df
Some checks failed
smoke-mm / Install & smoke test mm --help (push) Has been cancelled

This commit is contained in:
2025-12-29 17:05:03 -08:00
parent 226de9316a
commit c019c00aed
104 changed files with 19669 additions and 12954 deletions

View File

@@ -0,0 +1,234 @@
# get-url Architecture & Flow
## Overview
The enhanced `get-url` command supports two modes:
```
get-url
├── SEARCH MODE (new)
│ └── -url "pattern"
│ ├── Normalize pattern (strip protocol, www)
│ ├── Search all stores
│ ├── Match URLs with wildcards
│ └── Return grouped results
└── ORIGINAL MODE (unchanged)
├── Hash lookup
├── Store lookup
└── Return URLs for file
```
## Flow Diagram: URL Search
```
User Input
v
get-url -url "youtube.com*"
v
_normalize_url_for_search()
│ Strips: https://, http://, www.
│ Result: "youtube.com*" (unchanged, already normalized)
v
_search_urls_across_stores()
├─→ Store 1 (Hydrus)
│ ├─→ search("*", limit=1000)
│ ├─→ get_url(file_hash) for each file
│ └─→ _match_url_pattern() for each URL
├─→ Store 2 (Folder)
│ ├─→ search("*", limit=1000)
│ ├─→ get_url(file_hash) for each file
│ └─→ _match_url_pattern() for each URL
└─→ ...more stores...
Matching URLs:
├─→ https://www.youtube.com/watch?v=123
├─→ http://youtube.com/shorts/abc
└─→ https://youtube.com/playlist?list=xyz
Normalized for matching:
├─→ youtube.com/watch?v=123 ✓ Matches "youtube.com*"
├─→ youtube.com/shorts/abc ✓ Matches "youtube.com*"
└─→ youtube.com/playlist?... ✓ Matches "youtube.com*"
v
Collect UrlItem results
├─→ UrlItem(url="https://www.youtube.com/watch?v=123",
│ hash="abcd1234...", store="hydrus")
├─→ UrlItem(url="http://youtube.com/shorts/abc",
│ hash="efgh5678...", store="folder")
└─→ ...more items...
v
Group by store
├─→ Hydrus
│ ├─→ https://www.youtube.com/watch?v=123
│ └─→ ...
└─→ Folder
├─→ http://youtube.com/shorts/abc
└─→ ...
v
Emit UrlItem objects for piping
v
Return exit code 0 (success)
```
## Code Structure
```
Get_Url (class)
├── __init__()
│ └── Register command with CLI
├── _normalize_url_for_search() [static]
│ └── Strip protocol & www, lowercase
├── _match_url_pattern() [static]
│ └── fnmatch with normalization
├── _search_urls_across_stores() [instance]
│ ├── Iterate stores
│ ├── Search files in store
│ ├── Get URLs for each file
│ ├── Apply pattern matching
│ └── Return (items, stores_found)
└── run() [main execution]
├── Check for -url flag
│ ├── YES: Search mode
│ │ └── _search_urls_across_stores()
│ └── NO: Original mode
│ └── Hash+store lookup
└── Return exit code
```
## Data Flow Examples
### Example 1: Search by Domain
```
Input: get-url -url "www.google.com"
Normalize: "google.com" (www. stripped)
Search Results:
Store "hydrus":
- https://www.google.com ✓
- https://google.com/search?q=hello ✓
- https://google.com/maps ✓
Store "folder":
- http://google.com ✓
- https://google.com/images ✓
Output: 5 matching URLs grouped by store
```
### Example 2: Wildcard Pattern
```
Input: get-url -url "youtube.com/watch*"
Pattern: "youtube.com/watch*"
Search Results:
Store "hydrus":
- https://www.youtube.com/watch?v=123 ✓
- https://youtube.com/watch?list=abc ✓
- https://www.youtube.com/shorts/xyz ✗ (doesn't match /watch*)
Store "folder":
- http://youtube.com/watch?v=456 ✓
Output: 3 matching URLs (watch only, not shorts)
```
### Example 3: Subdomain Wildcard
```
Input: get-url -url "*.example.com*"
Normalize: "*.example.com*" (already normalized)
Search Results:
Store "hydrus":
- https://cdn.example.com/video.mp4 ✓
- https://api.example.com/endpoint ✓
- https://www.example.com ✓
- https://other.org ✗
Output: 3 matching URLs
```
## Integration with Piping
```
# Search → Filter → Add Tag
get-url -url "youtube.com*" | add-tag -tag "video-source"
# Search → Count
get-url -url "reddit.com*" | wc -l
# Search → Export
get-url -url "github.com*" > github_urls.txt
```
## Error Handling Flow
```
get-url -url "pattern"
├─→ No stores configured?
│ └─→ Log "Error: No stores configured"
│ └─→ Return exit code 1
├─→ Store search fails?
│ └─→ Log error, skip store, continue
├─→ No matches found?
│ └─→ Log "No urls matching pattern"
│ └─→ Return exit code 1
└─→ Matches found?
└─→ Return exit code 0
```
## Performance Considerations
1. **Store Iteration**: Loops through all configured stores
2. **File Scanning**: Each store searches up to 1000 files
3. **URL Matching**: Each URL tested against pattern (fnmatch - O(n) per URL)
4. **Memory**: Stores all matching items in memory before display
Optimization opportunities:
- Cache store results
- Limit search scope with --store flag
- Early exit with --limit N
- Pagination support
## Backward Compatibility
Original mode (unchanged):
```
@1 | get-url
└─→ No -url flag
└─→ Use original logic
├─→ Get hash from result
├─→ Get store from result or args
├─→ Call backend.get_url(hash)
└─→ Return URLs for that file
```
All original functionality preserved. New -url flag is additive only.

76
docs/GET_URL_QUICK_REF.md Normal file
View File

@@ -0,0 +1,76 @@
# Quick Reference: get-url URL Search
## Basic Syntax
```bash
# Search mode (new)
get-url -url "pattern"
# Original mode (unchanged)
@1 | get-url
```
## Examples
### Exact domain match
```bash
get-url -url "google.com"
```
Matches: `https://www.google.com`, `http://google.com/search`, `https://google.com/maps`
### YouTube URL search
```bash
get-url -url "https://www.youtube.com/watch?v=xx_88TDWmEs"
```
Normalizes to: `youtube.com/watch?v=xx_88tdwmes`
Matches: Any video with same ID across different protocols
### Wildcard domain
```bash
get-url -url "youtube.com*"
```
Matches: All YouTube URLs (videos, shorts, playlists, etc.)
### Subdomain wildcard
```bash
get-url -url "*.example.com*"
```
Matches: `cdn.example.com`, `api.example.com`, `www.example.com`
### Specific path pattern
```bash
get-url -url "youtube.com/watch*"
```
Matches: Only YouTube watch URLs (not shorts or playlists)
### Single character wildcard
```bash
get-url -url "example.com/file?.mp4"
```
Matches: `example.com/file1.mp4`, `example.com/fileA.mp4` (not `file12.mp4`)
## How It Works
1. **Normalization**: Strips `https://`, `www.` prefix from pattern and all URLs
2. **Pattern Matching**: Uses `*` and `?` wildcards (case-insensitive)
3. **Search**: Scans all configured stores for matching URLs
4. **Results**: Groups matches by store, shows URL and hash
## Return Values
- Exit code **0** if matches found
- Exit code **1** if no matches or error
## Piping Results
```bash
get-url -url "youtube.com*" | grep -i video
get-url -url "example.com*" | add-tag -tag "external-source"
```
## Common Patterns
| Pattern | Matches | Notes |
|---------|---------|-------|
| `google.com` | Google URLs | Exact domain (after normalization) |
| `youtube.com*` | All YouTube | Wildcard at end |
| `*.example.com*` | Subdomains | Wildcard at start and end |
| `github.com/user*` | User repos | Path pattern |
| `reddit.com/r/*` | Subreddit | Path with wildcard |

91
docs/GET_URL_SEARCH.md Normal file
View File

@@ -0,0 +1,91 @@
# get-url Enhanced URL Search
The `get-url` command now supports searching for URLs across all stores with automatic protocol and `www` prefix stripping.
## Features
### 1. **Protocol Stripping**
URLs are normalized by removing:
- Protocol prefixes: `https://`, `http://`, `ftp://`, etc.
- `www.` prefix (case-insensitive)
### 2. **Wildcard Matching**
Patterns support standard wildcards:
- `*` - matches any sequence of characters
- `?` - matches any single character
### 3. **Case-Insensitive Matching**
All matching is case-insensitive for domains and paths
## Usage Examples
### Search by full domain
```bash
get-url -url "www.google.com"
# Matches:
# - https://www.google.com
# - http://google.com/search
# - https://google.com/maps
```
### Search with YouTube example
```bash
get-url -url "https://www.youtube.com/watch?v=xx_88TDWmEs"
# Becomes: youtube.com/watch?v=xx_88tdwmes
# Matches:
# - https://www.youtube.com/watch?v=xx_88TDWmEs
# - http://youtube.com/watch?v=xx_88TDWmEs
```
### Domain wildcard matching
```bash
get-url -url "youtube.com*"
# Matches any URL starting with youtube.com:
# - https://www.youtube.com/watch?v=123
# - https://youtube.com/shorts/abc
# - http://youtube.com/playlist?list=xyz
```
### Subdomain matching
```bash
get-url -url "*example.com*"
# Matches:
# - https://cdn.example.com/file.mp4
# - https://www.example.com
# - https://api.example.com/endpoint
```
### Specific path matching
```bash
get-url -url "youtube.com/watch*"
# Matches:
# - https://www.youtube.com/watch?v=123
# - http://youtube.com/watch?list=abc
# Does NOT match:
# - https://youtube.com/shorts/abc
```
## Get URLs for Specific File
The original functionality is still supported:
```bash
@1 | get-url
# Requires hash and store from piped result
```
## Output
Results are organized by store and show:
- **Store**: Backend name (hydrus, folder, etc.)
- **Url**: The full matched URL
- **Hash**: First 16 characters of the file hash (for compactness)
## Implementation Details
The search:
1. Iterates through all configured stores
2. Searches for all files in each store (limit 1000 per store)
3. Retrieves URLs for each file
4. Applies pattern matching with normalization
5. Returns results grouped by store
6. Emits `UrlItem` objects for piping to other commands