Some checks failed
smoke-mm / Install & smoke test mm --help (push) Has been cancelled
5.8 KiB
5.8 KiB
get-url Architecture & Flow
Overview
The enhanced get-url command supports two modes:
get-url
├── SEARCH MODE (new)
│ └── -url "pattern"
│ ├── Normalize pattern (strip protocol, www)
│ ├── Search all stores
│ ├── Match URLs with wildcards
│ └── Return grouped results
│
└── ORIGINAL MODE (unchanged)
├── Hash lookup
├── Store lookup
└── Return URLs for file
Flow Diagram: URL Search
User Input
│
v
get-url -url "youtube.com*"
│
v
_normalize_url_for_search()
│ Strips: https://, http://, www.
│ Result: "youtube.com*" (unchanged, already normalized)
v
_search_urls_across_stores()
│
├─→ Store 1 (Hydrus)
│ ├─→ search("*", limit=1000)
│ ├─→ get_url(file_hash) for each file
│ └─→ _match_url_pattern() for each URL
│
├─→ Store 2 (Folder)
│ ├─→ search("*", limit=1000)
│ ├─→ get_url(file_hash) for each file
│ └─→ _match_url_pattern() for each URL
│
└─→ ...more stores...
Matching URLs:
├─→ https://www.youtube.com/watch?v=123
├─→ http://youtube.com/shorts/abc
└─→ https://youtube.com/playlist?list=xyz
Normalized for matching:
├─→ youtube.com/watch?v=123 ✓ Matches "youtube.com*"
├─→ youtube.com/shorts/abc ✓ Matches "youtube.com*"
└─→ youtube.com/playlist?... ✓ Matches "youtube.com*"
v
Collect UrlItem results
│
├─→ UrlItem(url="https://www.youtube.com/watch?v=123",
│ hash="abcd1234...", store="hydrus")
│
├─→ UrlItem(url="http://youtube.com/shorts/abc",
│ hash="efgh5678...", store="folder")
│
└─→ ...more items...
v
Group by store
│
├─→ Hydrus
│ ├─→ https://www.youtube.com/watch?v=123
│ └─→ ...
│
└─→ Folder
├─→ http://youtube.com/shorts/abc
└─→ ...
v
Emit UrlItem objects for piping
│
v
Return exit code 0 (success)
Code Structure
Get_Url (class)
│
├── __init__()
│ └── Register command with CLI
│
├── _normalize_url_for_search() [static]
│ └── Strip protocol & www, lowercase
│
├── _match_url_pattern() [static]
│ └── fnmatch with normalization
│
├── _search_urls_across_stores() [instance]
│ ├── Iterate stores
│ ├── Search files in store
│ ├── Get URLs for each file
│ ├── Apply pattern matching
│ └── Return (items, stores_found)
│
└── run() [main execution]
├── Check for -url flag
│ ├── YES: Search mode
│ │ └── _search_urls_across_stores()
│ └── NO: Original mode
│ └── Hash+store lookup
│
└── Return exit code
Data Flow Examples
Example 1: Search by Domain
Input: get-url -url "www.google.com"
Normalize: "google.com" (www. stripped)
Search Results:
Store "hydrus":
- https://www.google.com ✓
- https://google.com/search?q=hello ✓
- https://google.com/maps ✓
Store "folder":
- http://google.com ✓
- https://google.com/images ✓
Output: 5 matching URLs grouped by store
Example 2: Wildcard Pattern
Input: get-url -url "youtube.com/watch*"
Pattern: "youtube.com/watch*"
Search Results:
Store "hydrus":
- https://www.youtube.com/watch?v=123 ✓
- https://youtube.com/watch?list=abc ✓
- https://www.youtube.com/shorts/xyz ✗ (doesn't match /watch*)
Store "folder":
- http://youtube.com/watch?v=456 ✓
Output: 3 matching URLs (watch only, not shorts)
Example 3: Subdomain Wildcard
Input: get-url -url "*.example.com*"
Normalize: "*.example.com*" (already normalized)
Search Results:
Store "hydrus":
- https://cdn.example.com/video.mp4 ✓
- https://api.example.com/endpoint ✓
- https://www.example.com ✓
- https://other.org ✗
Output: 3 matching URLs
Integration with Piping
# Search → Filter → Add Tag
get-url -url "youtube.com*" | add-tag -tag "video-source"
# Search → Count
get-url -url "reddit.com*" | wc -l
# Search → Export
get-url -url "github.com*" > github_urls.txt
Error Handling Flow
get-url -url "pattern"
│
├─→ No stores configured?
│ └─→ Log "Error: No stores configured"
│ └─→ Return exit code 1
│
├─→ Store search fails?
│ └─→ Log error, skip store, continue
│
├─→ No matches found?
│ └─→ Log "No urls matching pattern"
│ └─→ Return exit code 1
│
└─→ Matches found?
└─→ Return exit code 0
Performance Considerations
- Store Iteration: Loops through all configured stores
- File Scanning: Each store searches up to 1000 files
- URL Matching: Each URL tested against pattern (fnmatch - O(n) per URL)
- Memory: Stores all matching items in memory before display
Optimization opportunities:
- Cache store results
- Limit search scope with --store flag
- Early exit with --limit N
- Pagination support
Backward Compatibility
Original mode (unchanged):
@1 | get-url
│
└─→ No -url flag
└─→ Use original logic
├─→ Get hash from result
├─→ Get store from result or args
├─→ Call backend.get_url(hash)
└─→ Return URLs for that file
All original functionality preserved. New -url flag is additive only.