Files
Medios-Macina/docs/GET_URL_ARCHITECTURE.md
Nose c019c00aed
Some checks failed
smoke-mm / Install & smoke test mm --help (push) Has been cancelled
df
2025-12-29 17:05:03 -08:00

5.8 KiB

get-url Architecture & Flow

Overview

The enhanced get-url command supports two modes:

get-url
├── SEARCH MODE (new)
│   └── -url "pattern"
│       ├── Normalize pattern (strip protocol, www)
│       ├── Search all stores
│       ├── Match URLs with wildcards
│       └── Return grouped results
│
└── ORIGINAL MODE (unchanged)
    ├── Hash lookup
    ├── Store lookup
    └── Return URLs for file
User Input
    │
    v
get-url -url "youtube.com*"
    │
    v
_normalize_url_for_search()
    │ Strips: https://, http://, www.
    │ Result: "youtube.com*" (unchanged, already normalized)
    v
_search_urls_across_stores()
    │
    ├─→ Store 1 (Hydrus)
    │   ├─→ search("*", limit=1000)
    │   ├─→ get_url(file_hash) for each file
    │   └─→ _match_url_pattern() for each URL
    │
    ├─→ Store 2 (Folder)
    │   ├─→ search("*", limit=1000)
    │   ├─→ get_url(file_hash) for each file
    │   └─→ _match_url_pattern() for each URL
    │
    └─→ ...more stores...
    
    Matching URLs:
    ├─→ https://www.youtube.com/watch?v=123
    ├─→ http://youtube.com/shorts/abc
    └─→ https://youtube.com/playlist?list=xyz
    
    Normalized for matching:
    ├─→ youtube.com/watch?v=123  ✓ Matches "youtube.com*"
    ├─→ youtube.com/shorts/abc   ✓ Matches "youtube.com*"
    └─→ youtube.com/playlist?...  ✓ Matches "youtube.com*"
    
    v
Collect UrlItem results
    │
    ├─→ UrlItem(url="https://www.youtube.com/watch?v=123", 
    │           hash="abcd1234...", store="hydrus")
    │
    ├─→ UrlItem(url="http://youtube.com/shorts/abc",
    │           hash="efgh5678...", store="folder")
    │
    └─→ ...more items...
    
    v
Group by store
    │
    ├─→ Hydrus
    │   ├─→ https://www.youtube.com/watch?v=123
    │   └─→ ...
    │
    └─→ Folder
        ├─→ http://youtube.com/shorts/abc
        └─→ ...
    
    v
Emit UrlItem objects for piping
    │
    v
Return exit code 0 (success)

Code Structure

Get_Url (class)
    │
    ├── __init__()
    │   └── Register command with CLI
    │
    ├── _normalize_url_for_search() [static]
    │   └── Strip protocol & www, lowercase
    │
    ├── _match_url_pattern() [static]
    │   └── fnmatch with normalization
    │
    ├── _search_urls_across_stores() [instance]
    │   ├── Iterate stores
    │   ├── Search files in store
    │   ├── Get URLs for each file
    │   ├── Apply pattern matching
    │   └── Return (items, stores_found)
    │
    └── run() [main execution]
        ├── Check for -url flag
        │   ├── YES: Search mode
        │   │   └── _search_urls_across_stores()
        │   └── NO: Original mode
        │       └── Hash+store lookup
        │
        └── Return exit code

Data Flow Examples

Example 1: Search by Domain

Input:  get-url -url "www.google.com"
        
Normalize: "google.com" (www. stripped)

Search Results:
  Store "hydrus":
    - https://www.google.com ✓
    - https://google.com/search?q=hello ✓
    - https://google.com/maps ✓
  
  Store "folder":
    - http://google.com ✓
    - https://google.com/images ✓

Output: 5 matching URLs grouped by store

Example 2: Wildcard Pattern

Input:  get-url -url "youtube.com/watch*"

Pattern: "youtube.com/watch*"

Search Results:
  Store "hydrus":
    - https://www.youtube.com/watch?v=123 ✓
    - https://youtube.com/watch?list=abc ✓
    - https://www.youtube.com/shorts/xyz ✗ (doesn't match /watch*)
  
  Store "folder":
    - http://youtube.com/watch?v=456 ✓

Output: 3 matching URLs (watch only, not shorts)

Example 3: Subdomain Wildcard

Input:  get-url -url "*.example.com*"

Normalize: "*.example.com*" (already normalized)

Search Results:
  Store "hydrus":
    - https://cdn.example.com/video.mp4 ✓
    - https://api.example.com/endpoint ✓
    - https://www.example.com ✓
    - https://other.org ✗

Output: 3 matching URLs

Integration with Piping

# Search → Filter → Add Tag
get-url -url "youtube.com*" | add-tag -tag "video-source"

# Search → Count
get-url -url "reddit.com*" | wc -l

# Search → Export
get-url -url "github.com*" > github_urls.txt

Error Handling Flow

get-url -url "pattern"
    │
    ├─→ No stores configured?
    │   └─→ Log "Error: No stores configured"
    │   └─→ Return exit code 1
    │
    ├─→ Store search fails?
    │   └─→ Log error, skip store, continue
    │
    ├─→ No matches found?
    │   └─→ Log "No urls matching pattern"
    │   └─→ Return exit code 1
    │
    └─→ Matches found?
        └─→ Return exit code 0

Performance Considerations

  1. Store Iteration: Loops through all configured stores
  2. File Scanning: Each store searches up to 1000 files
  3. URL Matching: Each URL tested against pattern (fnmatch - O(n) per URL)
  4. Memory: Stores all matching items in memory before display

Optimization opportunities:

  • Cache store results
  • Limit search scope with --store flag
  • Early exit with --limit N
  • Pagination support

Backward Compatibility

Original mode (unchanged):

@1 | get-url
    │
    └─→ No -url flag
        └─→ Use original logic
            ├─→ Get hash from result
            ├─→ Get store from result or args
            ├─→ Call backend.get_url(hash)
            └─→ Return URLs for that file

All original functionality preserved. New -url flag is additive only.