Files
Medios-Macina/docs/GET_URL_ARCHITECTURE.md

235 lines
5.8 KiB
Markdown
Raw Normal View History

2025-12-29 17:05:03 -08:00
# get-url Architecture & Flow
## Overview
The enhanced `get-url` command supports two modes:
```
get-url
├── SEARCH MODE (new)
│ └── -url "pattern"
│ ├── Normalize pattern (strip protocol, www)
│ ├── Search all stores
│ ├── Match URLs with wildcards
│ └── Return grouped results
└── ORIGINAL MODE (unchanged)
├── Hash lookup
├── Store lookup
└── Return URLs for file
```
## Flow Diagram: URL Search
```
User Input
v
get-url -url "youtube.com*"
v
_normalize_url_for_search()
│ Strips: https://, http://, www.
│ Result: "youtube.com*" (unchanged, already normalized)
v
_search_urls_across_stores()
├─→ Store 1 (Hydrus)
│ ├─→ search("*", limit=1000)
│ ├─→ get_url(file_hash) for each file
│ └─→ _match_url_pattern() for each URL
├─→ Store 2 (Folder)
│ ├─→ search("*", limit=1000)
│ ├─→ get_url(file_hash) for each file
│ └─→ _match_url_pattern() for each URL
└─→ ...more stores...
Matching URLs:
├─→ https://www.youtube.com/watch?v=123
├─→ http://youtube.com/shorts/abc
└─→ https://youtube.com/playlist?list=xyz
Normalized for matching:
├─→ youtube.com/watch?v=123 ✓ Matches "youtube.com*"
├─→ youtube.com/shorts/abc ✓ Matches "youtube.com*"
└─→ youtube.com/playlist?... ✓ Matches "youtube.com*"
v
Collect UrlItem results
├─→ UrlItem(url="https://www.youtube.com/watch?v=123",
│ hash="abcd1234...", store="hydrus")
├─→ UrlItem(url="http://youtube.com/shorts/abc",
│ hash="efgh5678...", store="folder")
└─→ ...more items...
v
Group by store
├─→ Hydrus
│ ├─→ https://www.youtube.com/watch?v=123
│ └─→ ...
└─→ Folder
├─→ http://youtube.com/shorts/abc
└─→ ...
v
Emit UrlItem objects for piping
v
Return exit code 0 (success)
```
## Code Structure
```
Get_Url (class)
├── __init__()
│ └── Register command with CLI
├── _normalize_url_for_search() [static]
│ └── Strip protocol & www, lowercase
├── _match_url_pattern() [static]
│ └── fnmatch with normalization
├── _search_urls_across_stores() [instance]
│ ├── Iterate stores
│ ├── Search files in store
│ ├── Get URLs for each file
│ ├── Apply pattern matching
│ └── Return (items, stores_found)
└── run() [main execution]
├── Check for -url flag
│ ├── YES: Search mode
│ │ └── _search_urls_across_stores()
│ └── NO: Original mode
│ └── Hash+store lookup
└── Return exit code
```
## Data Flow Examples
### Example 1: Search by Domain
```
Input: get-url -url "www.google.com"
Normalize: "google.com" (www. stripped)
Search Results:
Store "hydrus":
- https://www.google.com ✓
- https://google.com/search?q=hello ✓
- https://google.com/maps ✓
Store "folder":
- http://google.com ✓
- https://google.com/images ✓
Output: 5 matching URLs grouped by store
```
### Example 2: Wildcard Pattern
```
Input: get-url -url "youtube.com/watch*"
Pattern: "youtube.com/watch*"
Search Results:
Store "hydrus":
- https://www.youtube.com/watch?v=123 ✓
- https://youtube.com/watch?list=abc ✓
- https://www.youtube.com/shorts/xyz ✗ (doesn't match /watch*)
Store "folder":
- http://youtube.com/watch?v=456 ✓
Output: 3 matching URLs (watch only, not shorts)
```
### Example 3: Subdomain Wildcard
```
Input: get-url -url "*.example.com*"
Normalize: "*.example.com*" (already normalized)
Search Results:
Store "hydrus":
- https://cdn.example.com/video.mp4 ✓
- https://api.example.com/endpoint ✓
- https://www.example.com ✓
- https://other.org ✗
Output: 3 matching URLs
```
## Integration with Piping
```
# Search → Filter → Add Tag
get-url -url "youtube.com*" | add-tag -tag "video-source"
# Search → Count
get-url -url "reddit.com*" | wc -l
# Search → Export
get-url -url "github.com*" > github_urls.txt
```
## Error Handling Flow
```
get-url -url "pattern"
├─→ No stores configured?
│ └─→ Log "Error: No stores configured"
│ └─→ Return exit code 1
├─→ Store search fails?
│ └─→ Log error, skip store, continue
├─→ No matches found?
│ └─→ Log "No urls matching pattern"
│ └─→ Return exit code 1
└─→ Matches found?
└─→ Return exit code 0
```
## Performance Considerations
1. **Store Iteration**: Loops through all configured stores
2. **File Scanning**: Each store searches up to 1000 files
3. **URL Matching**: Each URL tested against pattern (fnmatch - O(n) per URL)
4. **Memory**: Stores all matching items in memory before display
Optimization opportunities:
- Cache store results
- Limit search scope with --store flag
- Early exit with --limit N
- Pagination support
## Backward Compatibility
Original mode (unchanged):
```
@1 | get-url
└─→ No -url flag
└─→ Use original logic
├─→ Get hash from result
├─→ Get store from result or args
├─→ Call backend.get_url(hash)
└─→ Return URLs for that file
```
All original functionality preserved. New -url flag is additive only.