Files
Medios-Macina/HASH_STORE_PRIORITY_PATTERN.md
2025-12-11 12:47:30 -08:00

223 lines
6.9 KiB
Markdown

# Hash+Store Priority Pattern & Database Connection Fixes
## Summary of Changes
### 1. Database Connection Leak Fixes ✅
**Problem:** FolderDB connections were not being properly closed, causing database locks and resource leaks.
**Files Fixed:**
- `cmdlets/search_store.py` - Now uses `with FolderDB()` context manager
- `cmdlets/search_provider.py` - Now uses `with FolderDB()` context manager
- `helper/store.py` (Folder.__init__) - Now uses `with FolderDB()` for temporary connections
- `helper/worker_manager.py` - Added `close()` method and context manager support (`__enter__`/`__exit__`)
**Pattern:**
```python
# OLD (leaked connections):
db = FolderDB(path)
try:
db.do_something()
finally:
if db:
db.close() # Could be skipped if exception occurs early
# NEW (guaranteed cleanup):
with FolderDB(path) as db:
db.do_something()
# Connection automatically closed when exiting block
```
### 2. Hash+Store Priority Pattern ✅
**Philosophy:** The hash+store pair is the **canonical identifier** for files across all storage backends. Sort order and table structure should not matter because we're always using hash+store.
**Why This Matters:**
- `@N` selections pass hash+store from search results
- Hash+store works consistently across all backends (Hydrus, Folder, Remote)
- Path-based resolution is fragile (files move, temp paths expire, etc.)
- Hash+store never changes and uniquely identifies content
**Updated Resolution Priority in `add_file.py`:**
```python
def _resolve_source(result, path_arg, pipe_obj, config):
"""
PRIORITY 1: hash+store from result dict (most reliable for @N selections)
- Checks result.get("hash") and result.get("store")
- Uses FileStorage[store].get_file(hash) to retrieve
- Works for: Hydrus, Folder, Remote backends
PRIORITY 2: Explicit -path argument
- Direct path specified by user
PRIORITY 3: pipe_obj.file_path
- Legacy path from previous pipeline stage
PRIORITY 4: Hydrus hash from pipe_obj.extra
- Fallback for older Hydrus workflows
PRIORITY 5: String/list result parsing
- Last resort for simple string paths
"""
```
**Example Flow:**
```bash
# User searches and selects result
$ search-store system:limit=5
# Result items include:
{
"hash": "a1b2c3d4...",
"store": "home", # Specific Hydrus instance
"title": "example.mp4"
}
# User selects @2 (index 1)
$ @2 | add-file -storage test
# add-file now:
1. Extracts hash="a1b2c3d4..." store="home" from result dict
2. Calls FileStorage["home"].get_file("a1b2c3d4...")
3. Retrieves actual file path from "home" backend
4. Proceeds with copy/upload to "test" storage
```
### 3. Benefits of This Approach
**Consistency:**
- @N selection always uses the same hash+store regardless of display order
- No confusion about which row index maps to which file
- Table synchronization issues (rows vs items) don't break selection
**Reliability:**
- Hash uniquely identifies content (SHA256 collision is effectively impossible)
- Store identifies the authoritative source backend
- No dependency on temporary paths or file locations
**Multi-Instance Support:**
- Works seamlessly with multiple Hydrus instances ("home", "work")
- Works with mixed backends (Hydrus + Folder + Remote)
- Each backend can independently retrieve file by hash
**Debugging:**
- Hash+store are visible in debug logs: `[add-file] Using hash+store: hash=a1b2c3d4..., store=home`
- Easy to trace which backend is being queried
- Clear error messages when hash+store lookup fails
## How @N Selection Works Now
### Selection Process:
1. **Search creates result list with hash+store:**
```python
results_list = [
{"hash": "abc123...", "store": "home", "title": "file1.mp4"},
{"hash": "def456...", "store": "default", "title": "file2.jpg"},
{"hash": "ghi789...", "store": "test", "title": "file3.png"},
]
```
2. **User selects @2 (second item, index 1):**
- CLI extracts: `result = {"hash": "def456...", "store": "default", "title": "file2.jpg"}`
- Passes this dict to the next cmdlet
3. **Next cmdlet receives dict with hash+store:**
```python
def run(self, result, args, config):
# result is the dict from selection
file_hash = result.get("hash") # "def456..."
store_name = result.get("store") # "default"
# Use hash+store to retrieve file
backend = FileStorage(config)[store_name]
file_path = backend.get_file(file_hash)
```
### Why This is Better Than Path-Based:
**Path-Based (OLD):**
```python
# Fragile: path could be temp file, symlink, moved file, etc.
result = {"file_path": "/tmp/hydrus-abc123.mp4"}
# What if file was moved? What if it's a temp path that expires?
```
**Hash+Store (NEW):**
```python
# Reliable: hash+store always works regardless of current location
result = {"hash": "abc123...", "store": "home"}
# Backend retrieves current location from its database/API
```
## Testing the Fixes
### 1. Test Database Connections:
```powershell
# Search multiple times and check for database locks
search-store system:limit=5
search-store system:limit=5
search-store system:limit=5
# Should complete without "database is locked" errors
```
### 2. Test Hash+Store Selection:
```powershell
# Search and select
search-store system:limit=5
@2 | get-metadata
# Should show metadata for the selected file using hash+store
# Debug log should show: [add-file] Using hash+store from result: hash=...
```
### 3. Test WorkerManager Cleanup:
```powershell
# In Python script:
from helper.worker_manager import WorkerManager
from pathlib import Path
with WorkerManager(Path("C:/path/to/library")) as wm:
# Do work
pass
# Database automatically closed when exiting block
```
## Cmdlets That Already Use Hash+Store Pattern
These cmdlets already correctly extract hash+store:
- ✅ `get-file` - Export file via hash+store
- ✅ `get-metadata` - Retrieve metadata via hash+store
- ✅ `get-url` - Get url via hash+store
- ✅ `get-tag` - Get tags via hash+store
- ✅ `add-url` - Add URL via hash+store
- ✅ `delete-url` - Delete URL via hash+store
- ✅ `add-file` - **NOW UPDATED** to prioritize hash+store
## Future Improvements
1. **Make hash+store mandatory in result dicts:**
- All search cmdlets should emit hash+store
- Validate that result dicts include these fields
2. **Add hash+store validation:**
- Warn if hash is not 64-char hex string
- Warn if store is not a registered backend
3. **Standardize error messages:**
- "File not found via hash+store: hash=abc123 store=home"
- Makes debugging much clearer
4. **Consider deprecating path-based workflows:**
- Migrate legacy cmdlets to hash+store pattern
- Remove path-based fallbacks once all cmdlets updated
## Key Takeaway
**The hash+store pair is now the primary way to identify and retrieve files across the entire system.** This makes the codebase more reliable, consistent, and easier to debug. Database connections are properly cleaned up to prevent locks and resource leaks.