Medios-Macina/HASH_STORE_PRIORITY_PATTERN.md

# Hash+Store Priority Pattern & Database Connection Fixes

## Summary of Changes

### 1. Database Connection Leak Fixes ✅

**Problem:** FolderDB connections were not being properly closed, causing database locks and resource leaks.

**Files Fixed:**
- `cmdlets/search_store.py` - Now uses `with FolderDB()` context manager
- `cmdlets/search_provider.py` - Now uses `with FolderDB()` context manager  
- `helper/store.py` (Folder.__init__) - Now uses `with FolderDB()` for temporary connections
- `helper/worker_manager.py` - Added `close()` method and context manager support (`__enter__`/`__exit__`)

**Pattern:**
```python
# OLD (leaked connections):
db = FolderDB(path)
try:
    db.do_something()
finally:
    if db:
        db.close()  # Could be skipped if exception occurs early

# NEW (guaranteed cleanup):
with FolderDB(path) as db:
    db.do_something()
# Connection automatically closed when exiting block
```

### 2. Hash+Store Priority Pattern ✅

**Philosophy:** The hash+store pair is the **canonical identifier** for files across all storage backends. Sort order and table structure should not matter because we're always using hash+store.

**Why This Matters:**
- `@N` selections pass hash+store from search results
- Hash+store works consistently across all backends (Hydrus, Folder, Remote)
- Path-based resolution is fragile (files move, temp paths expire, etc.)
- Hash+store never changes and uniquely identifies content

**Updated Resolution Priority in `add_file.py`:**

```python
def _resolve_source(result, path_arg, pipe_obj, config):
    """
    PRIORITY 1: hash+store from result dict (most reliable for @N selections)
       - Checks result.get("hash") and result.get("store")
       - Uses FileStorage[store].get_file(hash) to retrieve
       - Works for: Hydrus, Folder, Remote backends
    
    PRIORITY 2: Explicit -path argument
       - Direct path specified by user
    
    PRIORITY 3: pipe_obj.file_path
       - Legacy path from previous pipeline stage
    
    PRIORITY 4: Hydrus hash from pipe_obj.extra
       - Fallback for older Hydrus workflows
    
    PRIORITY 5: String/list result parsing
       - Last resort for simple string paths
    """
```

**Example Flow:**
```bash
# User searches and selects result
$ search-store system:limit=5

# Result items include:
{
  "hash": "a1b2c3d4...",
  "store": "home",  # Specific Hydrus instance
  "title": "example.mp4"
}

# User selects @2 (index 1)
$ @2 | add-file -storage test

# add-file now:
1. Extracts hash="a1b2c3d4..." store="home" from result dict
2. Calls FileStorage["home"].get_file("a1b2c3d4...")
3. Retrieves actual file path from "home" backend
4. Proceeds with copy/upload to "test" storage
```

### 3. Benefits of This Approach

**Consistency:**
- @N selection always uses the same hash+store regardless of display order
- No confusion about which row index maps to which file
- Table synchronization issues (rows vs items) don't break selection

**Reliability:**
- Hash uniquely identifies content (SHA256 collision is effectively impossible)
- Store identifies the authoritative source backend
- No dependency on temporary paths or file locations

**Multi-Instance Support:**
- Works seamlessly with multiple Hydrus instances ("home", "work")
- Works with mixed backends (Hydrus + Folder + Remote)
- Each backend can independently retrieve file by hash

**Debugging:**
- Hash+store are visible in debug logs: `[add-file] Using hash+store: hash=a1b2c3d4..., store=home`
- Easy to trace which backend is being queried
- Clear error messages when hash+store lookup fails

## How @N Selection Works Now

### Selection Process:

1. **Search creates result list with hash+store:**
   ```python
   results_list = [
       {"hash": "abc123...", "store": "home", "title": "file1.mp4"},
       {"hash": "def456...", "store": "default", "title": "file2.jpg"},
       {"hash": "ghi789...", "store": "test", "title": "file3.png"},
   ]
   ```

2. **User selects @2 (second item, index 1):**
   - CLI extracts: `result = {"hash": "def456...", "store": "default", "title": "file2.jpg"}`
   - Passes this dict to the next cmdlet

3. **Next cmdlet receives dict with hash+store:**
   ```python
   def run(self, result, args, config):
       # result is the dict from selection
       file_hash = result.get("hash")  # "def456..."
       store_name = result.get("store")  # "default"
       
       # Use hash+store to retrieve file
       backend = FileStorage(config)[store_name]
       file_path = backend.get_file(file_hash)
   ```

### Why This is Better Than Path-Based:

**Path-Based (OLD):**
```python
# Fragile: path could be temp file, symlink, moved file, etc.
result = {"file_path": "/tmp/hydrus-abc123.mp4"}
# What if file was moved? What if it's a temp path that expires?
```

**Hash+Store (NEW):**
```python
# Reliable: hash+store always works regardless of current location
result = {"hash": "abc123...", "store": "home"}
# Backend retrieves current location from its database/API
```

## Testing the Fixes

### 1. Test Database Connections:

```powershell
# Search multiple times and check for database locks
search-store system:limit=5
search-store system:limit=5
search-store system:limit=5

# Should complete without "database is locked" errors
```

### 2. Test Hash+Store Selection:

```powershell
# Search and select
search-store system:limit=5
@2 | get-metadata

# Should show metadata for the selected file using hash+store
# Debug log should show: [add-file] Using hash+store from result: hash=...
```

### 3. Test WorkerManager Cleanup:

```powershell
# In Python script:
from helper.worker_manager import WorkerManager
from pathlib import Path

with WorkerManager(Path("C:/path/to/library")) as wm:
    # Do work
    pass
# Database automatically closed when exiting block
```

## Cmdlets That Already Use Hash+Store Pattern

These cmdlets already correctly extract hash+store:
- ✅ `get-file` - Export file via hash+store
- ✅ `get-metadata` - Retrieve metadata via hash+store
- ✅ `get-url` - Get url via hash+store
- ✅ `get-tag` - Get tags via hash+store
- ✅ `add-url` - Add URL via hash+store
- ✅ `delete-url` - Delete URL via hash+store
- ✅ `add-file` - **NOW UPDATED** to prioritize hash+store

## Future Improvements

1. **Make hash+store mandatory in result dicts:**
   - All search cmdlets should emit hash+store
   - Validate that result dicts include these fields

2. **Add hash+store validation:**
   - Warn if hash is not 64-char hex string
   - Warn if store is not a registered backend

3. **Standardize error messages:**
   - "File not found via hash+store: hash=abc123 store=home"
   - Makes debugging much clearer

4. **Consider deprecating path-based workflows:**
   - Migrate legacy cmdlets to hash+store pattern
   - Remove path-based fallbacks once all cmdlets updated

## Key Takeaway

**The hash+store pair is now the primary way to identify and retrieve files across the entire system.** This makes the codebase more reliable, consistent, and easier to debug. Database connections are properly cleaned up to prevent locks and resource leaks.
dfdkflj 2025-12-11 12:47:30 -08:00			`# Hash+Store Priority Pattern & Database Connection Fixes`

			`## Summary of Changes`

			`### 1. Database Connection Leak Fixes ✅`

			`Problem: FolderDB connections were not being properly closed, causing database locks and resource leaks.`

			`Files Fixed:`
			- `cmdlets/search_store.py` - Now uses `with FolderDB()` context manager
			- `cmdlets/search_provider.py` - Now uses `with FolderDB()` context manager
			- `helper/store.py` (Folder.__init__) - Now uses `with FolderDB()` for temporary connections
			- `helper/worker_manager.py` - Added `close()` method and context manager support (`__enter__`/`__exit__`)

			`Pattern:`
			```python
			`# OLD (leaked connections):`
			`db = FolderDB(path)`
			`try:`
			`db.do_something()`
			`finally:`
			`if db:`
			`db.close() # Could be skipped if exception occurs early`

			`# NEW (guaranteed cleanup):`
			`with FolderDB(path) as db:`
			`db.do_something()`
			`# Connection automatically closed when exiting block`
			```

			`### 2. Hash+Store Priority Pattern ✅`

			`Philosophy: The hash+store pair is the canonical identifier for files across all storage backends. Sort order and table structure should not matter because we're always using hash+store.`

			`Why This Matters:`
			- `@N` selections pass hash+store from search results
			`- Hash+store works consistently across all backends (Hydrus, Folder, Remote)`
			`- Path-based resolution is fragile (files move, temp paths expire, etc.)`
			`- Hash+store never changes and uniquely identifies content`

			Updated Resolution Priority in `add_file.py`:

			```python
			`def _resolve_source(result, path_arg, pipe_obj, config):`
			`"""`
			`PRIORITY 1: hash+store from result dict (most reliable for @N selections)`
			`- Checks result.get("hash") and result.get("store")`
			`- Uses FileStorage[store].get_file(hash) to retrieve`
			`- Works for: Hydrus, Folder, Remote backends`

			`PRIORITY 2: Explicit -path argument`
			`- Direct path specified by user`

			`PRIORITY 3: pipe_obj.file_path`
			`- Legacy path from previous pipeline stage`

			`PRIORITY 4: Hydrus hash from pipe_obj.extra`
			`- Fallback for older Hydrus workflows`

			`PRIORITY 5: String/list result parsing`
			`- Last resort for simple string paths`
			`"""`
			```

			`Example Flow:`
			```bash
			`# User searches and selects result`
			`$ search-store system:limit=5`

			`# Result items include:`
			`{`
			`"hash": "a1b2c3d4...",`
			`"store": "home", # Specific Hydrus instance`
			`"title": "example.mp4"`
			`}`

			`# User selects @2 (index 1)`
			`$ @2 \| add-file -storage test`

			`# add-file now:`
			`1. Extracts hash="a1b2c3d4..." store="home" from result dict`
			`2. Calls FileStorage["home"].get_file("a1b2c3d4...")`
			`3. Retrieves actual file path from "home" backend`
			`4. Proceeds with copy/upload to "test" storage`
			```

			`### 3. Benefits of This Approach`

			`Consistency:`
			`- @N selection always uses the same hash+store regardless of display order`
			`- No confusion about which row index maps to which file`
			`- Table synchronization issues (rows vs items) don't break selection`

			`Reliability:`
			`- Hash uniquely identifies content (SHA256 collision is effectively impossible)`
			`- Store identifies the authoritative source backend`
			`- No dependency on temporary paths or file locations`

			`Multi-Instance Support:`
			`- Works seamlessly with multiple Hydrus instances ("home", "work")`
			`- Works with mixed backends (Hydrus + Folder + Remote)`
			`- Each backend can independently retrieve file by hash`

			`Debugging:`
			- Hash+store are visible in debug logs: `[add-file] Using hash+store: hash=a1b2c3d4..., store=home`
			`- Easy to trace which backend is being queried`
			`- Clear error messages when hash+store lookup fails`

			`## How @N Selection Works Now`

			`### Selection Process:`

			`1. Search creates result list with hash+store:`
			```python
			`results_list = [`
			`{"hash": "abc123...", "store": "home", "title": "file1.mp4"},`
			`{"hash": "def456...", "store": "default", "title": "file2.jpg"},`
			`{"hash": "ghi789...", "store": "test", "title": "file3.png"},`
			`]`
			```

			`2. User selects @2 (second item, index 1):`
			- CLI extracts: `result = {"hash": "def456...", "store": "default", "title": "file2.jpg"}`
			`- Passes this dict to the next cmdlet`

			`3. Next cmdlet receives dict with hash+store:`
			```python
			`def run(self, result, args, config):`
			`# result is the dict from selection`
			`file_hash = result.get("hash") # "def456..."`
			`store_name = result.get("store") # "default"`

			`# Use hash+store to retrieve file`
			`backend = FileStorage(config)[store_name]`
			`file_path = backend.get_file(file_hash)`
			```

			`### Why This is Better Than Path-Based:`

			`Path-Based (OLD):`
			```python
			`# Fragile: path could be temp file, symlink, moved file, etc.`
			`result = {"file_path": "/tmp/hydrus-abc123.mp4"}`
			`# What if file was moved? What if it's a temp path that expires?`
			```

			`Hash+Store (NEW):`
			```python
			`# Reliable: hash+store always works regardless of current location`
			`result = {"hash": "abc123...", "store": "home"}`
			`# Backend retrieves current location from its database/API`
			```

			`## Testing the Fixes`

			`### 1. Test Database Connections:`

			```powershell
			`# Search multiple times and check for database locks`
			`search-store system:limit=5`
			`search-store system:limit=5`
			`search-store system:limit=5`

			`# Should complete without "database is locked" errors`
			```

			`### 2. Test Hash+Store Selection:`

			```powershell
			`# Search and select`
			`search-store system:limit=5`
			`@2 \| get-metadata`

			`# Should show metadata for the selected file using hash+store`
			`# Debug log should show: [add-file] Using hash+store from result: hash=...`
			```

			`### 3. Test WorkerManager Cleanup:`

			```powershell
			`# In Python script:`
			`from helper.worker_manager import WorkerManager`
			`from pathlib import Path`

			`with WorkerManager(Path("C:/path/to/library")) as wm:`
			`# Do work`
			`pass`
			`# Database automatically closed when exiting block`
			```

			`## Cmdlets That Already Use Hash+Store Pattern`

			`These cmdlets already correctly extract hash+store:`
			- ✅ `get-file` - Export file via hash+store
			- ✅ `get-metadata` - Retrieve metadata via hash+store
			- ✅ `get-url` - Get url via hash+store
			- ✅ `get-tag` - Get tags via hash+store
			- ✅ `add-url` - Add URL via hash+store
			- ✅ `delete-url` - Delete URL via hash+store
			- ✅ `add-file` - NOW UPDATED to prioritize hash+store

			`## Future Improvements`

			`1. Make hash+store mandatory in result dicts:`
			`- All search cmdlets should emit hash+store`
			`- Validate that result dicts include these fields`

			`2. Add hash+store validation:`
			`- Warn if hash is not 64-char hex string`
			`- Warn if store is not a registered backend`

			`3. Standardize error messages:`
			`- "File not found via hash+store: hash=abc123 store=home"`
			`- Makes debugging much clearer`

			`4. Consider deprecating path-based workflows:`
			`- Migrate legacy cmdlets to hash+store pattern`
			`- Remove path-based fallbacks once all cmdlets updated`

			`## Key Takeaway`

			`The hash+store pair is now the primary way to identify and retrieve files across the entire system. This makes the codebase more reliable, consistent, and easier to debug. Database connections are properly cleaned up to prevent locks and resource leaks.`