Files
Medios-Macina/HASH_STORE_PRIORITY_PATTERN.md
2025-12-11 12:47:30 -08:00

6.9 KiB

Hash+Store Priority Pattern & Database Connection Fixes

Summary of Changes

1. Database Connection Leak Fixes

Problem: FolderDB connections were not being properly closed, causing database locks and resource leaks.

Files Fixed:

  • cmdlets/search_store.py - Now uses with FolderDB() context manager
  • cmdlets/search_provider.py - Now uses with FolderDB() context manager
  • helper/store.py (Folder.init) - Now uses with FolderDB() for temporary connections
  • helper/worker_manager.py - Added close() method and context manager support (__enter__/__exit__)

Pattern:

# OLD (leaked connections):
db = FolderDB(path)
try:
    db.do_something()
finally:
    if db:
        db.close()  # Could be skipped if exception occurs early

# NEW (guaranteed cleanup):
with FolderDB(path) as db:
    db.do_something()
# Connection automatically closed when exiting block

2. Hash+Store Priority Pattern

Philosophy: The hash+store pair is the canonical identifier for files across all storage backends. Sort order and table structure should not matter because we're always using hash+store.

Why This Matters:

  • @N selections pass hash+store from search results
  • Hash+store works consistently across all backends (Hydrus, Folder, Remote)
  • Path-based resolution is fragile (files move, temp paths expire, etc.)
  • Hash+store never changes and uniquely identifies content

Updated Resolution Priority in add_file.py:

def _resolve_source(result, path_arg, pipe_obj, config):
    """
    PRIORITY 1: hash+store from result dict (most reliable for @N selections)
       - Checks result.get("hash") and result.get("store")
       - Uses FileStorage[store].get_file(hash) to retrieve
       - Works for: Hydrus, Folder, Remote backends
    
    PRIORITY 2: Explicit -path argument
       - Direct path specified by user
    
    PRIORITY 3: pipe_obj.file_path
       - Legacy path from previous pipeline stage
    
    PRIORITY 4: Hydrus hash from pipe_obj.extra
       - Fallback for older Hydrus workflows
    
    PRIORITY 5: String/list result parsing
       - Last resort for simple string paths
    """

Example Flow:

# User searches and selects result
$ search-store system:limit=5

# Result items include:
{
  "hash": "a1b2c3d4...",
  "store": "home",  # Specific Hydrus instance
  "title": "example.mp4"
}

# User selects @2 (index 1)
$ @2 | add-file -storage test

# add-file now:
1. Extracts hash="a1b2c3d4..." store="home" from result dict
2. Calls FileStorage["home"].get_file("a1b2c3d4...")
3. Retrieves actual file path from "home" backend
4. Proceeds with copy/upload to "test" storage

3. Benefits of This Approach

Consistency:

  • @N selection always uses the same hash+store regardless of display order
  • No confusion about which row index maps to which file
  • Table synchronization issues (rows vs items) don't break selection

Reliability:

  • Hash uniquely identifies content (SHA256 collision is effectively impossible)
  • Store identifies the authoritative source backend
  • No dependency on temporary paths or file locations

Multi-Instance Support:

  • Works seamlessly with multiple Hydrus instances ("home", "work")
  • Works with mixed backends (Hydrus + Folder + Remote)
  • Each backend can independently retrieve file by hash

Debugging:

  • Hash+store are visible in debug logs: [add-file] Using hash+store: hash=a1b2c3d4..., store=home
  • Easy to trace which backend is being queried
  • Clear error messages when hash+store lookup fails

How @N Selection Works Now

Selection Process:

  1. Search creates result list with hash+store:

    results_list = [
        {"hash": "abc123...", "store": "home", "title": "file1.mp4"},
        {"hash": "def456...", "store": "default", "title": "file2.jpg"},
        {"hash": "ghi789...", "store": "test", "title": "file3.png"},
    ]
    
  2. User selects @2 (second item, index 1):

    • CLI extracts: result = {"hash": "def456...", "store": "default", "title": "file2.jpg"}
    • Passes this dict to the next cmdlet
  3. Next cmdlet receives dict with hash+store:

    def run(self, result, args, config):
        # result is the dict from selection
        file_hash = result.get("hash")  # "def456..."
        store_name = result.get("store")  # "default"
    
        # Use hash+store to retrieve file
        backend = FileStorage(config)[store_name]
        file_path = backend.get_file(file_hash)
    

Why This is Better Than Path-Based:

Path-Based (OLD):

# Fragile: path could be temp file, symlink, moved file, etc.
result = {"file_path": "/tmp/hydrus-abc123.mp4"}
# What if file was moved? What if it's a temp path that expires?

Hash+Store (NEW):

# Reliable: hash+store always works regardless of current location
result = {"hash": "abc123...", "store": "home"}
# Backend retrieves current location from its database/API

Testing the Fixes

1. Test Database Connections:

# Search multiple times and check for database locks
search-store system:limit=5
search-store system:limit=5
search-store system:limit=5

# Should complete without "database is locked" errors

2. Test Hash+Store Selection:

# Search and select
search-store system:limit=5
@2 | get-metadata

# Should show metadata for the selected file using hash+store
# Debug log should show: [add-file] Using hash+store from result: hash=...

3. Test WorkerManager Cleanup:

# In Python script:
from helper.worker_manager import WorkerManager
from pathlib import Path

with WorkerManager(Path("C:/path/to/library")) as wm:
    # Do work
    pass
# Database automatically closed when exiting block

Cmdlets That Already Use Hash+Store Pattern

These cmdlets already correctly extract hash+store:

  • get-file - Export file via hash+store
  • get-metadata - Retrieve metadata via hash+store
  • get-url - Get url via hash+store
  • get-tag - Get tags via hash+store
  • add-url - Add URL via hash+store
  • delete-url - Delete URL via hash+store
  • add-file - NOW UPDATED to prioritize hash+store

Future Improvements

  1. Make hash+store mandatory in result dicts:

    • All search cmdlets should emit hash+store
    • Validate that result dicts include these fields
  2. Add hash+store validation:

    • Warn if hash is not 64-char hex string
    • Warn if store is not a registered backend
  3. Standardize error messages:

    • "File not found via hash+store: hash=abc123 store=home"
    • Makes debugging much clearer
  4. Consider deprecating path-based workflows:

    • Migrate legacy cmdlets to hash+store pattern
    • Remove path-based fallbacks once all cmdlets updated

Key Takeaway

The hash+store pair is now the primary way to identify and retrieve files across the entire system. This makes the codebase more reliable, consistent, and easier to debug. Database connections are properly cleaned up to prevent locks and resource leaks.