mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-02-08 22:05:08 -05:00
## Summary This PR extends the embedding system to support **blocks** and **documentation** content types in addition to store agents, and introduces **unified hybrid search** across all content types using a single `UnifiedContentEmbedding` table. ### Key Changes 1. **Unified Hybrid Search Architecture** - Added `search` tsvector column to `UnifiedContentEmbedding` table - New `unified_hybrid_search()` function searches across all content types (agents, blocks, docs) - Updated `hybrid_search()` for store agents to use `UnifiedContentEmbedding.search` - Removed deprecated `search` column from `StoreListingVersion` table 2. **Pluggable Content Handler Architecture** - Created abstract `ContentHandler` base class for extensibility - Implemented handlers: `StoreAgentHandler`, `BlockHandler`, `DocumentationHandler` - Registry pattern for easy addition of new content types 3. **Block Embeddings** - Discovers all blocks using `get_blocks()` - Extracts searchable text from: name, description, categories, input/output schemas 4. **Documentation Embeddings** - Scans `/docs/` directory for `.md` and `.mdx` files - Extracts title from first `#` heading or uses filename as fallback 5. **Hybrid Search Graceful Degradation** - Falls back to lexical-only search if query embedding generation fails - Redistributes semantic weight proportionally to other components - Logs warning instead of throwing error 6. **Database Migrations** - `20260115200000_add_unified_search_tsvector`: Adds search column to UnifiedContentEmbedding with auto-update trigger - `20260115210000_remove_storelistingversion_search`: Removes deprecated search column and updates StoreAgent view 7. **Orphan Cleanup** - `cleanup_orphaned_embeddings()` removes embeddings for deleted content - Always runs after backfill, even at 100% coverage ### Review Comments Addressed - ✅ SQL parameter index bug when user_id provided (embeddings.py) - ✅ Early return skipping cleanup at 100% coverage (scheduler.py) - ✅ Inconsistent return structure across code paths (scheduler.py) - ✅ SQL UNION syntax error - added parentheses for ORDER BY/LIMIT (hybrid_search.py) - ✅ Version numeric ordering in aggregations (migration) - ✅ Embedding dimension uses EMBEDDING_DIM constant ### Files Changed - `backend/api/features/store/content_handlers.py` (NEW): Handler architecture - `backend/api/features/store/embeddings.py`: Refactored to use handlers - `backend/api/features/store/hybrid_search.py`: Unified search + graceful degradation - `backend/executor/scheduler.py`: Process all content types, consistent returns - `migrations/20260115200000_add_unified_search_tsvector/`: Add tsvector to unified table - `migrations/20260115210000_remove_storelistingversion_search/`: Remove old search column - `schema.prisma`: Updated UnifiedContentEmbedding and StoreListingVersion models - `*_test.py`: Added tests for unified_hybrid_search ## Test Plan 1. ✅ All tests passing on Python 3.11, 3.12, 3.13 2. ✅ Types check passing 3. ✅ CodeRabbit and Sentry reviews addressed 4. Deploy to staging and verify: - Backfill job processes all content types - Search results include blocks and docs - Search works without OpenAI API (graceful degradation) 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Co-authored-by: Swifty <craigswift13@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>