Files
AutoGPT/.dockerignore
Zamil Majdy 8b83bb8647 feat(backend): unified hybrid search with embedding backfill for all content types (#11767)
## Summary

This PR extends the embedding system to support **blocks** and
**documentation** content types in addition to store agents, and
introduces **unified hybrid search** across all content types using a
single `UnifiedContentEmbedding` table.

### Key Changes

1. **Unified Hybrid Search Architecture**
   - Added `search` tsvector column to `UnifiedContentEmbedding` table
- New `unified_hybrid_search()` function searches across all content
types (agents, blocks, docs)
- Updated `hybrid_search()` for store agents to use
`UnifiedContentEmbedding.search`
   - Removed deprecated `search` column from `StoreListingVersion` table

2. **Pluggable Content Handler Architecture**
   - Created abstract `ContentHandler` base class for extensibility
- Implemented handlers: `StoreAgentHandler`, `BlockHandler`,
`DocumentationHandler`
   - Registry pattern for easy addition of new content types

3. **Block Embeddings**
   - Discovers all blocks using `get_blocks()`
- Extracts searchable text from: name, description, categories,
input/output schemas

4. **Documentation Embeddings**
   - Scans `/docs/` directory for `.md` and `.mdx` files
   - Extracts title from first `#` heading or uses filename as fallback

5. **Hybrid Search Graceful Degradation**
- Falls back to lexical-only search if query embedding generation fails
   - Redistributes semantic weight proportionally to other components
   - Logs warning instead of throwing error

6. **Database Migrations**
- `20260115200000_add_unified_search_tsvector`: Adds search column to
UnifiedContentEmbedding with auto-update trigger
- `20260115210000_remove_storelistingversion_search`: Removes deprecated
search column and updates StoreAgent view

7. **Orphan Cleanup**
- `cleanup_orphaned_embeddings()` removes embeddings for deleted content
   - Always runs after backfill, even at 100% coverage

### Review Comments Addressed

-  SQL parameter index bug when user_id provided (embeddings.py)
-  Early return skipping cleanup at 100% coverage (scheduler.py)
-  Inconsistent return structure across code paths (scheduler.py)
-  SQL UNION syntax error - added parentheses for ORDER BY/LIMIT
(hybrid_search.py)
-  Version numeric ordering in aggregations (migration)
-  Embedding dimension uses EMBEDDING_DIM constant

### Files Changed

- `backend/api/features/store/content_handlers.py` (NEW): Handler
architecture
- `backend/api/features/store/embeddings.py`: Refactored to use handlers
- `backend/api/features/store/hybrid_search.py`: Unified search +
graceful degradation
- `backend/executor/scheduler.py`: Process all content types, consistent
returns
- `migrations/20260115200000_add_unified_search_tsvector/`: Add tsvector
to unified table
- `migrations/20260115210000_remove_storelistingversion_search/`: Remove
old search column
- `schema.prisma`: Updated UnifiedContentEmbedding and
StoreListingVersion models
- `*_test.py`: Added tests for unified_hybrid_search

## Test Plan

1.  All tests passing on Python 3.11, 3.12, 3.13
2.  Types check passing
3.  CodeRabbit and Sentry reviews addressed
4. Deploy to staging and verify:
   - Backfill job processes all content types
   - Search results include blocks and docs
   - Search works without OpenAI API (graceful degradation)

🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Swifty <craigswift13@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 09:47:19 +01:00

70 lines
1.9 KiB
Plaintext

# Ignore everything by default, selectively add things to context
*
# Documentation (for embeddings/search)
!docs/
# Platform - Libs
!autogpt_platform/autogpt_libs/autogpt_libs/
!autogpt_platform/autogpt_libs/pyproject.toml
!autogpt_platform/autogpt_libs/poetry.lock
!autogpt_platform/autogpt_libs/README.md
# Platform - Backend
!autogpt_platform/backend/backend/
!autogpt_platform/backend/test/e2e_test_data.py
!autogpt_platform/backend/migrations/
!autogpt_platform/backend/schema.prisma
!autogpt_platform/backend/pyproject.toml
!autogpt_platform/backend/poetry.lock
!autogpt_platform/backend/README.md
!autogpt_platform/backend/.env
!autogpt_platform/backend/gen_prisma_types_stub.py
# Platform - Market
!autogpt_platform/market/market/
!autogpt_platform/market/scripts.py
!autogpt_platform/market/schema.prisma
!autogpt_platform/market/pyproject.toml
!autogpt_platform/market/poetry.lock
!autogpt_platform/market/README.md
# Platform - Frontend
!autogpt_platform/frontend/src/
!autogpt_platform/frontend/public/
!autogpt_platform/frontend/scripts/
!autogpt_platform/frontend/package.json
!autogpt_platform/frontend/pnpm-lock.yaml
!autogpt_platform/frontend/tsconfig.json
!autogpt_platform/frontend/README.md
## config
!autogpt_platform/frontend/*.config.*
!autogpt_platform/frontend/.env.*
!autogpt_platform/frontend/.env
# Classic - AutoGPT
!classic/original_autogpt/autogpt/
!classic/original_autogpt/pyproject.toml
!classic/original_autogpt/poetry.lock
!classic/original_autogpt/README.md
!classic/original_autogpt/tests/
# Classic - Benchmark
!classic/benchmark/agbenchmark/
!classic/benchmark/pyproject.toml
!classic/benchmark/poetry.lock
!classic/benchmark/README.md
# Classic - Forge
!classic/forge/
!classic/forge/pyproject.toml
!classic/forge/poetry.lock
!classic/forge/README.md
# Classic - Frontend
!classic/frontend/build/web/
# Explicitly re-ignore some folders
.*
**/__pycache__