Files
AutoGPT/autogpt_platform/backend/backend/blocks/exa/contents.py
Nicholas Tindle 37b3e4e82e feat(blocks)!: Update Exa search block to match latest API specification (#11185)
BREAKING CHANGE: Removed deprecated use_auto_prompt field from Input
schema. Existing workflows using this field will need to be updated to
use the type field set to "auto" instead.

## Summary of Changes 📝

This PR comprehensively updates all Exa search blocks to match the
latest Exa API specification and adds significant new functionality
through the Websets API integration.

### Core API Updates 🔄

- **Migration to Exa SDK**: Replaced manual API calls with the official
`exa_py` AsyncExa SDK across all blocks for better reliability and
maintainability
- **Removed deprecated fields**: Eliminated
`use_auto_prompt`/`useAutoprompt` field (breaking change)
- **Fixed incomplete field definitions**: Corrected `user_location`
field definition
- **Added new input fields**: Added `moderation` and `context` fields
for enhanced content filtering

### Enhanced Content Settings 🛠️

- **Text field improvements**: Support both boolean and advanced object
configurations
- **New content options**: 
  - Added `livecrawl` settings (never, fallback, always, preferred)
  - Added `subpages` support for deeper content retrieval
  - Added `extras` settings for links and images
  - Added `context` settings for additional contextual information
- **Updated settings**: Enhanced `highlight` and `summary`
configurations with new query and schema options

### Comprehensive Cost Tracking 💰

- Added detailed cost tracking models:
  - `CostDollars` for monetary costs
  - `CostCredits` for API credit tracking
  - `CostDuration` for time-based costs
- New output fields: `request_id`, `resolved_search_type`,
`cost_dollars`
- Improved response handling to conditionally yield fields based on
availability

### New Websets API Integration 🚀

Added eight new specialized blocks for Exa's Websets API:
- **`websets.py`**: Core webset management (create, get, list, delete)
- **`websets_search.py`**: Search operations within websets
- **`websets_items.py`**: Individual item management (add, get, update,
delete)
- **`websets_enrichment.py`**: Data enrichment operations
- **`websets_import_export.py`**: Bulk import/export functionality
- **`websets_monitor.py`**: Monitor and track webset changes
- **`websets_polling.py`**: Poll for updates and changes

### New Special-Purpose Blocks 🎯

- **`code_context.py`**: Code search capabilities for finding relevant
code snippets from open source repositories, documentation, and Stack
Overflow
- **`research.py`**: Asynchronous research capabilities that explore the
web, gather sources, synthesize findings, and return structured results
with citations

### Code Organization Improvements 📁

- **Removed legacy code**: Deleted `model.py` file containing deprecated
API models
- **Centralized helpers**: Consolidated shared models and utilities in
`helpers.py`
- **Improved modularity**: Each webset operation is now in its own
dedicated file

### Other Changes 🔧

- Updated `.gitignore` for better development workflow
- Updated `CLAUDE.md` with project-specific instructions
- Updated documentation in `docs/content/platform/new_blocks.md` with
error handling, data models, and file input guidelines
- Improved webhook block implementations with SDK integration

### Files Changed 📂

- **Modified (11 files)**:
  - `.gitignore`
  - `autogpt_platform/CLAUDE.md`
  - `autogpt_platform/backend/backend/blocks/exa/answers.py`
  - `autogpt_platform/backend/backend/blocks/exa/contents.py`
  - `autogpt_platform/backend/backend/blocks/exa/helpers.py`
  - `autogpt_platform/backend/backend/blocks/exa/search.py`
  - `autogpt_platform/backend/backend/blocks/exa/similar.py`
  - `autogpt_platform/backend/backend/blocks/exa/webhook_blocks.py`
  - `autogpt_platform/backend/backend/blocks/exa/websets.py`
  - `docs/content/platform/new_blocks.md`

- **Added (8 files)**:
  - `autogpt_platform/backend/backend/blocks/exa/code_context.py`
  - `autogpt_platform/backend/backend/blocks/exa/research.py`
  - `autogpt_platform/backend/backend/blocks/exa/websets_enrichment.py`
- `autogpt_platform/backend/backend/blocks/exa/websets_import_export.py`
  - `autogpt_platform/backend/backend/blocks/exa/websets_items.py`
  - `autogpt_platform/backend/backend/blocks/exa/websets_monitor.py`
  - `autogpt_platform/backend/backend/blocks/exa/websets_polling.py`
  - `autogpt_platform/backend/backend/blocks/exa/websets_search.py`

- **Deleted (1 file)**:
  - `autogpt_platform/backend/backend/blocks/exa/model.py`

### Migration Guide 🚦

For users with existing workflows using the deprecated `use_auto_prompt`
field:
1. Remove the `use_auto_prompt` field from your input configuration
2. Set the `type` field to `ExaSearchTypes.AUTO` (or "auto" in JSON) to
achieve the same behavior
3. Review any custom content settings as the structure has been enhanced

### Testing Recommendations 

- Test existing workflows to ensure they handle the breaking change
- Verify cost tracking fields are properly returned
- Test new content settings options (livecrawl, subpages, extras,
context)
- Validate websets functionality if using the new Websets API blocks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  - [x] made + ran a test agent for the blocks and flows between them
[Exa
Tests_v44.json](https://github.com/user-attachments/files/23226143/Exa.Tests_v44.json)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Migrates Exa blocks to AsyncExa SDK, adds comprehensive
Websets/research/code-context blocks, updates existing
search/content/answers/similar, deletes legacy models, adjusts
tests/docs; breaking: remove `use_auto_prompt` in favor of
`type="auto"`.
> 
> - **Backend — Exa integration (SDK migration & BREAKING)**:
> - Replace manual HTTP calls with `exa_py.AsyncExa` across `search`,
`similar`, `contents`, `answers`, and webhooks; richer outputs
(citations, context, costs, resolved search type).
>   - BREAKING: remove `Input.use_auto_prompt`; use `type = "auto"`.
> - Centralize models/utilities in `exa/helpers.py` (content settings,
cost models, result mappers).
> - **New Blocks**:
> - **Websets**: management (`websets.py`), searches, items,
enrichments, imports/exports, monitors, polling (new files under
`exa/websets_*`).
> - **Research**: async research task create/get/wait/list
(`exa/research.py`).
> - **Code Context**: code snippet/context retrieval
(`exa/code_context.py`).
> - **Removals**:
>   - Delete deprecated `exa/model.py`.
> - **Docs & DX**:
> - Update `docs/new_blocks.md` (error handling, models, file input) and
`CLAUDE.md`; ignore backend logs in `.gitignore`.
> - **Frontend Tests**:
> - Split/extend “e” block tests and improve block add robustness in
Playwright (`build.spec.ts`, `build.page.ts`).
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6e5e572322. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added multiple Exa research and webset management blocks for task
creation, monitoring, and completion tracking.
* Introduced new search capabilities including code context retrieval,
content search, and enhanced filtering options.
* Added webset enrichment, import/export, and item management
functionality.
  * Expanded search with location-based and category filters.

* **Documentation**
* Updated guidance on error handling, data models, and file input
handling.

* **Refactor**
* Modernized backend API integration with improved response structure
and error reporting.
  * Simplified configuration options for search operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-11-05 19:52:48 +00:00

226 lines
7.8 KiB
Python

from enum import Enum
from typing import Optional
from exa_py import AsyncExa
from pydantic import BaseModel
from backend.sdk import (
APIKeyCredentials,
Block,
BlockCategory,
BlockOutput,
BlockSchemaInput,
BlockSchemaOutput,
CredentialsMetaInput,
SchemaField,
)
from ._config import exa
from .helpers import (
CostDollars,
ExaSearchResults,
ExtrasSettings,
HighlightSettings,
LivecrawlTypes,
SummarySettings,
)
class ContentStatusTag(str, Enum):
CRAWL_NOT_FOUND = "CRAWL_NOT_FOUND"
CRAWL_TIMEOUT = "CRAWL_TIMEOUT"
CRAWL_LIVECRAWL_TIMEOUT = "CRAWL_LIVECRAWL_TIMEOUT"
SOURCE_NOT_AVAILABLE = "SOURCE_NOT_AVAILABLE"
CRAWL_UNKNOWN_ERROR = "CRAWL_UNKNOWN_ERROR"
class ContentError(BaseModel):
tag: Optional[ContentStatusTag] = SchemaField(
default=None, description="Specific error type"
)
httpStatusCode: Optional[int] = SchemaField(
default=None, description="The corresponding HTTP status code"
)
class ContentStatus(BaseModel):
id: str = SchemaField(description="The URL that was requested")
status: str = SchemaField(
description="Status of the content fetch operation (success or error)"
)
error: Optional[ContentError] = SchemaField(
default=None, description="Error details, only present when status is 'error'"
)
class ExaContentsBlock(Block):
class Input(BlockSchemaInput):
credentials: CredentialsMetaInput = exa.credentials_field(
description="The Exa integration requires an API Key."
)
urls: list[str] = SchemaField(
description="Array of URLs to crawl (preferred over 'ids')",
default_factory=list,
advanced=False,
)
ids: list[str] = SchemaField(
description="[DEPRECATED - use 'urls' instead] Array of document IDs obtained from searches",
default_factory=list,
advanced=True,
)
text: bool = SchemaField(
description="Retrieve text content from pages",
default=True,
)
highlights: HighlightSettings = SchemaField(
description="Text snippets most relevant from each page",
default=HighlightSettings(),
)
summary: SummarySettings = SchemaField(
description="LLM-generated summary of the webpage",
default=SummarySettings(),
)
livecrawl: Optional[LivecrawlTypes] = SchemaField(
description="Livecrawling options: never, fallback (default), always, preferred",
default=LivecrawlTypes.FALLBACK,
advanced=True,
)
livecrawl_timeout: Optional[int] = SchemaField(
description="Timeout for livecrawling in milliseconds",
default=10000,
advanced=True,
)
subpages: Optional[int] = SchemaField(
description="Number of subpages to crawl", default=0, ge=0, advanced=True
)
subpage_target: Optional[str | list[str]] = SchemaField(
description="Keyword(s) to find specific subpages of search results",
default=None,
advanced=True,
)
extras: ExtrasSettings = SchemaField(
description="Extra parameters for additional content",
default=ExtrasSettings(),
advanced=True,
)
class Output(BlockSchemaOutput):
results: list[ExaSearchResults] = SchemaField(
description="List of document contents with metadata"
)
result: ExaSearchResults = SchemaField(
description="Single document content result"
)
context: str = SchemaField(
description="A formatted string of the results ready for LLMs"
)
request_id: str = SchemaField(description="Unique identifier for the request")
statuses: list[ContentStatus] = SchemaField(
description="Status information for each requested URL"
)
cost_dollars: Optional[CostDollars] = SchemaField(
description="Cost breakdown for the request"
)
error: str = SchemaField(description="Error message if the request failed")
def __init__(self):
super().__init__(
id="c52be83f-f8cd-4180-b243-af35f986b461",
description="Retrieves document contents using Exa's contents API",
categories={BlockCategory.SEARCH},
input_schema=ExaContentsBlock.Input,
output_schema=ExaContentsBlock.Output,
)
async def run(
self, input_data: Input, *, credentials: APIKeyCredentials, **kwargs
) -> BlockOutput:
if not input_data.urls and not input_data.ids:
raise ValueError("Either 'urls' or 'ids' must be provided")
sdk_kwargs = {}
# Prefer urls over ids
if input_data.urls:
sdk_kwargs["urls"] = input_data.urls
elif input_data.ids:
sdk_kwargs["ids"] = input_data.ids
if input_data.text:
sdk_kwargs["text"] = {"includeHtmlTags": True}
# Handle highlights - only include if modified from defaults
if input_data.highlights and (
input_data.highlights.num_sentences != 1
or input_data.highlights.highlights_per_url != 1
or input_data.highlights.query is not None
):
highlights_dict = {}
highlights_dict["numSentences"] = input_data.highlights.num_sentences
highlights_dict["highlightsPerUrl"] = (
input_data.highlights.highlights_per_url
)
if input_data.highlights.query:
highlights_dict["query"] = input_data.highlights.query
sdk_kwargs["highlights"] = highlights_dict
# Handle summary - only include if modified from defaults
if input_data.summary and (
input_data.summary.query is not None
or input_data.summary.schema is not None
):
summary_dict = {}
if input_data.summary.query:
summary_dict["query"] = input_data.summary.query
if input_data.summary.schema:
summary_dict["schema"] = input_data.summary.schema
sdk_kwargs["summary"] = summary_dict
if input_data.livecrawl:
sdk_kwargs["livecrawl"] = input_data.livecrawl.value
if input_data.livecrawl_timeout is not None:
sdk_kwargs["livecrawl_timeout"] = input_data.livecrawl_timeout
if input_data.subpages is not None:
sdk_kwargs["subpages"] = input_data.subpages
if input_data.subpage_target:
sdk_kwargs["subpage_target"] = input_data.subpage_target
# Handle extras - only include if modified from defaults
if input_data.extras and (
input_data.extras.links > 0 or input_data.extras.image_links > 0
):
extras_dict = {}
if input_data.extras.links:
extras_dict["links"] = input_data.extras.links
if input_data.extras.image_links:
extras_dict["image_links"] = input_data.extras.image_links
sdk_kwargs["extras"] = extras_dict
# Always enable context for LLM-ready output
sdk_kwargs["context"] = True
aexa = AsyncExa(api_key=credentials.api_key.get_secret_value())
response = await aexa.get_contents(**sdk_kwargs)
converted_results = [
ExaSearchResults.from_sdk(sdk_result)
for sdk_result in response.results or []
]
yield "results", converted_results
for result in converted_results:
yield "result", result
if response.context:
yield "context", response.context
if response.statuses:
yield "statuses", response.statuses
if response.cost_dollars:
yield "cost_dollars", response.cost_dollars