- Add 'playwright install chromium' step to Forge CI workflow
- Auto-detect default model from available API keys (ANTHROPIC_API_KEY,
OPENAI_API_KEY, GROQ_API_KEY) in direct_benchmark harness
- Prefer Claude > OpenAI > Groq, fallback to OpenAI if no keys found
- Remove old benchmark/ folder with agbenchmark framework
- Move challenges to direct_benchmark/challenges/
- Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/
- Move challenges_already_beaten.json to direct_benchmark/
- Update CI workflow to use direct_benchmark
- Update CLAUDE.md files with new benchmarking instructions
- Add benchmarking section to original_autogpt/CLAUDE.md
The direct_benchmark harness directly instantiates agents without HTTP
server overhead, enabling parallel execution with asyncio semaphore.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add model comparison support to test harness (claude, openai, gpt5, opus presets)
- Add --models, --smart-llm, --fast-llm, --list-models CLI args
- Add real-time logging with timestamps and progress indicators
- Fix success parsing bug: read results[0].success instead of non-existent metrics.success
- Fix agbenchmark TestResult validation: use exception typename when value is empty
- Fix WebArena challenge validation: use strings instead of integers in instantiation_dict
- Fix Agent type annotations: create AnyActionProposal union for all prompt strategies
- Add pytest integration tests for the strategy benchmark harness
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract _get_tool_error_message helper method
- Replace 20+ levels of nesting with simple for loop
- Improve readability of tool_result construction
- Update benchmark poetry.lock
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace basic DuckDuckGo-only search with a modern tiered system:
1. Tavily (primary) - AI-optimized results with content extraction
- AI-generated answer summaries
- Relevance scoring
- Full page content extraction via search_and_extract command
2. Serper (secondary) - Fast, cheap Google SERP results
- $0.30-1.00 per 1K queries
- Real Google results without scraping
3. DDGS multi-engine (fallback) - Free, no API key required
- Automatic fallback chain: DuckDuckGo → Bing → Brave → Google → etc.
- 8 search backends supported
Key changes:
- Upgrade duckduckgo-search to ddgs v9.10 (renamed successor package)
- Add Tavily and Serper API integrations
- Implement automatic provider selection and fallback chain
- Add search_and_extract command for research with content extraction
- Add TAVILY_API_KEY and SERPER_API_KEY to env templates
- Update benchmark httpx constraint for ddgs compatibility
- 23 comprehensive tests for all providers and fallback scenarios
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove openai_functions config option - native tool calling is now always enabled
- Remove use_functions_api from BaseAgentConfiguration and prompt strategy
- Add use_prefill config to disable prefill for Anthropic (prefill + tools incompatible)
- Update anthropic dependency to ^0.45.0 for tools API support
- Simplify prompt strategy to always expect tool_calls from LLM response
This fixes the N/A command loop bug where models would output "N/A" as a
command name when function calling was disabled. With native tool calling
always enabled, models are forced to pick from valid tools only.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This small PR resolves the deprecation warnings of the `logger` library:
```
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```
Make all changes necessary to make everything work with Poetry v2.0.0.
- Resolves#9196
## Changes
- Removed `--no-update` flag from `poetry lock` command in codebase
- Removed extra path arguments from `poetry -C [path] run [command]`
occurrences
- Regenerated all lock files in hierarchical order
- Added workaround for Poetry bug where `packages.[i].format` is now
suddenly required
Additionally:
- Fixed up .dockerignore
- Fixes .venv being erroneously copied over from local
- Fixes build context bloat (300MB -> 2.5MB)
- Fixed warnings about entrypoint script not being installed in docker
builds
### Relevant (breaking) changes in v2.0.0
- `--no-update` flag no longer exists for `poetry lock` as it has become
default behavior
- The `-C` option now actually changes the directory, so any path
arguments in `poetry run` commands can/must be removed
- Poetry v2.0.0 uses the new v2.1 lock file spec, so all lock files have
to be regenerated to avoid false-positive lock file updates and checks
on future PRs
- **BUG:** when specifying `poetry.tool.packages`, `format` is required
now
- python-poetry/poetry#9961
Full Poetry v2.0.0 release notes and change log:
https://python-poetry.org/blog/announcing-poetry-2.0.0
Restructuring the Repo to make it clear the difference between classic autogpt and the autogpt platform:
* Move the "classic" projects `autogpt`, `forge`, `frontend`, and `benchmark` into a `classic` folder
* Also rename `autogpt` to `original_autogpt` for absolute clarity
* Rename `rnd/` to `autogpt_platform/`
* `rnd/autogpt_builder` -> `autogpt_platform/frontend`
* `rnd/autogpt_server` -> `autogpt_platform/backend`
* Adjust any paths accordingly