- Fix line-too-long in test_permissions.py docstring
- Fix type annotation in validators.py (callable -> Callable)
- Add --fresh flag to benchmark tests to prevent state resumption
- Exclude direct_benchmark/adapters from pyright (optional deps)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integrate standard AI agent benchmarks into the direct_benchmark infrastructure
using a plugin-based adapter pattern:
- Add BenchmarkAdapter base class with setup(), load_challenges(), and evaluate()
- Implement GAIAAdapter for the GAIA benchmark (requires HF token)
- Implement SWEBenchAdapter for SWE-bench (requires Docker)
- Implement AgentBenchAdapter for AgentBench multi-environment benchmark
- Extend HarnessConfig with benchmark options (--benchmark, --benchmark-split, etc.)
- Modify ParallelExecutor to use adapter's evaluate() for external benchmarks
- Fix runner to record finish step (was being skipped, breaking answer extraction)
- Add optional benchmarks dependency group with datasets and huggingface-hub
- Increase default benchmark timeout to 900s
Usage:
poetry run direct-benchmark run \
--benchmark agent-bench \
--benchmark-subset dbbench \
--strategies one_shot \
--models claude
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update black version to match pre-commit hook (24.10.0) and reformat
all files with the new version.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add missing strategies (lats, multi_agent_debate) to PromptStrategyName
- Fix method override signatures for reasoning_effort parameter
- Fix Pydantic Field() overload issues with helper function
- Fix BeautifulSoup Tag type narrowing in web_fetch.py
- Fix Optional member access in playwright_browser.py and rewoo.py
- Convert hasattr patterns to getattr for proper type narrowing
- Add proper type casts for Literal types
- Fix file storage path type conversions
- Exclude legacy challenges/ from pyright checking
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Merge forge/, original_autogpt/, and direct_benchmark/ into a single Poetry
project to eliminate cross-project path dependency issues.
Changes:
- Create classic/pyproject.toml with merged dependencies from all three projects
- Remove individual pyproject.toml and poetry.lock files from subdirectories
- Update all CLAUDE.md files to reflect commands run from classic/ root
- Update all README.md files with new installation and usage instructions
All packages are now included via the packages directive:
- forge/forge (core agent framework)
- original_autogpt/autogpt (AutoGPT agent)
- direct_benchmark/direct_benchmark (benchmark harness)
CLI entry points preserved: autogpt, serve, direct-benchmark
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>