mirror of
https://github.com/Significant-Gravitas/AutoGPT.git
synced 2026-02-12 15:55:03 -05:00
- Add model comparison support to test harness (claude, openai, gpt5, opus presets) - Add --models, --smart-llm, --fast-llm, --list-models CLI args - Add real-time logging with timestamps and progress indicators - Fix success parsing bug: read results[0].success instead of non-existent metrics.success - Fix agbenchmark TestResult validation: use exception typename when value is empty - Fix WebArena challenge validation: use strings instead of integers in instantiation_dict - Fix Agent type annotations: create AnyActionProposal union for all prompt strategies - Add pytest integration tests for the strategy benchmark harness Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Auto-GPT Benchmarks
Built for the purpose of benchmarking the performance of agents regardless of how they work.
Objectively know how well your agent is performing in categories like code, retrieval, memory, and safety.
Save time and money while doing it through smart dependencies. The best part? It's all automated.
Scores:
Ranking overall:
Detailed results:
Click here to see the results and the raw data!!
More agents coming soon !