mirror of https://github.com/Significant-Gravitas/AutoGPT.git synced 2026-04-08 03:00:28 -04:00

Files

Nicholas Tindle 49f56b4e8d feat(classic): enhance strategy benchmark harness with model comparison and bug fixes

- Add model comparison support to test harness (claude, openai, gpt5, opus presets)
- Add --models, --smart-llm, --fast-llm, --list-models CLI args
- Add real-time logging with timestamps and progress indicators
- Fix success parsing bug: read results[0].success instead of non-existent metrics.success
- Fix agbenchmark TestResult validation: use exception typename when value is empty
- Fix WebArena challenge validation: use strings instead of integers in instantiation_dict
- Fix Agent type annotations: create AnyActionProposal union for all prompt strategies
- Add pytest integration tests for the strategy benchmark harness

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-19 18:07:14 -06:00

abilities

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

alignment

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

library

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

verticals

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

__init__.py

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

base.py

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

builtin.py

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

CHALLENGE.md

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

optional_categories.json

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

README.md

refactor: AutoGPT Platform Stealth Launch Repo Re-Org (#8113 )

2024-09-20 16:50:43 +02:00

webarena_selection.json

feat(classic): enhance strategy benchmark harness with model comparison and bug fixes

2026-01-19 18:07:14 -06:00

webarena.py

fix: Resolve logger.warn(..) deprecration warnings (#9938 )

2025-05-16 10:56:03 +02:00

README.md

Auto-GPT-Benchmarks

The goal of this repo is to provide easy challenge creation for test driven development with the Auto-GPT-Benchmarks package. This is essentially a library to craft challenges using a dsl (jsons in this case).

This is the up to date dependency graph: https://sapphire-denys-23.tiiny.site/

How to use

Make sure you have the package installed with pip install agbenchmark.

If you would just like to use the default challenges, don't worry about this repo. Just install the package and you will have access to the default challenges.

To add new challenges as you develop, add this repo as a submodule to your project/agbenchmark folder. Any new challenges you add within the submodule will get registered automatically.