- Add step callback to AgentRunner for real-time step logging
- BenchmarkUI now shows:
- Active runs with current step info
- Recent steps panel with colored config prefixes
- Proper Live display refresh (implements __rich_console__)
- Each config gets a distinct color for easy identification
- Verbose mode prints step logs immediately with config prefix
- Fix Live display not updating (pass UI object, not rendered content)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove old benchmark/ folder with agbenchmark framework
- Move challenges to direct_benchmark/challenges/
- Move analysis tools (analyze_reports.py, analyze_failures.py) to direct_benchmark/
- Move challenges_already_beaten.json to direct_benchmark/
- Update CI workflow to use direct_benchmark
- Update CLAUDE.md files with new benchmarking instructions
- Add benchmarking section to original_autogpt/CLAUDE.md
The direct_benchmark harness directly instantiates agents without HTTP
server overhead, enabling parallel execution with asyncio semaphore.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>