Graham Neubig
a081935fd8
Simplify eval code ( #2775 )
...
* Start simplifying eval code
* Update
* Add EDA
* Updated GAIA
* Update gpqa
* Add humanevalfix
* Fix logic_reasoning
* Add miniwob
* Add mint and ml_bench
* toolqa
* Added swe-bench
* Fixed webarena
* Refactor parameters
2024-07-05 19:33:08 +09:00
மனோஜ்குமார் பழனிச்சாமி
143f38d25a
Refactored sandbox config and added fast boot ( #2455 )
...
* Refactored sandbox config and added fastboot
* added tests
* fixed tests
* fixed tests
* intimate user about breaking change
* remove default config from eval
* check for lowercase env
* add test
* Revert Migration
* migrate old sandbox configs
* resolve merge conflict
* revert migration 2
* Revert "remove default config from eval"
This reverts commit de57c588db .
* change type to box_type
* fix var name
* linted
* lint
* lint comments
* fix tests
* fix tests
* fix typo
* fix box_type, remove fast_boot
* add tests for sandbox config
* fix test
* update eval docs
* small removal comments
* adapt toml template
* old fields shouldn't be in the app dataclass
* fix old keys in app config
* clean up exec box
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
2024-07-05 03:30:21 +00:00
Graham Neubig
ffd3c7144c
Remove global args ( #2760 )
...
* Remove global args
* Remove global args
* Update files
* Update main
* Bug fixes
* Fix logging
2024-07-03 20:07:52 +09:00
Engel Nyst
2d9bb56763
Add ability to restore the cli session (optional) ( #2699 )
...
* add ability to restore the main session
* add quick log
* rename to cli session
2024-06-30 06:56:55 +00:00
Engel Nyst
874b4c9075
CLI concurrency ( #2695 )
...
* add session id in cli, evals
* fix main sid
2024-06-30 04:04:30 +02:00
Graham Neubig
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings ( #2597 )
...
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings
* Update evaluation/webarena/scripts/run_infer.sh
---------
Co-authored-by: OpenDevin <opendevin@all-hands.dev >
2024-06-23 03:43:43 +00:00
மனோஜ்குமார் பழனிச்சாமி
41564c2eac
Use :main instead of :latest ( #2539 )
...
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-21 03:57:50 +00:00
Boxuan Li
feabc97aba
Evaluation time travel: build sandbox on the fly ( #2491 )
2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf
Evaluation time travel: allow evaluation on a specific version ( #2356 )
...
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
RainRat
745ae42a72
fix typos ( #2352 )
2024-06-09 12:57:58 -07:00
Boxuan Li
208b1461ca
[AgentBench evaluation] set run_as_devin to true ( #2269 )
...
Co-authored-by: Leo <ifuryst@gmail.com >
2024-06-05 07:53:33 +00:00
Leo
040d6bd806
fix: add an early exit check for agent answers in agent bench. ( #2257 )
...
Signed-off-by: ifuryst <ifuryst@gmail.com >
2024-06-04 18:45:07 -07:00
Ryan H. Tran
22e8fb39b1
add cost metrics to evaluation outputs for all benchmarks ( #2199 )
2024-06-02 08:28:00 +00:00
Leo
be251b11de
Add AgentBench. ( #2012 )
...
* Add AgentBench.
* Load the datasets from HF.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add helper functions.
* Add mock executor.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add retriv agent answer cmd.
* Adjust the dataset.
* Refine test results.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Consolidate all AgentBench datasets and scripts into a single CSV dataset.
* Refactor dataset source.
* Update helper functions.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Fix the CRLF problem.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Separate the instance's workspace.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add cleanup logic and error handling for sandbox closure.
* Normalized dataset
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update README.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update the prompt to capture the answer.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Refactor script execution paths to use absolute container workspace path.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update AgentBench README.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Delete useless functions.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/agent_bench/README.md
* Add script to summarize test results from JSONL file in AgentBench
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Delete useless script and codes.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/agent_bench/scripts/summarise_results.py
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-01 07:58:14 +00:00