Jiayi Pan
917d96e06f
Fix doc error in evals ( #2654 )
2024-06-27 16:13:47 +00:00
Graham Neubig
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings ( #2597 )
...
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings
* Update evaluation/webarena/scripts/run_infer.sh
---------
Co-authored-by: OpenDevin <opendevin@all-hands.dev >
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba
Evaluation time travel: build sandbox on the fly ( #2491 )
2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf
Evaluation time travel: allow evaluation on a specific version ( #2356 )
...
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
finaltrip
05b84df9cb
chore: fix some comments ( #2234 )
...
Signed-off-by: finaltrip <finaltrip@qq.com >
2024-06-03 16:04:34 +00:00
Ryan H. Tran
22e8fb39b1
add cost metrics to evaluation outputs for all benchmarks ( #2199 )
2024-06-02 08:28:00 +00:00
Binyuan Hui
46dcf4bb3e
Support BIRD benchmark ( #2117 )
...
* update: change timeout from 10 to 30
* update: readme for bird evaluation
* Update evaluation/bird/run_infer.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Update evaluation/bird/README.md
Co-authored-by: Shimada666 <649940882@qq.com >
* Update evaluation/bird/README.md
Co-authored-by: Shimada666 <649940882@qq.com >
* Update evaluation/bird/run_infer.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
2024-06-01 11:34:36 +00:00