Graham Neubig
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings ( #2597 )
...
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings
* Update evaluation/webarena/scripts/run_infer.sh
---------
Co-authored-by: OpenDevin <opendevin@all-hands.dev >
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba
Evaluation time travel: build sandbox on the fly ( #2491 )
2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf
Evaluation time travel: allow evaluation on a specific version ( #2356 )
...
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Robert
7fc57650f3
BioCoder integration ( #2076 )
...
* prepare execution and inference
* Create README.md
* Update README.md
* Update evaluation/biocoder/README.md
* Update evaluation/swe_bench/swe_env_box.py
* switch to biocoder docker container and test-specific code
* code for copying and running test files into container
* add metrics
* add readme
* Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes)
* Update README.md
---------
Co-authored-by: lilbillybiscuit <qianbill2014@outlook.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-10 11:11:40 +08:00