Commit Graph

6 Commits

Author SHA1 Message Date
Xingyao Wang
50c13aad98 [Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396) 2024-10-15 21:34:52 +08:00
Xingyao Wang
6b16a5da0b [Eval,Arch] Update GPTQ eval and add headless_mode for Controller (#2994)
* update and polish gptq eval

* fix typo

* Update evaluation/gpqa/README.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/gpqa/run_infer.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* add headless mode to all appropriate agent controller call

* delegate set to error when in headless mode

* try to deduplicate a bit

* make headless_mode default to True and only change it to false for AgentSession

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-07-20 03:35:48 +00:00
Graham Neubig
cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba Evaluation time travel: build sandbox on the fly (#2491) 2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf Evaluation time travel: allow evaluation on a specific version (#2356)
* Time travel for evaluation

* Fix source script path

* Exit script if given version doesn't exist

* Exit on failure

* Update README

* Change scripts of all other benchmarks

* Modify README files

* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Jaskirat Singh
e8307608c2 Support gpqa benchmark evaluation (#2080)
* feat: add gpqa benchmark evaluation

* add metrics

* reset configs in final block

* make lint

---------

Co-authored-by: yufansong <yufan@risingwave-labs.com>
2024-06-08 16:24:24 +00:00