5 Commits

Author SHA1 Message Date
Graham Neubig
cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba Evaluation time travel: build sandbox on the fly (#2491) 2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf Evaluation time travel: allow evaluation on a specific version (#2356)
* Time travel for evaluation

* Fix source script path

* Exit script if given version doesn't exist

* Exit on failure

* Update README

* Change scripts of all other benchmarks

* Modify README files

* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Ryan H. Tran
0584e428b2 [Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals (#2268)
* fix: bug in stopping when the agent reaches max steps or solution proposals

* remove --eval-num-workers

* update env.py
2024-06-05 06:47:07 +00:00
Ryan H. Tran
9434bcce48 Support MINT benchmark (MATH, GSM8K subset) (#1955)
* setup boilerplate and README

* setup test script and load dataset

* add temp intg that works

* refactor code

* add solution evaluation through 'fake_user_response_fn'

* finish integrating MATH subset

* Update evaluation/mint/run_infer.py

* Update evaluation/mint/run_infer.sh

* Update opendevin/core/main.py

* remove redudant templates, add eval_note, update README

* use <execute_ipython> tag instead of <execute>

* hardcode AGENT option for run_infer.sh

* Update evaluation/mint/task.py

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* fix: bug no message returned when task's success

* change message to make the agent exit

* import bash abstractmethod

* install all required packages inside sandbox before the agent runs, adjust prompt

* add subset eval folder separation and test for gsm8k

* fix bug in Reasoning task result check, add requirements.txt

* Fix syntax error in evaluation/mint/run_infer.py

* update README, add default values for `SUBSET` and `EVAL_LIMIT`

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-05-28 07:42:52 +00:00