OpenHands

mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Author	SHA1	Message	Date
Jiayi Pan	917d96e06f	Fix doc error in evals (#2654 )	2024-06-27 16:13:47 +00:00
Graham Neubig	cab7a288ca	Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 ) * Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings * Update evaluation/webarena/scripts/run_infer.sh --------- Co-authored-by: OpenDevin <opendevin@all-hands.dev>	2024-06-23 03:43:43 +00:00
Boxuan Li	feabc97aba	Evaluation time travel: build sandbox on the fly (#2491 )	2024-06-20 20:22:02 -06:00
Boxuan Li	6f235937cf	Evaluation time travel: allow evaluation on a specific version (#2356 ) * Time travel for evaluation * Fix source script path * Exit script if given version doesn't exist * Exit on failure * Update README * Change scripts of all other benchmarks * Modify README files * Fix logic_reasoning README	2024-06-16 10:25:14 -04:00
Ryan H. Tran	0584e428b2	[Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals (#2268 ) * fix: bug in stopping when the agent reaches max steps or solution proposals * remove --eval-num-workers * update env.py	2024-06-05 06:47:07 +00:00
finaltrip	05b84df9cb	chore: fix some comments (#2234 ) Signed-off-by: finaltrip <finaltrip@qq.com>	2024-06-03 16:04:34 +00:00
Ryan H. Tran	22e8fb39b1	add cost metrics to evaluation outputs for all benchmarks (#2199 )	2024-06-02 08:28:00 +00:00
RainRat	ed6dcc8381	fix typos (#2187 ) * fix typos no functional change * fix typos	2024-06-01 20:40:30 +00:00
Ryan H. Tran	01296ff79d	Add remaining subsets for MINT benchmark (#2142 ) * add MMLU subset * add theoremqa subset * remove redundant packages from requirements.txt, adjust prompts, handle gpt3.5 propose a wrong answer after a correct answer * add MBPP subset * add humaneval subset * update README * exit actively after the agent finishes the task	2024-05-31 20:04:13 +00:00
Ryan H. Tran	9434bcce48	Support MINT benchmark (MATH, GSM8K subset) (#1955 ) * setup boilerplate and README * setup test script and load dataset * add temp intg that works * refactor code * add solution evaluation through 'fake_user_response_fn' * finish integrating MATH subset * Update evaluation/mint/run_infer.py * Update evaluation/mint/run_infer.sh * Update opendevin/core/main.py * remove redudant templates, add eval_note, update README * use <execute_ipython> tag instead of <execute> * hardcode AGENT option for run_infer.sh * Update evaluation/mint/task.py Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * fix: bug no message returned when task's success * change message to make the agent exit * import bash abstractmethod * install all required packages inside sandbox before the agent runs, adjust prompt * add subset eval folder separation and test for gsm8k * fix bug in Reasoning task result check, add requirements.txt * Fix syntax error in evaluation/mint/run_infer.py * update README, add default values for `SUBSET` and `EVAL_LIMIT` --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: yufansong <yufan@risingwave-labs.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-05-28 07:42:52 +00:00

10 Commits