OpenHands

mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Author	SHA1	Message	Date
Graham Neubig	a081935fd8	Simplify eval code (#2775 ) * Start simplifying eval code * Update * Add EDA * Updated GAIA * Update gpqa * Add humanevalfix * Fix logic_reasoning * Add miniwob * Add mint and ml_bench * toolqa * Added swe-bench * Fixed webarena * Refactor parameters	2024-07-05 19:33:08 +09:00
மனோஜ்குமார் பழனிச்சாமி	143f38d25a	Refactored sandbox config and added fast boot (#2455 ) * Refactored sandbox config and added fastboot * added tests * fixed tests * fixed tests * intimate user about breaking change * remove default config from eval * check for lowercase env * add test * Revert Migration * migrate old sandbox configs * resolve merge conflict * revert migration 2 * Revert "remove default config from eval" This reverts commit `de57c588db`. * change type to box_type * fix var name * linted * lint * lint comments * fix tests * fix tests * fix typo * fix box_type, remove fast_boot * add tests for sandbox config * fix test * update eval docs * small removal comments * adapt toml template * old fields shouldn't be in the app dataclass * fix old keys in app config * clean up exec box --------- Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-07-05 03:30:21 +00:00
Graham Neubig	ffd3c7144c	Remove global args (#2760 ) * Remove global args * Remove global args * Update files * Update main * Bug fixes * Fix logging	2024-07-03 20:07:52 +09:00
Engel Nyst	2d9bb56763	Add ability to restore the cli session (optional) (#2699 ) * add ability to restore the main session * add quick log * rename to cli session	2024-06-30 06:56:55 +00:00
Engel Nyst	874b4c9075	CLI concurrency (#2695 ) * add session id in cli, evals * fix main sid	2024-06-30 04:04:30 +02:00
Graham Neubig	cab7a288ca	Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 ) * Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings * Update evaluation/webarena/scripts/run_infer.sh --------- Co-authored-by: OpenDevin <opendevin@all-hands.dev>	2024-06-23 03:43:43 +00:00
மனோஜ்குமார் பழனிச்சாமி	41564c2eac	Use :main instead of :latest (#2539 ) Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-06-21 03:57:50 +00:00
Boxuan Li	feabc97aba	Evaluation time travel: build sandbox on the fly (#2491 )	2024-06-20 20:22:02 -06:00
Boxuan Li	6f235937cf	Evaluation time travel: allow evaluation on a specific version (#2356 ) * Time travel for evaluation * Fix source script path * Exit script if given version doesn't exist * Exit on failure * Update README * Change scripts of all other benchmarks * Modify README files * Fix logic_reasoning README	2024-06-16 10:25:14 -04:00
RainRat	745ae42a72	fix typos (#2352 )	2024-06-09 12:57:58 -07:00
Boxuan Li	208b1461ca	[AgentBench evaluation] set run_as_devin to true (#2269 ) Co-authored-by: Leo <ifuryst@gmail.com>	2024-06-05 07:53:33 +00:00
Leo	040d6bd806	fix: add an early exit check for agent answers in agent bench. (#2257 ) Signed-off-by: ifuryst <ifuryst@gmail.com>	2024-06-04 18:45:07 -07:00
Ryan H. Tran	22e8fb39b1	add cost metrics to evaluation outputs for all benchmarks (#2199 )	2024-06-02 08:28:00 +00:00
Leo	be251b11de	Add AgentBench. (#2012 ) * Add AgentBench. * Load the datasets from HF. Signed-off-by: ifuryst <ifuryst@gmail.com> * Add helper functions. * Add mock executor. Signed-off-by: ifuryst <ifuryst@gmail.com> * Add retriv agent answer cmd. * Adjust the dataset. * Refine test results. Signed-off-by: ifuryst <ifuryst@gmail.com> * Consolidate all AgentBench datasets and scripts into a single CSV dataset. * Refactor dataset source. * Update helper functions. Signed-off-by: ifuryst <ifuryst@gmail.com> * Fix the CRLF problem. Signed-off-by: ifuryst <ifuryst@gmail.com> * Separate the instance's workspace. Signed-off-by: ifuryst <ifuryst@gmail.com> * Add cleanup logic and error handling for sandbox closure. * Normalized dataset Signed-off-by: ifuryst <ifuryst@gmail.com> * Update README. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update the prompt to capture the answer. Signed-off-by: ifuryst <ifuryst@gmail.com> * Refactor script execution paths to use absolute container workspace path. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update AgentBench README. Signed-off-by: ifuryst <ifuryst@gmail.com> * Delete useless functions. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update evaluation/agent_bench/README.md * Add script to summarize test results from JSONL file in AgentBench Signed-off-by: ifuryst <ifuryst@gmail.com> * Delete useless script and codes. Signed-off-by: ifuryst <ifuryst@gmail.com> * Update evaluation/agent_bench/scripts/summarise_results.py --------- Signed-off-by: ifuryst <ifuryst@gmail.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>	2024-06-01 07:58:14 +00:00

14 Commits