Files
OpenHands/evaluation/gaia
Jiayi Pan 2d52298a1d Support GAIA benchmark (#1911)
* Add gaia test

* Improve gaia prompts

* Fix browser_env hang bug

* Fix gaia bugs

* add gaia to eval readme

* Fix gaia bugs

* minor fix

* add run_infer.sh and update readme

* set num eval worker to 1

* default to 2023 gaia level1 subset

* default to level 1

* add prompt to instruct model enclose answer within <solution> tag

* add missing break

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-05-24 11:22:28 +00:00
..
2024-05-24 11:22:28 +00:00
2024-05-24 11:22:28 +00:00
2024-05-24 11:22:28 +00:00
2024-05-24 11:22:28 +00:00
2024-05-24 11:22:28 +00:00

GAIA Evaluation

This folder contains evaluation harness for evaluating agents on the GAIA benchmark.

Configure OpenDevin and your LLM

Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.

Run the evaluation

We are using the GAIA dataset hosted on Hugging Face. Please accept the terms and make sure to have logged in on your computer by huggingface-cli login before running the evaluation.

Following is the basic command to start the evaluation. Here we are evaluating on the validation set for the 2023_all split. You can adjust ./evaluation/gaia/scripts/run_infer.sh to change the subset you want to evaluate on.

./evaluation/gaia/scripts/run_infer.sh [model_config] [agent] [eval_limit] [gaia_subset]
# e.g., ./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview CodeActAgent 300

where model_config is mandatory, while agent, eval_limit and gaia_subset are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.

agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.

eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note: in order to use eval_limit, you must also set agent.

gaia_subset, GAIA benchmark has multiple subsets: 2023_level1, 2023_level2, 2023_level3, 2023_all. If not provided, it will defaults to 2023_level1.

Let's say you'd like to run 10 instances using eval_gpt4_1106_preview and CodeActAgent, then your command would be:

./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview CodeActAgent 10

Get score

Then you can get stats by running the following command:

python ./evaluation/gaia/get_score.py \
--file <path_to/output.json>