mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Files

tobitege 8fcf0817d4 (eval) Aider_bench: add eval_ids arg to run specific instance id's (#3592 )

* add eval_ids arg to run specific instance id's; fix/extend README

* fix description in parser for --eval-ids

* fix test_arg_parser.py to account for added arg

* fix typo in README to say "summarize" instead of "summarise" for script

2024-08-27 00:49:26 +08:00

scripts

(eval) Aider_bench: add eval_ids arg to run specific instance id's (#3592 )

2024-08-27 00:49:26 +08:00

create_dataset.py

[Feat] Aider Benchmark (#3507 )

2024-08-21 18:05:41 +00:00

helper.py

[Feat] Aider Benchmark (#3507 )

2024-08-21 18:05:41 +00:00

README.md

(eval) Aider_bench: add eval_ids arg to run specific instance id's (#3592 )

2024-08-27 00:49:26 +08:00

run_infer.py

(eval) Aider_bench: add eval_ids arg to run specific instance id's (#3592 )

2024-08-27 00:49:26 +08:00

README.md

AiderBench Evaluation

This folder contains evaluation harness for evaluating agents on the Aider Editing Benchmark. This will allow us to develop better editing approach without running the full SWE-bench. The benchmark uses the RajMaheshwari/Exercism-Python Hugging Face dataset based on the Exercism python coding exercises.

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Start the evaluation

./evaluation/aider_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.9.0.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the entire Exercism test set (133 issues). Note: in order to use eval_limit, you must also set agent.
eval-num-workers: the number of workers to use for evaluation. Default: 1.
eval_ids, e.g. "1,3,10", limits the evaluation to instances with the given IDs (comma separated).

Following is the basic command to start the evaluation.

You can update the arguments in the script evaluation/aider_bench/scripts/run_infer.sh, such as --max-iterations, --eval-num-workers and so on:

--agent-cls, the agent to use. For example, CodeActAgent.
--llm-config: the LLM configuration to use. For example, eval_gpt4_1106_preview.
--max-iterations: the max allowed number of iterations to run the evaluation. Default: 30.
--eval-num-workers: the number of workers to use for evaluation. Default: 1.
--eval-n-limit: the number of examples to evaluate. For example, 100.
--eval-ids: the IDs of the examples to evaluate (comma separated). For example, "1,3,10".

./evaluation/aider_bench/scripts/run_infer.sh eval_gpt35_turbo HEAD CodeActAgent 100 1 "1,3,10"

Summarize Results

poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]

Full example:

poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl

This will list the instances that passed and the instances that failed. For each instance, the corresponding set of test cases (which can vary for each instance) are run on the file edited by the agent. We consider an instance to be passed only if ALL test cases are passed. Sometimes even a single failed test case will cause the entire instance to be marked as filed.

You can inspect the test_results field in the output json file to know the exact outcome of the tests. If there are no syntax or indentation errors, you can expect to see something like "..F...EF..", where "." means the test case passed, "E" means there was an error while executing the test case and "F" means some assertion failed and returned output was not as expected.