From 54250e3fe24707ad294b3aec79f1342e36efd2a7 Mon Sep 17 00:00:00 2001 From: Graham Neubig Date: Tue, 22 Oct 2024 10:42:22 -0400 Subject: [PATCH] Update evaluation README.md structure (#4516) --- evaluation/README.md | 85 +++++++++++++++++++++++++------------------- 1 file changed, 48 insertions(+), 37 deletions(-) diff --git a/evaluation/README.md b/evaluation/README.md index 7a555e3ee6..7eb59c7b8d 100644 --- a/evaluation/README.md +++ b/evaluation/README.md @@ -2,19 +2,47 @@ This folder contains code and resources to run experiments and evaluations. -## Logistics +## For Benchmark Users -To better organize the evaluation folder, we should follow the rules below: +### Setup -- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain -all the preprocessing/evaluation/analysis scripts. -- Raw data and experimental records should not be stored within this repo. -- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization. -- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo. +Before starting evaluation, follow the instructions here [here](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md) to setup your local development environment and LLM. + +Once you are done with setup, you can follow the benchmark-specific instructions in each subdirectory of the evaluation directory. +Generally these will involve running `run_infer.py` to perform inference with the agents. + +### Implementing and Evaluating an Agent + +To add an agent to OpenHands, you will need to implement it in the [agenthub directory](https://github.com/All-Hands-AI/OpenHands/tree/main/openhands/agenthub). There is a README there with more information. + +To evaluate an agent, you can provide the agent's name to the `run_infer.py` program. + +### Evaluating Different LLMs + +OpenHands in development mode uses `config.toml` to keep track of most configuration. +Here's an example configuration file you can use to define and use multiple LLMs: + +```toml +[llm] +# IMPORTANT: add your API key here, and set the model to the one you want to evaluate +model = "gpt-4o-2024-05-13" +api_key = "sk-XXX" + +[llm.eval_gpt4_1106_preview_llm] +model = "gpt-4-1106-preview" +api_key = "XXX" +temperature = 0.0 + +[llm.eval_some_openai_compatible_model_llm] +model = "openai/MODEL_NAME" +base_url = "https://OPENAI_COMPATIBLE_URL/v1" +api_key = "XXX" +temperature = 0.0 +``` ## Supported Benchmarks -To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). +The OpenHands evaluation harness supports a wide variety of benchmarks across software engineering, web browsing, and miscellaneous assistance tasks. ### Software Engineering @@ -41,36 +69,19 @@ To learn more about how to integrate your benchmark into OpenHands, check out [t - Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA) - ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning) -## Before everything begins: Setup Environment and LLM Configuration - -Please follow instruction [here](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md) to setup your local development environment and LLM. - -OpenHands in development mode uses `config.toml` to keep track of most configurations. - -Here's an example configuration file you can use to define and use multiple LLMs: - -```toml -[llm] -# IMPORTANT: add your API key here, and set the model to the one you want to evaluate -model = "gpt-4o-2024-05-13" -api_key = "sk-XXX" - -[llm.eval_gpt4_1106_preview_llm] -model = "gpt-4-1106-preview" -api_key = "XXX" -temperature = 0.0 - -[llm.eval_some_openai_compatible_model_llm] -model = "openai/MODEL_NAME" -base_url = "https://OPENAI_COMPATIBLE_URL/v1" -api_key = "XXX" -temperature = 0.0 -``` - -### Result Visualization +## Result Visualization Check [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization of existing experimental results. -### Upload your results - You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions). + +## For Benchmark Developers + +To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/modules/usage/how-to/evaluation-harness). Briefly, + +- Each subfolder contains a specific benchmark or experiment. For example, `evaluation/swe_bench` should contain +all the preprocessing/evaluation/analysis scripts. +- Raw data and experimental records should not be stored within this repo. +- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization. +- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo. +