update documentation for evaluation tutorial

2026-01-08 22:38:05 -05:00 · 2024-08-06 14:55:42 -04:00
parent 9c44d94cef
commit 7270d21cf9
3 changed files with 290 additions and 199 deletions
--- a/docs/modules/usage/evaluation_harness.md
+++ b/docs/modules/usage/evaluation_harness.md
@@ -0,0 +1,259 @@
+---
+sidebar_position: 6
+---
+
+Here's a revised guideline to help people contribute to OpenDevin, incorporating the sections you've requested and using the provided files:
+
+# How to contribute to OpenDevin Evaluation Harness
+
+This guide provides an overview of how to integrate your own evaluation benchmark into the OpenDevin framework.
+
+## Before everything begins: Setup Environment and LLM Configuration
+
+Please follow instruction [here](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup your local development environment and LLM.
+
+OpenDevin in development mode uses `config.toml` to keep track of most configurations.
+
+Here's an example configuration file you can use to define and use multiple LLMs:
+
+```toml
+[llm]
+# IMPORTANT: add your API key here, and set the model to the one you want to evaluate
+model = "gpt-4o-2024-05-13"
+api_key = "sk-XXX"
+
+[llm.eval_gpt4_1106_preview_llm]
+model = "gpt-4-1106-preview"
+api_key = "XXX"
+temperature = 0.0
+
+[llm.eval_some_openai_compatible_model_llm]
+model = "openai/MODEL_NAME"
+base_url = "https://OPENAI_COMPATIBLE_URL/v1"
+api_key = "XXX"
+temperature = 0.0
+```
+
+
+## How to use OpenDevin in the command line
+
+OpenDevin can be run from the command line using the following format:
+
+```bash
+poetry run python ./opendevin/core/main.py \
+        -i <max_iterations> \
+        -t "<task_description>" \
+        -c <agent_class> \
+        -l <llm_config>
+```
+
+For example:
+
+```bash
+poetry run python ./opendevin/core/main.py \
+        -i 10 \
+        -t "Write me a bash script that prints hello world." \
+        -c CodeActAgent \
+        -l llm
+```
+
+This command runs OpenDevin with:
+- A maximum of 10 iterations
+- The specified task description
+- Using the CodeActAgent
+- With the LLM configuration defined in the `llm` section of your `config.toml` file
+
+## How does OpenDevin work
+
+The main entry point for OpenDevin is in `opendevin/core/main.py`. Here's a simplified flow of how it works:
+
+1. Parse command-line arguments and load the configuration.
+2. Create a runtime environment using `create_runtime()`.
+3. Initialize the specified agent.
+4. Run the controller using `run_controller()`, which:
+   - Attaches the runtime to the agent
+   - Executes the agent's task
+   - Returns a final state when complete
+
+The `run_controller()` function is the core of OpenDevin's execution. It manages the interaction between the agent, the runtime, and the task, handling things like user input simulation and event processing.
+
+
+## Easiest way to get started: Exploring Existing Benchmarks
+
+We encourage you to review the various evaluation benchmarks available in the [`evaluation/` directory](https://github.com/OpenDevin/OpenDevin/blob/main/evaluation) of our repository.
+
+To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
+
+## How to create an evaluation workflow
+
+To create an evaluation workflow for your benchmark, follow these steps:
+
+1. Create a configuration:
+   ```python
+   def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
+       config = AppConfig(
+           default_agent=metadata.agent_class,
+           runtime='eventstream',
+           max_iterations=metadata.max_iterations,
+           sandbox=SandboxConfig(
+               container_image='your_container_image',
+               enable_auto_lint=True,
+               timeout=300,
+           ),
+       )
+       config.set_llm_config(metadata.llm_config)
+       return config
+   ```
+
+2. Initialize the runtime and set up the evaluation environment:
+   ```python
+   async def initialize_runtime(runtime: Runtime, instance: pd.Series):
+       # Set up your evaluation environment here
+       # For example, setting environment variables, preparing files, etc.
+       pass
+   ```
+
+3. Create a function to process each instance:
+   ```python
+   async def process_instance(instance: pd.Series, metadata: EvalMetadata) -> EvalOutput:
+       config = get_config(instance, metadata)
+       runtime = await create_runtime(config, sid=instance.instance_id)
+       await initialize_runtime(runtime, instance)
+
+       instruction = get_instruction(instance, metadata)
+
+       state = await run_controller(
+           config=config,
+           task_str=instruction,
+           runtime=runtime,
+           fake_user_response_fn=your_user_response_function,
+       )
+
+       # Evaluate the agent's actions
+       evaluation_result = await evaluate_agent_actions(runtime, instance)
+
+       return EvalOutput(
+           instance_id=instance.instance_id,
+           instruction=instruction,
+           test_result=evaluation_result,
+           metadata=metadata,
+           history=state.history.compatibility_for_eval_history_pairs(),
+           metrics=state.metrics.get() if state.metrics else None,
+           error=state.last_error if state and state.last_error else None,
+       )
+   ```
+
+4. Run the evaluation:
+   ```python
+   metadata = make_metadata(llm_config, dataset_name, agent_class, max_iterations, eval_note, eval_output_dir)
+   output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
+   instances = prepare_dataset(your_dataset, output_file, eval_n_limit)
+
+   await run_evaluation(
+       instances,
+       metadata,
+       output_file,
+       num_workers,
+       process_instance
+   )
+   ```
+
+This workflow sets up the configuration, initializes the runtime environment, processes each instance by running the agent and evaluating its actions, and then collects the results into an `EvalOutput` object. The `run_evaluation` function handles parallelization and progress tracking.
+
+Remember to customize the `get_instruction`, `your_user_response_function`, and `evaluate_agent_actions` functions according to your specific benchmark requirements.
+
+By following this structure, you can create a robust evaluation workflow for your benchmark within the OpenDevin framework.
+
+Certainly! I'll add a section explaining the user_response_fn and include a description of the workflow and interaction. Here's an updated version of the guideline with the new section:
+
+
+## Understanding the `user_response_fn`
+
+The `user_response_fn` is a crucial component in OpenDevin's evaluation workflow. It simulates user interaction with the agent, allowing for automated responses during the evaluation process. This function is particularly useful when you want to provide consistent, predefined responses to the agent's queries or actions.
+
+
+### Workflow and Interaction
+
+The correct workflow for handling actions and the `user_response_fn` is as follows:
+
+1. Agent receives a task and starts processing
+2. Agent emits an Action
+3. If the Action is executable (e.g., CmdRunAction, IPythonRunCellAction):
+   - The Runtime processes the Action
+   - Runtime returns an Observation
+4. If the Action is not executable (typically a MessageAction):
+   - The `user_response_fn` is called
+   - It returns a simulated user response
+5. The agent receives either the Observation or the simulated response
+6. Steps 2-5 repeat until the task is completed or max iterations are reached
+
+Here's a more accurate visual representation:
+
+```
+                 [Agent]
+                    |
+                    v
+               [Emit Action]
+                    |
+                    v
+            [Is Action Executable?]
+           /                       \
+         Yes                        No
+          |                          |
+          v                          v
+     [Runtime]               [user_response_fn]
+          |                          |
+          v                          v
+  [Return Observation]    [Simulated Response]
+           \                        /
+            \                      /
+             v                    v
+           [Agent receives feedback]
+                    |
+                    v
+         [Continue or Complete Task]
+```
+
+In this workflow:
+
+- Executable actions (like running commands or executing code) are handled directly by the Runtime.
+- Non-executable actions (typically when the agent wants to communicate or ask for clarification) are handled by the `user_response_fn`.
+- The agent then processes the feedback, whether it's an Observation from the Runtime or a simulated response from the `user_response_fn`.
+
+This approach allows for automated handling of both concrete actions and simulated user interactions, making it suitable for evaluation scenarios where you want to test the agent's ability to complete tasks with minimal human intervention.
+
+### Example Implementation
+
+Here's an example of a `user_response_fn` used in the SWE-Bench evaluation:
+
+```python
+def codeact_user_response(state: State | None) -> str:
+    msg = (
+        'Please continue working on the task on whatever approach you think is suitable.\n'
+        'If you think you have solved the task, please first send your answer to user through message and then <execute_bash> exit </execute_bash>.\n'
+        'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP.\n'
+    )
+
+    if state and state.history:
+        # check if the agent has tried to talk to the user 3 times, if so, let the agent know it can give up
+        user_msgs = [
+            event
+            for event in state.history.get_events()
+            if isinstance(event, MessageAction) and event.source == 'user'
+        ]
+        if len(user_msgs) >= 2:
+            # let the agent know that it can give up when it has tried 3 times
+            return (
+                msg
+                + 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
+            )
+    return msg
+```
+
+This function does the following:
+
+1. Provides a standard message encouraging the agent to continue working.
+2. Checks how many times the agent has attempted to communicate with the user.
+3. If the agent has made multiple attempts, it provides an option to give up.
+
+By using this function, you can ensure consistent behavior across multiple evaluation runs and prevent the agent from getting stuck waiting for human input.
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -12,29 +12,47 @@ all the preprocessing/evaluation/analysis scripts.

 ## Supported Benchmarks

+To learn more about how to integrate your benchmark into OpenDevin, check out [tutorial here](https://docs.all-hands.dev/modules/usage/evaluation_harness).
+
+### Software Engineering
+
 - SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
 - HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
- GAIA: [`evaluation/gaia`](./gaia)
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
- MINT: [`evaluation/mint`](./mint)
- AgentBench: [`evaluation/agent_bench`](./agent_bench)
 - BIRD: [`evaluation/bird`](./bird)
- LogicReasoning: [`evaluation/logic_reasoning`](./logic_reasoning)
+- BioCoder: [`evaluation/ml_bench`](./ml_bench)
+- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
+- APIBench: [`evaluation/gorilla`](./gorilla/)
+- ToolQA: [`evaluation/toolqa`](./toolqa/)

-## Setup
+### Web Browsing

-### Development environment
-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
+- WebArena: [`evaluation/webarena`](./webarena/)
+- MiniWob++: [`evaluation/miniwob`](./miniwob/)

-### Configure OpenDevin and your LLM
+### Misc. Assistance

-Create a `config.toml` file if it does not exist at the root of the workspace. You can copy from `config.template.toml` if it is easier for you.
+- GAIA: [`evaluation/gaia`](./gaia)
+- GPQA: [`evaluation/gpqa`](./gpqa)
+- AgentBench: [`evaluation/agent_bench`](./agent_bench)
+- MINT: [`evaluation/mint`](./mint)
+- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
+- ProofWriter: [`evaluation/logic_reasoning`](./logic_reasoning)

-Add the configuration for your LLM:
+
+## Before everything begins: Setup Environment and LLM Configuration
+
+Please follow instruction [here](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup your local development environment and LLM.
+
+OpenDevin in development mode uses `config.toml` to keep track of most configurations.
+
+Here's an example configuration file you can use to define and use multiple LLMs:

 ```toml
-# TODO: Change these to the model you want to evaluate
+[llm]
+# IMPORTANT: add your API key here, and set the model to the one you want to evaluate
+model = "gpt-4o-2024-05-13"
+api_key = "sk-XXX"
+
 [llm.eval_gpt4_1106_preview_llm]
 model = "gpt-4-1106-preview"
 api_key = "XXX"
--- a/evaluation/TUTORIAL.md
+++ b/evaluation/TUTORIAL.md
@@ -1,186 +0,0 @@
-# Tutorial: How to add a New Evaluation Benchmark to OpenDevin
-
-This tutorial provides a general guide on how to integrate your own evaluation benchmark into the OpenDevin framework.
-
-You can read this for details, and also learn by example by looking at our existing evaluations:
- [swe_bench](swe_bench/)
-
-
-## A quick walk-through of OpenDevin architecture
-
-### Before everything begins
-
-Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
-
-### Configuration file
-
-OpenDevin uses `config.toml` to keep track of most configurations.
-
-Here's an example configuration file you can use:
-
-```toml
-[core]
-max_iterations = 100
-cache_dir = "/tmp/cache"
-
-# IMPORTANT: You should set these two paths to YOUR WORKSPACE directory,
-# which will be mounted into Sandbox for agent to interact with!
-# The OpenDevin agent will be able to read/write files whatever they like (even rm -rf)
-# in this directory, so be careful!!
-workspace_base = "/path/to/your/workspace"
-workspace_mount_path = "/path/to/your/workspace"
-# ==========================
-
-ssh_hostname = "localhost"
-
-run_as_devin = false
-
-[sandbox]
-# SWEBench eval specific - but you can tweak it to your needs
-use_host_network = false
-# linting python after editing helps LLM fix indentations
-enable_auto_lint = true
-
-
-box_type = "ssh"
-timeout = 120
-
-[llm]
-# IMPORTANT: add your API key here, and set the model to the one you want to evaluate
-model = "gpt-4o-2024-05-13"
-api_key = "sk-XXX"
-```
-
-### How to use OpenDevin programmatically
-
-In this section, for the purpose of building an evaluation task, we don't use the standard OpenDevin web-based GUI, but rather run OpenDevin backend from CLI.
-
-For example, you can run the following, which performs the specified task `-t`, with a particular model config `-l` and agent `-c`, for a maximum number of iterations `-i`:
-
-```bash
-poetry run python ./opendevin/core/main.py \
-        -i 10 \
-        -t "Write me a bash script that print hello world." \
-        -c CodeActAgent \
-        -l llm
-```
-
-After running the script, you will observe the following:
-
-![](./static/example_task_1.png)
-
-You can see the agent uses bash to write a script, makes it executable, and then tests it by running it to make sure it is working.
-
-At the end of the above screenshot, OpenDevin actually requests user inputs when it think it finishes the task. This will cause issues in evaluation, since most evaluation don't assume additional user input. To fix this, we introduce the functionality of `fake_user_response_fn` in the `main` function, which we describe in the next section.
-
-## The `main` function
-
-The signature of `main` (in file [[`opendevin/core/main.py`](../opendevin/core/main.py)]) is as follows:
-
-```python
-async def main(
-    task_str: str = '',
-    exit_on_message: bool = False,
-    fake_user_response_fn: Optional[Callable[[Optional[State]], str]] = None,
-    sandbox: Optional[Sandbox] = None,
-) -> Optional[State]:
-```
-
- `task_str`: The task instruction to run. In the above example, it is "Write me a bash script that print hello world."
- `exit_on_message`: whether to quit if the agent asks for a message from user
- `fake_user_response_fn`: An optional function that receives the current state (could be None) and returns a fake user response.
- `sandbox`: An optional sandbox to run the agent in.
-
-### `fake_user_response_fn`
-
-Here's an example of `fake_user_response_fn` in the implementation for SWE-Bench in [`evaluation/swe_bench/run_infer.py`](swe_bench/run_infer.py):
-
-```python
-def codeact_user_response(state: State) -> str:
-    msg = (
-        'Please continue working on the task on whatever approach you think is suitable.\n'
-        'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
-        'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
-    )
-    # check if the agent has tried to talk to the user 3 times, if so, let the agent know it can give up
-    if state.history:
-        user_msgs = [
-            event
-            for event in state.history.get_events()
-            if isinstance(action, MessageAction) and action.source == 'user'
-        ]
-        if len(user_msgs) > 2:
-            # let the agent know that it can give up when it has tried 3 times
-            return (
-                msg
-                + 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
-            )
-    return msg
-```
-
-
-### Return value
-
-The main function returns a `State`, which is defined in [`opendevin/controller/state/state.py`](../opendevin/controller/state/state.py). We are mainly using `state.history` here, which is the most important field of data. You can imagine it is being a more structured version of OpenAI's chat completion [messages](https://platform.openai.com/docs/guides/text-generation/chat-completions-api).
-
-`history: list[tuple[Action, Observation]] = field(default_factory=list)` is a list of (action, observation) tuple. All the actions are defined at [`opendevin/events/action`](../opendevin/events/action) and observations are defined at [`opendevin/events/observation`](../opendevin/events/action).
-
-The agent can emit different actions like `CmdRunAction`  (`opendevin/events/action/commands.py`) to execute bash commands and receive `CmdOutputObservation` (`opendevin/events/observation/commands.py`), `IPythonRunCellAction` to receive `IPythonRunCellObservation`, `BrowseInteractiveAction` (`opendevin/events/action/browse.py`) to browse the web and receive `BrowserOutputObservation` (`opendevin/events/observation/browse.py`).
-
-The action we used in this example is `MessageAction` (`opendevin/events/action/message.py`), which actually denotes a message from either `agent` or `user`. In the [CodeAct agent example](https://github.com/OpenDevin/OpenDevin/blob/7ca560471bd262f22513f3863995d0a8e6121c07/agenthub/codeact_agent/codeact_agent.py#L239-L273), an agent is considered to emit a `MessageAction` when it does not trigger a `CmdRunAction`, `IPythonRunCellAction`, and/or `BrowseInteractiveAction`.
-
-Typically, the agent returns `MessageAction` when it is confused about the task, and want to ask human for follow-up clarification, which is a good thing in real-world task, but not necessarily in evaluation. So in this example, we provide a dummy prompt to tell the agent "Please continue working on the task on whatever approach you think is suitable[...]".
-
-If you see something like this, you can consider adding this to your evaluation pipeline as well.
-
-### `sandbox`
-
-Sandbox is a fully functioning docker container where the agent can perform all sorts of tasks, e.g., using bash, calling Python, install packages, and more. You can leave `sandbox` to `None` if you don't need to do anything special to pre-configure the `Sandbox`.
-
-In SWE-Bench, we need to copy the proper repository directory to the workspace and activate the right python virtual environment before the agent can start performing the task, so we actually defined a custom [`SWEBenchSSHBox`](https://github.com/OpenDevin/OpenDevin/blob/7ca560471bd262f22513f3863995d0a8e6121c07/evaluation/swe_bench/swe_env_box.py#L12-L118) that inherit from the default sandbox [`SSHBox`](https://github.com/OpenDevin/OpenDevin/blob/7ca560471bd262f22513f3863995d0a8e6121c07/opendevin/runtime/docker/ssh_box.py#L188) and handles all these initial setup. If you need to configure the `sandbox` for your evaluation, check `SWEBenchSSHBox` for a reference of implementation.
-
-## How to put together an evaluation script?
-
-Now we know how to start running the agent end-to-end, and how `fake_user_response_fn` and `sandbox` work. We will walk through a piece of dummy code (simplified version of SWE-Bench's [`run_infer.py`](https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/run_infer.py)) that outline the general workflow:
-
- Load the dataset and prepare the evaluation configuration.
- Filter out any instances that have already been processed.
- For each instance in the dataset:
-  - Set up the sandbox environment.
-  - Run the agent to generate a solution.
-  - Apply the solution to the instance and execute the test command.
-  - Collect the results and write them to the output file.
- Perform cleanup after the evaluation is complete.
-
-You can see the [swe_bench/run_infer.py](swe_bench/run_infer.py) file for an example.
-
-When you fully understand the `run_infer.py`, you can be ready to actually starting the evaluation!
-
-
-## Run the evaluation!
-
-You can write your `run_infer.sh` script mimicking SWE-Bench's [`run_infer.sh`](https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/scripts/run_infer.sh).
-
-
-You can start the evaluation by running:
-
-```bash
-./run_infer.sh eval_gpt_4o_2024_05_13
-```
-Where `eval_gpt_4o_2024_05_13` is the model config you defined on the config.toml.
-Like this:
-
-```toml
-[core]
-...
-
-[llm]
-model="gpt-4-32k"
-...
-
-[eval_gpt_4o_2024_05_13]
-model="gpt-4o-2024-05-13"
-api_key="sk-xxx"
-```
-
-If `[eval_gpt_4o_2024_05_13]` is not present, it will default to using the model configured in `[llm]`.