Docs: Clarify config.toml usage in evaluation harness (#6828)

Co-authored-by: openhands <openhands@all-hands.dev>
2026-01-08 22:38:05 -05:00 · 2025-02-21 01:16:17 -05:00
parent c27b191358
commit e52aee168e
1 changed files with 4 additions and 0 deletions
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -20,6 +20,8 @@ To evaluate an agent, you can provide the agent's name to the `run_infer.py` pro
 ### Evaluating Different LLMs

 OpenHands in development mode uses `config.toml` to keep track of most configuration.
+**IMPORTANT: For evaluation, only the LLM section in `config.toml` will be used. Other configurations, such as `save_trajectory_path`, are not applied during evaluation.**
+
 Here's an example configuration file you can use to define and use multiple LLMs:

 ```toml
@@ -40,6 +42,8 @@ api_key = "XXX"
 temperature = 0.0
 ```

+For other configurations specific to evaluation, such as `save_trajectory_path`, these are typically set in the `get_config` function of the respective `run_infer.py` file for each benchmark.
+
 ## Supported Benchmarks

 The OpenHands evaluation harness supports a wide variety of benchmarks across [software engineering](#software-engineering), [web browsing](#web-browsing), [miscellaneous assistance](#misc-assistance), and [real-world](#real-world) tasks.