diff --git a/evaluation/README.md b/evaluation/README.md index e87c0657da..b8da72f8c5 100644 --- a/evaluation/README.md +++ b/evaluation/README.md @@ -42,7 +42,7 @@ temperature = 0.0 ## Supported Benchmarks -The OpenHands evaluation harness supports a wide variety of benchmarks across [software engineering](#software-engineering), [web browsing](#web-browsing), and [miscellaneous assistance](#misc-assistance) tasks. +The OpenHands evaluation harness supports a wide variety of benchmarks across [software engineering](#software-engineering), [web browsing](#web-browsing), [miscellaneous assistance](#misc-assistance), and [real-world](#real-world) tasks. ### Software Engineering @@ -73,6 +73,10 @@ The OpenHands evaluation harness supports a wide variety of benchmarks across [s - ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning) - ScienceAgentBench: [`evaluation/benchmarks/scienceagentbench`](./benchmarks/scienceagentbench) +### Real World + +- TheAgentCompany: [`evaluation/benchmarks/the_agent_company`](./benchmarks/the_agent_company) + ## Result Visualization Check [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization of existing experimental results.