mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Files

Engel Nyst d37b2973b2 Refactoring: event stream based agent history (#2709 )

* add to event stream sync

* remove async from tests

* small logging spam fix

* remove swe agent

* arch refactoring: use history from the event stream

* refactor agents

* monologue agent

* ruff

* planner agent

* micro-agents

* refactor history in evaluations

* evals history refactoring

* adapt evals and tests

* unit testing stuck

* testing micro agents, event stream

* fix planner agent

* fix tests

* fix stuck after rename

* fix test

* small clean up

* fix merge

* fix merge issue

* fix integration tests

* Update agenthub/dummy_agent/agent.py

* fix tests

* rename more clearly; add todo; clean up

2024-07-07 21:04:23 +00:00

prompts

Add remaining subsets for MINT benchmark (#2142 )

2024-05-31 20:04:13 +00:00

scripts

Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 )

2024-06-23 03:43:43 +00:00

tasks

Add remaining subsets for MINT benchmark (#2142 )

2024-05-31 20:04:13 +00:00

.gitignore

Support MINT benchmark (MATH, GSM8K subset) (#1955 )

2024-05-28 07:42:52 +00:00

config_variables.py

Add remaining subsets for MINT benchmark (#2142 )

2024-05-31 20:04:13 +00:00

datatypes.py

[Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals (#2268 )

2024-06-05 06:47:07 +00:00

env.py

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

README.md

Fix doc error in evals (#2654 )

2024-06-27 16:13:47 +00:00

requirements.txt

Add remaining subsets for MINT benchmark (#2142 )

2024-05-31 20:04:13 +00:00

run_infer.py

Refactoring: event stream based agent history (#2709 )

2024-07-07 21:04:23 +00:00

utils.py

Add remaining subsets for MINT benchmark (#2142 )

2024-05-31 20:04:13 +00:00

README.md

MINT Benchmark

This folder contains the evaluation harness for the MINT benchmark on LLMs' ability to solve tasks with multi-turn interactions.

Configure OpenDevin and LM

Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.

Start the evaluation

We are using the MINT dataset hosted on Hugging Face.

Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is CodeActAgent.

./evaluation/mint/scripts/run_infer.sh [model_config] [git-version] [subset] [eval_limit]

where model_config is mandatory, while others are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like 0.6.2.
subset, e.g. math, is the subset of the MINT benchmark to evaluate on, defaulting to math. It can be either: math, gsm8k, mmlu, theoremqa, mbpp,humaneval.
eval_limit, e.g. 2, limits the evaluation to the first eval_limit instances, defaulting to all instances.

Note: in order to use eval_limit, you must also set subset.

For example,

./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3

Reference

@misc{wang2024mint,
    title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
    author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
    year={2024},
    eprint={2309.10691},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}