* add MMLU subset * add theoremqa subset * remove redundant packages from requirements.txt, adjust prompts, handle gpt3.5 propose a wrong answer after a correct answer * add MBPP subset * add humaneval subset * update README * exit actively after the agent finishes the task
MINT Benchmark
This folder contains the evaluation harness for the MINT benchmark on LLMs' ability to solve tasks with multi-turn interactions.
Configure OpenDevin and LM
Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.
Start the evaluation
We are using the MINT dataset hosted on Hugging Face.
Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is CodeActAgent.
./evaluation/mint/scripts/run_infer.sh [model_config] [subset] [eval_limit]
where model_config is mandatory, while subset and eval_limit are optional.
-
model_config, e.g.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in yourconfig.toml. -
subset, e.g.math, is the subset of the MINT benchmark to evaluate on, defaulting tomath. It can be either:math,gsm8k,mmlu,theoremqa,mbpp,humaneval. -
eval_limit, e.g.2, limits the evaluation to the firsteval_limitinstances, defaulting to all instances.
Note: in order to use eval_limit, you must also set subset.
Let's say you'd like to run 3 instances on the gsm8k subset using eval_gpt4_1106_preview,
then your command would be:
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview gsm8k 3
Reference
@misc{wang2024mint,
title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
year={2024},
eprint={2309.10691},
archivePrefix={arXiv},
primaryClass={cs.CL}
}