Add inference for SWT-Bench (#7201)

Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
This commit is contained in:
Niels Mündler
2025-04-17 22:49:42 +02:00
committed by GitHub
parent 988d4aa679
commit 4b124d5906
5 changed files with 1044 additions and 6 deletions

View File

@@ -2,6 +2,8 @@
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
**UPDATE (4/8/2025): We now support running SWT-Bench evaluation! For more details, checkout [the corresponding section](#SWT-Bench-Evaluation).**
**UPDATE (03/27/2025): We now support SWE-Bench multimodal evaluation! Simply use "princeton-nlp/SWE-bench_Multimodal" as the dataset name in the `run_infer.sh` script to evaluate on multimodal instances.**
**UPDATE (2/18/2025): We now support running SWE-Gym using the same evaluation harness here. For more details, checkout [this README](./SWE-Gym.md).**
@@ -141,7 +143,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split]
# Example
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
```
The script now accepts optional arguments:
@@ -182,3 +184,58 @@ To clean-up all existing runtimes that you've already started, run:
```bash
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/utils/scripts/cleanup_remote_runtime.sh
```
## SWT-Bench Evaluation
[SWT-Bench](https://swtbench.com/) ([paper](https://arxiv.org/abs/2406.12952)) is a benchmark for evaluating the capability of LLMs at creating unit tests. It is performed on the same instances as SWE-Bench, but requires a separate evaluation harness to capture coverage and issue reproduction. We therefore detail below how to leverage the inference script in this folder to run inference on SWT-Bench and how to use the SWT-Bench evaluation harness to evaluate them.
### Run inference on SWT-Bench
To run inference on SWT-Bench, you can use the same `run_infer.sh` script as described for evaluation on plain SWE-Bench. The only differences is that you need to specify the `mode` parameter to `swt` or `swt-ci` when running the script. For example, to run inference on SWT-Bench Verified, run the following command:
```bash
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [swe-dataset] test 1 swt
# Example - This runs evaluation on CodeActAgent for 500 instances on "SWT-bench_Verified"'s test set (corresponding to SWE-bench_Verified), with max 100 iteration per instances, with 1 number of workers running in parallel
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4o-2024-11-20 HEAD CodeActAgent 500 100 1 princeton-nlp/SWE-bench_Verified test 1 swt
```
The two modes `swt` and `swt-ci` have the following effect:
- `swt`: This mode will change the prompt to instruct the agent to generate reproducing test cases instead of resolving the issue.
- `swt-ci`: In addition to the changes by `swt`, this mode sets up the CI environment by i) pre-installing the environment in the docker image, such that the test framework can be executed without errors and ii) telling the model the exact command to run the test framework.
### Run evaluation for SWT-bench
The evaluation of these results is done leveraging [the SWT-Bench evaluation harness](https://github.com/logic-star-ai/swt-bench/tree/master).
#### Extracting results into SWT-Bench harness format
In order to run evaluation of the obtained inference results in the SWT-Bench harness, we transform the results to a format that the SWT-Bench evaluation harness expects.
```bash
python3 evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py --prediction_file [output.jsonl] > [output_swt.jsonl]
# Example
python3 evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py --prediction_file "evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt-4o-2024-11-20_maxiter_100_N_v0.31.0-no-hint-swt-run_1/output.jsonl" > OpenHands-gpt-4o-2024-11-20.jsonl
```
#### Running the results in SWT-Bench
Next, we run the [SWT-Bench evaluation harness](https://github.com/logic-star-ai/swt-bench/tree/master) with these results.
First set-up and validate the setup as described in the harness [here](https://github.com/logic-star-ai/swt-bench/tree/master?tab=readme-ov-file#-set-up).
Then, run the evaluation with the following command:
```bash
# Example
python3 -m src.main \
--dataset_name princeton-nlp/SWE-bench_Verified \
--predictions_path <pathTo>/OpenHands-gpt-4o-2024-11-20.jsonl \
--max_workers 12 \
--run_id OpenHands-CodeAct-gpt-4o-2024-11-20 --patch_types vanilla --build_mode api
```
The results of the evaluation can be obtained by running the reporting script of the harness.
```bash
# Example
python -m src.report run_instance_swt_logs/OpenHands-CodeAct-gpt-4o-2024-11-20/OpenHands__CodeActAgent__gpt-4o-2024-11-20 --dataset verified
```

View File

@@ -0,0 +1,832 @@
# Based on https://github.com/logic-star-ai/swt-bench/blob/master/src/constants.py
# Constants - Installation Specifications
MAP_VERSION_TO_INSTALL_SKLEARN = {
k: {
"python": "3.6",
"packages": "numpy scipy cython pytest pandas matplotlib",
"install": "python -m pip install -v --no-use-pep517 --no-build-isolation -e .",
"pip_packages": [
"cython",
"numpy==1.19.2",
"setuptools",
"scipy==1.5.2",
],
}
for k in ["0.20", "0.21", "0.22"]
}
MAP_VERSION_TO_INSTALL_SKLEARN.update(
{
k: {
"python": "3.9",
"packages": "'numpy==1.19.2' 'scipy==1.5.2' 'cython==3.0.10' pytest 'pandas<2.0.0' 'matplotlib<3.9.0' setuptools pytest joblib threadpoolctl",
"install": "python -m pip install -v --no-use-pep517 --no-build-isolation -e .",
"pip_packages": ["cython", "setuptools", "numpy", "scipy"],
}
for k in ["1.3", "1.4"]
}
)
MAP_VERSION_TO_INSTALL_FLASK = {
"2.0": {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pip_packages": [
"setuptools==70.0.0",
"Werkzeug==2.3.7",
"Jinja2==3.0.1",
"itsdangerous==2.1.2",
"click==8.0.1",
"MarkupSafe==2.1.3",
],
},
"2.1": {
"python": "3.10",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pip_packages": [
"click==8.1.3",
"itsdangerous==2.1.2",
"Jinja2==3.1.2",
"MarkupSafe==2.1.1",
"Werkzeug==2.3.7",
],
},
}
MAP_VERSION_TO_INSTALL_FLASK.update(
{
k: {
"python": "3.11",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pip_packages": [
"click==8.1.3",
"itsdangerous==2.1.2",
"Jinja2==3.1.2",
"MarkupSafe==2.1.1",
"Werkzeug==2.3.7",
],
}
for k in ["2.2", "2.3"]
}
)
MAP_VERSION_TO_INSTALL_DJANGO = {
k: {
"python": "3.5",
"packages": "requirements.txt",
"pre_install": [
"apt-get update && apt-get install -y locales",
"echo 'en_US UTF-8' > /etc/locale.gen",
"locale-gen en_US.UTF-8",
],
"install": "python setup.py install",
"pip_packages": ["setuptools"],
"eval_commands": [
"export LANG=en_US.UTF-8",
"export LC_ALL=en_US.UTF-8",
"export PYTHONIOENCODING=utf8",
"export LANGUAGE=en_US:en",
],
}
for k in ["1.7", "1.8", "1.9", "1.10", "1.11", "2.0", "2.1", "2.2"]
}
MAP_VERSION_TO_INSTALL_DJANGO.update(
{
k: {"python": "3.5", "install": "python setup.py install"}
for k in ["1.4", "1.5", "1.6"]
}
)
MAP_VERSION_TO_INSTALL_DJANGO.update(
{
k: {
"python": "3.6",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"eval_commands": [
"sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen",
"export LANG=en_US.UTF-8",
"export LANGUAGE=en_US:en",
"export LC_ALL=en_US.UTF-8",
],
}
for k in ["3.0", "3.1", "3.2"]
}
)
MAP_VERSION_TO_INSTALL_DJANGO.update(
{
k: {
"python": "3.8",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
}
for k in ["4.0"]
}
)
MAP_VERSION_TO_INSTALL_DJANGO.update(
{
k: {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
}
for k in ["4.1", "4.2"]
}
)
MAP_VERSION_TO_INSTALL_DJANGO.update(
{
k: {
"python": "3.11",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
}
for k in ["5.0"]
}
)
MAP_VERSION_TO_INSTALL_REQUESTS = {
k: {"python": "3.9", "packages": "pytest", "install": "python -m pip install ."}
for k in ["0.7", "0.8", "0.9", "0.11", "0.13", "0.14", "1.1", "1.2", "2.0", "2.2"]
+ ["2.3", "2.4", "2.5", "2.7", "2.8", "2.9", "2.10", "2.11", "2.12", "2.17"]
+ ["2.18", "2.19", "2.22", "2.26", "2.25", "2.27", "3.0"]
}
MAP_VERSION_TO_INSTALL_SEABORN = {
k: {
"python": "3.9",
"install": "python -m pip install -e .",
"pip_packages": [
"contourpy==1.1.0",
"cycler==0.11.0",
"fonttools==4.42.1",
"importlib-resources==6.0.1",
"kiwisolver==1.4.5",
"matplotlib==3.7.2",
"numpy==1.25.2",
"packaging==23.1",
"pandas==1.3.5", # 2.0.3
"pillow==10.0.0",
"pyparsing==3.0.9",
"pytest",
"python-dateutil==2.8.2",
"pytz==2023.3.post1",
"scipy==1.11.2",
"six==1.16.0",
"tzdata==2023.1",
"zipp==3.16.2",
],
}
for k in ["0.11"]
}
MAP_VERSION_TO_INSTALL_SEABORN.update(
{
k: {
"python": "3.9",
"install": "python -m pip install -e .[dev]",
"pip_packages": [
"contourpy==1.1.0",
"cycler==0.11.0",
"fonttools==4.42.1",
"importlib-resources==6.0.1",
"kiwisolver==1.4.5",
"matplotlib==3.7.2",
"numpy==1.25.2",
"packaging==23.1",
"pandas==2.0.0",
"pillow==10.0.0",
"pyparsing==3.0.9",
"pytest",
"python-dateutil==2.8.2",
"pytz==2023.3.post1",
"scipy==1.11.2",
"six==1.16.0",
"tzdata==2023.1",
"zipp==3.16.2",
],
}
for k in ["0.12", "0.13"]
}
)
MAP_VERSION_TO_INSTALL_PYTEST = {
k: {"python": "3.9", "install": "python -m pip install -e ."}
for k in [
"4.4",
"4.5",
"4.6",
"5.0",
"5.1",
"5.2",
"5.3",
"5.4",
"6.0",
"6.2",
"6.3",
"7.0",
"7.1",
"7.2",
"7.4",
"8.0",
]
}
MAP_VERSION_TO_INSTALL_PYTEST["4.4"]["pip_packages"] = [
"atomicwrites==1.4.1",
"attrs==23.1.0",
"more-itertools==10.1.0",
"pluggy==0.13.1",
"py==1.11.0",
"setuptools==68.0.0",
"six==1.16.0",
]
MAP_VERSION_TO_INSTALL_PYTEST["4.5"]["pip_packages"] = [
"atomicwrites==1.4.1",
"attrs==23.1.0",
"more-itertools==10.1.0",
"pluggy==0.11.0",
"py==1.11.0",
"setuptools==68.0.0",
"six==1.16.0",
"wcwidth==0.2.6",
]
MAP_VERSION_TO_INSTALL_PYTEST["4.6"]["pip_packages"] = [
"atomicwrites==1.4.1",
"attrs==23.1.0",
"more-itertools==10.1.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
"six==1.16.0",
"wcwidth==0.2.6",
]
for k in ["5.0", "5.1", "5.2"]:
MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
"atomicwrites==1.4.1",
"attrs==23.1.0",
"more-itertools==10.1.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
"wcwidth==0.2.6",
]
MAP_VERSION_TO_INSTALL_PYTEST["5.3"]["pip_packages"] = [
"attrs==23.1.0",
"more-itertools==10.1.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
"wcwidth==0.2.6",
]
MAP_VERSION_TO_INSTALL_PYTEST["5.4"]["pip_packages"] = [
"py==1.11.0",
"packaging==23.1",
"attrs==23.1.0",
"more-itertools==10.1.0",
"pluggy==0.13.1",
]
MAP_VERSION_TO_INSTALL_PYTEST["6.0"]["pip_packages"] = [
"attrs==23.1.0",
"iniconfig==2.0.0",
"more-itertools==10.1.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
"toml==0.10.2",
]
for k in ["6.2", "6.3"]:
MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
"attrs==23.1.0",
"iniconfig==2.0.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
"toml==0.10.2",
]
MAP_VERSION_TO_INSTALL_PYTEST["7.0"]["pip_packages"] = [
"attrs==23.1.0",
"iniconfig==2.0.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
]
for k in ["7.1", "7.2"]:
MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
"attrs==23.1.0",
"iniconfig==2.0.0",
"packaging==23.1",
"pluggy==0.13.1",
"py==1.11.0",
"tomli==2.0.1",
]
MAP_VERSION_TO_INSTALL_PYTEST["7.4"]["pip_packages"] = [
"iniconfig==2.0.0",
"packaging==23.1",
"pluggy==1.3.0",
"exceptiongroup==1.1.3",
"tomli==2.0.1",
]
MAP_VERSION_TO_INSTALL_PYTEST["8.0"]["pip_packages"] = [
"iniconfig==2.0.0",
"packaging==23.1",
"pluggy==1.3.0",
"exceptiongroup==1.1.3",
"tomli==2.0.1",
]
MAP_VERSION_TO_INSTALL_MATPLOTLIB = {
k: {
"python": "3.11",
"packages": "environment.yml",
"install": "python -m pip install -e .",
"pre_install": [
"apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super dvipng"
],
"pip_packages": [
"contourpy==1.1.0",
"cycler==0.11.0",
"fonttools==4.42.1",
"ghostscript",
"kiwisolver==1.4.5",
"numpy==1.25.2",
"packaging==23.1",
"pillow==10.0.0",
"pikepdf",
"pyparsing==3.0.9",
"python-dateutil==2.8.2",
"six==1.16.0",
"setuptools==68.1.2",
"setuptools-scm==7.1.0",
"typing-extensions==4.7.1",
],
}
for k in ["3.5", "3.6", "3.7"]
}
MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
{
k: {
"python": "3.8",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pre_install": [
"apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super"
],
"pip_packages": ["pytest", "ipython"],
}
for k in ["3.1", "3.2", "3.3", "3.4"]
}
)
MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
{
k: {
"python": "3.7",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pre_install": [
"apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config"
],
"pip_packages": ["pytest"],
}
for k in ["3.0"]
}
)
MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
{
k: {
"python": "3.5",
"install": "python setup.py build; python setup.py install",
"pre_install": [
"apt-get -y update && apt-get -y upgrade && && apt-get install -y imagemagick ffmpeg"
],
"pip_packages": ["pytest"],
"execute_test_as_nonroot": True,
}
for k in ["2.0", "2.1", "2.2", "1.0", "1.1", "1.2", "1.3", "1.4", "1.5"]
}
)
MAP_VERSION_TO_INSTALL_SPHINX = {
k: {
"python": "3.9",
"pip_packages": ["tox==4.16.0", "tox-current-env==0.0.11"],
"install": "python -m pip install -e .[test]",
"pre_install": ["sed -i 's/pytest/pytest -rA/' tox.ini"],
}
for k in ["1.5", "1.6", "1.7", "1.8", "2.0", "2.1", "2.2", "2.3", "2.4", "3.0"]
+ ["3.1", "3.2", "3.3", "3.4", "3.5", "4.0", "4.1", "4.2", "4.3", "4.4"]
+ ["4.5", "5.0", "5.1", "5.2", "5.3", "6.0", "6.2", "7.0", "7.1", "7.2"]
}
for k in ["3.0", "3.1", "3.2", "3.3", "3.4", "3.5", "4.0", "4.1", "4.2", "4.3", "4.4"]:
MAP_VERSION_TO_INSTALL_SPHINX[k][
"pre_install"
].extend([
"sed -i 's/Jinja2>=2.3/Jinja2<3.0/' setup.py",
"sed -i 's/sphinxcontrib-applehelp/sphinxcontrib-applehelp<=1.0.7/' setup.py",
"sed -i 's/sphinxcontrib-devhelp/sphinxcontrib-devhelp<=1.0.5/' setup.py",
"sed -i 's/sphinxcontrib-qthelp/sphinxcontrib-qthelp<=1.0.6/' setup.py",
"sed -i 's/alabaster>=0.7,<0.8/alabaster>=0.7,<0.7.12/' setup.py",
'sed -i "s/\'packaging\',/\'packaging\', \'markupsafe<=2.0.1\',/" setup.py',
])
if k in ["4.2", "4.3", "4.4"]:
MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
"sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py",
"sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py",
])
elif k == "4.1":
MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
(
"grep -q 'sphinxcontrib-htmlhelp>=2.0.0' setup.py && "
"sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py || "
"sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py"
),
(
"grep -q 'sphinxcontrib-serializinghtml>=1.1.5' setup.py && "
"sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py || "
"sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py"
)
])
else:
MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
"sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py",
"sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py",
])
MAP_VERSION_TO_INSTALL_SPHINX["7.2"]["pre_install"] += [
"apt-get update && apt-get install -y graphviz"
]
MAP_VERSION_TO_INSTALL_ASTROPY = {
k: {
"python": "3.9",
"install": "python -m pip install -e .[test] --verbose",
"pip_packages": [
"attrs==23.1.0",
"exceptiongroup==1.1.3",
"execnet==2.0.2",
"hypothesis==6.82.6",
"iniconfig==2.0.0",
"numpy==1.25.2",
"packaging==23.1",
"pluggy==1.3.0",
"psutil==5.9.5",
"pyerfa==2.0.0.3",
"pytest-arraydiff==0.5.0",
"pytest-astropy-header==0.2.2",
"pytest-astropy==0.10.0",
"pytest-cov==4.1.0",
"pytest-doctestplus==1.0.0",
"pytest-filter-subpackage==0.1.2",
"pytest-mock==3.11.1",
"pytest-openfiles==0.5.0",
"pytest-remotedata==0.4.0",
"pytest-xdist==3.3.1",
"pytest==7.4.0",
"PyYAML==6.0.1",
"setuptools==68.0.0",
"sortedcontainers==2.4.0",
"tomli==2.0.1",
],
}
for k in ["0.1", "0.2", "0.3", "0.4", "1.1", "1.2", "1.3", "3.0", "3.1", "3.2"]
+ ["4.1", "4.2", "4.3", "5.0", "5.1", "5.2"]
}
for k in ["4.1", "4.2", "4.3", "5.0", "5.1", "5.2"]:
MAP_VERSION_TO_INSTALL_ASTROPY[k]["pre_install"] = [
'sed -i \'s/requires = \\["setuptools",/requires = \\["setuptools==68.0.0",/\' pyproject.toml'
]
MAP_VERSION_TO_INSTALL_SYMPY = {
k: {
"python": "3.9",
"packages": "mpmath flake8",
"pip_packages": ["mpmath==1.3.0", "flake8-comprehensions"],
"install": "python -m pip install -e .",
}
for k in ["0.7", "1.0", "1.1", "1.10", "1.11", "1.12", "1.2", "1.4", "1.5", "1.6"]
+ ["1.7", "1.8", "1.9"]
}
MAP_VERSION_TO_INSTALL_SYMPY.update(
{
k: {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pip_packages": ["mpmath==1.3.0"],
}
for k in ["1.13"]
}
)
MAP_VERSION_TO_INSTALL_PYLINT = {
k: {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
}
for k in [
"2.10",
"2.11",
"2.13",
"2.14",
"2.15",
"2.16",
"2.17",
"2.8",
"2.9",
"3.0",
]
}
MAP_VERSION_TO_INSTALL_PYLINT["2.8"]["pip_packages"] = ["pyenchant==3.2"]
MAP_VERSION_TO_INSTALL_PYLINT["2.8"]["pre_install"] = [
"apt-get update && apt-get install -y libenchant-2-dev hunspell-en-us"
]
MAP_VERSION_TO_INSTALL_PYLINT.update(
{
k: {
**MAP_VERSION_TO_INSTALL_PYLINT[k],
"pip_packages": ["astroid==3.0.0a6", "setuptools"],
}
for k in ["3.0"]
}
)
MAP_VERSION_TO_INSTALL_XARRAY = {
k: {
"python": "3.10",
"packages": "environment.yml",
"install": "python -m pip install -e .",
"pip_packages": [
"numpy==1.23.0",
"packaging==23.1",
"pandas==1.5.3",
"pytest==7.4.0",
"python-dateutil==2.8.2",
"pytz==2023.3",
"six==1.16.0",
"scipy==1.11.1",
"setuptools==68.0.0"
],
"no_use_env": True,
}
for k in ["0.12", "0.18", "0.19", "0.20", "2022.03", "2022.06", "2022.09"]
}
MAP_VERSION_TO_INSTALL_SQLFLUFF = {
k: {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
}
for k in [
"0.10",
"0.11",
"0.12",
"0.13",
"0.4",
"0.5",
"0.6",
"0.8",
"0.9",
"1.0",
"1.1",
"1.2",
"1.3",
"1.4",
"2.0",
"2.1",
"2.2",
]
}
MAP_VERSION_TO_INSTALL_DBT_CORE = {
k: {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
}
for k in [
"0.13",
"0.14",
"0.15",
"0.16",
"0.17",
"0.18",
"0.19",
"0.20",
"0.21",
"1.0",
"1.1",
"1.2",
"1.3",
"1.4",
"1.5",
"1.6",
"1.7",
]
}
MAP_VERSION_TO_INSTALL_PYVISTA = {
k: {
"python": "3.9",
"install": "python -m pip install -e .",
"pip_packages": ["pytest"],
}
for k in ["0.20", "0.21", "0.22", "0.23"]
}
MAP_VERSION_TO_INSTALL_PYVISTA.update(
{
k: {
"python": "3.9",
"packages": "requirements.txt",
"install": "python -m pip install -e .",
"pip_packages": ["pytest"],
}
for k in [
"0.24",
"0.25",
"0.26",
"0.27",
"0.28",
"0.29",
"0.30",
"0.31",
"0.32",
"0.33",
"0.34",
"0.35",
"0.36",
"0.37",
"0.38",
"0.39",
"0.40",
"0.41",
"0.42",
"0.43",
]
}
)
MAP_VERSION_TO_INSTALL_ASTROID = {
k: {
"python": "3.9",
"install": "python -m pip install -e .",
"pip_packages": ["pytest"],
}
for k in [
"2.10",
"2.12",
"2.13",
"2.14",
"2.15",
"2.16",
"2.5",
"2.6",
"2.7",
"2.8",
"2.9",
"3.0",
]
}
MAP_VERSION_TO_INSTALL_MARSHMALLOW = {
k: {
"python": "3.9",
"install": "python -m pip install -e '.[dev]'",
}
for k in [
"2.18",
"2.19",
"2.20",
"3.0",
"3.1",
"3.10",
"3.11",
"3.12",
"3.13",
"3.15",
"3.16",
"3.19",
"3.2",
"3.4",
"3.8",
"3.9",
]
}
MAP_VERSION_TO_INSTALL_PVLIB = {
k: {
"python": "3.9",
"install": "python -m pip install -e .[all]",
"packages": "pandas scipy",
"pip_packages": ["jupyter", "ipython", "matplotlib", "pytest", "flake8"],
}
for k in ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
}
MAP_VERSION_TO_INSTALL_PYDICOM = {
k: {"python": "3.6", "install": "python -m pip install -e .", "packages": "numpy"}
for k in [
"1.0",
"1.1",
"1.2",
"1.3",
"1.4",
"2.0",
"2.1",
"2.2",
"2.3",
"2.4",
"3.0",
]
}
MAP_VERSION_TO_INSTALL_PYDICOM.update(
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.8"} for k in ["1.4", "2.0"]}
)
MAP_VERSION_TO_INSTALL_PYDICOM.update(
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.9"} for k in ["2.1", "2.2"]}
)
MAP_VERSION_TO_INSTALL_PYDICOM.update(
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.10"} for k in ["2.3"]}
)
MAP_VERSION_TO_INSTALL_PYDICOM.update(
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.11"} for k in ["2.4", "3.0"]}
)
MAP_VERSION_TO_INSTALL_HUMANEVAL = {k: {"python": "3.9"} for k in ["1.0"]}
MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX = {k: {"python": "3.10", "packages": "pytest"} for k in ["0.0.1"]}
# Constants - Task Instance Instllation Environment
MAP_VERSION_TO_INSTALL = {
"astropy/astropy": MAP_VERSION_TO_INSTALL_ASTROPY,
"dbt-labs/dbt-core": MAP_VERSION_TO_INSTALL_DBT_CORE,
"django/django": MAP_VERSION_TO_INSTALL_DJANGO,
"matplotlib/matplotlib": MAP_VERSION_TO_INSTALL_MATPLOTLIB,
"marshmallow-code/marshmallow": MAP_VERSION_TO_INSTALL_MARSHMALLOW,
"mwaskom/seaborn": MAP_VERSION_TO_INSTALL_SEABORN,
"pallets/flask": MAP_VERSION_TO_INSTALL_FLASK,
"psf/requests": MAP_VERSION_TO_INSTALL_REQUESTS,
"pvlib/pvlib-python": MAP_VERSION_TO_INSTALL_PVLIB,
"pydata/xarray": MAP_VERSION_TO_INSTALL_XARRAY,
"pydicom/pydicom": MAP_VERSION_TO_INSTALL_PYDICOM,
"pylint-dev/astroid": MAP_VERSION_TO_INSTALL_ASTROID,
"pylint-dev/pylint": MAP_VERSION_TO_INSTALL_PYLINT,
"pytest-dev/pytest": MAP_VERSION_TO_INSTALL_PYTEST,
"pyvista/pyvista": MAP_VERSION_TO_INSTALL_PYVISTA,
"scikit-learn/scikit-learn": MAP_VERSION_TO_INSTALL_SKLEARN,
"sphinx-doc/sphinx": MAP_VERSION_TO_INSTALL_SPHINX,
"sqlfluff/sqlfluff": MAP_VERSION_TO_INSTALL_SQLFLUFF,
"swe-bench/humaneval": MAP_VERSION_TO_INSTALL_HUMANEVAL,
"nielstron/humaneval_fix": MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX,
"sympy/sympy": MAP_VERSION_TO_INSTALL_SYMPY,
}
# Constants - Repository Specific Installation Instructions
MAP_REPO_TO_INSTALL = {}
# Constants - Task Instance Test Frameworks
TEST_PYTEST_VERBOSE = "pytest -rA --tb=long -p no:cacheprovider"
MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE = {
"astropy/astropy": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROPY.keys()
},
"django/django": {
k: "./tests/runtests.py --verbosity 2 --settings=test_sqlite --parallel 1"
for k in MAP_VERSION_TO_INSTALL_DJANGO.keys()
},
"marshmallow-code/marshmallow": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MARSHMALLOW.keys()
},
"matplotlib/matplotlib": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MATPLOTLIB.keys()
},
"mwaskom/seaborn": {
k: "pytest -rA --tb=long" for k in MAP_VERSION_TO_INSTALL_SEABORN.keys()
},
"pallets/flask": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_FLASK.keys()
},
"psf/requests": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_REQUESTS.keys()
},
"pvlib/pvlib-python": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PVLIB.keys()
},
"pydata/xarray": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_XARRAY.keys()
},
"pydicom/pydicom": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYDICOM.keys()
},
"pylint-dev/astroid": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROID.keys()
},
"pylint-dev/pylint": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYLINT.keys()
},
"pytest-dev/pytest": {
k: "pytest -rA --tb=long" for k in MAP_VERSION_TO_INSTALL_PYTEST.keys()
},
"pyvista/pyvista": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYVISTA.keys()
},
"scikit-learn/scikit-learn": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SKLEARN.keys()
},
"sphinx-doc/sphinx": {
k: "tox -epy39 -v --" for k in MAP_VERSION_TO_INSTALL_SPHINX.keys()
},
"sqlfluff/sqlfluff": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SQLFLUFF.keys()
},
"swe-bench/humaneval": {
k: "python" for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
},
"nielstron/humaneval_fix": {
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
},
"sympy/sympy": {
k: "bin/test -C --verbose" for k in MAP_VERSION_TO_INSTALL_SYMPY.keys()
},
}
MAP_REPO_TO_TEST_FRAMEWORK["django/django"]["1.9"] = "./tests/runtests.py --verbosity 2"

View File

@@ -3,13 +3,18 @@ import copy
import json
import os
import tempfile
from typing import Any
from typing import Any, Literal
import pandas as pd
import toml
from datasets import load_dataset
import openhands.agenthub
from evaluation.benchmarks.swe_bench.resource.swt_bench_constants import (
MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE,
MAP_REPO_TO_INSTALL,
MAP_VERSION_TO_INSTALL
)
from evaluation.benchmarks.swe_bench.binary_patch_utils import (
remove_binary_diffs,
remove_binary_files_from_git,
@@ -55,6 +60,7 @@ from openhands.utils.shutdown_listener import sleep_if_should_continue
USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'true'
BenchMode = Literal["swe", "swt", "swt-ci"]
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
@@ -68,7 +74,32 @@ def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
instruction = f"""
mode = metadata.details["mode"]
if mode.startswith('swt'):
test_instructions = f"The following command can be used to run the tests: `{list(MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE[instance.repo].values())[0]}`. Make sure they fail in the expected way.\n" if mode.endswith("ci") else ""
instruction = f"""\
<uploaded_files>
/workspace/{workspace_dir_name}
</uploaded_files>
I've uploaded a python code repository in the directory {workspace_dir_name}. Consider the following issue description:
<issue_description>
{instance.problem_statement}
</issue_description>
Can you help me implement the necessary changes to the repository to test whether the issue in <issue_description> was resolved?
I will take care of all changes to any of the non-test files. This means you DON'T have to modify the actual logic and ONLY have to update test logic and tests!
Your task is to make the minimal changes to tests files in the /workspace directory to reproduce the issue in the <issue_description>, i.e., such that the generated tests fail in the current state (where the issue is unresolved) and pass when the issue will be resolved.
Follow these steps to reproduce the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2. Create a script `reproduction.py` to reproduce the error and execute it with `python reproduction.py` using the BashTool, to confirm the error
3. Edit the sourcecode of the repo to integrate your reproduction script into the test framework
4. Run the test framework and make sure your tests fail! Only submit FAILING tests! Never submit passing tests.
{test_instructions}Your thinking should be thorough and so it's fine if it's very long.
"""
else:
instruction = f"""
<uploaded_files>
/workspace/{workspace_dir_name}
</uploaded_files>
@@ -356,6 +387,29 @@ def initialize_runtime(
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')
if metadata.details["mode"] == "swt-ci":
# set up repo
setup_commands = []
if instance["repo"] in MAP_REPO_TO_INSTALL:
setup_commands.append(MAP_REPO_TO_INSTALL[instance["repo"]])
# Run pre-install set up if provided
install = MAP_VERSION_TO_INSTALL.get(instance['repo'], {}).get(instance['version'], [])
if "pre_install" in install:
for pre_install in install["pre_install"]:
setup_commands.append(pre_install)
if "install" in install:
setup_commands.append(install["install"])
for command in setup_commands:
action = CmdRunAction(command=command)
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
if 'multimodal' not in metadata.dataset.lower():
# Only for non-multimodal datasets, we need to activate the testbed environment for Python
# SWE-Bench multimodal datasets are not using the testbed environment
@@ -678,6 +732,13 @@ if __name__ == '__main__':
default='test',
help='split to evaluate on',
)
parser.add_argument(
'--mode',
type=str,
default='swe',
choices=['swe', 'swt', 'swt-ci'],
help="mode to run the evaluation, either 'swe', 'swt', or 'swt-ci'",
)
args, _ = parser.parse_known_args()
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
@@ -714,7 +775,7 @@ if __name__ == '__main__':
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
details = {}
details = {"mode": args.mode}
_agent_cls = openhands.agenthub.Agent.get_cls(args.agent_cls)
dataset_descrption = (

View File

@@ -12,6 +12,7 @@ NUM_WORKERS=$6
DATASET=$7
SPLIT=$8
N_RUNS=$9
MODE=${10}
if [ -z "$NUM_WORKERS" ]; then
NUM_WORKERS=1
@@ -45,6 +46,11 @@ if [ -z "$SPLIT" ]; then
SPLIT="test"
fi
if [ -z "$MODE" ]; then
MODE="swe"
echo "MODE not specified, use default $MODE"
fi
export RUN_WITH_BROWSING=$RUN_WITH_BROWSING
echo "RUN_WITH_BROWSING: $RUN_WITH_BROWSING"
@@ -55,6 +61,10 @@ echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"
echo "SPLIT: $SPLIT"
echo "MAX_ITER: $MAX_ITER"
echo "NUM_WORKERS: $NUM_WORKERS"
echo "COMMIT_HASH: $COMMIT_HASH"
echo "MODE: $MODE"
# Default to NOT use Hint
if [ -z "$USE_HINT_TEXT" ]; then
@@ -74,9 +84,13 @@ fi
if [ -n "$EXP_NAME" ]; then
EVAL_NOTE="$EVAL_NOTE-$EXP_NAME"
fi
# if mode != swe, add mode to the eval note
if [ "$MODE" != "swe" ]; then
EVAL_NOTE="${EVAL_NOTE}-${MODE}"
fi
function run_eval() {
local eval_note=$1
local eval_note="${1}"
COMMAND="poetry run python evaluation/benchmarks/swe_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
@@ -84,7 +98,8 @@ function run_eval() {
--eval-num-workers $NUM_WORKERS \
--eval-note $eval_note \
--dataset $DATASET \
--split $SPLIT"
--split $SPLIT \
--mode $MODE"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"

View File

@@ -0,0 +1,73 @@
import json
import argparse
import logging
import unidiff
from evaluation.benchmarks.swe_bench.resource.swt_bench_constants import MAP_VERSION_TO_INSTALL
_LOGGER = logging.getLogger(__name__)
def remove_setup_files(model_patch: str, instance: dict, delete_setup_changes: bool):
""" Discard all changes that a patch applies to files changes by the pre_install script and that are reproduction scripts (top-level script)"""
setup_files = ["setup.py", "tox.ini", "pyproject.toml"]
pre_install = MAP_VERSION_TO_INSTALL.get(instance["repo"], {}).get(instance["version"], {}).get("pre_install", [])
relevant_files = [
file
for file in setup_files
if any(file in install and "sed" in install for install in pre_install)
] if delete_setup_changes else []
for i in range(10):
try:
# Appearently outputs.jsonl has .strip() applied, so we try to reconstruct the original patch by adding auxiliary whitespace
patch = unidiff.PatchSet(model_patch + i*"\n")
break
except unidiff.UnidiffParseError as e:
pass
to_delete = []
for i, file in enumerate(patch):
if any(f in file.source_file for f in relevant_files) or file.target_file.count("/") == 1:
to_delete.append(i)
for i in reversed(to_delete):
del patch[i]
return str(patch)
def main(
prediction_file: str,
):
"""Main function to extract the model patches from the OpenHands prediction file and turn them into the expected SWT-Bench format."""
with open(prediction_file) as f:
for line in f:
pred = json.loads(line)
try:
git_diff = pred["test_result"]["git_patch"]
except KeyError:
_LOGGER.warning("Warning: No git diff found for instance %s", pred["instance_id"])
continue
ci_mode = pred["metadata"]["details"].get("mode", "") == "swt-ci"
try:
git_diff = remove_setup_files(git_diff, pred["instance"], ci_mode)
except:
_LOGGER.warning("Warning: Invalid git diff found for instance %s", pred["instance_id"])
print(json.dumps({
"instance_id": pred["instance_id"],
"model_name_or_path": f'{pred["metadata"]["llm_config"]["openrouter_app_name"]}__{pred["metadata"]["agent_class"]}__{pred["metadata"]["llm_config"]["model"]}',
"model_patch": git_diff,
"full_output": json.dumps(pred),
}))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
"--prediction_file",
type=str,
required=True,
help="Path to the prediction file (.../outputs.jsonl)",
)
args = parser.parse_args()
main(args.prediction_file)