mirror of
https://github.com/All-Hands-AI/OpenHands.git
synced 2026-01-09 14:57:59 -05:00
Add inference for SWT-Bench (#7201)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Calvin Smith <email@cjsmith.io>
This commit is contained in:
@@ -2,6 +2,8 @@
|
||||
|
||||
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
|
||||
|
||||
**UPDATE (4/8/2025): We now support running SWT-Bench evaluation! For more details, checkout [the corresponding section](#SWT-Bench-Evaluation).**
|
||||
|
||||
**UPDATE (03/27/2025): We now support SWE-Bench multimodal evaluation! Simply use "princeton-nlp/SWE-bench_Multimodal" as the dataset name in the `run_infer.sh` script to evaluate on multimodal instances.**
|
||||
|
||||
**UPDATE (2/18/2025): We now support running SWE-Gym using the same evaluation harness here. For more details, checkout [this README](./SWE-Gym.md).**
|
||||
@@ -141,7 +143,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
|
||||
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split]
|
||||
|
||||
# Example
|
||||
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
|
||||
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
|
||||
```
|
||||
|
||||
The script now accepts optional arguments:
|
||||
@@ -182,3 +184,58 @@ To clean-up all existing runtimes that you've already started, run:
|
||||
```bash
|
||||
ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/utils/scripts/cleanup_remote_runtime.sh
|
||||
```
|
||||
|
||||
## SWT-Bench Evaluation
|
||||
|
||||
[SWT-Bench](https://swtbench.com/) ([paper](https://arxiv.org/abs/2406.12952)) is a benchmark for evaluating the capability of LLMs at creating unit tests. It is performed on the same instances as SWE-Bench, but requires a separate evaluation harness to capture coverage and issue reproduction. We therefore detail below how to leverage the inference script in this folder to run inference on SWT-Bench and how to use the SWT-Bench evaluation harness to evaluate them.
|
||||
|
||||
### Run inference on SWT-Bench
|
||||
|
||||
To run inference on SWT-Bench, you can use the same `run_infer.sh` script as described for evaluation on plain SWE-Bench. The only differences is that you need to specify the `mode` parameter to `swt` or `swt-ci` when running the script. For example, to run inference on SWT-Bench Verified, run the following command:
|
||||
|
||||
```bash
|
||||
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [swe-dataset] test 1 swt
|
||||
|
||||
# Example - This runs evaluation on CodeActAgent for 500 instances on "SWT-bench_Verified"'s test set (corresponding to SWE-bench_Verified), with max 100 iteration per instances, with 1 number of workers running in parallel
|
||||
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4o-2024-11-20 HEAD CodeActAgent 500 100 1 princeton-nlp/SWE-bench_Verified test 1 swt
|
||||
```
|
||||
|
||||
The two modes `swt` and `swt-ci` have the following effect:
|
||||
- `swt`: This mode will change the prompt to instruct the agent to generate reproducing test cases instead of resolving the issue.
|
||||
- `swt-ci`: In addition to the changes by `swt`, this mode sets up the CI environment by i) pre-installing the environment in the docker image, such that the test framework can be executed without errors and ii) telling the model the exact command to run the test framework.
|
||||
|
||||
### Run evaluation for SWT-bench
|
||||
|
||||
The evaluation of these results is done leveraging [the SWT-Bench evaluation harness](https://github.com/logic-star-ai/swt-bench/tree/master).
|
||||
|
||||
#### Extracting results into SWT-Bench harness format
|
||||
In order to run evaluation of the obtained inference results in the SWT-Bench harness, we transform the results to a format that the SWT-Bench evaluation harness expects.
|
||||
|
||||
```bash
|
||||
python3 evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py --prediction_file [output.jsonl] > [output_swt.jsonl]
|
||||
|
||||
# Example
|
||||
python3 evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py --prediction_file "evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt-4o-2024-11-20_maxiter_100_N_v0.31.0-no-hint-swt-run_1/output.jsonl" > OpenHands-gpt-4o-2024-11-20.jsonl
|
||||
```
|
||||
|
||||
#### Running the results in SWT-Bench
|
||||
|
||||
Next, we run the [SWT-Bench evaluation harness](https://github.com/logic-star-ai/swt-bench/tree/master) with these results.
|
||||
First set-up and validate the setup as described in the harness [here](https://github.com/logic-star-ai/swt-bench/tree/master?tab=readme-ov-file#-set-up).
|
||||
Then, run the evaluation with the following command:
|
||||
|
||||
```bash
|
||||
# Example
|
||||
python3 -m src.main \
|
||||
--dataset_name princeton-nlp/SWE-bench_Verified \
|
||||
--predictions_path <pathTo>/OpenHands-gpt-4o-2024-11-20.jsonl \
|
||||
--max_workers 12 \
|
||||
--run_id OpenHands-CodeAct-gpt-4o-2024-11-20 --patch_types vanilla --build_mode api
|
||||
```
|
||||
|
||||
The results of the evaluation can be obtained by running the reporting script of the harness.
|
||||
|
||||
```bash
|
||||
# Example
|
||||
python -m src.report run_instance_swt_logs/OpenHands-CodeAct-gpt-4o-2024-11-20/OpenHands__CodeActAgent__gpt-4o-2024-11-20 --dataset verified
|
||||
```
|
||||
832
evaluation/benchmarks/swe_bench/resource/swt_bench_constants.py
Normal file
832
evaluation/benchmarks/swe_bench/resource/swt_bench_constants.py
Normal file
@@ -0,0 +1,832 @@
|
||||
# Based on https://github.com/logic-star-ai/swt-bench/blob/master/src/constants.py
|
||||
|
||||
# Constants - Installation Specifications
|
||||
MAP_VERSION_TO_INSTALL_SKLEARN = {
|
||||
k: {
|
||||
"python": "3.6",
|
||||
"packages": "numpy scipy cython pytest pandas matplotlib",
|
||||
"install": "python -m pip install -v --no-use-pep517 --no-build-isolation -e .",
|
||||
"pip_packages": [
|
||||
"cython",
|
||||
"numpy==1.19.2",
|
||||
"setuptools",
|
||||
"scipy==1.5.2",
|
||||
],
|
||||
}
|
||||
for k in ["0.20", "0.21", "0.22"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_SKLEARN.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "'numpy==1.19.2' 'scipy==1.5.2' 'cython==3.0.10' pytest 'pandas<2.0.0' 'matplotlib<3.9.0' setuptools pytest joblib threadpoolctl",
|
||||
"install": "python -m pip install -v --no-use-pep517 --no-build-isolation -e .",
|
||||
"pip_packages": ["cython", "setuptools", "numpy", "scipy"],
|
||||
}
|
||||
for k in ["1.3", "1.4"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_FLASK = {
|
||||
"2.0": {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": [
|
||||
"setuptools==70.0.0",
|
||||
"Werkzeug==2.3.7",
|
||||
"Jinja2==3.0.1",
|
||||
"itsdangerous==2.1.2",
|
||||
"click==8.0.1",
|
||||
"MarkupSafe==2.1.3",
|
||||
],
|
||||
},
|
||||
"2.1": {
|
||||
"python": "3.10",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": [
|
||||
"click==8.1.3",
|
||||
"itsdangerous==2.1.2",
|
||||
"Jinja2==3.1.2",
|
||||
"MarkupSafe==2.1.1",
|
||||
"Werkzeug==2.3.7",
|
||||
],
|
||||
},
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_FLASK.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.11",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": [
|
||||
"click==8.1.3",
|
||||
"itsdangerous==2.1.2",
|
||||
"Jinja2==3.1.2",
|
||||
"MarkupSafe==2.1.1",
|
||||
"Werkzeug==2.3.7",
|
||||
],
|
||||
}
|
||||
for k in ["2.2", "2.3"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_DJANGO = {
|
||||
k: {
|
||||
"python": "3.5",
|
||||
"packages": "requirements.txt",
|
||||
"pre_install": [
|
||||
"apt-get update && apt-get install -y locales",
|
||||
"echo 'en_US UTF-8' > /etc/locale.gen",
|
||||
"locale-gen en_US.UTF-8",
|
||||
],
|
||||
"install": "python setup.py install",
|
||||
"pip_packages": ["setuptools"],
|
||||
"eval_commands": [
|
||||
"export LANG=en_US.UTF-8",
|
||||
"export LC_ALL=en_US.UTF-8",
|
||||
"export PYTHONIOENCODING=utf8",
|
||||
"export LANGUAGE=en_US:en",
|
||||
],
|
||||
}
|
||||
for k in ["1.7", "1.8", "1.9", "1.10", "1.11", "2.0", "2.1", "2.2"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_DJANGO.update(
|
||||
{
|
||||
k: {"python": "3.5", "install": "python setup.py install"}
|
||||
for k in ["1.4", "1.5", "1.6"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_DJANGO.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.6",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"eval_commands": [
|
||||
"sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen",
|
||||
"export LANG=en_US.UTF-8",
|
||||
"export LANGUAGE=en_US:en",
|
||||
"export LC_ALL=en_US.UTF-8",
|
||||
],
|
||||
}
|
||||
for k in ["3.0", "3.1", "3.2"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_DJANGO.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.8",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in ["4.0"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_DJANGO.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in ["4.1", "4.2"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_DJANGO.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.11",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in ["5.0"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_REQUESTS = {
|
||||
k: {"python": "3.9", "packages": "pytest", "install": "python -m pip install ."}
|
||||
for k in ["0.7", "0.8", "0.9", "0.11", "0.13", "0.14", "1.1", "1.2", "2.0", "2.2"]
|
||||
+ ["2.3", "2.4", "2.5", "2.7", "2.8", "2.9", "2.10", "2.11", "2.12", "2.17"]
|
||||
+ ["2.18", "2.19", "2.22", "2.26", "2.25", "2.27", "3.0"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_SEABORN = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": [
|
||||
"contourpy==1.1.0",
|
||||
"cycler==0.11.0",
|
||||
"fonttools==4.42.1",
|
||||
"importlib-resources==6.0.1",
|
||||
"kiwisolver==1.4.5",
|
||||
"matplotlib==3.7.2",
|
||||
"numpy==1.25.2",
|
||||
"packaging==23.1",
|
||||
"pandas==1.3.5", # 2.0.3
|
||||
"pillow==10.0.0",
|
||||
"pyparsing==3.0.9",
|
||||
"pytest",
|
||||
"python-dateutil==2.8.2",
|
||||
"pytz==2023.3.post1",
|
||||
"scipy==1.11.2",
|
||||
"six==1.16.0",
|
||||
"tzdata==2023.1",
|
||||
"zipp==3.16.2",
|
||||
],
|
||||
}
|
||||
for k in ["0.11"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_SEABORN.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e .[dev]",
|
||||
"pip_packages": [
|
||||
"contourpy==1.1.0",
|
||||
"cycler==0.11.0",
|
||||
"fonttools==4.42.1",
|
||||
"importlib-resources==6.0.1",
|
||||
"kiwisolver==1.4.5",
|
||||
"matplotlib==3.7.2",
|
||||
"numpy==1.25.2",
|
||||
"packaging==23.1",
|
||||
"pandas==2.0.0",
|
||||
"pillow==10.0.0",
|
||||
"pyparsing==3.0.9",
|
||||
"pytest",
|
||||
"python-dateutil==2.8.2",
|
||||
"pytz==2023.3.post1",
|
||||
"scipy==1.11.2",
|
||||
"six==1.16.0",
|
||||
"tzdata==2023.1",
|
||||
"zipp==3.16.2",
|
||||
],
|
||||
}
|
||||
for k in ["0.12", "0.13"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_PYTEST = {
|
||||
k: {"python": "3.9", "install": "python -m pip install -e ."}
|
||||
for k in [
|
||||
"4.4",
|
||||
"4.5",
|
||||
"4.6",
|
||||
"5.0",
|
||||
"5.1",
|
||||
"5.2",
|
||||
"5.3",
|
||||
"5.4",
|
||||
"6.0",
|
||||
"6.2",
|
||||
"6.3",
|
||||
"7.0",
|
||||
"7.1",
|
||||
"7.2",
|
||||
"7.4",
|
||||
"8.0",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["4.4"]["pip_packages"] = [
|
||||
"atomicwrites==1.4.1",
|
||||
"attrs==23.1.0",
|
||||
"more-itertools==10.1.0",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"setuptools==68.0.0",
|
||||
"six==1.16.0",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["4.5"]["pip_packages"] = [
|
||||
"atomicwrites==1.4.1",
|
||||
"attrs==23.1.0",
|
||||
"more-itertools==10.1.0",
|
||||
"pluggy==0.11.0",
|
||||
"py==1.11.0",
|
||||
"setuptools==68.0.0",
|
||||
"six==1.16.0",
|
||||
"wcwidth==0.2.6",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["4.6"]["pip_packages"] = [
|
||||
"atomicwrites==1.4.1",
|
||||
"attrs==23.1.0",
|
||||
"more-itertools==10.1.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"six==1.16.0",
|
||||
"wcwidth==0.2.6",
|
||||
]
|
||||
for k in ["5.0", "5.1", "5.2"]:
|
||||
MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
|
||||
"atomicwrites==1.4.1",
|
||||
"attrs==23.1.0",
|
||||
"more-itertools==10.1.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"wcwidth==0.2.6",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["5.3"]["pip_packages"] = [
|
||||
"attrs==23.1.0",
|
||||
"more-itertools==10.1.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"wcwidth==0.2.6",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["5.4"]["pip_packages"] = [
|
||||
"py==1.11.0",
|
||||
"packaging==23.1",
|
||||
"attrs==23.1.0",
|
||||
"more-itertools==10.1.0",
|
||||
"pluggy==0.13.1",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["6.0"]["pip_packages"] = [
|
||||
"attrs==23.1.0",
|
||||
"iniconfig==2.0.0",
|
||||
"more-itertools==10.1.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"toml==0.10.2",
|
||||
]
|
||||
for k in ["6.2", "6.3"]:
|
||||
MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
|
||||
"attrs==23.1.0",
|
||||
"iniconfig==2.0.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"toml==0.10.2",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["7.0"]["pip_packages"] = [
|
||||
"attrs==23.1.0",
|
||||
"iniconfig==2.0.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
]
|
||||
for k in ["7.1", "7.2"]:
|
||||
MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
|
||||
"attrs==23.1.0",
|
||||
"iniconfig==2.0.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==0.13.1",
|
||||
"py==1.11.0",
|
||||
"tomli==2.0.1",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["7.4"]["pip_packages"] = [
|
||||
"iniconfig==2.0.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==1.3.0",
|
||||
"exceptiongroup==1.1.3",
|
||||
"tomli==2.0.1",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYTEST["8.0"]["pip_packages"] = [
|
||||
"iniconfig==2.0.0",
|
||||
"packaging==23.1",
|
||||
"pluggy==1.3.0",
|
||||
"exceptiongroup==1.1.3",
|
||||
"tomli==2.0.1",
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_MATPLOTLIB = {
|
||||
k: {
|
||||
"python": "3.11",
|
||||
"packages": "environment.yml",
|
||||
"install": "python -m pip install -e .",
|
||||
"pre_install": [
|
||||
"apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super dvipng"
|
||||
],
|
||||
"pip_packages": [
|
||||
"contourpy==1.1.0",
|
||||
"cycler==0.11.0",
|
||||
"fonttools==4.42.1",
|
||||
"ghostscript",
|
||||
"kiwisolver==1.4.5",
|
||||
"numpy==1.25.2",
|
||||
"packaging==23.1",
|
||||
"pillow==10.0.0",
|
||||
"pikepdf",
|
||||
"pyparsing==3.0.9",
|
||||
"python-dateutil==2.8.2",
|
||||
"six==1.16.0",
|
||||
"setuptools==68.1.2",
|
||||
"setuptools-scm==7.1.0",
|
||||
"typing-extensions==4.7.1",
|
||||
],
|
||||
}
|
||||
for k in ["3.5", "3.6", "3.7"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.8",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pre_install": [
|
||||
"apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super"
|
||||
],
|
||||
"pip_packages": ["pytest", "ipython"],
|
||||
}
|
||||
for k in ["3.1", "3.2", "3.3", "3.4"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.7",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pre_install": [
|
||||
"apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config"
|
||||
],
|
||||
"pip_packages": ["pytest"],
|
||||
}
|
||||
for k in ["3.0"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.5",
|
||||
"install": "python setup.py build; python setup.py install",
|
||||
"pre_install": [
|
||||
"apt-get -y update && apt-get -y upgrade && && apt-get install -y imagemagick ffmpeg"
|
||||
],
|
||||
"pip_packages": ["pytest"],
|
||||
"execute_test_as_nonroot": True,
|
||||
}
|
||||
for k in ["2.0", "2.1", "2.2", "1.0", "1.1", "1.2", "1.3", "1.4", "1.5"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_SPHINX = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"pip_packages": ["tox==4.16.0", "tox-current-env==0.0.11"],
|
||||
"install": "python -m pip install -e .[test]",
|
||||
"pre_install": ["sed -i 's/pytest/pytest -rA/' tox.ini"],
|
||||
}
|
||||
for k in ["1.5", "1.6", "1.7", "1.8", "2.0", "2.1", "2.2", "2.3", "2.4", "3.0"]
|
||||
+ ["3.1", "3.2", "3.3", "3.4", "3.5", "4.0", "4.1", "4.2", "4.3", "4.4"]
|
||||
+ ["4.5", "5.0", "5.1", "5.2", "5.3", "6.0", "6.2", "7.0", "7.1", "7.2"]
|
||||
}
|
||||
for k in ["3.0", "3.1", "3.2", "3.3", "3.4", "3.5", "4.0", "4.1", "4.2", "4.3", "4.4"]:
|
||||
MAP_VERSION_TO_INSTALL_SPHINX[k][
|
||||
"pre_install"
|
||||
].extend([
|
||||
"sed -i 's/Jinja2>=2.3/Jinja2<3.0/' setup.py",
|
||||
"sed -i 's/sphinxcontrib-applehelp/sphinxcontrib-applehelp<=1.0.7/' setup.py",
|
||||
"sed -i 's/sphinxcontrib-devhelp/sphinxcontrib-devhelp<=1.0.5/' setup.py",
|
||||
"sed -i 's/sphinxcontrib-qthelp/sphinxcontrib-qthelp<=1.0.6/' setup.py",
|
||||
"sed -i 's/alabaster>=0.7,<0.8/alabaster>=0.7,<0.7.12/' setup.py",
|
||||
'sed -i "s/\'packaging\',/\'packaging\', \'markupsafe<=2.0.1\',/" setup.py',
|
||||
])
|
||||
if k in ["4.2", "4.3", "4.4"]:
|
||||
MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
|
||||
"sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py",
|
||||
"sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py",
|
||||
])
|
||||
elif k == "4.1":
|
||||
MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
|
||||
(
|
||||
"grep -q 'sphinxcontrib-htmlhelp>=2.0.0' setup.py && "
|
||||
"sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py || "
|
||||
"sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py"
|
||||
),
|
||||
(
|
||||
"grep -q 'sphinxcontrib-serializinghtml>=1.1.5' setup.py && "
|
||||
"sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py || "
|
||||
"sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py"
|
||||
)
|
||||
])
|
||||
else:
|
||||
MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
|
||||
"sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py",
|
||||
"sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py",
|
||||
])
|
||||
MAP_VERSION_TO_INSTALL_SPHINX["7.2"]["pre_install"] += [
|
||||
"apt-get update && apt-get install -y graphviz"
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_ASTROPY = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e .[test] --verbose",
|
||||
"pip_packages": [
|
||||
"attrs==23.1.0",
|
||||
"exceptiongroup==1.1.3",
|
||||
"execnet==2.0.2",
|
||||
"hypothesis==6.82.6",
|
||||
"iniconfig==2.0.0",
|
||||
"numpy==1.25.2",
|
||||
"packaging==23.1",
|
||||
"pluggy==1.3.0",
|
||||
"psutil==5.9.5",
|
||||
"pyerfa==2.0.0.3",
|
||||
"pytest-arraydiff==0.5.0",
|
||||
"pytest-astropy-header==0.2.2",
|
||||
"pytest-astropy==0.10.0",
|
||||
"pytest-cov==4.1.0",
|
||||
"pytest-doctestplus==1.0.0",
|
||||
"pytest-filter-subpackage==0.1.2",
|
||||
"pytest-mock==3.11.1",
|
||||
"pytest-openfiles==0.5.0",
|
||||
"pytest-remotedata==0.4.0",
|
||||
"pytest-xdist==3.3.1",
|
||||
"pytest==7.4.0",
|
||||
"PyYAML==6.0.1",
|
||||
"setuptools==68.0.0",
|
||||
"sortedcontainers==2.4.0",
|
||||
"tomli==2.0.1",
|
||||
],
|
||||
}
|
||||
for k in ["0.1", "0.2", "0.3", "0.4", "1.1", "1.2", "1.3", "3.0", "3.1", "3.2"]
|
||||
+ ["4.1", "4.2", "4.3", "5.0", "5.1", "5.2"]
|
||||
}
|
||||
for k in ["4.1", "4.2", "4.3", "5.0", "5.1", "5.2"]:
|
||||
MAP_VERSION_TO_INSTALL_ASTROPY[k]["pre_install"] = [
|
||||
'sed -i \'s/requires = \\["setuptools",/requires = \\["setuptools==68.0.0",/\' pyproject.toml'
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_SYMPY = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "mpmath flake8",
|
||||
"pip_packages": ["mpmath==1.3.0", "flake8-comprehensions"],
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in ["0.7", "1.0", "1.1", "1.10", "1.11", "1.12", "1.2", "1.4", "1.5", "1.6"]
|
||||
+ ["1.7", "1.8", "1.9"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_SYMPY.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": ["mpmath==1.3.0"],
|
||||
}
|
||||
for k in ["1.13"]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_PYLINT = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in [
|
||||
"2.10",
|
||||
"2.11",
|
||||
"2.13",
|
||||
"2.14",
|
||||
"2.15",
|
||||
"2.16",
|
||||
"2.17",
|
||||
"2.8",
|
||||
"2.9",
|
||||
"3.0",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PYLINT["2.8"]["pip_packages"] = ["pyenchant==3.2"]
|
||||
MAP_VERSION_TO_INSTALL_PYLINT["2.8"]["pre_install"] = [
|
||||
"apt-get update && apt-get install -y libenchant-2-dev hunspell-en-us"
|
||||
]
|
||||
MAP_VERSION_TO_INSTALL_PYLINT.update(
|
||||
{
|
||||
k: {
|
||||
**MAP_VERSION_TO_INSTALL_PYLINT[k],
|
||||
"pip_packages": ["astroid==3.0.0a6", "setuptools"],
|
||||
}
|
||||
for k in ["3.0"]
|
||||
}
|
||||
)
|
||||
|
||||
MAP_VERSION_TO_INSTALL_XARRAY = {
|
||||
k: {
|
||||
"python": "3.10",
|
||||
"packages": "environment.yml",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": [
|
||||
"numpy==1.23.0",
|
||||
"packaging==23.1",
|
||||
"pandas==1.5.3",
|
||||
"pytest==7.4.0",
|
||||
"python-dateutil==2.8.2",
|
||||
"pytz==2023.3",
|
||||
"six==1.16.0",
|
||||
"scipy==1.11.1",
|
||||
"setuptools==68.0.0"
|
||||
],
|
||||
"no_use_env": True,
|
||||
}
|
||||
for k in ["0.12", "0.18", "0.19", "0.20", "2022.03", "2022.06", "2022.09"]
|
||||
}
|
||||
|
||||
MAP_VERSION_TO_INSTALL_SQLFLUFF = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in [
|
||||
"0.10",
|
||||
"0.11",
|
||||
"0.12",
|
||||
"0.13",
|
||||
"0.4",
|
||||
"0.5",
|
||||
"0.6",
|
||||
"0.8",
|
||||
"0.9",
|
||||
"1.0",
|
||||
"1.1",
|
||||
"1.2",
|
||||
"1.3",
|
||||
"1.4",
|
||||
"2.0",
|
||||
"2.1",
|
||||
"2.2",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_DBT_CORE = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
}
|
||||
for k in [
|
||||
"0.13",
|
||||
"0.14",
|
||||
"0.15",
|
||||
"0.16",
|
||||
"0.17",
|
||||
"0.18",
|
||||
"0.19",
|
||||
"0.20",
|
||||
"0.21",
|
||||
"1.0",
|
||||
"1.1",
|
||||
"1.2",
|
||||
"1.3",
|
||||
"1.4",
|
||||
"1.5",
|
||||
"1.6",
|
||||
"1.7",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PYVISTA = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": ["pytest"],
|
||||
}
|
||||
for k in ["0.20", "0.21", "0.22", "0.23"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PYVISTA.update(
|
||||
{
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"packages": "requirements.txt",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": ["pytest"],
|
||||
}
|
||||
for k in [
|
||||
"0.24",
|
||||
"0.25",
|
||||
"0.26",
|
||||
"0.27",
|
||||
"0.28",
|
||||
"0.29",
|
||||
"0.30",
|
||||
"0.31",
|
||||
"0.32",
|
||||
"0.33",
|
||||
"0.34",
|
||||
"0.35",
|
||||
"0.36",
|
||||
"0.37",
|
||||
"0.38",
|
||||
"0.39",
|
||||
"0.40",
|
||||
"0.41",
|
||||
"0.42",
|
||||
"0.43",
|
||||
]
|
||||
}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_ASTROID = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e .",
|
||||
"pip_packages": ["pytest"],
|
||||
}
|
||||
for k in [
|
||||
"2.10",
|
||||
"2.12",
|
||||
"2.13",
|
||||
"2.14",
|
||||
"2.15",
|
||||
"2.16",
|
||||
"2.5",
|
||||
"2.6",
|
||||
"2.7",
|
||||
"2.8",
|
||||
"2.9",
|
||||
"3.0",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_MARSHMALLOW = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e '.[dev]'",
|
||||
}
|
||||
for k in [
|
||||
"2.18",
|
||||
"2.19",
|
||||
"2.20",
|
||||
"3.0",
|
||||
"3.1",
|
||||
"3.10",
|
||||
"3.11",
|
||||
"3.12",
|
||||
"3.13",
|
||||
"3.15",
|
||||
"3.16",
|
||||
"3.19",
|
||||
"3.2",
|
||||
"3.4",
|
||||
"3.8",
|
||||
"3.9",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PVLIB = {
|
||||
k: {
|
||||
"python": "3.9",
|
||||
"install": "python -m pip install -e .[all]",
|
||||
"packages": "pandas scipy",
|
||||
"pip_packages": ["jupyter", "ipython", "matplotlib", "pytest", "flake8"],
|
||||
}
|
||||
for k in ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PYDICOM = {
|
||||
k: {"python": "3.6", "install": "python -m pip install -e .", "packages": "numpy"}
|
||||
for k in [
|
||||
"1.0",
|
||||
"1.1",
|
||||
"1.2",
|
||||
"1.3",
|
||||
"1.4",
|
||||
"2.0",
|
||||
"2.1",
|
||||
"2.2",
|
||||
"2.3",
|
||||
"2.4",
|
||||
"3.0",
|
||||
]
|
||||
}
|
||||
MAP_VERSION_TO_INSTALL_PYDICOM.update(
|
||||
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.8"} for k in ["1.4", "2.0"]}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_PYDICOM.update(
|
||||
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.9"} for k in ["2.1", "2.2"]}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_PYDICOM.update(
|
||||
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.10"} for k in ["2.3"]}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_PYDICOM.update(
|
||||
{k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.11"} for k in ["2.4", "3.0"]}
|
||||
)
|
||||
MAP_VERSION_TO_INSTALL_HUMANEVAL = {k: {"python": "3.9"} for k in ["1.0"]}
|
||||
MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX = {k: {"python": "3.10", "packages": "pytest"} for k in ["0.0.1"]}
|
||||
|
||||
# Constants - Task Instance Instllation Environment
|
||||
MAP_VERSION_TO_INSTALL = {
|
||||
"astropy/astropy": MAP_VERSION_TO_INSTALL_ASTROPY,
|
||||
"dbt-labs/dbt-core": MAP_VERSION_TO_INSTALL_DBT_CORE,
|
||||
"django/django": MAP_VERSION_TO_INSTALL_DJANGO,
|
||||
"matplotlib/matplotlib": MAP_VERSION_TO_INSTALL_MATPLOTLIB,
|
||||
"marshmallow-code/marshmallow": MAP_VERSION_TO_INSTALL_MARSHMALLOW,
|
||||
"mwaskom/seaborn": MAP_VERSION_TO_INSTALL_SEABORN,
|
||||
"pallets/flask": MAP_VERSION_TO_INSTALL_FLASK,
|
||||
"psf/requests": MAP_VERSION_TO_INSTALL_REQUESTS,
|
||||
"pvlib/pvlib-python": MAP_VERSION_TO_INSTALL_PVLIB,
|
||||
"pydata/xarray": MAP_VERSION_TO_INSTALL_XARRAY,
|
||||
"pydicom/pydicom": MAP_VERSION_TO_INSTALL_PYDICOM,
|
||||
"pylint-dev/astroid": MAP_VERSION_TO_INSTALL_ASTROID,
|
||||
"pylint-dev/pylint": MAP_VERSION_TO_INSTALL_PYLINT,
|
||||
"pytest-dev/pytest": MAP_VERSION_TO_INSTALL_PYTEST,
|
||||
"pyvista/pyvista": MAP_VERSION_TO_INSTALL_PYVISTA,
|
||||
"scikit-learn/scikit-learn": MAP_VERSION_TO_INSTALL_SKLEARN,
|
||||
"sphinx-doc/sphinx": MAP_VERSION_TO_INSTALL_SPHINX,
|
||||
"sqlfluff/sqlfluff": MAP_VERSION_TO_INSTALL_SQLFLUFF,
|
||||
"swe-bench/humaneval": MAP_VERSION_TO_INSTALL_HUMANEVAL,
|
||||
"nielstron/humaneval_fix": MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX,
|
||||
"sympy/sympy": MAP_VERSION_TO_INSTALL_SYMPY,
|
||||
}
|
||||
|
||||
# Constants - Repository Specific Installation Instructions
|
||||
MAP_REPO_TO_INSTALL = {}
|
||||
|
||||
# Constants - Task Instance Test Frameworks
|
||||
TEST_PYTEST_VERBOSE = "pytest -rA --tb=long -p no:cacheprovider"
|
||||
MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE = {
|
||||
"astropy/astropy": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROPY.keys()
|
||||
},
|
||||
"django/django": {
|
||||
k: "./tests/runtests.py --verbosity 2 --settings=test_sqlite --parallel 1"
|
||||
for k in MAP_VERSION_TO_INSTALL_DJANGO.keys()
|
||||
},
|
||||
"marshmallow-code/marshmallow": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MARSHMALLOW.keys()
|
||||
},
|
||||
"matplotlib/matplotlib": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MATPLOTLIB.keys()
|
||||
},
|
||||
"mwaskom/seaborn": {
|
||||
k: "pytest -rA --tb=long" for k in MAP_VERSION_TO_INSTALL_SEABORN.keys()
|
||||
},
|
||||
"pallets/flask": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_FLASK.keys()
|
||||
},
|
||||
"psf/requests": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_REQUESTS.keys()
|
||||
},
|
||||
"pvlib/pvlib-python": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PVLIB.keys()
|
||||
},
|
||||
"pydata/xarray": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_XARRAY.keys()
|
||||
},
|
||||
"pydicom/pydicom": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYDICOM.keys()
|
||||
},
|
||||
"pylint-dev/astroid": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROID.keys()
|
||||
},
|
||||
"pylint-dev/pylint": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYLINT.keys()
|
||||
},
|
||||
"pytest-dev/pytest": {
|
||||
k: "pytest -rA --tb=long" for k in MAP_VERSION_TO_INSTALL_PYTEST.keys()
|
||||
},
|
||||
"pyvista/pyvista": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYVISTA.keys()
|
||||
},
|
||||
"scikit-learn/scikit-learn": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SKLEARN.keys()
|
||||
},
|
||||
"sphinx-doc/sphinx": {
|
||||
k: "tox -epy39 -v --" for k in MAP_VERSION_TO_INSTALL_SPHINX.keys()
|
||||
},
|
||||
"sqlfluff/sqlfluff": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SQLFLUFF.keys()
|
||||
},
|
||||
"swe-bench/humaneval": {
|
||||
k: "python" for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
|
||||
},
|
||||
"nielstron/humaneval_fix": {
|
||||
k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
|
||||
},
|
||||
"sympy/sympy": {
|
||||
k: "bin/test -C --verbose" for k in MAP_VERSION_TO_INSTALL_SYMPY.keys()
|
||||
},
|
||||
}
|
||||
MAP_REPO_TO_TEST_FRAMEWORK["django/django"]["1.9"] = "./tests/runtests.py --verbosity 2"
|
||||
@@ -3,13 +3,18 @@ import copy
|
||||
import json
|
||||
import os
|
||||
import tempfile
|
||||
from typing import Any
|
||||
from typing import Any, Literal
|
||||
|
||||
import pandas as pd
|
||||
import toml
|
||||
from datasets import load_dataset
|
||||
|
||||
import openhands.agenthub
|
||||
from evaluation.benchmarks.swe_bench.resource.swt_bench_constants import (
|
||||
MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE,
|
||||
MAP_REPO_TO_INSTALL,
|
||||
MAP_VERSION_TO_INSTALL
|
||||
)
|
||||
from evaluation.benchmarks.swe_bench.binary_patch_utils import (
|
||||
remove_binary_diffs,
|
||||
remove_binary_files_from_git,
|
||||
@@ -55,6 +60,7 @@ from openhands.utils.shutdown_listener import sleep_if_should_continue
|
||||
|
||||
USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
|
||||
RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'true'
|
||||
BenchMode = Literal["swe", "swt", "swt-ci"]
|
||||
|
||||
|
||||
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
|
||||
@@ -68,7 +74,32 @@ def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
|
||||
|
||||
def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
|
||||
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
|
||||
instruction = f"""
|
||||
mode = metadata.details["mode"]
|
||||
if mode.startswith('swt'):
|
||||
test_instructions = f"The following command can be used to run the tests: `{list(MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE[instance.repo].values())[0]}`. Make sure they fail in the expected way.\n" if mode.endswith("ci") else ""
|
||||
instruction = f"""\
|
||||
<uploaded_files>
|
||||
/workspace/{workspace_dir_name}
|
||||
</uploaded_files>
|
||||
I've uploaded a python code repository in the directory {workspace_dir_name}. Consider the following issue description:
|
||||
|
||||
<issue_description>
|
||||
{instance.problem_statement}
|
||||
</issue_description>
|
||||
|
||||
|
||||
Can you help me implement the necessary changes to the repository to test whether the issue in <issue_description> was resolved?
|
||||
I will take care of all changes to any of the non-test files. This means you DON'T have to modify the actual logic and ONLY have to update test logic and tests!
|
||||
Your task is to make the minimal changes to tests files in the /workspace directory to reproduce the issue in the <issue_description>, i.e., such that the generated tests fail in the current state (where the issue is unresolved) and pass when the issue will be resolved.
|
||||
Follow these steps to reproduce the issue:
|
||||
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
|
||||
2. Create a script `reproduction.py` to reproduce the error and execute it with `python reproduction.py` using the BashTool, to confirm the error
|
||||
3. Edit the sourcecode of the repo to integrate your reproduction script into the test framework
|
||||
4. Run the test framework and make sure your tests fail! Only submit FAILING tests! Never submit passing tests.
|
||||
{test_instructions}Your thinking should be thorough and so it's fine if it's very long.
|
||||
"""
|
||||
else:
|
||||
instruction = f"""
|
||||
<uploaded_files>
|
||||
/workspace/{workspace_dir_name}
|
||||
</uploaded_files>
|
||||
@@ -356,6 +387,29 @@ def initialize_runtime(
|
||||
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
|
||||
assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')
|
||||
|
||||
if metadata.details["mode"] == "swt-ci":
|
||||
# set up repo
|
||||
setup_commands = []
|
||||
if instance["repo"] in MAP_REPO_TO_INSTALL:
|
||||
setup_commands.append(MAP_REPO_TO_INSTALL[instance["repo"]])
|
||||
|
||||
# Run pre-install set up if provided
|
||||
install = MAP_VERSION_TO_INSTALL.get(instance['repo'], {}).get(instance['version'], [])
|
||||
if "pre_install" in install:
|
||||
for pre_install in install["pre_install"]:
|
||||
setup_commands.append(pre_install)
|
||||
|
||||
if "install" in install:
|
||||
setup_commands.append(install["install"])
|
||||
|
||||
for command in setup_commands:
|
||||
action = CmdRunAction(command=command)
|
||||
action.set_hard_timeout(600)
|
||||
logger.info(action, extra={'msg_type': 'ACTION'})
|
||||
obs = runtime.run_action(action)
|
||||
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
|
||||
|
||||
|
||||
if 'multimodal' not in metadata.dataset.lower():
|
||||
# Only for non-multimodal datasets, we need to activate the testbed environment for Python
|
||||
# SWE-Bench multimodal datasets are not using the testbed environment
|
||||
@@ -678,6 +732,13 @@ if __name__ == '__main__':
|
||||
default='test',
|
||||
help='split to evaluate on',
|
||||
)
|
||||
parser.add_argument(
|
||||
'--mode',
|
||||
type=str,
|
||||
default='swe',
|
||||
choices=['swe', 'swt', 'swt-ci'],
|
||||
help="mode to run the evaluation, either 'swe', 'swt', or 'swt-ci'",
|
||||
)
|
||||
args, _ = parser.parse_known_args()
|
||||
|
||||
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
|
||||
@@ -714,7 +775,7 @@ if __name__ == '__main__':
|
||||
if llm_config is None:
|
||||
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
|
||||
|
||||
details = {}
|
||||
details = {"mode": args.mode}
|
||||
_agent_cls = openhands.agenthub.Agent.get_cls(args.agent_cls)
|
||||
|
||||
dataset_descrption = (
|
||||
|
||||
@@ -12,6 +12,7 @@ NUM_WORKERS=$6
|
||||
DATASET=$7
|
||||
SPLIT=$8
|
||||
N_RUNS=$9
|
||||
MODE=${10}
|
||||
|
||||
if [ -z "$NUM_WORKERS" ]; then
|
||||
NUM_WORKERS=1
|
||||
@@ -45,6 +46,11 @@ if [ -z "$SPLIT" ]; then
|
||||
SPLIT="test"
|
||||
fi
|
||||
|
||||
if [ -z "$MODE" ]; then
|
||||
MODE="swe"
|
||||
echo "MODE not specified, use default $MODE"
|
||||
fi
|
||||
|
||||
export RUN_WITH_BROWSING=$RUN_WITH_BROWSING
|
||||
echo "RUN_WITH_BROWSING: $RUN_WITH_BROWSING"
|
||||
|
||||
@@ -55,6 +61,10 @@ echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
|
||||
echo "MODEL_CONFIG: $MODEL_CONFIG"
|
||||
echo "DATASET: $DATASET"
|
||||
echo "SPLIT: $SPLIT"
|
||||
echo "MAX_ITER: $MAX_ITER"
|
||||
echo "NUM_WORKERS: $NUM_WORKERS"
|
||||
echo "COMMIT_HASH: $COMMIT_HASH"
|
||||
echo "MODE: $MODE"
|
||||
|
||||
# Default to NOT use Hint
|
||||
if [ -z "$USE_HINT_TEXT" ]; then
|
||||
@@ -74,9 +84,13 @@ fi
|
||||
if [ -n "$EXP_NAME" ]; then
|
||||
EVAL_NOTE="$EVAL_NOTE-$EXP_NAME"
|
||||
fi
|
||||
# if mode != swe, add mode to the eval note
|
||||
if [ "$MODE" != "swe" ]; then
|
||||
EVAL_NOTE="${EVAL_NOTE}-${MODE}"
|
||||
fi
|
||||
|
||||
function run_eval() {
|
||||
local eval_note=$1
|
||||
local eval_note="${1}"
|
||||
COMMAND="poetry run python evaluation/benchmarks/swe_bench/run_infer.py \
|
||||
--agent-cls $AGENT \
|
||||
--llm-config $MODEL_CONFIG \
|
||||
@@ -84,7 +98,8 @@ function run_eval() {
|
||||
--eval-num-workers $NUM_WORKERS \
|
||||
--eval-note $eval_note \
|
||||
--dataset $DATASET \
|
||||
--split $SPLIT"
|
||||
--split $SPLIT \
|
||||
--mode $MODE"
|
||||
|
||||
if [ -n "$EVAL_LIMIT" ]; then
|
||||
echo "EVAL_LIMIT: $EVAL_LIMIT"
|
||||
|
||||
73
evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py
Normal file
73
evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py
Normal file
@@ -0,0 +1,73 @@
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
|
||||
|
||||
import unidiff
|
||||
|
||||
from evaluation.benchmarks.swe_bench.resource.swt_bench_constants import MAP_VERSION_TO_INSTALL
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def remove_setup_files(model_patch: str, instance: dict, delete_setup_changes: bool):
|
||||
""" Discard all changes that a patch applies to files changes by the pre_install script and that are reproduction scripts (top-level script)"""
|
||||
setup_files = ["setup.py", "tox.ini", "pyproject.toml"]
|
||||
pre_install = MAP_VERSION_TO_INSTALL.get(instance["repo"], {}).get(instance["version"], {}).get("pre_install", [])
|
||||
relevant_files = [
|
||||
file
|
||||
for file in setup_files
|
||||
if any(file in install and "sed" in install for install in pre_install)
|
||||
] if delete_setup_changes else []
|
||||
for i in range(10):
|
||||
try:
|
||||
# Appearently outputs.jsonl has .strip() applied, so we try to reconstruct the original patch by adding auxiliary whitespace
|
||||
patch = unidiff.PatchSet(model_patch + i*"\n")
|
||||
break
|
||||
except unidiff.UnidiffParseError as e:
|
||||
pass
|
||||
|
||||
to_delete = []
|
||||
for i, file in enumerate(patch):
|
||||
if any(f in file.source_file for f in relevant_files) or file.target_file.count("/") == 1:
|
||||
to_delete.append(i)
|
||||
for i in reversed(to_delete):
|
||||
del patch[i]
|
||||
return str(patch)
|
||||
|
||||
|
||||
def main(
|
||||
prediction_file: str,
|
||||
):
|
||||
"""Main function to extract the model patches from the OpenHands prediction file and turn them into the expected SWT-Bench format."""
|
||||
with open(prediction_file) as f:
|
||||
for line in f:
|
||||
pred = json.loads(line)
|
||||
try:
|
||||
git_diff = pred["test_result"]["git_patch"]
|
||||
except KeyError:
|
||||
_LOGGER.warning("Warning: No git diff found for instance %s", pred["instance_id"])
|
||||
continue
|
||||
ci_mode = pred["metadata"]["details"].get("mode", "") == "swt-ci"
|
||||
try:
|
||||
git_diff = remove_setup_files(git_diff, pred["instance"], ci_mode)
|
||||
except:
|
||||
_LOGGER.warning("Warning: Invalid git diff found for instance %s", pred["instance_id"])
|
||||
print(json.dumps({
|
||||
"instance_id": pred["instance_id"],
|
||||
"model_name_or_path": f'{pred["metadata"]["llm_config"]["openrouter_app_name"]}__{pred["metadata"]["agent_class"]}__{pred["metadata"]["llm_config"]["model"]}',
|
||||
"model_patch": git_diff,
|
||||
"full_output": json.dumps(pred),
|
||||
}))
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--prediction_file",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Path to the prediction file (.../outputs.jsonl)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
main(args.prediction_file)
|
||||
Reference in New Issue
Block a user