Add inference for SWT-Bench (#7201)

Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Calvin Smith <email@cjsmith.io>
2026-01-09 14:57:59 -05:00 · 2025-04-17 22:49:42 +02:00
parent 988d4aa679
commit 4b124d5906
5 changed files with 1044 additions and 6 deletions
--- a/evaluation/benchmarks/swe_bench/README.md
+++ b/evaluation/benchmarks/swe_bench/README.md
@@ -2,6 +2,8 @@

 This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).

+**UPDATE (4/8/2025): We now support running SWT-Bench evaluation! For more details, checkout [the corresponding section](#SWT-Bench-Evaluation).**
+
 **UPDATE (03/27/2025): We now support SWE-Bench multimodal evaluation! Simply use "princeton-nlp/SWE-bench_Multimodal" as the dataset name in the `run_infer.sh` script to evaluate on multimodal instances.**

 **UPDATE (2/18/2025): We now support running SWE-Gym using the same evaluation harness here. For more details, checkout [this README](./SWE-Gym.md).**
@@ -141,7 +143,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
 ./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split]

 # Example
-./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
+./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
 ```

 The script now accepts optional arguments:
@@ -182,3 +184,58 @@ To clean-up all existing runtimes that you've already started, run:
 ```bash
 ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/utils/scripts/cleanup_remote_runtime.sh
 ```
+
+## SWT-Bench Evaluation
+
+[SWT-Bench](https://swtbench.com/) ([paper](https://arxiv.org/abs/2406.12952)) is a benchmark for evaluating the capability of LLMs at creating unit tests. It is performed on the same instances as SWE-Bench, but requires a separate evaluation harness to capture coverage and issue reproduction. We therefore detail below how to leverage the inference script in this folder to run inference on SWT-Bench and how to use the SWT-Bench evaluation harness to evaluate them.
+
+### Run inference on SWT-Bench
+
+To run inference on SWT-Bench, you can use the same `run_infer.sh` script as described for evaluation on plain SWE-Bench. The only differences is that you need to specify the `mode` parameter to `swt` or `swt-ci` when running the script. For example, to run inference on SWT-Bench Verified, run the following command:
+
+```bash
+./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [swe-dataset] test 1 swt
+
+# Example - This runs evaluation on CodeActAgent for 500 instances on "SWT-bench_Verified"'s test set (corresponding to SWE-bench_Verified), with max 100 iteration per instances, with 1 number of workers running in parallel
+./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4o-2024-11-20 HEAD CodeActAgent 500 100 1 princeton-nlp/SWE-bench_Verified test 1 swt
+```
+
+The two modes `swt` and `swt-ci` have the following effect:
+- `swt`: This mode will change the prompt to instruct the agent to generate reproducing test cases instead of resolving the issue.
+- `swt-ci`: In addition to the changes by `swt`, this mode sets up the CI environment by i) pre-installing the environment in the docker image, such that the test framework can be executed without errors and ii) telling the model the exact command to run the test framework.
+
+### Run evaluation for SWT-bench
+
+The evaluation of these results is done leveraging [the SWT-Bench evaluation harness](https://github.com/logic-star-ai/swt-bench/tree/master).
+
+#### Extracting results into SWT-Bench harness format
+In order to run evaluation of the obtained inference results in the SWT-Bench harness, we transform the results to a format that the SWT-Bench evaluation harness expects.
+
+```bash
+python3 evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py --prediction_file [output.jsonl] > [output_swt.jsonl]
+
+# Example  
+python3 evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py --prediction_file "evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt-4o-2024-11-20_maxiter_100_N_v0.31.0-no-hint-swt-run_1/output.jsonl" > OpenHands-gpt-4o-2024-11-20.jsonl
+```
+
+#### Running the results in SWT-Bench
+
+Next, we run the [SWT-Bench evaluation harness](https://github.com/logic-star-ai/swt-bench/tree/master) with these results.
+First set-up and validate the setup as described in the harness [here](https://github.com/logic-star-ai/swt-bench/tree/master?tab=readme-ov-file#-set-up).
+Then, run the evaluation with the following command:
+
+```bash
+# Example
+python3 -m src.main \
+    --dataset_name princeton-nlp/SWE-bench_Verified \
+    --predictions_path <pathTo>/OpenHands-gpt-4o-2024-11-20.jsonl \
+    --max_workers 12 \
+    --run_id OpenHands-CodeAct-gpt-4o-2024-11-20  --patch_types vanilla  --build_mode api
+```
+
+The results of the evaluation can be obtained by running the reporting script of the harness.
+
+```bash
+# Example
+python -m src.report run_instance_swt_logs/OpenHands-CodeAct-gpt-4o-2024-11-20/OpenHands__CodeActAgent__gpt-4o-2024-11-20 --dataset verified
+```
--- a/evaluation/benchmarks/swe_bench/resource/swt_bench_constants.py
+++ b/evaluation/benchmarks/swe_bench/resource/swt_bench_constants.py
@@ -0,0 +1,832 @@
+# Based on https://github.com/logic-star-ai/swt-bench/blob/master/src/constants.py
+
+# Constants - Installation Specifications
+MAP_VERSION_TO_INSTALL_SKLEARN = {
+    k: {
+        "python": "3.6",
+        "packages": "numpy scipy cython pytest pandas matplotlib",
+        "install": "python -m pip install -v --no-use-pep517 --no-build-isolation -e .",
+        "pip_packages": [
+            "cython",
+            "numpy==1.19.2",
+            "setuptools",
+            "scipy==1.5.2",
+        ],
+    }
+    for k in ["0.20", "0.21", "0.22"]
+}
+MAP_VERSION_TO_INSTALL_SKLEARN.update(
+    {
+        k: {
+            "python": "3.9",
+            "packages": "'numpy==1.19.2' 'scipy==1.5.2' 'cython==3.0.10' pytest 'pandas<2.0.0' 'matplotlib<3.9.0' setuptools pytest joblib threadpoolctl",
+            "install": "python -m pip install -v --no-use-pep517 --no-build-isolation -e .",
+            "pip_packages": ["cython", "setuptools", "numpy", "scipy"],
+        }
+        for k in ["1.3", "1.4"]
+    }
+)
+MAP_VERSION_TO_INSTALL_FLASK = {
+    "2.0": {
+        "python": "3.9",
+        "packages": "requirements.txt",
+        "install": "python -m pip install -e .",
+        "pip_packages": [
+            "setuptools==70.0.0",
+            "Werkzeug==2.3.7",
+            "Jinja2==3.0.1",
+            "itsdangerous==2.1.2",
+            "click==8.0.1",
+            "MarkupSafe==2.1.3",
+        ],
+    },
+    "2.1": {
+        "python": "3.10",
+        "packages": "requirements.txt",
+        "install": "python -m pip install -e .",
+        "pip_packages": [
+            "click==8.1.3",
+            "itsdangerous==2.1.2",
+            "Jinja2==3.1.2",
+            "MarkupSafe==2.1.1",
+            "Werkzeug==2.3.7",
+        ],
+    },
+}
+MAP_VERSION_TO_INSTALL_FLASK.update(
+    {
+        k: {
+            "python": "3.11",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+            "pip_packages": [
+                "click==8.1.3",
+                "itsdangerous==2.1.2",
+                "Jinja2==3.1.2",
+                "MarkupSafe==2.1.1",
+                "Werkzeug==2.3.7",
+            ],
+        }
+        for k in ["2.2", "2.3"]
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO = {
+    k: {
+        "python": "3.5",
+        "packages": "requirements.txt",
+        "pre_install": [
+            "apt-get update && apt-get install -y locales",
+            "echo 'en_US UTF-8' > /etc/locale.gen",
+            "locale-gen en_US.UTF-8",
+        ],
+        "install": "python setup.py install",
+        "pip_packages": ["setuptools"],
+        "eval_commands": [
+            "export LANG=en_US.UTF-8",
+            "export LC_ALL=en_US.UTF-8",
+            "export PYTHONIOENCODING=utf8",
+            "export LANGUAGE=en_US:en",
+        ],
+    }
+    for k in ["1.7", "1.8", "1.9", "1.10", "1.11", "2.0", "2.1", "2.2"]
+}
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {"python": "3.5", "install": "python setup.py install"}
+        for k in ["1.4", "1.5", "1.6"]
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            "python": "3.6",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+            "eval_commands": [
+                "sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen",
+                "export LANG=en_US.UTF-8",
+                "export LANGUAGE=en_US:en",
+                "export LC_ALL=en_US.UTF-8",
+            ],
+        }
+        for k in ["3.0", "3.1", "3.2"]
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            "python": "3.8",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+        }
+        for k in ["4.0"]
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            "python": "3.9",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+        }
+        for k in ["4.1", "4.2"]
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            "python": "3.11",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+        }
+        for k in ["5.0"]
+    }
+)
+MAP_VERSION_TO_INSTALL_REQUESTS = {
+    k: {"python": "3.9", "packages": "pytest", "install": "python -m pip install ."}
+    for k in ["0.7", "0.8", "0.9", "0.11", "0.13", "0.14", "1.1", "1.2", "2.0", "2.2"]
+    + ["2.3", "2.4", "2.5", "2.7", "2.8", "2.9", "2.10", "2.11", "2.12", "2.17"]
+    + ["2.18", "2.19", "2.22", "2.26", "2.25", "2.27", "3.0"]
+}
+MAP_VERSION_TO_INSTALL_SEABORN = {
+    k: {
+        "python": "3.9",
+        "install": "python -m pip install -e .",
+        "pip_packages": [
+            "contourpy==1.1.0",
+            "cycler==0.11.0",
+            "fonttools==4.42.1",
+            "importlib-resources==6.0.1",
+            "kiwisolver==1.4.5",
+            "matplotlib==3.7.2",
+            "numpy==1.25.2",
+            "packaging==23.1",
+            "pandas==1.3.5",  # 2.0.3
+            "pillow==10.0.0",
+            "pyparsing==3.0.9",
+            "pytest",
+            "python-dateutil==2.8.2",
+            "pytz==2023.3.post1",
+            "scipy==1.11.2",
+            "six==1.16.0",
+            "tzdata==2023.1",
+            "zipp==3.16.2",
+        ],
+    }
+    for k in ["0.11"]
+}
+MAP_VERSION_TO_INSTALL_SEABORN.update(
+    {
+        k: {
+            "python": "3.9",
+            "install": "python -m pip install -e .[dev]",
+            "pip_packages": [
+                "contourpy==1.1.0",
+                "cycler==0.11.0",
+                "fonttools==4.42.1",
+                "importlib-resources==6.0.1",
+                "kiwisolver==1.4.5",
+                "matplotlib==3.7.2",
+                "numpy==1.25.2",
+                "packaging==23.1",
+                "pandas==2.0.0",
+                "pillow==10.0.0",
+                "pyparsing==3.0.9",
+                "pytest",
+                "python-dateutil==2.8.2",
+                "pytz==2023.3.post1",
+                "scipy==1.11.2",
+                "six==1.16.0",
+                "tzdata==2023.1",
+                "zipp==3.16.2",
+            ],
+        }
+        for k in ["0.12", "0.13"]
+    }
+)
+MAP_VERSION_TO_INSTALL_PYTEST = {
+    k: {"python": "3.9", "install": "python -m pip install -e ."}
+    for k in [
+        "4.4",
+        "4.5",
+        "4.6",
+        "5.0",
+        "5.1",
+        "5.2",
+        "5.3",
+        "5.4",
+        "6.0",
+        "6.2",
+        "6.3",
+        "7.0",
+        "7.1",
+        "7.2",
+        "7.4",
+        "8.0",
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYTEST["4.4"]["pip_packages"] = [
+    "atomicwrites==1.4.1",
+    "attrs==23.1.0",
+    "more-itertools==10.1.0",
+    "pluggy==0.13.1",
+    "py==1.11.0",
+    "setuptools==68.0.0",
+    "six==1.16.0",
+]
+MAP_VERSION_TO_INSTALL_PYTEST["4.5"]["pip_packages"] = [
+    "atomicwrites==1.4.1",
+    "attrs==23.1.0",
+    "more-itertools==10.1.0",
+    "pluggy==0.11.0",
+    "py==1.11.0",
+    "setuptools==68.0.0",
+    "six==1.16.0",
+    "wcwidth==0.2.6",
+]
+MAP_VERSION_TO_INSTALL_PYTEST["4.6"]["pip_packages"] = [
+    "atomicwrites==1.4.1",
+    "attrs==23.1.0",
+    "more-itertools==10.1.0",
+    "packaging==23.1",
+    "pluggy==0.13.1",
+    "py==1.11.0",
+    "six==1.16.0",
+    "wcwidth==0.2.6",
+]
+for k in ["5.0", "5.1", "5.2"]:
+    MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
+        "atomicwrites==1.4.1",
+        "attrs==23.1.0",
+        "more-itertools==10.1.0",
+        "packaging==23.1",
+        "pluggy==0.13.1",
+        "py==1.11.0",
+        "wcwidth==0.2.6",
+    ]
+MAP_VERSION_TO_INSTALL_PYTEST["5.3"]["pip_packages"] = [
+    "attrs==23.1.0",
+    "more-itertools==10.1.0",
+    "packaging==23.1",
+    "pluggy==0.13.1",
+    "py==1.11.0",
+    "wcwidth==0.2.6",
+]
+MAP_VERSION_TO_INSTALL_PYTEST["5.4"]["pip_packages"] = [
+    "py==1.11.0",
+    "packaging==23.1",
+    "attrs==23.1.0",
+    "more-itertools==10.1.0",
+    "pluggy==0.13.1",
+]
+MAP_VERSION_TO_INSTALL_PYTEST["6.0"]["pip_packages"] = [
+    "attrs==23.1.0",
+    "iniconfig==2.0.0",
+    "more-itertools==10.1.0",
+    "packaging==23.1",
+    "pluggy==0.13.1",
+    "py==1.11.0",
+    "toml==0.10.2",
+]
+for k in ["6.2", "6.3"]:
+    MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
+        "attrs==23.1.0",
+        "iniconfig==2.0.0",
+        "packaging==23.1",
+        "pluggy==0.13.1",
+        "py==1.11.0",
+        "toml==0.10.2",
+    ]
+MAP_VERSION_TO_INSTALL_PYTEST["7.0"]["pip_packages"] = [
+    "attrs==23.1.0",
+    "iniconfig==2.0.0",
+    "packaging==23.1",
+    "pluggy==0.13.1",
+    "py==1.11.0",
+]
+for k in ["7.1", "7.2"]:
+    MAP_VERSION_TO_INSTALL_PYTEST[k]["pip_packages"] = [
+        "attrs==23.1.0",
+        "iniconfig==2.0.0",
+        "packaging==23.1",
+        "pluggy==0.13.1",
+        "py==1.11.0",
+        "tomli==2.0.1",
+    ]
+MAP_VERSION_TO_INSTALL_PYTEST["7.4"]["pip_packages"] = [
+    "iniconfig==2.0.0",
+    "packaging==23.1",
+    "pluggy==1.3.0",
+    "exceptiongroup==1.1.3",
+    "tomli==2.0.1",
+]
+MAP_VERSION_TO_INSTALL_PYTEST["8.0"]["pip_packages"] = [
+    "iniconfig==2.0.0",
+    "packaging==23.1",
+    "pluggy==1.3.0",
+    "exceptiongroup==1.1.3",
+    "tomli==2.0.1",
+]
+MAP_VERSION_TO_INSTALL_MATPLOTLIB = {
+    k: {
+        "python": "3.11",
+        "packages": "environment.yml",
+        "install": "python -m pip install -e .",
+        "pre_install": [
+            "apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super dvipng"
+        ],
+        "pip_packages": [
+            "contourpy==1.1.0",
+            "cycler==0.11.0",
+            "fonttools==4.42.1",
+            "ghostscript",
+            "kiwisolver==1.4.5",
+            "numpy==1.25.2",
+            "packaging==23.1",
+            "pillow==10.0.0",
+            "pikepdf",
+            "pyparsing==3.0.9",
+            "python-dateutil==2.8.2",
+            "six==1.16.0",
+            "setuptools==68.1.2",
+            "setuptools-scm==7.1.0",
+            "typing-extensions==4.7.1",
+        ],
+    }
+    for k in ["3.5", "3.6", "3.7"]
+}
+MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
+    {
+        k: {
+            "python": "3.8",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+            "pre_install": [
+                "apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super"
+            ],
+            "pip_packages": ["pytest", "ipython"],
+        }
+        for k in ["3.1", "3.2", "3.3", "3.4"]
+    }
+)
+MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
+    {
+        k: {
+            "python": "3.7",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+            "pre_install": [
+                "apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config"
+            ],
+            "pip_packages": ["pytest"],
+        }
+        for k in ["3.0"]
+    }
+)
+MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
+    {
+        k: {
+            "python": "3.5",
+            "install": "python setup.py build; python setup.py install",
+            "pre_install": [
+                "apt-get -y update && apt-get -y upgrade && && apt-get install -y imagemagick ffmpeg"
+            ],
+            "pip_packages": ["pytest"],
+            "execute_test_as_nonroot": True,
+        }
+        for k in ["2.0", "2.1", "2.2", "1.0", "1.1", "1.2", "1.3", "1.4", "1.5"]
+    }
+)
+MAP_VERSION_TO_INSTALL_SPHINX = {
+    k: {
+        "python": "3.9",
+        "pip_packages": ["tox==4.16.0", "tox-current-env==0.0.11"],
+        "install": "python -m pip install -e .[test]",
+        "pre_install": ["sed -i 's/pytest/pytest -rA/' tox.ini"],
+    }
+    for k in ["1.5", "1.6", "1.7", "1.8", "2.0", "2.1", "2.2", "2.3", "2.4", "3.0"]
+    + ["3.1", "3.2", "3.3", "3.4", "3.5", "4.0", "4.1", "4.2", "4.3", "4.4"]
+    + ["4.5", "5.0", "5.1", "5.2", "5.3", "6.0", "6.2", "7.0", "7.1", "7.2"]
+}
+for k in ["3.0", "3.1", "3.2", "3.3", "3.4", "3.5", "4.0", "4.1", "4.2", "4.3", "4.4"]:
+    MAP_VERSION_TO_INSTALL_SPHINX[k][
+        "pre_install"
+    ].extend([
+        "sed -i 's/Jinja2>=2.3/Jinja2<3.0/' setup.py",
+        "sed -i 's/sphinxcontrib-applehelp/sphinxcontrib-applehelp<=1.0.7/' setup.py",
+        "sed -i 's/sphinxcontrib-devhelp/sphinxcontrib-devhelp<=1.0.5/' setup.py",
+        "sed -i 's/sphinxcontrib-qthelp/sphinxcontrib-qthelp<=1.0.6/' setup.py",
+        "sed -i 's/alabaster>=0.7,<0.8/alabaster>=0.7,<0.7.12/' setup.py",
+        'sed -i "s/\'packaging\',/\'packaging\', \'markupsafe<=2.0.1\',/" setup.py',
+    ])
+    if k in ["4.2", "4.3", "4.4"]:
+        MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
+            "sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py",
+            "sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py",
+        ])
+    elif k == "4.1":
+        MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
+            (
+                "grep -q 'sphinxcontrib-htmlhelp>=2.0.0' setup.py && "
+                "sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py || "
+                "sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py"
+            ),
+            (
+                "grep -q 'sphinxcontrib-serializinghtml>=1.1.5' setup.py && "
+                "sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py || "
+                "sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py"
+            )
+        ])
+    else:
+        MAP_VERSION_TO_INSTALL_SPHINX[k]["pre_install"].extend([
+            "sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py",
+            "sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py",
+        ])
+MAP_VERSION_TO_INSTALL_SPHINX["7.2"]["pre_install"] += [
+    "apt-get update && apt-get install -y graphviz"
+]
+MAP_VERSION_TO_INSTALL_ASTROPY = {
+    k: {
+        "python": "3.9",
+        "install": "python -m pip install -e .[test] --verbose",
+        "pip_packages": [
+            "attrs==23.1.0",
+            "exceptiongroup==1.1.3",
+            "execnet==2.0.2",
+            "hypothesis==6.82.6",
+            "iniconfig==2.0.0",
+            "numpy==1.25.2",
+            "packaging==23.1",
+            "pluggy==1.3.0",
+            "psutil==5.9.5",
+            "pyerfa==2.0.0.3",
+            "pytest-arraydiff==0.5.0",
+            "pytest-astropy-header==0.2.2",
+            "pytest-astropy==0.10.0",
+            "pytest-cov==4.1.0",
+            "pytest-doctestplus==1.0.0",
+            "pytest-filter-subpackage==0.1.2",
+            "pytest-mock==3.11.1",
+            "pytest-openfiles==0.5.0",
+            "pytest-remotedata==0.4.0",
+            "pytest-xdist==3.3.1",
+            "pytest==7.4.0",
+            "PyYAML==6.0.1",
+            "setuptools==68.0.0",
+            "sortedcontainers==2.4.0",
+            "tomli==2.0.1",
+        ],
+    }
+    for k in ["0.1", "0.2", "0.3", "0.4", "1.1", "1.2", "1.3", "3.0", "3.1", "3.2"]
+    + ["4.1", "4.2", "4.3", "5.0", "5.1", "5.2"]
+}
+for k in ["4.1", "4.2", "4.3", "5.0", "5.1", "5.2"]:
+    MAP_VERSION_TO_INSTALL_ASTROPY[k]["pre_install"] = [
+        'sed -i \'s/requires = \\["setuptools",/requires = \\["setuptools==68.0.0",/\' pyproject.toml'
+    ]
+MAP_VERSION_TO_INSTALL_SYMPY = {
+    k: {
+        "python": "3.9",
+        "packages": "mpmath flake8",
+        "pip_packages": ["mpmath==1.3.0", "flake8-comprehensions"],
+        "install": "python -m pip install -e .",
+    }
+    for k in ["0.7", "1.0", "1.1", "1.10", "1.11", "1.12", "1.2", "1.4", "1.5", "1.6"]
+    + ["1.7", "1.8", "1.9"]
+}
+MAP_VERSION_TO_INSTALL_SYMPY.update(
+    {
+        k: {
+            "python": "3.9",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+            "pip_packages": ["mpmath==1.3.0"],
+        }
+        for k in ["1.13"]
+    }
+)
+MAP_VERSION_TO_INSTALL_PYLINT = {
+    k: {
+        "python": "3.9",
+        "packages": "requirements.txt",
+        "install": "python -m pip install -e .",
+    }
+    for k in [
+        "2.10",
+        "2.11",
+        "2.13",
+        "2.14",
+        "2.15",
+        "2.16",
+        "2.17",
+        "2.8",
+        "2.9",
+        "3.0",
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYLINT["2.8"]["pip_packages"] = ["pyenchant==3.2"]
+MAP_VERSION_TO_INSTALL_PYLINT["2.8"]["pre_install"] = [
+    "apt-get update && apt-get install -y libenchant-2-dev hunspell-en-us"
+]
+MAP_VERSION_TO_INSTALL_PYLINT.update(
+    {
+        k: {
+            **MAP_VERSION_TO_INSTALL_PYLINT[k],
+            "pip_packages": ["astroid==3.0.0a6", "setuptools"],
+        }
+        for k in ["3.0"]
+    }
+)
+
+MAP_VERSION_TO_INSTALL_XARRAY = {
+    k: {
+        "python": "3.10",
+        "packages": "environment.yml",
+        "install": "python -m pip install -e .",
+        "pip_packages": [
+            "numpy==1.23.0",
+            "packaging==23.1",
+            "pandas==1.5.3",
+            "pytest==7.4.0",
+            "python-dateutil==2.8.2",
+            "pytz==2023.3",
+            "six==1.16.0",
+            "scipy==1.11.1",
+            "setuptools==68.0.0"
+        ],
+        "no_use_env": True,
+    }
+    for k in ["0.12", "0.18", "0.19", "0.20", "2022.03", "2022.06", "2022.09"]
+}
+
+MAP_VERSION_TO_INSTALL_SQLFLUFF = {
+    k: {
+        "python": "3.9",
+        "packages": "requirements.txt",
+        "install": "python -m pip install -e .",
+    }
+    for k in [
+        "0.10",
+        "0.11",
+        "0.12",
+        "0.13",
+        "0.4",
+        "0.5",
+        "0.6",
+        "0.8",
+        "0.9",
+        "1.0",
+        "1.1",
+        "1.2",
+        "1.3",
+        "1.4",
+        "2.0",
+        "2.1",
+        "2.2",
+    ]
+}
+MAP_VERSION_TO_INSTALL_DBT_CORE = {
+    k: {
+        "python": "3.9",
+        "packages": "requirements.txt",
+        "install": "python -m pip install -e .",
+    }
+    for k in [
+        "0.13",
+        "0.14",
+        "0.15",
+        "0.16",
+        "0.17",
+        "0.18",
+        "0.19",
+        "0.20",
+        "0.21",
+        "1.0",
+        "1.1",
+        "1.2",
+        "1.3",
+        "1.4",
+        "1.5",
+        "1.6",
+        "1.7",
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYVISTA = {
+    k: {
+        "python": "3.9",
+        "install": "python -m pip install -e .",
+        "pip_packages": ["pytest"],
+    }
+    for k in ["0.20", "0.21", "0.22", "0.23"]
+}
+MAP_VERSION_TO_INSTALL_PYVISTA.update(
+    {
+        k: {
+            "python": "3.9",
+            "packages": "requirements.txt",
+            "install": "python -m pip install -e .",
+            "pip_packages": ["pytest"],
+        }
+        for k in [
+            "0.24",
+            "0.25",
+            "0.26",
+            "0.27",
+            "0.28",
+            "0.29",
+            "0.30",
+            "0.31",
+            "0.32",
+            "0.33",
+            "0.34",
+            "0.35",
+            "0.36",
+            "0.37",
+            "0.38",
+            "0.39",
+            "0.40",
+            "0.41",
+            "0.42",
+            "0.43",
+        ]
+    }
+)
+MAP_VERSION_TO_INSTALL_ASTROID = {
+    k: {
+        "python": "3.9",
+        "install": "python -m pip install -e .",
+        "pip_packages": ["pytest"],
+    }
+    for k in [
+        "2.10",
+        "2.12",
+        "2.13",
+        "2.14",
+        "2.15",
+        "2.16",
+        "2.5",
+        "2.6",
+        "2.7",
+        "2.8",
+        "2.9",
+        "3.0",
+    ]
+}
+MAP_VERSION_TO_INSTALL_MARSHMALLOW = {
+    k: {
+        "python": "3.9",
+        "install": "python -m pip install -e '.[dev]'",
+    }
+    for k in [
+        "2.18",
+        "2.19",
+        "2.20",
+        "3.0",
+        "3.1",
+        "3.10",
+        "3.11",
+        "3.12",
+        "3.13",
+        "3.15",
+        "3.16",
+        "3.19",
+        "3.2",
+        "3.4",
+        "3.8",
+        "3.9",
+    ]
+}
+MAP_VERSION_TO_INSTALL_PVLIB = {
+    k: {
+        "python": "3.9",
+        "install": "python -m pip install -e .[all]",
+        "packages": "pandas scipy",
+        "pip_packages": ["jupyter", "ipython", "matplotlib", "pytest", "flake8"],
+    }
+    for k in ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
+}
+MAP_VERSION_TO_INSTALL_PYDICOM = {
+    k: {"python": "3.6", "install": "python -m pip install -e .", "packages": "numpy"}
+    for k in [
+        "1.0",
+        "1.1",
+        "1.2",
+        "1.3",
+        "1.4",
+        "2.0",
+        "2.1",
+        "2.2",
+        "2.3",
+        "2.4",
+        "3.0",
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.8"} for k in ["1.4", "2.0"]}
+)
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.9"} for k in ["2.1", "2.2"]}
+)
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.10"} for k in ["2.3"]}
+)
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], "python": "3.11"} for k in ["2.4", "3.0"]}
+)
+MAP_VERSION_TO_INSTALL_HUMANEVAL = {k: {"python": "3.9"} for k in ["1.0"]}
+MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX = {k: {"python": "3.10",  "packages": "pytest"} for k in ["0.0.1"]}
+
+# Constants - Task Instance Instllation Environment
+MAP_VERSION_TO_INSTALL = {
+    "astropy/astropy": MAP_VERSION_TO_INSTALL_ASTROPY,
+    "dbt-labs/dbt-core": MAP_VERSION_TO_INSTALL_DBT_CORE,
+    "django/django": MAP_VERSION_TO_INSTALL_DJANGO,
+    "matplotlib/matplotlib": MAP_VERSION_TO_INSTALL_MATPLOTLIB,
+    "marshmallow-code/marshmallow": MAP_VERSION_TO_INSTALL_MARSHMALLOW,
+    "mwaskom/seaborn": MAP_VERSION_TO_INSTALL_SEABORN,
+    "pallets/flask": MAP_VERSION_TO_INSTALL_FLASK,
+    "psf/requests": MAP_VERSION_TO_INSTALL_REQUESTS,
+    "pvlib/pvlib-python": MAP_VERSION_TO_INSTALL_PVLIB,
+    "pydata/xarray": MAP_VERSION_TO_INSTALL_XARRAY,
+    "pydicom/pydicom": MAP_VERSION_TO_INSTALL_PYDICOM,
+    "pylint-dev/astroid": MAP_VERSION_TO_INSTALL_ASTROID,
+    "pylint-dev/pylint": MAP_VERSION_TO_INSTALL_PYLINT,
+    "pytest-dev/pytest": MAP_VERSION_TO_INSTALL_PYTEST,
+    "pyvista/pyvista": MAP_VERSION_TO_INSTALL_PYVISTA,
+    "scikit-learn/scikit-learn": MAP_VERSION_TO_INSTALL_SKLEARN,
+    "sphinx-doc/sphinx": MAP_VERSION_TO_INSTALL_SPHINX,
+    "sqlfluff/sqlfluff": MAP_VERSION_TO_INSTALL_SQLFLUFF,
+    "swe-bench/humaneval": MAP_VERSION_TO_INSTALL_HUMANEVAL,
+    "nielstron/humaneval_fix": MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX,
+    "sympy/sympy": MAP_VERSION_TO_INSTALL_SYMPY,
+}
+
+# Constants - Repository Specific Installation Instructions
+MAP_REPO_TO_INSTALL = {}
+
+# Constants - Task Instance Test Frameworks
+TEST_PYTEST_VERBOSE = "pytest -rA --tb=long -p no:cacheprovider"
+MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE = {
+    "astropy/astropy": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROPY.keys()
+    },
+    "django/django": {
+        k: "./tests/runtests.py --verbosity 2 --settings=test_sqlite --parallel 1"
+        for k in MAP_VERSION_TO_INSTALL_DJANGO.keys()
+    },
+    "marshmallow-code/marshmallow": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MARSHMALLOW.keys()
+    },
+    "matplotlib/matplotlib": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MATPLOTLIB.keys()
+    },
+    "mwaskom/seaborn": {
+        k: "pytest -rA --tb=long" for k in MAP_VERSION_TO_INSTALL_SEABORN.keys()
+    },
+    "pallets/flask": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_FLASK.keys()
+    },
+    "psf/requests": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_REQUESTS.keys()
+    },
+    "pvlib/pvlib-python": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PVLIB.keys()
+    },
+    "pydata/xarray": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_XARRAY.keys()
+    },
+    "pydicom/pydicom": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYDICOM.keys()
+    },
+    "pylint-dev/astroid": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROID.keys()
+    },
+    "pylint-dev/pylint": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYLINT.keys()
+    },
+    "pytest-dev/pytest": {
+        k: "pytest -rA --tb=long" for k in MAP_VERSION_TO_INSTALL_PYTEST.keys()
+    },
+    "pyvista/pyvista": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYVISTA.keys()
+    },
+    "scikit-learn/scikit-learn": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SKLEARN.keys()
+    },
+    "sphinx-doc/sphinx": {
+        k: "tox -epy39 -v --" for k in MAP_VERSION_TO_INSTALL_SPHINX.keys()
+    },
+    "sqlfluff/sqlfluff": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SQLFLUFF.keys()
+    },
+    "swe-bench/humaneval": {
+        k: "python" for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
+    },
+    "nielstron/humaneval_fix": {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
+    },
+    "sympy/sympy": {
+        k: "bin/test -C --verbose" for k in MAP_VERSION_TO_INSTALL_SYMPY.keys()
+    },
+}
+MAP_REPO_TO_TEST_FRAMEWORK["django/django"]["1.9"] = "./tests/runtests.py --verbosity 2"
--- a/evaluation/benchmarks/swe_bench/run_infer.py
+++ b/evaluation/benchmarks/swe_bench/run_infer.py
@@ -3,13 +3,18 @@ import copy
 import json
 import os
 import tempfile
-from typing import Any
+from typing import Any, Literal

 import pandas as pd
 import toml
 from datasets import load_dataset

 import openhands.agenthub
+from evaluation.benchmarks.swe_bench.resource.swt_bench_constants import (
+    MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE,
+    MAP_REPO_TO_INSTALL,
+    MAP_VERSION_TO_INSTALL
+)
 from evaluation.benchmarks.swe_bench.binary_patch_utils import (
    remove_binary_diffs,
    remove_binary_files_from_git,
@@ -55,6 +60,7 @@ from openhands.utils.shutdown_listener import sleep_if_should_continue

 USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
 RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'true'
+BenchMode = Literal["swe", "swt", "swt-ci"]


 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
@@ -68,7 +74,32 @@ def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:

 def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
-    instruction = f"""
+    mode = metadata.details["mode"]
+    if mode.startswith('swt'):
+        test_instructions = f"The following command can be used to run the tests: `{list(MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE[instance.repo].values())[0]}`. Make sure they fail in the expected way.\n" if mode.endswith("ci") else ""
+        instruction = f"""\
+<uploaded_files>
+/workspace/{workspace_dir_name}
+</uploaded_files>
+I've uploaded a python code repository in the directory {workspace_dir_name}. Consider the following issue description:
+
+<issue_description>
+{instance.problem_statement}
+</issue_description>
+
+
+Can you help me implement the necessary changes to the repository to test whether the issue in <issue_description> was resolved?
+I will take care of all changes to any of the non-test files. This means you DON'T have to modify the actual logic and ONLY have to update test logic and tests!
+Your task is to make the minimal changes to tests files in the /workspace directory to reproduce the issue in the <issue_description>, i.e., such that the generated tests fail in the current state (where the issue is unresolved) and pass when the issue will be resolved.
+Follow these steps to reproduce the issue:
+1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
+2. Create a script `reproduction.py` to reproduce the error and execute it with `python reproduction.py` using the BashTool, to confirm the error
+3. Edit the sourcecode of the repo to integrate your reproduction script into the test framework
+4. Run the test framework and make sure your tests fail! Only submit FAILING tests! Never submit passing tests.
+{test_instructions}Your thinking should be thorough and so it's fine if it's very long.
+"""
+    else:
+        instruction = f"""
 <uploaded_files>
 /workspace/{workspace_dir_name}
 </uploaded_files>
@@ -356,6 +387,29 @@ def initialize_runtime(
    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
    assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')

+    if metadata.details["mode"] == "swt-ci":
+        # set up repo
+        setup_commands = []
+        if instance["repo"] in MAP_REPO_TO_INSTALL:
+            setup_commands.append(MAP_REPO_TO_INSTALL[instance["repo"]])
+
+        # Run pre-install set up if provided
+        install = MAP_VERSION_TO_INSTALL.get(instance['repo'], {}).get(instance['version'], [])
+        if "pre_install" in install:
+            for pre_install in install["pre_install"]:
+                setup_commands.append(pre_install)
+
+        if "install" in install:
+            setup_commands.append(install["install"])
+
+        for command in setup_commands:
+            action = CmdRunAction(command=command)
+            action.set_hard_timeout(600)
+            logger.info(action, extra={'msg_type': 'ACTION'})
+            obs = runtime.run_action(action)
+            logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+
    if 'multimodal' not in metadata.dataset.lower():
        # Only for non-multimodal datasets, we need to activate the testbed environment for Python
        # SWE-Bench multimodal datasets are not using the testbed environment
@@ -678,6 +732,13 @@ if __name__ == '__main__':
        default='test',
        help='split to evaluate on',
    )
+    parser.add_argument(
+        '--mode',
+        type=str,
+        default='swe',
+        choices=['swe', 'swt', 'swt-ci'],
+        help="mode to run the evaluation, either 'swe', 'swt', or 'swt-ci'",
+    )
    args, _ = parser.parse_known_args()

    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
@@ -714,7 +775,7 @@ if __name__ == '__main__':
    if llm_config is None:
        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')

-    details = {}
+    details = {"mode": args.mode}
    _agent_cls = openhands.agenthub.Agent.get_cls(args.agent_cls)

    dataset_descrption = (
--- a/evaluation/benchmarks/swe_bench/scripts/run_infer.sh
+++ b/evaluation/benchmarks/swe_bench/scripts/run_infer.sh
@@ -12,6 +12,7 @@ NUM_WORKERS=$6
 DATASET=$7
 SPLIT=$8
 N_RUNS=$9
+MODE=${10}

 if [ -z "$NUM_WORKERS" ]; then
  NUM_WORKERS=1
@@ -45,6 +46,11 @@ if [ -z "$SPLIT" ]; then
  SPLIT="test"
 fi

+if [ -z "$MODE" ]; then
+  MODE="swe"
+  echo "MODE not specified, use default $MODE"
+fi
+
 export RUN_WITH_BROWSING=$RUN_WITH_BROWSING
 echo "RUN_WITH_BROWSING: $RUN_WITH_BROWSING"

@@ -55,6 +61,10 @@ echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
 echo "MODEL_CONFIG: $MODEL_CONFIG"
 echo "DATASET: $DATASET"
 echo "SPLIT: $SPLIT"
+echo "MAX_ITER: $MAX_ITER"
+echo "NUM_WORKERS: $NUM_WORKERS"
+echo "COMMIT_HASH: $COMMIT_HASH"
+echo "MODE: $MODE"

 # Default to NOT use Hint
 if [ -z "$USE_HINT_TEXT" ]; then
@@ -74,9 +84,13 @@ fi
 if [ -n "$EXP_NAME" ]; then
  EVAL_NOTE="$EVAL_NOTE-$EXP_NAME"
 fi
+# if mode != swe, add mode to the eval note
+if [ "$MODE" != "swe" ]; then
+  EVAL_NOTE="${EVAL_NOTE}-${MODE}"
+fi

 function run_eval() {
-  local eval_note=$1
+  local eval_note="${1}"
  COMMAND="poetry run python evaluation/benchmarks/swe_bench/run_infer.py \
    --agent-cls $AGENT \
    --llm-config $MODEL_CONFIG \
@@ -84,7 +98,8 @@ function run_eval() {
    --eval-num-workers $NUM_WORKERS \
    --eval-note $eval_note \
    --dataset $DATASET \
-    --split $SPLIT"
+    --split $SPLIT \
+    --mode $MODE"

  if [ -n "$EVAL_LIMIT" ]; then
    echo "EVAL_LIMIT: $EVAL_LIMIT"
--- a/evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py
+++ b/evaluation/benchmarks/swe_bench/scripts/swtbench/convert.py
@@ -0,0 +1,73 @@
+import json
+import argparse
+import logging
+
+
+import unidiff
+
+from evaluation.benchmarks.swe_bench.resource.swt_bench_constants import MAP_VERSION_TO_INSTALL
+
+_LOGGER = logging.getLogger(__name__)
+
+
+def remove_setup_files(model_patch: str, instance: dict, delete_setup_changes: bool):
+    """ Discard all changes that a patch applies to files changes by the pre_install script and that are reproduction scripts (top-level script)"""
+    setup_files = ["setup.py", "tox.ini", "pyproject.toml"]
+    pre_install = MAP_VERSION_TO_INSTALL.get(instance["repo"], {}).get(instance["version"], {}).get("pre_install", [])
+    relevant_files = [
+        file
+        for file in setup_files
+        if any(file in install and "sed" in install for install in pre_install)
+    ] if delete_setup_changes else []
+    for i in range(10):
+        try:
+            # Appearently outputs.jsonl has .strip() applied, so we try to reconstruct the original patch by adding auxiliary whitespace
+            patch = unidiff.PatchSet(model_patch + i*"\n")
+            break
+        except unidiff.UnidiffParseError as e:
+            pass
+
+    to_delete = []
+    for i, file in enumerate(patch):
+        if any(f in file.source_file for f in relevant_files) or file.target_file.count("/") == 1:
+            to_delete.append(i)
+    for i in reversed(to_delete):
+        del patch[i]
+    return str(patch)
+
+
+def main(
+        prediction_file: str,
+):
+    """Main function to extract the model patches from the OpenHands prediction file and turn them into the expected SWT-Bench format."""
+    with open(prediction_file) as f:
+        for line in f:
+            pred = json.loads(line)
+            try:
+                git_diff = pred["test_result"]["git_patch"]
+            except KeyError:
+                _LOGGER.warning("Warning: No git diff found for instance %s", pred["instance_id"])
+                continue
+            ci_mode = pred["metadata"]["details"].get("mode", "") == "swt-ci"
+            try:
+                git_diff = remove_setup_files(git_diff, pred["instance"], ci_mode)
+            except:
+                _LOGGER.warning("Warning: Invalid git diff found for instance %s", pred["instance_id"])
+            print(json.dumps({
+                "instance_id": pred["instance_id"],
+                "model_name_or_path": f'{pred["metadata"]["llm_config"]["openrouter_app_name"]}__{pred["metadata"]["agent_class"]}__{pred["metadata"]["llm_config"]["model"]}',
+                "model_patch": git_diff,
+                "full_output": json.dumps(pred),
+            }))
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--prediction_file",
+        type=str,
+        required=True,
+        help="Path to the prediction file (.../outputs.jsonl)",
+    )
+    args = parser.parse_args()
+
+    main(args.prediction_file)