[eval] SWE-Gym Integration (#6651)

Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
This commit is contained in:
Xingyao Wang
2025-03-05 15:15:02 -05:00
committed by GitHub
parent bbf40c6576
commit 9f720a9d69
10 changed files with 638 additions and 20 deletions

View File

@@ -2,6 +2,8 @@
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
**UPDATE (2/18/2025): We now support running SWE-Gym using the same evaluation harness here. For more details, checkout [this README](./SWE-Gym.md).
**UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
The evaluation consists of three steps:

View File

@@ -0,0 +1,128 @@
<h1 align="center"> Training Software Engineering Agents and Verifiers with SWE-Gym </h1>
<p align="center">
<a href="https://www.jiayipan.com/" style="text-decoration: none;">Jiayi Pan<sup>*,1</sup></a>,
<a href="https://xwang.dev/" style="text-decoration: none;">Xingyao Wang<sup>*,2</sup></a>,
<a href="https://www.phontron.com/" style="text-decoration: none;">Graham Neubig<sup>3</sup></a>,
<a href="https://www.cs.toronto.edu/~ndjaitly/" style="text-decoration: none;">Navdeep Jaitly<sup>4</sup></a>,
<a href="https://blender.cs.illinois.edu/hengji.html" style="text-decoration: none;">Heng Ji<sup>2</sup></a>,
<a href="https://www.alanesuhr.com/" style="text-decoration: none;">Alane Suhr<sup>^,1</sup></a>,
<a href="https://dreasysnail.github.io/" style="text-decoration: none;">Yizhe Zhang<sup>^,4</sup></a>
</p>
<p align="center">
<sup>1</sup>UC Berkeley, <sup>2</sup>UIUC, <sup>3</sup>CMU, <sup>4</sup>Apple </br>
<sub><sup>*</sup>Equal contribution, <sup>^</sup>Equal supervision</sub>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2412.21139">📃 Paper</a>
<a href="https://huggingface.co/SWE-Gym" >🤗 Data & Models</a>
</p>
We present **SWE-Gym**, the first environment for training real-world software engineering agents.
We use it to train strong LM agents that achieve state-of-the-art open results on SWE-Bench, with early, promising scaling characteristics as we increase training and inference-time compute.
<p align="center">
<img src="https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/teaser.jpg?raw=true" width="100%" alt="teaser">
</p>
---
# Run SWE-Gym with OpenHands
The process of running SWE-Gym is very similar to how you'd run SWE-Bench evaluation.
1. First, clone OpenHands repo `git clone https://github.com/All-Hands-AI/OpenHands.git`
2. Then setup the repo following [Development.md](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md)
3. Then you can simply serve your own model as an OpenAI compatible endpoint, put those info in config.toml. You can do this by following instruction [here](../../README.md#setup).
4. And then simply do the following to sample for 16x parallelism:
```bash
export ALLHANDS_API_KEY=ah-yourkey # You don't need to set this when running these in local docker container
./evaluation/benchmarks/swe_bench/scripts/rollout_swegym.sh llm.mymodel-temp05 'train-t05' 16
```
NOTE: SWE-Gym sampling with parallelism is currently only tested with AllHands RemoteRuntime (limited beta). Fill [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply for access.
5. When `rollout_swegym.sh` finishes, you will get a file called `output.with_completions.jsonl.gz`. Then you can use [`./scripts/swegym/convert_data.ipynb`](./scripts/swegym/convert_data.ipynb) to convert them into SFT data format.
---
# More info about SWE-Gym
Progress in agents for software engineering has been limited by the lack of training environments that both include rigorous verification for reinforcement learning and cover the expansive tasks encountered in real-world repository-level engineering.
We introduce SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers.
Our baselines achieve new open SOTA - 32%/26% on SWE-Bench Verified/Lite, with promising scaling trends.
![SWE-Gym Scaling](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/scaling.jpg?raw=true)
*SWE-Gym enables scalable improvements for software engineering agents at both training and inference time. Our current results is primarily bottlenecked by training and inference compute, rather than the size of our environment.*
## SWE-Gym Environment
We create SWE-Gym, the first environment for training SWE agents, with **2.4K real tasks from 11 Python repos** & a Lite split of 234 instances. SWE-Gym combines real-world Python tasks, repository context, executable environments, and test verification to train agents for solving software engineering problems.
![SWE-Gym Repo Distribution](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/swe-gym.jpg?raw=true)
## SWE-Gym trains LMs as agents
When fine-tuned on less than 500 agent-environment interaction trajectories sampled from it from GPT-4o and Claude 3.5 Sonnet, we achieve **+14%** absolute gains on SWE-Bench Verified with an 32B LM-powered OpenHands agent.
![OpenHands Performance diff before and after training](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/oh-agent.jpg?raw=true)
## SWE-Gym enables self-improvement
SWE-Gym is also effective across agent scaffolds. With rejection sampling fine-tuning and MoatlessTools scaffold, our 32B and 7B models achieve 20% and 10% respectively on SWE-Bench Lite through self-improvement.
<p align="center">
<img src="https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/ml-agent.jpg?raw=true" width="80%" alt="Moatless self-improvement">
</p>
## SWE-Gym enables inference-time scaling
SWE-Gym enables inference-time scaling through verifiers trained on agent trajectories.
These verifiers identify most promising solutions via best-of-n selection, together with our learned agents, they achieve 32%/26% on SWE-Bench Verified/Lite, a new open SoTA.
![Inference Time Scaling for Moatless Agent](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/inference-ml.jpg?raw=true)
*Inference Time Scaling for Moatless Agent*
![Inference Time Scaling for OpenHands Agent](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/inference-oh.jpg?raw=true)
*Inference Time Scaling for OpenHands Agent*
## Our baselines on SWE-Gym shows strong scaling trends
Lastly, our ablations reveal strong scaling trends - performance is now bottlenecked by train and inference compute, rather than the size of our dataset. Pushing and improving these scaling trends further is an exciting direction for future work.
![](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/scaling.jpg?raw=true)
## Reproducing Results
**The Dataset**
To access SWE-Gym dataset, checkout our huggingface hub page [SWE-Gym](https://huggingface.co/SWE-Gym)
The environment constants are currently saved at [SWE-Bench-Fork](https://github.com/SWE-Gym/SWE-Bench-Fork)
We also have pre-built docker images for each instance under [xingyaoww/sweb.eval.x86_64](https://hub.docker.com/search?q=xingyaoww%2Fsweb.eval.x86_64.) prefix at docker hub.
## 📚 Citation
```bibtex
@misc{pan2024trainingsoftwareengineeringagents,
title={Training Software Engineering Agents and Verifiers with SWE-Gym},
author={Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang},
year={2024},
eprint={2412.21139},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2412.21139},
}
```

View File

@@ -6,17 +6,6 @@ import time
from functools import partial
import pandas as pd
from swebench.harness.grading import get_eval_report
from swebench.harness.run_evaluation import (
APPLY_PATCH_FAIL,
APPLY_PATCH_PASS,
)
from swebench.harness.test_spec.test_spec import (
SWEbenchInstance,
TestSpec,
make_test_spec,
)
from swebench.harness.utils import load_swebench_dataset
from tqdm import tqdm
from evaluation.benchmarks.swe_bench.resource.mapping import (
@@ -348,6 +337,31 @@ if __name__ == '__main__':
)
args, _ = parser.parse_known_args()
if 'SWE-Gym' in args.dataset:
from swegym.harness.grading import get_eval_report
from swegym.harness.run_evaluation import (
APPLY_PATCH_FAIL,
APPLY_PATCH_PASS,
)
from swegym.harness.test_spec import (
SWEbenchInstance,
TestSpec,
make_test_spec,
)
from swegym.harness.utils import load_swebench_dataset
else: # Newer version of SWE-Bench have different import paths
from swebench.harness.grading import get_eval_report
from swebench.harness.run_evaluation import (
APPLY_PATCH_FAIL,
APPLY_PATCH_PASS,
)
from swebench.harness.test_spec.test_spec import (
SWEbenchInstance,
TestSpec,
make_test_spec,
)
from swebench.harness.utils import load_swebench_dataset
# Load SWE-Bench dataset
full_dataset: list[SWEbenchInstance] = load_swebench_dataset(
args.dataset, args.split

View File

@@ -533,6 +533,20 @@ def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:
return dataset
# A list of instances that are known to be tricky to infer
# (will cause runtime failure even with resource factor = 8)
SWEGYM_EXCLUDE_IDS = [
'dask__dask-10422',
'pandas-dev__pandas-50548',
'pandas-dev__pandas-53672',
'pandas-dev__pandas-54174',
'pandas-dev__pandas-55518',
'pandas-dev__pandas-58383',
'pydata__xarray-6721',
'pytest-dev__pytest-10081',
'pytest-dev__pytest-7236',
]
if __name__ == '__main__':
parser = get_parser()
parser.add_argument(
@@ -556,6 +570,13 @@ if __name__ == '__main__':
logger.info(
f'Loaded dataset {args.dataset} with split {args.split}: {len(swe_bench_tests)} tasks'
)
if 'SWE-Gym' in args.dataset:
swe_bench_tests = swe_bench_tests[
~swe_bench_tests['instance_id'].isin(SWEGYM_EXCLUDE_IDS)
]
logger.info(
f'{len(swe_bench_tests)} tasks left after excluding SWE-Gym excluded tasks'
)
llm_config = None
if args.llm_config:
@@ -599,6 +620,6 @@ if __name__ == '__main__':
output_file,
args.eval_num_workers,
process_instance,
timeout_seconds=120 * 60, # 2 hour PER instance should be more than enough
timeout_seconds=8 * 60 * 60, # 8 hour PER instance should be more than enough
max_retries=5,
)

View File

@@ -24,7 +24,7 @@ def load_completions(output_dir: str, instance_id: str):
# create messages
messages = result['messages']
messages.append(result['response']['choices'][0]['message'])
tools = result['kwargs']['tools']
tools = result['kwargs'].get('tools', [])
return {
'messages': messages,
'tools': tools,

View File

@@ -75,7 +75,7 @@ def load_completions(instance_id: str):
# create messages
messages = result['messages']
messages.append(result['response']['choices'][0]['message'])
tools = result['kwargs']['tools']
tools = result['kwargs'].get('tools', None)
return {
'messages': messages,
'tools': tools,

View File

@@ -0,0 +1,132 @@
#!/bin/bash
# NOTE: this script is for rolling out the SWE-Gym dataset for **TRAINING**
# For more information, please refer to
# 1. the Github Repo: https://github.com/SWE-Gym/SWE-Gym
# 2. the paper: https://arxiv.org/abs/2412.21139
MODEL=$1 # eg your llm config name in config.toml (eg: "llm.claude-3-5-sonnet-20241022-t05")
EXP_NAME=$2 # "train-t05"
N_WORKERS=${3:-64}
N_RUNS=${4:-1}
export EXP_NAME=$EXP_NAME
# use 2x resources for rollout since some codebases are pretty resource-intensive
export DEFAULT_RUNTIME_RESOURCE_FACTOR=2
echo "MODEL: $MODEL"
echo "EXP_NAME: $EXP_NAME"
DATASET="SWE-Gym/SWE-Gym" # change this to the "/SWE-Gym-Lite" if you want to rollout the lite subset
SPLIT="train"
if [ -z "$ALLHANDS_API_KEY" ]; then
echo "ALLHANDS_API_KEY is not set. Will rollout and evaluate locally using Docker. WARNING: A large value of N_WORKERS will result in a large number of Docker containers being spun up and may crash your machine."
export RUNTIME=docker
else
echo "ALLHANDS_API_KEY is set. Continuing rollout and evaluation with remote runtime..."
export RUNTIME=remote
export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
export EVAL_DOCKER_IMAGE_PREFIX="us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images"
fi
EVAL_LIMIT=3000
MAX_ITER=100
# ===== Run inference =====
source "evaluation/utils/version_control.sh"
get_openhands_version
echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"
echo "SPLIT: $SPLIT"
# Default to NOT use Hint
export USE_INSTANCE_IMAGE=true
export USE_HINT_TEXT=false
export RUN_WITH_BROWSING=false
echo "USE_HINT_TEXT: $USE_HINT_TEXT"
EVAL_NOTE="$OPENHANDS_VERSION-no-hint-$EXP_NAME"
function run_eval() {
local eval_note=$1
COMMAND="poetry run python evaluation/benchmarks/swe_bench/run_infer.py \
--agent-cls CodeActAgent \
--llm-config $MODEL \
--max-iterations $MAX_ITER \
--eval-num-workers $N_WORKERS \
--eval-note $eval_note \
--dataset $DATASET \
--split $SPLIT"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
eval $COMMAND
}
for run_idx in $(seq 1 $N_RUNS); do
while true; do
echo "### Running inference... ###"
unset SANDBOX_ENV_GITHUB_TOKEN # prevent the agent from using the github token to push
current_eval_note="$EVAL_NOTE-run_$run_idx"
echo "EVAL_NOTE: $current_eval_note"
INFER_OUTPUT=$(run_eval $current_eval_note)
INFER_STATUS=$? # Capture the exit status of run_infer.sh
echo "INFER_STATUS: $INFER_STATUS"
echo "### Cleaning up remote runtime... ###"
./evaluation/utils/scripts/cleanup_remote_runtime.sh
if [ $INFER_STATUS -eq 0 ]; then
echo "### Inference completed successfully. ###"
break
else
echo "### Inference failed with exit code $INFER_STATUS. Retrying... ###"
fi
done
# Extract the output directory using the special delimiters
OUTPUT_FILE=$(echo "$INFER_OUTPUT" | grep -o '### OUTPUT FILE:.* ###' | sed 's/### OUTPUT FILE: \(.*\) ###/\1/')
echo "Got OUTPUT_FILE: $OUTPUT_FILE"
while true; do
echo "### Evaluating on $OUTPUT_FILE ... ###"
COMMAND="poetry run python evaluation/benchmarks/swe_bench/eval_infer.py \
--eval-num-workers $((N_WORKERS * 2)) \
--input-file $OUTPUT_FILE \
--dataset $DATASET \
--split $SPLIT"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
echo "Running command: $COMMAND"
# Run the command
eval $COMMAND
EVAL_STATUS=$?
if [ $EVAL_STATUS -eq 0 ]; then
echo "### Evaluation completed successfully. ###"
break
else
echo "### Evaluation failed with exit code $EVAL_STATUS. Retrying... ###"
fi
./evaluation/utils/scripts/cleanup_remote_runtime.sh
done
# update the output with evaluation results
echo "### Updating the output with evaluation results... ###"
poetry run python evaluation/benchmarks/swe_bench/scripts/eval/update_output_with_eval.py $OUTPUT_FILE
echo "### Combining the final completions... ###"
poetry run python evaluation/benchmarks/swe_bench/scripts/eval/combine_final_completions.py $OUTPUT_FILE
echo "### DONE for run $run_idx! ###"
echo "You can find the final output at $(dirname $OUTPUT_FILE)/$FINAL_OUTPUT_FILE"
done

View File

@@ -0,0 +1,287 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import pandas as pd\n",
"from tqdm import tqdm\n",
"\n",
"tqdm.pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Load raw data and convert to training data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gzip\n",
"import json\n",
"\n",
"from tqdm import tqdm\n",
"\n",
"FILE_PATHS = [\n",
" 'YOURPATH-no-hint-train-t05-run_1/output.with_completions.jsonl.gz',\n",
" 'YOURPATH-no-hint-train-t05-run_2/output.with_completions.jsonl.gz',\n",
"]\n",
"\n",
"# More memory efficient for large files\n",
"# Initialize lists to store the data\n",
"data = []\n",
"\n",
"\n",
"# Read file line by line\n",
"for FILE_PATH in FILE_PATHS:\n",
" with gzip.open(FILE_PATH, 'rb') as f: # Use 'rb' for gzipped files\n",
" for i, line in tqdm(\n",
" enumerate(f), desc=f\"Processing {FILE_PATH.split('/')[-1]}\"\n",
" ):\n",
" # Parse only the fields we need\n",
" raw_data = json.loads(line)\n",
" data.append(\n",
" {\n",
" 'resolved': raw_data['report']['resolved'],\n",
" 'messages': raw_data['raw_completions']['messages']\n",
" if raw_data['raw_completions'] is not None\n",
" else None,\n",
" 'git_patch': raw_data['test_result'].get('git_patch', ''),\n",
" 'tools': raw_data['raw_completions']['tools']\n",
" if raw_data['raw_completions'] is not None\n",
" and 'tools' in raw_data['raw_completions']\n",
" else None,\n",
" }\n",
" )\n",
"\n",
"# Convert to DataFrame after collecting all data\n",
"df = pd.DataFrame(data)\n",
"print(f'#total amount of data={len(df)}')\n",
"df = df[~df['messages'].isna()]\n",
"print(f'#total amount of data after removing nan={len(df)}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def _contains_multiple_tool_calls(messages: list[dict]) -> bool:\n",
" return any(\n",
" message.get('tool_calls') and len(message['tool_calls']) > 1\n",
" for message in messages\n",
" )\n",
"\n",
"\n",
"df['contains_multiple_tool_calls'] = df['messages'].apply(_contains_multiple_tool_calls)\n",
"display(df.groupby(['contains_multiple_tool_calls'])['resolved'].sum())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import copy\n",
"\n",
"# Convert function calling messages to non-function calling messages\n",
"from openhands.llm.fn_call_converter import (\n",
" FunctionCallConversionError,\n",
" convert_fncall_messages_to_non_fncall_messages,\n",
" convert_from_multiple_tool_calls_to_single_tool_call_messages,\n",
")\n",
"\n",
"total_failed = 0\n",
"\n",
"\n",
"def _convert_messages(messages: list[dict], tools: list[dict]) -> list[dict]:\n",
" global total_failed\n",
" message_copy = copy.deepcopy(messages)\n",
" for message in message_copy:\n",
" if message['content'] is None:\n",
" message['content'] = ''\n",
" try:\n",
" return convert_fncall_messages_to_non_fncall_messages(\n",
" message_copy, tools, add_in_context_learning_example=False\n",
" )\n",
" except FunctionCallConversionError:\n",
" total_failed += 1\n",
" # print(f'Failed to convert messages: {messages}\\nTools: {tools}')\n",
" # traceback.print_exc()\n",
" return None\n",
"\n",
"\n",
"df['converted_messages'] = df.apply(\n",
" lambda row: convert_from_multiple_tool_calls_to_single_tool_call_messages(\n",
" row['messages'], ignore_final_tool_result=True\n",
" ),\n",
" axis=1,\n",
")\n",
"df['nonfncall_messages'] = df.apply(\n",
" lambda row: _convert_messages(row['converted_messages'], row['tools']), axis=1\n",
")\n",
"print('total nan', df['nonfncall_messages'].isna().sum())\n",
"df = df[~df['nonfncall_messages'].isna()]\n",
"print(f'Total failed: {total_failed}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pandarallel import pandarallel\n",
"from transformers import AutoTokenizer\n",
"\n",
"os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
"pandarallel.initialize(progress_bar=True, verbose=1, nb_workers=16)\n",
"tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')\n",
"df['n_tokens'] = df['rm_conv'].parallel_apply(\n",
" lambda x: len(tokenizer.apply_chat_template(x))\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'BEFORE: #total={len(df)}')\n",
"df_selected = df[df['n_tokens'] < 131072]\n",
"print(f'AFTER(truncated to 128k): #total={len(df_selected)}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_selected['n_tokens'].describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ecdf of n_tokens\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"display(df.groupby(['resolved'])['n_tokens'].describe())\n",
"sns.ecdfplot(x='n_tokens', data=df, hue='resolved')\n",
"plt.show()\n",
"\n",
"print(f'#total={len(df)}')\n",
"df_selected = df[df['n_tokens'] < 131072]\n",
"print(f'#selected={len(df_selected)}')\n",
"display(df_selected.groupby(['resolved'])['n_tokens'].describe())\n",
"sns.ecdfplot(x='n_tokens', data=df_selected, hue='resolved')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_selected[~df_selected['resolved']]['n_tokens'].describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_selected['resolved'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_selected.groupby(['resolved'])['n_tokens'].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Save Resolved Messages for SFT"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_selected[df_selected['resolved']][['nonfncall_messages']].rename(\n",
" columns={'nonfncall_messages': 'messages'}\n",
").to_json(\n",
" os.path.join(\n",
" 'YOUR_OUTPUT_FOLDER',\n",
" f'policy_traj_128k_swegym_{df_selected[\"resolved\"].value_counts()[True]}i.jsonl',\n",
" ),\n",
" lines=True,\n",
" orient='records',\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "openhands-ai-CPy6G0pU-py3.12",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

45
poetry.lock generated
View File

@@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 2.0.1 and should not be changed by hand.
# This file is automatically @generated by Poetry 2.0.0 and should not be changed by hand.
[[package]]
name = "aiohappyeyeballs"
@@ -8938,7 +8938,7 @@ files = [
[package.dependencies]
greenlet = [
{version = "!=0.4.17", optional = true, markers = "python_version < \"3.14\" and (platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\") or extra == \"asyncio\""},
{version = "!=0.4.17", markers = "python_version < \"3.14\" and (platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\")"},
{version = "!=0.4.17", optional = true, markers = "python_version < \"3.14\" and (platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\") or extra == \"asyncio\""},
]
typing-extensions = ">=4.6.0"
@@ -9109,14 +9109,14 @@ files = [
[[package]]
name = "swebench"
version = "3.0.13"
version = "3.0.15"
description = "The official SWE-bench package - a benchmark for evaluating LMs on software engineering"
optional = false
python-versions = ">=3.8"
groups = ["evaluation"]
files = [
{file = "swebench-3.0.13-py3-none-any.whl", hash = "sha256:0949e0a7269fcebb287dd951d14c049bd8189c7740fc4878354dbec756531c0f"},
{file = "swebench-3.0.13.tar.gz", hash = "sha256:d1cce406d0674cb1f3ca7da90089644d1ded3649c98f239a5a7ef4829d2f7c58"},
{file = "swebench-3.0.15-py3-none-any.whl", hash = "sha256:dd694356f9c155a55d3d2e113fe58446f7385eea0574230af5e2504426f8b85b"},
{file = "swebench-3.0.15.tar.gz", hash = "sha256:24e734fbcce34082665a25719075e6899382b7135103dd8c6cc09a6e23789101"},
]
[package.dependencies]
@@ -9139,6 +9139,39 @@ unidiff = "*"
inference = ["anthropic", "flash_attn", "jedi", "openai", "peft", "protobuf", "sentencepiece", "tiktoken", "torch", "transformers", "triton"]
test = ["pytest", "pytest-cov"]
[[package]]
name = "swegym"
version = "2.0.13"
description = "Fork of SWE-bench package - a benchmark for evaluating LMs on software engineering"
optional = false
python-versions = ">=3.8"
groups = ["evaluation"]
files = []
develop = false
[package.dependencies]
beautifulsoup4 = "*"
chardet = "*"
datasets = "*"
docker = "*"
ghapi = "*"
GitPython = "*"
pre-commit = "*"
python-dotenv = "*"
requests = "*"
rich = "*"
tqdm = "*"
unidiff = "*"
[package.extras]
inference = ["anthropic", "flash_attn", "jedi", "openai", "peft", "protobuf", "sentencepiece", "tenacity", "tiktoken", "torch", "transformers", "triton"]
[package.source]
type = "git"
url = "https://github.com/SWE-Gym/SWE-Bench-Package.git"
reference = "HEAD"
resolved_reference = "16dd480cce9b27bf111a362d280881c6def5d2a7"
[[package]]
name = "sympy"
version = "1.13.1"
@@ -10855,4 +10888,4 @@ testing = ["coverage[toml]", "zope.event", "zope.testing"]
[metadata]
lock-version = "2.1"
python-versions = "^3.12"
content-hash = "83da0b681253a79417c9842862cdd102c1ab6e8770d9dd9e0c42bc7994be2cd0"
content-hash = "c3f32c54606e5f313d9a909625f77cc3d575bf951e986633bcecd94520f36450"

View File

@@ -146,6 +146,7 @@ whatthepatch = "*"
retry = "*"
evaluate = "*"
swebench = "^3.0.8"
swegym = { git = "https://github.com/SWE-Gym/SWE-Bench-Package.git" }
commit0 = "*"
func_timeout = "*"
sympy = "*"