Merge branch 'main' into chuck-build

test
issue #9388 , this will fix the issue (#10450 )
2026-04-29 03:00:45 -04:00 · 2025-09-23 14:25:16 -04:00 · 2025-09-23 14:19:20 -04:00 · 2025-09-22 16:56:53 -04:00 · 2025-09-22 20:35:30 +00:00 · 2025-09-22 15:56:26 -04:00
133 changed files with 6620 additions and 1932 deletions
--- a/.github/workflows/stale.yml
+++ b/.github/workflows/stale.yml
@@ -15,7 +15,7 @@ jobs:
          stale-issue-message: 'This issue is stale because it has been open for 40 days with no activity. Remove the stale label or leave a comment, otherwise it will be closed in 10 days.'
          stale-pr-message: 'This PR is stale because it has been open for 40 days with no activity. Remove the stale label or leave a comment, otherwise it will be closed in 10 days.'
          days-before-stale: 40
-          exempt-issue-labels: roadmap,backlog
+          exempt-issue-labels: roadmap,backlog,app-team
          close-issue-message: 'This issue was automatically closed due to 50 days of inactivity. We do this to help keep the issues somewhat manageable and focus on active issues.'
          close-pr-message: 'This PR was closed because it had no activity for 50 days. If you feel this was closed in error, and you would like to continue the PR, please resubmit or let us know.'
          days-before-close: 10
--- a/docs/usage/llms/openhands-llms.mdx
+++ b/docs/usage/llms/openhands-llms.mdx
@@ -30,6 +30,20 @@ When running OpenHands, you'll need to set the following in the OpenHands UI thr

 ## Pricing

-Pricing follows official API provider rates. [You can view model prices here.](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json)
+Pricing follows official API provider rates. Below are the current pricing details for OpenHands models:

-For `qwen3-coder-480b`, we charge the cheapest FP8 rate available on openrouter: \$0.4 per million input tokens and \$1.6 per million output tokens.
+| Model | Input Cost (per 1M tokens) | Cached Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Max Input Tokens | Max Output Tokens |
+|-------|----------------------------|-----------------------------------|------------------------------|------------------|-------------------|
+| claude-opus-4-20250514 | $15.00 | $1.50 | $75.00 | 200,000 | 32,000 |
+| claude-sonnet-4-20250514 | $3.00 | $0.30 | $15.00 | 200,000 | 64,000 |
+| devstral-medium-2507 | $0.40 | N/A | $2.00 | 128,000 | 128,000 |
+| devstral-small-2505 | $0.10 | N/A | $0.30 | 128,000 | 128,000 |
+| devstral-small-2507 | $0.10 | N/A | $0.30 | 128,000 | 128,000 |
+| gemini-2.5-pro | $1.25 | $0.31 | $10.00 | 1,048,576 | 65,535 |
+| gpt-5-2025-08-07 | $1.25 | $0.125 | $10.00 | 400,000 | 128,000 |
+| gpt-5-mini-2025-08-07 | $0.25 | $0.025 | $2.00 | 400,000 | 128,000 |
+| o3 | $2.00 | $0.50 | $8.00 | 200,000 | 100,000 |
+| o4-mini | $1.10 | $0.28 | $4.40 | 200,000 | 100,000 |
+| qwen3-coder-480b | $0.40 | N/A | $1.60 | N/A | N/A |
+
+**Note:** Cached input tokens are charged at a reduced rate when the same content is reused across requests. Models that don't support prompt caching show "N/A" for cached input cost.
--- a/enterprise/Dockerfile
+++ b/enterprise/Dockerfile
@@ -7,14 +7,28 @@ LABEL com.datadoghq.tags.service="deploy"
 LABEL com.datadoghq.tags.env="${DD_ENV}"

 # Install Node.js v20+ and npm (which includes npx)
+# Apply security updates to fix CVEs
 RUN apt-get update && \
    apt-get install -y curl && \
    curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
    apt-get install -y nodejs && \
    apt-get install -y jq gettext && \
-    apt-get clean
+    # Apply security updates for packages with available fixes
+    apt-get upgrade -y \
+        libc-bin \
+        libc6 \
+        libgnutls30 \
+        libsqlite3-0 \
+        perl-base && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*

-RUN pip install alembic psycopg2-binary cloud-sql-python-connector pg8000 gspread stripe python-keycloak asyncpg sqlalchemy[asyncio] resend tenacity slack-sdk ddtrace posthog "limits==5.2.0" coredis prometheus-client shap scikit-learn pandas numpy
+# Install Python packages with security fixes
+RUN pip install alembic psycopg2-binary cloud-sql-python-connector pg8000 gspread stripe python-keycloak asyncpg sqlalchemy[asyncio] resend tenacity slack-sdk ddtrace posthog "limits==5.2.0" coredis prometheus-client shap scikit-learn pandas numpy && \
+    # Update packages with known CVE fixes
+    pip install --upgrade \
+        "mcp>=1.10.0" \
+        "pillow>=11.3.0"

 WORKDIR /app
 COPY enterprise .
--- a/evaluation/benchmarks/multi_swe_bench/SWE-Gym.md
+++ b/evaluation/benchmarks/multi_swe_bench/SWE-Gym.md
@@ -0,0 +1,152 @@
+<h1 align="center"> Training Software Engineering Agents and Verifiers with SWE-Gym </h1>
+
+A Multi-SWE-bench implementation of SWE-Gym.
+
+<p align="center">
+  <a href="https://www.jiayipan.com/" style="text-decoration: none;">Jiayi Pan<sup>*,1</sup></a>,
+  <a href="https://xwang.dev/" style="text-decoration: none;">Xingyao Wang<sup>*,2</sup></a>,
+  <a href="https://www.phontron.com/" style="text-decoration: none;">Graham Neubig<sup>3</sup></a>,
+  <a href="https://www.cs.toronto.edu/~ndjaitly/" style="text-decoration: none;">Navdeep Jaitly<sup>4</sup></a>,
+  <a href="https://blender.cs.illinois.edu/hengji.html" style="text-decoration: none;">Heng Ji<sup>2</sup></a>,
+  <a href="https://www.alanesuhr.com/" style="text-decoration: none;">Alane Suhr<sup>^,1</sup></a>,
+  <a href="https://dreasysnail.github.io/" style="text-decoration: none;">Yizhe Zhang<sup>^,4</sup></a>
+</p>
+
+<p align="center">
+  <sup>1</sup>UC Berkeley, <sup>2</sup>UIUC, <sup>3</sup>CMU, <sup>4</sup>Apple </br>
+  <sub><sup>*</sup>Equal contribution, <sup>^</sup>Equal supervision</sub>
+</p>
+
+<p align="center">
+<a href="https://arxiv.org/abs/2412.21139">📃 Paper</a>
+•
+<a href="https://huggingface.co/SWE-Gym" >🤗 Data & Models</a>
+</p>
+
+We present **SWE-Gym**, the first environment for training real-world software engineering agents.
+We use it to train strong LM agents that achieve state-of-the-art open results on SWE-Bench, with early, promising scaling characteristics as we increase training and inference-time compute.
+
+<p align="center">
+  <img src="https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/teaser.jpg?raw=true" width="100%" alt="teaser">
+</p>
+
+---
+# Run SWE-Gym with OpenHands
+
+The process of running SWE-Gym is very similar to how you'd run SWE-Bench evaluation.
+
+
+1. First, clone OpenHands repo `git clone https://github.com/All-Hands-AI/OpenHands.git`
+2. Then setup the repo following [Development.md](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md)
+3. Then you can simply serve your own model as an OpenAI compatible endpoint, put those info in config.toml. You can do this by following instruction [here](../../README.md#setup).
+4. And then simply do the following to sample for 16x parallelism:
+
+```bash
+export ALLHANDS_API_KEY=ah-yourkey  # You don't need to set this when running these in local docker container
+./evaluation/benchmarks/multi_swe_bench/scripts/rollout_swegym.sh llm.mymodel-temp05 'train-t05' 16
+```
+
+NOTE: SWE-Gym sampling with parallelism is currently only tested with AllHands RemoteRuntime (limited beta). Fill [this form](https://docs.google.com/forms/d/e/1FAIpQLSckVz_JFwg2_mOxNZjCtr7aoBFI2Mwdan3f75J_TrdMS1JV2g/viewform) to apply for access.
+
+
+5. When `rollout_swegym.sh` finishes, you will get a file called `output.with_completions.jsonl.gz`. Then you can use [`./scripts/swegym/convert_data.ipynb`](./scripts/swegym/convert_data.ipynb) to convert them into SFT data format.
+
+## Running the Jupyter Notebook
+
+To run the data conversion notebook, follow these steps:
+
+1. Navigate to the OpenHands repository root:
+```bash
+cd openhands_repo
+```
+
+2. Set the PYTHONPATH and start Jupyter notebook:
+```bash
+PYTHONPATH=$(pwd) jupyter notebook
+```
+
+3. In the Jupyter interface, navigate to `evaluation/benchmarks/swe_bench/scripts/swegym/convert_data.ipynb`
+
+4. Update the file paths in the notebook:
+   - Set `FILE_PATHS` to point to your `output.with_completions.jsonl.gz` files
+   - Set `YOUR_OUTPUT_FOLDER` to your desired output directory
+
+5. Run the notebook cells sequentially to process your data and generate the SFT training format.
+
+---
+# More info about SWE-Gym
+
+Progress in agents for software engineering has been limited by the lack of training environments that both include rigorous verification for reinforcement learning and cover the expansive tasks encountered in real-world repository-level engineering.
+
+We introduce SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers.
+Our baselines achieve new open SOTA - 32%/26% on SWE-Bench Verified/Lite, with promising scaling trends.
+
+![SWE-Gym Scaling](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/scaling.jpg?raw=true)
+*SWE-Gym enables scalable improvements for software engineering agents at both training and inference time. Our current results is primarily bottlenecked by training and inference compute, rather than the size of our environment.*
+
+## SWE-Gym Environment
+
+We create SWE-Gym, the first environment for training SWE agents, with **2.4K real tasks from 11 Python repos** & a Lite split of 234 instances. SWE-Gym combines real-world Python tasks, repository context, executable environments, and test verification to train agents for solving software engineering problems.
+
+![SWE-Gym Repo Distribution](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/swe-gym.jpg?raw=true)
+
+
+## SWE-Gym trains LMs as agents
+
+When fine-tuned on less than 500 agent-environment interaction trajectories sampled from it from GPT-4o and Claude 3.5 Sonnet, we achieve **+14%** absolute gains on SWE-Bench Verified with an 32B LM-powered OpenHands agent.
+
+![OpenHands Performance diff before and after training](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/oh-agent.jpg?raw=true)
+
+
+## SWE-Gym enables self-improvement
+
+SWE-Gym is also effective across agent scaffolds. With rejection sampling fine-tuning and MoatlessTools scaffold, our 32B and 7B models achieve 20% and 10% respectively on SWE-Bench Lite through self-improvement.
+
+<p align="center">
+  <img src="https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/ml-agent.jpg?raw=true" width="80%" alt="Moatless self-improvement">
+</p>
+
+
+
+## SWE-Gym enables inference-time scaling
+
+SWE-Gym enables inference-time scaling through verifiers trained on agent trajectories.
+These verifiers identify most promising solutions via best-of-n selection, together with our learned agents, they achieve 32%/26% on SWE-Bench Verified/Lite, a new open SoTA.
+
+
+![Inference Time Scaling for Moatless Agent](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/inference-ml.jpg?raw=true)
+*Inference Time Scaling for Moatless Agent*
+
+![Inference Time Scaling for OpenHands Agent](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/inference-oh.jpg?raw=true)
+*Inference Time Scaling for OpenHands Agent*
+
+
+## Our baselines on SWE-Gym shows strong scaling trends
+
+Lastly, our ablations reveal strong scaling trends - performance is now bottlenecked by train and inference compute, rather than the size of our dataset. Pushing and improving these scaling trends further is an exciting direction for future work.
+
+![](https://github.com/SWE-Gym/SWE-Gym/blob/main/assets/images/scaling.jpg?raw=true)
+
+## Reproducing Results
+**The Dataset**
+
+To access SWE-Gym dataset, checkout our huggingface hub page [SWE-Gym](https://huggingface.co/SWE-Gym)
+
+The environment constants are currently saved at [SWE-Bench-Fork](https://github.com/SWE-Gym/SWE-Bench-Fork)
+
+We also have pre-built docker images for each instance under [xingyaoww/sweb.eval.x86_64](https://hub.docker.com/search?q=xingyaoww%2Fsweb.eval.x86_64.) prefix at docker hub.
+
+
+## 📚 Citation
+
+```bibtex
+@misc{pan2024trainingsoftwareengineeringagents,
+      title={Training Software Engineering Agents and Verifiers with SWE-Gym},
+      author={Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang},
+      year={2024},
+      eprint={2412.21139},
+      archivePrefix={arXiv},
+      primaryClass={cs.SE},
+      url={https://arxiv.org/abs/2412.21139},
+}
+```
--- a/evaluation/benchmarks/multi_swe_bench/run_infer.py
+++ b/evaluation/benchmarks/multi_swe_bench/run_infer.py
@@ -51,8 +51,8 @@ RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'tru

 # TODO: migrate all swe-bench docker to ghcr.io/openhands
 # TODO: 适应所有的语言
-DOCKER_IMAGE_PREFIX = os.environ.get('EVAL_DOCKER_IMAGE_PREFIX', '')
-LANGUAGE = os.environ.get('LANGUAGE', 'python')
+DOCKER_IMAGE_PREFIX = os.environ.get('EVAL_DOCKER_IMAGE_PREFIX', 'mswebench')
+LANGUAGE = os.environ.get('LANGUAGE', 'java')
 logger.info(f'Using docker image prefix: {DOCKER_IMAGE_PREFIX}')


@@ -305,31 +305,19 @@ def get_instance_docker_image(instance: pd.Series):
        instance_id = instance.get('instance_id', '')
        tag_suffix = instance_id.split('-')[-1] if instance_id else ''
        container_tag = f'pr-{tag_suffix}'
-        # pdb.set_trace()
-        return f'mswebench/{container_name}:{container_tag}'
-        # return "kong/insomnia:pr-8284"
-        # return "'sweb.eval.x86_64.local_insomnia"
-        # return "local_insomnia_why"
-        # return "local/kong-insomnia:pr-8117"
+        return f'{DOCKER_IMAGE_PREFIX}/{container_name}:{container_tag}'


 def get_config(
    instance: pd.Series,
    metadata: EvalMetadata,
 ) -> OpenHandsConfig:
-    SWE_BENCH_CONTAINER_IMAGE = 'ghcr.io/opendevin/eval-swe-bench:full-v1.2.1'
-    if USE_INSTANCE_IMAGE:
-        # We use a different instance image for the each instance of swe-bench eval
-        # base_container_image = get_instance_docker_image(instance['instance_id'])
-        base_container_image = get_instance_docker_image(instance)
-        logger.info(
-            f'Using instance container image: {base_container_image}. '
-            f'Please make sure this image exists. '
-            f'Submit an issue on https://github.com/All-Hands-AI/OpenHands if you run into any issues.'
-        )
-    else:
-        base_container_image = SWE_BENCH_CONTAINER_IMAGE
-        logger.info(f'Using swe-bench container image: {base_container_image}')
+    base_container_image = get_instance_docker_image(instance)
+    logger.info(
+        f'Using instance container image: {base_container_image}. '
+        f'Please make sure this image exists. '
+        f'Submit an issue on https://github.com/All-Hands-AI/OpenHands if you run into any issues.'
+    )

    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = base_container_image
@@ -772,7 +760,6 @@ if __name__ == '__main__':
    parser.add_argument(
        '--dataset',
        type=str,
-        default='princeton-nlp/SWE-bench',
        help='data set to evaluate on, either full-test or lite-test',
    )
    parser.add_argument(
@@ -787,6 +774,7 @@ if __name__ == '__main__':
    # so we don't need to manage file uploading to OpenHands's repo
    # dataset = load_dataset(args.dataset, split=args.split)
    # dataset = load_dataset(args.dataset)
+    logger.info(f'Loading dataset {args.dataset} with split {args.split} ')
    dataset = load_dataset('json', data_files=args.dataset)
    dataset = dataset[args.split]
    swe_bench_tests = filter_dataset(dataset.to_pandas(), 'instance_id')
@@ -839,7 +827,7 @@ if __name__ == '__main__':
        args.eval_num_workers,
        process_instance,
        timeout_seconds=120 * 60,  # 2 hour PER instance should be more than enough
-        max_retries=5,
+        max_retries=3,
    )
    # Check if any instances reached maximum retries
    check_maximum_retries_exceeded(metadata.eval_output_dir)
--- a/evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py
@@ -1,37 +1,54 @@
+import argparse
 import json

-input_file = 'XXX.jsonl'
-output_file = 'YYY.jsonl'

-with (
-    open(input_file, 'r', encoding='utf-8') as fin,
-    open(output_file, 'w', encoding='utf-8') as fout,
-):
-    for line in fin:
-        line = line.strip()
-        if not line:
-            continue
+def main(input_file, output_file):
+    with (
+        open(input_file, 'r', encoding='utf-8') as fin,
+        open(output_file, 'w', encoding='utf-8') as fout,
+    ):
+        for line in fin:
+            line = line.strip()
+            if not line:
+                continue

-        data = json.loads(line)
-        item = data
+            data = json.loads(line)
+            item = data

-        # 提取原始数据
-        org = item.get('org', '')
-        repo = item.get('repo', '')
-        number = str(item.get('number', ''))
+            # Skip instances that don't have resolved_issues or have empty resolved_issues
+            if not item.get('resolved_issues') or len(item['resolved_issues']) == 0:
+                print(
+                    f'Skipping instance {item.get("org", "")}/{item.get("repo", "")}-{item.get("number", "")} - no resolved_issues'
+                )
+                continue

-        new_item = {}
-        new_item['repo'] = f'{org}/{repo}'
-        new_item['instance_id'] = f'{org}__{repo}-{number}'
-        new_item['problem_statement'] = (
-            item['resolved_issues'][0].get('title', '')
-            + '\n'
-            + item['resolved_issues'][0].get('body', '')
-        )
-        new_item['FAIL_TO_PASS'] = []
-        new_item['PASS_TO_PASS'] = []
-        new_item['base_commit'] = item['base'].get('sha', '')
-        new_item['version'] = '0.1'  # depends
+            # 提取原始数据
+            org = item.get('org', '')
+            repo = item.get('repo', '')
+            number = str(item.get('number', ''))

-        output_data = new_item
-        fout.write(json.dumps(output_data, ensure_ascii=False) + '\n')
+            new_item = {}
+            new_item['repo'] = f'{org}/{repo}'
+            new_item['instance_id'] = f'{org}__{repo}-{number}'
+
+            # Get the first resolved issue
+            resolved_issue = item['resolved_issues'][0]
+            title = resolved_issue.get('title') or ''
+            body = resolved_issue.get('body') or ''
+
+            new_item['problem_statement'] = title + '\n' + body
+            new_item['FAIL_TO_PASS'] = []
+            new_item['PASS_TO_PASS'] = []
+            new_item['base_commit'] = item['base'].get('sha', '')
+            new_item['version'] = '0.1'  # depends
+
+            output_data = new_item
+            fout.write(json.dumps(output_data, ensure_ascii=False) + '\n')
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', required=True, help='Input .jsonl file path')
+    parser.add_argument('--output', required=True, help='Output .jsonl file path')
+    args = parser.parse_args()
+    main(args.input, args.output)
--- a/evaluation/benchmarks/multi_swe_bench/scripts/eval/combine_final_completions.py
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/eval/combine_final_completions.py
@@ -0,0 +1,69 @@
+import argparse
+import gzip
+import json
+import os
+from glob import glob
+
+from tqdm import tqdm
+
+tqdm.pandas()
+
+
+# Load trajectories for resolved instances
+def load_completions(output_dir: str, instance_id: str):
+    glob_path = os.path.join(output_dir, 'llm_completions', instance_id, '*.json')
+    files = sorted(glob(glob_path))  # this is ascending order
+    # pick the last file (last turn)
+    try:
+        file_path = files[-1]
+    except IndexError:
+        # print(f'No files found for instance {instance_id}: files={files}')
+        return None
+    with open(file_path, 'r') as f:
+        result = json.load(f)
+    # create messages
+    messages = result['messages']
+    messages.append(result['response']['choices'][0]['message'])
+    tools = result['kwargs'].get('tools', [])
+    return {
+        'messages': messages,
+        'tools': tools,
+    }
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument('jsonl_path', type=str)
+args = parser.parse_args()
+
+output_dir = os.path.dirname(args.jsonl_path)
+output_path = os.path.join(output_dir, 'output.with_completions.jsonl.gz')
+
+# Check if output would be different from input
+needs_update = False
+with open(args.jsonl_path, 'r') as f_in:
+    for line in tqdm(f_in, desc='Checking for changes'):
+        data = json.loads(line)
+        new_completions = load_completions(output_dir, data['instance_id'])
+        current_completions = data.get('raw_completions')
+        if current_completions != new_completions:
+            needs_update = True
+            break
+
+if not needs_update:
+    print('No updates required. Skipping file update.')
+    exit(0)
+
+if os.path.exists(output_path):
+    print(f'Output file already exists at {output_path}, overwriting? (y/n)')
+    if input() != 'y':
+        print('Exiting...')
+        exit(0)
+
+# Process line by line
+with open(args.jsonl_path, 'r') as f_in, gzip.open(output_path, 'wt') as f_out:
+    for line in tqdm(f_in):
+        data = json.loads(line)
+        data['raw_completions'] = load_completions(output_dir, data['instance_id'])
+        f_out.write(json.dumps(data) + '\n')
+
+print(f'Saved compressed output to {output_path}')
--- a/evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py
@@ -1,13 +1,11 @@
+import argparse
 import json
 import re

-IN_FILE = 'output.jsonl'
-OUT_FILE = 'patch.jsonl'

-
-def main():
-    with open(IN_FILE, 'r') as fin:
-        with open(OUT_FILE, 'w') as fout:
+def main(input_file, output_file):
+    with open(input_file, 'r') as fin:
+        with open(output_file, 'w') as fout:
            for line in fin:
                data = json.loads(line)
                groups = re.match(r'(.*)__(.*)-(.*)', data['instance_id'])
@@ -15,10 +13,14 @@ def main():
                    'org': groups.group(1),
                    'repo': groups.group(2),
                    'number': groups.group(3),
-                    'fix_patch': data['test_result']['git_patch'],
+                    'fix_patch': data.get('test_result', {}).get('git_patch', '') or '',
                }
                fout.write(json.dumps(patch) + '\n')


 if __name__ == '__main__':
-    main()
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', required=True, help='Input .jsonl file path')
+    parser.add_argument('--output', required=True, help='Output .jsonl file path')
+    args = parser.parse_args()
+    main(args.input, args.output)
--- a/evaluation/benchmarks/multi_swe_bench/scripts/eval/update_multi_swe_bench_config.py
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/eval/update_multi_swe_bench_config.py
@@ -0,0 +1,70 @@
+import argparse
+import json
+import os
+import subprocess
+
+
+def update_multi_swe_config(output_jsonl_path, config_path, dataset):
+    path_to_parent = os.path.dirname(os.path.abspath(output_jsonl_path))
+    converted_path = os.path.join(path_to_parent, 'output_converted.jsonl')
+
+    # Run the conversion script
+    subprocess.run(
+        [
+            'python3',
+            './evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py',
+            '--input',
+            output_jsonl_path,
+            '--output',
+            converted_path,
+        ],
+        check=True,
+    )
+
+    # Create required directories
+    os.makedirs(os.path.join(path_to_parent, 'eval_files', 'dataset'), exist_ok=True)
+    os.makedirs(os.path.join(path_to_parent, 'eval_files', 'workdir'), exist_ok=True)
+    os.makedirs(os.path.join(path_to_parent, 'eval_files', 'repos'), exist_ok=True)
+    os.makedirs(os.path.join(path_to_parent, 'eval_files', 'logs'), exist_ok=True)
+
+    # Prepare config dict
+    config = {
+        'mode': 'evaluation',
+        'workdir': os.path.join(path_to_parent, 'eval_files', 'workdir'),
+        'patch_files': [converted_path],
+        'dataset_files': [dataset],
+        'force_build': True,
+        'output_dir': os.path.join(path_to_parent, 'eval_files', 'dataset'),
+        'specifics': [],
+        'skips': [],
+        'repo_dir': os.path.join(path_to_parent, 'eval_files', 'repos'),
+        'need_clone': True,
+        'global_env': [],
+        'clear_env': True,
+        'stop_on_error': False,
+        'max_workers': 5,
+        'max_workers_build_image': 5,
+        'max_workers_run_instance': 5,
+        'log_dir': os.path.join(path_to_parent, 'eval_files', 'logs'),
+        'log_level': 'DEBUG',
+        'fix_patch_run_cmd': (
+            'bash -c "apt update ; apt install -y patch ; '
+            "sed -i 's@git apply.*@patch --batch --fuzz=5 -p1 -i /home/test.patch;"
+            'patch --batch --fuzz=5 -p1 -i /home/fix.patch@g\' /home/fix-run.sh ; chmod +x /home/*.sh  ; /home/fix-run.sh"'
+        ),
+    }
+
+    # Save to multibench.config
+    os.makedirs(os.path.dirname(config_path), exist_ok=True)
+    with open(config_path, 'w') as f:
+        json.dump(config, f, indent=4)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', required=True, help='Path to input file')
+    parser.add_argument('--output', required=True, help='Path to create config')
+    parser.add_argument('--dataset', required=True, help='Path to dataset')
+    args = parser.parse_args()
+
+    update_multi_swe_config(args.input, args.output, args.dataset)
--- a/evaluation/benchmarks/multi_swe_bench/scripts/eval/update_output_with_eval.py
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/eval/update_output_with_eval.py
@@ -0,0 +1,176 @@
+import argparse
+import json
+import os
+from collections import defaultdict
+
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser()
+parser.add_argument('input_file', type=str)
+parser.add_argument(
+    '--force',
+    action='store_true',
+    help='Force update all reports even if no changes are detected',
+)
+parser.add_argument(
+    '--overwrite-backup',
+    action='store_true',
+    help='Automatically overwrite existing backup files without prompting',
+)
+args = parser.parse_args()
+
+dirname = os.path.dirname(args.input_file)
+
+# Initialize counters and data structures
+instance_id_to_status = defaultdict(
+    lambda: {
+        'empty_generation': False,
+        'resolved': False,
+        'failed_apply_patch': False,
+        'error_eval': False,
+        'test_timeout': False,
+    }
+)
+
+# Process official report if it exists
+swebench_official_report_json = os.path.join(
+    dirname, 'eval_files/dataset/final_report.json'
+)
+openhands_remote_report_jsonl = args.input_file.replace(
+    '.jsonl', '.swebench_eval.jsonl'
+)
+
+if os.path.exists(swebench_official_report_json):
+    output_md_filepath = os.path.join(dirname, 'README.md')
+    with open(swebench_official_report_json, 'r') as f:
+        report = json.load(f)
+
+    # Convert instance IDs from "repo/name:pr-123" format to "repo__name-123" format
+    def convert_instance_id(instance_id):
+        """Convert instance ID from slash/colon-pr format to double underscore/dash format."""
+        if '/' in instance_id and ':pr-' in instance_id:
+            # Split on '/' and ':pr-'
+            parts = instance_id.split('/')
+            if len(parts) == 2:
+                repo_part = parts[0]
+                name_and_pr = parts[1]
+                if ':pr-' in name_and_pr:
+                    name, pr_number = name_and_pr.split(':pr-')
+                    return f'{repo_part}__{name}-{pr_number}'
+        return instance_id
+
+    # Convert all instance ID lists in the report
+    for key in [
+        'resolved_ids',
+        'unresolved_ids',
+        'error_ids',
+        'empty_patch_ids',
+        'incomplete_ids',
+    ]:
+        if key in report:
+            report[key] = [
+                convert_instance_id(instance_id) for instance_id in report[key]
+            ]
+
+    output_md = (
+        '# Multi-SWE-bench Report\n'
+        'This folder contains the evaluation results of the SWE-bench using the [official evaluation docker containerization](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md#choosing-the-right-cache_level).\n\n'
+        '## Summary\n'
+        f'- total instances: {report["total_instances"]}\n'
+        f'- submitted instances: {report["submitted_instances"]}\n'
+        f'- completed instances: {report["completed_instances"]}\n'
+        f'- empty patch instances: {report["empty_patch_instances"]}\n'
+        f'- resolved instances: {report["resolved_instances"]}\n'
+        f'- unresolved instances: {report["unresolved_instances"]}\n'
+        f'- error instances: {report["error_instances"]}\n'
+    )
+
+    output_md += '\n## Resolved Instances\n'
+    # instance_id to status
+    for instance_id in report['resolved_ids']:
+        instance_id_to_status[instance_id]['resolved'] = True
+        output_md += (
+            f'- [{instance_id}](./eval_outputs/{instance_id}/run_instance.log)\n'
+        )
+
+    output_md += '\n## Unresolved Instances\n'
+    for instance_id in report['unresolved_ids']:
+        output_md += (
+            f'- [{instance_id}](./eval_outputs/{instance_id}/run_instance.log)\n'
+        )
+
+    output_md += '\n## Error Instances\n'
+    for instance_id in report['error_ids']:
+        instance_id_to_status[instance_id]['error_eval'] = True
+        output_md += (
+            f'- [{instance_id}](./eval_outputs/{instance_id}/run_instance.log)\n'
+        )
+
+    output_md += '\n## Empty Patch Instances\n'
+    for instance_id in report['empty_patch_ids']:
+        instance_id_to_status[instance_id]['empty_generation'] = True
+        output_md += (
+            f'- [{instance_id}](./eval_outputs/{instance_id}/run_instance.log)\n'
+        )
+
+    output_md += '\n## Incomplete Instances\n'
+    for instance_id in report['incomplete_ids']:
+        output_md += (
+            f'- [{instance_id}](./eval_outputs/{instance_id}/run_instance.log)\n'
+        )
+
+    with open(output_md_filepath, 'w') as f:
+        f.write(output_md)
+
+else:
+    print(
+        f'No report file found: Both {swebench_official_report_json} and {openhands_remote_report_jsonl} do not exist.'
+    )
+    exit()
+
+# Before backup and update, check if any changes would be made (unless --force is used)
+if not args.force:
+    needs_update = False
+    with open(args.input_file, 'r') as infile:
+        for line in tqdm(infile, desc='Checking for changes'):
+            data = json.loads(line)
+            instance_id = data['instance_id']
+            current_report = data.get('report', {})
+            new_report = instance_id_to_status[
+                instance_id
+            ]  # if no report, it's not resolved
+            if current_report != new_report:
+                needs_update = True
+                break
+
+    if not needs_update:
+        print('No updates detected. Skipping file update.')
+        exit()
+else:
+    print('Force flag enabled. Updating all reports regardless of changes.')
+
+# Backup and update the original file row by row
+if os.path.exists(args.input_file + '.bak'):
+    if args.overwrite_backup:
+        print(
+            'Existing backup file found. Overwriting automatically due to --overwrite-backup flag.'
+        )
+        os.remove(args.input_file + '.bak')
+    else:
+        conf = input('Existing backup file found. Do you want to overwrite it? (y/n)')
+        if conf != 'y':
+            exit()
+        os.remove(args.input_file + '.bak')
+
+os.rename(args.input_file, args.input_file + '.bak')
+
+# Process and write file row by row
+with (
+    open(args.input_file + '.bak', 'r') as infile,
+    open(args.input_file, 'w') as outfile,
+):
+    for line in tqdm(infile, desc='Updating output file'):
+        data = json.loads(line)
+        instance_id = data['instance_id']
+        data['report'] = instance_id_to_status[instance_id]
+        outfile.write(json.dumps(data) + '\n')
--- a/evaluation/benchmarks/multi_swe_bench/scripts/rollout_multi_swegym.sh
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/rollout_multi_swegym.sh
@@ -0,0 +1,146 @@
+#!/bin/bash
+
+# NOTE: this script is for rolling out the Multi-SWE-Gym dataset for **TRAINING**
+# For more information, please refer to
+# 1. the Github Repo: https://github.com/SWE-Gym/SWE-Gym
+# 2. the paper: https://arxiv.org/abs/2412.21139
+
+MODEL=$1  # eg your llm config name in config.toml (eg: "llm.claude-3-5-sonnet-20241022-t05")
+EXP_NAME=$2 # "train-t05"
+EVAL_DATASET=$3  # path to original dataset (jsonl file)
+N_WORKERS=${4:-64}
+N_RUNS=${5:-1}
+
+export EXP_NAME=$EXP_NAME
+# use 2x resources for rollout since some codebases are pretty resource-intensive
+export DEFAULT_RUNTIME_RESOURCE_FACTOR=2
+echo "MODEL: $MODEL"
+echo "EXP_NAME: $EXP_NAME"
+echo "EVAL_DATASET: $EVAL_DATASET"
+# Generate DATASET path by adding _with_runtime_ before .jsonl extension
+DATASET="${EVAL_DATASET%.jsonl}_with_runtime_.jsonl"  # path to converted dataset
+
+# Create the converted dataset file
+echo "Creating converted dataset at: $DATASET"
+poetry run python ./evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py --input "$EVAL_DATASET" --output "$DATASET"
+
+SPLIT="train"
+export LANGUAGE=java
+
+if [ -z "$ALLHANDS_API_KEY" ] || [ "$RUNTIME" != "remote" ]; then
+    echo "ALLHANDS_API_KEY is not set or RUNTIME is not set to remote. Will rollout and evaluate locally using Docker. WARNING: A large value of N_WORKERS will result in a large number of Docker containers being spun up and may crash your machine."
+    export RUNTIME=docker
+else
+    echo "ALLHANDS_API_KEY is set and RUNTIME is set to remote. Continuing rollout and evaluation with remote runtime..."
+    export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
+fi
+
+#EVAL_LIMIT=3000
+MAX_ITER=100
+
+
+# ===== Run inference =====
+source "evaluation/utils/version_control.sh"
+get_openhands_version
+
+echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
+echo "MODEL_CONFIG: $MODEL_CONFIG"
+echo "DATASET: $DATASET"
+echo "EVAL_DOCKER_IMAGE_PREFIX: $EVAL_DOCKER_IMAGE_PREFIX"
+
+# Default to NOT use Hint
+export USE_INSTANCE_IMAGE=true
+export USE_HINT_TEXT=false
+export RUN_WITH_BROWSING=false
+echo "USE_HINT_TEXT: $USE_HINT_TEXT"
+EVAL_NOTE="$OPENHANDS_VERSION-no-hint-$EXP_NAME"
+
+function run_eval() {
+  local eval_note=$1
+  export LANGUAGE=java
+  echo "About to run command"
+  COMMAND="EVAL_DOCKER_IMAGE_PREFIX=$EVAL_DOCKER_IMAGE_PREFIX; LANGUAGE=java;
+    poetry run python evaluation/benchmarks/multi_swe_bench/run_infer.py \
+    --agent-cls CodeActAgent \
+    --llm-config $MODEL \
+    --max-iterations $MAX_ITER \
+    --eval-num-workers $N_WORKERS \
+    --eval-note $eval_note \
+    --dataset $DATASET \
+    --split $SPLIT"
+
+  echo "Running command: $COMMAND"
+  if [ -n "$EVAL_LIMIT" ]; then
+    echo "EVAL_LIMIT: $EVAL_LIMIT"
+    COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
+  fi
+
+  # Run the command
+  eval $COMMAND
+}
+
+for run_idx in $(seq 1 $N_RUNS); do
+
+    while true; do
+        echo "### Running inference... ###"
+        unset SANDBOX_ENV_GITHUB_TOKEN # prevent the agent from using the github token to push
+        current_eval_note="$EVAL_NOTE-run_$run_idx"
+        echo "EVAL_NOTE: $current_eval_note"
+        echo "DATASET command: $DATASET"
+        #INFER_OUTPUT=$(run_eval $current_eval_note)
+        INFER_OUTPUT=$(run_eval $current_eval_note | tee /dev/stderr)
+        INFER_STATUS=$?  # Capture the exit status of run_infer.sh
+        echo "INFER_STATUS: $INFER_STATUS"
+
+        echo "### Cleaning up remote runtime... ###"
+        ./evaluation/utils/scripts/cleanup_remote_runtime.sh
+
+        if [ $INFER_STATUS -eq 0 ]; then
+            echo "### Inference completed successfully. ###"
+            break
+        else
+            echo "### Inference failed with exit code $INFER_STATUS. Retrying... ###"
+        fi
+    done
+
+    # Extract the output directory using the special delimiters
+    OUTPUT_FILE=$(echo "$INFER_OUTPUT" | grep -o '### OUTPUT FILE:.* ###' | sed 's/### OUTPUT FILE: \(.*\) ###/\1/')
+    echo "Got OUTPUT_FILE: $OUTPUT_FILE"
+
+    while true; do
+        echo "### Evaluating on $OUTPUT_FILE ... ###"
+        OUTPUT_CONFIG_FILE="${OUTPUT_FILE%.jsonl}_config.json"
+        export EVAL_SKIP_BUILD_ERRORS=true
+        pip install multi-swe-bench --quiet --disable-pip-version-check > /dev/null 2>&1
+        COMMAND="poetry run python ./evaluation/benchmarks/multi_swe_bench/scripts/eval/update_multi_swe_bench_config.py --input $OUTPUT_FILE --output $OUTPUT_CONFIG_FILE --dataset $EVAL_DATASET;
+        python -m multi_swe_bench.harness.run_evaluation --config $OUTPUT_CONFIG_FILE
+        "
+
+        if [ -n "$EVAL_LIMIT" ]; then
+        echo "EVAL_LIMIT: $EVAL_LIMIT"
+        COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
+        fi
+        echo "Running command: $COMMAND"
+        # Run the command
+        eval $COMMAND
+        EVAL_STATUS=$?
+        if [ $EVAL_STATUS -eq 0 ]; then
+            echo "### Evaluation completed successfully. ###"
+            break
+        else
+            echo "### Evaluation failed with exit code $EVAL_STATUS. Retrying... ###"
+        fi
+
+        ./evaluation/utils/scripts/cleanup_remote_runtime.sh
+    done
+
+    # update the output with evaluation results
+    echo "### Updating the output with evaluation results... ###"
+    poetry run python evaluation/benchmarks/multi_swe_bench/scripts/eval/update_output_with_eval.py $OUTPUT_FILE
+
+    echo "### Combining the final completions... ###"
+    poetry run python evaluation/benchmarks/multi_swe_bench/scripts/eval/combine_final_completions.py $OUTPUT_FILE
+
+    echo "### DONE for run $run_idx! ###"
+    echo "You can find the final output at $(dirname $OUTPUT_FILE)/$FINAL_OUTPUT_FILE"
+done
--- a/evaluation/benchmarks/multi_swe_bench/scripts/run_infer.sh
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/run_infer.sh
@@ -47,8 +47,8 @@ if [ -z "$DATASET" ]; then
 fi

 if [ -z "$LANGUAGE" ]; then
-  echo "LANUGUAGE not specified, use default python"
-  LANGUAGE="python"
+  echo "LANGUAGE not specified, use default python"
+  LANGUAGE="java"
 fi

 if [ -z "$SPLIT" ]; then
@@ -69,10 +69,10 @@ fi

 if [ -z "$EVAL_DOCKER_IMAGE_PREFIX" ]; then
  if [ "$LANGUAGE" = "python" ]; then
-  echo "EVAL_DOCKER_IMAGE_PREFIX is docker.io/xingyaoww/ as default as LANUGUAGE is python"
+  echo "EVAL_DOCKER_IMAGE_PREFIX is docker.io/xingyaoww/ as default as LANGUAGE is python"
    EVAL_DOCKER_IMAGE_PREFIX="docker.io/xingyaoww/"
  elif [ "$LANGUAGE" = "java" ]; then
-  echo "EVAL_DOCKER_IMAGE_PREFIX is java_verified as LANUGUAGE is java"
+  echo "EVAL_DOCKER_IMAGE_PREFIX is empty as LANGUAGE is java"
    EVAL_DOCKER_IMAGE_PREFIX=""
  fi
 fi
--- a/evaluation/benchmarks/multi_swe_bench/scripts/swegym/convert_data.ipynb
+++ b/evaluation/benchmarks/multi_swe_bench/scripts/swegym/convert_data.ipynb
@@ -0,0 +1,344 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "import pandas as pd\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "tqdm.pandas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1. Load raw data and convert to training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gzip\n",
+    "import json\n",
+    "\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "FILE_PATHS = [\n",
+    "    'YOURPATH-no-hint-train-t05-run_1/output.with_completions.jsonl.gz',\n",
+    "    'YOURPATH-no-hint-train-t05-run_2/output.with_completions.jsonl.gz',\n",
+    "]\n",
+    "\n",
+    "# More memory efficient for large files\n",
+    "# Initialize lists to store the data\n",
+    "data = []\n",
+    "\n",
+    "\n",
+    "# Read file line by line\n",
+    "for FILE_PATH in FILE_PATHS:\n",
+    "    with gzip.open(FILE_PATH, 'rb') as f:  # Use 'rb' for gzipped files\n",
+    "        for i, line in tqdm(\n",
+    "            enumerate(f), desc=f'Processing {FILE_PATH.split(\"/\")[-1]}'\n",
+    "        ):\n",
+    "            # Parse only the fields we need\n",
+    "            raw_data = json.loads(line)\n",
+    "            data.append(\n",
+    "                {\n",
+    "                    'resolved': raw_data['report']['resolved'],\n",
+    "                    'messages': raw_data['raw_completions']['messages']\n",
+    "                    if raw_data['raw_completions'] is not None\n",
+    "                    else None,\n",
+    "                    'git_patch': raw_data['test_result'].get('git_patch', ''),\n",
+    "                    'tools': raw_data['raw_completions']['tools']\n",
+    "                    if raw_data['raw_completions'] is not None\n",
+    "                    and 'tools' in raw_data['raw_completions']\n",
+    "                    else None,\n",
+    "                }\n",
+    "            )\n",
+    "\n",
+    "# Convert to DataFrame after collecting all data\n",
+    "df = pd.DataFrame(data)\n",
+    "print(f'#total amount of data={len(df)}')\n",
+    "df = df[~df['messages'].isna()]\n",
+    "print(f'#total amount of data after removing nan={len(df)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Filter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def _contains_multiple_tool_calls(messages: list[dict]) -> bool:\n",
+    "    return any(\n",
+    "        message.get('tool_calls') and len(message['tool_calls']) > 1\n",
+    "        for message in messages\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "df['contains_multiple_tool_calls'] = df['messages'].apply(_contains_multiple_tool_calls)\n",
+    "display(df.groupby(['contains_multiple_tool_calls'])['resolved'].sum())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "import copy\n",
+    "\n",
+    "# Convert function calling messages to non-function calling messages\n",
+    "from openhands.llm.fn_call_converter import (\n",
+    "    FunctionCallConversionError,\n",
+    "    convert_fncall_messages_to_non_fncall_messages,\n",
+    "    convert_from_multiple_tool_calls_to_single_tool_call_messages,\n",
+    ")\n",
+    "\n",
+    "total_failed = 0\n",
+    "\n",
+    "\n",
+    "def _convert_messages(messages: list[dict], tools: list[dict]) -> list[dict]:\n",
+    "    global total_failed\n",
+    "    message_copy = copy.deepcopy(messages)\n",
+    "    for message in message_copy:\n",
+    "        if message['content'] is None:\n",
+    "            message['content'] = ''\n",
+    "    try:\n",
+    "        return convert_fncall_messages_to_non_fncall_messages(\n",
+    "            message_copy, tools, add_in_context_learning_example=False\n",
+    "        )\n",
+    "    except FunctionCallConversionError:\n",
+    "        total_failed += 1\n",
+    "        # print(f'Failed to convert messages: {messages}\\nTools: {tools}')\n",
+    "        # traceback.print_exc()\n",
+    "        return None\n",
+    "\n",
+    "\n",
+    "df['converted_messages'] = df.apply(\n",
+    "    lambda row: convert_from_multiple_tool_calls_to_single_tool_call_messages(\n",
+    "        row['messages'], ignore_final_tool_result=True\n",
+    "    ),\n",
+    "    axis=1,\n",
+    ")\n",
+    "df['nonfncall_messages'] = df.apply(\n",
+    "    lambda row: _convert_messages(row['converted_messages'], row['tools']), axis=1\n",
+    ")\n",
+    "print('total nan', df['nonfncall_messages'].isna().sum())\n",
+    "df = df[~df['nonfncall_messages'].isna()]\n",
+    "print(df['nonfncall_messages'].iloc[0])\n",
+    "\n",
+    "print(f'Total failed: {total_failed}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tokenization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pandarallel import pandarallel\n",
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
+    "pandarallel.initialize(progress_bar=True, verbose=1, nb_workers=16)\n",
+    "tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')\n",
+    "\n",
+    "\n",
+    "def clean_messages(messages):\n",
+    "    clean = []\n",
+    "    for msg in messages:\n",
+    "        if not isinstance(msg, dict):\n",
+    "            continue\n",
+    "        role = msg.get('role')\n",
+    "        content = msg.get('content')\n",
+    "        if isinstance(content, str):\n",
+    "            text = content\n",
+    "        elif isinstance(content, dict):\n",
+    "            text = content.get('text')\n",
+    "        elif (\n",
+    "            isinstance(content, list)\n",
+    "            and len(content) == 1\n",
+    "            and isinstance(content[0], dict)\n",
+    "        ):\n",
+    "            text = content[0].get('text')\n",
+    "        else:\n",
+    "            print(f'Format not accepted {content}')\n",
+    "        clean.append({'role': role, 'content': text})\n",
+    "    return clean\n",
+    "\n",
+    "\n",
+    "# Step 1: Clean the messages\n",
+    "df['nonfncall_messages'] = df['nonfncall_messages'].apply(clean_messages)\n",
+    "\n",
+    "# Step 2: Compute token count\n",
+    "df['n_tokens'] = df['nonfncall_messages'].parallel_apply(\n",
+    "    lambda x: len(tokenizer.apply_chat_template(x))\n",
+    ")\n",
+    "\n",
+    "# print(df['nonfncall_messages'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f'BEFORE: #total={len(df)}')\n",
+    "df_selected = df[df['n_tokens'] < 131072]\n",
+    "print(f'AFTER(truncated to 128k): #total={len(df_selected)}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_selected['n_tokens'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ecdf of n_tokens\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "\n",
+    "display(df.groupby(['resolved'])['n_tokens'].describe())\n",
+    "sns.ecdfplot(x='n_tokens', data=df, hue='resolved')\n",
+    "plt.show()\n",
+    "\n",
+    "print(f'#total={len(df)}')\n",
+    "df_selected = df[df['n_tokens'] < 131072]\n",
+    "print(f'#selected={len(df_selected)}')\n",
+    "display(df_selected.groupby(['resolved'])['n_tokens'].describe())\n",
+    "sns.ecdfplot(x='n_tokens', data=df_selected, hue='resolved')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_selected[~df_selected['resolved']]['n_tokens'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_selected['resolved'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_selected.groupby(['resolved'])['n_tokens'].describe()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Save Resolved Messages for SFT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Flatten messages and change format to {\"content\": \"\", \"role\": \"\"}\n",
+    "df_selected[df_selected['resolved']][['nonfncall_messages']].rename(\n",
+    "    columns={'nonfncall_messages': 'messages'}\n",
+    ").to_json(\n",
+    "    os.path.join(\n",
+    "        'PATH_TO_FILE',\n",
+    "        f'policy_traj_128k_swegym_{df_selected[\"resolved\"].value_counts()[True]}i.jsonl',\n",
+    "    ),\n",
+    "    lines=True,\n",
+    "    orient='records',\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/evaluation/benchmarks/swe_perf/README.md
+++ b/evaluation/benchmarks/swe_perf/README.md
@@ -0,0 +1,81 @@
+# SWE-Perf Evaluation
+
+This folder contains the OpenHands inference generation of the [SWE-Perf benchmark](https://swe-perf.github.io/) ([paper](https://arxiv.org/pdf/2507.12415v1)).
+
+The evaluation consists of three steps:
+
+1. Environment setup: [install python environment](../../README.md#development-environment) and [configure LLM config](../../README.md#configure-openhands-and-your-llm).
+2. [Run inference](#running-inference-locally-with-docker): Generate a edit patch for each Github issue
+3. [Evaluate patches](#evaluate-generated-patches)
+
+## Setup Environment and LLM Configuration
+
+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
+
+## Running inference Locally with Docker
+
+Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the SWE-PErf set you are running on) for the instance-level docker image.
+
+When the `run_infer.sh` script is started, it will automatically pull the relevant SWE-Perf images.
+For example, for instance ID `scikit-learn_scikit-learn-11674`, it will try to pull our pre-build docker image `betty1202/sweb.eval.x86_64.scikit-learn_s_scikit-learn-11674` from DockerHub.
+This image will be used create an OpenHands runtime image where the agent will operate on.
+
+```bash
+./evaluation/benchmarks/swe_perf/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] [n_runs] [mode]
+
+# Example
+./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 500 100 1 SWE-Perf/SWE-Perf test
+```
+
+where `model_config` is mandatory, and the rest are optional.
+
+- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
+LLM settings, as defined in your `config.toml`.
+- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would
+like to evaluate. It could also be a release tag like `0.6.2`.
+- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
+to `CodeActAgent`.
+- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
+default, the script evaluates the entire SWE-Perf test set (140 issues). Note:
+in order to use `eval_limit`, you must also set `agent`.
+- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
+default, it is set to 100.
+- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
+default, it is set to 1.
+- `dataset`, a huggingface dataset name. e.g. `SWE-Perf/SWE-Perf`, specifies which dataset to evaluate on.
+- `dataset_split`, split for the huggingface dataset. e.g., `test`, `dev`. Default to `test`.
+
+- `n_runs`, e.g. `3`, is the number of times to run the evaluation. Default is 1.
+- `mode`, e.g. `swt`, `swt-ci`, or `swe`, specifies the evaluation mode. Default is `swe`.
+
+> [!CAUTION]
+> Setting `num_workers` larger than 1 is not officially tested, YMMV.
+
+
+Let's say you'd like to run 10 instances using `llm.eval_gpt4_1106_preview` and CodeActAgent,
+
+then your command would be:
+
+```bash
+./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10
+```
+
+## Evaluate Generated Patches
+
+
+To evaluate the generated patch, follow these steps:
+
+### 1. Convert output to the evaluation standard format
+Run the following command:
+```bash
+python -m evaluation.benchmarks.swe_perf.format_conversion \
+    --input_path [input_path] \
+    --output_path [output_path]
+```
+
+* `input_path`: Path to the raw generated patch file.
+* `output_path`: Path where the converted file will be saved.
+
+### 2. Run the SWE-Perf benchmark official evaluation
+
+Once the output is converted, use the [official SWE-Perf benchmark evaluation](https://github.com/SWE-Perf/SWE-Perf/tree/main/evaluation) to evaluate it.
--- a/evaluation/benchmarks/swe_perf/init.py
+++ b/evaluation/benchmarks/swe_perf/init.py
--- a/evaluation/benchmarks/swe_perf/binary_patch_utils.py
+++ b/evaluation/benchmarks/swe_perf/binary_patch_utils.py
@@ -0,0 +1,52 @@
+"""
+Utilities for handling binary files and patch generation in SWE-Perf evaluation.
+"""
+
+
+def remove_binary_diffs(patch_text):
+    """
+    Remove binary file diffs from a git patch.
+
+    Args:
+        patch_text (str): The git patch text
+
+    Returns:
+        str: The cleaned patch text with binary diffs removed
+    """
+    lines = patch_text.splitlines()
+    cleaned_lines = []
+    block = []
+    is_binary_block = False
+
+    for line in lines:
+        if line.startswith('diff --git '):
+            if block and not is_binary_block:
+                cleaned_lines.extend(block)
+            block = [line]
+            is_binary_block = False
+        elif 'Binary files' in line:
+            is_binary_block = True
+            block.append(line)
+        else:
+            block.append(line)
+
+    if block and not is_binary_block:
+        cleaned_lines.extend(block)
+    return '\n'.join(cleaned_lines)
+
+
+def remove_binary_files_from_git():
+    """
+    Generate a bash command to remove binary files from git staging.
+
+    Returns:
+        str: A bash command that removes binary files from git staging
+    """
+    return """
+    for file in $(git status --porcelain | grep -E "^(M| M|\\?\\?|A| A)" | cut -c4-); do
+        if [ -f "$file" ] && (file "$file" | grep -q "executable" || git check-attr binary "$file" | grep -q "binary: set"); then
+            git rm -f "$file" 2>/dev/null || rm -f "$file"
+            echo "Removed: $file"
+        fi
+    done
+    """.strip()
--- a/evaluation/benchmarks/swe_perf/format_conversion.py
+++ b/evaluation/benchmarks/swe_perf/format_conversion.py
@@ -0,0 +1,45 @@
+import json
+import os
+from argparse import ArgumentParser
+
+parser = ArgumentParser()
+parser.add_argument('--input_path', type=str, help='Name of input path to JSON file.')
+parser.add_argument('--output_path', type=str, help='Name of output path to JSON file.')
+args = parser.parse_args()
+
+input_path = args.input_path
+output_path = args.output_path
+os.makedirs(output_path, exist_ok=True)
+
+
+def load_jsonl(file_path):
+    """Load JSONL file into a list of dictionaries."""
+    data = []
+    with open(file_path, 'r') as f:
+        for line in f:
+            data.append(json.loads(line))
+    return data
+
+
+dataset = load_jsonl(input_path)
+ooutput_dataset = []
+for data in dataset:
+    instance_id = data['instance_id']
+    model_name_or_path = 'openhands'
+    model_patch = (
+        data['test_result']['git_patch']
+        if 'test_result' in data and 'git_patch' in data['test_result']
+        else None
+    )
+    ooutput_dataset.append(
+        {
+            'instance_id': instance_id,
+            'model_name_or_path': model_name_or_path,
+            'model_patch': model_patch,
+        }
+    )
+
+with open(os.path.join(output_path, 'output.jsonl'), 'w') as f:
+    for item in ooutput_dataset:
+        json_line = json.dumps(item, ensure_ascii=False)
+        f.write(json_line + '\n')
--- a/evaluation/benchmarks/swe_perf/resource/mapping.py
+++ b/evaluation/benchmarks/swe_perf/resource/mapping.py
@@ -0,0 +1,39 @@
+"""Mapping instance_id to resource_factor.
+
+Different instances may have different resource requirements.
+e.g., some instances may require more memory/CPU to run inference.
+This file tracks the resource requirements of different instances.
+"""
+
+import json
+import os
+
+from openhands.core.logger import openhands_logger as logger
+
+CUR_DIR = os.path.dirname(os.path.abspath(__file__))
+DEFAULT_RUNTIME_RESOURCE_FACTOR = int(
+    os.environ.get('DEFAULT_RUNTIME_RESOURCE_FACTOR', 1)
+)
+
+# dataset to resource mapping
+_global_resource_mapping: dict[str, dict[str, float]] = {}
+
+
+def get_resource_mapping(dataset_name: str) -> dict[str, float]:
+    if dataset_name not in _global_resource_mapping:
+        file_path = os.path.join(CUR_DIR, f'{dataset_name}.json')
+        if not os.path.exists(file_path):
+            logger.info(f'Resource mapping for {dataset_name} not found.')
+            return None
+
+        with open(file_path, 'r') as f:
+            _global_resource_mapping[dataset_name] = json.load(f)
+        logger.debug(f'Loaded resource mapping for {dataset_name}')
+    return _global_resource_mapping[dataset_name]
+
+
+def get_instance_resource_factor(dataset_name: str, instance_id: str) -> int:
+    resource_mapping = get_resource_mapping(dataset_name)
+    if resource_mapping is None:
+        return DEFAULT_RUNTIME_RESOURCE_FACTOR
+    return int(resource_mapping.get(instance_id, DEFAULT_RUNTIME_RESOURCE_FACTOR))
--- a/evaluation/benchmarks/swe_perf/resource/swt_bench_constants.py
+++ b/evaluation/benchmarks/swe_perf/resource/swt_bench_constants.py
@@ -0,0 +1,842 @@
+# Based on https://github.com/logic-star-ai/swt-bench/blob/master/src/constants.py
+
+# Constants - Installation Specifications
+MAP_VERSION_TO_INSTALL_SKLEARN = {
+    k: {
+        'python': '3.6',
+        'packages': 'numpy scipy cython pytest pandas matplotlib',
+        'install': 'python -m pip install -v --no-use-pep517 --no-build-isolation -e .',
+        'pip_packages': [
+            'cython',
+            'numpy==1.19.2',
+            'setuptools',
+            'scipy==1.5.2',
+        ],
+    }
+    for k in ['0.20', '0.21', '0.22']
+}
+MAP_VERSION_TO_INSTALL_SKLEARN.update(
+    {
+        k: {
+            'python': '3.9',
+            'packages': "'numpy==1.19.2' 'scipy==1.5.2' 'cython==3.0.10' pytest 'pandas<2.0.0' 'matplotlib<3.9.0' setuptools pytest joblib threadpoolctl",
+            'install': 'python -m pip install -v --no-use-pep517 --no-build-isolation -e .',
+            'pip_packages': ['cython', 'setuptools', 'numpy', 'scipy'],
+        }
+        for k in ['1.3', '1.4']
+    }
+)
+MAP_VERSION_TO_INSTALL_FLASK = {
+    '2.0': {
+        'python': '3.9',
+        'packages': 'requirements.txt',
+        'install': 'python -m pip install -e .',
+        'pip_packages': [
+            'setuptools==70.0.0',
+            'Werkzeug==2.3.7',
+            'Jinja2==3.0.1',
+            'itsdangerous==2.1.2',
+            'click==8.0.1',
+            'MarkupSafe==2.1.3',
+        ],
+    },
+    '2.1': {
+        'python': '3.10',
+        'packages': 'requirements.txt',
+        'install': 'python -m pip install -e .',
+        'pip_packages': [
+            'click==8.1.3',
+            'itsdangerous==2.1.2',
+            'Jinja2==3.1.2',
+            'MarkupSafe==2.1.1',
+            'Werkzeug==2.3.7',
+        ],
+    },
+}
+MAP_VERSION_TO_INSTALL_FLASK.update(
+    {
+        k: {
+            'python': '3.11',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+            'pip_packages': [
+                'click==8.1.3',
+                'itsdangerous==2.1.2',
+                'Jinja2==3.1.2',
+                'MarkupSafe==2.1.1',
+                'Werkzeug==2.3.7',
+            ],
+        }
+        for k in ['2.2', '2.3']
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO = {
+    k: {
+        'python': '3.5',
+        'packages': 'requirements.txt',
+        'pre_install': [
+            'apt-get update && apt-get install -y locales',
+            "echo 'en_US UTF-8' > /etc/locale.gen",
+            'locale-gen en_US.UTF-8',
+        ],
+        'install': 'python setup.py install',
+        'pip_packages': ['setuptools'],
+        'eval_commands': [
+            'export LANG=en_US.UTF-8',
+            'export LC_ALL=en_US.UTF-8',
+            'export PYTHONIOENCODING=utf8',
+            'export LANGUAGE=en_US:en',
+        ],
+    }
+    for k in ['1.7', '1.8', '1.9', '1.10', '1.11', '2.0', '2.1', '2.2']
+}
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {'python': '3.5', 'install': 'python setup.py install'}
+        for k in ['1.4', '1.5', '1.6']
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            'python': '3.6',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+            'eval_commands': [
+                "sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen",
+                'export LANG=en_US.UTF-8',
+                'export LANGUAGE=en_US:en',
+                'export LC_ALL=en_US.UTF-8',
+            ],
+        }
+        for k in ['3.0', '3.1', '3.2']
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            'python': '3.8',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+        }
+        for k in ['4.0']
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            'python': '3.9',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+        }
+        for k in ['4.1', '4.2']
+    }
+)
+MAP_VERSION_TO_INSTALL_DJANGO.update(
+    {
+        k: {
+            'python': '3.11',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+        }
+        for k in ['5.0']
+    }
+)
+MAP_VERSION_TO_INSTALL_REQUESTS = {
+    k: {'python': '3.9', 'packages': 'pytest', 'install': 'python -m pip install .'}
+    for k in ['0.7', '0.8', '0.9', '0.11', '0.13', '0.14', '1.1', '1.2', '2.0', '2.2']
+    + ['2.3', '2.4', '2.5', '2.7', '2.8', '2.9', '2.10', '2.11', '2.12', '2.17']
+    + ['2.18', '2.19', '2.22', '2.26', '2.25', '2.27', '3.0']
+}
+MAP_VERSION_TO_INSTALL_SEABORN = {
+    k: {
+        'python': '3.9',
+        'install': 'python -m pip install -e .',
+        'pip_packages': [
+            'contourpy==1.1.0',
+            'cycler==0.11.0',
+            'fonttools==4.42.1',
+            'importlib-resources==6.0.1',
+            'kiwisolver==1.4.5',
+            'matplotlib==3.7.2',
+            'numpy==1.25.2',
+            'packaging==23.1',
+            'pandas==1.3.5',  # 2.0.3
+            'pillow==10.0.0',
+            'pyparsing==3.0.9',
+            'pytest',
+            'python-dateutil==2.8.2',
+            'pytz==2023.3.post1',
+            'scipy==1.11.2',
+            'six==1.16.0',
+            'tzdata==2023.1',
+            'zipp==3.16.2',
+        ],
+    }
+    for k in ['0.11']
+}
+MAP_VERSION_TO_INSTALL_SEABORN.update(
+    {
+        k: {
+            'python': '3.9',
+            'install': 'python -m pip install -e .[dev]',
+            'pip_packages': [
+                'contourpy==1.1.0',
+                'cycler==0.11.0',
+                'fonttools==4.42.1',
+                'importlib-resources==6.0.1',
+                'kiwisolver==1.4.5',
+                'matplotlib==3.7.2',
+                'numpy==1.25.2',
+                'packaging==23.1',
+                'pandas==2.0.0',
+                'pillow==10.0.0',
+                'pyparsing==3.0.9',
+                'pytest',
+                'python-dateutil==2.8.2',
+                'pytz==2023.3.post1',
+                'scipy==1.11.2',
+                'six==1.16.0',
+                'tzdata==2023.1',
+                'zipp==3.16.2',
+            ],
+        }
+        for k in ['0.12', '0.13']
+    }
+)
+MAP_VERSION_TO_INSTALL_PYTEST = {
+    k: {'python': '3.9', 'install': 'python -m pip install -e .'}
+    for k in [
+        '4.4',
+        '4.5',
+        '4.6',
+        '5.0',
+        '5.1',
+        '5.2',
+        '5.3',
+        '5.4',
+        '6.0',
+        '6.2',
+        '6.3',
+        '7.0',
+        '7.1',
+        '7.2',
+        '7.4',
+        '8.0',
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYTEST['4.4']['pip_packages'] = [
+    'atomicwrites==1.4.1',
+    'attrs==23.1.0',
+    'more-itertools==10.1.0',
+    'pluggy==0.13.1',
+    'py==1.11.0',
+    'setuptools==68.0.0',
+    'six==1.16.0',
+]
+MAP_VERSION_TO_INSTALL_PYTEST['4.5']['pip_packages'] = [
+    'atomicwrites==1.4.1',
+    'attrs==23.1.0',
+    'more-itertools==10.1.0',
+    'pluggy==0.11.0',
+    'py==1.11.0',
+    'setuptools==68.0.0',
+    'six==1.16.0',
+    'wcwidth==0.2.6',
+]
+MAP_VERSION_TO_INSTALL_PYTEST['4.6']['pip_packages'] = [
+    'atomicwrites==1.4.1',
+    'attrs==23.1.0',
+    'more-itertools==10.1.0',
+    'packaging==23.1',
+    'pluggy==0.13.1',
+    'py==1.11.0',
+    'six==1.16.0',
+    'wcwidth==0.2.6',
+]
+for k in ['5.0', '5.1', '5.2']:
+    MAP_VERSION_TO_INSTALL_PYTEST[k]['pip_packages'] = [
+        'atomicwrites==1.4.1',
+        'attrs==23.1.0',
+        'more-itertools==10.1.0',
+        'packaging==23.1',
+        'pluggy==0.13.1',
+        'py==1.11.0',
+        'wcwidth==0.2.6',
+    ]
+MAP_VERSION_TO_INSTALL_PYTEST['5.3']['pip_packages'] = [
+    'attrs==23.1.0',
+    'more-itertools==10.1.0',
+    'packaging==23.1',
+    'pluggy==0.13.1',
+    'py==1.11.0',
+    'wcwidth==0.2.6',
+]
+MAP_VERSION_TO_INSTALL_PYTEST['5.4']['pip_packages'] = [
+    'py==1.11.0',
+    'packaging==23.1',
+    'attrs==23.1.0',
+    'more-itertools==10.1.0',
+    'pluggy==0.13.1',
+]
+MAP_VERSION_TO_INSTALL_PYTEST['6.0']['pip_packages'] = [
+    'attrs==23.1.0',
+    'iniconfig==2.0.0',
+    'more-itertools==10.1.0',
+    'packaging==23.1',
+    'pluggy==0.13.1',
+    'py==1.11.0',
+    'toml==0.10.2',
+]
+for k in ['6.2', '6.3']:
+    MAP_VERSION_TO_INSTALL_PYTEST[k]['pip_packages'] = [
+        'attrs==23.1.0',
+        'iniconfig==2.0.0',
+        'packaging==23.1',
+        'pluggy==0.13.1',
+        'py==1.11.0',
+        'toml==0.10.2',
+    ]
+MAP_VERSION_TO_INSTALL_PYTEST['7.0']['pip_packages'] = [
+    'attrs==23.1.0',
+    'iniconfig==2.0.0',
+    'packaging==23.1',
+    'pluggy==0.13.1',
+    'py==1.11.0',
+]
+for k in ['7.1', '7.2']:
+    MAP_VERSION_TO_INSTALL_PYTEST[k]['pip_packages'] = [
+        'attrs==23.1.0',
+        'iniconfig==2.0.0',
+        'packaging==23.1',
+        'pluggy==0.13.1',
+        'py==1.11.0',
+        'tomli==2.0.1',
+    ]
+MAP_VERSION_TO_INSTALL_PYTEST['7.4']['pip_packages'] = [
+    'iniconfig==2.0.0',
+    'packaging==23.1',
+    'pluggy==1.3.0',
+    'exceptiongroup==1.1.3',
+    'tomli==2.0.1',
+]
+MAP_VERSION_TO_INSTALL_PYTEST['8.0']['pip_packages'] = [
+    'iniconfig==2.0.0',
+    'packaging==23.1',
+    'pluggy==1.3.0',
+    'exceptiongroup==1.1.3',
+    'tomli==2.0.1',
+]
+MAP_VERSION_TO_INSTALL_MATPLOTLIB = {
+    k: {
+        'python': '3.11',
+        'packages': 'environment.yml',
+        'install': 'python -m pip install -e .',
+        'pre_install': [
+            'apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super dvipng'
+        ],
+        'pip_packages': [
+            'contourpy==1.1.0',
+            'cycler==0.11.0',
+            'fonttools==4.42.1',
+            'ghostscript',
+            'kiwisolver==1.4.5',
+            'numpy==1.25.2',
+            'packaging==23.1',
+            'pillow==10.0.0',
+            'pikepdf',
+            'pyparsing==3.0.9',
+            'python-dateutil==2.8.2',
+            'six==1.16.0',
+            'setuptools==68.1.2',
+            'setuptools-scm==7.1.0',
+            'typing-extensions==4.7.1',
+        ],
+    }
+    for k in ['3.5', '3.6', '3.7']
+}
+MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
+    {
+        k: {
+            'python': '3.8',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+            'pre_install': [
+                'apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config texlive texlive-latex-extra texlive-fonts-recommended texlive-xetex texlive-luatex cm-super'
+            ],
+            'pip_packages': ['pytest', 'ipython'],
+        }
+        for k in ['3.1', '3.2', '3.3', '3.4']
+    }
+)
+MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
+    {
+        k: {
+            'python': '3.7',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+            'pre_install': [
+                'apt-get -y update && apt-get -y upgrade && apt-get install -y imagemagick ffmpeg libfreetype6-dev pkg-config'
+            ],
+            'pip_packages': ['pytest'],
+        }
+        for k in ['3.0']
+    }
+)
+MAP_VERSION_TO_INSTALL_MATPLOTLIB.update(
+    {
+        k: {
+            'python': '3.5',
+            'install': 'python setup.py build; python setup.py install',
+            'pre_install': [
+                'apt-get -y update && apt-get -y upgrade && && apt-get install -y imagemagick ffmpeg'
+            ],
+            'pip_packages': ['pytest'],
+            'execute_test_as_nonroot': True,
+        }
+        for k in ['2.0', '2.1', '2.2', '1.0', '1.1', '1.2', '1.3', '1.4', '1.5']
+    }
+)
+MAP_VERSION_TO_INSTALL_SPHINX = {
+    k: {
+        'python': '3.9',
+        'pip_packages': ['tox==4.16.0', 'tox-current-env==0.0.11'],
+        'install': 'python -m pip install -e .[test]',
+        'pre_install': ["sed -i 's/pytest/pytest -rA/' tox.ini"],
+    }
+    for k in ['1.5', '1.6', '1.7', '1.8', '2.0', '2.1', '2.2', '2.3', '2.4', '3.0']
+    + ['3.1', '3.2', '3.3', '3.4', '3.5', '4.0', '4.1', '4.2', '4.3', '4.4']
+    + ['4.5', '5.0', '5.1', '5.2', '5.3', '6.0', '6.2', '7.0', '7.1', '7.2']
+}
+for k in ['3.0', '3.1', '3.2', '3.3', '3.4', '3.5', '4.0', '4.1', '4.2', '4.3', '4.4']:
+    MAP_VERSION_TO_INSTALL_SPHINX[k]['pre_install'].extend(
+        [
+            "sed -i 's/Jinja2>=2.3/Jinja2<3.0/' setup.py",
+            "sed -i 's/sphinxcontrib-applehelp/sphinxcontrib-applehelp<=1.0.7/' setup.py",
+            "sed -i 's/sphinxcontrib-devhelp/sphinxcontrib-devhelp<=1.0.5/' setup.py",
+            "sed -i 's/sphinxcontrib-qthelp/sphinxcontrib-qthelp<=1.0.6/' setup.py",
+            "sed -i 's/alabaster>=0.7,<0.8/alabaster>=0.7,<0.7.12/' setup.py",
+            "sed -i \"s/'packaging',/'packaging', 'markupsafe<=2.0.1',/\" setup.py",
+        ]
+    )
+    if k in ['4.2', '4.3', '4.4']:
+        MAP_VERSION_TO_INSTALL_SPHINX[k]['pre_install'].extend(
+            [
+                "sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py",
+                "sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py",
+            ]
+        )
+    elif k == '4.1':
+        MAP_VERSION_TO_INSTALL_SPHINX[k]['pre_install'].extend(
+            [
+                (
+                    "grep -q 'sphinxcontrib-htmlhelp>=2.0.0' setup.py && "
+                    "sed -i 's/sphinxcontrib-htmlhelp>=2.0.0/sphinxcontrib-htmlhelp>=2.0.0,<=2.0.4/' setup.py || "
+                    "sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py"
+                ),
+                (
+                    "grep -q 'sphinxcontrib-serializinghtml>=1.1.5' setup.py && "
+                    "sed -i 's/sphinxcontrib-serializinghtml>=1.1.5/sphinxcontrib-serializinghtml>=1.1.5,<=1.1.9/' setup.py || "
+                    "sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py"
+                ),
+            ]
+        )
+    else:
+        MAP_VERSION_TO_INSTALL_SPHINX[k]['pre_install'].extend(
+            [
+                "sed -i 's/sphinxcontrib-htmlhelp/sphinxcontrib-htmlhelp<=2.0.4/' setup.py",
+                "sed -i 's/sphinxcontrib-serializinghtml/sphinxcontrib-serializinghtml<=1.1.9/' setup.py",
+            ]
+        )
+MAP_VERSION_TO_INSTALL_SPHINX['7.2']['pre_install'] += [
+    'apt-get update && apt-get install -y graphviz'
+]
+MAP_VERSION_TO_INSTALL_ASTROPY = {
+    k: {
+        'python': '3.9',
+        'install': 'python -m pip install -e .[test] --verbose',
+        'pip_packages': [
+            'attrs==23.1.0',
+            'exceptiongroup==1.1.3',
+            'execnet==2.0.2',
+            'hypothesis==6.82.6',
+            'iniconfig==2.0.0',
+            'numpy==1.25.2',
+            'packaging==23.1',
+            'pluggy==1.3.0',
+            'psutil==5.9.5',
+            'pyerfa==2.0.0.3',
+            'pytest-arraydiff==0.5.0',
+            'pytest-astropy-header==0.2.2',
+            'pytest-astropy==0.10.0',
+            'pytest-cov==4.1.0',
+            'pytest-doctestplus==1.0.0',
+            'pytest-filter-subpackage==0.1.2',
+            'pytest-mock==3.11.1',
+            'pytest-openfiles==0.5.0',
+            'pytest-remotedata==0.4.0',
+            'pytest-xdist==3.3.1',
+            'pytest==7.4.0',
+            'PyYAML==6.0.1',
+            'setuptools==68.0.0',
+            'sortedcontainers==2.4.0',
+            'tomli==2.0.1',
+        ],
+    }
+    for k in ['0.1', '0.2', '0.3', '0.4', '1.1', '1.2', '1.3', '3.0', '3.1', '3.2']
+    + ['4.1', '4.2', '4.3', '5.0', '5.1', '5.2']
+}
+for k in ['4.1', '4.2', '4.3', '5.0', '5.1', '5.2']:
+    MAP_VERSION_TO_INSTALL_ASTROPY[k]['pre_install'] = [
+        'sed -i \'s/requires = \\["setuptools",/requires = \\["setuptools==68.0.0",/\' pyproject.toml'
+    ]
+MAP_VERSION_TO_INSTALL_SYMPY = {
+    k: {
+        'python': '3.9',
+        'packages': 'mpmath flake8',
+        'pip_packages': ['mpmath==1.3.0', 'flake8-comprehensions'],
+        'install': 'python -m pip install -e .',
+    }
+    for k in ['0.7', '1.0', '1.1', '1.10', '1.11', '1.12', '1.2', '1.4', '1.5', '1.6']
+    + ['1.7', '1.8', '1.9']
+}
+MAP_VERSION_TO_INSTALL_SYMPY.update(
+    {
+        k: {
+            'python': '3.9',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+            'pip_packages': ['mpmath==1.3.0'],
+        }
+        for k in ['1.13']
+    }
+)
+MAP_VERSION_TO_INSTALL_PYLINT = {
+    k: {
+        'python': '3.9',
+        'packages': 'requirements.txt',
+        'install': 'python -m pip install -e .',
+    }
+    for k in [
+        '2.10',
+        '2.11',
+        '2.13',
+        '2.14',
+        '2.15',
+        '2.16',
+        '2.17',
+        '2.8',
+        '2.9',
+        '3.0',
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYLINT['2.8']['pip_packages'] = ['pyenchant==3.2']
+MAP_VERSION_TO_INSTALL_PYLINT['2.8']['pre_install'] = [
+    'apt-get update && apt-get install -y libenchant-2-dev hunspell-en-us'
+]
+MAP_VERSION_TO_INSTALL_PYLINT.update(
+    {
+        k: {
+            **MAP_VERSION_TO_INSTALL_PYLINT[k],
+            'pip_packages': ['astroid==3.0.0a6', 'setuptools'],
+        }
+        for k in ['3.0']
+    }
+)
+
+MAP_VERSION_TO_INSTALL_XARRAY = {
+    k: {
+        'python': '3.10',
+        'packages': 'environment.yml',
+        'install': 'python -m pip install -e .',
+        'pip_packages': [
+            'numpy==1.23.0',
+            'packaging==23.1',
+            'pandas==1.5.3',
+            'pytest==7.4.0',
+            'python-dateutil==2.8.2',
+            'pytz==2023.3',
+            'six==1.16.0',
+            'scipy==1.11.1',
+            'setuptools==68.0.0',
+        ],
+        'no_use_env': True,
+    }
+    for k in ['0.12', '0.18', '0.19', '0.20', '2022.03', '2022.06', '2022.09']
+}
+
+MAP_VERSION_TO_INSTALL_SQLFLUFF = {
+    k: {
+        'python': '3.9',
+        'packages': 'requirements.txt',
+        'install': 'python -m pip install -e .',
+    }
+    for k in [
+        '0.10',
+        '0.11',
+        '0.12',
+        '0.13',
+        '0.4',
+        '0.5',
+        '0.6',
+        '0.8',
+        '0.9',
+        '1.0',
+        '1.1',
+        '1.2',
+        '1.3',
+        '1.4',
+        '2.0',
+        '2.1',
+        '2.2',
+    ]
+}
+MAP_VERSION_TO_INSTALL_DBT_CORE = {
+    k: {
+        'python': '3.9',
+        'packages': 'requirements.txt',
+        'install': 'python -m pip install -e .',
+    }
+    for k in [
+        '0.13',
+        '0.14',
+        '0.15',
+        '0.16',
+        '0.17',
+        '0.18',
+        '0.19',
+        '0.20',
+        '0.21',
+        '1.0',
+        '1.1',
+        '1.2',
+        '1.3',
+        '1.4',
+        '1.5',
+        '1.6',
+        '1.7',
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYVISTA = {
+    k: {
+        'python': '3.9',
+        'install': 'python -m pip install -e .',
+        'pip_packages': ['pytest'],
+    }
+    for k in ['0.20', '0.21', '0.22', '0.23']
+}
+MAP_VERSION_TO_INSTALL_PYVISTA.update(
+    {
+        k: {
+            'python': '3.9',
+            'packages': 'requirements.txt',
+            'install': 'python -m pip install -e .',
+            'pip_packages': ['pytest'],
+        }
+        for k in [
+            '0.24',
+            '0.25',
+            '0.26',
+            '0.27',
+            '0.28',
+            '0.29',
+            '0.30',
+            '0.31',
+            '0.32',
+            '0.33',
+            '0.34',
+            '0.35',
+            '0.36',
+            '0.37',
+            '0.38',
+            '0.39',
+            '0.40',
+            '0.41',
+            '0.42',
+            '0.43',
+        ]
+    }
+)
+MAP_VERSION_TO_INSTALL_ASTROID = {
+    k: {
+        'python': '3.9',
+        'install': 'python -m pip install -e .',
+        'pip_packages': ['pytest'],
+    }
+    for k in [
+        '2.10',
+        '2.12',
+        '2.13',
+        '2.14',
+        '2.15',
+        '2.16',
+        '2.5',
+        '2.6',
+        '2.7',
+        '2.8',
+        '2.9',
+        '3.0',
+    ]
+}
+MAP_VERSION_TO_INSTALL_MARSHMALLOW = {
+    k: {
+        'python': '3.9',
+        'install': "python -m pip install -e '.[dev]'",
+    }
+    for k in [
+        '2.18',
+        '2.19',
+        '2.20',
+        '3.0',
+        '3.1',
+        '3.10',
+        '3.11',
+        '3.12',
+        '3.13',
+        '3.15',
+        '3.16',
+        '3.19',
+        '3.2',
+        '3.4',
+        '3.8',
+        '3.9',
+    ]
+}
+MAP_VERSION_TO_INSTALL_PVLIB = {
+    k: {
+        'python': '3.9',
+        'install': 'python -m pip install -e .[all]',
+        'packages': 'pandas scipy',
+        'pip_packages': ['jupyter', 'ipython', 'matplotlib', 'pytest', 'flake8'],
+    }
+    for k in ['0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9']
+}
+MAP_VERSION_TO_INSTALL_PYDICOM = {
+    k: {'python': '3.6', 'install': 'python -m pip install -e .', 'packages': 'numpy'}
+    for k in [
+        '1.0',
+        '1.1',
+        '1.2',
+        '1.3',
+        '1.4',
+        '2.0',
+        '2.1',
+        '2.2',
+        '2.3',
+        '2.4',
+        '3.0',
+    ]
+}
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], 'python': '3.8'} for k in ['1.4', '2.0']}
+)
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], 'python': '3.9'} for k in ['2.1', '2.2']}
+)
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], 'python': '3.10'} for k in ['2.3']}
+)
+MAP_VERSION_TO_INSTALL_PYDICOM.update(
+    {k: {**MAP_VERSION_TO_INSTALL_PYDICOM[k], 'python': '3.11'} for k in ['2.4', '3.0']}
+)
+MAP_VERSION_TO_INSTALL_HUMANEVAL = {k: {'python': '3.9'} for k in ['1.0']}
+MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX = {
+    k: {'python': '3.10', 'packages': 'pytest'} for k in ['0.0.1']
+}
+
+# Constants - Task Instance Instllation Environment
+MAP_VERSION_TO_INSTALL = {
+    'astropy/astropy': MAP_VERSION_TO_INSTALL_ASTROPY,
+    'dbt-labs/dbt-core': MAP_VERSION_TO_INSTALL_DBT_CORE,
+    'django/django': MAP_VERSION_TO_INSTALL_DJANGO,
+    'matplotlib/matplotlib': MAP_VERSION_TO_INSTALL_MATPLOTLIB,
+    'marshmallow-code/marshmallow': MAP_VERSION_TO_INSTALL_MARSHMALLOW,
+    'mwaskom/seaborn': MAP_VERSION_TO_INSTALL_SEABORN,
+    'pallets/flask': MAP_VERSION_TO_INSTALL_FLASK,
+    'psf/requests': MAP_VERSION_TO_INSTALL_REQUESTS,
+    'pvlib/pvlib-python': MAP_VERSION_TO_INSTALL_PVLIB,
+    'pydata/xarray': MAP_VERSION_TO_INSTALL_XARRAY,
+    'pydicom/pydicom': MAP_VERSION_TO_INSTALL_PYDICOM,
+    'pylint-dev/astroid': MAP_VERSION_TO_INSTALL_ASTROID,
+    'pylint-dev/pylint': MAP_VERSION_TO_INSTALL_PYLINT,
+    'pytest-dev/pytest': MAP_VERSION_TO_INSTALL_PYTEST,
+    'pyvista/pyvista': MAP_VERSION_TO_INSTALL_PYVISTA,
+    'scikit-learn/scikit-learn': MAP_VERSION_TO_INSTALL_SKLEARN,
+    'sphinx-doc/sphinx': MAP_VERSION_TO_INSTALL_SPHINX,
+    'sqlfluff/sqlfluff': MAP_VERSION_TO_INSTALL_SQLFLUFF,
+    'swe-bench/humaneval': MAP_VERSION_TO_INSTALL_HUMANEVAL,
+    'nielstron/humaneval_fix': MAP_VERSION_TO_INSTALL_HUMANEVAL_FIX,
+    'sympy/sympy': MAP_VERSION_TO_INSTALL_SYMPY,
+}
+
+# Constants - Repository Specific Installation Instructions
+MAP_REPO_TO_INSTALL = {}
+
+# Constants - Task Instance Test Frameworks
+TEST_PYTEST_VERBOSE = 'pytest -rA --tb=long -p no:cacheprovider'
+MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE = {
+    'astropy/astropy': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROPY.keys()
+    },
+    'django/django': {
+        k: './tests/runtests.py --verbosity 2 --settings=test_sqlite --parallel 1'
+        for k in MAP_VERSION_TO_INSTALL_DJANGO.keys()
+    },
+    'marshmallow-code/marshmallow': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MARSHMALLOW.keys()
+    },
+    'matplotlib/matplotlib': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_MATPLOTLIB.keys()
+    },
+    'mwaskom/seaborn': {
+        k: 'pytest -rA --tb=long' for k in MAP_VERSION_TO_INSTALL_SEABORN.keys()
+    },
+    'pallets/flask': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_FLASK.keys()
+    },
+    'psf/requests': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_REQUESTS.keys()
+    },
+    'pvlib/pvlib-python': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PVLIB.keys()
+    },
+    'pydata/xarray': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_XARRAY.keys()
+    },
+    'pydicom/pydicom': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYDICOM.keys()
+    },
+    'pylint-dev/astroid': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_ASTROID.keys()
+    },
+    'pylint-dev/pylint': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYLINT.keys()
+    },
+    'pytest-dev/pytest': {
+        k: 'pytest -rA --tb=long' for k in MAP_VERSION_TO_INSTALL_PYTEST.keys()
+    },
+    'pyvista/pyvista': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_PYVISTA.keys()
+    },
+    'scikit-learn/scikit-learn': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SKLEARN.keys()
+    },
+    'sphinx-doc/sphinx': {
+        k: 'tox -epy39 -v --' for k in MAP_VERSION_TO_INSTALL_SPHINX.keys()
+    },
+    'sqlfluff/sqlfluff': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_SQLFLUFF.keys()
+    },
+    'swe-bench/humaneval': {
+        k: 'python' for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
+    },
+    'nielstron/humaneval_fix': {
+        k: TEST_PYTEST_VERBOSE for k in MAP_VERSION_TO_INSTALL_HUMANEVAL.keys()
+    },
+    'sympy/sympy': {
+        k: 'bin/test -C --verbose' for k in MAP_VERSION_TO_INSTALL_SYMPY.keys()
+    },
+}
+MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE['django/django']['1.9'] = (
+    './tests/runtests.py --verbosity 2'
+)
--- a/evaluation/benchmarks/swe_perf/run_infer.py
+++ b/evaluation/benchmarks/swe_perf/run_infer.py
@@ -0,0 +1,978 @@
+import asyncio
+import copy
+import json
+import os
+import tempfile
+from typing import Any, Literal
+
+import pandas as pd
+import toml
+from datasets import load_dataset
+
+import openhands.agenthub
+from evaluation.benchmarks.swe_perf.binary_patch_utils import (
+    remove_binary_diffs,
+    remove_binary_files_from_git,
+)
+from evaluation.benchmarks.swe_perf.resource.mapping import (
+    get_instance_resource_factor,
+)
+from evaluation.benchmarks.swe_perf.resource.swt_bench_constants import (
+    MAP_REPO_TO_INSTALL,
+    MAP_VERSION_TO_INSTALL,
+)
+from evaluation.utils.shared import (
+    EvalException,
+    EvalMetadata,
+    EvalOutput,
+    assert_and_raise,
+    check_maximum_retries_exceeded,
+    codeact_user_response,
+    get_default_sandbox_config_for_eval,
+    get_metrics,
+    is_fatal_evaluation_error,
+    make_metadata,
+    prepare_dataset,
+    reset_logger_for_multiprocessing,
+    run_evaluation,
+    update_llm_config_for_completions_logging,
+)
+from openhands.controller.state.state import State
+from openhands.core.config import (
+    AgentConfig,
+    OpenHandsConfig,
+    get_evaluation_parser,
+    get_llm_config_arg,
+)
+from openhands.core.config.condenser_config import NoOpCondenserConfig
+from openhands.core.config.utils import get_condenser_config_arg
+from openhands.core.logger import openhands_logger as logger
+from openhands.core.main import create_runtime, run_controller
+from openhands.critic import AgentFinishedCritic
+from openhands.events.action import CmdRunAction, FileReadAction, MessageAction
+from openhands.events.observation import (
+    CmdOutputObservation,
+    ErrorObservation,
+    FileReadObservation,
+)
+from openhands.events.serialization.event import event_from_dict, event_to_dict
+from openhands.runtime.base import Runtime
+from openhands.utils.async_utils import call_async_from_sync
+from openhands.utils.shutdown_listener import sleep_if_should_continue
+
+USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
+RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'true'
+ENABLE_LLM_EDITOR = os.environ.get('ENABLE_LLM_EDITOR', 'false').lower() == 'true'
+BenchMode = Literal['swe', 'swt', 'swt-ci']
+
+# Global variable to track dataset type
+DATASET_TYPE = 'SWE-Perf'
+
+
+AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
+    'CodeActAgent': codeact_user_response,
+}
+
+
+def _get_sweperf_workspace_dir_name(instance: pd.Series) -> str:
+    return f'{instance.repo}__{instance.version}'.replace('/', '__')
+
+
+def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
+    workspace_dir_name = _get_sweperf_workspace_dir_name(instance)
+
+    # The instruction
+    instruction = f"""
+<uploaded_files>
+/workspace/{workspace_dir_name}
+</uploaded_files>
+
+I've uploaded a python code repository in the directory {workspace_dir_name}. Consider the following issue description:
+
+
+<issue_description>
+{instance.problem_statement_realistic}
+</issue_description>
+
+Can you help me implement the necessary changes to the repository so that the requirements specified in the <issue_description> are met?
+I've already taken care of all changes to any of the test files described in the <issue_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
+Also the development Python environment is already set up for you (i.e., all dependencies already installed), so you don't need to install other packages.
+Your task is to make the minimal changes to non-test files in the /workspace/{workspace_dir_name} directory to ensure the <issue_description> is satisfied.
+
+Follow these phases to resolve the issue:
+
+## ⚙️ Phase 1: Understand the Problem & Test Reuse
+
+**1.1. Install the package locally:**
+
+```bash
+python -m pip install pyinstrument
+python -m pip install -e .
+```
+
+> Only proceed to README-based install if the above fails.
+
+**1.2. Identify relevant modules and logic:**
+
+* Use test cases mentioned in `<issue_description>` to locate the functions and files involved.
+* Focus on potential performance bottlenecks: loops, I/O, locks, cache access, data structures, etc.
+
+**1.3. Run initial benchmark:**
+
+```bash
+pytest -rA --durations=0 --disable-warnings -p no:warnings --tb=no <test_case>
+```
+
+## 📊 Phase 2: Localization (Hierarchical Bottleneck Detection)
+
+**2.1. Global profiling using `pyinstrument`:**
+
+```bash
+pyinstrument -m pytest -rA --durations=0 --disable-warnings --tb=no --continue-on-collection-errors -p no:warnings <test_case>
+```
+
+**2.2. Analyze performance stack if necessary:**
+
+* 🔍 **Module level**: Identify hot files and methods.
+* 🔬 **Function level**: Focus on top-consuming classes/functions.
+* 🧬 **Line level**: Add fine-grained sampling/logging if needed.
+
+**2.3. Output a layered summary** showing where time is spent and why.
+
+
+## 🧠 Phase 3: Repair (Design Candidate Fixes)
+
+**3.1. Propose multiple optimization ideas:**
+
+* Algorithm refinement
+* Data structure improvement
+* Parallelism / async
+* Caching / batching
+
+**3.2. For each candidate:**
+
+* Describe the idea using pseudocode or `diff`
+* Evaluate expected gain vs implementation complexity
+
+---
+
+## 🔬 Phase 4: Patch Validation (Quantitative Benchmarking)
+
+**4.1. Apply each patch separately**
+
+**4.2. Re-run benchmark using the same test case**
+
+**4.3. Record improvements in pytest `call` phase**
+
+**4.4. Build a comparison table:**
+
+| Patch | Call Time (s) | % Improvement | Notes |
+| ----- | ------------- | ------------- | ----- |
+
+**4.5. Select the best-performing, low-risk patch**
+
+---
+
+## 📌 Phase 5: Fix Analysis (Explain the Fix)
+
+**5.1.** Clearly describe the **problem**
+
+**5.2.** Identify the **exact location** in the code
+
+**5.3.** Describe **how the test reveals** the problem
+
+**5.4.** State **best practices** relevant to the fix
+
+**5.5.** Explain **how the fix resolves** the issue
+
+---
+
+## 🧩 Phase 6: Fix Implementation
+
+**6.1.** Implement the selected fix in the non-test source files
+**6.2.** Ensure the changes are minimal and performance-focused
+
+---
+
+## ✅ Phase 7: Final Verification & Review
+
+**7.1.** Run **full regression tests** to ensure correctness
+
+**7.2.** Re-benchmark performance and compare with base commit `{instance['base_commit']}`
+
+**7.3.** Ensure `call` time is significantly improved in `pytest` output
+
+**7.4.** If gain is limited, consider deeper optimization
+
+**7.5.** Write a concise final report including:
+
+* Bottleneck path
+* Fix alternatives considered
+* Final solution and metrics
+* Code diff summary
+
+## Important ##
+1. Please do not optimize specifically for certain test cases, as this may be considered cheating.
+2. Delete all files unrelated to optimization, including temporary files, auxiliary test code, and so on.
+3. Avoid excessive optimization and unnecessary divergence; if the improvement is not significant, stop promptly to maintain efficiency and focus.
+
+Be thorough in your exploration, testing, and reasoning. It's fine if your thinking process is lengthy - quality and completeness are more important than brevity.
+"""
+
+    if RUN_WITH_BROWSING:
+        instruction += (
+            '<IMPORTANT!>\nYou SHOULD NEVER attempt to browse the web. </IMPORTANT!>\n'
+        )
+
+    if 'image_assets' in instance:
+        assets = json.loads(instance['image_assets'])
+        assert 'problem_statement' in assets, (
+            'problem_statement is required in image_assets'
+        )
+        image_urls = assets['problem_statement']
+        return MessageAction(content=instruction, image_urls=image_urls)
+    return MessageAction(content=instruction)
+
+
+def get_instance_docker_image(
+    instance_id: str,
+) -> str:
+    docker_image_prefix = 'docker.io/betty1202/'
+    image_name = 'sweb.eval.x86_64.' + instance_id
+    image_name = image_name.replace(
+        '__', '_s_'
+    )  # to comply with docker image naming convention
+    return (docker_image_prefix.rstrip('/') + '/' + image_name).lower()
+
+
+def get_config(
+    instance: pd.Series,
+    metadata: EvalMetadata,
+) -> OpenHandsConfig:
+    base_container_image = get_instance_docker_image(
+        instance['instance_id'],
+    )
+    logger.info(
+        f'Using instance container image: {base_container_image}. '
+        f'Please make sure this image exists. '
+        f'Submit an issue on https://github.com/All-Hands-AI/OpenHands if you run into any issues.'
+    )
+
+    sandbox_config = get_default_sandbox_config_for_eval()
+    sandbox_config.base_container_image = base_container_image
+    sandbox_config.enable_auto_lint = True
+    sandbox_config.use_host_network = False
+    # Add platform to the sandbox config to solve issue 4401
+    sandbox_config.platform = 'linux/amd64'
+    sandbox_config.remote_runtime_resource_factor = get_instance_resource_factor(
+        dataset_name=metadata.dataset,
+        instance_id=instance['instance_id'],
+    )
+
+    config = OpenHandsConfig(
+        default_agent=metadata.agent_class,
+        run_as_openhands=False,
+        max_iterations=metadata.max_iterations,
+        enable_browser=RUN_WITH_BROWSING,
+        runtime=os.environ.get('RUNTIME', 'docker'),
+        sandbox=sandbox_config,
+        # do not mount workspace
+        workspace_base=None,
+        workspace_mount_path=None,
+    )
+
+    config.set_llm_config(
+        update_llm_config_for_completions_logging(
+            metadata.llm_config, metadata.eval_output_dir, instance['instance_id']
+        )
+    )
+    # get 'draft_editor' config if exists
+    config.set_llm_config(get_llm_config_arg('draft_editor'), 'draft_editor')
+
+    agent_config = AgentConfig(
+        enable_jupyter=False,
+        enable_browsing=RUN_WITH_BROWSING,
+        enable_llm_editor=ENABLE_LLM_EDITOR,
+        enable_mcp=False,
+        condenser=metadata.condenser_config,
+        enable_prompt_extensions=False,
+    )
+    config.set_agent_config(agent_config)
+    return config
+
+
+def initialize_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required
+    metadata: EvalMetadata,
+):
+    """Initialize the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    """
+    logger.info('-' * 30)
+    logger.info('BEGIN Runtime Initialization Fn')
+    logger.info('-' * 30)
+    workspace_dir_name = _get_sweperf_workspace_dir_name(instance)
+    obs: CmdOutputObservation
+
+    # Set instance id and git configuration
+    action = CmdRunAction(
+        command=f"""echo 'export SWE_INSTANCE_ID={instance['instance_id']}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo "alias git='git --no-pager'" >> ~/.bashrc && git config --global core.pager "" && git config --global diff.binary false"""
+    )
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        obs.exit_code == 0,
+        f'Failed to export SWE_INSTANCE_ID and configure git: {str(obs)}',
+    )
+
+    action = CmdRunAction(command="""export USER=$(whoami); echo USER=${USER} """)
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(obs.exit_code == 0, f'Failed to export USER: {str(obs)}')
+
+    # inject the init script
+    script_dir = os.path.dirname(__file__)
+
+    # inject the instance info
+    action = CmdRunAction(command='mkdir -p /swe_util/eval_data/instances')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        obs.exit_code == 0,
+        f'Failed to create /swe_util/eval_data/instances: {str(obs)}',
+    )
+
+    swe_instance_json_name = 'swe-perf-instance.json'
+    with tempfile.TemporaryDirectory() as temp_dir:
+        # Construct the full path for the desired file name within the temporary directory
+        temp_file_path = os.path.join(temp_dir, swe_instance_json_name)
+        # Write to the file with the desired name within the temporary directory
+        with open(temp_file_path, 'w') as f:
+            if not isinstance(instance, dict):
+                json.dump([instance.to_dict()], f)
+            else:
+                json.dump([instance], f)
+
+        # Copy the file to the desired location
+        runtime.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
+
+        # inject the instance swe entry
+        entry_script_path = 'instance_swe_entry.sh'
+        runtime.copy_to(
+            str(os.path.join(script_dir, f'scripts/setup/{entry_script_path}')),
+            '/swe_util/',
+        )
+
+    action = CmdRunAction(command='cat ~/.bashrc')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(obs.exit_code == 0, f'Failed to cat ~/.bashrc: {str(obs)}')
+
+    action = CmdRunAction(command='source ~/.bashrc')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    if isinstance(obs, ErrorObservation):
+        logger.error(f'Failed to source ~/.bashrc: {str(obs)}')
+    assert_and_raise(obs.exit_code == 0, f'Failed to source ~/.bashrc: {str(obs)}')
+
+    action = CmdRunAction(command=f'source /swe_util/{entry_script_path}')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        obs.exit_code == 0,
+        f'Failed to source /swe_util/{entry_script_path}: {str(obs)}',
+    )
+
+    action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        obs.exit_code == 0,
+        f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
+    )
+
+    action = CmdRunAction(command='git reset --hard')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(obs.exit_code == 0, f'Failed to git reset --hard: {str(obs)}')
+
+    action = CmdRunAction(
+        command='for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
+    )
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')
+
+    if metadata.details['mode'] == 'swt-ci':
+        # set up repo
+        setup_commands = []
+        if instance['repo'] in MAP_REPO_TO_INSTALL:
+            setup_commands.append(MAP_REPO_TO_INSTALL[instance['repo']])
+
+        # Run pre-install set up if provided
+        install = MAP_VERSION_TO_INSTALL.get(instance['repo'], {}).get(
+            instance['version'], []
+        )
+        if 'pre_install' in install:
+            for pre_install in install['pre_install']:
+                setup_commands.append(pre_install)
+
+        if 'install' in install:
+            setup_commands.append(install['install'])
+
+        for command in setup_commands:
+            action = CmdRunAction(command=command)
+            action.set_hard_timeout(600)
+            logger.info(action, extra={'msg_type': 'ACTION'})
+            obs = runtime.run_action(action)
+            logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    action = CmdRunAction(command='which python')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        obs.exit_code == 0 and 'testbed' in obs.content,
+        f'Expected to find python interpreter from testbed, but got: {str(obs)}',
+    )
+
+    logger.info('-' * 30)
+    logger.info('END Runtime Initialization Fn')
+    logger.info('-' * 30)
+
+
+def complete_runtime(
+    runtime: Runtime,
+    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
+) -> dict[str, Any]:
+    """Complete the runtime for the agent.
+
+    This function is called before the runtime is used to run the agent.
+    If you need to do something in the sandbox to get the correctness metric after
+    the agent has run, modify this function.
+    """
+    logger.info('-' * 30)
+    logger.info('BEGIN Runtime Completion Fn')
+    logger.info('-' * 30)
+    obs: CmdOutputObservation
+    workspace_dir_name = _get_sweperf_workspace_dir_name(instance)
+
+    action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    if obs.exit_code == -1:
+        # The previous command is still running
+        # We need to kill previous command
+        logger.info('The previous command is still running, trying to kill it...')
+        action = CmdRunAction(command='C-c')
+        obs = runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+        # Then run the command again
+        action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
+        action.set_hard_timeout(600)
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    if obs.exit_code == -1:
+        # The previous command is still running
+        # We need to kill previous command
+        logger.info('The previous command is still running, trying to ctrl+z it...')
+        action = CmdRunAction(command='C-z')
+        obs = runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+        # Then run the command again
+        action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
+        action.set_hard_timeout(600)
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+
+    assert_and_raise(
+        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
+        f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
+    )
+
+    action = CmdRunAction(command='git config --global core.pager ""')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
+        f'Failed to git config --global core.pager "": {str(obs)}',
+    )
+
+    # First check for any git repositories in subdirectories
+    action = CmdRunAction(command='find . -type d -name .git -not -path "./.git"')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
+        f'Failed to find git repositories: {str(obs)}',
+    )
+
+    git_dirs = [p for p in obs.content.strip().split('\n') if p]
+    if git_dirs:
+        # Remove all .git directories in subdirectories
+        for git_dir in git_dirs:
+            action = CmdRunAction(command=f'rm -rf "{git_dir}"')
+            action.set_hard_timeout(600)
+            logger.info(action, extra={'msg_type': 'ACTION'})
+            obs = runtime.run_action(action)
+            logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+            assert_and_raise(
+                isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
+                f'Failed to remove git directory {git_dir}: {str(obs)}',
+            )
+
+    # add all files
+    action = CmdRunAction(command='git add -A')
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
+        f'Failed to git add -A: {str(obs)}',
+    )
+
+    # Remove binary files from git staging
+    action = CmdRunAction(command=remove_binary_files_from_git())
+    action.set_hard_timeout(600)
+    logger.info(action, extra={'msg_type': 'ACTION'})
+    obs = runtime.run_action(action)
+    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+    assert_and_raise(
+        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
+        f'Failed to remove binary files: {str(obs)}',
+    )
+
+    n_retries = 0
+    git_patch = None
+    while n_retries < 5:
+        action = CmdRunAction(
+            command=f'git diff --no-color --cached {instance["base_commit"]} > patch.diff'
+        )
+        action.set_hard_timeout(max(300 + 100 * n_retries, 600))
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        obs = runtime.run_action(action)
+        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+        n_retries += 1
+        if isinstance(obs, CmdOutputObservation):
+            if obs.exit_code == 0:
+                # Read the patch file
+                action = FileReadAction(path='patch.diff')
+                action.set_hard_timeout(max(300 + 100 * n_retries, 600))
+                logger.info(action, extra={'msg_type': 'ACTION'})
+                obs = runtime.run_action(action)
+                logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+                if isinstance(obs, FileReadObservation):
+                    git_patch = obs.content
+                    break
+                elif isinstance(obs, ErrorObservation):
+                    # Fall back to cat "patch.diff" to get the patch
+                    assert 'File could not be decoded as utf-8' in obs.content
+                    action = CmdRunAction(command='cat patch.diff')
+                    action.set_hard_timeout(max(300 + 100 * n_retries, 600))
+                    logger.info(action, extra={'msg_type': 'ACTION'})
+                    obs = runtime.run_action(action)
+                    assert isinstance(obs, CmdOutputObservation) and obs.exit_code == 0
+                    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
+                    git_patch = obs.content
+                    break
+                else:
+                    assert_and_raise(False, f'Unexpected observation type: {str(obs)}')
+            else:
+                logger.info('Failed to get git diff, retrying...')
+                sleep_if_should_continue(10)
+        elif isinstance(obs, ErrorObservation):
+            logger.error(f'Error occurred: {obs.content}. Retrying...')
+            sleep_if_should_continue(10)
+        else:
+            assert_and_raise(False, f'Unexpected observation type: {str(obs)}')
+
+    assert_and_raise(git_patch is not None, 'Failed to get git diff (None)')
+
+    # Remove binary diffs from the patch
+    git_patch = remove_binary_diffs(git_patch)
+
+    logger.info('-' * 30)
+    logger.info('END Runtime Completion Fn')
+    logger.info('-' * 30)
+    return {'git_patch': git_patch}
+
+
+def process_instance(
+    instance: pd.Series,
+    metadata: EvalMetadata,
+    reset_logger: bool = True,
+    runtime_failure_count: int = 0,
+) -> EvalOutput:
+    config = get_config(instance, metadata)
+
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
+    else:
+        logger.info(f'Starting evaluation for instance {instance.instance_id}.')
+
+    # Increase resource_factor with increasing attempt_id
+    if runtime_failure_count > 0:
+        config.sandbox.remote_runtime_resource_factor = min(
+            config.sandbox.remote_runtime_resource_factor * (2**runtime_failure_count),
+            8,
+        )
+        logger.warning(
+            f'This is the {runtime_failure_count + 1}th attempt for instance {instance.instance_id}, setting resource factor to {config.sandbox.remote_runtime_resource_factor}'
+        )
+
+    metadata = copy.deepcopy(metadata)
+    metadata.details['runtime_failure_count'] = runtime_failure_count
+    metadata.details['remote_runtime_resource_factor'] = (
+        config.sandbox.remote_runtime_resource_factor
+    )
+
+    runtime = create_runtime(config)
+    call_async_from_sync(runtime.connect)
+
+    try:
+        initialize_runtime(runtime, instance, metadata)
+
+        message_action = get_instruction(instance, metadata)
+
+        # Here's how you can run the agent (similar to the `main` function) and get the final task state
+        state: State | None = asyncio.run(
+            run_controller(
+                config=config,
+                initial_user_action=message_action,
+                runtime=runtime,
+                fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
+                    metadata.agent_class
+                ],
+            )
+        )
+
+        # if fatal error, throw EvalError to trigger re-run
+        if is_fatal_evaluation_error(state.last_error):
+            raise EvalException('Fatal error detected: ' + state.last_error)
+
+        # Get git patch
+        complete_runtime_fn = complete_runtime
+        return_val = complete_runtime_fn(runtime, instance)
+        git_patch = return_val['git_patch']
+        logger.info(
+            f'Got git diff for instance {instance.instance_id}:\n--------\n{git_patch}\n--------'
+        )
+    finally:
+        runtime.close()
+    # ==========================================
+
+    # ======= Attempt to evaluate the agent's edits =======
+    # we use eval_infer.sh to evaluate the agent's edits, not here
+    # because the agent may alter the environment / testcases
+    test_result = {
+        'git_patch': git_patch,
+    }
+
+    # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
+    if state is None:
+        raise ValueError('State should not be None.')
+
+    # NOTE: this is NO LONGER the event stream, but an agent history that includes delegate agent's events
+    histories = [event_to_dict(event) for event in state.history]
+    metrics = get_metrics(state)
+
+    # Save the output
+    instruction = message_action.content
+    if message_action.image_urls:
+        instruction += (
+            '\n\n<image_urls>' + '\n'.join(message_action.image_urls) + '</image_urls>'
+        )
+    output = EvalOutput(
+        instance_id=instance.instance_id,
+        instruction=instruction,
+        instance=instance.to_dict(),  # SWE Bench specific
+        test_result=test_result,
+        metadata=metadata,
+        history=histories,
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+    )
+    return output
+
+
+def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:
+    file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'config.toml')
+    if os.path.exists(file_path):
+        with open(file_path, 'r') as file:
+            data = toml.load(file)
+            if 'selected_ids' in data:
+                selected_ids = data['selected_ids']
+                logger.info(
+                    f'Filtering {len(selected_ids)} tasks from "selected_ids"...'
+                )
+                subset = dataset[dataset[filter_column].isin(selected_ids)]
+                logger.info(f'Retained {subset.shape[0]} tasks after filtering')
+                return subset
+            if 'selected_repos' in data:
+                selected_repos = data['selected_repos']
+                if isinstance(selected_repos, str):
+                    selected_repos = [selected_repos]
+                assert isinstance(selected_repos, list)
+                logger.info(
+                    f'Filtering {selected_repos} tasks from "selected_repos"...'
+                )
+                subset = dataset[dataset['repo'].isin(selected_repos)]
+                logger.info(f'Retained {subset.shape[0]} tasks after filtering')
+                return subset
+
+    skip_ids = os.environ.get('SKIP_IDS', '').split(',')
+    if len(skip_ids) > 0:
+        logger.info(f'Filtering {len(skip_ids)} tasks from "SKIP_IDS"...')
+        return dataset[~dataset[filter_column].isin(skip_ids)]
+    return dataset
+
+
+if __name__ == '__main__':
+    parser = get_evaluation_parser()
+    parser.add_argument(
+        '--dataset',
+        type=str,
+        default='SWE-Perf/SWE-Perf',
+        help='data set to evaluate on, either full-test or lite-test',
+    )
+    parser.add_argument(
+        '--split',
+        type=str,
+        default='test',
+        help='split to evaluate on',
+    )
+    parser.add_argument(
+        '--mode',
+        type=str,
+        default='swe',
+        choices=['swe', 'swt', 'swt-ci'],
+        help="mode to run the evaluation, either 'swe', 'swt', or 'swt-ci'",
+    )
+
+    args, _ = parser.parse_known_args()
+
+    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
+    # so we don't need to manage file uploading to OpenHands's repo
+    dataset = load_dataset(args.dataset, split=args.split)
+
+    swe_perf_tests = filter_dataset(dataset.to_pandas(), 'instance_id')
+    logger.info(
+        f'Loaded dataset {args.dataset} with split {args.split}: {len(swe_perf_tests)} tasks'
+    )
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+        llm_config.log_completions = True
+        # modify_params must be False for evaluation purpose, for reproducibility and accurancy of results
+        llm_config.modify_params = False
+
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
+
+    # Get condenser config from environment variable
+    condenser_name = os.environ.get('EVAL_CONDENSER')
+    if condenser_name:
+        condenser_config = get_condenser_config_arg(condenser_name)
+        if condenser_config is None:
+            raise ValueError(
+                f'Could not find Condenser config: EVAL_CONDENSER={condenser_name}'
+            )
+    else:
+        # If no specific condenser config is provided via env var, default to NoOpCondenser
+        condenser_config = NoOpCondenserConfig()
+        logger.debug(
+            'No Condenser config provided via EVAL_CONDENSER, using NoOpCondenser.'
+        )
+
+    details = {'mode': args.mode}
+    _agent_cls = openhands.agenthub.Agent.get_cls(args.agent_cls)
+
+    dataset_descrption = (
+        args.dataset.replace('/', '__') + '-' + args.split.replace('/', '__')
+    )
+    metadata = make_metadata(
+        llm_config,
+        dataset_descrption,
+        args.agent_cls,
+        args.max_iterations,
+        args.eval_note,
+        args.eval_output_dir,
+        details=details,
+        condenser_config=condenser_config,
+    )
+
+    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
+    print(f'### OUTPUT FILE: {output_file} ###')
+
+    # Run evaluation in iterative mode:
+    # If a rollout fails to output AgentFinishAction, we will try again until it succeeds OR total 3 attempts have been made.
+    ITERATIVE_EVAL_MODE = (
+        os.environ.get('ITERATIVE_EVAL_MODE', 'false').lower() == 'true'
+    )
+    ITERATIVE_EVAL_MODE_MAX_ATTEMPTS = int(
+        os.environ.get('ITERATIVE_EVAL_MODE_MAX_ATTEMPTS', '3')
+    )
+
+    if not ITERATIVE_EVAL_MODE:
+        # load the dataset
+        instances = prepare_dataset(swe_perf_tests, output_file, args.eval_n_limit)
+
+        run_evaluation(
+            instances,
+            metadata,
+            output_file,
+            args.eval_num_workers,
+            process_instance,
+            timeout_seconds=8
+            * 60
+            * 60,  # 8 hour PER instance should be more than enough
+            max_retries=5,
+        )
+    else:
+        critic = AgentFinishedCritic()
+
+        def get_cur_output_file_path(attempt: int) -> str:
+            return (
+                f'{output_file.removesuffix(".jsonl")}.critic_attempt_{attempt}.jsonl'
+            )
+
+        eval_ids = None
+        for attempt in range(1, ITERATIVE_EVAL_MODE_MAX_ATTEMPTS + 1):
+            cur_output_file = get_cur_output_file_path(attempt)
+            logger.info(
+                f'Running evaluation with critic {critic.__class__.__name__} for attempt {attempt} of {ITERATIVE_EVAL_MODE_MAX_ATTEMPTS}.'
+            )
+
+            # For deterministic eval, we set temperature to 0.1 for (>1) attempt
+            # so hopefully we get slightly different results
+            if attempt > 1 and metadata.llm_config.temperature == 0:
+                logger.info(
+                    f'Detected temperature is 0 for (>1) attempt {attempt}. Setting temperature to 0.1...'
+                )
+                metadata.llm_config.temperature = 0.1
+
+            # Load instances - at first attempt, we evaluate all instances
+            # On subsequent attempts, we only evaluate the instances that failed the previous attempt determined by critic
+            instances = prepare_dataset(
+                swe_perf_tests, cur_output_file, args.eval_n_limit, eval_ids=eval_ids
+            )
+
+            # Run evaluation - but save them to cur_output_file
+            logger.info(
+                f'Evaluating {len(instances)} instances for attempt {attempt}...'
+            )
+            run_evaluation(
+                instances,
+                metadata,
+                cur_output_file,
+                args.eval_num_workers,
+                process_instance,
+                timeout_seconds=8
+                * 60
+                * 60,  # 8 hour PER instance should be more than enough
+                max_retries=5,
+            )
+
+            # When eval is done, we update eval_ids to the instances that failed the current attempt
+            instances_failed = []
+            logger.info(
+                f'Use critic {critic.__class__.__name__} to check {len(instances)} instances for attempt {attempt}...'
+            )
+            with open(cur_output_file, 'r') as f:
+                for line in f:
+                    instance = json.loads(line)
+                    try:
+                        history = [
+                            event_from_dict(event) for event in instance['history']
+                        ]
+                        critic_result = critic.evaluate(
+                            history, instance['test_result'].get('git_patch', '')
+                        )
+                        if not critic_result.success:
+                            instances_failed.append(instance['instance_id'])
+                    except Exception as e:
+                        logger.error(
+                            f'Error loading history for instance {instance["instance_id"]}: {e}'
+                        )
+                        instances_failed.append(instance['instance_id'])
+            logger.info(
+                f'{len(instances_failed)} instances failed the current attempt {attempt}: {instances_failed}'
+            )
+            eval_ids = instances_failed
+
+            # If no instances failed, we break
+            if len(instances_failed) == 0:
+                break
+
+        # Then we should aggregate the results from all attempts into the original output file
+        # and remove the intermediate files
+        logger.info(
+            'Aggregating results from all attempts into the original output file...'
+        )
+        fout = open(output_file, 'w')
+        added_instance_ids = set()
+        for attempt in reversed(range(1, ITERATIVE_EVAL_MODE_MAX_ATTEMPTS + 1)):
+            cur_output_file = get_cur_output_file_path(attempt)
+            if not os.path.exists(cur_output_file):
+                logger.warning(
+                    f'Intermediate output file {cur_output_file} does not exist. Skipping...'
+                )
+                continue
+
+            with open(cur_output_file, 'r') as f:
+                for line in f:
+                    instance = json.loads(line)
+                    # Also make sure git_patch is not empty - otherwise we fall back to previous attempt (empty patch is worse than anything else)
+                    if (
+                        instance['instance_id'] not in added_instance_ids
+                        and instance['test_result'].get('git_patch', '').strip()
+                    ):
+                        fout.write(line)
+                        added_instance_ids.add(instance['instance_id'])
+            logger.info(
+                f'Aggregated instances from {cur_output_file}. Total instances added so far: {len(added_instance_ids)}'
+            )
+        fout.close()
+        logger.info(
+            f'Done! Total {len(added_instance_ids)} instances added to {output_file}'
+        )
+        # Check if any instances reached maximum retries
+        check_maximum_retries_exceeded(metadata.eval_output_dir)
--- a/evaluation/benchmarks/swe_perf/scripts/run_infer.sh
+++ b/evaluation/benchmarks/swe_perf/scripts/run_infer.sh
@@ -0,0 +1,146 @@
+#!/usr/bin/env bash
+set -eo pipefail
+
+source "evaluation/utils/version_control.sh"
+
+MODEL_CONFIG=$1
+COMMIT_HASH=$2
+AGENT=$3
+EVAL_LIMIT=$4
+MAX_ITER=$5
+NUM_WORKERS=$6
+DATASET=$7
+SPLIT=$8
+N_RUNS=$9
+MODE=${10}
+
+
+if [ -z "$NUM_WORKERS" ]; then
+  NUM_WORKERS=1
+  echo "Number of workers not specified, use default $NUM_WORKERS"
+fi
+checkout_eval_branch
+
+if [ -z "$AGENT" ]; then
+  echo "Agent not specified, use default CodeActAgent"
+  AGENT="CodeActAgent"
+fi
+
+if [ -z "$MAX_ITER" ]; then
+  echo "MAX_ITER not specified, use default 100"
+  MAX_ITER=100
+fi
+
+if [ -z "$RUN_WITH_BROWSING" ]; then
+  echo "RUN_WITH_BROWSING not specified, use default false"
+  RUN_WITH_BROWSING=false
+fi
+
+
+if [ -z "$DATASET" ]; then
+  echo "DATASET not specified, use default SWE-Perf/SWE-Perf"
+  DATASET="SWE-Perf/SWE-Perf"
+fi
+
+if [ -z "$SPLIT" ]; then
+  echo "SPLIT not specified, use default test"
+  SPLIT="test"
+fi
+
+if [ -z "$MODE" ]; then
+  MODE="swe"
+  echo "MODE not specified, use default $MODE"
+fi
+
+if [ -n "$EVAL_CONDENSER" ]; then
+  echo "Using Condenser Config: $EVAL_CONDENSER"
+else
+  echo "No Condenser Config provided via EVAL_CONDENSER, use default (NoOpCondenser)."
+fi
+
+export RUN_WITH_BROWSING=$RUN_WITH_BROWSING
+echo "RUN_WITH_BROWSING: $RUN_WITH_BROWSING"
+
+get_openhands_version
+
+echo "AGENT: $AGENT"
+echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
+echo "MODEL_CONFIG: $MODEL_CONFIG"
+echo "DATASET: $DATASET"
+echo "SPLIT: $SPLIT"
+echo "MAX_ITER: $MAX_ITER"
+echo "NUM_WORKERS: $NUM_WORKERS"
+echo "COMMIT_HASH: $COMMIT_HASH"
+echo "MODE: $MODE"
+echo "EVAL_CONDENSER: $EVAL_CONDENSER"
+
+# Default to NOT use Hint
+if [ -z "$USE_HINT_TEXT" ]; then
+  export USE_HINT_TEXT=false
+fi
+echo "USE_HINT_TEXT: $USE_HINT_TEXT"
+EVAL_NOTE="$OPENHANDS_VERSION"
+# if not using Hint, add -no-hint to the eval note
+if [ "$USE_HINT_TEXT" = false ]; then
+  EVAL_NOTE="$EVAL_NOTE-no-hint"
+fi
+
+if [ "$RUN_WITH_BROWSING" = true ]; then
+  EVAL_NOTE="$EVAL_NOTE-with-browsing"
+fi
+
+if [ -n "$EXP_NAME" ]; then
+  EVAL_NOTE="$EVAL_NOTE-$EXP_NAME"
+fi
+# if mode != swe, add mode to the eval note
+if [ "$MODE" != "swe" ]; then
+  EVAL_NOTE="${EVAL_NOTE}-${MODE}"
+fi
+# Add condenser config to eval note if provided
+if [ -n "$EVAL_CONDENSER" ]; then
+  EVAL_NOTE="${EVAL_NOTE}-${EVAL_CONDENSER}"
+fi
+
+function run_eval() {
+  local eval_note="${1}"
+  COMMAND="poetry run python evaluation/benchmarks/swe_perf/run_infer.py \
+    --agent-cls $AGENT \
+    --llm-config $MODEL_CONFIG \
+    --max-iterations $MAX_ITER \
+    --eval-num-workers $NUM_WORKERS \
+    --eval-note $eval_note \
+    --dataset $DATASET \
+    --split $SPLIT \
+    --mode $MODE"
+
+
+
+  if [ -n "$EVAL_LIMIT" ]; then
+    echo "EVAL_LIMIT: $EVAL_LIMIT"
+    COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
+  fi
+
+  # Run the command
+  eval $COMMAND
+}
+
+unset SANDBOX_ENV_GITHUB_TOKEN # prevent the agent from using the github token to push
+if [ -z "$N_RUNS" ]; then
+  N_RUNS=1
+  echo "N_RUNS not specified, use default $N_RUNS"
+fi
+
+# Skip runs if the run number is in the SKIP_RUNS list
+# read from env variable SKIP_RUNS as a comma separated list of run numbers
+SKIP_RUNS=(${SKIP_RUNS//,/ })
+for i in $(seq 1 $N_RUNS); do
+  if [[ " ${SKIP_RUNS[@]} " =~ " $i " ]]; then
+    echo "Skipping run $i"
+    continue
+  fi
+  current_eval_note="$EVAL_NOTE-run_$i"
+  echo "EVAL_NOTE: $current_eval_note"
+  run_eval $current_eval_note
+done
+
+checkout_original_branch
--- a/evaluation/benchmarks/swe_perf/scripts/setup/compare_patch_filename.py
+++ b/evaluation/benchmarks/swe_perf/scripts/setup/compare_patch_filename.py
@@ -0,0 +1,54 @@
+"""This script compares gold patches with OpenHands-generated patches and check whether
+OpenHands found the right (set of) files to modify.
+"""
+
+import argparse
+import json
+import re
+
+
+def extract_modified_files(patch):
+    modified_files = set()
+    file_pattern = re.compile(r'^diff --git a/(.*?) b/')
+
+    for line in patch.split('\n'):
+        match = file_pattern.match(line)
+        if match:
+            modified_files.add(match.group(1))
+
+    return modified_files
+
+
+def process_report(oh_output_file):
+    succ = 0
+    fail = 0
+    for line in open(oh_output_file):
+        line = json.loads(line)
+        instance_id = line['instance_id']
+        gold_patch = line['swe_instance']['patch']
+        generated_patch = line['git_patch']
+        gold_modified_files = extract_modified_files(gold_patch)
+        # swe-bench lite only: a gold patch always contains exactly one file
+        assert len(gold_modified_files) == 1
+        generated_modified_files = extract_modified_files(generated_patch)
+
+        # Check if all files in gold_patch are also in generated_patch
+        all_files_in_generated = gold_modified_files.issubset(generated_modified_files)
+        if all_files_in_generated:
+            succ += 1
+        else:
+            fail += 1
+            print(
+                f'{instance_id}: file mismatch, gold = {gold_modified_files}, generated = {generated_modified_files}'
+            )
+    print(
+        f'\nSUMMARY: {succ} out of {succ + fail} instances found correct files to edit, success rate = {succ / float(succ + fail)}'
+    )
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--oh_output_file', help='Path to the OH output file')
+    args = parser.parse_args()
+
+    process_report(args.oh_output_file)
--- a/evaluation/benchmarks/swe_perf/scripts/setup/instance_swe_entry.sh
+++ b/evaluation/benchmarks/swe_perf/scripts/setup/instance_swe_entry.sh
@@ -0,0 +1,43 @@
+#!/usr/bin/env bash
+
+source ~/.bashrc
+SWEUTIL_DIR=/swe_util
+
+# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
+# SWE_INSTANCE_ID=django__django-11099
+if [ -z "$SWE_INSTANCE_ID" ]; then
+    echo "Error: SWE_INSTANCE_ID is not set." >&2
+    exit 1
+fi
+
+# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
+item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-instance.json)
+
+if [[ -z "$item" ]]; then
+  echo "No item found for the provided instance ID."
+  exit 1
+fi
+
+
+WORKSPACE_NAME=$(echo "$item" | jq -r '(.repo | tostring) + "__" + (.version | tostring) | gsub("/"; "__")')
+
+echo "WORKSPACE_NAME: $WORKSPACE_NAME"
+
+# Clear the workspace
+if [ -d /workspace ]; then
+    rm -rf /workspace/*
+else
+    mkdir /workspace
+fi
+# Copy repo to workspace
+if [ -d /workspace/$WORKSPACE_NAME ]; then
+    rm -rf /workspace/$WORKSPACE_NAME
+fi
+mkdir -p /workspace
+cp -r /testbed /workspace/$WORKSPACE_NAME
+
+# Activate instance-specific environment
+if [ -d /opt/miniconda3 ]; then
+    . /opt/miniconda3/etc/profile.d/conda.sh
+    conda activate testbed
+fi
--- a/force_build.txt
+++ b/force_build.txt
@@ -0,0 +1 @@
+test
--- a/frontend/tests/components/features/conversation-panel/conversation-card.test.tsx
+++ b/frontend/tests/components/features/conversation-panel/conversation-card.test.tsx
@@ -357,69 +357,6 @@ describe("ConversationCard", () => {
    expect(onClick).not.toHaveBeenCalled();
  });

-  it("should show display cost button only when showOptions is true", async () => {
-    const onContextMenuToggle = vi.fn();
-    const { rerender } = renderWithProviders(
-      <ConversationCard
-        onDelete={onDelete}
-        onChangeTitle={onChangeTitle}
-        title="Conversation 1"
-        selectedRepository={null}
-        lastUpdatedAt="2021-10-01T12:00:00Z"
-        contextMenuOpen
-        onContextMenuToggle={onContextMenuToggle}
-      />,
-    );
-
-    // Wait for context menu to appear
-    const menu = await screen.findByTestId("context-menu");
-    expect(
-      within(menu).queryByTestId("display-cost-button"),
-    ).not.toBeInTheDocument();
-
-    rerender(
-      <ConversationCard
-        onDelete={onDelete}
-        onChangeTitle={onChangeTitle}
-        showOptions
-        title="Conversation 1"
-        selectedRepository={null}
-        lastUpdatedAt="2021-10-01T12:00:00Z"
-        contextMenuOpen
-        onContextMenuToggle={onContextMenuToggle}
-      />,
-    );
-
-    // Wait for context menu to appear and check for display cost button
-    const newMenu = await screen.findByTestId("context-menu");
-    within(newMenu).getByTestId("display-cost-button");
-  });
-
-  it("should show metrics modal when clicking the display cost button", async () => {
-    const user = userEvent.setup();
-    const onContextMenuToggle = vi.fn();
-    renderWithProviders(
-      <ConversationCard
-        onDelete={onDelete}
-        onChangeTitle={onChangeTitle}
-        title="Conversation 1"
-        selectedRepository={null}
-        lastUpdatedAt="2021-10-01T12:00:00Z"
-        showOptions
-        contextMenuOpen
-        onContextMenuToggle={onContextMenuToggle}
-      />,
-    );
-
-    const menu = screen.getByTestId("context-menu");
-    const displayCostButton = within(menu).getByTestId("display-cost-button");
-
-    await user.click(displayCostButton);
-
-    // Verify if metrics modal is displayed by checking for the modal content
-    expect(screen.getByTestId("metrics-modal")).toBeInTheDocument();
-  });
-
  it("should not display the edit or delete options if the handler is not provided", async () => {
    const onContextMenuToggle = vi.fn();
    const { rerender } = renderWithProviders(
--- a/frontend/tests/components/features/conversation-panel/conversation-panel.test.tsx
+++ b/frontend/tests/components/features/conversation-panel/conversation-panel.test.tsx
@@ -1,10 +1,9 @@
 import { screen, waitFor, within } from "@testing-library/react";
 import { beforeAll, beforeEach, describe, expect, it, vi } from "vitest";
-import { QueryClientConfig } from "@tanstack/react-query";
 import userEvent from "@testing-library/user-event";
 import { createRoutesStub } from "react-router";
 import React from "react";
-import { renderWithProviders } from "test-utils";
+import { renderWithQueryAndI18n } from "test-utils";
 import { ConversationPanel } from "#/components/features/conversation-panel/conversation-panel";
 import ConversationService from "#/api/conversation-service/conversation-service.api";
 import { Conversation } from "#/api/open-hands.types";
@@ -18,16 +17,7 @@ describe("ConversationPanel", () => {
    },
  ]);

-  const renderConversationPanel = (config?: QueryClientConfig) =>
-    renderWithProviders(<RouterStub />, {
-      preloadedState: {
-        metrics: {
-          cost: null,
-          max_budget_per_task: null,
-          usage: null,
-        },
-      },
-    });
+  const renderConversationPanel = () => renderWithQueryAndI18n(<RouterStub />);

  beforeAll(() => {
    vi.mock("react-router", async (importOriginal) => ({
@@ -297,15 +287,7 @@ describe("ConversationPanel", () => {
      },
    ]);

-    renderWithProviders(<MyRouterStub />, {
-      preloadedState: {
-        metrics: {
-          cost: null,
-          max_budget_per_task: null,
-          usage: null,
-        },
-      },
-    });
+    renderWithQueryAndI18n(<MyRouterStub />);

    const toggleButton = screen.getByText("Toggle");

--- a/frontend/tests/components/features/microagent-management/microagent-management.test.tsx
+++ b/frontend/tests/components/features/microagent-management/microagent-management.test.tsx
@@ -57,11 +57,6 @@ describe("MicroagentManagement", () => {
  const renderMicroagentManagement = (config?: QueryClientConfig) =>
    renderWithProviders(<RouterStub />, {
      preloadedState: {
-        metrics: {
-          cost: null,
-          max_budget_per_task: null,
-          usage: null,
-        },
        microagentManagement: {
          addMicroagentModalVisible: false,
          updateMicroagentModalVisible: false,
@@ -1351,11 +1346,6 @@ describe("MicroagentManagement", () => {
      // Render with modal already visible in Redux state
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: null,
            addMicroagentModalVisible: true, // Start with modal visible
@@ -1646,11 +1636,6 @@ describe("MicroagentManagement", () => {
    const renderMicroagentManagementMain = (selectedMicroagentItem: any) =>
      renderWithProviders(<MicroagentManagementMain />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            addMicroagentModalVisible: false,
            selectedRepository: {
@@ -1998,11 +1983,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible in Redux state
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2037,11 +2017,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible and selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2075,11 +2050,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible and selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2118,11 +2088,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible and selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2174,11 +2139,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2225,11 +2185,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2279,11 +2234,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible but no microagent data
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: null,
            addMicroagentModalVisible: false,
@@ -2325,11 +2275,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible and microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2374,11 +2319,6 @@ describe("MicroagentManagement", () => {
      // Render with update modal visible and microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForUpdate,
@@ -2561,11 +2501,6 @@ describe("MicroagentManagement", () => {
      // Render with selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForLearn,
@@ -2601,11 +2536,6 @@ describe("MicroagentManagement", () => {
      // Render with selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForLearn,
@@ -2658,11 +2588,6 @@ describe("MicroagentManagement", () => {
      // Render with selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForLearn,
@@ -2718,11 +2643,6 @@ describe("MicroagentManagement", () => {
      // Render with selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForLearn,
@@ -2776,11 +2696,6 @@ describe("MicroagentManagement", () => {
      // Render with selected microagent
      renderWithProviders(<RouterStub />, {
        preloadedState: {
-          metrics: {
-            cost: null,
-            max_budget_per_task: null,
-            usage: null,
-          },
          microagentManagement: {
            selectedMicroagentItem: {
              microagent: mockMicroagentForLearn,
--- a/frontend/tests/components/terminal/terminal.test.tsx
+++ b/frontend/tests/components/terminal/terminal.test.tsx
@@ -1,17 +1,14 @@
 import { act, screen } from "@testing-library/react";
 import { renderWithProviders } from "test-utils";
 import { vi, describe, afterEach, it, expect } from "vitest";
-import { Command, appendInput, appendOutput } from "#/state/command-slice";
+import { Command, useCommandStore } from "#/state/command-store";
 import Terminal from "#/components/features/terminal/terminal";

-const renderTerminal = (commands: Command[] = []) =>
-  renderWithProviders(<Terminal />, {
-    preloadedState: {
-      cmd: {
-        commands,
-      },
-    },
-  });
+const renderTerminal = (commands: Command[] = []) => {
+  // Set initial commands in Zustand store
+  useCommandStore.setState({ commands });
+  return renderWithProviders(<Terminal />);
+};

 describe.skip("Terminal", () => {
  global.ResizeObserver = vi.fn().mockImplementation(() => ({
@@ -58,25 +55,25 @@ describe.skip("Terminal", () => {
  });

  it("should write commands to the terminal", () => {
-    const { store } = renderTerminal();
+    renderTerminal();

    act(() => {
-      store.dispatch(appendInput("echo Hello"));
-      store.dispatch(appendOutput("Hello"));
+      useCommandStore.getState().appendInput("echo Hello");
+      useCommandStore.getState().appendOutput("Hello");
    });

    expect(mockTerminal.writeln).toHaveBeenNthCalledWith(1, "echo Hello");
    expect(mockTerminal.writeln).toHaveBeenNthCalledWith(2, "Hello");

    act(() => {
-      store.dispatch(appendInput("echo World"));
+      useCommandStore.getState().appendInput("echo World");
    });

    expect(mockTerminal.writeln).toHaveBeenNthCalledWith(3, "echo World");
  });

  it("should load and write commands to the terminal", () => {
-    const { store } = renderTerminal([
+    renderTerminal([
      { type: "input", content: "echo Hello" },
      { type: "output", content: "Hello" },
    ]);
@@ -85,17 +82,17 @@ describe.skip("Terminal", () => {
    expect(mockTerminal.writeln).toHaveBeenNthCalledWith(2, "Hello");

    act(() => {
-      store.dispatch(appendInput("echo Hello"));
+      useCommandStore.getState().appendInput("echo Hello");
    });

    expect(mockTerminal.writeln).toHaveBeenNthCalledWith(3, "echo Hello");
  });

  it("should end the line with a dollar sign after writing a command", () => {
-    const { store } = renderTerminal();
+    renderTerminal();

    act(() => {
-      store.dispatch(appendInput("echo Hello"));
+      useCommandStore.getState().appendInput("echo Hello");
    });

    expect(mockTerminal.writeln).toHaveBeenCalledWith("echo Hello");
--- a/frontend/tests/hooks/use-terminal.test.tsx
+++ b/frontend/tests/hooks/use-terminal.test.tsx
@@ -1,7 +1,7 @@
 import { beforeAll, describe, expect, it, vi } from "vitest";
 import { afterEach } from "node:test";
 import { useTerminal } from "#/hooks/use-terminal";
-import { Command } from "#/state/command-slice";
+import { Command, useCommandStore } from "#/state/command-store";
 import { AgentState } from "#/types/agent-state";
 import { renderWithProviders } from "../../test-utils";

@@ -19,10 +19,10 @@ interface TestTerminalComponentProps {
  commands: Command[];
 }

-function TestTerminalComponent({
-  commands,
-}: TestTerminalComponentProps) {
-  const ref = useTerminal({ commands });
+function TestTerminalComponent({ commands }: TestTerminalComponentProps) {
+  // Set commands in Zustand store
+  useCommandStore.setState({ commands });
+  const ref = useTerminal();
  return <div ref={ref} />;
 }

@@ -60,7 +60,6 @@ describe("useTerminal", () => {
    renderWithProviders(<TestTerminalComponent commands={[]} />, {
      preloadedState: {
        agent: { curAgentState: AgentState.RUNNING },
-        cmd: { commands: [] },
      },
    });
  });
@@ -74,7 +73,6 @@ describe("useTerminal", () => {
    renderWithProviders(<TestTerminalComponent commands={commands} />, {
      preloadedState: {
        agent: { curAgentState: AgentState.RUNNING },
-        cmd: { commands },
      },
    });

@@ -94,17 +92,11 @@ describe("useTerminal", () => {
      { content: secret, type: "output" },
    ];

-    renderWithProviders(
-      <TestTerminalComponent
-        commands={commands}
-      />,
-      {
-        preloadedState: {
-          agent: { curAgentState: AgentState.RUNNING },
-          cmd: { commands },
-        },
+    renderWithProviders(<TestTerminalComponent commands={commands} />, {
+      preloadedState: {
+        agent: { curAgentState: AgentState.RUNNING },
      },
-    );
+    });

    // This test is no longer relevant as secrets filtering has been removed
  });
--- a/frontend/tests/initial-query.test.tsx
+++ b/frontend/tests/initial-query.test.tsx
@@ -1,20 +1,24 @@
-import { describe, it, expect } from "vitest";
-import store from "../src/store";
-import {
-  setInitialPrompt,
-  clearInitialPrompt,
-} from "../src/state/initial-query-slice";
+import { describe, it, expect, beforeEach } from "vitest";
+import { useInitialQueryStore } from "../src/stores/initial-query-store";

 describe("Initial Query Behavior", () => {
-  it("should clear initial query when clearInitialPrompt is dispatched", () => {
+  beforeEach(() => {
+    // Reset the store before each test
+    useInitialQueryStore.getState().reset();
+  });
+
+  it("should clear initial query when clearInitialPrompt is called", () => {
+    const { setInitialPrompt, clearInitialPrompt, initialPrompt } =
+      useInitialQueryStore.getState();
+
    // Set up initial query in the store
-    store.dispatch(setInitialPrompt("test query"));
-    expect(store.getState().initialQuery.initialPrompt).toBe("test query");
+    setInitialPrompt("test query");
+    expect(useInitialQueryStore.getState().initialPrompt).toBe("test query");

    // Clear the initial query
-    store.dispatch(clearInitialPrompt());
+    clearInitialPrompt();

    // Verify initial query is cleared
-    expect(store.getState().initialQuery.initialPrompt).toBeNull();
+    expect(useInitialQueryStore.getState().initialPrompt).toBeNull();
  });
 });
--- a/frontend/tests/services/actions.test.tsx
+++ b/frontend/tests/services/actions.test.tsx
@@ -13,14 +13,26 @@ vi.mock("#/store", () => ({
  },
 }));

-vi.mock("#/state/command-slice", () => ({
-  appendInput: mockAppendInput,
+vi.mock("#/state/command-store", () => ({
+  useCommandStore: {
+    getState: () => ({
+      appendInput: mockAppendInput,
+    }),
+  },
 }));

 vi.mock("#/state/jupyter-slice", () => ({
  appendJupyterInput: mockAppendJupyterInput,
 }));

+vi.mock("#/state/metrics-slice", () => ({
+  setMetrics: vi.fn(),
+}));
+
+vi.mock("#/state/security-analyzer-slice", () => ({
+  appendSecurityAnalyzerInput: vi.fn(),
+}));
+
 describe("handleActionMessage", () => {
  beforeEach(() => {
    // Clear all mocks before each test
@@ -45,7 +57,8 @@ describe("handleActionMessage", () => {
    handleActionMessage(runAction);

    // Check that appendInput was called with the command
-    expect(mockDispatch).toHaveBeenCalledWith(mockAppendInput("ls -la"));
+    expect(mockAppendInput).toHaveBeenCalledWith("ls -la");
+    expect(mockDispatch).not.toHaveBeenCalled();
    expect(mockAppendJupyterInput).not.toHaveBeenCalled();
  });

@@ -59,7 +72,8 @@ describe("handleActionMessage", () => {
      args: {
        code: "print('Hello from Jupyter!')",
      },
-      message: "Running Python code interactively: print('Hello from Jupyter!')",
+      message:
+        "Running Python code interactively: print('Hello from Jupyter!')",
      timestamp: "2023-01-01T00:00:00Z",
    };

@@ -67,7 +81,9 @@ describe("handleActionMessage", () => {
    handleActionMessage(ipythonAction);

    // Check that appendJupyterInput was called with the code
-    expect(mockDispatch).toHaveBeenCalledWith(mockAppendJupyterInput("print('Hello from Jupyter!')"));
+    expect(mockDispatch).toHaveBeenCalledWith(
+      mockAppendJupyterInput("print('Hello from Jupyter!')"),
+    );
    expect(mockAppendInput).not.toHaveBeenCalled();
  });

@@ -89,7 +105,9 @@ describe("handleActionMessage", () => {
    // Handle the action
    handleActionMessage(hiddenAction);

-    // Check that nothing was dispatched
+    // Check that nothing was dispatched or called
    expect(mockDispatch).not.toHaveBeenCalled();
+    expect(mockAppendInput).not.toHaveBeenCalled();
+    expect(mockAppendJupyterInput).not.toHaveBeenCalled();
  });
 });
--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
@@ -59,7 +59,8 @@
        "tailwind-scrollbar": "^4.0.2",
        "vite": "^7.1.4",
        "web-vitals": "^5.1.0",
-        "ws": "^8.18.2"
+        "ws": "^8.18.2",
+        "zustand": "^5.0.8"
      },
      "devDependencies": {
        "@babel/parser": "^7.28.3",
@@ -18326,6 +18327,35 @@
      "dev": true,
      "license": "MIT"
    },
+    "node_modules/zustand": {
+      "version": "5.0.8",
+      "resolved": "https://registry.npmjs.org/zustand/-/zustand-5.0.8.tgz",
+      "integrity": "sha512-gyPKpIaxY9XcO2vSMrLbiER7QMAMGOQZVRdJ6Zi782jkbzZygq5GI9nG8g+sMgitRtndwaBSl7uiqC49o1SSiw==",
+      "license": "MIT",
+      "engines": {
+        "node": ">=12.20.0"
+      },
+      "peerDependencies": {
+        "@types/react": ">=18.0.0",
+        "immer": ">=9.0.6",
+        "react": ">=18.0.0",
+        "use-sync-external-store": ">=1.2.0"
+      },
+      "peerDependenciesMeta": {
+        "@types/react": {
+          "optional": true
+        },
+        "immer": {
+          "optional": true
+        },
+        "react": {
+          "optional": true
+        },
+        "use-sync-external-store": {
+          "optional": true
+        }
+      }
+    },
    "node_modules/zwitch": {
      "version": "2.0.4",
      "resolved": "https://registry.npmjs.org/zwitch/-/zwitch-2.0.4.tgz",
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -58,7 +58,8 @@
    "tailwind-scrollbar": "^4.0.2",
    "vite": "^7.1.4",
    "web-vitals": "^5.1.0",
-    "ws": "^8.18.2"
+    "ws": "^8.18.2",
+    "zustand": "^5.0.8"
  },
  "scripts": {
    "dev": "npm run make-i18n && cross-env VITE_MOCK_API=false react-router dev",
--- a/frontend/src/components/features/chat/chat-interface.tsx
+++ b/frontend/src/components/features/chat/chat-interface.tsx
@@ -18,6 +18,7 @@ import { useWsClient } from "#/context/ws-client-provider";
 import { Messages } from "./messages";
 import { ChatSuggestions } from "./chat-suggestions";
 import { ScrollProvider } from "#/context/scroll-context";
+import { useInitialQueryStore } from "#/stores/initial-query-store";

 import { ScrollToBottomButton } from "#/components/shared/buttons/scroll-to-bottom-button";
 import { LoadingSpinner } from "#/components/shared/loading-spinner";
@@ -33,6 +34,7 @@ import { useUploadFiles } from "#/hooks/mutation/use-upload-files";
 import { useConfig } from "#/hooks/query/use-config";
 import { validateFiles } from "#/utils/file-validation";
 import { setMessageToSend } from "#/state/conversation-slice";
+import ConfirmationModeEnabled from "./confirmation-mode-enabled";

 function getEntryPoint(
  hasRepository: boolean | null,
@@ -67,9 +69,7 @@ export function ChatInterface() {
    "positive" | "negative"
  >("positive");
  const [feedbackModalIsOpen, setFeedbackModalIsOpen] = React.useState(false);
-  const { selectedRepository, replayJson } = useSelector(
-    (state: RootState) => state.initialQuery,
-  );
+  const { selectedRepository, replayJson } = useInitialQueryStore();
  const params = useParams();
  const { mutateAsync: uploadFiles } = useUploadFiles();

@@ -210,17 +210,20 @@ export function ChatInterface() {

        <div className="flex flex-col gap-[6px]">
          <div className="flex justify-between relative">
-            {events.length > 0 && (
-              <TrajectoryActions
-                onPositiveFeedback={() =>
-                  onClickShareFeedbackActionButton("positive")
-                }
-                onNegativeFeedback={() =>
-                  onClickShareFeedbackActionButton("negative")
-                }
-                isSaasMode={config?.APP_MODE === "saas"}
-              />
-            )}
+            <div className="flex items-center gap-1">
+              <ConfirmationModeEnabled />
+              {events.length > 0 && (
+                <TrajectoryActions
+                  onPositiveFeedback={() =>
+                    onClickShareFeedbackActionButton("positive")
+                  }
+                  onNegativeFeedback={() =>
+                    onClickShareFeedbackActionButton("negative")
+                  }
+                  isSaasMode={config?.APP_MODE === "saas"}
+                />
+              )}
+            </div>

            <div className="absolute left-1/2 transform -translate-x-1/2 bottom-0">
              {curAgentState === AgentState.RUNNING && <TypingIndicator />}
--- a/frontend/src/components/features/chat/components/chat-input-actions.tsx
+++ b/frontend/src/components/features/chat/components/chat-input-actions.tsx
@@ -0,0 +1,34 @@
+import { ConversationStatus } from "#/types/conversation-status";
+import { ServerStatus } from "#/components/features/controls/server-status";
+import { AgentStatus } from "#/components/features/controls/agent-status";
+import { Tools } from "../../controls/tools";
+
+interface ChatInputActionsProps {
+  conversationStatus: ConversationStatus | null;
+  disabled: boolean;
+  handleStop: (onStop?: () => void) => void;
+  handleResumeAgent: () => void;
+  onStop?: () => void;
+}
+
+export function ChatInputActions({
+  conversationStatus,
+  disabled,
+  handleStop,
+  handleResumeAgent,
+  onStop,
+}: ChatInputActionsProps) {
+  return (
+    <div className="w-full flex items-center justify-between">
+      <div className="flex items-center gap-1">
+        <Tools />
+        <ServerStatus conversationStatus={conversationStatus} />
+      </div>
+      <AgentStatus
+        handleStop={() => handleStop(onStop)}
+        handleResumeAgent={handleResumeAgent}
+        disabled={disabled}
+      />
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/components/chat-input-container.tsx
+++ b/frontend/src/components/features/chat/components/chat-input-container.tsx
@@ -0,0 +1,89 @@
+import React from "react";
+import { ConversationStatus } from "#/types/conversation-status";
+import { DragOver } from "../drag-over";
+import { UploadedFiles } from "../uploaded-files";
+import { ChatInputRow } from "./chat-input-row";
+import { ChatInputActions } from "./chat-input-actions";
+
+interface ChatInputContainerProps {
+  chatContainerRef: React.RefObject<HTMLDivElement | null>;
+  isDragOver: boolean;
+  disabled: boolean;
+  showButton: boolean;
+  buttonClassName: string;
+  conversationStatus: ConversationStatus | null;
+  chatInputRef: React.RefObject<HTMLDivElement | null>;
+  handleFileIconClick: (isDisabled: boolean) => void;
+  handleSubmit: () => void;
+  handleStop: (onStop?: () => void) => void;
+  handleResumeAgent: () => void;
+  onDragOver: (e: React.DragEvent, isDisabled: boolean) => void;
+  onDragLeave: (e: React.DragEvent, isDisabled: boolean) => void;
+  onDrop: (e: React.DragEvent, isDisabled: boolean) => void;
+  onInput: () => void;
+  onPaste: (e: React.ClipboardEvent) => void;
+  onKeyDown: (e: React.KeyboardEvent) => void;
+  onFocus?: () => void;
+  onBlur?: () => void;
+  onStop?: () => void;
+}
+
+export function ChatInputContainer({
+  chatContainerRef,
+  isDragOver,
+  disabled,
+  showButton,
+  buttonClassName,
+  conversationStatus,
+  chatInputRef,
+  handleFileIconClick,
+  handleSubmit,
+  handleStop,
+  handleResumeAgent,
+  onDragOver,
+  onDragLeave,
+  onDrop,
+  onInput,
+  onPaste,
+  onKeyDown,
+  onFocus,
+  onBlur,
+  onStop,
+}: ChatInputContainerProps) {
+  return (
+    <div
+      ref={chatContainerRef}
+      className="bg-[#25272D] box-border content-stretch flex flex-col items-start justify-center p-4 pt-3 relative rounded-[15px] w-full"
+      onDragOver={(e) => onDragOver(e, disabled)}
+      onDragLeave={(e) => onDragLeave(e, disabled)}
+      onDrop={(e) => onDrop(e, disabled)}
+    >
+      {/* Drag Over UI */}
+      {isDragOver && <DragOver />}
+
+      <UploadedFiles />
+
+      <ChatInputRow
+        chatInputRef={chatInputRef}
+        disabled={disabled}
+        showButton={showButton}
+        buttonClassName={buttonClassName}
+        handleFileIconClick={handleFileIconClick}
+        handleSubmit={handleSubmit}
+        onInput={onInput}
+        onPaste={onPaste}
+        onKeyDown={onKeyDown}
+        onFocus={onFocus}
+        onBlur={onBlur}
+      />
+
+      <ChatInputActions
+        conversationStatus={conversationStatus}
+        disabled={disabled}
+        handleStop={handleStop}
+        handleResumeAgent={handleResumeAgent}
+        onStop={onStop}
+      />
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/components/chat-input-field.tsx
+++ b/frontend/src/components/features/chat/components/chat-input-field.tsx
@@ -0,0 +1,44 @@
+import React from "react";
+import { useTranslation } from "react-i18next";
+
+interface ChatInputFieldProps {
+  chatInputRef: React.RefObject<HTMLDivElement | null>;
+  onInput: () => void;
+  onPaste: (e: React.ClipboardEvent) => void;
+  onKeyDown: (e: React.KeyboardEvent) => void;
+  onFocus?: () => void;
+  onBlur?: () => void;
+}
+
+export function ChatInputField({
+  chatInputRef,
+  onInput,
+  onPaste,
+  onKeyDown,
+  onFocus,
+  onBlur,
+}: ChatInputFieldProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div
+      className="box-border content-stretch flex flex-row items-center justify-start min-h-6 p-0 relative shrink-0 flex-1"
+      data-name="Text & caret"
+    >
+      <div className="basis-0 flex flex-col font-normal grow justify-center leading-[0] min-h-px min-w-px overflow-ellipsis overflow-hidden relative shrink-0 text-[#d0d9fa] text-[16px] text-left">
+        <div
+          ref={chatInputRef}
+          className="chat-input bg-transparent text-white text-[16px] font-normal leading-[20px] outline-none resize-none custom-scrollbar min-h-[20px] max-h-[400px] [text-overflow:inherit] [text-wrap-mode:inherit] [white-space-collapse:inherit] block whitespace-pre-wrap"
+          contentEditable
+          data-placeholder={t("SUGGESTIONS$WHAT_TO_BUILD")}
+          data-testid="chat-input"
+          onInput={onInput}
+          onPaste={onPaste}
+          onKeyDown={onKeyDown}
+          onFocus={onFocus}
+          onBlur={onBlur}
+        />
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/components/chat-input-grip.tsx
+++ b/frontend/src/components/features/chat/components/chat-input-grip.tsx
@@ -0,0 +1,38 @@
+import React from "react";
+import { cn } from "#/utils/utils";
+
+interface ChatInputGripProps {
+  gripRef: React.RefObject<HTMLDivElement | null>;
+  isGripVisible: boolean;
+  handleTopEdgeClick: (e: React.MouseEvent) => void;
+  handleGripMouseDown: (e: React.MouseEvent) => void;
+  handleGripTouchStart: (e: React.TouchEvent) => void;
+}
+
+export function ChatInputGrip({
+  gripRef,
+  isGripVisible,
+  handleTopEdgeClick,
+  handleGripMouseDown,
+  handleGripTouchStart,
+}: ChatInputGripProps) {
+  return (
+    <div
+      className="absolute -top-[12px] left-0 w-full h-6 lg:h-3 z-20 group"
+      id="resize-grip"
+      onClick={handleTopEdgeClick}
+    >
+      {/* Resize Grip - appears on hover of top edge area, when dragging, or when clicked */}
+      <div
+        ref={gripRef}
+        className={cn(
+          "absolute top-[4px] left-0 w-full h-[3px] bg-white cursor-ns-resize z-10 transition-opacity duration-200",
+          isGripVisible ? "opacity-100" : "opacity-0 group-hover:opacity-100",
+        )}
+        onMouseDown={handleGripMouseDown}
+        onTouchStart={handleGripTouchStart}
+        style={{ userSelect: "none" }}
+      />
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/components/chat-input-row.tsx
+++ b/frontend/src/components/features/chat/components/chat-input-row.tsx
@@ -0,0 +1,62 @@
+import React from "react";
+import { cn } from "#/utils/utils";
+import { ChatAddFileButton } from "../chat-add-file-button";
+import { ChatSendButton } from "../chat-send-button";
+import { ChatInputField } from "./chat-input-field";
+
+interface ChatInputRowProps {
+  chatInputRef: React.RefObject<HTMLDivElement | null>;
+  disabled: boolean;
+  showButton: boolean;
+  buttonClassName: string;
+  handleFileIconClick: (isDisabled: boolean) => void;
+  handleSubmit: () => void;
+  onInput: () => void;
+  onPaste: (e: React.ClipboardEvent) => void;
+  onKeyDown: (e: React.KeyboardEvent) => void;
+  onFocus?: () => void;
+  onBlur?: () => void;
+}
+
+export function ChatInputRow({
+  chatInputRef,
+  disabled,
+  showButton,
+  buttonClassName,
+  handleFileIconClick,
+  handleSubmit,
+  onInput,
+  onPaste,
+  onKeyDown,
+  onFocus,
+  onBlur,
+}: ChatInputRowProps) {
+  return (
+    <div className="box-border content-stretch flex flex-row items-end justify-between p-0 relative shrink-0 w-full pb-[18px] gap-2">
+      <div className="basis-0 box-border content-stretch flex flex-row gap-4 grow items-end justify-start min-h-px min-w-px p-0 relative shrink-0">
+        <ChatAddFileButton
+          disabled={disabled}
+          handleFileIconClick={() => handleFileIconClick(disabled)}
+        />
+
+        <ChatInputField
+          chatInputRef={chatInputRef}
+          onInput={onInput}
+          onPaste={onPaste}
+          onKeyDown={onKeyDown}
+          onFocus={onFocus}
+          onBlur={onBlur}
+        />
+      </div>
+
+      {/* Send Button */}
+      {showButton && (
+        <ChatSendButton
+          buttonClassName={cn(buttonClassName, "translate-y-[3px]")}
+          handleSubmit={handleSubmit}
+          disabled={disabled}
+        />
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/components/hidden-file-input.tsx
+++ b/frontend/src/components/features/chat/components/hidden-file-input.tsx
@@ -0,0 +1,23 @@
+import React from "react";
+
+interface HiddenFileInputProps {
+  fileInputRef: React.RefObject<HTMLInputElement | null>;
+  onChange: (e: React.ChangeEvent<HTMLInputElement>) => void;
+}
+
+export function HiddenFileInput({
+  fileInputRef,
+  onChange,
+}: HiddenFileInputProps) {
+  return (
+    <input
+      type="file"
+      ref={fileInputRef}
+      multiple
+      accept="*/*"
+      style={{ display: "none" }}
+      onChange={onChange}
+      data-testid="upload-image-input"
+    />
+  );
+}
--- a/frontend/src/components/features/chat/confirmation-mode-enabled.tsx
+++ b/frontend/src/components/features/chat/confirmation-mode-enabled.tsx
@@ -0,0 +1,29 @@
+import { useTranslation } from "react-i18next";
+import { Tooltip } from "@heroui/react";
+import { I18nKey } from "#/i18n/declaration";
+import LockIcon from "#/icons/lock.svg?react";
+import { useSettings } from "#/hooks/query/use-settings";
+
+function ConfirmationModeEnabled() {
+  const { t } = useTranslation();
+
+  const { data: settings } = useSettings();
+
+  if (!settings?.CONFIRMATION_MODE) {
+    return null;
+  }
+
+  return (
+    <Tooltip
+      content={t(I18nKey.COMMON$CONFIRMATION_MODE_ENABLED)}
+      closeDelay={100}
+      className="bg-white text-black hover:bg-transparent"
+    >
+      <div className="flex items-center justify-center w-[26px] h-[26px] rounded-lg bg-[#25272D]">
+        <LockIcon width={15} height={15} />
+      </div>
+    </Tooltip>
+  );
+}
+
+export default ConfirmationModeEnabled;
--- a/frontend/src/components/features/chat/custom-chat-input.tsx
+++ b/frontend/src/components/features/chat/custom-chat-input.tsx
@@ -1,25 +1,20 @@
-import React, { useRef, useCallback, useState, useEffect } from "react";
+import React, { useEffect } from "react";
 import { useDispatch, useSelector } from "react-redux";
-import { useTranslation } from "react-i18next";
 import { ConversationStatus } from "#/types/conversation-status";
-import { ServerStatus } from "#/components/features/controls/server-status";
-import { AgentStatus } from "#/components/features/controls/agent-status";
-import { ChatSendButton } from "./chat-send-button";
-import { ChatAddFileButton } from "./chat-add-file-button";
-import { cn, isMobileDevice } from "#/utils/utils";
-import { useAutoResize } from "#/hooks/use-auto-resize";
-import { DragOver } from "./drag-over";
-import { UploadedFiles } from "./uploaded-files";
-import { Tools } from "../controls/tools";
 import {
  clearAllFiles,
  setShouldHideSuggestions,
  setSubmittedMessage,
-  setMessageToSend,
-  setIsRightPanelShown,
 } from "#/state/conversation-slice";
-import { CHAT_INPUT } from "#/utils/constants";
 import { RootState } from "#/store";
+import { useChatInputLogic } from "#/hooks/chat/use-chat-input-logic";
+import { useFileHandling } from "#/hooks/chat/use-file-handling";
+import { useGripResize } from "#/hooks/chat/use-grip-resize";
+import { useChatInputEvents } from "#/hooks/chat/use-chat-input-events";
+import { useChatSubmission } from "#/hooks/chat/use-chat-submission";
+import { ChatInputGrip } from "./components/chat-input-grip";
+import { ChatInputContainer } from "./components/chat-input-container";
+import { HiddenFileInput } from "./components/hidden-file-input";

 export interface CustomChatInputProps {
  disabled?: boolean;
@@ -46,13 +41,9 @@ export function CustomChatInput({
  className = "",
  buttonClassName = "",
 }: CustomChatInputProps) {
-  const [isDragOver, setIsDragOver] = useState(false);
-  const [isGripVisible, setIsGripVisible] = useState(false);
-
-  const { messageToSend, submittedMessage, hasRightPanelToggled } = useSelector(
+  const { submittedMessage } = useSelector(
    (state: RootState) => state.conversation,
  );
-
  const dispatch = useDispatch();

  // Disable input when conversation is stopped
@@ -68,87 +59,55 @@ export function CustomChatInput({
    dispatch(setSubmittedMessage(null));
  }, [submittedMessage, disabled, onSubmit, dispatch]);

-  const { t } = useTranslation();
-
-  const chatInputRef = useRef<HTMLDivElement>(null);
-  const fileInputRef = useRef<HTMLInputElement>(null);
-  const chatContainerRef = useRef<HTMLDivElement>(null);
-  const gripRef = useRef<HTMLDivElement>(null);
-
-  // Save current input value when drawer state changes
-  useEffect(() => {
-    if (chatInputRef.current) {
-      const currentText = chatInputRef.current?.innerText || "";
-      // Dispatch to save current input value when drawer state changes
-      dispatch(setMessageToSend(currentText));
-      dispatch(setIsRightPanelShown(hasRightPanelToggled));
-    }
-  }, [hasRightPanelToggled, dispatch]);
-
-  // Helper function to check if contentEditable is truly empty
-  const isContentEmpty = useCallback((): boolean => {
-    if (!chatInputRef.current) return true;
-    const text =
-      chatInputRef.current.innerText || chatInputRef.current.textContent || "";
-    return text.trim() === "";
-  }, []);
-
-  // Helper function to properly clear contentEditable for placeholder display
-  const clearEmptyContent = useCallback((): void => {
-    if (chatInputRef.current && isContentEmpty()) {
-      chatInputRef.current.innerHTML = "";
-      chatInputRef.current.textContent = "";
-    }
-  }, [isContentEmpty]);
-
-  // Drag state management callbacks
-  const handleDragStart = useCallback(() => {
-    // Keep grip visible during drag by adding a CSS class
-    if (gripRef.current) {
-      gripRef.current.classList.add("opacity-100");
-      gripRef.current.classList.remove("opacity-0");
-    }
-  }, []);
-
-  const handleDragEnd = useCallback(() => {
-    // Restore hover-based visibility
-    if (gripRef.current) {
-      gripRef.current.classList.remove("opacity-100");
-      gripRef.current.classList.add("opacity-0");
-    }
-  }, []);
-
-  // Handle click on top edge area to toggle grip visibility
-  const handleTopEdgeClick = (e: React.MouseEvent) => {
-    e.stopPropagation();
-    setIsGripVisible((prev) => !prev);
-  };
-
-  // Callback to handle height changes and manage suggestions visibility
-  const handleHeightChange = useCallback(
-    (height: number) => {
-      // Hide suggestions when input height exceeds the threshold
-      const shouldHideChatSuggestions = height > CHAT_INPUT.HEIGHT_THRESHOLD;
-      dispatch(setShouldHideSuggestions(shouldHideChatSuggestions));
-    },
-    [dispatch],
-  );
-
-  // Use the auto-resize hook with height change callback
+  // Custom hooks
  const {
+    chatInputRef,
+    messageToSend,
+    checkIsContentEmpty,
+    clearEmptyContentHandler,
+  } = useChatInputLogic();
+
+  const {
+    fileInputRef,
+    chatContainerRef,
+    isDragOver,
+    handleFileIconClick,
+    handleFileInputChange,
+    handleDragOver,
+    handleDragLeave,
+    handleDrop,
+  } = useFileHandling(onFilesPaste);
+
+  const {
+    gripRef,
+    isGripVisible,
+    handleTopEdgeClick,
    smartResize,
    handleGripMouseDown,
    handleGripTouchStart,
    increaseHeightForEmptyContent,
-  } = useAutoResize(chatInputRef, {
-    minHeight: 20,
-    maxHeight: 400,
-    onHeightChange: handleHeightChange,
-    onGripDragStart: handleDragStart,
-    onGripDragEnd: handleDragEnd,
-    value: messageToSend ?? undefined,
-    enableManualResize: true,
-  });
+  } = useGripResize(
+    chatInputRef as React.RefObject<HTMLDivElement | null>,
+    messageToSend,
+  );
+
+  const { handleSubmit, handleResumeAgent, handleStop } = useChatSubmission(
+    chatInputRef as React.RefObject<HTMLDivElement | null>,
+    fileInputRef as React.RefObject<HTMLInputElement | null>,
+    smartResize,
+    onSubmit,
+  );
+
+  const { handleInput, handlePaste, handleKeyDown, handleBlur, handleFocus } =
+    useChatInputEvents(
+      chatInputRef as React.RefObject<HTMLDivElement | null>,
+      smartResize,
+      increaseHeightForEmptyContent,
+      checkIsContentEmpty,
+      clearEmptyContentHandler,
+      onFocus,
+      onBlur,
+    );

  // Cleanup: reset suggestions visibility when component unmounts
  useEffect(
@@ -159,283 +118,46 @@ export function CustomChatInput({
    [dispatch],
  );

-  // Function to add files and notify parent
-  const addFiles = useCallback(
-    (files: File[]) => {
-      // Call onFilesPaste if provided with the new files
-      if (onFilesPaste && files.length > 0) {
-        onFilesPaste(files);
-      }
-    },
-    [onFilesPaste],
-  );
-
-  // File icon click handler
-  const handleFileIconClick = () => {
-    if (!isDisabled && fileInputRef.current) {
-      fileInputRef.current.click();
-    }
-  };
-
-  // File input change handler
-  const handleFileInputChange = (e: React.ChangeEvent<HTMLInputElement>) => {
-    const files = Array.from(e.target.files || []);
-    addFiles(files);
-  };
-
-  // Drag and drop event handlers
-  const handleDragOver = (e: React.DragEvent) => {
-    if (isDisabled) return;
-    e.preventDefault();
-    setIsDragOver(true);
-  };
-
-  const handleDragLeave = (e: React.DragEvent) => {
-    if (isDisabled) return;
-    e.preventDefault();
-    // Only remove drag-over class if we're leaving the container entirely
-    if (!chatContainerRef.current?.contains(e.relatedTarget as Node)) {
-      setIsDragOver(false);
-    }
-  };
-
-  const handleDrop = (e: React.DragEvent) => {
-    if (isDisabled) return;
-    e.preventDefault();
-    setIsDragOver(false);
-
-    const files = Array.from(e.dataTransfer.files);
-    addFiles(files);
-  };
-
-  // Send button click handler
-  const handleSubmit = () => {
-    const message = chatInputRef.current?.innerText || "";
-
-    if (message.trim()) {
-      onSubmit(message);
-
-      // Clear the input
-      if (chatInputRef.current) {
-        chatInputRef.current.textContent = "";
-      }
-      if (fileInputRef.current) {
-        fileInputRef.current.value = "";
-      }
-
-      // Reset height and show suggestions again
-      smartResize();
-    }
-  };
-
-  // Resume agent button click handler
-  const handleResumeAgent = () => {
-    const message = chatInputRef.current?.innerText || "continue";
-
-    onSubmit(message.trim());
-
-    // Clear the input
-    if (chatInputRef.current) {
-      chatInputRef.current.textContent = "";
-    }
-    if (fileInputRef.current) {
-      fileInputRef.current.value = "";
-    }
-
-    // Reset height and show suggestions again
-    smartResize();
-  };
-
-  // Handle stop button click
-  const handleStop = () => {
-    if (onStop) {
-      onStop();
-    }
-  };
-
-  // Handle input events
-  const handleInput = () => {
-    smartResize();
-
-    // Clear empty content to ensure placeholder shows
-    if (chatInputRef.current) {
-      clearEmptyContent();
-    }
-
-    // Ensure cursor stays visible when content is scrollable
-    if (!chatInputRef.current) {
-      return;
-    }
-
-    const selection = window.getSelection();
-    if (!selection || selection.rangeCount === 0) {
-      return;
-    }
-
-    const range = selection.getRangeAt(0);
-    if (
-      !range.getBoundingClientRect ||
-      !chatInputRef.current.getBoundingClientRect
-    ) {
-      return;
-    }
-
-    const rect = range.getBoundingClientRect();
-    const inputRect = chatInputRef.current.getBoundingClientRect();
-
-    // If cursor is below the visible area, scroll to show it
-    if (rect.bottom > inputRect.bottom) {
-      chatInputRef.current.scrollTop =
-        chatInputRef.current.scrollHeight - chatInputRef.current.clientHeight;
-    }
-  };
-
-  // Handle paste events to clean up formatting
-  const handlePaste = (e: React.ClipboardEvent) => {
-    e.preventDefault();
-
-    // Get plain text from clipboard
-    const text = e.clipboardData.getData("text/plain");
-
-    // Insert plain text
-    document.execCommand("insertText", false, text);
-
-    // Trigger resize
-    setTimeout(smartResize, 0);
-  };
-
-  // Handle key events
-  const handleKeyDown = (e: React.KeyboardEvent) => {
-    if (e.key !== "Enter") {
-      return;
-    }
-
-    if (isContentEmpty()) {
-      e.preventDefault();
-      increaseHeightForEmptyContent();
-      return;
-    }
-
-    // Original submit logic - only for desktop without shift key
-    if (!isMobileDevice() && !e.shiftKey && !disabled) {
-      e.preventDefault();
-      handleSubmit();
-    }
-  };
-
-  // Handle blur events to ensure placeholder shows when empty
-  const handleBlur = () => {
-    // Clear empty content to ensure placeholder shows
-    if (chatInputRef.current) {
-      clearEmptyContent();
-    }
-
-    // Call the original onBlur callback if provided
-    if (onBlur) {
-      onBlur();
-    }
-  };
-
  return (
    <div className={`w-full ${className}`}>
      {/* Hidden file input */}
-      <input
-        type="file"
-        ref={fileInputRef}
-        multiple
-        accept="*/*"
-        style={{ display: "none" }}
+      <HiddenFileInput
+        fileInputRef={fileInputRef}
        onChange={handleFileInputChange}
-        data-testid="upload-image-input"
      />

      {/* Container with grip */}
      <div className="relative w-full">
-        {/* Top edge hover area - invisible area that triggers grip visibility */}
-        <div
-          className="absolute -top-[12px] left-0 w-full h-6 lg:h-3 z-20 group"
-          id="resize-grip"
-          onClick={handleTopEdgeClick}
-        >
-          {/* Resize Grip - appears on hover of top edge area, when dragging, or when clicked */}
-          <div
-            ref={gripRef}
-            className={cn(
-              "absolute top-[4px] left-0 w-full h-[3px] bg-white cursor-ns-resize z-10 transition-opacity duration-200",
-              isGripVisible
-                ? "opacity-100"
-                : "opacity-0 group-hover:opacity-100",
-            )}
-            onMouseDown={handleGripMouseDown}
-            onTouchStart={handleGripTouchStart}
-            style={{ userSelect: "none" }}
-          />
-        </div>
+        <ChatInputGrip
+          gripRef={gripRef}
+          isGripVisible={isGripVisible}
+          handleTopEdgeClick={handleTopEdgeClick}
+          handleGripMouseDown={handleGripMouseDown}
+          handleGripTouchStart={handleGripTouchStart}
+        />

-        {/* Chat Input Component */}
-        <div
-          ref={chatContainerRef}
-          className="bg-[#25272D] box-border content-stretch flex flex-col items-start justify-center p-4 pt-3 relative rounded-[15px] w-full"
+        <ChatInputContainer
+          chatContainerRef={chatContainerRef}
+          isDragOver={isDragOver}
+          disabled={isDisabled}
+          showButton={showButton}
+          buttonClassName={buttonClassName}
+          conversationStatus={conversationStatus}
+          chatInputRef={chatInputRef}
+          handleFileIconClick={handleFileIconClick}
+          handleSubmit={handleSubmit}
+          handleStop={handleStop}
+          handleResumeAgent={handleResumeAgent}
          onDragOver={handleDragOver}
          onDragLeave={handleDragLeave}
          onDrop={handleDrop}
-        >
-          {/* Drag Over UI */}
-          {isDragOver && <DragOver />}
-
-          <UploadedFiles />
-          {/* Main Input Row */}
-          <div className="box-border content-stretch flex flex-row items-end justify-between p-0 relative shrink-0 w-full pb-[18px] gap-2">
-            <div className="basis-0 box-border content-stretch flex flex-row gap-4 grow items-end justify-start min-h-px min-w-px p-0 relative shrink-0">
-              <ChatAddFileButton
-                disabled={disabled}
-                handleFileIconClick={handleFileIconClick}
-              />
-
-              {/* Chat Input Area */}
-              <div
-                className="box-border content-stretch flex flex-row items-center justify-start min-h-6 p-0 relative shrink-0 flex-1"
-                data-name="Text & caret"
-              >
-                <div className="basis-0 flex flex-col font-normal grow justify-center leading-[0] min-h-px min-w-px overflow-ellipsis overflow-hidden relative shrink-0 text-[#d0d9fa] text-[16px] text-left">
-                  <div
-                    ref={chatInputRef}
-                    className="chat-input bg-transparent text-white text-[16px] font-normal leading-[20px] outline-none resize-none custom-scrollbar min-h-[20px] max-h-[400px] [text-overflow:inherit] [text-wrap-mode:inherit] [white-space-collapse:inherit] block whitespace-pre-wrap"
-                    contentEditable
-                    data-placeholder={t("SUGGESTIONS$WHAT_TO_BUILD")}
-                    data-testid="chat-input"
-                    onInput={handleInput}
-                    onPaste={handlePaste}
-                    onKeyDown={handleKeyDown}
-                    onFocus={onFocus}
-                    onBlur={handleBlur}
-                  />
-                </div>
-              </div>
-            </div>
-
-            {/* Send Button */}
-            {showButton && (
-              <ChatSendButton
-                buttonClassName={cn(buttonClassName, "translate-y-[3px]")}
-                handleSubmit={handleSubmit}
-                disabled={disabled}
-              />
-            )}
-          </div>
-
-          <div className="w-full flex items-center justify-between">
-            <div className="flex items-center gap-1">
-              <Tools />
-              <ServerStatus conversationStatus={conversationStatus} />
-            </div>
-            <AgentStatus
-              handleStop={handleStop}
-              handleResumeAgent={handleResumeAgent}
-              disabled={disabled}
-            />
-          </div>
-        </div>
+          onInput={handleInput}
+          onPaste={handlePaste}
+          onKeyDown={(e) => handleKeyDown(e, isDisabled, handleSubmit)}
+          onFocus={handleFocus}
+          onBlur={handleBlur}
+          onStop={onStop}
+        />
      </div>
    </div>
  );
--- a/frontend/src/components/features/chat/task-tracking-observation-content.tsx
+++ b/frontend/src/components/features/chat/task-tracking-observation-content.tsx
@@ -1,6 +1,6 @@
-import React from "react";
-import { useTranslation } from "react-i18next";
 import { TaskTrackingObservation } from "#/types/core/observations";
+import { TaskListSection } from "./task-tracking/task-list-section";
+import { ResultSection } from "./task-tracking/result-section";

 interface TaskTrackingObservationContentProps {
  event: TaskTrackingObservation;
@@ -9,101 +9,17 @@ interface TaskTrackingObservationContentProps {
 export function TaskTrackingObservationContent({
  event,
 }: TaskTrackingObservationContentProps) {
-  const { t } = useTranslation();
-
  const { command, task_list: taskList } = event.extras;
  const shouldShowTaskList = command === "plan" && taskList.length > 0;

-  const getStatusIcon = (status: string) => {
-    switch (status) {
-      case "todo":
-        return "⏳";
-      case "in_progress":
-        return "🔄";
-      case "done":
-        return "✅";
-      default:
-        return "❓";
-    }
-  };
-
-  const getStatusClassName = (status: string) => {
-    if (status === "done") {
-      return "bg-green-800 text-green-200";
-    }
-    if (status === "in_progress") {
-      return "bg-yellow-800 text-yellow-200";
-    }
-    return "bg-gray-700 text-gray-300";
-  };
-
  return (
    <div className="flex flex-col gap-4">
      {/* Task List section - only show for 'plan' command */}
-      {shouldShowTaskList && (
-        <div className="flex flex-col gap-2">
-          <div className="flex items-center justify-between">
-            <h3 className="text-sm font-semibold text-gray-300">
-              {t("TASK_TRACKING_OBSERVATION$TASK_LIST")} ({taskList.length}{" "}
-              {taskList.length === 1 ? "item" : "items"})
-            </h3>
-          </div>
-          <div className="p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 max-h-[400px] shadow-inner">
-            <div className="space-y-3">
-              {taskList.map((task, index) => (
-                <div key={task.id} className="border-l-2 border-gray-600 pl-3">
-                  <div className="flex items-start gap-2">
-                    <span className="text-lg">
-                      {getStatusIcon(task.status)}
-                    </span>
-                    <div className="flex-1">
-                      <div className="flex items-center gap-2 mb-1">
-                        <span className="text-sm text-gray-400">
-                          {index + 1}.
-                        </span>
-                        <span
-                          className={`text-xs px-2 py-1 rounded uppercase font-semibold ${getStatusClassName(
-                            task.status,
-                          )}`}
-                        >
-                          {task.status.replace("_", " ")}
-                        </span>
-                      </div>
-                      <h4 className="font-medium text-white mb-1">
-                        {task.title}
-                      </h4>
-                      <p className="text-xs text-gray-400 mb-1">
-                        {t("TASK_TRACKING_OBSERVATION$TASK_ID")}: {task.id}
-                      </p>
-                      {task.notes && (
-                        <p className="text-sm text-gray-300 italic">
-                          {t("TASK_TRACKING_OBSERVATION$TASK_NOTES")}:{" "}
-                          {task.notes}
-                        </p>
-                      )}
-                    </div>
-                  </div>
-                </div>
-              ))}
-            </div>
-          </div>
-        </div>
-      )}
+      {shouldShowTaskList && <TaskListSection taskList={taskList} />}

      {/* Result message - only show if there's meaningful content */}
      {event.content && event.content.trim() && (
-        <div className="flex flex-col gap-2">
-          <div className="flex items-center justify-between">
-            <h3 className="text-sm font-semibold text-gray-300">
-              {t("TASK_TRACKING_OBSERVATION$RESULT")}
-            </h3>
-          </div>
-          <div className="p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 shadow-inner">
-            <pre className="whitespace-pre-wrap text-sm">
-              {event.content.trim()}
-            </pre>
-          </div>
-        </div>
+        <ResultSection content={event.content} />
      )}
    </div>
  );
--- a/frontend/src/components/features/chat/task-tracking/result-section.tsx
+++ b/frontend/src/components/features/chat/task-tracking/result-section.tsx
@@ -0,0 +1,21 @@
+import { useTranslation } from "react-i18next";
+import { Typography } from "#/ui/typography";
+
+interface ResultSectionProps {
+  content: string;
+}
+
+export function ResultSection({ content }: ResultSectionProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex flex-col gap-2">
+      <div className="flex items-center justify-between">
+        <Typography.H3>{t("TASK_TRACKING_OBSERVATION$RESULT")}</Typography.H3>
+      </div>
+      <div className="p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 shadow-inner">
+        <pre className="whitespace-pre-wrap text-sm">{content.trim()}</pre>
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/task-tracking/status-badge.tsx
+++ b/frontend/src/components/features/chat/task-tracking/status-badge.tsx
@@ -0,0 +1,17 @@
+import { getStatusClassName } from "#/utils/utils";
+
+interface StatusBadgeProps {
+  status: string;
+}
+
+export function StatusBadge({ status }: StatusBadgeProps) {
+  return (
+    <span
+      className={`text-xs px-2 py-1 rounded uppercase font-semibold ${getStatusClassName(
+        status,
+      )}`}
+    >
+      {status.replace("_", " ")}
+    </span>
+  );
+}
--- a/frontend/src/components/features/chat/task-tracking/status-icon.tsx
+++ b/frontend/src/components/features/chat/task-tracking/status-icon.tsx
@@ -0,0 +1,9 @@
+import { getStatusIcon } from "#/utils/utils";
+
+interface StatusIconProps {
+  status: string;
+}
+
+export function StatusIcon({ status }: StatusIconProps) {
+  return <span className="text-lg">{getStatusIcon(status)}</span>;
+}
--- a/frontend/src/components/features/chat/task-tracking/task-item.tsx
+++ b/frontend/src/components/features/chat/task-tracking/task-item.tsx
@@ -0,0 +1,43 @@
+import { useTranslation } from "react-i18next";
+import { Typography } from "#/ui/typography";
+import { StatusIcon } from "./status-icon";
+import { StatusBadge } from "./status-badge";
+
+interface TaskItemProps {
+  task: {
+    id: string;
+    title: string;
+    status: "todo" | "in_progress" | "done";
+    notes?: string;
+  };
+  index: number;
+}
+
+export function TaskItem({ task, index }: TaskItemProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="border-l-2 border-gray-600 pl-3">
+      <div className="flex items-start gap-2">
+        <StatusIcon status={task.status} />
+        <div className="flex-1">
+          <div className="flex items-center gap-2 mb-1">
+            <Typography.Text className="text-sm text-gray-400">
+              {index + 1}.
+            </Typography.Text>
+            <StatusBadge status={task.status} />
+          </div>
+          <h4 className="font-medium text-white mb-1">{task.title}</h4>
+          <Typography.Text className="text-xs text-gray-400 mb-1">
+            {t("TASK_TRACKING_OBSERVATION$TASK_ID")}: {task.id}
+          </Typography.Text>
+          {task.notes && (
+            <Typography.Text className="text-sm text-gray-300 italic">
+              {t("TASK_TRACKING_OBSERVATION$TASK_NOTES")}: {task.notes}
+            </Typography.Text>
+          )}
+        </div>
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/task-tracking/task-list-section.tsx
+++ b/frontend/src/components/features/chat/task-tracking/task-list-section.tsx
@@ -0,0 +1,34 @@
+import { useTranslation } from "react-i18next";
+import { TaskItem } from "./task-item";
+import { Typography } from "#/ui/typography";
+
+interface TaskListSectionProps {
+  taskList: Array<{
+    id: string;
+    title: string;
+    status: "todo" | "in_progress" | "done";
+    notes?: string;
+  }>;
+}
+
+export function TaskListSection({ taskList }: TaskListSectionProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex flex-col gap-2">
+      <div className="flex items-center justify-between">
+        <Typography.H3>
+          {t("TASK_TRACKING_OBSERVATION$TASK_LIST")} ({taskList.length}{" "}
+          {taskList.length === 1 ? "item" : "items"})
+        </Typography.H3>
+      </div>
+      <div className="p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 max-h-[400px] shadow-inner">
+        <div className="space-y-3">
+          {taskList.map((task, index) => (
+            <TaskItem key={task.id} task={task} index={index} />
+          ))}
+        </div>
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/chat/utils/chat-input.utils.ts
+++ b/frontend/src/components/features/chat/utils/chat-input.utils.ts
@@ -0,0 +1,75 @@
+/**
+ * Utility functions for chat input component
+ */
+/* eslint-disable no-param-reassign */
+/**
+ * Check if contentEditable element is truly empty
+ */
+export const isContentEmpty = (element: HTMLDivElement | null): boolean => {
+  if (!element) {
+    return true;
+  }
+  const text = element.innerText || element.textContent || "";
+  return text.trim() === "";
+};
+
+/**
+ * Clear empty content from contentEditable element for placeholder display
+ */
+export const clearEmptyContent = (element: HTMLDivElement | null): void => {
+  if (element && isContentEmpty(element)) {
+    element.innerHTML = "";
+    element.textContent = "";
+  }
+};
+
+/**
+ * Get text content from contentEditable element
+ */
+export const getTextContent = (element: HTMLDivElement | null): string =>
+  element?.innerText || "";
+
+/**
+ * Clear text content from contentEditable element
+ */
+export const clearTextContent = (element: HTMLDivElement | null): void => {
+  if (element) {
+    element.textContent = "";
+  }
+};
+
+/**
+ * Clear file input value
+ */
+export const clearFileInput = (element: HTMLInputElement | null): void => {
+  if (element) {
+    element.value = "";
+  }
+};
+
+/**
+ * Ensure cursor stays visible when content is scrollable
+ */
+export const ensureCursorVisible = (element: HTMLDivElement | null): void => {
+  if (!element) {
+    return;
+  }
+
+  const selection = window.getSelection();
+  if (!selection || selection.rangeCount === 0) {
+    return;
+  }
+
+  const range = selection.getRangeAt(0);
+  if (!range.getBoundingClientRect || !element.getBoundingClientRect) {
+    return;
+  }
+
+  const rect = range.getBoundingClientRect();
+  const inputRect = element.getBoundingClientRect();
+
+  // If cursor is below the visible area, scroll to show it
+  if (rect.bottom > inputRect.bottom) {
+    element.scrollTop = element.scrollHeight - element.clientHeight;
+  }
+};
--- a/frontend/src/components/features/controls/agent-status.tsx
+++ b/frontend/src/components/features/controls/agent-status.tsx
@@ -2,6 +2,7 @@ import { useTranslation } from "react-i18next";
 import { useSelector, useDispatch } from "react-redux";
 import { useEffect } from "react";
 import { RootState } from "#/store";
+import { useStatusStore } from "#/state/status-store";
 import { useWsClient } from "#/context/ws-client-provider";
 import { useActiveConversation } from "#/hooks/query/use-active-conversation";
 import { getStatusCode } from "#/utils/status";
@@ -30,7 +31,7 @@ export function AgentStatus({
  const { t } = useTranslation();
  const dispatch = useDispatch();
  const { curAgentState } = useSelector((state: RootState) => state.agent);
-  const { curStatusMessage } = useSelector((state: RootState) => state.status);
+  const { curStatusMessage } = useStatusStore();
  const { webSocketStatus } = useWsClient();
  const { data: conversation } = useActiveConversation();

--- a/frontend/src/components/features/conversation-panel/conversation-card/conversation-card-actions.tsx
+++ b/frontend/src/components/features/conversation-panel/conversation-card/conversation-card-actions.tsx
@@ -0,0 +1,65 @@
+import React from "react";
+import { cn } from "#/utils/utils";
+import { ConversationStatus } from "#/types/conversation-status";
+import { ConversationCardContextMenu } from "./conversation-card-context-menu";
+import EllipsisIcon from "#/icons/ellipsis.svg?react";
+
+interface ConversationCardActionsProps {
+  contextMenuOpen: boolean;
+  onContextMenuToggle: (isOpen: boolean) => void;
+  onDelete?: (event: React.MouseEvent<HTMLButtonElement>) => void;
+  onStop?: (event: React.MouseEvent<HTMLButtonElement>) => void;
+  onEdit?: (event: React.MouseEvent<HTMLButtonElement>) => void;
+  onDownloadViaVSCode?: (event: React.MouseEvent<HTMLButtonElement>) => void;
+  conversationStatus?: ConversationStatus;
+  conversationId?: string;
+  showOptions?: boolean;
+}
+
+export function ConversationCardActions({
+  contextMenuOpen,
+  onContextMenuToggle,
+  onDelete,
+  onStop,
+  onEdit,
+  onDownloadViaVSCode,
+  conversationStatus,
+  conversationId,
+  showOptions,
+}: ConversationCardActionsProps) {
+  return (
+    <div className="group">
+      <button
+        data-testid="ellipsis-button"
+        type="button"
+        onClick={(event) => {
+          event.preventDefault();
+          event.stopPropagation();
+          onContextMenuToggle(!contextMenuOpen);
+        }}
+        className="cursor-pointer w-6 h-6 flex flex-row items-center justify-end"
+      >
+        <EllipsisIcon />
+      </button>
+      <div
+        className={cn(
+          // Show on hover (desktop) or when explicitly opened (click/touch)
+          "relative opacity-0 invisible group-hover:opacity-100 group-hover:visible",
+          // Override hover styles when explicitly opened via click
+          contextMenuOpen && "opacity-100 visible",
+        )}
+      >
+        <ConversationCardContextMenu
+          onClose={() => onContextMenuToggle(false)}
+          onDelete={onDelete}
+          onStop={conversationStatus !== "STOPPED" ? onStop : undefined}
+          onEdit={onEdit}
+          onDownloadViaVSCode={
+            conversationId && showOptions ? onDownloadViaVSCode : undefined
+          }
+          position="bottom"
+        />
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/conversation-card/conversation-card-footer.tsx
+++ b/frontend/src/components/features/conversation-panel/conversation-card/conversation-card-footer.tsx
@@ -0,0 +1,38 @@
+import { useTranslation } from "react-i18next";
+import { formatTimeDelta } from "#/utils/format-time-delta";
+import { cn } from "#/utils/utils";
+import { I18nKey } from "#/i18n/declaration";
+import { RepositorySelection } from "#/api/open-hands.types";
+import { ConversationRepoLink } from "./conversation-repo-link";
+import { NoRepository } from "./no-repository";
+
+interface ConversationCardFooterProps {
+  selectedRepository: RepositorySelection | null;
+  lastUpdatedAt: string; // ISO 8601
+  createdAt?: string; // ISO 8601
+}
+
+export function ConversationCardFooter({
+  selectedRepository,
+  lastUpdatedAt,
+  createdAt,
+}: ConversationCardFooterProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className={cn("flex flex-row justify-between items-center mt-1")}>
+      {selectedRepository?.selected_repository ? (
+        <ConversationRepoLink selectedRepository={selectedRepository} />
+      ) : (
+        <NoRepository />
+      )}
+      {(createdAt ?? lastUpdatedAt) && (
+        <p className="text-xs text-[#A3A3A3] flex-1 text-right">
+          <time>
+            {`${formatTimeDelta(new Date(lastUpdatedAt ?? createdAt))} ${t(I18nKey.CONVERSATION$AGO)}`}
+          </time>
+        </p>
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/conversation-card/conversation-card-header.tsx
+++ b/frontend/src/components/features/conversation-panel/conversation-card/conversation-card-header.tsx
@@ -0,0 +1,40 @@
+import { ConversationStatus } from "#/types/conversation-status";
+import { ConversationCardTitle } from "./conversation-card-title";
+import { ConversationStatusIndicator } from "../../home/recent-conversations/conversation-status-indicator";
+import { ConversationStatusBadges } from "./conversation-status-badges";
+
+interface ConversationCardHeaderProps {
+  title: string;
+  titleMode: "view" | "edit";
+  onTitleSave: (title: string) => void;
+  conversationStatus?: ConversationStatus;
+}
+
+export function ConversationCardHeader({
+  title,
+  titleMode,
+  onTitleSave,
+  conversationStatus,
+}: ConversationCardHeaderProps) {
+  return (
+    <div className="flex items-center gap-2 flex-1 min-w-0 overflow-hidden mr-2">
+      {/* Status Indicator */}
+      {conversationStatus && (
+        <div className="flex items-center">
+          <ConversationStatusIndicator
+            conversationStatus={conversationStatus}
+          />
+        </div>
+      )}
+      <ConversationCardTitle
+        title={title}
+        titleMode={titleMode}
+        onSave={onTitleSave}
+      />
+      {/* Status Badges */}
+      {conversationStatus && (
+        <ConversationStatusBadges conversationStatus={conversationStatus} />
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/conversation-card/conversation-card.tsx
+++ b/frontend/src/components/features/conversation-panel/conversation-card/conversation-card.tsx
@@ -1,28 +1,13 @@
 import React from "react";
-import { useSelector } from "react-redux";
 import posthog from "posthog-js";
-import { useTranslation } from "react-i18next";
-import { formatTimeDelta } from "#/utils/format-time-delta";
-import { ConversationRepoLink } from "./conversation-repo-link";
-import { ConversationCardContextMenu } from "./conversation-card-context-menu";
-import { SystemMessageModal } from "../system-message-modal";
-import { MicroagentsModal } from "../microagents-modal";
-import { BudgetDisplay } from "../budget-display";
 import { cn } from "#/utils/utils";
-import { BaseModal } from "../../../shared/modals/base-modal/base-modal";
-import { RootState } from "#/store";
-import { I18nKey } from "#/i18n/declaration";
 import { transformVSCodeUrl } from "#/utils/vscode-url-helper";
 import ConversationService from "#/api/conversation-service/conversation-service.api";
-import { useWsClient } from "#/context/ws-client-provider";
-import { isSystemMessage } from "#/types/core/guards";
 import { ConversationStatus } from "#/types/conversation-status";
 import { RepositorySelection } from "#/api/open-hands.types";
-import EllipsisIcon from "#/icons/ellipsis.svg?react";
-import { ConversationCardTitle } from "./conversation-card-title";
-import { ConversationStatusIndicator } from "../../home/recent-conversations/conversation-status-indicator";
-import { ConversationStatusBadges } from "./conversation-status-badges";
-import { NoRepository } from "./no-repository";
+import { ConversationCardHeader } from "./conversation-card-header";
+import { ConversationCardActions } from "./conversation-card-actions";
+import { ConversationCardFooter } from "./conversation-card-footer";

 interface ConversationCardProps {
  onClick?: () => void;
@@ -57,18 +42,7 @@ export function ConversationCard({
  contextMenuOpen = false,
  onContextMenuToggle,
 }: ConversationCardProps) {
-  const { t } = useTranslation();
-  const { parsedEvents } = useWsClient();
  const [titleMode, setTitleMode] = React.useState<"view" | "edit">("view");
-  const [metricsModalVisible, setMetricsModalVisible] = React.useState(false);
-  const [systemModalVisible, setSystemModalVisible] = React.useState(false);
-  const [microagentsModalVisible, setMicroagentsModalVisible] =
-    React.useState(false);
-
-  const systemMessage = parsedEvents.find(isSystemMessage);
-
-  // Subscribe to metrics data from Redux store
-  const metrics = useSelector((state: RootState) => state.metrics);

  const onTitleSave = (newTitle: string) => {
    if (newTitle !== "" && newTitle !== title) {
@@ -124,250 +98,47 @@ export function ConversationCard({
    onContextMenuToggle?.(false);
  };

-  const handleDisplayCost = (event: React.MouseEvent<HTMLButtonElement>) => {
-    event.stopPropagation();
-    setMetricsModalVisible(true);
-  };
-
-  const handleShowAgentTools = (event: React.MouseEvent<HTMLButtonElement>) => {
-    event.stopPropagation();
-    setSystemModalVisible(true);
-  };
-
-  const handleShowMicroagents = (
-    event: React.MouseEvent<HTMLButtonElement>,
-  ) => {
-    event.stopPropagation();
-    setMicroagentsModalVisible(true);
-  };
-
  const hasContextMenu = !!(onDelete || onChangeTitle || showOptions);

  return (
-    <>
-      <div
-        data-testid="conversation-card"
-        data-context-menu-open={contextMenuOpen.toString()}
-        onClick={onClick}
-        className={cn(
-          "relative h-auto w-full p-3.5 border-b border-neutral-600 cursor-pointer",
-          "data-[context-menu-open=false]:hover:bg-[#454545]",
-          conversationStatus === "ARCHIVED" && "opacity-60",
+    <div
+      data-testid="conversation-card"
+      data-context-menu-open={contextMenuOpen.toString()}
+      onClick={onClick}
+      className={cn(
+        "relative h-auto w-full p-3.5 border-b border-neutral-600 cursor-pointer",
+        "data-[context-menu-open=false]:hover:bg-[#454545]",
+        conversationStatus === "ARCHIVED" && "opacity-60",
+      )}
+    >
+      <div className="flex items-center justify-between w-full">
+        <ConversationCardHeader
+          title={title}
+          titleMode={titleMode}
+          onTitleSave={onTitleSave}
+          conversationStatus={conversationStatus}
+        />
+
+        {hasContextMenu && (
+          <ConversationCardActions
+            contextMenuOpen={contextMenuOpen}
+            onContextMenuToggle={onContextMenuToggle || (() => {})}
+            onDelete={onDelete && handleDelete}
+            onStop={onStop && handleStop}
+            onEdit={onChangeTitle && handleEdit}
+            onDownloadViaVSCode={handleDownloadViaVSCode}
+            conversationStatus={conversationStatus}
+            conversationId={conversationId}
+            showOptions={showOptions}
+          />
        )}
-      >
-        <div className="flex items-center justify-between w-full">
-          <div className="flex items-center gap-2 flex-1 min-w-0 overflow-hidden mr-2">
-            {/* Status Indicator */}
-            {conversationStatus && (
-              <div className="flex items-center">
-                <ConversationStatusIndicator
-                  conversationStatus={conversationStatus}
-                />
-              </div>
-            )}
-            <ConversationCardTitle
-              title={title}
-              titleMode={titleMode}
-              onSave={onTitleSave}
-            />
-            {/* Status Badges */}
-            {conversationStatus && (
-              <ConversationStatusBadges
-                conversationStatus={conversationStatus}
-              />
-            )}
-          </div>
-
-          {hasContextMenu && (
-            <div className="group">
-              <button
-                data-testid="ellipsis-button"
-                type="button"
-                onClick={(event) => {
-                  event.preventDefault();
-                  event.stopPropagation();
-                  onContextMenuToggle?.(!contextMenuOpen);
-                }}
-                className="cursor-pointer w-6 h-6 flex flex-row items-center justify-end"
-              >
-                <EllipsisIcon />
-              </button>
-              <div
-                className={cn(
-                  // Show on hover (desktop) or when explicitly opened (click/touch)
-                  "relative opacity-0 invisible group-hover:opacity-100 group-hover:visible",
-                  // Override hover styles when explicitly opened via click
-                  contextMenuOpen && "opacity-100 visible",
-                )}
-              >
-                <ConversationCardContextMenu
-                  onClose={() => onContextMenuToggle?.(false)}
-                  onDelete={onDelete && handleDelete}
-                  onStop={
-                    conversationStatus !== "STOPPED"
-                      ? onStop && handleStop
-                      : undefined
-                  }
-                  onEdit={onChangeTitle && handleEdit}
-                  onDownloadViaVSCode={
-                    conversationId && showOptions
-                      ? handleDownloadViaVSCode
-                      : undefined
-                  }
-                  onDisplayCost={showOptions ? handleDisplayCost : undefined}
-                  onShowAgentTools={
-                    showOptions && systemMessage
-                      ? handleShowAgentTools
-                      : undefined
-                  }
-                  onShowMicroagents={
-                    showOptions && conversationId
-                      ? handleShowMicroagents
-                      : undefined
-                  }
-                  position="bottom"
-                />
-              </div>
-            </div>
-          )}
-        </div>
-
-        <div className={cn("flex flex-row justify-between items-center mt-1")}>
-          {selectedRepository?.selected_repository ? (
-            <ConversationRepoLink selectedRepository={selectedRepository} />
-          ) : (
-            <NoRepository />
-          )}
-          {(createdAt ?? lastUpdatedAt) && (
-            <p className="text-xs text-[#A3A3A3] flex-1 text-right">
-              <time>
-                {`${formatTimeDelta(new Date(lastUpdatedAt ?? createdAt))} ${t(I18nKey.CONVERSATION$AGO)}`}
-              </time>
-            </p>
-          )}
-        </div>
      </div>

-      <BaseModal
-        isOpen={metricsModalVisible}
-        onOpenChange={setMetricsModalVisible}
-        title={t(I18nKey.CONVERSATION$METRICS_INFO)}
-        testID="metrics-modal"
-      >
-        <div className="space-y-4">
-          {(metrics?.cost !== null || metrics?.usage !== null) && (
-            <div className="rounded-md p-3">
-              <div className="grid gap-3">
-                {metrics?.cost !== null && (
-                  <div className="flex justify-between items-center pb-2">
-                    <span className="text-lg font-semibold">
-                      {t(I18nKey.CONVERSATION$TOTAL_COST)}
-                    </span>
-                    <span className="font-semibold">
-                      ${metrics.cost.toFixed(4)}
-                    </span>
-                  </div>
-                )}
-                <BudgetDisplay
-                  cost={metrics?.cost ?? null}
-                  maxBudgetPerTask={metrics?.max_budget_per_task ?? null}
-                />
-
-                {metrics?.usage !== null && (
-                  <>
-                    <div className="flex justify-between items-center pb-2">
-                      <span>{t(I18nKey.CONVERSATION$INPUT)}</span>
-                      <span className="font-semibold">
-                        {metrics.usage.prompt_tokens.toLocaleString()}
-                      </span>
-                    </div>
-
-                    <div className="grid grid-cols-2 gap-2 pl-4 text-sm">
-                      <span className="text-neutral-400">
-                        {t(I18nKey.CONVERSATION$CACHE_HIT)}
-                      </span>
-                      <span className="text-right">
-                        {metrics.usage.cache_read_tokens.toLocaleString()}
-                      </span>
-                      <span className="text-neutral-400">
-                        {t(I18nKey.CONVERSATION$CACHE_WRITE)}
-                      </span>
-                      <span className="text-right">
-                        {metrics.usage.cache_write_tokens.toLocaleString()}
-                      </span>
-                    </div>
-
-                    <div className="flex justify-between items-center border-b border-neutral-700 pb-2">
-                      <span>{t(I18nKey.CONVERSATION$OUTPUT)}</span>
-                      <span className="font-semibold">
-                        {metrics.usage.completion_tokens.toLocaleString()}
-                      </span>
-                    </div>
-
-                    <div className="flex justify-between items-center border-b border-neutral-700 pb-2">
-                      <span className="font-semibold">
-                        {t(I18nKey.CONVERSATION$TOTAL)}
-                      </span>
-                      <span className="font-bold">
-                        {(
-                          metrics.usage.prompt_tokens +
-                          metrics.usage.completion_tokens
-                        ).toLocaleString()}
-                      </span>
-                    </div>
-
-                    <div className="flex flex-col gap-2">
-                      <div className="flex items-center justify-between">
-                        <span className="font-semibold">
-                          {t(I18nKey.CONVERSATION$CONTEXT_WINDOW)}
-                        </span>
-                      </div>
-                      <div className="w-full h-1.5 bg-neutral-700 rounded-full overflow-hidden">
-                        <div
-                          className="h-full bg-blue-500 transition-all duration-300"
-                          style={{
-                            width: `${Math.min(100, (metrics.usage.per_turn_token / metrics.usage.context_window) * 100)}%`,
-                          }}
-                        />
-                      </div>
-                      <div className="flex justify-end">
-                        <span className="text-xs text-neutral-400">
-                          {metrics.usage.per_turn_token.toLocaleString()} /{" "}
-                          {metrics.usage.context_window.toLocaleString()} (
-                          {(
-                            (metrics.usage.per_turn_token /
-                              metrics.usage.context_window) *
-                            100
-                          ).toFixed(2)}
-                          % {t(I18nKey.CONVERSATION$USED)})
-                        </span>
-                      </div>
-                    </div>
-                  </>
-                )}
-              </div>
-            </div>
-          )}
-
-          {!metrics?.cost && !metrics?.usage && (
-            <div className="rounded-md p-4 text-center">
-              <p className="text-neutral-400">
-                {t(I18nKey.CONVERSATION$NO_METRICS)}
-              </p>
-            </div>
-          )}
-        </div>
-      </BaseModal>
-
-      <SystemMessageModal
-        isOpen={systemModalVisible}
-        onClose={() => setSystemModalVisible(false)}
-        systemMessage={systemMessage ? systemMessage.args : null}
+      <ConversationCardFooter
+        selectedRepository={selectedRepository}
+        lastUpdatedAt={lastUpdatedAt}
+        createdAt={createdAt}
      />
-
-      {microagentsModalVisible && (
-        <MicroagentsModal onClose={() => setMicroagentsModalVisible(false)} />
-      )}
-    </>
+    </div>
  );
 }
--- a/frontend/src/components/features/conversation-panel/microagent-content.tsx
+++ b/frontend/src/components/features/conversation-panel/microagent-content.tsx
@@ -0,0 +1,35 @@
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+import { Typography } from "#/ui/typography";
+import { Pre } from "#/ui/pre";
+
+interface MicroagentContentProps {
+  content: string;
+}
+
+export function MicroagentContent({ content }: MicroagentContentProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="mt-2">
+      <Typography.Text className="text-sm font-semibold text-gray-300 mb-2">
+        {t(I18nKey.MICROAGENTS_MODAL$CONTENT)}
+      </Typography.Text>
+      <Pre
+        size="default"
+        font="mono"
+        lineHeight="relaxed"
+        background="dark"
+        textColor="light"
+        padding="medium"
+        borderRadius="medium"
+        shadow="inner"
+        maxHeight="small"
+        overflow="auto"
+        className="mt-2"
+      >
+        {content || t(I18nKey.MICROAGENTS_MODAL$NO_CONTENT)}
+      </Pre>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/microagent-item.tsx
+++ b/frontend/src/components/features/conversation-panel/microagent-item.tsx
@@ -0,0 +1,52 @@
+import { ChevronDown, ChevronRight } from "lucide-react";
+import { Microagent } from "#/api/open-hands.types";
+import { Typography } from "#/ui/typography";
+import { MicroagentTriggers } from "./microagent-triggers";
+import { MicroagentContent } from "./microagent-content";
+
+interface MicroagentItemProps {
+  agent: Microagent;
+  isExpanded: boolean;
+  onToggle: (agentName: string) => void;
+}
+
+export function MicroagentItem({
+  agent,
+  isExpanded,
+  onToggle,
+}: MicroagentItemProps) {
+  return (
+    <div className="rounded-md overflow-hidden">
+      <button
+        type="button"
+        onClick={() => onToggle(agent.name)}
+        className="w-full py-3 px-2 text-left flex items-center justify-between hover:bg-gray-700 transition-colors"
+      >
+        <div className="flex items-center">
+          <Typography.Text className="font-bold text-gray-100">
+            {agent.name}
+          </Typography.Text>
+        </div>
+        <div className="flex items-center">
+          <Typography.Text className="px-2 py-1 text-xs rounded-full bg-gray-800 mr-2">
+            {agent.type === "repo" ? "Repository" : "Knowledge"}
+          </Typography.Text>
+          <Typography.Text className="text-gray-300">
+            {isExpanded ? (
+              <ChevronDown size={18} />
+            ) : (
+              <ChevronRight size={18} />
+            )}
+          </Typography.Text>
+        </div>
+      </button>
+
+      {isExpanded && (
+        <div className="px-2 pb-3 pt-1">
+          <MicroagentTriggers triggers={agent.triggers} />
+          <MicroagentContent content={agent.content} />
+        </div>
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/microagent-triggers.tsx
+++ b/frontend/src/components/features/conversation-panel/microagent-triggers.tsx
@@ -0,0 +1,33 @@
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+import { Typography } from "#/ui/typography";
+
+interface MicroagentTriggersProps {
+  triggers: string[];
+}
+
+export function MicroagentTriggers({ triggers }: MicroagentTriggersProps) {
+  const { t } = useTranslation();
+
+  if (!triggers || triggers.length === 0) {
+    return null;
+  }
+
+  return (
+    <div className="mt-2 mb-3">
+      <Typography.Text className="text-sm font-semibold text-gray-300 mb-2">
+        {t(I18nKey.MICROAGENTS_MODAL$TRIGGERS)}
+      </Typography.Text>
+      <div className="flex flex-wrap gap-1">
+        {triggers.map((trigger) => (
+          <Typography.Text
+            key={trigger}
+            className="px-2 py-1 text-xs rounded-full bg-blue-900"
+          >
+            {trigger}
+          </Typography.Text>
+        ))}
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/microagents-empty-state.tsx
+++ b/frontend/src/components/features/conversation-panel/microagents-empty-state.tsx
@@ -0,0 +1,21 @@
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+import { Typography } from "#/ui/typography";
+
+interface MicroagentsEmptyStateProps {
+  isError: boolean;
+}
+
+export function MicroagentsEmptyState({ isError }: MicroagentsEmptyStateProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex items-center justify-center h-full p-4">
+      <Typography.Text className="text-gray-400">
+        {isError
+          ? t(I18nKey.MICROAGENTS_MODAL$FETCH_ERROR)
+          : t(I18nKey.CONVERSATION$NO_MICROAGENTS)}
+      </Typography.Text>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/microagents-loading-state.tsx
+++ b/frontend/src/components/features/conversation-panel/microagents-loading-state.tsx
@@ -0,0 +1,7 @@
+export function MicroagentsLoadingState() {
+  return (
+    <div className="flex justify-center items-center py-8">
+      <div className="animate-spin rounded-full h-8 w-8 border-t-2 border-b-2 border-primary" />
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/microagents-modal-header.tsx
+++ b/frontend/src/components/features/conversation-panel/microagents-modal-header.tsx
@@ -0,0 +1,45 @@
+import { useTranslation } from "react-i18next";
+import { RefreshCw } from "lucide-react";
+import { BaseModalTitle } from "#/components/shared/modals/confirmation-modals/base-modal";
+import { I18nKey } from "#/i18n/declaration";
+import { BrandButton } from "../settings/brand-button";
+
+interface MicroagentsModalHeaderProps {
+  isAgentReady: boolean;
+  isLoading: boolean;
+  isRefetching: boolean;
+  onRefresh: () => void;
+}
+
+export function MicroagentsModalHeader({
+  isAgentReady,
+  isLoading,
+  isRefetching,
+  onRefresh,
+}: MicroagentsModalHeaderProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex flex-col gap-6 w-full">
+      <div className="flex items-center justify-between w-full">
+        <BaseModalTitle title={t(I18nKey.MICROAGENTS_MODAL$TITLE)} />
+        {isAgentReady && (
+          <BrandButton
+            testId="refresh-microagents"
+            type="button"
+            variant="primary"
+            className="flex items-center gap-2"
+            onClick={onRefresh}
+            isDisabled={isLoading || isRefetching}
+          >
+            <RefreshCw
+              size={16}
+              className={`${isRefetching ? "animate-spin" : ""}`}
+            />
+            {t(I18nKey.BUTTON$REFRESH)}
+          </BrandButton>
+        )}
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/microagents-modal.tsx
+++ b/frontend/src/components/features/conversation-panel/microagents-modal.tsx
@@ -1,15 +1,17 @@
 import { useState } from "react";
 import { useTranslation } from "react-i18next";
 import { useSelector } from "react-redux";
-import { ChevronDown, ChevronRight, RefreshCw } from "lucide-react";
-import { BaseModalTitle } from "#/components/shared/modals/confirmation-modals/base-modal";
 import { ModalBackdrop } from "#/components/shared/modals/modal-backdrop";
 import { ModalBody } from "#/components/shared/modals/modal-body";
 import { I18nKey } from "#/i18n/declaration";
 import { useConversationMicroagents } from "#/hooks/query/use-conversation-microagents";
 import { RootState } from "#/store";
 import { AgentState } from "#/types/agent-state";
-import { BrandButton } from "../settings/brand-button";
+import { Typography } from "#/ui/typography";
+import { MicroagentsModalHeader } from "./microagents-modal-header";
+import { MicroagentsLoadingState } from "./microagents-loading-state";
+import { MicroagentsEmptyState } from "./microagents-empty-state";
+import { MicroagentItem } from "./microagent-item";

 interface MicroagentsModalProps {
  onClose: () => void;
@@ -47,57 +49,34 @@ export function MicroagentsModal({ onClose }: MicroagentsModalProps) {
        className="max-h-[80vh] flex flex-col items-start"
        testID="microagents-modal"
      >
-        <div className="flex flex-col gap-6 w-full">
-          <div className="flex items-center justify-between w-full">
-            <BaseModalTitle title={t(I18nKey.MICROAGENTS_MODAL$TITLE)} />
-            {isAgentReady && (
-              <BrandButton
-                testId="refresh-microagents"
-                type="button"
-                variant="primary"
-                className="flex items-center gap-2"
-                onClick={refetch}
-                isDisabled={isLoading || isRefetching}
-              >
-                <RefreshCw
-                  size={16}
-                  className={`${isRefetching ? "animate-spin" : ""}`}
-                />
-                {t(I18nKey.BUTTON$REFRESH)}
-              </BrandButton>
-            )}
-          </div>
-        </div>
+        <MicroagentsModalHeader
+          isAgentReady={isAgentReady}
+          isLoading={isLoading}
+          isRefetching={isRefetching}
+          onRefresh={refetch}
+        />

        {isAgentReady && (
-          <span className="text-sm text-gray-400">
+          <Typography.Text className="text-sm text-gray-400">
            {t(I18nKey.MICROAGENTS_MODAL$WARNING)}
-          </span>
+          </Typography.Text>
        )}

-        <div className="w-full h-[60vh] overflow-auto rounded-md">
+        <div className="w-full h-[60vh] overflow-auto rounded-md custom-scrollbar-always">
          {!isAgentReady && (
            <div className="w-full h-full flex items-center text-center justify-center text-2xl text-tertiary-light">
-              {t(I18nKey.DIFF_VIEWER$WAITING_FOR_RUNTIME)}
+              <Typography.Text>
+                {t(I18nKey.DIFF_VIEWER$WAITING_FOR_RUNTIME)}
+              </Typography.Text>
            </div>
          )}

-          {isLoading && (
-            <div className="flex justify-center items-center py-8">
-              <div className="animate-spin rounded-full h-8 w-8 border-t-2 border-b-2 border-primary" />
-            </div>
-          )}
+          {isLoading && <MicroagentsLoadingState />}

          {!isLoading &&
            isAgentReady &&
            (isError || !microagents || microagents.length === 0) && (
-              <div className="flex items-center justify-center h-full p-4">
-                <p className="text-gray-400">
-                  {isError
-                    ? t(I18nKey.MICROAGENTS_MODAL$FETCH_ERROR)
-                    : t(I18nKey.CONVERSATION$NO_MICROAGENTS)}
-                </p>
-              </div>
+              <MicroagentsEmptyState isError={isError} />
            )}

          {!isLoading &&
@@ -109,68 +88,12 @@ export function MicroagentsModal({ onClose }: MicroagentsModalProps) {
                  const isExpanded = expandedAgents[agent.name] || false;

                  return (
-                    <div
+                    <MicroagentItem
                      key={agent.name}
-                      className="rounded-md overflow-hidden"
-                    >
-                      <button
-                        type="button"
-                        onClick={() => toggleAgent(agent.name)}
-                        className="w-full py-3 px-2 text-left flex items-center justify-between hover:bg-gray-700 transition-colors"
-                      >
-                        <div className="flex items-center">
-                          <h3 className="font-bold text-gray-100">
-                            {agent.name}
-                          </h3>
-                        </div>
-                        <div className="flex items-center">
-                          <span className="px-2 py-1 text-xs rounded-full bg-gray-800 mr-2">
-                            {agent.type === "repo" ? "Repository" : "Knowledge"}
-                          </span>
-                          <span className="text-gray-300">
-                            {isExpanded ? (
-                              <ChevronDown size={18} />
-                            ) : (
-                              <ChevronRight size={18} />
-                            )}
-                          </span>
-                        </div>
-                      </button>
-
-                      {isExpanded && (
-                        <div className="px-2 pb-3 pt-1">
-                          {agent.triggers && agent.triggers.length > 0 && (
-                            <div className="mt-2 mb-3">
-                              <h4 className="text-sm font-semibold text-gray-300 mb-2">
-                                {t(I18nKey.MICROAGENTS_MODAL$TRIGGERS)}
-                              </h4>
-                              <div className="flex flex-wrap gap-1">
-                                {agent.triggers.map((trigger) => (
-                                  <span
-                                    key={trigger}
-                                    className="px-2 py-1 text-xs rounded-full bg-blue-900"
-                                  >
-                                    {trigger}
-                                  </span>
-                                ))}
-                              </div>
-                            </div>
-                          )}
-
-                          <div className="mt-2">
-                            <h4 className="text-sm font-semibold text-gray-300 mb-2">
-                              {t(I18nKey.MICROAGENTS_MODAL$CONTENT)}
-                            </h4>
-                            <div className="text-sm mt-2 p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 max-h-[400px] shadow-inner">
-                              <pre className="whitespace-pre-wrap font-mono text-sm leading-relaxed">
-                                {agent.content ||
-                                  t(I18nKey.MICROAGENTS_MODAL$NO_CONTENT)}
-                              </pre>
-                            </div>
-                          </div>
-                        </div>
-                      )}
-                    </div>
+                      agent={agent}
+                      isExpanded={isExpanded}
+                      onToggle={toggleAgent}
+                    />
                  );
                })}
              </div>
--- a/frontend/src/components/features/conversation-panel/system-message-modal.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal.tsx
@@ -1,12 +1,9 @@
-import React, { useState } from "react";
-import { useTranslation } from "react-i18next";
-import { ChevronDown, ChevronRight } from "lucide-react";
-import ReactJsonView from "@microlink/react-json-view";
-import { BaseModalTitle } from "#/components/shared/modals/confirmation-modals/base-modal";
+import { useState } from "react";
 import { ModalBackdrop } from "#/components/shared/modals/modal-backdrop";
 import { ModalBody } from "#/components/shared/modals/modal-body";
-import { cn } from "#/utils/utils";
-import { JSON_VIEW_THEME } from "#/utils/constants";
+import { SystemMessageHeader } from "./system-message-modal/system-message-header";
+import { TabNavigation } from "./system-message-modal/tab-navigation";
+import { TabContent } from "./system-message-modal/tab-content";

 interface SystemMessageModalProps {
  isOpen: boolean;
@@ -19,26 +16,11 @@ interface SystemMessageModalProps {
  } | null;
 }

-interface FunctionData {
-  name?: string;
-  description?: string;
-  parameters?: Record<string, unknown>;
-}
-
-interface ToolData {
-  type?: string;
-  function?: FunctionData;
-  name?: string;
-  description?: string;
-  parameters?: Record<string, unknown>;
-}
-
 export function SystemMessageModal({
  isOpen,
  onClose,
  systemMessage,
 }: SystemMessageModalProps) {
-  const { t } = useTranslation();
  const [activeTab, setActiveTab] = useState<"system" | "tools">("system");
  const [expandedTools, setExpandedTools] = useState<Record<number, boolean>>(
    {},
@@ -62,155 +44,27 @@ export function SystemMessageModal({
          width="medium"
          className="max-h-[80vh] flex flex-col items-start"
        >
-          <div className="flex flex-col gap-6 w-full">
-            <BaseModalTitle title={t("SYSTEM_MESSAGE_MODAL$TITLE")} />
-            <div className="flex flex-col gap-2">
-              {systemMessage.agent_class && (
-                <div className="text-sm">
-                  <span className="font-semibold text-gray-300">
-                    {t("SYSTEM_MESSAGE_MODAL$AGENT_CLASS")}
-                  </span>{" "}
-                  <span className="font-medium text-gray-100">
-                    {systemMessage.agent_class}
-                  </span>
-                </div>
-              )}
-              {systemMessage.openhands_version && (
-                <div className="text-sm">
-                  <span className="font-semibold text-gray-300">
-                    {t("SYSTEM_MESSAGE_MODAL$OPENHANDS_VERSION")}
-                  </span>{" "}
-                  <span className="text-gray-100">
-                    {systemMessage.openhands_version}
-                  </span>
-                </div>
-              )}
-            </div>
-          </div>
+          <SystemMessageHeader
+            agentClass={systemMessage.agent_class}
+            openhandsVersion={systemMessage.openhands_version}
+          />

          <div className="w-full">
-            <div className="flex border-b mb-2">
-              <button
-                type="button"
-                className={cn(
-                  "px-4 py-2 font-medium border-b-2 transition-colors",
-                  activeTab === "system"
-                    ? "border-primary text-gray-100"
-                    : "border-transparent hover:text-gray-700 dark:hover:text-gray-300",
-                )}
-                onClick={() => setActiveTab("system")}
-              >
-                {t("SYSTEM_MESSAGE_MODAL$SYSTEM_MESSAGE_TAB")}
-              </button>
-              {systemMessage.tools && systemMessage.tools.length > 0 && (
-                <button
-                  type="button"
-                  className={cn(
-                    "px-4 py-2 font-medium border-b-2 transition-colors",
-                    activeTab === "tools"
-                      ? "border-primary text-gray-100"
-                      : "border-transparent hover:text-gray-700 dark:hover:text-gray-300",
-                  )}
-                  onClick={() => setActiveTab("tools")}
-                >
-                  {t("SYSTEM_MESSAGE_MODAL$TOOLS_TAB")}
-                </button>
-              )}
-            </div>
+            <TabNavigation
+              activeTab={activeTab}
+              onTabChange={setActiveTab}
+              hasTools={
+                !!(systemMessage.tools && systemMessage.tools.length > 0)
+              }
+            />

-            <div className="max-h-[51vh] overflow-auto rounded-md">
-              {activeTab === "system" && (
-                <div className="p-4 whitespace-pre-wrap font-mono text-sm leading-relaxed text-gray-300 shadow-inner">
-                  {systemMessage.content}
-                </div>
-              )}
-
-              {activeTab === "tools" &&
-                systemMessage.tools &&
-                systemMessage.tools.length > 0 && (
-                  <div className="p-2 space-y-3">
-                    {systemMessage.tools.map((tool, index) => {
-                      // Extract function data from the nested structure
-                      const toolData = tool as ToolData;
-                      const functionData = toolData.function || toolData;
-                      const name =
-                        functionData.name ||
-                        (toolData.type === "function" &&
-                          toolData.function?.name) ||
-                        "";
-                      const description =
-                        functionData.description ||
-                        (toolData.type === "function" &&
-                          toolData.function?.description) ||
-                        "";
-                      const parameters =
-                        functionData.parameters ||
-                        (toolData.type === "function" &&
-                          toolData.function?.parameters) ||
-                        null;
-
-                      const isExpanded = expandedTools[index] || false;
-
-                      return (
-                        <div key={index} className="rounded-md overflow-hidden">
-                          <button
-                            type="button"
-                            onClick={() => toggleTool(index)}
-                            className="w-full py-3 px-2 text-left flex items-center justify-between hover:bg-gray-700 transition-colors"
-                          >
-                            <div className="flex items-center">
-                              <h3 className="font-bold text-gray-100">
-                                {String(name)}
-                              </h3>
-                            </div>
-                            <span className="text-gray-300">
-                              {isExpanded ? (
-                                <ChevronDown size={18} />
-                              ) : (
-                                <ChevronRight size={18} />
-                              )}
-                            </span>
-                          </button>
-
-                          {isExpanded && (
-                            <div className="px-2 pb-3 pt-1">
-                              <div className="mt-2 mb-3">
-                                <p className="text-sm whitespace-pre-wrap text-gray-300 leading-relaxed">
-                                  {String(description)}
-                                </p>
-                              </div>
-
-                              {/* Parameters section */}
-                              {parameters && (
-                                <div className="mt-2">
-                                  <h4 className="text-sm font-semibold text-gray-300">
-                                    {t("SYSTEM_MESSAGE_MODAL$PARAMETERS")}
-                                  </h4>
-                                  <div className="text-sm mt-2 p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 max-h-[400px] shadow-inner">
-                                    <ReactJsonView
-                                      name={false}
-                                      src={parameters}
-                                      theme={JSON_VIEW_THEME}
-                                    />
-                                  </div>
-                                </div>
-                              )}
-                            </div>
-                          )}
-                        </div>
-                      );
-                    })}
-                  </div>
-                )}
-
-              {activeTab === "tools" &&
-                (!systemMessage.tools || systemMessage.tools.length === 0) && (
-                  <div className="flex items-center justify-center h-full p-4">
-                    <p className="text-gray-400">
-                      {t("SYSTEM_MESSAGE_MODAL$NO_TOOLS")}
-                    </p>
-                  </div>
-                )}
+            <div className="max-h-[51vh] overflow-auto rounded-md custom-scrollbar-always">
+              <TabContent
+                activeTab={activeTab}
+                systemMessage={systemMessage}
+                expandedTools={expandedTools}
+                onToggleTool={toggleTool}
+              />
            </div>
          </div>
        </ModalBody>
--- a/frontend/src/components/features/conversation-panel/system-message-modal/empty-tools-state.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/empty-tools-state.tsx
@@ -0,0 +1,14 @@
+import { useTranslation } from "react-i18next";
+import { Typography } from "#/ui/typography";
+
+export function EmptyToolsState() {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex items-center justify-center h-full p-4">
+      <Typography.Text className="text-gray-400">
+        {t("SYSTEM_MESSAGE_MODAL$NO_TOOLS")}
+      </Typography.Text>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/system-message-content.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/system-message-content.tsx
@@ -0,0 +1,13 @@
+import { Typography } from "#/ui/typography";
+
+interface SystemMessageContentProps {
+  content: string;
+}
+
+export function SystemMessageContent({ content }: SystemMessageContentProps) {
+  return (
+    <div className="p-4 shadow-inner">
+      <Typography.CodeBlock>{content}</Typography.CodeBlock>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/system-message-header.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/system-message-header.tsx
@@ -0,0 +1,43 @@
+import { useTranslation } from "react-i18next";
+import { BaseModalTitle } from "#/components/shared/modals/confirmation-modals/base-modal";
+import { Typography } from "#/ui/typography";
+
+interface SystemMessageHeaderProps {
+  agentClass: string | null;
+  openhandsVersion: string | null;
+}
+
+export function SystemMessageHeader({
+  agentClass,
+  openhandsVersion,
+}: SystemMessageHeaderProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex flex-col gap-6 w-full">
+      <BaseModalTitle title={t("SYSTEM_MESSAGE_MODAL$TITLE")} />
+      <div className="flex flex-col gap-2">
+        {agentClass && (
+          <div className="text-sm">
+            <Typography.Text className="font-semibold text-gray-300">
+              {t("SYSTEM_MESSAGE_MODAL$AGENT_CLASS")}
+            </Typography.Text>{" "}
+            <Typography.Text className="font-medium text-gray-100">
+              {agentClass}
+            </Typography.Text>
+          </div>
+        )}
+        {openhandsVersion && (
+          <div className="text-sm">
+            <Typography.Text className="font-semibold text-gray-300">
+              {t("SYSTEM_MESSAGE_MODAL$OPENHANDS_VERSION")}
+            </Typography.Text>{" "}
+            <Typography.Text className="text-gray-100">
+              {openhandsVersion}
+            </Typography.Text>
+          </div>
+        )}
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/tab-button.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/tab-button.tsx
@@ -0,0 +1,37 @@
+import { cn } from "#/utils/utils";
+
+interface TabButtonProps {
+  isActive: boolean;
+  children: React.ReactNode;
+  onClick: () => void;
+  className?: string;
+  disabled?: boolean;
+}
+
+export function TabButton({
+  isActive,
+  children,
+  onClick,
+  className,
+  disabled = false,
+}: TabButtonProps) {
+  return (
+    <button
+      type="button"
+      disabled={disabled}
+      className={cn(
+        "px-4 py-2 font-medium border-b-2 transition-colors",
+        isActive
+          ? "border-primary text-gray-100"
+          : "border-transparent hover:text-gray-700 dark:hover:text-gray-300",
+        disabled && "opacity-50 cursor-not-allowed",
+        className,
+      )}
+      onClick={onClick}
+      aria-selected={isActive}
+      role="tab"
+    >
+      {children}
+    </button>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/tab-content.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/tab-content.tsx
@@ -0,0 +1,40 @@
+import { SystemMessageContent } from "./system-message-content";
+import { ToolsList } from "./tools-list";
+import { EmptyToolsState } from "./empty-tools-state";
+
+interface TabContentProps {
+  activeTab: "system" | "tools";
+  systemMessage: {
+    content: string;
+    tools: Array<Record<string, unknown>> | null;
+  };
+  expandedTools: Record<number, boolean>;
+  onToggleTool: (index: number) => void;
+}
+
+export function TabContent({
+  activeTab,
+  systemMessage,
+  expandedTools,
+  onToggleTool,
+}: TabContentProps) {
+  if (activeTab === "system") {
+    return <SystemMessageContent content={systemMessage.content} />;
+  }
+
+  if (activeTab === "tools") {
+    if (systemMessage.tools && systemMessage.tools.length > 0) {
+      return (
+        <ToolsList
+          tools={systemMessage.tools}
+          expandedTools={expandedTools}
+          onToggleTool={onToggleTool}
+        />
+      );
+    }
+
+    return <EmptyToolsState />;
+  }
+
+  return null;
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/tab-navigation.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/tab-navigation.tsx
@@ -0,0 +1,35 @@
+import { useTranslation } from "react-i18next";
+import { TabButton } from "./tab-button";
+
+interface TabNavigationProps {
+  activeTab: "system" | "tools";
+  onTabChange: (tab: "system" | "tools") => void;
+  hasTools: boolean;
+}
+
+export function TabNavigation({
+  activeTab,
+  onTabChange,
+  hasTools,
+}: TabNavigationProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="flex border-b mb-2" role="tablist">
+      <TabButton
+        isActive={activeTab === "system"}
+        onClick={() => onTabChange("system")}
+      >
+        {t("SYSTEM_MESSAGE_MODAL$SYSTEM_MESSAGE_TAB")}
+      </TabButton>
+      {hasTools && (
+        <TabButton
+          isActive={activeTab === "tools"}
+          onClick={() => onTabChange("tools")}
+        >
+          {t("SYSTEM_MESSAGE_MODAL$TOOLS_TAB")}
+        </TabButton>
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/toggle-button.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/toggle-button.tsx
@@ -0,0 +1,33 @@
+import { ChevronDown, ChevronRight } from "lucide-react";
+import { Typography } from "#/ui/typography";
+
+interface ToggleButtonProps {
+  title: string;
+  isExpanded: boolean;
+  onClick: () => void;
+  className?: string;
+}
+
+export function ToggleButton({
+  title,
+  isExpanded,
+  onClick,
+  className,
+}: ToggleButtonProps) {
+  return (
+    <button
+      type="button"
+      onClick={onClick}
+      className={`w-full py-3 px-2 text-left flex items-center justify-between hover:bg-gray-700 transition-colors ${className || ""}`}
+    >
+      <div className="flex items-center">
+        <Typography.Text className="font-bold text-gray-100">
+          {title}
+        </Typography.Text>
+      </div>
+      <Typography.Text className="text-gray-300">
+        {isExpanded ? <ChevronDown size={18} /> : <ChevronRight size={18} />}
+      </Typography.Text>
+    </button>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/tool-item.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/tool-item.tsx
@@ -0,0 +1,65 @@
+import { Typography } from "#/ui/typography";
+import { ToolParameters } from "./tool-parameters";
+import { ToggleButton } from "./toggle-button";
+
+interface FunctionData {
+  name?: string;
+  description?: string;
+  parameters?: Record<string, unknown>;
+}
+
+interface ToolData {
+  type?: string;
+  function?: FunctionData;
+  name?: string;
+  description?: string;
+  parameters?: Record<string, unknown>;
+}
+
+interface ToolItemProps {
+  tool: Record<string, unknown>;
+  index: number;
+  isExpanded: boolean;
+  onToggle: (index: number) => void;
+}
+
+export function ToolItem({ tool, index, isExpanded, onToggle }: ToolItemProps) {
+  // Extract function data from the nested structure
+  const toolData = tool as ToolData;
+  const functionData = toolData.function || toolData;
+  const name =
+    functionData.name ||
+    (toolData.type === "function" && toolData.function?.name) ||
+    "";
+  const description =
+    functionData.description ||
+    (toolData.type === "function" && toolData.function?.description) ||
+    "";
+  const parameters =
+    functionData.parameters ||
+    (toolData.type === "function" && toolData.function?.parameters) ||
+    null;
+
+  return (
+    <div className="rounded-md overflow-hidden">
+      <ToggleButton
+        title={String(name)}
+        isExpanded={isExpanded}
+        onClick={() => onToggle(index)}
+      />
+
+      {isExpanded && (
+        <div className="px-2 pb-3 pt-1">
+          <div className="mt-2 mb-3">
+            <Typography.Text className="text-sm whitespace-pre-wrap text-gray-300 leading-relaxed">
+              {String(description)}
+            </Typography.Text>
+          </div>
+
+          {/* Parameters section */}
+          {parameters && <ToolParameters parameters={parameters} />}
+        </div>
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/tool-parameters.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/tool-parameters.tsx
@@ -0,0 +1,23 @@
+import { useTranslation } from "react-i18next";
+import ReactJsonView from "@microlink/react-json-view";
+import { JSON_VIEW_THEME } from "#/utils/constants";
+import { Typography } from "#/ui/typography";
+
+interface ToolParametersProps {
+  parameters: Record<string, unknown>;
+}
+
+export function ToolParameters({ parameters }: ToolParametersProps) {
+  const { t } = useTranslation();
+
+  return (
+    <div className="mt-2">
+      <Typography.Text className="text-sm font-semibold text-gray-300">
+        {t("SYSTEM_MESSAGE_MODAL$PARAMETERS")}
+      </Typography.Text>
+      <div className="text-sm mt-2 p-3 bg-gray-900 rounded-md overflow-auto text-gray-300 max-h-[400px] shadow-inner">
+        <ReactJsonView name={false} src={parameters} theme={JSON_VIEW_THEME} />
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation-panel/system-message-modal/tools-list.tsx
+++ b/frontend/src/components/features/conversation-panel/system-message-modal/tools-list.tsx
@@ -0,0 +1,27 @@
+import { ToolItem } from "./tool-item";
+
+interface ToolsListProps {
+  tools: Array<Record<string, unknown>>;
+  expandedTools: Record<number, boolean>;
+  onToggleTool: (index: number) => void;
+}
+
+export function ToolsList({
+  tools,
+  expandedTools,
+  onToggleTool,
+}: ToolsListProps) {
+  return (
+    <div className="p-2 space-y-3">
+      {tools.map((tool, index) => (
+        <ToolItem
+          key={index}
+          tool={tool}
+          index={index}
+          isExpanded={expandedTools[index] || false}
+          onToggle={onToggleTool}
+        />
+      ))}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation/conversation-main.tsx
+++ b/frontend/src/components/features/conversation/conversation-main.tsx
@@ -1,86 +0,0 @@
-import { useSelector } from "react-redux";
-import { useWindowSize } from "@uidotdev/usehooks";
-import { Panel, PanelGroup, PanelResizeHandle } from "react-resizable-panels";
-import { ChatInterface } from "../chat/chat-interface";
-import { ConversationTabContent } from "./conversation-tabs/conversation-tab-content/conversation-tab-content";
-import { cn } from "#/utils/utils";
-import { RootState } from "#/store";
-
-interface ChatInterfaceWrapperProps {
-  isRightPanelShown: boolean;
-}
-
-export function ChatInterfaceWrapper({
-  isRightPanelShown,
-}: ChatInterfaceWrapperProps) {
-  if (!isRightPanelShown) {
-    return (
-      <div className="flex justify-center w-full h-full">
-        <div className="w-full max-w-[768px]">
-          <ChatInterface />
-        </div>
-      </div>
-    );
-  }
-
-  return <ChatInterface />;
-}
-
-export function ConversationMain() {
-  const { width } = useWindowSize();
-  const isRightPanelShown = useSelector(
-    (state: RootState) => state.conversation.isRightPanelShown,
-  );
-
-  if (width && width <= 1024) {
-    return (
-      <div className="flex flex-col gap-3 overflow-auto w-full">
-        <div
-          className={cn(
-            "overflow-hidden w-full bg-base min-h-[600px]",
-            !isRightPanelShown && "h-full",
-          )}
-        >
-          <ChatInterface />
-        </div>
-        {isRightPanelShown && (
-          <div className="h-full w-full min-h-[494px] flex flex-col gap-3">
-            <ConversationTabContent />
-          </div>
-        )}
-      </div>
-    );
-  }
-
-  if (isRightPanelShown) {
-    return (
-      <PanelGroup
-        direction="horizontal"
-        className="grow h-full min-h-0 min-w-0"
-        autoSaveId="react-resizable-panels:layout"
-      >
-        <Panel minSize={30} maxSize={80} className="overflow-hidden bg-base">
-          <ChatInterfaceWrapper isRightPanelShown={isRightPanelShown} />
-        </Panel>
-        <PanelResizeHandle className="cursor-ew-resize" />
-        <Panel
-          minSize={20}
-          maxSize={70}
-          className="flex flex-col overflow-hidden"
-        >
-          <div className="flex flex-col flex-1 gap-3">
-            <ConversationTabContent />
-          </div>
-        </Panel>
-      </PanelGroup>
-    );
-  }
-
-  return (
-    <div className="flex flex-col gap-3 overflow-auto w-full h-full">
-      <div className="overflow-hidden w-full h-full bg-base">
-        <ChatInterfaceWrapper isRightPanelShown={isRightPanelShown} />
-      </div>
-    </div>
-  );
-}
--- a/frontend/src/components/features/conversation/conversation-main/chat-interface-wrapper.tsx
+++ b/frontend/src/components/features/conversation/conversation-main/chat-interface-wrapper.tsx
@@ -0,0 +1,21 @@
+import { ChatInterface } from "../../chat/chat-interface";
+
+interface ChatInterfaceWrapperProps {
+  isRightPanelShown: boolean;
+}
+
+export function ChatInterfaceWrapper({
+  isRightPanelShown,
+}: ChatInterfaceWrapperProps) {
+  if (!isRightPanelShown) {
+    return (
+      <div className="flex justify-center w-full h-full">
+        <div className="w-full max-w-[768px]">
+          <ChatInterface />
+        </div>
+      </div>
+    );
+  }
+
+  return <ChatInterface />;
+}
--- a/frontend/src/components/features/conversation/conversation-main/conversation-main.tsx
+++ b/frontend/src/components/features/conversation/conversation-main/conversation-main.tsx
@@ -0,0 +1,18 @@
+import { useSelector } from "react-redux";
+import { useWindowSize } from "@uidotdev/usehooks";
+import { RootState } from "#/store";
+import { MobileLayout } from "./mobile-layout";
+import { DesktopLayout } from "./desktop-layout";
+
+export function ConversationMain() {
+  const { width } = useWindowSize();
+  const isRightPanelShown = useSelector(
+    (state: RootState) => state.conversation.isRightPanelShown,
+  );
+
+  if (width && width <= 1024) {
+    return <MobileLayout isRightPanelShown={isRightPanelShown} />;
+  }
+
+  return <DesktopLayout isRightPanelShown={isRightPanelShown} />;
+}
--- a/frontend/src/components/features/conversation/conversation-main/desktop-layout.tsx
+++ b/frontend/src/components/features/conversation/conversation-main/desktop-layout.tsx
@@ -0,0 +1,35 @@
+import { Panel, PanelGroup, PanelResizeHandle } from "react-resizable-panels";
+import { ChatInterfaceWrapper } from "./chat-interface-wrapper";
+import { ConversationTabContent } from "../conversation-tabs/conversation-tab-content/conversation-tab-content";
+
+interface DesktopLayoutProps {
+  isRightPanelShown: boolean;
+}
+
+export function DesktopLayout({ isRightPanelShown }: DesktopLayoutProps) {
+  return (
+    <PanelGroup
+      direction="horizontal"
+      className="grow h-full min-h-0 min-w-0"
+      autoSaveId="react-resizable-panels:layout"
+    >
+      <Panel minSize={30} maxSize={80} className="overflow-hidden bg-base">
+        <ChatInterfaceWrapper isRightPanelShown={isRightPanelShown} />
+      </Panel>
+      {isRightPanelShown && (
+        <>
+          <PanelResizeHandle className="cursor-ew-resize" />
+          <Panel
+            minSize={20}
+            maxSize={70}
+            className="flex flex-col overflow-hidden"
+          >
+            <div className="flex flex-col flex-1 gap-3">
+              <ConversationTabContent />
+            </div>
+          </Panel>
+        </>
+      )}
+    </PanelGroup>
+  );
+}
--- a/frontend/src/components/features/conversation/conversation-main/mobile-layout.tsx
+++ b/frontend/src/components/features/conversation/conversation-main/mobile-layout.tsx
@@ -0,0 +1,27 @@
+import { ChatInterface } from "../../chat/chat-interface";
+import { ConversationTabContent } from "../conversation-tabs/conversation-tab-content/conversation-tab-content";
+import { cn } from "#/utils/utils";
+
+interface MobileLayoutProps {
+  isRightPanelShown: boolean;
+}
+
+export function MobileLayout({ isRightPanelShown }: MobileLayoutProps) {
+  return (
+    <div className="flex flex-col gap-3 overflow-auto w-full">
+      <div
+        className={cn(
+          "overflow-hidden w-full bg-base min-h-[600px]",
+          !isRightPanelShown && "h-full",
+        )}
+      >
+        <ChatInterface />
+      </div>
+      {isRightPanelShown && (
+        <div className="h-full w-full min-h-[494px] flex flex-col gap-3">
+          <ConversationTabContent />
+        </div>
+      )}
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation/conversation-name.tsx
+++ b/frontend/src/components/features/conversation/conversation-name.tsx
@@ -12,7 +12,7 @@ import { SystemMessageModal } from "../conversation-panel/system-message-modal";
 import { MicroagentsModal } from "../conversation-panel/microagents-modal";
 import { ConfirmDeleteModal } from "../conversation-panel/confirm-delete-modal";
 import { ConfirmStopModal } from "../conversation-panel/confirm-stop-modal";
-import { MetricsModal } from "./metrics-modal";
+import { MetricsModal } from "./metrics-modal/metrics-modal";

 export function ConversationName() {
  const { t } = useTranslation();
--- a/frontend/src/components/features/conversation/metrics-modal.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal.tsx
@@ -1,130 +0,0 @@
-import React from "react";
-import { useTranslation } from "react-i18next";
-import { useSelector } from "react-redux";
-import { BaseModal } from "../../shared/modals/base-modal/base-modal";
-import { BudgetDisplay } from "../conversation-panel/budget-display";
-import { RootState } from "#/store";
-import { I18nKey } from "#/i18n/declaration";
-
-interface MetricsModalProps {
-  isOpen: boolean;
-  onOpenChange: (open: boolean) => void;
-}
-
-export function MetricsModal({ isOpen, onOpenChange }: MetricsModalProps) {
-  const { t } = useTranslation();
-  const metrics = useSelector((state: RootState) => state.metrics);
-
-  return (
-    <BaseModal
-      isOpen={isOpen}
-      onOpenChange={onOpenChange}
-      title={t(I18nKey.CONVERSATION$METRICS_INFO)}
-      testID="metrics-modal"
-    >
-      <div className="space-y-4">
-        {(metrics?.cost !== null || metrics?.usage !== null) && (
-          <div className="rounded-md p-3">
-            <div className="grid gap-3">
-              {metrics?.cost !== null && (
-                <div className="flex justify-between items-center pb-2">
-                  <span className="text-lg font-semibold">
-                    {t(I18nKey.CONVERSATION$TOTAL_COST)}
-                  </span>
-                  <span className="font-semibold">
-                    ${metrics.cost.toFixed(4)}
-                  </span>
-                </div>
-              )}
-              <BudgetDisplay
-                cost={metrics?.cost ?? null}
-                maxBudgetPerTask={metrics?.max_budget_per_task ?? null}
-              />
-
-              {metrics?.usage !== null && (
-                <>
-                  <div className="flex justify-between items-center pb-2">
-                    <span>{t(I18nKey.CONVERSATION$INPUT)}</span>
-                    <span className="font-semibold">
-                      {metrics.usage.prompt_tokens.toLocaleString()}
-                    </span>
-                  </div>
-
-                  <div className="grid grid-cols-2 gap-2 pl-4 text-sm">
-                    <span className="text-neutral-400">
-                      {t(I18nKey.CONVERSATION$CACHE_HIT)}
-                    </span>
-                    <span className="text-right">
-                      {metrics.usage.cache_read_tokens.toLocaleString()}
-                    </span>
-                    <span className="text-neutral-400">
-                      {t(I18nKey.CONVERSATION$CACHE_WRITE)}
-                    </span>
-                    <span className="text-right">
-                      {metrics.usage.cache_write_tokens.toLocaleString()}
-                    </span>
-                  </div>
-
-                  <div className="flex justify-between items-center border-b border-neutral-700 pb-2">
-                    <span>{t(I18nKey.CONVERSATION$OUTPUT)}</span>
-                    <span className="font-semibold">
-                      {metrics.usage.completion_tokens.toLocaleString()}
-                    </span>
-                  </div>
-
-                  <div className="flex justify-between items-center border-b border-neutral-700 pb-2">
-                    <span className="font-semibold">
-                      {t(I18nKey.CONVERSATION$TOTAL)}
-                    </span>
-                    <span className="font-bold">
-                      {(
-                        metrics.usage.prompt_tokens +
-                        metrics.usage.completion_tokens
-                      ).toLocaleString()}
-                    </span>
-                  </div>
-
-                  <div className="flex flex-col gap-2">
-                    <div className="flex items-center justify-between">
-                      <span className="font-semibold">
-                        {t(I18nKey.CONVERSATION$CONTEXT_WINDOW)}
-                      </span>
-                    </div>
-                    <div className="w-full h-1.5 bg-neutral-700 rounded-full overflow-hidden">
-                      <div
-                        className="h-full bg-blue-500 transition-all duration-300"
-                        style={{
-                          width: `${Math.min(100, (metrics.usage.per_turn_token / metrics.usage.context_window) * 100)}%`,
-                        }}
-                      />
-                    </div>
-                    <div className="flex justify-end">
-                      <span className="text-xs text-neutral-400">
-                        {metrics.usage.per_turn_token.toLocaleString()} /{" "}
-                        {metrics.usage.context_window.toLocaleString()} (
-                        {(
-                          (metrics.usage.per_turn_token /
-                            metrics.usage.context_window) *
-                          100
-                        ).toFixed(2)}
-                        % {t(I18nKey.CONVERSATION$USED)})
-                      </span>
-                    </div>
-                  </div>
-                </>
-              )}
-            </div>
-          </div>
-        )}
-
-        {!metrics?.cost && !metrics?.usage && (
-          <div className="rounded-md p-4 text-center">
-            <p className="text-neutral-400">
-              {t(I18nKey.CONVERSATION$NO_METRICS)}
-            </p>
-          </div>
-        )}
-      </div>
-    </BaseModal>
-  );
-}
--- a/frontend/src/components/features/conversation/metrics-modal/context-window-section.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal/context-window-section.tsx
@@ -0,0 +1,39 @@
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+
+interface ContextWindowSectionProps {
+  perTurnToken: number;
+  contextWindow: number;
+}
+
+export function ContextWindowSection({
+  perTurnToken,
+  contextWindow,
+}: ContextWindowSectionProps) {
+  const { t } = useTranslation();
+
+  const usagePercentage = (perTurnToken / contextWindow) * 100;
+  const progressWidth = Math.min(100, usagePercentage);
+
+  return (
+    <div className="flex flex-col gap-2">
+      <div className="flex items-center justify-between">
+        <span className="font-semibold">
+          {t(I18nKey.CONVERSATION$CONTEXT_WINDOW)}
+        </span>
+      </div>
+      <div className="w-full h-1.5 bg-neutral-700 rounded-full overflow-hidden">
+        <div
+          className="h-full bg-blue-500 transition-all duration-300"
+          style={{ width: `${progressWidth}%` }}
+        />
+      </div>
+      <div className="flex justify-end">
+        <span className="text-xs text-neutral-400">
+          {perTurnToken.toLocaleString()} / {contextWindow.toLocaleString()} (
+          {usagePercentage.toFixed(2)}% {t(I18nKey.CONVERSATION$USED)})
+        </span>
+      </div>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation/metrics-modal/cost-section.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal/cost-section.tsx
@@ -0,0 +1,28 @@
+import { useTranslation } from "react-i18next";
+import { BudgetDisplay } from "../../conversation-panel/budget-display";
+import { I18nKey } from "#/i18n/declaration";
+
+interface CostSectionProps {
+  cost: number | null;
+  maxBudgetPerTask: number | null;
+}
+
+export function CostSection({ cost, maxBudgetPerTask }: CostSectionProps) {
+  const { t } = useTranslation();
+
+  if (cost === null) {
+    return null;
+  }
+
+  return (
+    <>
+      <div className="flex justify-between items-center pb-2">
+        <span className="text-lg font-semibold">
+          {t(I18nKey.CONVERSATION$TOTAL_COST)}
+        </span>
+        <span className="font-semibold">${cost.toFixed(4)}</span>
+      </div>
+      <BudgetDisplay cost={cost} maxBudgetPerTask={maxBudgetPerTask} />
+    </>
+  );
+}
--- a/frontend/src/components/features/conversation/metrics-modal/empty-state.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal/empty-state.tsx
@@ -0,0 +1,12 @@
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+
+export function EmptyState() {
+  const { t } = useTranslation();
+
+  return (
+    <div className="rounded-md p-4 text-center">
+      <p className="text-neutral-400">{t(I18nKey.CONVERSATION$NO_METRICS)}</p>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation/metrics-modal/metric-row.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal/metric-row.tsx
@@ -0,0 +1,22 @@
+import { ReactNode } from "react";
+
+interface MetricRowProps {
+  label: ReactNode;
+  value: ReactNode;
+  labelClassName?: string;
+  valueClassName?: string;
+}
+
+export function MetricRow({
+  label,
+  value,
+  labelClassName = "",
+  valueClassName = "font-semibold",
+}: MetricRowProps) {
+  return (
+    <div className="flex justify-between items-center border-b border-neutral-700 pb-2">
+      <span className={labelClassName}>{label}</span>
+      <span className={valueClassName}>{value}</span>
+    </div>
+  );
+}
--- a/frontend/src/components/features/conversation/metrics-modal/metrics-modal.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal/metrics-modal.tsx
@@ -0,0 +1,52 @@
+import { useTranslation } from "react-i18next";
+import { BaseModal } from "../../../shared/modals/base-modal/base-modal";
+import { I18nKey } from "#/i18n/declaration";
+import { CostSection } from "./cost-section";
+import { UsageSection } from "./usage-section";
+import { ContextWindowSection } from "./context-window-section";
+import { EmptyState } from "./empty-state";
+import useMetricsStore from "#/stores/metrics-store";
+
+interface MetricsModalProps {
+  isOpen: boolean;
+  onOpenChange: (open: boolean) => void;
+}
+
+export function MetricsModal({ isOpen, onOpenChange }: MetricsModalProps) {
+  const { t } = useTranslation();
+  const metrics = useMetricsStore();
+
+  return (
+    <BaseModal
+      isOpen={isOpen}
+      onOpenChange={onOpenChange}
+      title={t(I18nKey.CONVERSATION$METRICS_INFO)}
+      testID="metrics-modal"
+    >
+      <div className="space-y-4">
+        {(metrics?.cost !== null || metrics?.usage !== null) && (
+          <div className="rounded-md p-3">
+            <div className="grid gap-3">
+              <CostSection
+                cost={metrics?.cost ?? null}
+                maxBudgetPerTask={metrics?.max_budget_per_task ?? null}
+              />
+
+              {metrics?.usage !== null && (
+                <>
+                  <UsageSection usage={metrics.usage} />
+                  <ContextWindowSection
+                    perTurnToken={metrics.usage.per_turn_token}
+                    contextWindow={metrics.usage.context_window}
+                  />
+                </>
+              )}
+            </div>
+          </div>
+        )}
+
+        {!metrics?.cost && !metrics?.usage && <EmptyState />}
+      </div>
+    </BaseModal>
+  );
+}
--- a/frontend/src/components/features/conversation/metrics-modal/usage-section.tsx
+++ b/frontend/src/components/features/conversation/metrics-modal/usage-section.tsx
@@ -0,0 +1,52 @@
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+import { MetricRow } from "./metric-row";
+
+interface UsageSectionProps {
+  usage: {
+    prompt_tokens: number;
+    completion_tokens: number;
+    cache_read_tokens: number;
+    cache_write_tokens: number;
+  };
+}
+
+export function UsageSection({ usage }: UsageSectionProps) {
+  const { t } = useTranslation();
+
+  return (
+    <>
+      <MetricRow
+        label={t(I18nKey.CONVERSATION$INPUT)}
+        value={usage.prompt_tokens.toLocaleString()}
+      />
+
+      <div className="grid grid-cols-2 gap-2 pl-4 text-sm">
+        <span className="text-neutral-400">
+          {t(I18nKey.CONVERSATION$CACHE_HIT)}
+        </span>
+        <span className="text-right">
+          {usage.cache_read_tokens.toLocaleString()}
+        </span>
+        <span className="text-neutral-400">
+          {t(I18nKey.CONVERSATION$CACHE_WRITE)}
+        </span>
+        <span className="text-right">
+          {usage.cache_write_tokens.toLocaleString()}
+        </span>
+      </div>
+
+      <MetricRow
+        label={t(I18nKey.CONVERSATION$OUTPUT)}
+        value={usage.completion_tokens.toLocaleString()}
+      />
+
+      <MetricRow
+        label={t(I18nKey.CONVERSATION$TOTAL)}
+        value={(usage.prompt_tokens + usage.completion_tokens).toLocaleString()}
+        labelClassName="font-semibold"
+        valueClassName="font-bold"
+      />
+    </>
+  );
+}
--- a/frontend/src/components/features/terminal/terminal.tsx
+++ b/frontend/src/components/features/terminal/terminal.tsx
@@ -7,14 +7,11 @@ import { cn } from "#/utils/utils";
 import { WaitingForRuntimeMessage } from "../chat/waiting-for-runtime-message";

 function Terminal() {
-  const { commands } = useSelector((state: RootState) => state.cmd);
  const { curAgentState } = useSelector((state: RootState) => state.agent);

  const isRuntimeInactive = RUNTIME_INACTIVE_STATES.includes(curAgentState);

-  const ref = useTerminal({
-    commands,
-  });
+  const ref = useTerminal();

  return (
    <div className="h-full flex flex-col rounded-xl">
--- a/frontend/src/context/ws-client-provider.tsx
+++ b/frontend/src/context/ws-client-provider.tsx
@@ -263,8 +263,8 @@ export function WsClientProvider({
    }
    sio.io.opts.query = sio.io.opts.query || {};
    sio.io.opts.query.latest_event_id = lastEventRef.current?.id;
-    updateStatusWhenErrorMessagePresent(data);

+    updateStatusWhenErrorMessagePresent(data);
    setErrorMessage(hasValidMessageProperty(data) ? data.message : "");
  }

@@ -296,8 +296,29 @@ export function WsClientProvider({
    if (!conversationId) {
      throw new Error("No conversation ID provided");
    }
-    if (conversation?.status !== "RUNNING" && !conversation?.runtime_status) {
-      return () => undefined; // conversation not yet loaded
+
+    // Clear error messages when conversation is intentionally stopped
+    if (conversation && conversation.status === "STOPPED") {
+      removeErrorMessage();
+      setWebSocketStatus("DISCONNECTED");
+      return () => undefined; // conversation intentionally stopped
+    }
+
+    // Set connecting status when conversation is starting
+    if (conversation && conversation.status === "STARTING") {
+      removeErrorMessage();
+      setWebSocketStatus("CONNECTING");
+      return () => undefined; // conversation is starting, will connect when ready
+    }
+
+    // Only connect when conversation is fully loaded and running
+    if (
+      !conversation ||
+      conversation.status !== "RUNNING" ||
+      !conversation.runtime_status ||
+      conversation.runtime_status === "STATUS$STOPPED"
+    ) {
+      return () => undefined; // conversation not ready for WebSocket connection
    }

    let sio = sioRef.current;
--- a/frontend/src/hooks/chat/use-chat-input-events.ts
+++ b/frontend/src/hooks/chat/use-chat-input-events.ts
@@ -0,0 +1,99 @@
+import { useCallback } from "react";
+import { isMobileDevice } from "#/utils/utils";
+import {
+  ensureCursorVisible,
+  clearEmptyContent,
+} from "#/components/features/chat/utils/chat-input.utils";
+
+/**
+ * Hook for handling chat input events
+ */
+export const useChatInputEvents = (
+  chatInputRef: React.RefObject<HTMLDivElement | null>,
+  smartResize: () => void,
+  increaseHeightForEmptyContent: () => void,
+  checkIsContentEmpty: () => boolean,
+  clearEmptyContentHandler: () => void,
+  onFocus?: () => void,
+  onBlur?: () => void,
+) => {
+  // Handle input events
+  const handleInput = useCallback(() => {
+    smartResize();
+
+    // Clear empty content to ensure placeholder shows
+    if (chatInputRef.current) {
+      clearEmptyContent(chatInputRef.current);
+    }
+
+    // Ensure cursor stays visible when content is scrollable
+    ensureCursorVisible(chatInputRef.current);
+  }, [smartResize, chatInputRef]);
+
+  // Handle paste events to clean up formatting
+  const handlePaste = useCallback(
+    (e: React.ClipboardEvent) => {
+      e.preventDefault();
+
+      // Get plain text from clipboard
+      const text = e.clipboardData.getData("text/plain");
+
+      // Insert plain text
+      document.execCommand("insertText", false, text);
+
+      // Trigger resize
+      setTimeout(smartResize, 0);
+    },
+    [smartResize],
+  );
+
+  // Handle key events
+  const handleKeyDown = useCallback(
+    (e: React.KeyboardEvent, disabled: boolean, handleSubmit: () => void) => {
+      if (e.key !== "Enter") {
+        return;
+      }
+
+      if (checkIsContentEmpty()) {
+        e.preventDefault();
+        increaseHeightForEmptyContent();
+        return;
+      }
+
+      // Original submit logic - only for desktop without shift key
+      if (!isMobileDevice() && !e.shiftKey && !disabled) {
+        e.preventDefault();
+        handleSubmit();
+      }
+    },
+    [checkIsContentEmpty, increaseHeightForEmptyContent],
+  );
+
+  // Handle blur events to ensure placeholder shows when empty
+  const handleBlur = useCallback(() => {
+    // Clear empty content to ensure placeholder shows
+    if (chatInputRef.current) {
+      clearEmptyContent(chatInputRef.current);
+    }
+
+    // Call the original onBlur callback if provided
+    if (onBlur) {
+      onBlur();
+    }
+  }, [chatInputRef, onBlur]);
+
+  // Handle focus events
+  const handleFocus = useCallback(() => {
+    if (onFocus) {
+      onFocus();
+    }
+  }, [onFocus]);
+
+  return {
+    handleInput,
+    handlePaste,
+    handleKeyDown,
+    handleBlur,
+    handleFocus,
+  };
+};
--- a/frontend/src/hooks/chat/use-chat-input-logic.ts
+++ b/frontend/src/hooks/chat/use-chat-input-logic.ts
@@ -0,0 +1,59 @@
+import { useRef, useCallback, useEffect } from "react";
+import { useDispatch, useSelector } from "react-redux";
+import {
+  setMessageToSend,
+  setIsRightPanelShown,
+} from "#/state/conversation-slice";
+import { RootState } from "#/store";
+import {
+  isContentEmpty,
+  clearEmptyContent,
+  getTextContent,
+} from "#/components/features/chat/utils/chat-input.utils";
+
+/**
+ * Hook for managing chat input content logic
+ */
+export const useChatInputLogic = () => {
+  const chatInputRef = useRef<HTMLDivElement | null>(null);
+
+  const { messageToSend, hasRightPanelToggled } = useSelector(
+    (state: RootState) => state.conversation,
+  );
+
+  const dispatch = useDispatch();
+
+  // Save current input value when drawer state changes
+  useEffect(() => {
+    if (chatInputRef.current) {
+      const currentText = getTextContent(chatInputRef.current);
+      dispatch(setMessageToSend(currentText));
+      dispatch(setIsRightPanelShown(hasRightPanelToggled));
+    }
+  }, [hasRightPanelToggled, dispatch]);
+
+  // Helper function to check if contentEditable is truly empty
+  const checkIsContentEmpty = useCallback(
+    (): boolean => isContentEmpty(chatInputRef.current),
+    [],
+  );
+
+  // Helper function to properly clear contentEditable for placeholder display
+  const clearEmptyContentHandler = useCallback((): void => {
+    clearEmptyContent(chatInputRef.current);
+  }, []);
+
+  // Get current message text
+  const getCurrentMessage = useCallback(
+    (): string => getTextContent(chatInputRef.current),
+    [],
+  );
+
+  return {
+    chatInputRef,
+    messageToSend,
+    checkIsContentEmpty,
+    clearEmptyContentHandler,
+    getCurrentMessage,
+  };
+};
--- a/frontend/src/hooks/chat/use-chat-submission.ts
+++ b/frontend/src/hooks/chat/use-chat-submission.ts
@@ -0,0 +1,61 @@
+import { useCallback } from "react";
+import {
+  clearTextContent,
+  clearFileInput,
+} from "#/components/features/chat/utils/chat-input.utils";
+
+/**
+ * Hook for handling chat message submission
+ */
+export const useChatSubmission = (
+  chatInputRef: React.RefObject<HTMLDivElement | null>,
+  fileInputRef: React.RefObject<HTMLInputElement | null>,
+  smartResize: () => void,
+  onSubmit: (message: string) => void,
+) => {
+  // Send button click handler
+  const handleSubmit = useCallback(() => {
+    const message = chatInputRef.current?.innerText || "";
+    const trimmedMessage = message.trim();
+
+    if (!trimmedMessage) {
+      return;
+    }
+
+    onSubmit(message);
+
+    // Clear the input
+    clearTextContent(chatInputRef.current);
+    clearFileInput(fileInputRef.current);
+
+    // Reset height and show suggestions again
+    smartResize();
+  }, [chatInputRef, fileInputRef, smartResize, onSubmit]);
+
+  // Resume agent button click handler
+  const handleResumeAgent = useCallback(() => {
+    const message = chatInputRef.current?.innerText || "continue";
+
+    onSubmit(message.trim());
+
+    // Clear the input
+    clearTextContent(chatInputRef.current);
+    clearFileInput(fileInputRef.current);
+
+    // Reset height and show suggestions again
+    smartResize();
+  }, [chatInputRef, fileInputRef, smartResize, onSubmit]);
+
+  // Handle stop button click
+  const handleStop = useCallback((onStop?: () => void) => {
+    if (onStop) {
+      onStop();
+    }
+  }, []);
+
+  return {
+    handleSubmit,
+    handleResumeAgent,
+    handleStop,
+  };
+};
--- a/frontend/src/hooks/chat/use-file-handling.ts
+++ b/frontend/src/hooks/chat/use-file-handling.ts
@@ -0,0 +1,103 @@
+import React, { useRef, useCallback, useState } from "react";
+
+interface UseFileHandlingReturn {
+  fileInputRef: React.RefObject<HTMLInputElement | null>;
+  chatContainerRef: React.RefObject<HTMLDivElement | null>;
+  isDragOver: boolean;
+  handleFileIconClick: (isDisabled: boolean) => void;
+  handleFileInputChange: (e: React.ChangeEvent<HTMLInputElement>) => void;
+  handleDragOver: (e: React.DragEvent, isDisabled: boolean) => void;
+  handleDragLeave: (e: React.DragEvent, isDisabled: boolean) => void;
+  handleDrop: (e: React.DragEvent, isDisabled: boolean) => void;
+}
+
+/**
+ * Hook for handling file operations (upload, drag & drop)
+ */
+export const useFileHandling = (
+  onFilesPaste?: (files: File[]) => void,
+): UseFileHandlingReturn => {
+  const fileInputRef = useRef<HTMLInputElement | null>(null);
+  const chatContainerRef = useRef<HTMLDivElement | null>(null);
+  const [isDragOver, setIsDragOver] = useState(false);
+
+  // Function to add files and notify parent
+  const addFiles = useCallback(
+    (files: File[]) => {
+      if (onFilesPaste && files.length > 0) {
+        onFilesPaste(files);
+      }
+    },
+    [onFilesPaste],
+  );
+
+  // File icon click handler
+  const handleFileIconClick = useCallback((isDisabled: boolean) => {
+    if (!isDisabled && fileInputRef.current) {
+      fileInputRef.current.click();
+    }
+  }, []);
+
+  // File input change handler
+  const handleFileInputChange = useCallback(
+    (e: React.ChangeEvent<HTMLInputElement>) => {
+      const files = Array.from(e.target.files || []);
+      addFiles(files);
+    },
+    [addFiles],
+  );
+
+  // Drag and drop event handlers
+  const handleDragOver = useCallback(
+    (e: React.DragEvent, isDisabled: boolean) => {
+      if (isDisabled) {
+        return;
+      }
+      e.preventDefault();
+      setIsDragOver(true);
+    },
+    [],
+  );
+
+  const handleDragLeave = useCallback(
+    (e: React.DragEvent, isDisabled: boolean) => {
+      if (
+        isDisabled ||
+        chatContainerRef.current?.contains(e.relatedTarget as Node)
+      ) {
+        return;
+      }
+
+      e.preventDefault();
+      setIsDragOver(false);
+    },
+    [],
+  );
+
+  const handleDrop = useCallback(
+    (e: React.DragEvent, isDisabled: boolean) => {
+      if (isDisabled) {
+        return;
+      }
+
+      e.preventDefault();
+
+      setIsDragOver(false);
+
+      const files = Array.from(e.dataTransfer.files);
+      addFiles(files);
+    },
+    [addFiles],
+  );
+
+  return {
+    fileInputRef,
+    chatContainerRef,
+    isDragOver,
+    handleFileIconClick,
+    handleFileInputChange,
+    handleDragOver,
+    handleDragLeave,
+    handleDrop,
+  };
+};
--- a/frontend/src/hooks/chat/use-grip-resize.ts
+++ b/frontend/src/hooks/chat/use-grip-resize.ts
@@ -0,0 +1,81 @@
+import { useRef, useState, useCallback } from "react";
+import { useDispatch } from "react-redux";
+import { useAutoResize } from "#/hooks/use-auto-resize";
+import {
+  IMessageToSend,
+  setShouldHideSuggestions,
+} from "#/state/conversation-slice";
+import { CHAT_INPUT } from "#/utils/constants";
+
+/**
+ * Hook for managing grip resize functionality
+ */
+export const useGripResize = (
+  chatInputRef: React.RefObject<HTMLDivElement | null>,
+  messageToSend: IMessageToSend | null,
+) => {
+  const gripRef = useRef<HTMLDivElement | null>(null);
+
+  const [isGripVisible, setIsGripVisible] = useState(false);
+
+  const dispatch = useDispatch();
+
+  // Drag state management callbacks
+  const handleDragStart = useCallback(() => {
+    // Keep grip visible during drag by adding a CSS class
+    if (gripRef.current) {
+      gripRef.current.classList.add("opacity-100");
+      gripRef.current.classList.remove("opacity-0");
+    }
+  }, []);
+
+  const handleDragEnd = useCallback(() => {
+    // Restore hover-based visibility
+    if (gripRef.current) {
+      gripRef.current.classList.remove("opacity-100");
+      gripRef.current.classList.add("opacity-0");
+    }
+  }, []);
+
+  // Handle click on top edge area to toggle grip visibility
+  const handleTopEdgeClick = useCallback((e: React.MouseEvent) => {
+    e.stopPropagation();
+    setIsGripVisible((prev) => !prev);
+  }, []);
+
+  // Callback to handle height changes and manage suggestions visibility
+  const handleHeightChange = useCallback(
+    (height: number) => {
+      // Hide suggestions when input height exceeds the threshold
+      const shouldHideChatSuggestions = height > CHAT_INPUT.HEIGHT_THRESHOLD;
+      dispatch(setShouldHideSuggestions(shouldHideChatSuggestions));
+    },
+    [dispatch],
+  );
+
+  // Use the auto-resize hook with height change callback
+  const {
+    smartResize,
+    handleGripMouseDown,
+    handleGripTouchStart,
+    increaseHeightForEmptyContent,
+  } = useAutoResize(chatInputRef as React.RefObject<HTMLElement | null>, {
+    minHeight: 20,
+    maxHeight: 400,
+    onHeightChange: handleHeightChange,
+    onGripDragStart: handleDragStart,
+    onGripDragEnd: handleDragEnd,
+    value: messageToSend ?? undefined,
+    enableManualResize: true,
+  });
+
+  return {
+    gripRef,
+    isGripVisible,
+    handleTopEdgeClick,
+    smartResize,
+    handleGripMouseDown,
+    handleGripTouchStart,
+    increaseHeightForEmptyContent,
+  };
+};
--- a/frontend/src/hooks/mutation/use-stop-conversation.ts
+++ b/frontend/src/hooks/mutation/use-stop-conversation.ts
@@ -1,8 +1,13 @@
 import { useMutation, useQueryClient } from "@tanstack/react-query";
+import { useNavigate, useParams } from "react-router";
 import ConversationService from "#/api/conversation-service/conversation-service.api";

 export const useStopConversation = () => {
  const queryClient = useQueryClient();
+  const navigate = useNavigate();
+  const { conversationId: currentConversationId } = useParams<{
+    conversationId: string;
+  }>();

  return useMutation({
    mutationFn: (variables: { conversationId: string }) =>
@@ -32,5 +37,14 @@ export const useStopConversation = () => {
      // Also invalidate the conversations list for consistency
      queryClient.invalidateQueries({ queryKey: ["user", "conversations"] });
    },
+    onSuccess: (_, variables) => {
+      // Only redirect if we're stopping the conversation we're currently viewing
+      if (
+        currentConversationId &&
+        variables.conversationId === currentConversationId
+      ) {
+        navigate("/");
+      }
+    },
  });
 };
--- a/frontend/src/hooks/use-conversation-name-context-menu.ts
+++ b/frontend/src/hooks/use-conversation-name-context-menu.ts
@@ -2,10 +2,9 @@ import { useTranslation } from "react-i18next";
 import React from "react";
 import posthog from "posthog-js";
 import { useParams, useNavigate } from "react-router";
-import { useSelector } from "react-redux";
 import { useWsClient } from "#/context/ws-client-provider";
 import { transformVSCodeUrl } from "#/utils/vscode-url-helper";
-import { RootState } from "#/store";
+import useMetricsStore from "#/stores/metrics-store";
 import { isSystemMessage } from "#/types/core/guards";
 import { ConversationStatus } from "#/types/conversation-status";
 import ConversationService from "#/api/conversation-service/conversation-service.api";
@@ -36,7 +35,7 @@ export function useConversationNameContextMenu({
  const { mutate: deleteConversation } = useDeleteConversation();
  const { mutate: stopConversation } = useStopConversation();
  const { mutate: getTrajectory } = useGetTrajectory();
-  const metrics = useSelector((state: RootState) => state.metrics);
+  const metrics = useMetricsStore();

  const [metricsModalVisible, setMetricsModalVisible] = React.useState(false);
  const [systemModalVisible, setSystemModalVisible] = React.useState(false);
--- a/frontend/src/hooks/use-terminal.ts
+++ b/frontend/src/hooks/use-terminal.ts
@@ -2,26 +2,18 @@ import { FitAddon } from "@xterm/addon-fit";
 import { Terminal } from "@xterm/xterm";
 import React from "react";
 import { useSelector } from "react-redux";
-import { Command } from "#/state/command-slice";
-import { RootState } from "#/store";
+import { Command, useCommandStore } from "#/state/command-store";
 import { RUNTIME_INACTIVE_STATES } from "#/types/agent-state";
 import { useWsClient } from "#/context/ws-client-provider";
 import { getTerminalCommand } from "#/services/terminal-service";
 import { parseTerminalOutput } from "#/utils/parse-terminal-output";
+import { RootState } from "#/store";

 /*
  NOTE: Tests for this hook are indirectly covered by the tests for the XTermTerminal component.
  The reason for this is that the hook exposes a ref that requires a DOM element to be rendered.
 */

-interface UseTerminalConfig {
-  commands: Command[];
-}
-
-const DEFAULT_TERMINAL_CONFIG: UseTerminalConfig = {
-  commands: [],
-};
-
 const renderCommand = (
  command: Command,
  terminal: Terminal,
@@ -44,11 +36,10 @@ const renderCommand = (
 // This ensures terminal history is preserved when navigating away and back
 const persistentLastCommandIndex = { current: 0 };

-export const useTerminal = ({
-  commands,
-}: UseTerminalConfig = DEFAULT_TERMINAL_CONFIG) => {
+export const useTerminal = () => {
  const { send } = useWsClient();
  const { curAgentState } = useSelector((state: RootState) => state.agent);
+  const commands = useCommandStore((state) => state.commands);
  const terminal = React.useRef<Terminal | null>(null);
  const fitAddon = React.useRef<FitAddon | null>(null);
  const ref = React.useRef<HTMLDivElement>(null);
--- a/frontend/src/i18n/declaration.ts
+++ b/frontend/src/i18n/declaration.ts
@@ -910,4 +910,5 @@ export enum I18nKey {
  COMMON$STOP_RUNTIME = "COMMON$STOP_RUNTIME",
  COMMON$START_RUNTIME = "COMMON$START_RUNTIME",
  COMMON$JUPYTER_EMPTY_MESSAGE = "COMMON$JUPYTER_EMPTY_MESSAGE",
+  COMMON$CONFIRMATION_MODE_ENABLED = "COMMON$CONFIRMATION_MODE_ENABLED",
 }
--- a/frontend/src/i18n/translation.json
+++ b/frontend/src/i18n/translation.json
@@ -14558,5 +14558,21 @@
    "tr": "Jupyter defteriniz boş. Gösterilecek hücre yok.",
    "de": "Ihr Jupyter-Notebook ist leer. Keine Zellen zum Anzeigen.",
    "uk": "Ваш Jupyter-ноутбук порожній. Немає клітинок для відображення."
+  },
+  "COMMON$CONFIRMATION_MODE_ENABLED": {
+    "en": "Confirmation mode enabled",
+    "ja": "確認モードが有効になっています",
+    "zh-CN": "已启用确认模式",
+    "zh-TW": "已啟用確認模式",
+    "ko-KR": "확인 모드가 활성화되었습니다",
+    "no": "Bekreftelsesmodus aktivert",
+    "it": "Modalità di conferma abilitata",
+    "pt": "Modo de confirmação ativado",
+    "es": "Modo de confirmación activado",
+    "ar": "تم تفعيل وضع التأكيد",
+    "fr": "Mode de confirmation activé",
+    "tr": "Onay modu etkinleştirildi",
+    "de": "Bestätigungsmodus aktiviert",
+    "uk": "Режим підтвердження увімкнено"
  }
 }
--- a/frontend/src/icons/lock.svg
+++ b/frontend/src/icons/lock.svg
@@ -0,0 +1,3 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none">
+<path fill-rule="evenodd" clip-rule="evenodd" d="M12 3.25C9.92893 3.25 8.25 4.92893 8.25 7V10H7C6.44772 10 6 10.4477 6 11V19C6 19.5523 6.44772 20 7 20H17C17.5523 20 18 19.5523 18 19V11C18 10.4477 17.5523 10 17 10H15.75V7C15.75 4.92893 14.0711 3.25 12 3.25ZM14.25 10V7C14.25 5.75736 13.2426 4.75 12 4.75C10.7574 4.75 9.75 5.75736 9.75 7V10H14.25Z" fill="currentColor"/>
+</svg>
--- a/frontend/src/routes/conversation.tsx
+++ b/frontend/src/routes/conversation.tsx
@@ -1,9 +1,10 @@
 import React from "react";
 import { useNavigate } from "react-router";
 import { useDispatch } from "react-redux";
+import { useQueryClient } from "@tanstack/react-query";

 import { useConversationId } from "#/hooks/use-conversation-id";
-import { clearTerminal } from "#/state/command-slice";
+import { useCommandStore } from "#/state/command-store";
 import { useEffectOnce } from "#/hooks/use-effect-once";
 import { clearJupyter } from "#/state/jupyter-slice";
 import { resetConversationState } from "#/state/conversation-slice";
@@ -19,25 +20,28 @@ import { useActiveConversation } from "#/hooks/query/use-active-conversation";

 import { displayErrorToast } from "#/utils/custom-toast-handlers";
 import { useDocumentTitleFromState } from "#/hooks/use-document-title-from-state";
-import ConversationService from "#/api/conversation-service/conversation-service.api";
 import { useIsAuthed } from "#/hooks/query/use-is-authed";
 import { ConversationSubscriptionsProvider } from "#/context/conversation-subscriptions-provider";
 import { useUserProviders } from "#/hooks/use-user-providers";

-import { ConversationMain } from "#/components/features/conversation/conversation-main";
+import { ConversationMain } from "#/components/features/conversation/conversation-main/conversation-main";
 import { ConversationName } from "#/components/features/conversation/conversation-name";

 import { ConversationTabs } from "#/components/features/conversation/conversation-tabs/conversation-tabs";
+import { useStartConversation } from "#/hooks/mutation/use-start-conversation";

 function AppContent() {
  useConversationConfig();

  const { conversationId } = useConversationId();
  const { data: conversation, isFetched, refetch } = useActiveConversation();
+  const { mutate: startConversation } = useStartConversation();
  const { data: isAuthed } = useIsAuthed();
  const { providers } = useUserProviders();
  const dispatch = useDispatch();
  const navigate = useNavigate();
+  const clearTerminal = useCommandStore((state) => state.clearTerminal);
+  const queryClient = useQueryClient();

  // Fetch batch feedback data when conversation is loaded
  useBatchFeedback();
@@ -45,6 +49,13 @@ function AppContent() {
  // Set the document title to the conversation title when available
  useDocumentTitleFromState();

+  // Force fresh conversation data when navigating to prevent stale cache issues
+  React.useEffect(() => {
+    queryClient.invalidateQueries({
+      queryKey: ["user", "conversation", conversationId],
+    });
+  }, [conversationId, queryClient]);
+
  React.useEffect(() => {
    if (isFetched && !conversation && isAuthed) {
      displayErrorToast(
@@ -52,23 +63,35 @@ function AppContent() {
      );
      navigate("/");
    } else if (conversation?.status === "STOPPED") {
-      // start the conversation if the state is stopped on initial load
-      ConversationService.startConversation(
-        conversation.conversation_id,
-        providers,
-      ).then(() => refetch());
+      // If conversation is STOPPED, attempt to start it
+      startConversation(
+        { conversationId: conversation.conversation_id, providers },
+        {
+          onError: (error) => {
+            displayErrorToast(`Failed to start conversation: ${error.message}`);
+            // Refetch the conversation to ensure UI consistency
+            refetch();
+          },
+        },
+      );
    }
-  }, [conversation?.conversation_id, isFetched, isAuthed, providers]);
+  }, [
+    conversation?.conversation_id,
+    conversation?.status,
+    isFetched,
+    isAuthed,
+    providers,
+  ]);

  React.useEffect(() => {
-    dispatch(clearTerminal());
+    clearTerminal();
    dispatch(clearJupyter());
    dispatch(resetConversationState());
    dispatch(setCurrentAgentState(AgentState.LOADING));
-  }, [conversationId]);
+  }, [conversationId, clearTerminal]);

  useEffectOnce(() => {
-    dispatch(clearTerminal());
+    clearTerminal();
    dispatch(clearJupyter());
    dispatch(resetConversationState());
    dispatch(setCurrentAgentState(AgentState.LOADING));
--- a/frontend/src/services/tests/actions.test.ts
+++ b/frontend/src/services/tests/actions.test.ts
@@ -3,7 +3,7 @@ import { handleStatusMessage } from "../actions";
 import { StatusMessage } from "#/types/message";
 import { queryClient } from "#/query-client-config";
 import store from "#/store";
-import { setCurStatusMessage } from "#/state/status-slice";
+import { useStatusStore } from "#/state/status-store";
 import { trackError } from "#/utils/error-handler";

 // Mock dependencies
@@ -19,12 +19,12 @@ vi.mock("#/store", () => ({
  },
 }));

-vi.mock("#/state/status-slice", () => ({
-  setCurStatusMessage: vi.fn(),
-}));
-
-vi.mock("#/state/chat-slice", () => ({
-  addErrorMessage: vi.fn(),
+vi.mock("#/state/status-store", () => ({
+  useStatusStore: {
+    getState: vi.fn(() => ({
+      setCurStatusMessage: vi.fn(),
+    })),
+  },
 }));

 vi.mock("#/utils/error-handler", () => ({
@@ -61,7 +61,7 @@ describe("handleStatusMessage", () => {
    expect(store.dispatch).not.toHaveBeenCalled();
  });

-  it("should dispatch setCurStatusMessage for info messages without conversation_title", () => {
+  it("should call setCurStatusMessage for info messages without conversation_title", () => {
    // Create a status message without a conversation title
    const statusMessage: StatusMessage = {
      status_update: true,
@@ -69,19 +69,28 @@ describe("handleStatusMessage", () => {
      message: "Some info message",
    };

+    const mockSetCurStatusMessage = vi.fn();
+    vi.mocked(useStatusStore.getState).mockReturnValue({
+      setCurStatusMessage: mockSetCurStatusMessage,
+      curStatusMessage: {
+        status_update: true,
+        type: "info",
+        id: "",
+        message: "",
+      },
+    });
+
    // Call the function
    handleStatusMessage(statusMessage);

-    // Verify that store.dispatch was called with setCurStatusMessage
-    expect(store.dispatch).toHaveBeenCalledWith(
-      setCurStatusMessage(statusMessage),
-    );
+    // Verify that setCurStatusMessage was called with the correct message
+    expect(mockSetCurStatusMessage).toHaveBeenCalledWith(statusMessage);

    // Verify that queryClient.invalidateQueries was not called
    expect(queryClient.invalidateQueries).not.toHaveBeenCalled();
  });

-  it("should dispatch addErrorMessage for error messages", () => {
+  it("should call trackError for error messages", () => {
    // Create an error status message
    const statusMessage: StatusMessage = {
      status_update: true,
@@ -100,6 +109,9 @@ describe("handleStatusMessage", () => {
      metadata: { msgId: "ERROR_ID" },
    });

+    // Verify that store.dispatch was not called
+    expect(store.dispatch).not.toHaveBeenCalled();
+
    // Verify that queryClient.invalidateQueries was not called
    expect(queryClient.invalidateQueries).not.toHaveBeenCalled();
  });
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
chuckbutkus	7c556d6396	Merge branch 'main' into chuck-build	2025-09-23 14:25:16 -04:00
openhands	8bb5aa21b9	test	2025-09-23 14:19:20 -04:00
BenYao21	d3d70fcc60	issue #9388 , this will fix the issue (#10450 ) Co-authored-by: mamoodi <mamoodiha@gmail.com> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-09-22 16:56:53 -04:00
Xinyi He	7906eab6b1	Add inference generation of SWE-Perf Benchmark (#10246 ) Co-authored-by: mamoodi <mamoodiha@gmail.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-22 20:35:30 +00:00
juanmichelini	547e1049f1	Multi swe gym (#10605 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-22 15:56:26 -04:00
mamoodi	818cc60b52	New label for not going stale (#11069 )	2025-09-22 11:53:47 -04:00
Robert Brennan	431d2c1f43	security: upgrade setuptools to >=78.1.1 to address CVE-2025-47273 and CVE-2024-6345 (#11038 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: enyst <engel.nyst@gmail.com>	2025-09-22 04:05:45 +00:00
Engel Nyst	07f23641a3	build(deps): pin litellm to avoid build failure (#11054 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-22 03:54:37 +02:00
Hiep Le	de84af5586	feat(frontend): display lock icon when confirmation mode is enabled (#11030 )	2025-09-20 10:55:19 +07:00
Hiep Le	b7765ba3f7	refactor(frontend): fix typecheck (#11037 )	2025-09-19 13:43:00 -04:00
Hiep Le	b89f2e51e4	refactor(frontend): migration of metrics-slice.ts to zustand (#11018 )	2025-09-19 23:52:21 +07:00
mamoodi	e09f93aa75	Release 0.57.0 (#10981 ) Co-authored-by: Ray Myers <ray.myers@gmail.com> Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: sp.wack <83104063+amanape@users.noreply.github.com> Co-authored-by: Rohit Malhotra <rohitvinodmalhotra@gmail.com>	2025-09-19 12:40:56 -04:00
Hiep Le	9f529b105a	refactor(frontend): migration of command-slice.ts to zustand (#11003 )	2025-09-19 23:33:59 +07:00
Graham Neubig	89e3d2a867	Improve OpenHands provider pricing documentation (#10974 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-20 00:22:44 +08:00
Hiep Le	a7b9a4f291	refactor(frontend): migration of status-slice.ts to zustand (#11017 )	2025-09-19 22:27:55 +07:00
Hiep Le	88cd16ae21	refactor(frontend): migration of initial-query-slice.ts to zustand (#11020 )	2025-09-19 22:27:20 +07:00
Hiep Le	a8a3e9e604	refactor(frontend): remove the code-slice.ts file (#11021 )	2025-09-19 21:22:29 +07:00
Hiep Le	0061bcc0b0	refactor(frontend): custom chat input (#10984 )	2025-09-19 21:06:18 +07:00
Hiep Le	9c9fa780b0	refactor(frontend): task tracking observation content (#11002 )	2025-09-19 20:03:05 +07:00
Alona	569ac16163	Improve token refresh error logging (#11026 )	2025-09-19 14:18:38 +07:00
openhands	08096db29f	test	2025-09-18 22:50:21 -04:00
openhands	b2b6ddf90c	test	2025-09-18 22:24:35 -04:00
openhands	87fe36d811	test	2025-09-18 21:44:34 -04:00
openhands	39d255d313	test	2025-09-18 21:27:03 -04:00
openhands	e334b67f21	Add logging	2025-09-18 20:48:24 -04:00
Robert Brennan	46f7738f41	Update Python packages to latest versions (#11023 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-18 19:52:46 +00:00
Rohit Malhotra	3f3669dd34	Hotfix: rm model choice override (#11022 )	2025-09-18 14:40:06 -04:00
sp.wack	cd65645eea	Hide Tavily search API key help text in SaaS mode (#11014 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-18 16:40:29 +00:00
Robert Brennan	8e88a7a277	fix: resolve critical and high CVEs in enterprise Docker image (#10987 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-18 11:25:33 -04:00
Hiep Le	b393d52439	refactor(frontend): conversation main (#10985 )	2025-09-18 20:23:13 +07:00
Hiep Le	faeec48365	refactor(frontend): conversation card (#10986 )	2025-09-18 20:22:59 +07:00
chuckbutkus	d5c02bf87b	Merge branch 'main' into allow-custom-user	2025-09-17 22:43:30 -04:00
openhands	14a4664fe8	Make su commands optional	2025-09-17 22:40:21 -04:00
sp.wack	774caf0607	feat: refactor status indicators (#10983 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-17 22:32:55 +04:00
chuckbutkus	3a7df33acf	Merge branch 'main' into test-user	2025-09-17 14:02:52 -04:00
sp.wack	7222730df0	Fix SaaS callback URLs and pro pill positioning (#10998 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-17 16:56:02 +00:00
Hiep Le	910177fc57	refactor(frontend): system message modal (#10969 )	2025-09-17 21:56:14 +07:00
Hiep Le	ac9badbd20	refactor(frontend): metrics modal (#10968 )	2025-09-17 21:55:25 +07:00
Ray Myers	02c299d88f	Fix Slack resolver failing on AWAITING_USER_INPUT state (#10992 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-17 09:20:12 -05:00
mamoodi	f65fbef649	Remove runtime settings (#10996 )	2025-09-17 13:59:29 +00:00
Hiep Le	3c2acad28d	refactor(frontend): microagents modal (#10970 )	2025-09-16 22:32:23 +07:00
chuckbutkus	69fddecc7f	Merge branch 'main' into test-user	2025-09-07 21:55:39 -04:00
Chuck Butkus	3afe5ccee5	Add Logging	2025-09-05 20:52:48 -04:00
chuckbutkus	3d5a8dcf5a	Merge branch 'main' into test-user	2025-09-05 14:20:10 -04:00
Chuck Butkus	2ee1abe22c	Lint fix	2025-09-05 13:16:03 -04:00
Chuck Butkus	148940f553	Added logging around alive checks	2025-09-05 11:10:57 -04:00
Chuck Butkus	1f09296136	Fix username checks	2025-09-03 21:40:13 -04:00
Chuck Butkus	70e5d12ba9	Revert "Change to a non-login shell" This reverts commit `bcb3160d95`.	2025-08-29 01:48:47 -04:00
Chuck Butkus	bcb3160d95	Change to a non-login shell	2025-08-29 01:37:02 -04:00
Chuck Butkus	174c691744	Update	2025-08-28 02:25:05 -04:00
Chuck Butkus	af34d446e9	Remove vscode username restriction	2025-08-28 02:22:27 -04:00
Chuck Butkus	6604924f76	Fix bash username	2025-08-28 02:21:41 -04:00
chuckbutkus	b2def1e438	Merge branch 'main' into test-user	2025-08-27 23:33:45 -04:00
Chuck Butkus	2b8e47aca9	Add runtime user env vars	2025-08-27 23:02:39 -04:00
Chuck Butkus	dba8b28824	Logging	2025-08-27 21:30:47 -04:00