Merge branch 'main' into fix-git-coauthorship-cli-runtime

Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 )
Co-authored-by: openhands <openhands@all-hands.dev>
2026-04-29 03:00:45 -04:00 · 2025-08-22 09:41:09 -04:00 · 2025-08-22 13:34:02 +00:00 · 2025-08-22 09:09:45 -04:00 · 2025-08-22 05:00:11 +08:00 · 2025-08-21 19:03:20 +02:00
256 changed files with 21309 additions and 15916 deletions
--- a/.github/workflows/e2e-tests.yml
+++ b/.github/workflows/e2e-tests.yml
@@ -22,7 +22,7 @@ jobs:
        uses: actions/checkout@v4

      - name: Install poetry via pipx
-        uses: abatilo/actions-poetry@v3
+        uses: abatilo/actions-poetry@v4
        with:
          poetry-version: 2.1.3

--- a/.github/workflows/py-tests.yml
+++ b/.github/workflows/py-tests.yml
@@ -73,7 +73,7 @@ jobs:
      - name: Install Python dependencies using Poetry
        run: poetry install --with dev,test,runtime
      - name: Run Windows unit tests
-        run: poetry run pytest -svv tests/unit/test_windows_bash.py
+        run: poetry run pytest -svv tests/unit/runtime/utils/test_windows_bash.py
        env:
          PYTHONPATH: ".;$env:PYTHONPATH"
          DEBUG: "1"
--- a/.github/workflows/welcome-good-first-issue.yml
+++ b/.github/workflows/welcome-good-first-issue.yml
@@ -0,0 +1,50 @@
+name: Welcome Good First Issue
+
+on:
+  issues:
+    types: [labeled]
+
+permissions:
+  issues: write
+
+jobs:
+  comment-on-good-first-issue:
+    if: github.event.label.name == 'good first issue'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check if welcome comment already exists
+        id: check_comment
+        uses: actions/github-script@v7
+        with:
+          result-encoding: string
+          script: |
+            const issueNumber = context.issue.number;
+            const comments = await github.rest.issues.listComments({
+              ...context.repo,
+              issue_number: issueNumber
+            });
+
+            const alreadyCommented = comments.data.some(
+              (comment) =>
+                comment.body.includes('<!-- auto-comment:good-first-issue -->')
+            );
+
+            return alreadyCommented ? 'true' : 'false';
+
+      - name: Leave welcome comment
+        if: steps.check_comment.outputs.result == 'false'
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const repoUrl = `https://github.com/${context.repo.owner}/${context.repo.repo}`;
+
+            await github.rest.issues.createComment({
+              ...context.repo,
+              issue_number: context.issue.number,
+              body: "🙌 **Hey there, future contributor!** 🙌\n\n" +
+                    "This issue has been labeled as **good first issue**, which means it's a great place to get started with the OpenHands project.\n\n" +
+                    "If you're interested in working on it, feel free to! No need to ask for permission.\n\n" +
+                    "Be sure to check out our [development setup guide](" + repoUrl + "/blob/main/Development.md) to get your environment set up, and follow our [contribution guidelines](" + repoUrl + "/blob/main/CONTRIBUTING.md) when you're ready to submit a fix.\n\n" +
+                    "🙌 Happy hacking! 🙌\n\n" +
+                    "<!-- auto-comment:good-first-issue -->"
+            });
--- a/.gitignore
+++ b/.gitignore
@@ -257,3 +257,5 @@ containers/runtime/code

 # test results
 test-results
+
+.eval_sessions
--- a/.openhands/microagents/repo.md
+++ b/.openhands/microagents/repo.md
@@ -144,6 +144,35 @@ Your specialized knowledge and instructions here...
     - Add the setting to the `Settings` model in `openhands/storage/data_models/settings.py`
     - Update any relevant backend code to apply the setting (e.g., in session creation)

+#### Settings UI Patterns:
+
+There are two main patterns for saving settings in the OpenHands frontend:
+
+**Pattern 1: Entity-based Resources (Immediate Save)**
+- Used for: API Keys, Secrets, MCP Servers
+- Behavior: Changes are saved immediately when user performs actions (add/edit/delete)
+- Implementation:
+  - No "Save Changes" button
+  - No local state management or `isDirty` tracking
+  - Uses dedicated mutation hooks for each operation (e.g., `use-add-mcp-server.ts`, `use-delete-mcp-server.ts`)
+  - Each mutation triggers immediate API call with query invalidation for UI updates
+  - Example: MCP settings, API Keys & Secrets tabs
+- Benefits: Simpler UX, no risk of losing changes, consistent with modern web app patterns
+
+**Pattern 2: Form-based Settings (Manual Save)**
+- Used for: Application settings, LLM configuration
+- Behavior: Changes are accumulated locally and saved when user clicks "Save Changes"
+- Implementation:
+  - Has "Save Changes" button that becomes enabled when changes are detected
+  - Uses local state management with `isDirty` tracking
+  - Uses `useSaveSettings` hook to save all changes at once
+  - Example: LLM tab, Application tab
+- Benefits: Allows bulk changes, explicit save action, can validate all fields before saving
+
+**When to use each pattern:**
+- Use Pattern 1 (Immediate Save) for entity management where each item is independent
+- Use Pattern 2 (Manual Save) for configuration forms where settings are interdependent or need validation
+
 ### Adding New LLM Models

 To add a new LLM model to OpenHands, you need to update multiple files across both frontend and backend:
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -0,0 +1,47 @@
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: trailing-whitespace
+        exclude: ^(docs/|modules/|python/|openhands-ui/|third_party/)
+      - id: end-of-file-fixer
+        exclude: ^(docs/|modules/|python/|openhands-ui/|third_party/)
+      - id: check-yaml
+        args: ["--allow-multiple-documents"]
+      - id: debug-statements
+
+  - repo: https://github.com/tox-dev/pyproject-fmt
+    rev: v2.5.1
+    hooks:
+      - id: pyproject-fmt
+  - repo: https://github.com/abravalheri/validate-pyproject
+    rev: v0.24.1
+    hooks:
+      - id: validate-pyproject
+
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    # Ruff version.
+    rev: v0.11.8
+    hooks:
+      # Run the linter.
+      - id: ruff
+        entry: ruff check --config dev_config/python/ruff.toml
+        types_or: [python, pyi, jupyter]
+        args: [--fix, --unsafe-fixes]
+        exclude: third_party/
+      # Run the formatter.
+      - id: ruff-format
+        entry: ruff format --config dev_config/python/ruff.toml
+        types_or: [python, pyi, jupyter]
+        exclude: third_party/
+
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.15.0
+    hooks:
+      - id: mypy
+        additional_dependencies:
+          [types-requests, types-setuptools, types-pyyaml, types-toml, types-docker, types-Markdown, pydantic, lxml]
+        # To see gaps add `--html-report mypy-report/`
+        entry: mypy --config-file dev_config/python/mypy.ini openhands/
+        always_run: true
+        pass_filenames: false
--- a/Development.md
+++ b/Development.md
@@ -159,7 +159,7 @@ poetry run pytest ./tests/unit/test_*.py
 To reduce build time (e.g., if no changes were made to the client-runtime component), you can use an existing Docker
 container image by setting the SANDBOX_RUNTIME_CONTAINER_IMAGE environment variable to the desired Docker image.

-Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:0.53-nikolaik`
+Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:0.54-nikolaik`

 ## Develop inside Docker container

--- a/README.md
+++ b/README.md
@@ -79,17 +79,17 @@ You'll find OpenHands running at [http://localhost:3000](http://localhost:3000)
 You can also run OpenHands directly with Docker:

 ```bash
-docker pull docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik
+docker pull docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik

 docker run -it --rm --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands:/.openhands \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53
+    docker.all-hands.dev/all-hands-ai/openhands:0.54
 ```

 </details>
--- a/README_CN.md
+++ b/README_CN.md
@@ -51,17 +51,17 @@ OpenHands也可以使用Docker在本地系统上运行。


 ```bash
-docker pull docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik
+docker pull docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik

 docker run -it --rm --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands:/.openhands \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53
+    docker.all-hands.dev/all-hands-ai/openhands:0.54
 ```

 > **注意**: 如果您在0.44版本之前使用过OpenHands，您可能需要运行 `mv ~/.openhands-state ~/.openhands` 来将对话历史迁移到新位置。
--- a/README_JA.md
+++ b/README_JA.md
@@ -42,17 +42,17 @@ OpenHandsはDockerを利用してローカル環境でも実行できます。
 > 公共ネットワークで実行していますか？[Hardened Docker Installation Guide](https://docs.all-hands.dev/usage/runtimes/docker#hardened-docker-installation)を参照して、ネットワークバインディングの制限や追加のセキュリティ対策を実施してください。

 ```bash
-docker pull docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik
+docker pull docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik

 docker run -it --rm --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands:/.openhands \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53
+    docker.all-hands.dev/all-hands-ai/openhands:0.54
 ```

 **注**: バージョン0.44以前のOpenHandsを使用していた場合は、会話履歴を移行するために `mv ~/.openhands-state ~/.openhands` を実行してください。
--- a/containers/app/Dockerfile
+++ b/containers/app/Dockerfile
@@ -21,7 +21,7 @@ ENV POETRY_NO_INTERACTION=1 \
    POETRY_CACHE_DIR=/tmp/poetry_cache

 RUN apt-get update -y \
-    && apt-get install -y curl make git build-essential \
+    && apt-get install -y curl make git build-essential jq gettext \
    && python3 -m pip install poetry --break-system-packages

 COPY pyproject.toml poetry.lock ./
--- a/containers/dev/compose.yml
+++ b/containers/dev/compose.yml
@@ -12,7 +12,7 @@ services:
      - SANDBOX_API_HOSTNAME=host.docker.internal
      - DOCKER_HOST_ADDR=host.docker.internal
      #
-      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.53-nikolaik}
+      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.54-nikolaik}
      - SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234}
      - WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
    ports:
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -7,7 +7,7 @@ services:
    image: openhands:latest
    container_name: openhands-app-${DATE:-}
    environment:
-      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik}
+      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik}
      #- SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234} # enable this only if you want a specific non-root sandbox user but you will have to manually adjust permissions of ~/.openhands for this user
      - WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
    ports:
--- a/docs/openapi.json
+++ b/docs/openapi.json
--- a/docs/usage/how-to/cli-mode.mdx
+++ b/docs/usage/how-to/cli-mode.mdx
@@ -119,7 +119,7 @@ The conversation history will be saved in `~/.openhands/sessions`.
 ```bash
 docker run -it \
    --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e SANDBOX_USER_ID=$(id -u) \
    -e SANDBOX_VOLUMES=$SANDBOX_VOLUMES \
    -e LLM_API_KEY=$LLM_API_KEY \
@@ -128,8 +128,8 @@ docker run -it \
    -v ~/.openhands:/.openhands \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app-$(date +%Y%m%d%H%M%S) \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53 \
-    python -m openhands.cli.main --override-cli-mode true
+    docker.all-hands.dev/all-hands-ai/openhands:0.54 \
+    python -m openhands.cli.entry --override-cli-mode true
 ```

 <Note>
--- a/docs/usage/how-to/headless-mode.mdx
+++ b/docs/usage/how-to/headless-mode.mdx
@@ -61,7 +61,7 @@ export GITHUB_TOKEN="your-token"  # Required for repository operations
 # Run OpenHands
 docker run -it \
    --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e SANDBOX_USER_ID=$(id -u) \
    -e SANDBOX_VOLUMES=$SANDBOX_VOLUMES \
    -e LLM_API_KEY=$LLM_API_KEY \
@@ -73,7 +73,7 @@ docker run -it \
    -v ~/.openhands:/.openhands \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app-$(date +%Y%m%d%H%M%S) \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53 \
+    docker.all-hands.dev/all-hands-ai/openhands:0.54 \
    python -m openhands.core.main -t "write a bash script that prints hi"
 ```

--- a/docs/usage/llms/local-llms.mdx
+++ b/docs/usage/llms/local-llms.mdx
@@ -68,23 +68,23 @@ Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstud
 1. Check [the installation guide](/usage/local-setup) and ensure all prerequisites are met before running OpenHands, then run:

 ```bash
-docker pull docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik
+docker pull docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik

 docker run -it --rm --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands:/.openhands \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53
+    docker.all-hands.dev/all-hands-ai/openhands:0.54
 ```

 2. Wait until the server is running (see log below):
 ```
 Digest: sha256:e72f9baecb458aedb9afc2cd5bc935118d1868719e55d50da73190d3a85c674f
-Status: Image is up to date for docker.all-hands.dev/all-hands-ai/openhands:0.53
+Status: Image is up to date for docker.all-hands.dev/all-hands-ai/openhands:0.54
 Starting OpenHands...
 Running OpenHands as root
 14:22:13 - openhands:INFO: server_config.py:50 - Using config class None
--- a/docs/usage/local-setup.mdx
+++ b/docs/usage/local-setup.mdx
@@ -109,17 +109,17 @@ Note that you'll still need `uv` installed for the default MCP servers to work p
 <Accordion title="Docker Command (Click to expand)">

 ```bash
-docker pull docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik
+docker pull docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik

 docker run -it --rm --pull=always \
-    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.53-nikolaik \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.54-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands:/.openhands \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
-    docker.all-hands.dev/all-hands-ai/openhands:0.53
+    docker.all-hands.dev/all-hands-ai/openhands:0.54
 ```

 </Accordion>
--- a/docs/usage/runtimes/docker.mdx
+++ b/docs/usage/runtimes/docker.mdx
@@ -130,3 +130,28 @@ docker run # ... \
 <Note>
 **Docker Desktop Required**: Network isolation features, including custom networks and `host.docker.internal` routing, require Docker Desktop. Docker Engine alone does not support these features on localhost across custom networks. If you're using Docker Engine without Docker Desktop, network isolation may not work as expected.
 </Note>
+
+### Sidecar Containers
+
+If you want to run sidecar containers to the sandbox 'runner' containers without exposing the sandbox containers to the host network, you can use the `SANDBOX_ADDITIONAL_NETWORKS` environment variable to specify additional Docker network names that should be added to the sandbox containers.
+
+```bash
+docker network create openhands-sccache
+
+docker run -d \
+  --hostname openhandsredis \
+  --network openhands-sccache \
+  redis
+
+docker run # ...
+    -e SANDBOX_ADDITIONAL_NETWORKS='["openhands-sccache"]' \
+    # ...
+```
+
+Then all sandbox instances will have to access a shared redis instance at `openhandsredis:6379`.
+
+#### Docker Compose gotcha
+
+Note that Docker Compose adds a prefix (a scope) by default to created networks, which is not taken into account by the additional networks config. Therefore when using docker compose you have to either:
+- specify a network name via the `name` field to remove the scoping (https://docs.docker.com/reference/compose-file/networks/#name) 
+- or provide the scope within the given config (e.g. `SANDBOX_ADDITIONAL_NETWORKS: '["myscope_openhands-sccache"]'` where `myscope` is the docker-compose assigned prefix). 
--- a/evaluation/benchmarks/EDA/run_infer.py
+++ b/evaluation/benchmarks/EDA/run_infer.py
@@ -9,7 +9,8 @@ from evaluation.utils.shared import (
    EvalMetadata,
    EvalOutput,
    compatibility_for_eval_history_pairs,
-    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -60,18 +61,15 @@ AGENT_CLS_TO_INST_SUFFIX = {
 def get_config(
    metadata: EvalMetadata,
 ) -> OpenHandsConfig:
-    sandbox_config = get_default_sandbox_config_for_eval()
-    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    # Create config with EDA-specific container image
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
    )
+
+    # Override the container image for EDA
+    config.sandbox.base_container_image = 'python:3.12-bookworm'
+
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
    agent_config.enable_prompt_extensions = False
@@ -146,7 +144,7 @@ def process_instance(

    logger.info(f'Final message: {final_message} | Ground truth: {instance["text"]}')
    test_result = game.reward()
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
--- a/evaluation/benchmarks/agent_bench/run_infer.py
+++ b/evaluation/benchmarks/agent_bench/run_infer.py
@@ -17,7 +17,8 @@ from evaluation.utils.shared import (
    EvalMetadata,
    EvalOutput,
    compatibility_for_eval_history_pairs,
-    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -40,19 +41,12 @@ from openhands.utils.async_utils import call_async_from_sync
 def get_config(
    metadata: EvalMetadata,
 ) -> OpenHandsConfig:
-    sandbox_config = get_default_sandbox_config_for_eval()
-    sandbox_config.base_container_image = 'python:3.12-slim'
+    # Create config with agent_bench-specific container image
+    config = get_openhands_config_for_eval(metadata=metadata)
+
+    # Override the container image for agent_bench
+    config.sandbox.base_container_image = 'python:3.12-slim'

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        runtime=os.environ.get('RUNTIME', 'docker'),
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
-    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
    agent_config.enable_prompt_extensions = False
@@ -273,7 +267,7 @@ def process_instance(
    # remove when it becomes unnecessary
    histories = compatibility_for_eval_history_pairs(state.history)

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Save the output
    output = EvalOutput(
--- a/evaluation/benchmarks/aider_bench/run_infer.py
+++ b/evaluation/benchmarks/aider_bench/run_infer.py
@@ -17,6 +17,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -49,15 +51,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.11-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
+        sandbox_config=sandbox_config,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -246,7 +243,7 @@ def process_instance(
    # for compatibility with the existing output format, we can remake the pairs here
    # remove when it becomes unnecessary
    histories = compatibility_for_eval_history_pairs(state.history)
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Save the output
    output = EvalOutput(
--- a/evaluation/benchmarks/biocoder/run_infer.py
+++ b/evaluation/benchmarks/biocoder/run_infer.py
@@ -15,6 +15,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -60,15 +62,10 @@ def get_config(
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = BIOCODER_BENCH_CONTAINER_IMAGE

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -294,7 +291,7 @@ def process_instance(
        raise ValueError('State should not be None.')

    test_result = complete_runtime(runtime, instance)
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)
    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
    # remove when it becomes unnecessary
--- a/evaluation/benchmarks/bird/run_infer.py
+++ b/evaluation/benchmarks/bird/run_infer.py
@@ -18,6 +18,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -74,15 +76,10 @@ def get_config(
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -422,7 +419,7 @@ def process_instance(
    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
    if state is None:
        raise ValueError('State should not be None.')
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
--- a/evaluation/benchmarks/browsing_delegation/run_infer.py
+++ b/evaluation/benchmarks/browsing_delegation/run_infer.py
@@ -11,6 +11,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -39,14 +41,8 @@ def get_config(
    )
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        workspace_base=None,
-        workspace_mount_path=None,
+    config = get_openhands_config_for_eval(
+        metadata=metadata, runtime='docker', sandbox_config=sandbox_config
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -88,7 +84,7 @@ def process_instance(
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)
    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
    # remove when it becomes unnecessary
--- a/evaluation/benchmarks/commit0/run_infer.py
+++ b/evaluation/benchmarks/commit0/run_infer.py
@@ -16,6 +16,8 @@ from evaluation.utils.shared import (
    assert_and_raise,
    codeact_user_response,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -113,16 +115,11 @@ def get_config(
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = base_container_image

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
-        enable_browser=RUN_WITH_BROWSING,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
+        sandbox_config=sandbox_config,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        enable_browser=RUN_WITH_BROWSING,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
@@ -480,7 +477,7 @@ def process_instance(

    # NOTE: this is NO LONGER the event stream, but an agent history that includes delegate agent's events
    histories = [event_to_dict(event) for event in state.history]
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Save the output
    output = EvalOutput(
--- a/evaluation/benchmarks/discoverybench/run_infer.py
+++ b/evaluation/benchmarks/discoverybench/run_infer.py
@@ -17,6 +17,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -64,15 +66,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -294,7 +291,7 @@ def process_instance(
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)
    test_result = complete_runtime(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
--- a/evaluation/benchmarks/gaia/run_infer.py
+++ b/evaluation/benchmarks/gaia/run_infer.py
@@ -22,6 +22,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -59,15 +61,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'nikolaik/python-nodejs:python3.12-nodejs22'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
+        sandbox_config=sandbox_config,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
    )
    config.set_llm_config(metadata.llm_config)
    if metadata.agent_config:
@@ -269,7 +266,7 @@ Here is the task:
        'model_answer': model_answer,
        'ground_truth': instance['Final answer'],
    }
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
--- a/evaluation/benchmarks/gorilla/run_infer.py
+++ b/evaluation/benchmarks/gorilla/run_infer.py
@@ -12,6 +12,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -42,15 +44,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -108,7 +105,7 @@ def process_instance(
    # attempt to parse model_answer
    ast_eval_fn = instance['ast_eval']
    correct, hallucination = ast_eval_fn(instance_id, model_answer_raw)
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)
    logger.info(
        f'Final message: {model_answer_raw} | Correctness: {correct} | Hallucination: {hallucination}'
    )
--- a/evaluation/benchmarks/gpqa/run_infer.py
+++ b/evaluation/benchmarks/gpqa/run_infer.py
@@ -30,6 +30,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -63,15 +65,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -292,7 +289,7 @@ Ok now its time to start solving the question. Good luck!
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Save the output
    output = EvalOutput(
--- a/evaluation/benchmarks/humanevalfix/run_infer.py
+++ b/evaluation/benchmarks/humanevalfix/run_infer.py
@@ -23,6 +23,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -84,15 +86,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -248,7 +245,7 @@ def process_instance(

    if state is None:
        raise ValueError('State should not be None.')
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)
    test_result = complete_runtime(runtime, instance)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
--- a/evaluation/benchmarks/lca_ci_build_repair/eval_infer.py
+++ b/evaluation/benchmarks/lca_ci_build_repair/eval_infer.py
@@ -16,6 +16,7 @@ import ruamel.yaml
 from evaluation.utils.shared import (
    EvalMetadata,
    get_default_sandbox_config_for_eval,
+    get_openhands_config_for_eval,
    make_metadata,
 )
 from openhands.core.config import (
@@ -37,15 +38,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
--- a/evaluation/benchmarks/lca_ci_build_repair/run_infer.py
+++ b/evaluation/benchmarks/lca_ci_build_repair/run_infer.py
@@ -22,6 +22,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -47,15 +49,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -335,7 +332,7 @@ Be thorough in your exploration, testing, and reasoning. It's fine if your think
        )
    )
    assert state is not None
-    metrics = state.metrics.get() if state.metrics else {}
+    metrics = get_metrics(state)

    test_result = complete_runtime(runtime, instance)

--- a/evaluation/benchmarks/logic_reasoning/run_infer.py
+++ b/evaluation/benchmarks/logic_reasoning/run_infer.py
@@ -10,6 +10,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -51,15 +53,10 @@ def get_config(
        '$OH_INTERPRETER_PATH -m pip install scitools-pyke'
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -247,7 +244,7 @@ def process_instance(
    )
    test_result['final_message'] = final_message

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)
    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
    # remove when it becomes unnecessary
--- a/evaluation/benchmarks/miniwob/run_infer.py
+++ b/evaluation/benchmarks/miniwob/run_infer.py
@@ -13,6 +13,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -57,15 +59,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'xingyaoww/od-eval-miniwob:v1.0'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
@@ -174,7 +171,7 @@ def process_instance(
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Instruction is the first message from the USER
    instruction = ''
--- a/evaluation/benchmarks/mint/run_infer.py
+++ b/evaluation/benchmarks/mint/run_infer.py
@@ -15,6 +15,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -109,15 +111,10 @@ def get_config(
        f'$OH_INTERPRETER_PATH -m pip install {" ".join(MINT_DEPENDENCIES)}'
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -205,7 +202,7 @@ def process_instance(
        task_state = state.extra_data['task_state']
        logger.info('Task state: ' + str(task_state.to_dict()))

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
--- a/evaluation/benchmarks/ml_bench/run_infer.py
+++ b/evaluation/benchmarks/ml_bench/run_infer.py
@@ -26,6 +26,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -79,15 +81,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'public.ecr.aws/i5g0m1f6/ml-bench'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -250,7 +247,7 @@ def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool =
        )
    )
    assert state is not None
-    metrics = state.metrics.get() if state.metrics else {}
+    metrics = get_metrics(state)

    test_result = complete_runtime(runtime)

--- a/evaluation/benchmarks/multi_swe_bench/eval_infer.py
+++ b/evaluation/benchmarks/multi_swe_bench/eval_infer.py
@@ -23,6 +23,7 @@ from evaluation.utils.shared import (
    EvalMetadata,
    EvalOutput,
    get_default_sandbox_config_for_eval,
+    get_openhands_config_for_eval,
    prepare_dataset,
    reset_logger_for_multiprocessing,
    run_evaluation,
@@ -87,13 +88,9 @@ def get_config(metadata: EvalMetadata, instance: pd.Series) -> OpenHandsConfig:
        dataset_name=metadata.dataset,
        instance_id=instance['instance_id'],
    )
-    config = OpenHandsConfig(
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    return config

--- a/evaluation/benchmarks/multi_swe_bench/run_infer.py
+++ b/evaluation/benchmarks/multi_swe_bench/run_infer.py
@@ -21,6 +21,7 @@ from evaluation.utils.shared import (
    codeact_user_response,
    get_default_sandbox_config_for_eval,
    get_metrics,
+    get_openhands_config_for_eval,
    is_fatal_evaluation_error,
    make_metadata,
    prepare_dataset,
@@ -341,16 +342,11 @@ def get_config(
        instance_id=instance['instance_id'],
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        enable_browser=RUN_WITH_BROWSING,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
--- a/evaluation/benchmarks/nocode_bench/run_infer_nc.py
+++ b/evaluation/benchmarks/nocode_bench/run_infer_nc.py
@@ -31,6 +31,7 @@ from evaluation.utils.shared import (
    codeact_user_response,
    get_default_sandbox_config_for_eval,
    get_metrics,
+    get_openhands_config_for_eval,
    is_fatal_evaluation_error,
    make_metadata,
    prepare_dataset,
@@ -174,15 +175,10 @@ def get_config(
        instance_id=instance['instance_id'],
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )

    config.set_llm_config(
--- a/evaluation/benchmarks/scienceagentbench/run_infer.py
+++ b/evaluation/benchmarks/scienceagentbench/run_infer.py
@@ -12,6 +12,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -63,16 +65,10 @@ def get_config(
    sandbox_config.base_container_image = (
        'docker.io/xingyaoww/openhands-eval-scienceagentbench'
    )
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        max_budget_per_task=4,
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
@@ -218,7 +214,7 @@ If the program uses some packages that are incompatible, please figure out alter
    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
    if state is None:
        raise ValueError('State should not be None.')
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
--- a/evaluation/benchmarks/swe_bench/README.md
+++ b/evaluation/benchmarks/swe_bench/README.md
@@ -93,6 +93,9 @@ export USE_HINT_TEXT=true # Ignore this if you are not sure.

 # Specify a condenser configuration for memory management (default: NoOpCondenser)
 export EVAL_CONDENSER=summarizer_for_eval # Name of the condenser config group in config.toml
+
+# Specify the instruction prompt template file name
+export INSTRUCTION_TEMPLATE_NAME=swe_custom.j2 # Name of the file in the swe_bench/prompts folder.
 ```

 Let's say you'd like to run 10 instances using `llm.eval_gpt4_1106_preview` and CodeActAgent,
--- a/evaluation/benchmarks/swe_bench/eval_infer.py
+++ b/evaluation/benchmarks/swe_bench/eval_infer.py
@@ -19,6 +19,7 @@ from evaluation.utils.shared import (
    EvalMetadata,
    EvalOutput,
    get_default_sandbox_config_for_eval,
+    get_openhands_config_for_eval,
    prepare_dataset,
    reset_logger_for_multiprocessing,
    run_evaluation,
@@ -83,13 +84,9 @@ def get_config(metadata: EvalMetadata, instance: pd.Series) -> OpenHandsConfig:
        dataset_name=metadata.dataset,
        instance_id=instance['instance_id'],
    )
-    config = OpenHandsConfig(
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    return config

--- a/evaluation/benchmarks/swe_bench/run_infer.py
+++ b/evaluation/benchmarks/swe_bench/run_infer.py
@@ -32,6 +32,7 @@ from evaluation.utils.shared import (
    codeact_user_response,
    get_default_sandbox_config_for_eval,
    get_metrics,
+    get_openhands_config_for_eval,
    is_fatal_evaluation_error,
    make_metadata,
    prepare_dataset,
@@ -108,7 +109,9 @@ def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageActio
    llm_model = metadata.llm_config.model

    # Determine the template file based on mode and LLM
-    if mode.startswith('swt'):
+    if metadata.instruction_template_name:
+        template_name = metadata.instruction_template_name
+    elif mode.startswith('swt'):
        template_name = 'swt.j2'
    elif mode == 'swe':
        if 'gpt-4.1' in llm_model:
@@ -122,6 +125,7 @@ def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageActio
        logger.error(f'Unexpected evaluation mode: {mode}. Falling back to default.')
        template_name = 'swe_default.j2'

+    logger.debug(f'Using instruction template file: {template_name}')
    # Set up Jinja2 environment
    # Assuming templates are in 'evaluation/benchmarks/swe_bench/prompts' relative to this script
    prompts_dir = os.path.join(os.path.dirname(__file__), 'prompts')
@@ -224,16 +228,11 @@ def get_config(
        instance_id=instance['instance_id'],
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        enable_browser=RUN_WITH_BROWSING,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )

    config.set_llm_config(
--- a/evaluation/benchmarks/swe_bench/run_infer_interact.py
+++ b/evaluation/benchmarks/swe_bench/run_infer_interact.py
@@ -21,6 +21,7 @@ from evaluation.utils.shared import (
    EvalException,
    EvalMetadata,
    EvalOutput,
+    get_metrics,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -179,7 +180,7 @@ def process_instance(
        raise ValueError('State should not be None.')

    histories = [event_to_dict(event) for event in state.history]
-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Save the output
    instruction = message_action.content
--- a/evaluation/benchmarks/swe_bench/run_localize.py
+++ b/evaluation/benchmarks/swe_bench/run_localize.py
@@ -20,6 +20,7 @@ from evaluation.utils.shared import (
    codeact_user_response,
    get_default_sandbox_config_for_eval,
    get_metrics,
+    get_openhands_config_for_eval,
    is_fatal_evaluation_error,
    make_metadata,
    prepare_dataset,
@@ -199,16 +200,11 @@ def get_config(
        'REPO_PATH': f'/workspace/{workspace_dir_name}/',
    }

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        enable_browser=RUN_WITH_BROWSING,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
--- a/evaluation/benchmarks/testgeneval/eval_infer.py
+++ b/evaluation/benchmarks/testgeneval/eval_infer.py
@@ -37,6 +37,7 @@ from evaluation.benchmarks.testgeneval.utils import load_testgeneval_dataset
 from evaluation.utils.shared import (
    EvalMetadata,
    EvalOutput,
+    get_openhands_config_for_eval,
    prepare_dataset,
    reset_logger_for_multiprocessing,
    run_evaluation,
@@ -58,20 +59,21 @@ def get_config(instance: pd.Series) -> OpenHandsConfig:
        f'Invalid container image for instance {instance["instance_id_swebench"]}.'
    )
    logger.info(f'Using instance container image: {base_container_image}.')
-    return OpenHandsConfig(
-        run_as_openhands=False,
-        runtime=os.environ.get('RUNTIME', 'eventstream'),
-        sandbox=SandboxConfig(
-            base_container_image=base_container_image,
-            use_host_network=False,
-            timeout=1800,
-            api_key=os.environ.get('ALLHANDS_API_KEY'),
-            remote_runtime_api_url=os.environ.get(
-                'SANDBOX_REMOTE_RUNTIME_API_URL', 'http://localhost:8000'
-            ),
+
+    # Create custom sandbox config for testgeneval with specific requirements
+    sandbox_config = SandboxConfig(
+        base_container_image=base_container_image,
+        use_host_network=False,
+        timeout=1800,  # Longer timeout than default (300)
+        api_key=os.environ.get('ALLHANDS_API_KEY'),
+        remote_runtime_api_url=os.environ.get(
+            'SANDBOX_REMOTE_RUNTIME_API_URL', 'http://localhost:8000'
        ),
-        workspace_base=None,
-        workspace_mount_path=None,
+    )
+
+    return get_openhands_config_for_eval(
+        sandbox_config=sandbox_config,
+        runtime=os.environ.get('RUNTIME', 'docker'),  # Different default runtime
    )


--- a/evaluation/benchmarks/testgeneval/run_infer.py
+++ b/evaluation/benchmarks/testgeneval/run_infer.py
@@ -25,6 +25,7 @@ from evaluation.utils.shared import (
    assert_and_raise,
    codeact_user_response,
    get_metrics,
+    get_openhands_config_for_eval,
    is_fatal_evaluation_error,
    make_metadata,
    prepare_dataset,
@@ -126,29 +127,26 @@ def get_config(
        f'Submit an issue on https://github.com/All-Hands-AI/OpenHands if you run into any issues.'
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
-        runtime=os.environ.get('RUNTIME', 'eventstream'),
-        sandbox=SandboxConfig(
-            base_container_image=base_container_image,
-            enable_auto_lint=True,
-            use_host_network=False,
-            # large enough timeout, since some testcases take very long to run
-            timeout=300,
-            # Add platform to the sandbox config to solve issue 4401
-            platform='linux/amd64',
-            api_key=os.environ.get('ALLHANDS_API_KEY', None),
-            remote_runtime_api_url=os.environ.get(
-                'SANDBOX_REMOTE_RUNTIME_API_URL', 'http://localhost:8000'
-            ),
-            keep_runtime_alive=False,
-            remote_runtime_init_timeout=3600,
+    sandbox_config = SandboxConfig(
+        base_container_image=base_container_image,
+        enable_auto_lint=True,
+        use_host_network=False,
+        # large enough timeout, since some testcases take very long to run
+        timeout=300,
+        # Add platform to the sandbox config to solve issue 4401
+        platform='linux/amd64',
+        api_key=os.environ.get('ALLHANDS_API_KEY', None),
+        remote_runtime_api_url=os.environ.get(
+            'SANDBOX_REMOTE_RUNTIME_API_URL', 'http://localhost:8000'
        ),
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        keep_runtime_alive=False,
+        remote_runtime_init_timeout=3600,
+    )
+
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
+        sandbox_config=sandbox_config,
+        runtime=os.environ.get('RUNTIME', 'docker'),
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
--- a/evaluation/benchmarks/the_agent_company/run_infer.py
+++ b/evaluation/benchmarks/the_agent_company/run_infer.py
@@ -12,7 +12,10 @@ import tempfile
 import yaml
 from browsing import pre_login

-from evaluation.utils.shared import get_default_sandbox_config_for_eval
+from evaluation.utils.shared import (
+    get_default_sandbox_config_for_eval,
+    get_openhands_config_for_eval,
+)
 from openhands.controller.state.state import State
 from openhands.core.config import (
    LLMConfig,
@@ -42,19 +45,17 @@ def get_config(
    sandbox_config.enable_auto_lint = True
    # If the web services are running on the host machine, this must be set to True
    sandbox_config.use_host_network = True
-    config = OpenHandsConfig(
-        run_as_openhands=False,
-        max_budget_per_task=4,
+    config = get_openhands_config_for_eval(
        max_iterations=100,
-        save_trajectory_path=os.path.join(
-            mount_path_on_host, f'traj_{task_short_name}.json'
-        ),
-        sandbox=sandbox_config,
        # we mount trajectories path so that trajectories, generated by OpenHands
        # controller, can be accessible to the evaluator file in the runtime container
+        sandbox_config=sandbox_config,
        workspace_mount_path=mount_path_on_host,
-        workspace_mount_path_in_sandbox='/outputs',
    )
+    config.save_trajectory_path = os.path.join(
+        mount_path_on_host, f'traj_{task_short_name}.json'
+    )
+    config.max_budget_per_task = 4
    config.set_llm_config(llm_config)
    if agent_config:
        config.set_agent_config(agent_config)
--- a/evaluation/benchmarks/toolqa/run_infer.py
+++ b/evaluation/benchmarks/toolqa/run_infer.py
@@ -11,6 +11,8 @@ from evaluation.utils.shared import (
    codeact_user_response,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -43,15 +45,10 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.base_container_image = 'python:3.12-bookworm'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -134,7 +131,7 @@ def process_instance(instance: Any, metadata: EvalMetadata, reset_logger: bool =
    correct = eval_answer(str(model_answer_raw), str(answer))
    logger.info(f'Final message: {model_answer_raw} | Correctness: {correct}')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # history is now available as a stream of events, rather than list of pairs of (Action, Observation)
    # for compatibility with the existing output format, we can remake the pairs here
--- a/evaluation/benchmarks/visual_swe_bench/run_infer.py
+++ b/evaluation/benchmarks/visual_swe_bench/run_infer.py
@@ -20,6 +20,7 @@ from evaluation.utils.shared import (
    codeact_user_response,
    get_default_sandbox_config_for_eval,
    get_metrics,
+    get_openhands_config_for_eval,
    is_fatal_evaluation_error,
    make_metadata,
    prepare_dataset,
@@ -160,16 +161,11 @@ def get_config(
        instance_id=instance['instance_id'],
    )

-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        enable_browser=RUN_WITH_BROWSING,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
--- a/evaluation/benchmarks/visualwebarena/run_infer.py
+++ b/evaluation/benchmarks/visualwebarena/run_infer.py
@@ -12,6 +12,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -72,16 +74,10 @@ def get_config(
        'VWA_WIKIPEDIA': f'{base_url}:8888',
        'VWA_HOMEPAGE': f'{base_url}:4399',
    }
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
-        attach_to_existing=True,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(
        update_llm_config_for_completions_logging(
@@ -179,7 +175,7 @@ def process_instance(
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Instruction obtained from the first message from the USER
    instruction = ''
--- a/evaluation/benchmarks/webarena/run_infer.py
+++ b/evaluation/benchmarks/webarena/run_infer.py
@@ -12,6 +12,8 @@ from evaluation.utils.shared import (
    EvalOutput,
    compatibility_for_eval_history_pairs,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -64,15 +66,10 @@ def get_config(
        'MAP': f'{base_url}:3000',
        'HOMEPAGE': f'{base_url}:4399',
    }
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime='docker',
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
+        sandbox_config=sandbox_config,
    )
    config.set_llm_config(metadata.llm_config)
    agent_config = config.get_agent_config(metadata.agent_class)
@@ -163,7 +160,7 @@ def process_instance(
    if state is None:
        raise ValueError('State should not be None.')

-    metrics = state.metrics.get() if state.metrics else None
+    metrics = get_metrics(state)

    # Instruction is the first message from the USER
    instruction = ''
--- a/evaluation/integration_tests/run_infer.py
+++ b/evaluation/integration_tests/run_infer.py
@@ -9,6 +9,8 @@ from evaluation.utils.shared import (
    EvalMetadata,
    EvalOutput,
    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
    make_metadata,
    prepare_dataset,
    reset_logger_for_multiprocessing,
@@ -44,18 +46,12 @@ def get_config(
 ) -> OpenHandsConfig:
    sandbox_config = get_default_sandbox_config_for_eval()
    sandbox_config.platform = 'linux/amd64'
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
        runtime=os.environ.get('RUNTIME', 'docker'),
-        max_iterations=metadata.max_iterations,
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
-        # debug
-        debug=True,
+        sandbox_config=sandbox_config,
    )
+    config.debug = True
    config.set_llm_config(
        update_llm_config_for_completions_logging(
            metadata.llm_config, metadata.eval_output_dir, instance_id
@@ -135,7 +131,7 @@ def process_instance(
        assert len(histories) > 0, 'History should not be empty'

        test_result: TestResult = test_class.verify_result(runtime, histories)
-        metrics = state.metrics.get() if state.metrics else None
+        metrics = get_metrics(state)
    finally:
        runtime.close()

--- a/evaluation/utils/shared.py
+++ b/evaluation/utils/shared.py
@@ -53,6 +53,7 @@ class EvalMetadata(BaseModel):
    data_split: str | None = None
    details: dict[str, Any] | None = None
    condenser_config: CondenserConfig | None = None
+    instruction_template_name: str | None = None


 class EvalOutput(BaseModel):
@@ -205,6 +206,7 @@ def make_metadata(
        condenser_config=condenser_config
        if condenser_config
        else NoOpCondenserConfig(),
+        instruction_template_name=os.environ.get('INSTRUCTION_TEMPLATE_NAME'),
    )
    metadata_json = metadata.model_dump_json()
    logger.info(f'Metadata: {metadata_json}')
@@ -666,8 +668,23 @@ def is_fatal_runtime_error(error: str | None) -> bool:


 def get_metrics(state: State) -> dict[str, Any]:
-    """Extract metrics from the state."""
-    metrics = state.metrics.get() if state.metrics else {}
+    """Extract metrics for evaluations.
+
+    Prefer ConversationStats (source of truth) and fall back to state.metrics for
+    backward compatibility.
+    """
+    metrics: dict[str, Any]
+    try:
+        if getattr(state, 'conversation_stats', None):
+            combined = state.conversation_stats.get_combined_metrics()
+            metrics = combined.get()
+        elif getattr(state, 'metrics', None):
+            metrics = state.metrics.get()
+        else:
+            metrics = {}
+    except Exception:
+        metrics = state.metrics.get() if getattr(state, 'metrics', None) else {}
+
    metrics['condenser'] = get_condensation_metadata(state)
    return metrics

@@ -686,3 +703,79 @@ def get_default_sandbox_config_for_eval() -> SandboxConfig:
        remote_runtime_enable_retries=True,
        remote_runtime_class='sysbox',
    )
+
+
+def get_openhands_config_for_eval(
+    metadata: EvalMetadata | None = None,
+    sandbox_config: SandboxConfig | None = None,
+    runtime: str | None = None,
+    max_iterations: int | None = None,
+    default_agent: str | None = None,
+    enable_browser: bool = False,
+    workspace_base: str | None = None,
+    workspace_mount_path: str | None = None,
+):
+    """Create an OpenHandsConfig with common patterns used across evaluation scripts.
+
+    This function provides a standardized way to create OpenHands configurations
+    for evaluation runs, with sensible defaults that match the patterns used in
+    most run_infer.py scripts. Individual evaluation scripts can override specific
+    attributes as needed.
+
+    Args:
+        metadata: EvalMetadata containing agent class, max iterations, etc.
+        sandbox_config: Custom sandbox config. If None, uses get_default_sandbox_config_for_eval()
+        runtime: Runtime type. If None, uses environment RUNTIME or 'docker'
+        max_iterations: Max iterations for the agent. If None, uses metadata.max_iterations
+        default_agent: Agent class name. If None, uses metadata.agent_class
+        enable_browser: Whether to enable browser functionality
+        workspace_base: Workspace base path. Defaults to None
+        workspace_mount_path: Workspace mount path. Defaults to None
+
+    Returns:
+        OpenHandsConfig: Configured for evaluation with eval-specific overrides applied
+    """
+    # Defer import to avoid circular imports at module load time
+    from openhands.core.config.openhands_config import (
+        OpenHandsConfig as _OHConfig,  # type: ignore
+    )
+
+    # Use provided sandbox config or get default
+    if sandbox_config is None:
+        sandbox_config = get_default_sandbox_config_for_eval()
+
+    # Extract values from metadata if provided
+    if metadata is not None:
+        if max_iterations is None:
+            max_iterations = metadata.max_iterations
+        if default_agent is None:
+            default_agent = metadata.agent_class
+
+    # Use environment runtime or default
+    if runtime is None:
+        runtime = os.environ.get('RUNTIME', 'docker')
+
+    # Provide sensible defaults if still None
+    if default_agent is None:
+        default_agent = 'CodeActAgent'
+    if max_iterations is None:
+        max_iterations = 50
+
+    # Always use repo-local .eval_sessions directory (absolute path)
+    eval_store = os.path.abspath(os.path.join(os.getcwd(), '.eval_sessions'))
+
+    # Create the base config with evaluation-specific overrides
+    config = _OHConfig(
+        default_agent=default_agent,
+        run_as_openhands=False,
+        runtime=runtime,
+        max_iterations=max_iterations,
+        enable_browser=enable_browser,
+        sandbox=sandbox_config,
+        workspace_base=workspace_base,
+        workspace_mount_path=workspace_mount_path,
+        file_store='local',
+        file_store_path=eval_store,
+    )
+
+    return config
--- a/frontend/tests/components/features/home/repo-selection-form.test.tsx
+++ b/frontend/tests/components/features/home/repo-selection-form.test.tsx
@@ -232,13 +232,16 @@ describe("RepositorySelectionForm", () => {
    renderForm();

    const dropdown = await screen.findByTestId("repo-dropdown");
-    const input = dropdown.querySelector('input[type="text"]') as HTMLInputElement;
+    const input = dropdown.querySelector(
+      'input[type="text"]',
+    ) as HTMLInputElement;
    expect(input).toBeInTheDocument();

    await userEvent.type(input, "https://github.com/kubernetes/kubernetes");
    expect(searchGitReposSpy).toHaveBeenLastCalledWith(
      "kubernetes/kubernetes",
      3,
+      "github",
    );
  });

@@ -268,13 +271,16 @@ describe("RepositorySelectionForm", () => {
    renderForm();

    const dropdown = await screen.findByTestId("repo-dropdown");
-    const input = dropdown.querySelector('input[type="text"]') as HTMLInputElement;
+    const input = dropdown.querySelector(
+      'input[type="text"]',
+    ) as HTMLInputElement;
    expect(input).toBeInTheDocument();

    await userEvent.type(input, "https://github.com/kubernetes/kubernetes");
    expect(searchGitReposSpy).toHaveBeenLastCalledWith(
      "kubernetes/kubernetes",
      3,
+      "github",
    );
  });
 });
--- a/frontend/tests/components/features/microagent-management/microagent-management.test.tsx
+++ b/frontend/tests/components/features/microagent-management/microagent-management.test.tsx
@@ -444,28 +444,38 @@ describe("MicroagentManagement", () => {
    expect(filePath2).toBeInTheDocument();
  });

-  it("should display add microagent button in repository accordion", async () => {
+  it("should render add microagent button", async () => {
    renderMicroagentManagement();

-    // Wait for repositories to be loaded
+    // Wait for repositories to be loaded and processed
    await waitFor(() => {
      expect(mockUseUserRepositories).toHaveBeenCalled();
    });

+    // Wait for repositories to be displayed in the accordion
+    await waitFor(() => {
+      expect(screen.getByTestId("repository-name-tooltip")).toBeInTheDocument();
+    });
+
    // Check that add microagent buttons are present
    const addButtons = screen.getAllByTestId("add-microagent-button");
    expect(addButtons.length).toBeGreaterThan(0);
  });

-  it("should open add microagent modal when add button is clicked", async () => {
+  it("should open modal when add button is clicked", async () => {
    const user = userEvent.setup();
    renderMicroagentManagement();

-    // Wait for repositories to be loaded
+    // Wait for repositories to be loaded and processed
    await waitFor(() => {
      expect(mockUseUserRepositories).toHaveBeenCalled();
    });

+    // Wait for repositories to be displayed in the accordion
+    await waitFor(() => {
+      expect(screen.getByTestId("repository-name-tooltip")).toBeInTheDocument();
+    });
+
    // Find and click the first add microagent button
    const addButtons = screen.getAllByTestId("add-microagent-button");
    await user.click(addButtons[0]);
@@ -1292,11 +1302,18 @@ describe("MicroagentManagement", () => {
    it("should render add microagent button", async () => {
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Check that add microagent buttons are present
      const addButtons = screen.getAllByTestId("add-microagent-button");
      expect(addButtons.length).toBeGreaterThan(0);
@@ -1306,11 +1323,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
@@ -1361,11 +1385,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
@@ -1385,11 +1416,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
@@ -1408,11 +1446,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
@@ -1441,11 +1486,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
@@ -1468,11 +1520,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
@@ -1494,11 +1553,18 @@ describe("MicroagentManagement", () => {
      const user = userEvent.setup();
      renderMicroagentManagement();

-      // Wait for repositories to be loaded
+      // Wait for repositories to be loaded and processed
      await waitFor(() => {
        expect(mockUseUserRepositories).toHaveBeenCalled();
      });

+      // Wait for repositories to be displayed in the accordion
+      await waitFor(() => {
+        expect(
+          screen.getByTestId("repository-name-tooltip"),
+        ).toBeInTheDocument();
+      });
+
      // Find and click the first add microagent button
      const addButtons = screen.getAllByTestId("add-microagent-button");
      await user.click(addButtons[0]);
--- a/frontend/tests/routes/settings.test.tsx
+++ b/frontend/tests/routes/settings.test.tsx
@@ -136,7 +136,7 @@ describe("Settings Screen", () => {
      "secrets",
      "api keys",
    ];
-    const sectionsToExclude = ["llm", "mcp"];
+    const sectionsToExclude = ["llm"];

    renderSettingsScreen();

--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
@@ -1,12 +1,12 @@
 {
  "name": "openhands-frontend",
-  "version": "0.53.0",
+  "version": "0.54.0",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {
    "": {
      "name": "openhands-frontend",
-      "version": "0.53.0",
+      "version": "0.54.0",
      "dependencies": {
        "@heroui/react": "^2.8.2",
        "@heroui/use-infinite-scroll": "^2.2.10",
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -1,6 +1,6 @@
 {
  "name": "openhands-frontend",
-  "version": "0.53.0",
+  "version": "0.54.0",
  "private": true,
  "type": "module",
  "engines": {
--- a/frontend/src/components/common/git-repository-dropdown.tsx
+++ b/frontend/src/components/common/git-repository-dropdown.tsx
@@ -1,7 +1,9 @@
-import { useCallback, useMemo, useRef } from "react";
+import { useCallback, useMemo, useState } from "react";
 import { useTranslation } from "react-i18next";
 import { Provider } from "../../types/settings";
 import { useGitRepositories } from "../../hooks/query/use-git-repositories";
+import { useSearchRepositories } from "../../hooks/query/use-search-repositories";
+import { useDebounce } from "../../hooks/use-debounce";
 import OpenHands from "../../api/open-hands";
 import { GitRepository } from "../../types/git";
 import {
@@ -19,10 +21,6 @@ export interface GitRepositoryDropdownProps {
  onChange?: (repository?: GitRepository) => void;
 }

-interface SearchCache {
-  [key: string]: GitRepository[];
-}
-
 export function GitRepositoryDropdown({
  provider,
  value,
@@ -33,6 +31,20 @@ export function GitRepositoryDropdown({
  onChange,
 }: GitRepositoryDropdownProps) {
  const { t } = useTranslation();
+  const [searchInput, setSearchInput] = useState("");
+  const debouncedSearchInput = useDebounce(searchInput, 300);
+
+  // Process search input to handle URLs
+  const processedSearchInput = useMemo(() => {
+    if (debouncedSearchInput.startsWith("https://")) {
+      const match = debouncedSearchInput.match(
+        /https:\/\/[^/]+\/([^/]+\/[^/]+)/,
+      );
+      return match ? match[1] : debouncedSearchInput;
+    }
+    return debouncedSearchInput;
+  }, [debouncedSearchInput]);
+
  const {
    data,
    fetchNextPage,
@@ -45,6 +57,10 @@ export function GitRepositoryDropdown({
    enabled: !disabled,
  });

+  // Search query for processed input (handles URLs)
+  const { data: searchData, isLoading: isSearchLoading } =
+    useSearchRepositories(processedSearchInput, provider);
+
  const allOptions: AsyncSelectOption[] = useMemo(
    () =>
      data?.pages
@@ -58,75 +74,83 @@ export function GitRepositoryDropdown({
    [data],
  );

-  // Keep track of search results
-  const searchCache = useRef<SearchCache>({});
+  const searchOptions: AsyncSelectOption[] = useMemo(
+    () =>
+      searchData
+        ? searchData.map((repo) => ({
+            value: repo.id,
+            label: repo.full_name,
+          }))
+        : [],
+    [searchData],
+  );

  const selectedOption = useMemo(() => {
    // First check in loaded pages
    const option = allOptions.find((opt) => opt.value === value);
    if (option) return option;

-    // If not found, check in search cache
-    const repo = Object.values(searchCache.current)
-      .flat()
-      .find((r) => r.id === value);
-
-    if (repo) {
-      return {
-        value: repo.id,
-        label: repo.full_name,
-      };
-    }
+    // If not found, check in search results
+    const searchOption = searchOptions.find((opt) => opt.value === value);
+    if (searchOption) return searchOption;

    return null;
-  }, [allOptions, value]);
+  }, [allOptions, searchOptions, value]);

  const loadOptions = useCallback(
    async (inputValue: string): Promise<AsyncSelectOption[]> => {
+      // Update search input to trigger debounced search
+      setSearchInput(inputValue);
+
      // If empty input, show all loaded options
      if (!inputValue.trim()) {
        return allOptions;
      }

-      // If it looks like a URL, extract the repo name and search
+      // For very short inputs, do local filtering
+      if (inputValue.length < 2) {
+        return allOptions.filter((option) =>
+          option.label.toLowerCase().includes(inputValue.toLowerCase()),
+        );
+      }
+
+      // Handle URL inputs by performing direct search
      if (inputValue.startsWith("https://")) {
        const match = inputValue.match(/https:\/\/[^/]+\/([^/]+\/[^/]+)/);
        if (match) {
          const repoName = match[1];
-          const searchResults = await OpenHands.searchGitRepositories(
-            repoName,
-            3,
-          );
-          // Cache the search results
-          searchCache.current[repoName] = searchResults;
-          return searchResults.map((repo) => ({
-            value: repo.id,
-            label: repo.full_name,
-          }));
+          try {
+            // Perform direct search for URL-based inputs
+            const repositories = await OpenHands.searchGitRepositories(
+              repoName,
+              3,
+              provider,
+            );
+            return repositories.map((repo) => ({
+              value: repo.full_name,
+              label: repo.full_name,
+              data: repo,
+            }));
+          } catch (error) {
+            // Fall back to local filtering if search fails
+            return allOptions.filter((option) =>
+              option.label.toLowerCase().includes(repoName.toLowerCase()),
+            );
+          }
        }
      }

-      // For any other input, search via API
-      if (inputValue.length >= 2) {
-        // Only search if at least 2 characters
-        const searchResults = await OpenHands.searchGitRepositories(
-          inputValue,
-          10,
-        );
-        // Cache the search results
-        searchCache.current[inputValue] = searchResults;
-        return searchResults.map((repo) => ({
-          value: repo.id,
-          label: repo.full_name,
-        }));
+      // For regular text inputs, use hook-based search results if available
+      if (searchOptions.length > 0 && processedSearchInput === inputValue) {
+        return searchOptions;
      }

-      // For very short inputs, do local filtering
+      // Fallback to local filtering while search is loading
      return allOptions.filter((option) =>
        option.label.toLowerCase().includes(inputValue.toLowerCase()),
      );
    },
-    [allOptions],
+    [allOptions, searchOptions, processedSearchInput, provider],
  );

  const handleChange = (option: AsyncSelectOption | null) => {
@@ -142,9 +166,7 @@ export function GitRepositoryDropdown({

    // If not found, check in search results
    if (!repo) {
-      repo = Object.values(searchCache.current)
-        .flat()
-        .find((r) => r.id === option.value);
+      repo = searchData?.find((r) => r.id === option.value);
    }

    onChange?.(repo);
@@ -167,7 +189,7 @@ export function GitRepositoryDropdown({
        errorMessage={errorMessage}
        disabled={disabled}
        isClearable={false}
-        isLoading={isLoading || isLoading || isFetchingNextPage}
+        isLoading={isLoading || isFetchingNextPage || isSearchLoading}
        cacheOptions
        defaultOptions={allOptions}
        onChange={handleChange}
--- a/frontend/src/components/features/microagent-management/microagent-management-accordion-title.tsx
+++ b/frontend/src/components/features/microagent-management/microagent-management-accordion-title.tsx
@@ -17,7 +17,7 @@ export function MicroagentManagementAccordionTitle({
        <TooltipButton
          tooltip={repository.full_name}
          ariaLabel={repository.full_name}
-          className="text-white text-base font-normal bg-transparent p-0 min-w-0 h-auto cursor-pointer truncate max-w-[232px]"
+          className="text-white text-base font-normal bg-transparent p-0 min-w-0 h-auto cursor-pointer truncate max-w-[200px] translate-y-[-1px]"
          testId="repository-name-tooltip"
          placement="bottom"
        >
--- a/frontend/src/components/features/microagent-management/microagent-management-add-microagent-button.tsx
+++ b/frontend/src/components/features/microagent-management/microagent-management-add-microagent-button.tsx
@@ -7,8 +7,6 @@ import {
 } from "#/state/microagent-management-slice";
 import { RootState } from "#/store";
 import { GitRepository } from "#/types/git";
-import PlusIcon from "#/icons/plus.svg?react";
-import { TooltipButton } from "#/components/shared/buttons/tooltip-button";

 interface MicroagentManagementAddMicroagentButtonProps {
  repository: GitRepository;
@@ -25,23 +23,22 @@ export function MicroagentManagementAddMicroagentButton({

  const dispatch = useDispatch();

-  const handleClick = (e: React.MouseEvent<HTMLDivElement>) => {
+  const handleClick = (e: React.MouseEvent<HTMLButtonElement>) => {
    e.stopPropagation();
    dispatch(setAddMicroagentModalVisible(!addMicroagentModalVisible));
    dispatch(setSelectedRepository(repository));
  };

  return (
-    <div onClick={handleClick}>
-      <TooltipButton
-        tooltip={t(I18nKey.COMMON$ADD_MICROAGENT)}
-        ariaLabel={t(I18nKey.COMMON$ADD_MICROAGENT)}
-        className="p-0 min-w-0 h-6 w-6 flex items-center justify-center bg-transparent cursor-pointer"
-        testId="add-microagent-button"
-        placement="bottom"
-      >
-        <PlusIcon width={22} height={22} />
-      </TooltipButton>
-    </div>
+    <button
+      type="button"
+      onClick={handleClick}
+      className="translate-y-[-1px]"
+      data-testid="add-microagent-button"
+    >
+      <span className="text-sm font-normal leading-5 text-[#8480FF] cursor-pointer hover:text-[#6C63FF] transition-colors duration-200">
+        {t(I18nKey.COMMON$ADD_MICROAGENT)}
+      </span>
+    </button>
  );
 }
--- a/frontend/src/components/features/microagent-management/microagent-management-content.tsx
+++ b/frontend/src/components/features/microagent-management/microagent-management-content.tsx
@@ -1,4 +1,5 @@
 import React, { useEffect, useState } from "react";
+import { useTranslation } from "react-i18next";
 import { useDispatch, useSelector } from "react-redux";
 import { MicroagentManagementSidebar } from "./microagent-management-sidebar";
 import { MicroagentManagementMain } from "./microagent-management-main";
@@ -25,6 +26,12 @@ import { GitRepository } from "#/types/git";
 import { queryClient } from "#/query-client-config";
 import { Provider } from "#/types/settings";
 import { MicroagentManagementLearnThisRepoModal } from "./microagent-management-learn-this-repo-modal";
+import {
+  displaySuccessToast,
+  displayErrorToast,
+} from "#/utils/custom-toast-handlers";
+import { getFirstPRUrl } from "#/utils/parse-pr-url";
+import { I18nKey } from "#/i18n/declaration";

 // Handle error events
 const isErrorEvent = (evt: unknown): evt is { error: true; message: string } =>
@@ -112,6 +119,8 @@ export function MicroagentManagementContent() {
    learnThisRepoModalVisible,
  } = useSelector((state: RootState) => state.microagentManagement);

+  const { t } = useTranslation();
+
  const dispatch = useDispatch();

  const { createConversationAndSubscribe, isPending } =
@@ -159,6 +168,37 @@ export function MicroagentManagementContent() {
          ? (selectedRepository as GitRepository).full_name
          : "";

+      // Check if agent is running and ready to work
+      if (
+        isOpenHandsEvent(socketEvent) &&
+        isAgentStateChangeObservation(socketEvent) &&
+        socketEvent.extras.agent_state === AgentState.RUNNING
+      ) {
+        displaySuccessToast(
+          t(I18nKey.MICROAGENT_MANAGEMENT$OPENING_PR_TO_CREATE_MICROAGENT),
+        );
+      }
+
+      // Check if agent has finished and we have a PR
+      if (isOpenHandsEvent(socketEvent) && isFinishAction(socketEvent)) {
+        const prUrl = getFirstPRUrl(socketEvent.args.final_thought || "");
+        if (prUrl) {
+          displaySuccessToast(
+            t(I18nKey.MICROAGENT_MANAGEMENT$PR_READY_FOR_REVIEW),
+          );
+        } else {
+          // Agent finished but no PR found
+          displaySuccessToast(t(I18nKey.MICROAGENT_MANAGEMENT$PR_NOT_CREATED));
+        }
+      }
+
+      // Handle error events
+      if (isErrorEvent(socketEvent) || isAgentStatusError(socketEvent)) {
+        displayErrorToast(
+          t(I18nKey.MICROAGENT_MANAGEMENT$ERROR_CREATING_MICROAGENT),
+        );
+      }
+
      if (shouldInvalidateConversationsList(socketEvent)) {
        invalidateConversationsList(repositoryName);
      }
--- a/frontend/src/components/features/microagent-management/microagent-management-repo-microagents.tsx
+++ b/frontend/src/components/features/microagent-management/microagent-management-repo-microagents.tsx
@@ -65,6 +65,18 @@ export function MicroagentManagementRepoMicroagents({
    }
  }, [conversations]);

+  useEffect(
+    () => () => {
+      dispatch(
+        setSelectedMicroagentItem({
+          microagent: null,
+          conversation: null,
+        }),
+      );
+    },
+    [],
+  );
+
  // Show loading only when both queries are loading
  const isLoading = isLoadingMicroagents || isLoadingConversations;

@@ -82,7 +94,7 @@ export function MicroagentManagementRepoMicroagents({
  // If there's an error with microagents, show the learn this repo component
  if (isError) {
    return (
-      <div className="pb-4">
+      <div>
        <MicroagentManagementLearnThisRepo repository={repository} />
      </div>
    );
@@ -93,7 +105,7 @@ export function MicroagentManagementRepoMicroagents({
  const totalItems = numberOfMicroagents + numberOfConversations;

  return (
-    <div className="pb-4">
+    <div>
      {totalItems === 0 && (
        <MicroagentManagementLearnThisRepo repository={repository} />
      )}
--- a/frontend/src/components/features/microagent-management/microagent-management-repositories.tsx
+++ b/frontend/src/components/features/microagent-management/microagent-management-repositories.tsx
@@ -97,8 +97,10 @@ export function MicroagentManagementRepositories({
        variant="splitted"
        className="w-full px-0 gap-3"
        itemClasses={{
-          base: "shadow-none bg-transparent border border-[#ffffff40] rounded-[6px] cursor-pointer",
-          trigger: "cursor-pointer gap-1",
+          base: "shadow-none bg-transparent cursor-pointer px-0",
+          trigger: "cursor-pointer gap-2 py-3",
+          indicator:
+            "flex items-center justify-center p-0.5 pr-[3px] text-white hover:bg-[#454545] rounded transition-colors duration-200 rotate-180",
        }}
        selectionMode="multiple"
      >
--- a/frontend/src/components/features/settings/mcp-settings/tests/mcp-server-form.validation.test.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/tests/mcp-server-form.validation.test.tsx
@@ -0,0 +1,110 @@
+import { render, screen, fireEvent } from "@testing-library/react";
+import { describe, it, expect, vi } from "vitest";
+import { MCPServerForm } from "../mcp-server-form";
+
+// i18n mock
+vi.mock("react-i18next", () => ({
+  useTranslation: () => ({
+    t: (key: string) => key,
+  }),
+}));
+
+describe("MCPServerForm validation", () => {
+  const noop = () => {};
+
+  it("rejects invalid env var lines and allows blank lines", () => {
+    const onSubmit = vi.fn();
+
+    render(
+      <MCPServerForm
+        mode="add"
+        server={{ id: "tmp", type: "stdio" }}
+        existingServers={[]}
+        onSubmit={onSubmit}
+        onCancel={noop}
+      />,
+    );
+
+    // Fill required fields
+    fireEvent.change(screen.getByTestId("name-input"), {
+      target: { value: "my-server" },
+    });
+    fireEvent.change(screen.getByTestId("command-input"), {
+      target: { value: "npx" },
+    });
+
+    // Invalid env entries mixed with blank lines
+    fireEvent.change(screen.getByTestId("env-input"), {
+      target: { value: "invalid\n\nKEY=value\n=novalue\nKEY_ONLY=" },
+    });
+
+    fireEvent.click(screen.getByTestId("submit-button"));
+
+    // Should show invalid env format error
+    expect(
+      screen.getByText("SETTINGS$MCP_ERROR_ENV_INVALID_FORMAT"),
+    ).toBeInTheDocument();
+
+    // Fix env with valid lines and blank lines
+    fireEvent.change(screen.getByTestId("env-input"), {
+      target: { value: "KEY=value\n\nANOTHER=123" },
+    });
+
+    fireEvent.click(screen.getByTestId("submit-button"));
+
+    // No error; submit should be called
+    expect(onSubmit).toHaveBeenCalledTimes(1);
+  });
+
+  it("rejects duplicate URLs across sse/shttp types", () => {
+    const onSubmit = vi.fn();
+
+    const existingServers = [
+      { id: "sse-1", type: "sse" as const, url: "https://api.example.com" },
+      { id: "shttp-1", type: "shttp" as const, url: "https://x.example.com" },
+    ];
+
+    const r1 = render(
+      <MCPServerForm
+        mode="add"
+        server={{ id: "tmp", type: "sse" }}
+        existingServers={existingServers}
+        onSubmit={onSubmit}
+        onCancel={noop}
+      />,
+    );
+
+    fireEvent.change(screen.getAllByTestId("url-input")[0], {
+      target: { value: "https://api.example.com" },
+    });
+
+    fireEvent.click(screen.getAllByTestId("submit-button")[0]);
+    expect(
+      screen.getByText("SETTINGS$MCP_ERROR_URL_DUPLICATE"),
+    ).toBeInTheDocument();
+
+    // Unmount first form, then check shttp duplicate
+    r1.unmount();
+
+    const r2 = render(
+      <MCPServerForm
+        mode="add"
+        server={{ id: "tmp2", type: "shttp" }}
+        existingServers={existingServers}
+        onSubmit={onSubmit}
+        onCancel={noop}
+      />,
+    );
+
+    fireEvent.change(screen.getAllByTestId("url-input")[0], {
+      target: { value: "https://api.example.com" },
+    });
+
+    fireEvent.click(screen.getAllByTestId("submit-button")[0]);
+    expect(
+      screen.getByText("SETTINGS$MCP_ERROR_URL_DUPLICATE"),
+    ).toBeInTheDocument();
+
+    r2.unmount();
+  });
+});
--- a/frontend/src/components/features/settings/mcp-settings/tests/mcp-server-list.test.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/tests/mcp-server-list.test.tsx
@@ -0,0 +1,158 @@
+import { render, screen } from "@testing-library/react";
+import { describe, it, expect, vi } from "vitest";
+import { MCPServerList } from "../mcp-server-list";
+
+// Mock react-i18next
+vi.mock("react-i18next", () => ({
+  useTranslation: () => ({
+    t: (key: string) => key,
+  }),
+}));
+
+const mockServers = [
+  {
+    id: "sse-0",
+    type: "sse" as const,
+    url: "https://very-long-url-that-could-cause-layout-overflow.example.com/api/v1/mcp/server/endpoint/with/many/path/segments",
+  },
+  {
+    id: "stdio-0",
+    type: "stdio" as const,
+    name: "test-stdio-server",
+    command: "python",
+    args: ["-m", "test_server"],
+  },
+];
+
+describe("MCPServerList", () => {
+  it("should render servers with proper layout structure", () => {
+    const mockOnEdit = vi.fn();
+    const mockOnDelete = vi.fn();
+
+    render(
+      <MCPServerList
+        servers={mockServers}
+        onEdit={mockOnEdit}
+        onDelete={mockOnDelete}
+      />,
+    );
+
+    // Check that the table structure is rendered
+    const table = screen.getByRole("table");
+    expect(table).toBeInTheDocument();
+    expect(table).toHaveClass("w-full");
+
+    // Check that server items are rendered
+    const serverItems = screen.getAllByTestId("mcp-server-item");
+    expect(serverItems).toHaveLength(2);
+
+    // Check that action buttons are present for each server
+    const editButtons = screen.getAllByTestId("edit-mcp-server-button");
+    const deleteButtons = screen.getAllByTestId("delete-mcp-server-button");
+    expect(editButtons).toHaveLength(2);
+    expect(deleteButtons).toHaveLength(2);
+  });
+
+  it("should render empty state when no servers", () => {
+    const mockOnEdit = vi.fn();
+    const mockOnDelete = vi.fn();
+
+    render(
+      <MCPServerList
+        servers={[]}
+        onEdit={mockOnEdit}
+        onDelete={mockOnDelete}
+      />,
+    );
+
+    expect(screen.getByText("SETTINGS$MCP_NO_SERVERS")).toBeInTheDocument();
+  });
+
+  it("should handle long URLs without breaking layout", () => {
+    const longUrlServer = {
+      id: "sse-0",
+      type: "sse" as const,
+      url: "https://extremely-long-url-that-would-previously-cause-layout-overflow-and-push-action-buttons-out-of-view.example.com/api/v1/mcp/server/endpoint/with/many/path/segments/and/query/parameters?param1=value1&param2=value2&param3=value3",
+    };
+
+    const mockOnEdit = vi.fn();
+    const mockOnDelete = vi.fn();
+
+    render(
+      <MCPServerList
+        servers={[longUrlServer]}
+        onEdit={mockOnEdit}
+        onDelete={mockOnDelete}
+      />,
+    );
+
+    // Check that action buttons are still present and accessible
+    const editButton = screen.getByTestId("edit-mcp-server-button");
+    const deleteButton = screen.getByTestId("delete-mcp-server-button");
+
+    expect(editButton).toBeInTheDocument();
+    expect(deleteButton).toBeInTheDocument();
+
+    // Check that the URL is properly displayed with title attribute for accessibility
+    const detailsCells = screen.getAllByTitle(longUrlServer.url);
+    expect(detailsCells).toHaveLength(2); // Name and Details columns both have the URL
+
+    // Check that both name and details cells use truncation and have title for tooltip
+    const [nameCell, detailsCell] = detailsCells;
+    expect(nameCell).toHaveClass("truncate");
+    expect(detailsCell).toHaveClass("truncate");
+  });
+
+  it("should display command and arguments for STDIO servers", () => {
+    const stdioServer = {
+      id: "stdio-1",
+      type: "stdio" as const,
+      name: "test-server",
+      command: "python",
+      args: ["-m", "test_module", "--verbose"],
+    };
+
+    const mockOnEdit = vi.fn();
+    const mockOnDelete = vi.fn();
+
+    render(
+      <MCPServerList
+        servers={[stdioServer]}
+        onEdit={mockOnEdit}
+        onDelete={mockOnDelete}
+      />,
+    );
+
+    // Check that the server details show command + arguments
+    const expectedDetails = "python -m test_module --verbose";
+    expect(screen.getByTitle(expectedDetails)).toBeInTheDocument();
+    expect(screen.getByText(expectedDetails)).toBeInTheDocument();
+  });
+
+  it("should fallback to server name for STDIO servers without command", () => {
+    const stdioServer = {
+      id: "stdio-2",
+      type: "stdio" as const,
+      name: "fallback-server",
+    };
+
+    const mockOnEdit = vi.fn();
+    const mockOnDelete = vi.fn();
+
+    render(
+      <MCPServerList
+        servers={[stdioServer]}
+        onEdit={mockOnEdit}
+        onDelete={mockOnDelete}
+      />,
+    );
+
+    // Check that the server details show the server name as fallback
+    // Both name and details columns will have the same value, so we expect 2 elements
+    const fallbackElements = screen.getAllByTitle("fallback-server");
+    expect(fallbackElements).toHaveLength(2);
+
+    const fallbackTextElements = screen.getAllByText("fallback-server");
+    expect(fallbackTextElements).toHaveLength(2);
+  });
+});
--- a/frontend/src/components/features/settings/mcp-settings/mcp-config-editor.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/mcp-config-editor.tsx
@@ -1,78 +0,0 @@
-import React, { useState } from "react";
-import { useTranslation } from "react-i18next";
-import { MCPConfig } from "#/types/settings";
-import { I18nKey } from "#/i18n/declaration";
-import { MCPSSEServers } from "./mcp-sse-servers";
-import { MCPStdioServers } from "./mcp-stdio-servers";
-import { MCPJsonEditor } from "./mcp-json-editor";
-import { BrandButton } from "../brand-button";
-
-interface MCPConfigEditorProps {
-  mcpConfig?: MCPConfig;
-  onChange: (config: MCPConfig) => void;
-}
-
-export function MCPConfigEditor({ mcpConfig, onChange }: MCPConfigEditorProps) {
-  const { t } = useTranslation();
-  const [isEditing, setIsEditing] = useState(false);
-  const handleConfigChange = (newConfig: MCPConfig) => {
-    onChange(newConfig);
-    setIsEditing(false);
-  };
-
-  const config = mcpConfig || { sse_servers: [], stdio_servers: [] };
-
-  return (
-    <div>
-      <div className="flex flex-col gap-2 mb-6">
-        <div className="text-sm font-medium">
-          {t(I18nKey.SETTINGS$MCP_TITLE)}
-        </div>
-        <p className="text-xs text-[#A3A3A3]">
-          {t(I18nKey.SETTINGS$MCP_DESCRIPTION)}
-        </p>
-      </div>
-      {!isEditing && (
-        <div className="flex justify-between items-center mb-4">
-          <div className="flex items-center">
-            <BrandButton
-              type="button"
-              variant="primary"
-              onClick={() => setIsEditing(true)}
-            >
-              {t(I18nKey.SETTINGS$MCP_EDIT_CONFIGURATION)}
-            </BrandButton>
-          </div>
-        </div>
-      )}
-      <div>
-        {isEditing ? (
-          <MCPJsonEditor
-            mcpConfig={mcpConfig}
-            onChange={handleConfigChange}
-            onCancel={() => setIsEditing(false)}
-          />
-        ) : (
-          <>
-            <div className="flex flex-col gap-6">
-              <div>
-                <MCPSSEServers servers={config.sse_servers} />
-              </div>
-
-              <div>
-                <MCPStdioServers servers={config.stdio_servers} />
-              </div>
-            </div>
-
-            {config.sse_servers.length === 0 &&
-              config.stdio_servers.length === 0 && (
-                <div className="mt-4 p-2 bg-yellow-50 border border-yellow-200 rounded-md text-sm text-yellow-700">
-                  {t(I18nKey.SETTINGS$MCP_NO_SERVERS_CONFIGURED)}
-                </div>
-              )}
-          </>
-        )}
-      </div>
-    </div>
-  );
-}
--- a/frontend/src/components/features/settings/mcp-settings/mcp-json-editor.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/mcp-json-editor.tsx
@@ -1,139 +0,0 @@
-import React, { useState, useRef, useEffect } from "react";
-import { useTranslation, Trans } from "react-i18next";
-import { MCPConfig } from "#/types/settings";
-import { I18nKey } from "#/i18n/declaration";
-import { BrandButton } from "../brand-button";
-import { cn } from "#/utils/utils";
-
-interface MCPJsonEditorProps {
-  mcpConfig?: MCPConfig;
-  onChange: (config: MCPConfig) => void;
-  onCancel: () => void;
-}
-
-const MCP_DEFAULT_CONFIG: MCPConfig = {
-  sse_servers: [],
-  stdio_servers: [],
-};
-
-export function MCPJsonEditor({
-  mcpConfig,
-  onChange,
-  onCancel,
-}: MCPJsonEditorProps) {
-  const { t } = useTranslation();
-  const [configText, setConfigText] = useState(() =>
-    mcpConfig
-      ? JSON.stringify(mcpConfig, null, 2)
-      : JSON.stringify(MCP_DEFAULT_CONFIG, null, 2),
-  );
-
-  const [error, setError] = useState<string | null>(null);
-
-  const textareaRef = useRef<HTMLTextAreaElement>(null);
-
-  useEffect(() => {
-    textareaRef.current?.focus();
-  }, []);
-
-  const handleTextChange = (e: React.ChangeEvent<HTMLTextAreaElement>) => {
-    setConfigText(e.target.value);
-  };
-
-  const handleSave = () => {
-    try {
-      const newConfig = JSON.parse(configText);
-
-      // Validate the structure
-      if (!newConfig.sse_servers || !Array.isArray(newConfig.sse_servers)) {
-        throw new Error(t(I18nKey.SETTINGS$MCP_ERROR_SSE_ARRAY));
-      }
-
-      if (!newConfig.stdio_servers || !Array.isArray(newConfig.stdio_servers)) {
-        throw new Error(t(I18nKey.SETTINGS$MCP_ERROR_STDIO_ARRAY));
-      }
-
-      // Validate SSE servers
-      for (const server of newConfig.sse_servers) {
-        if (
-          typeof server !== "string" &&
-          (!server.url || typeof server.url !== "string")
-        ) {
-          throw new Error(t(I18nKey.SETTINGS$MCP_ERROR_SSE_URL));
-        }
-      }
-
-      // Validate stdio servers
-      for (const server of newConfig.stdio_servers) {
-        if (!server.name || !server.command) {
-          throw new Error(t(I18nKey.SETTINGS$MCP_ERROR_STDIO_PROPS));
-        }
-      }
-
-      onChange(newConfig);
-      setError(null);
-    } catch (e) {
-      setError(
-        e instanceof Error
-          ? e.message
-          : t(I18nKey.SETTINGS$MCP_ERROR_INVALID_JSON),
-      );
-    }
-  };
-
-  return (
-    <div>
-      <p className="mb-2 text-sm text-gray-400">
-        <Trans
-          i18nKey={I18nKey.SETTINGS$MCP_CONFIG_DESCRIPTION}
-          components={{
-            a: (
-              <a
-                href="https://docs.all-hands.dev/usage/mcp"
-                target="_blank"
-                rel="noopener noreferrer"
-                className="text-blue-400 hover:underline"
-              >
-                documentation
-              </a>
-            ),
-          }}
-        />
-      </p>
-      <textarea
-        ref={textareaRef}
-        className={cn(
-          "w-full h-64 resize-y p-2 rounded-sm text-sm font-mono",
-          "bg-tertiary border border-[#717888]",
-          "placeholder:italic placeholder:text-tertiary-alt",
-          "focus:outline-none focus:ring-1 focus:ring-primary",
-          "disabled:bg-[#2D2F36] disabled:border-[#2D2F36] disabled:cursor-not-allowed",
-        )}
-        value={configText}
-        onChange={handleTextChange}
-        spellCheck="false"
-      />
-      {error && (
-        <div className="mt-2 p-2 bg-red-100 border border-red-300 rounded-md text-sm text-red-700">
-          <strong>{t(I18nKey.SETTINGS$MCP_CONFIG_ERROR)}</strong> {error}
-        </div>
-      )}
-      <div className="mt-2 text-sm text-gray-400">
-        <strong>{t(I18nKey.SETTINGS$MCP_CONFIG_EXAMPLE)}</strong>{" "}
-        <code>
-          {
-            '{ "sse_servers": ["https://example-mcp-server.com/sse"], "stdio_servers": [{ "name": "fetch", "command": "uvx", "args": ["mcp-server-fetch"] }] }'
-          }
-        </code>
-      </div>
-      <div className="mt-4 flex justify-end gap-3">
-        <BrandButton type="button" variant="secondary" onClick={onCancel}>
-          {t(I18nKey.BUTTON$CANCEL)}
-        </BrandButton>
-        <BrandButton type="button" variant="primary" onClick={handleSave}>
-          {t(I18nKey.SETTINGS$MCP_PREVIEW_CHANGES)}
-        </BrandButton>
-      </div>
-    </div>
-  );
-}
--- a/frontend/src/components/features/settings/mcp-settings/mcp-server-form.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/mcp-server-form.tsx
@@ -0,0 +1,376 @@
+import React from "react";
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+import { SettingsInput } from "../settings-input";
+import { SettingsDropdownInput } from "../settings-dropdown-input";
+import { BrandButton } from "../brand-button";
+import { OptionalTag } from "../optional-tag";
+import { cn } from "#/utils/utils";
+
+type MCPServerType = "sse" | "stdio" | "shttp";
+
+interface MCPServerConfig {
+  id: string;
+  type: MCPServerType;
+  name?: string;
+  url?: string;
+  api_key?: string;
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+}
+
+interface MCPServerFormProps {
+  mode: "add" | "edit";
+  server?: MCPServerConfig;
+  existingServers?: MCPServerConfig[];
+  onSubmit: (server: MCPServerConfig) => void;
+  onCancel: () => void;
+}
+
+export function MCPServerForm({
+  mode,
+  server,
+  existingServers,
+  onSubmit,
+  onCancel,
+}: MCPServerFormProps) {
+  const { t } = useTranslation();
+  const [serverType, setServerType] = React.useState<MCPServerType>(
+    server?.type || "sse",
+  );
+  const [error, setError] = React.useState<string | null>(null);
+
+  const serverTypeOptions = [
+    { key: "sse", label: t(I18nKey.SETTINGS$MCP_SERVER_TYPE_SSE) },
+    { key: "stdio", label: t(I18nKey.SETTINGS$MCP_SERVER_TYPE_STDIO) },
+    { key: "shttp", label: t(I18nKey.SETTINGS$MCP_SERVER_TYPE_SHTTP) },
+  ];
+
+  const validateUrl = (url: string): string | null => {
+    if (!url) return t(I18nKey.SETTINGS$MCP_ERROR_URL_REQUIRED);
+    try {
+      const urlObj = new URL(url);
+      if (!["http:", "https:"].includes(urlObj.protocol)) {
+        return t(I18nKey.SETTINGS$MCP_ERROR_URL_INVALID_PROTOCOL);
+      }
+    } catch {
+      return t(I18nKey.SETTINGS$MCP_ERROR_URL_INVALID);
+    }
+    return null;
+  };
+
+  const validateName = (name: string): string | null => {
+    if (!name) return t(I18nKey.SETTINGS$MCP_ERROR_NAME_REQUIRED);
+    if (!/^[a-zA-Z0-9_-]+$/.test(name)) {
+      return t(I18nKey.SETTINGS$MCP_ERROR_NAME_INVALID);
+    }
+    return null;
+  };
+
+  const validateNameUniqueness = (name: string): string | null => {
+    if (!existingServers) return null;
+    const shouldCheckUniqueness =
+      mode === "add" || (mode === "edit" && server?.name !== name);
+    if (!shouldCheckUniqueness) return null;
+
+    const existingStdioNames = existingServers
+      .filter((s) => s.type === "stdio")
+      .map((s) => s.name)
+      .filter(Boolean);
+    if (existingStdioNames.includes(name)) {
+      return t(I18nKey.SETTINGS$MCP_ERROR_NAME_DUPLICATE);
+    }
+    return null;
+  };
+
+  const validateCommand = (command: string): string | null => {
+    if (!command) return t(I18nKey.SETTINGS$MCP_ERROR_COMMAND_REQUIRED);
+    if (command.includes(" ")) {
+      return t(I18nKey.SETTINGS$MCP_ERROR_COMMAND_NO_SPACES);
+    }
+    return null;
+  };
+
+  const validateUrlUniqueness = (url: string): string | null => {
+    if (!existingServers) return null;
+    const originalUrl = server?.url;
+    const changed = mode === "add" || (mode === "edit" && originalUrl !== url);
+    if (!changed) return null;
+    // For URL-based servers (sse/shttp), ensure URL is unique across both types
+    const exists = existingServers.some(
+      (s) => (s.type === "sse" || s.type === "shttp") && s.url === url,
+    );
+    if (exists) return t(I18nKey.SETTINGS$MCP_ERROR_URL_DUPLICATE);
+    return null;
+  };
+
+  const validateEnvFormat = (envString: string): string | null => {
+    if (!envString.trim()) return null;
+    const lines = envString.split("\n");
+    for (let i = 0; i < lines.length; i += 1) {
+      const trimmed = lines[i].trim();
+      if (trimmed) {
+        const eq = trimmed.indexOf("=");
+        if (eq === -1) return t(I18nKey.SETTINGS$MCP_ERROR_ENV_INVALID_FORMAT);
+        const key = trimmed.substring(0, eq).trim();
+        if (!key) return t(I18nKey.SETTINGS$MCP_ERROR_ENV_INVALID_FORMAT);
+      }
+    }
+    return null;
+  };
+
+  const validateStdioServer = (formData: FormData): string | null => {
+    const name = formData.get("name")?.toString().trim() || "";
+    const command = formData.get("command")?.toString().trim() || "";
+    const envString = formData.get("env")?.toString() || "";
+
+    const nameError = validateName(name);
+    if (nameError) return nameError;
+
+    const uniquenessError = validateNameUniqueness(name);
+    if (uniquenessError) return uniquenessError;
+
+    const commandError = validateCommand(command);
+    if (commandError) return commandError;
+
+    // Validate environment variable format
+    const envError = validateEnvFormat(envString);
+    if (envError) return envError;
+
+    return null;
+  };
+
+  const validateForm = (formData: FormData): string | null => {
+    if (serverType === "sse" || serverType === "shttp") {
+      const url = formData.get("url")?.toString().trim() || "";
+      const urlError = validateUrl(url);
+      if (urlError) return urlError;
+      const urlDupError = validateUrlUniqueness(url);
+      if (urlDupError) return urlDupError;
+      return null;
+    }
+
+    if (serverType === "stdio") {
+      return validateStdioServer(formData);
+    }
+
+    return null;
+  };
+
+  const parseEnvironmentVariables = (
+    envString: string,
+  ): Record<string, string> => {
+    const env: Record<string, string> = {};
+    const input = envString.trim();
+    if (!input) return env;
+
+    for (const line of input.split("\n")) {
+      const trimmed = line.trim();
+      const eq = trimmed.indexOf("=");
+      const key = eq >= 0 ? trimmed.substring(0, eq).trim() : "";
+      if (trimmed && eq !== -1 && key) {
+        env[key] = trimmed.substring(eq + 1).trim();
+      }
+    }
+    return env;
+  };
+
+  const formatEnvironmentVariables = (env?: Record<string, string>): string => {
+    if (!env) return "";
+    return Object.entries(env)
+      .map(([key, value]) => `${key}=${value}`)
+      .join("\n");
+  };
+
+  const handleSubmit = (event: React.FormEvent<HTMLFormElement>) => {
+    event.preventDefault();
+    setError(null);
+
+    const formData = new FormData(event.currentTarget);
+    const validationError = validateForm(formData);
+
+    if (validationError) {
+      setError(validationError);
+      return;
+    }
+
+    const baseConfig = {
+      id: server?.id || `${serverType}-${Date.now()}`,
+      type: serverType,
+    };
+
+    if (serverType === "sse" || serverType === "shttp") {
+      const url = formData.get("url")?.toString().trim();
+      const apiKey = formData.get("api_key")?.toString().trim();
+
+      onSubmit({
+        ...baseConfig,
+        url: url!,
+        ...(apiKey && { api_key: apiKey }),
+      });
+    } else if (serverType === "stdio") {
+      const name = formData.get("name")?.toString().trim();
+      const command = formData.get("command")?.toString().trim();
+      const argsString = formData.get("args")?.toString().trim();
+      const envString = formData.get("env")?.toString().trim();
+
+      const args = argsString
+        ? argsString
+            .split("\n")
+            .map((arg) => arg.trim())
+            .filter(Boolean)
+        : [];
+      const env = parseEnvironmentVariables(envString || "");
+
+      onSubmit({
+        ...baseConfig,
+        name: name!,
+        command: command!,
+        ...(args.length > 0 && { args }),
+        ...(Object.keys(env).length > 0 && { env }),
+      });
+    }
+  };
+
+  const formTestId =
+    mode === "add" ? "add-mcp-server-form" : "edit-mcp-server-form";
+
+  return (
+    <form
+      data-testid={formTestId}
+      onSubmit={handleSubmit}
+      className="flex flex-col items-start gap-6"
+    >
+      {mode === "add" && (
+        <SettingsDropdownInput
+          testId="server-type-dropdown"
+          name="server-type"
+          label={t(I18nKey.SETTINGS$MCP_SERVER_TYPE)}
+          items={serverTypeOptions}
+          selectedKey={serverType}
+          onSelectionChange={(key) => setServerType(key as MCPServerType)}
+          onInputChange={() => {}} // Prevent input changes
+          isClearable={false}
+          allowsCustomValue={false}
+          required
+          wrapperClassName={cn("w-full", "max-w-[680px]")}
+        />
+      )}
+
+      {error && <p className="text-red-500 text-sm">{error}</p>}
+
+      {(serverType === "sse" || serverType === "shttp") && (
+        <>
+          <SettingsInput
+            testId="url-input"
+            name="url"
+            type="url"
+            label={t(I18nKey.SETTINGS$MCP_URL)}
+            className="w-full max-w-[680px]"
+            required
+            defaultValue={server?.url || ""}
+            placeholder="https://api.example.com"
+          />
+
+          <SettingsInput
+            testId="api-key-input"
+            name="api_key"
+            type="password"
+            label={t(I18nKey.SETTINGS$MCP_API_KEY)}
+            className="w-full max-w-[680px]"
+            showOptionalTag
+            defaultValue={server?.api_key || ""}
+            placeholder={t(I18nKey.SETTINGS$MCP_API_KEY_PLACEHOLDER)}
+          />
+        </>
+      )}
+
+      {serverType === "stdio" && (
+        <>
+          <SettingsInput
+            testId="name-input"
+            name="name"
+            type="text"
+            label={t(I18nKey.SETTINGS$MCP_NAME)}
+            className="w-full max-w-[680px]"
+            required
+            defaultValue={server?.name || ""}
+            placeholder="my-mcp-server"
+            pattern="^[a-zA-Z0-9_-]+$"
+          />
+
+          <SettingsInput
+            testId="command-input"
+            name="command"
+            type="text"
+            label={t(I18nKey.SETTINGS$MCP_COMMAND)}
+            className="w-full max-w-[680px]"
+            required
+            defaultValue={server?.command || ""}
+            placeholder="npx"
+          />
+
+          <label className="flex flex-col gap-2.5 w-full max-w-[680px]">
+            <div className="flex items-center gap-2">
+              <span className="text-sm">
+                {t(I18nKey.SETTINGS$MCP_COMMAND_ARGUMENTS)}
+              </span>
+              <OptionalTag />
+            </div>
+            <textarea
+              data-testid="args-input"
+              name="args"
+              rows={3}
+              defaultValue={server?.args?.join("\n") || ""}
+              placeholder="arg1&#10;arg2&#10;arg3"
+              className={cn(
+                "bg-tertiary border border-[#717888] w-full rounded-sm p-2 placeholder:italic placeholder:text-tertiary-alt resize-none",
+                "disabled:bg-[#2D2F36] disabled:border-[#2D2F36] disabled:cursor-not-allowed",
+              )}
+            />
+            <p className="text-xs text-tertiary-alt">
+              {t(I18nKey.SETTINGS$MCP_COMMAND_ARGUMENTS_HELP)}
+            </p>
+          </label>
+
+          <label className="flex flex-col gap-2.5 w-full max-w-[680px]">
+            <div className="flex items-center gap-2">
+              <span className="text-sm">
+                {t(I18nKey.SETTINGS$MCP_ENVIRONMENT_VARIABLES)}
+              </span>
+              <OptionalTag />
+            </div>
+            <textarea
+              data-testid="env-input"
+              name="env"
+              rows={4}
+              defaultValue={formatEnvironmentVariables(server?.env)}
+              placeholder="KEY1=value1&#10;KEY2=value2"
+              className={cn(
+                "resize-none",
+                "bg-tertiary border border-[#717888] rounded-sm p-2 placeholder:italic placeholder:text-tertiary-alt",
+                "disabled:bg-[#2D2F36] disabled:border-[#2D2F36] disabled:cursor-not-allowed",
+              )}
+            />
+          </label>
+        </>
+      )}
+
+      <div className="flex items-center gap-4">
+        <BrandButton
+          testId="cancel-button"
+          type="button"
+          variant="secondary"
+          onClick={onCancel}
+        >
+          {t(I18nKey.BUTTON$CANCEL)}
+        </BrandButton>
+        <BrandButton testId="submit-button" type="submit" variant="primary">
+          {mode === "add" && t(I18nKey.SETTINGS$MCP_ADD_SERVER)}
+          {mode === "edit" && t(I18nKey.SETTINGS$MCP_SAVE_SERVER)}
+        </BrandButton>
+      </div>
+    </form>
+  );
+}
--- a/frontend/src/components/features/settings/mcp-settings/mcp-server-list-item.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/mcp-server-list-item.tsx
@@ -0,0 +1,110 @@
+import { FaPencil, FaTrash } from "react-icons/fa6";
+import { useTranslation } from "react-i18next";
+import { I18nKey } from "#/i18n/declaration";
+
+interface MCPServerConfig {
+  id: string;
+  type: "sse" | "stdio" | "shttp";
+  name?: string;
+  url?: string;
+  api_key?: string;
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+}
+
+export function MCPServerListItem({
+  server,
+  onEdit,
+  onDelete,
+}: {
+  server: MCPServerConfig;
+  onEdit: () => void;
+  onDelete: () => void;
+}) {
+  const { t } = useTranslation();
+
+  const getServerTypeLabel = (type: string) => {
+    switch (type) {
+      case "sse":
+        return t(I18nKey.SETTINGS$MCP_SERVER_TYPE_SSE);
+      case "stdio":
+        return t(I18nKey.SETTINGS$MCP_SERVER_TYPE_STDIO);
+      case "shttp":
+        return t(I18nKey.SETTINGS$MCP_SERVER_TYPE_SHTTP);
+      default:
+        return type.toUpperCase();
+    }
+  };
+
+  const getServerDescription = (serverConfig: MCPServerConfig) => {
+    if (serverConfig.type === "stdio") {
+      if (serverConfig.command) {
+        const args =
+          serverConfig.args && serverConfig.args.length > 0
+            ? ` ${serverConfig.args.join(" ")}`
+            : "";
+        return `${serverConfig.command}${args}`;
+      }
+      return serverConfig.name || "";
+    }
+    if (
+      (serverConfig.type === "sse" || serverConfig.type === "shttp") &&
+      serverConfig.url
+    ) {
+      return serverConfig.url;
+    }
+    return "";
+  };
+
+  const serverName = server.type === "stdio" ? server.name : server.url;
+  const serverDescription = getServerDescription(server);
+
+  return (
+    <tr
+      data-testid="mcp-server-item"
+      className="grid grid-cols-[minmax(0,0.25fr)_120px_minmax(0,1fr)_120px] gap-4 items-start border-t border-tertiary"
+    >
+      <td
+        className="p-3 text-sm text-content-2 truncate min-w-0"
+        title={serverName}
+      >
+        {serverName}
+      </td>
+
+      <td className="p-3 text-sm text-content-2 whitespace-nowrap">
+        {getServerTypeLabel(server.type)}
+      </td>
+
+      <td
+        className="p-3 text-sm text-content-2 opacity-80 italic min-w-0 truncate"
+        title={serverDescription}
+      >
+        <span className="inline-block max-w-full align-bottom">
+          {serverDescription}
+        </span>
+      </td>
+
+      <td className="p-3 flex items-start justify-end gap-4 whitespace-nowrap">
+        <button
+          data-testid="edit-mcp-server-button"
+          type="button"
+          onClick={onEdit}
+          aria-label={`Edit ${serverName}`}
+          className="cursor-pointer hover:text-content-1 transition-colors"
+        >
+          <FaPencil size={16} />
+        </button>
+        <button
+          data-testid="delete-mcp-server-button"
+          type="button"
+          onClick={onDelete}
+          aria-label={`Delete ${serverName}`}
+          className="cursor-pointer hover:text-content-1 transition-colors"
+        >
+          <FaTrash size={16} />
+        </button>
+      </td>
+    </tr>
+  );
+}
--- a/frontend/src/components/features/settings/mcp-settings/mcp-server-list.tsx
+++ b/frontend/src/components/features/settings/mcp-settings/mcp-server-list.tsx
@@ -0,0 +1,71 @@
+import { useTranslation } from "react-i18next";
+import { MCPServerListItem } from "./mcp-server-list-item";
+import { I18nKey } from "#/i18n/declaration";
+
+interface MCPServerConfig {
+  id: string;
+  type: "sse" | "stdio" | "shttp";
+  name?: string;
+  url?: string;
+  api_key?: string;
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+}
+
+interface MCPServerListProps {
+  servers: MCPServerConfig[];
+  onEdit: (server: MCPServerConfig) => void;
+  onDelete: (serverId: string) => void;
+}
+
+export function MCPServerList({
+  servers,
+  onEdit,
+  onDelete,
+}: MCPServerListProps) {
+  const { t } = useTranslation();
+
+  if (servers.length === 0) {
+    return (
+      <div className="border border-tertiary rounded-md p-8 text-center">
+        <p className="text-content-2 text-sm">
+          {t(I18nKey.SETTINGS$MCP_NO_SERVERS)}
+        </p>
+      </div>
+    );
+  }
+
+  return (
+    <div className="border border-tertiary rounded-md overflow-hidden">
+      <table className="w-full">
+        <thead className="bg-base-tertiary">
+          <tr className="grid grid-cols-[minmax(0,0.25fr)_120px_minmax(0,1fr)_120px] gap-4 items-start">
+            <th className="text-left p-3 text-sm font-medium">
+              {t(I18nKey.SETTINGS$NAME)}
+            </th>
+            <th className="text-left p-3 text-sm font-medium">
+              {t(I18nKey.SETTINGS$MCP_SERVER_TYPE)}
+            </th>
+            <th className="text-left p-3 text-sm font-medium">
+              {t(I18nKey.SETTINGS$MCP_SERVER_DETAILS)}
+            </th>
+            <th className="text-right p-3 text-sm font-medium">
+              {t(I18nKey.SETTINGS$ACTIONS)}
+            </th>
+          </tr>
+        </thead>
+        <tbody>
+          {servers.map((server) => (
+            <MCPServerListItem
+              key={server.id}
+              server={server}
+              onEdit={() => onEdit(server)}
+              onDelete={() => onDelete(server.id)}
+            />
+          ))}
+        </tbody>
+      </table>
+    </div>
+  );
+}
--- a/frontend/src/components/shared/modals/modal-backdrop.tsx
+++ b/frontend/src/components/shared/modals/modal-backdrop.tsx
@@ -23,7 +23,7 @@ export function ModalBackdrop({ children, onClose }: ModalBackdropProps) {
    <div className="fixed inset-0 flex items-center justify-center z-20">
      <div
        onClick={handleClick}
-        className="fixed inset-0 bg-black bg-opacity-80"
+        className="fixed inset-0 bg-black opacity-60"
      />
      <div className="relative">{children}</div>
    </div>
--- a/frontend/src/hooks/mutation/use-add-mcp-server.ts
+++ b/frontend/src/hooks/mutation/use-add-mcp-server.ts
@@ -0,0 +1,67 @@
+import { useMutation, useQueryClient } from "@tanstack/react-query";
+import { useSettings } from "#/hooks/query/use-settings";
+import OpenHands from "#/api/open-hands";
+import { MCPSSEServer, MCPStdioServer, MCPSHTTPServer } from "#/types/settings";
+
+type MCPServerType = "sse" | "stdio" | "shttp";
+
+interface MCPServerConfig {
+  type: MCPServerType;
+  name?: string;
+  url?: string;
+  api_key?: string;
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+}
+
+export function useAddMcpServer() {
+  const queryClient = useQueryClient();
+  const { data: settings } = useSettings();
+
+  return useMutation({
+    mutationFn: async (server: MCPServerConfig): Promise<void> => {
+      if (!settings) return;
+
+      const currentConfig = settings.MCP_CONFIG || {
+        sse_servers: [],
+        stdio_servers: [],
+        shttp_servers: [],
+      };
+
+      const newConfig = { ...currentConfig };
+
+      if (server.type === "sse") {
+        const sseServer: MCPSSEServer = {
+          url: server.url!,
+          ...(server.api_key && { api_key: server.api_key }),
+        };
+        newConfig.sse_servers.push(sseServer);
+      } else if (server.type === "stdio") {
+        const stdioServer: MCPStdioServer = {
+          name: server.name!,
+          command: server.command!,
+          ...(server.args && { args: server.args }),
+          ...(server.env && { env: server.env }),
+        };
+        newConfig.stdio_servers.push(stdioServer);
+      } else if (server.type === "shttp") {
+        const shttpServer: MCPSHTTPServer = {
+          url: server.url!,
+          ...(server.api_key && { api_key: server.api_key }),
+        };
+        newConfig.shttp_servers.push(shttpServer);
+      }
+
+      const apiSettings = {
+        mcp_config: newConfig,
+      };
+
+      await OpenHands.saveSettings(apiSettings);
+    },
+    onSuccess: () => {
+      // Invalidate the settings query to trigger a refetch
+      queryClient.invalidateQueries({ queryKey: ["settings"] });
+    },
+  });
+}
--- a/frontend/src/hooks/mutation/use-delete-mcp-server.ts
+++ b/frontend/src/hooks/mutation/use-delete-mcp-server.ts
@@ -0,0 +1,37 @@
+import { useMutation, useQueryClient } from "@tanstack/react-query";
+import { useSettings } from "#/hooks/query/use-settings";
+import OpenHands from "#/api/open-hands";
+import { MCPConfig } from "#/types/settings";
+
+export function useDeleteMcpServer() {
+  const queryClient = useQueryClient();
+  const { data: settings } = useSettings();
+
+  return useMutation({
+    mutationFn: async (serverId: string): Promise<void> => {
+      if (!settings?.MCP_CONFIG) return;
+
+      const newConfig: MCPConfig = { ...settings.MCP_CONFIG };
+      const [serverType, indexStr] = serverId.split("-");
+      const index = parseInt(indexStr, 10);
+
+      if (serverType === "sse") {
+        newConfig.sse_servers.splice(index, 1);
+      } else if (serverType === "stdio") {
+        newConfig.stdio_servers.splice(index, 1);
+      } else if (serverType === "shttp") {
+        newConfig.shttp_servers.splice(index, 1);
+      }
+
+      const apiSettings = {
+        mcp_config: newConfig,
+      };
+
+      await OpenHands.saveSettings(apiSettings);
+    },
+    onSuccess: () => {
+      // Invalidate the settings query to trigger a refetch
+      queryClient.invalidateQueries({ queryKey: ["settings"] });
+    },
+  });
+}
--- a/frontend/src/hooks/mutation/use-update-mcp-server.ts
+++ b/frontend/src/hooks/mutation/use-update-mcp-server.ts
@@ -0,0 +1,69 @@
+import { useMutation, useQueryClient } from "@tanstack/react-query";
+import { useSettings } from "#/hooks/query/use-settings";
+import OpenHands from "#/api/open-hands";
+import { MCPSSEServer, MCPStdioServer, MCPSHTTPServer } from "#/types/settings";
+
+type MCPServerType = "sse" | "stdio" | "shttp";
+
+interface MCPServerConfig {
+  type: MCPServerType;
+  name?: string;
+  url?: string;
+  api_key?: string;
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+}
+
+export function useUpdateMcpServer() {
+  const queryClient = useQueryClient();
+  const { data: settings } = useSettings();
+
+  return useMutation({
+    mutationFn: async ({
+      serverId,
+      server,
+    }: {
+      serverId: string;
+      server: MCPServerConfig;
+    }): Promise<void> => {
+      if (!settings?.MCP_CONFIG) return;
+
+      const newConfig = { ...settings.MCP_CONFIG };
+      const [serverType, indexStr] = serverId.split("-");
+      const index = parseInt(indexStr, 10);
+
+      if (serverType === "sse") {
+        const sseServer: MCPSSEServer = {
+          url: server.url!,
+          ...(server.api_key && { api_key: server.api_key }),
+        };
+        newConfig.sse_servers[index] = sseServer;
+      } else if (serverType === "stdio") {
+        const stdioServer: MCPStdioServer = {
+          name: server.name!,
+          command: server.command!,
+          ...(server.args && { args: server.args }),
+          ...(server.env && { env: server.env }),
+        };
+        newConfig.stdio_servers[index] = stdioServer;
+      } else if (serverType === "shttp") {
+        const shttpServer: MCPSHTTPServer = {
+          url: server.url!,
+          ...(server.api_key && { api_key: server.api_key }),
+        };
+        newConfig.shttp_servers[index] = shttpServer;
+      }
+
+      const apiSettings = {
+        mcp_config: newConfig,
+      };
+
+      await OpenHands.saveSettings(apiSettings);
+    },
+    onSuccess: () => {
+      // Invalidate the settings query to trigger a refetch
+      queryClient.invalidateQueries({ queryKey: ["settings"] });
+    },
+  });
+}
--- a/frontend/src/i18n/declaration.ts
+++ b/frontend/src/i18n/declaration.ts
@@ -781,4 +781,37 @@ export enum I18nKey {
  PROJECT_MANAGEMENT$SVC_ACC_EMAIL_VALIDATION_ERROR = "PROJECT_MANAGEMENT$SVC_ACC_EMAIL_VALIDATION_ERROR",
  PROJECT_MANAGEMENT$SVC_ACC_API_KEY_VALIDATION_ERROR = "PROJECT_MANAGEMENT$SVC_ACC_API_KEY_VALIDATION_ERROR",
  MICROAGENT_MANAGEMENT$ERROR_LOADING_MICROAGENT_CONTENT = "MICROAGENT_MANAGEMENT$ERROR_LOADING_MICROAGENT_CONTENT",
+  SETTINGS$MCP_ERROR_ENV_INVALID_FORMAT = "SETTINGS$MCP_ERROR_ENV_INVALID_FORMAT",
+  SETTINGS$MCP_ERROR_URL_DUPLICATE = "SETTINGS$MCP_ERROR_URL_DUPLICATE",
+  SETTINGS$MCP_SERVER_TYPE_SSE = "SETTINGS$MCP_SERVER_TYPE_SSE",
+  SETTINGS$MCP_SERVER_TYPE_STDIO = "SETTINGS$MCP_SERVER_TYPE_STDIO",
+  SETTINGS$MCP_SERVER_TYPE_SHTTP = "SETTINGS$MCP_SERVER_TYPE_SHTTP",
+  SETTINGS$MCP_ERROR_URL_REQUIRED = "SETTINGS$MCP_ERROR_URL_REQUIRED",
+  SETTINGS$MCP_ERROR_URL_INVALID_PROTOCOL = "SETTINGS$MCP_ERROR_URL_INVALID_PROTOCOL",
+  SETTINGS$MCP_ERROR_URL_INVALID = "SETTINGS$MCP_ERROR_URL_INVALID",
+  SETTINGS$MCP_ERROR_NAME_REQUIRED = "SETTINGS$MCP_ERROR_NAME_REQUIRED",
+  SETTINGS$MCP_ERROR_NAME_INVALID = "SETTINGS$MCP_ERROR_NAME_INVALID",
+  SETTINGS$MCP_ERROR_NAME_DUPLICATE = "SETTINGS$MCP_ERROR_NAME_DUPLICATE",
+  SETTINGS$MCP_ERROR_COMMAND_REQUIRED = "SETTINGS$MCP_ERROR_COMMAND_REQUIRED",
+  SETTINGS$MCP_ERROR_COMMAND_NO_SPACES = "SETTINGS$MCP_ERROR_COMMAND_NO_SPACES",
+  SETTINGS$MCP_SERVER_TYPE = "SETTINGS$MCP_SERVER_TYPE",
+  SETTINGS$MCP_API_KEY_PLACEHOLDER = "SETTINGS$MCP_API_KEY_PLACEHOLDER",
+  SETTINGS$MCP_COMMAND_ARGUMENTS = "SETTINGS$MCP_COMMAND_ARGUMENTS",
+  SETTINGS$MCP_COMMAND_ARGUMENTS_HELP = "SETTINGS$MCP_COMMAND_ARGUMENTS_HELP",
+  SETTINGS$MCP_ENVIRONMENT_VARIABLES = "SETTINGS$MCP_ENVIRONMENT_VARIABLES",
+  SETTINGS$MCP_ADD_SERVER = "SETTINGS$MCP_ADD_SERVER",
+  SETTINGS$MCP_SAVE_SERVER = "SETTINGS$MCP_SAVE_SERVER",
+  SETTINGS$MCP_NO_SERVERS = "SETTINGS$MCP_NO_SERVERS",
+  SETTINGS$MCP_SERVER_DETAILS = "SETTINGS$MCP_SERVER_DETAILS",
+  SETTINGS$MCP_CONFIRM_DELETE = "SETTINGS$MCP_CONFIRM_DELETE",
+  SETTINGS$MCP_CONFIRM_CHANGES = "SETTINGS$MCP_CONFIRM_CHANGES",
+  SETTINGS$MCP_DEFAULT_CONFIG = "SETTINGS$MCP_DEFAULT_CONFIG",
+  PROJECT_MANAGEMENT$WORKSPACE_NAME_PLACEHOLDER = "PROJECT_MANAGEMENT$WORKSPACE_NAME_PLACEHOLDER",
+  PROJECT_MANAGEMENT$CONFIGURE_MODAL_DESCRIPTION = "PROJECT_MANAGEMENT$CONFIGURE_MODAL_DESCRIPTION",
+  PROJECT_MANAGEMENT$IMPORTANT_WORKSPACE_INTEGRATION = "PROJECT_MANAGEMENT$IMPORTANT_WORKSPACE_INTEGRATION",
+  SETTINGS = "SETTINGS",
+  MICROAGENT_MANAGEMENT$OPENING_PR_TO_CREATE_MICROAGENT = "MICROAGENT_MANAGEMENT$OPENING_PR_TO_CREATE_MICROAGENT",
+  MICROAGENT_MANAGEMENT$PR_READY_FOR_REVIEW = "MICROAGENT_MANAGEMENT$PR_READY_FOR_REVIEW",
+  MICROAGENT_MANAGEMENT$PR_NOT_CREATED = "MICROAGENT_MANAGEMENT$PR_NOT_CREATED",
+  MICROAGENT_MANAGEMENT$ERROR_CREATING_MICROAGENT = "MICROAGENT_MANAGEMENT$ERROR_CREATING_MICROAGENT",
 }
--- a/frontend/src/i18n/translation.json
+++ b/frontend/src/i18n/translation.json
--- a/frontend/src/routes/mcp-settings.tsx
+++ b/frontend/src/routes/mcp-settings.tsx
@@ -1,86 +1,191 @@
-import React, { useState, useEffect } from "react";
+import React, { useState } from "react";
 import { useTranslation } from "react-i18next";
-import posthog from "posthog-js";
 import { useSettings } from "#/hooks/query/use-settings";
-import { useSaveSettings } from "#/hooks/mutation/use-save-settings";
-import { MCPConfig } from "#/types/settings";
-import { MCPConfigEditor } from "#/components/features/settings/mcp-settings/mcp-config-editor";
-import { BrandButton } from "#/components/features/settings/brand-button";
+import { useDeleteMcpServer } from "#/hooks/mutation/use-delete-mcp-server";
+import { useAddMcpServer } from "#/hooks/mutation/use-add-mcp-server";
+import { useUpdateMcpServer } from "#/hooks/mutation/use-update-mcp-server";
 import { I18nKey } from "#/i18n/declaration";
-import {
-  displayErrorToast,
-  displaySuccessToast,
-} from "#/utils/custom-toast-handlers";
-import { retrieveAxiosErrorMessage } from "#/utils/retrieve-axios-error-message";
+
+import { MCPServerList } from "#/components/features/settings/mcp-settings/mcp-server-list";
+import { MCPServerForm } from "#/components/features/settings/mcp-settings/mcp-server-form";
+import { ConfirmationModal } from "#/components/shared/modals/confirmation-modal";
+import { BrandButton } from "#/components/features/settings/brand-button";
+import { MCPConfig } from "#/types/settings";
+
+type MCPServerType = "sse" | "stdio" | "shttp";
+
+interface MCPServerConfig {
+  id: string;
+  type: MCPServerType;
+  name?: string;
+  url?: string;
+  api_key?: string;
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+}

 function MCPSettingsScreen() {
  const { t } = useTranslation();
  const { data: settings, isLoading } = useSettings();
-  const { mutate: saveSettings, isPending } = useSaveSettings();
+  const { mutate: deleteMcpServer } = useDeleteMcpServer();
+  const { mutate: addMcpServer } = useAddMcpServer();
+  const { mutate: updateMcpServer } = useUpdateMcpServer();

-  const [mcpConfig, setMcpConfig] = useState<MCPConfig | undefined>(undefined);
-  const [isDirty, setIsDirty] = useState(false);
+  const [view, setView] = useState<"list" | "add" | "edit">("list");
+  const [editingServer, setEditingServer] = useState<MCPServerConfig | null>(
+    null,
+  );
+  const [confirmationModalIsVisible, setConfirmationModalIsVisible] =
+    useState(false);
+  const [serverToDelete, setServerToDelete] = useState<string | null>(null);

-  useEffect(() => {
-    if (!mcpConfig && settings?.MCP_CONFIG) {
-      setMcpConfig(settings.MCP_CONFIG);
-    }
-  }, [settings, mcpConfig]);
-
-  const handleConfigChange = (config: MCPConfig) => {
-    setMcpConfig(config);
-    setIsDirty(true);
+  const mcpConfig: MCPConfig = settings?.MCP_CONFIG || {
+    sse_servers: [],
+    stdio_servers: [],
+    shttp_servers: [],
  };

-  const formAction = () => {
-    if (!settings) return;
+  // Convert servers to a unified format for display
+  const allServers: MCPServerConfig[] = [
+    ...mcpConfig.sse_servers.map((server, index) => ({
+      id: `sse-${index}`,
+      type: "sse" as const,
+      url: typeof server === "string" ? server : server.url,
+      api_key: typeof server === "object" ? server.api_key : undefined,
+    })),
+    ...mcpConfig.stdio_servers.map((server, index) => ({
+      id: `stdio-${index}`,
+      type: "stdio" as const,
+      name: server.name,
+      command: server.command,
+      args: server.args,
+      env: server.env,
+    })),
+    ...mcpConfig.shttp_servers.map((server, index) => ({
+      id: `shttp-${index}`,
+      type: "shttp" as const,
+      url: typeof server === "string" ? server : server.url,
+      api_key: typeof server === "object" ? server.api_key : undefined,
+    })),
+  ];

-    saveSettings(
-      { MCP_CONFIG: mcpConfig },
+  const handleAddServer = (serverConfig: MCPServerConfig) => {
+    addMcpServer(serverConfig, {
+      onSuccess: () => {
+        setView("list");
+      },
+    });
+  };
+
+  const handleEditServer = (serverConfig: MCPServerConfig) => {
+    updateMcpServer(
+      {
+        serverId: serverConfig.id,
+        server: serverConfig,
+      },
      {
        onSuccess: () => {
-          displaySuccessToast(t(I18nKey.SETTINGS$SAVED));
-          posthog.capture("settings_saved", {
-            HAS_MCP_CONFIG: mcpConfig ? "YES" : "NO",
-            MCP_SSE_SERVERS_COUNT: mcpConfig?.sse_servers?.length || 0,
-            MCP_STDIO_SERVERS_COUNT: mcpConfig?.stdio_servers?.length || 0,
-          });
-          setIsDirty(false);
-        },
-        onError: (error) => {
-          const errorMessage = retrieveAxiosErrorMessage(error);
-          displayErrorToast(errorMessage || t(I18nKey.ERROR$GENERIC));
+          setView("list");
        },
      },
    );
  };

+  const handleDeleteServer = (serverId: string) => {
+    deleteMcpServer(serverId, {
+      onSuccess: () => {
+        setConfirmationModalIsVisible(false);
+      },
+    });
+  };
+
+  const handleEditClick = (server: MCPServerConfig) => {
+    setEditingServer(server);
+    setView("edit");
+  };
+
+  const handleDeleteClick = (serverId: string) => {
+    setServerToDelete(serverId);
+    setConfirmationModalIsVisible(true);
+  };
+
+  const handleConfirmDelete = () => {
+    if (serverToDelete) {
+      handleDeleteServer(serverToDelete);
+      setServerToDelete(null);
+    }
+  };
+
+  const handleCancelDelete = () => {
+    setConfirmationModalIsVisible(false);
+    setServerToDelete(null);
+  };
+
  if (isLoading) {
-    return <div className="p-9">{t(I18nKey.HOME$LOADING)}</div>;
+    return (
+      <div className="px-11 py-9 flex flex-col gap-5">
+        <div className="animate-pulse">
+          <div className="h-6 bg-gray-300 rounded w-1/4 mb-4" />
+          <div className="h-4 bg-gray-300 rounded w-1/2 mb-8" />
+          <div className="h-10 bg-gray-300 rounded w-32" />
+        </div>
+      </div>
+    );
  }

  return (
-    <form
-      data-testid="mcp-settings-screen"
-      action={formAction}
-      className="flex flex-col h-full justify-between"
-    >
-      <div className="p-9 flex flex-col gap-12">
-        <MCPConfigEditor mcpConfig={mcpConfig} onChange={handleConfigChange} />
-      </div>
+    <div className="px-11 py-9 flex flex-col gap-5">
+      {view === "list" && (
+        <>
+          <BrandButton
+            testId="add-mcp-server-button"
+            type="button"
+            variant="primary"
+            onClick={() => setView("add")}
+            isDisabled={isLoading}
+          >
+            {t(I18nKey.SETTINGS$MCP_ADD_SERVER)}
+          </BrandButton>

-      <div className="flex gap-6 p-6 justify-end border-t border-t-tertiary">
-        <BrandButton
-          testId="submit-button"
-          type="submit"
-          variant="primary"
-          isDisabled={!isDirty || isPending}
-        >
-          {!isPending && t(I18nKey.SETTINGS$SAVE_CHANGES)}
-          {isPending && t(I18nKey.SETTINGS$SAVING)}
-        </BrandButton>
-      </div>
-    </form>
+          <MCPServerList
+            servers={allServers}
+            onEdit={handleEditClick}
+            onDelete={handleDeleteClick}
+          />
+        </>
+      )}
+
+      {view === "add" && (
+        <MCPServerForm
+          mode="add"
+          existingServers={allServers}
+          onSubmit={handleAddServer}
+          onCancel={() => setView("list")}
+        />
+      )}
+
+      {view === "edit" && editingServer && (
+        <MCPServerForm
+          mode="edit"
+          server={editingServer}
+          existingServers={allServers}
+          onSubmit={handleEditServer}
+          onCancel={() => {
+            setView("list");
+            setEditingServer(null);
+          }}
+        />
+      )}
+
+      {confirmationModalIsVisible && (
+        <ConfirmationModal
+          text={t(I18nKey.SETTINGS$MCP_CONFIRM_DELETE)}
+          onConfirm={handleConfirmDelete}
+          onCancel={handleCancelDelete}
+        />
+      )}
+    </div>
  );
 }

--- a/frontend/src/routes/settings.tsx
+++ b/frontend/src/routes/settings.tsx
@@ -23,6 +23,7 @@ const SAAS_NAV_ITEMS = [
  { to: "/settings/billing", text: "SETTINGS$NAV_CREDITS" },
  { to: "/settings/secrets", text: "SETTINGS$NAV_SECRETS" },
  { to: "/settings/api-keys", text: "SETTINGS$NAV_API_KEYS" },
+  { to: "/settings/mcp", text: "SETTINGS$NAV_MCP" },
 ];

 const OSS_NAV_ITEMS = [
--- a/frontend/src/services/settings.ts
+++ b/frontend/src/services/settings.ts
@@ -26,6 +26,7 @@ export const DEFAULT_SETTINGS: Settings = {
  MCP_CONFIG: {
    sse_servers: [],
    stdio_servers: [],
+    shttp_servers: [],
  },
  GIT_USER_NAME: "openhands",
  GIT_USER_EMAIL: "openhands@all-hands.dev",
--- a/frontend/src/types/settings.ts
+++ b/frontend/src/types/settings.ts
@@ -24,9 +24,15 @@ export type MCPStdioServer = {
  env?: Record<string, string>;
 };

+export type MCPSHTTPServer = {
+  url: string;
+  api_key?: string;
+};
+
 export type MCPConfig = {
  sse_servers: (string | MCPSSEServer)[];
  stdio_servers: MCPStdioServer[];
+  shttp_servers: (string | MCPSHTTPServer)[];
 };

 export type Settings = {
@@ -77,6 +83,7 @@ export type ApiSettings = {
  mcp_config?: {
    sse_servers: (string | MCPSSEServer)[];
    stdio_servers: MCPStdioServer[];
+    shttp_servers: (string | MCPSHTTPServer)[];
  };
  email?: string;
  email_verified?: boolean;
--- a/openhands/cli/main.py
+++ b/openhands/cli/main.py
@@ -83,7 +83,7 @@ from openhands.microagent.microagent import BaseMicroagent
 from openhands.runtime import get_runtime_cls
 from openhands.runtime.base import Runtime
 from openhands.storage.settings.file_settings_store import FileSettingsStore
-from openhands.utils.utils import create_registry_and_convo_stats
+from openhands.utils.utils import create_registry_and_conversation_stats


 async def cleanup_session(
@@ -148,7 +148,7 @@ async def run_session(
        None, display_initialization_animation, 'Initializing...', is_loaded
    )

-    llm_registry, convo_stats, config = create_registry_and_convo_stats(
+    llm_registry, conversation_stats, config = create_registry_and_conversation_stats(
        config,
        sid,
        None,
@@ -169,7 +169,9 @@ async def run_session(

    runtime.subscribe_to_shell_stream(stream_to_console)

-    controller, initial_state = create_controller(agent, runtime, config, convo_stats)
+    controller, initial_state = create_controller(
+        agent, runtime, config, conversation_stats
+    )

    event_stream = runtime.event_stream

--- a/openhands/controller/agent_controller.py
+++ b/openhands/controller/agent_controller.py
@@ -109,7 +109,7 @@ class AgentController:
        self,
        agent: Agent,
        event_stream: EventStream,
-        convo_stats: ConversationStats,
+        conversation_stats: ConversationStats,
        iteration_delta: int,
        budget_per_task_delta: float | None = None,
        agent_to_llm_config: dict[str, LLMConfig] | None = None,
@@ -149,7 +149,7 @@ class AgentController:
        self.agent = agent
        self.headless_mode = headless_mode
        self.is_delegate = is_delegate
-        self.convo_stats = convo_stats
+        self.conversation_stats = conversation_stats

        # the event stream must be set before maybe subscribing to it
        self.event_stream = event_stream
@@ -165,7 +165,7 @@ class AgentController:
        # state from the previous session, state from a parent agent, or a fresh state
        self.set_initial_state(
            state=initial_state,
-            convo_stats=convo_stats,
+            conversation_stats=conversation_stats,
            max_iterations=iteration_delta,
            max_budget_per_task=budget_per_task_delta,
            confirmation_mode=confirmation_mode,
@@ -687,7 +687,7 @@ class AgentController:
            user_id=self.user_id,
            agent=delegate_agent,
            event_stream=self.event_stream,
-            convo_stats=self.convo_stats,
+            conversation_stats=self.conversation_stats,
            iteration_delta=self._initial_max_iterations,
            budget_per_task_delta=self._initial_max_budget_per_task,
            agent_to_llm_config=self.agent_to_llm_config,
@@ -951,7 +951,7 @@ class AgentController:
    def set_initial_state(
        self,
        state: State | None,
-        convo_stats: ConversationStats,
+        conversation_stats: ConversationStats,
        max_iterations: int,
        max_budget_per_task: float | None,
        confirmation_mode: bool = False,
@@ -959,7 +959,7 @@ class AgentController:
        self.state_tracker.set_initial_state(
            self.id,
            state,
-            convo_stats,
+            conversation_stats,
            max_iterations,
            max_budget_per_task,
            confirmation_mode,
@@ -1000,7 +1000,7 @@ class AgentController:
            action: The action to attach metrics to
        """
        # Get metrics from agent LLM
-        metrics = self.convo_stats.get_combined_metrics()
+        metrics = self.conversation_stats.get_combined_metrics()

        # Create a clean copy with only the fields we want to keep
        clean_metrics = Metrics()
--- a/openhands/controller/state/state.py
+++ b/openhands/controller/state/state.py
@@ -85,7 +85,7 @@ class State:
            limit_increase_amount=100, current_value=0, max_value=100
        )
    )
-    convo_stats: ConversationStats | None = None
+    conversation_stats: ConversationStats | None = None
    budget_flag: BudgetControlFlag | None = None
    confirmation_mode: bool = False
    history: list[Event] = field(default_factory=list)
@@ -122,8 +122,8 @@ class State:
    def save_to_session(
        self, sid: str, file_store: FileStore, user_id: str | None
    ) -> None:
-        convo_stats = self.convo_stats
-        self.convo_stats = None  # Don't save convo stats, handles itself
+        conversation_stats = self.conversation_stats
+        self.conversation_stats = None  # Don't save conversation stats, handles itself

        pickled = pickle.dumps(self)
        logger.debug(f'Saving state to session {sid}:{self.agent_state}')
@@ -144,7 +144,7 @@ class State:
            logger.error(f'Failed to save state to session: {e}')
            raise e

-        self.convo_stats = convo_stats  # restore reference
+        self.conversation_stats = conversation_stats  # restore reference

    @staticmethod
    def restore_from_session(
--- a/openhands/controller/state/state_tracker.py
+++ b/openhands/controller/state/state_tracker.py
@@ -51,7 +51,7 @@ class StateTracker:
        self,
        id: str,
        state: State | None,
-        convo_stats: ConversationStats,
+        conversation_stats: ConversationStats,
        max_iterations: int,
        max_budget_per_task: float | None,
        confirmation_mode: bool = False,
@@ -74,7 +74,7 @@ class StateTracker:
                session_id=id.removesuffix('-delegate'),
                user_id=self.user_id,
                inputs={},
-                convo_stats=convo_stats,
+                conversation_stats=conversation_stats,
                iteration_flag=IterationControlFlag(
                    limit_increase_amount=max_iterations,
                    current_value=0,
@@ -99,7 +99,7 @@ class StateTracker:
            if self.state.start_id <= -1:
                self.state.start_id = 0

-            state.convo_stats = convo_stats
+            state.conversation_stats = conversation_stats

    def _init_history(self, event_stream: EventStream) -> None:
        """Initializes the agent's history from the event stream.
@@ -248,8 +248,8 @@ class StateTracker:
        if self.sid and self.file_store:
            self.state.save_to_session(self.sid, self.file_store, self.user_id)

-        if self.state.convo_stats:
-            self.state.convo_stats.save_metrics()
+        if self.state.conversation_stats:
+            self.state.conversation_stats.save_metrics()

    def run_control_flags(self):
        """Performs one step of the control flags"""
@@ -262,7 +262,7 @@ class StateTracker:
        Budget flag will monitor for when budget is exceeded
        """
        # Sync cost across all llm services from llm registry
-        if self.state.budget_flag and self.state.convo_stats:
+        if self.state.budget_flag and self.state.conversation_stats:
            self.state.budget_flag.current_value = (
-                self.state.convo_stats.get_combined_metrics().accumulated_cost
+                self.state.conversation_stats.get_combined_metrics().accumulated_cost
            )
--- a/openhands/core/config/llm_config.py
+++ b/openhands/core/config/llm_config.py
@@ -172,9 +172,6 @@ class LLMConfig(BaseModel):

        # Set reasoning_effort to 'high' by default for non-Gemini models
        # Gemini models use optimized thinking budget when reasoning_effort is None
-        logger.debug(
-            f'Setting reasoning_effort for model {self.model} with reasoning_effort {self.reasoning_effort}'
-        )
        if self.reasoning_effort is None and 'gemini-2.5-pro' not in self.model:
            self.reasoning_effort = 'high'

--- a/openhands/core/config/sandbox_config.py
+++ b/openhands/core/config/sandbox_config.py
@@ -18,6 +18,7 @@ class SandboxConfig(BaseModel):
        remote_runtime_enable_retries: Whether to enable retries (on recoverable errors like requests.ConnectionError) for the remote runtime API requests.
        enable_auto_lint: Whether to enable auto-lint.
        use_host_network: Whether to use the host network.
+        additional_networks: A list of additional Docker networks to connect to
        runtime_binding_address: The binding address for the runtime ports.  It specifies which network interface on the host machine Docker should bind the runtime ports to.
        initialize_plugins: Whether to initialize plugins.
        force_rebuild_runtime: Whether to force rebuild the runtime image.
@@ -65,6 +66,7 @@ class SandboxConfig(BaseModel):
        default=False
    )  # once enabled, OpenHands would lint files after editing
    use_host_network: bool = Field(default=False)
+    additional_networks: list[str] = Field(default=[])
    runtime_binding_address: str = Field(default='0.0.0.0')
    runtime_extra_build_args: list[str] | None = Field(default=None)
    initialize_plugins: bool = Field(default=True)
--- a/openhands/core/main.py
+++ b/openhands/core/main.py
@@ -32,12 +32,11 @@ from openhands.events.action.action import Action
 from openhands.events.event import Event
 from openhands.events.observation import AgentStateChangedObservation
 from openhands.io import read_input, read_task
-from openhands.llm.llm_registry import LLMRegistry
 from openhands.mcp import add_mcp_tools_to_agent
 from openhands.memory.memory import Memory
 from openhands.runtime.base import Runtime
 from openhands.utils.async_utils import call_async_from_sync
-from openhands.utils.utils import create_registry_and_convo_stats
+from openhands.utils.utils import create_registry_and_conversation_stats


 class FakeUserResponseFunc(Protocol):
@@ -59,7 +58,6 @@ async def run_controller(
    headless_mode: bool = True,
    memory: Memory | None = None,
    conversation_instructions: str | None = None,
-    llm_registry: LLMRegistry | None = None,
 ) -> State | None:
    """Main coroutine to run the agent controller with task input flexibility.

@@ -98,7 +96,7 @@ async def run_controller(
    """
    sid = sid or generate_sid(config)

-    llm_registry, convo_stats, config = create_registry_and_convo_stats(
+    llm_registry, conversation_stats, config = create_registry_and_conversation_stats(
        config,
        sid,
        None,
@@ -165,7 +163,7 @@ async def run_controller(
        )

    controller, initial_state = create_controller(
-        agent, runtime, config, convo_stats, replay_events=replay_events
+        agent, runtime, config, conversation_stats, replay_events=replay_events
    )

    assert isinstance(initial_user_action, Action), (
--- a/openhands/core/setup.py
+++ b/openhands/core/setup.py
@@ -35,7 +35,7 @@ from openhands.utils.async_utils import GENERAL_TIMEOUT, call_async_from_sync

 def create_runtime(
    config: OpenHandsConfig,
-    llm_registry: LLMRegistry,
+    llm_registry: LLMRegistry | None = None,
    sid: str | None = None,
    headless_mode: bool = True,
    agent: Agent | None = None,
@@ -84,7 +84,7 @@ def create_runtime(
        sid=session_id,
        plugins=agent_cls.sandbox_plugins,
        headless_mode=headless_mode,
-        llm_registry=llm_registry,
+        llm_registry=llm_registry or LLMRegistry(config),
        git_provider_tokens=git_provider_tokens,
    )

@@ -218,7 +218,7 @@ def create_controller(
    agent: Agent,
    runtime: Runtime,
    config: OpenHandsConfig,
-    convo_stats: ConversationStats,
+    conversation_stats: ConversationStats,
    headless_mode: bool = True,
    replay_events: list[Event] | None = None,
 ) -> tuple[AgentController, State | None]:
@@ -236,7 +236,7 @@ def create_controller(

    controller = AgentController(
        agent=agent,
-        convo_stats=convo_stats,
+        conversation_stats=conversation_stats,
        iteration_delta=config.max_iterations,
        budget_per_task_delta=config.max_budget_per_task,
        agent_to_llm_config=config.get_agent_to_llm_config_map(),
--- a/openhands/integrations/github/github_service.py
+++ b/openhands/integrations/github/github_service.py
@@ -321,6 +321,36 @@ class GitHubService(BaseGitService, GitService, InstallationsService):
        installations = response.get('installations', [])
        return [str(i['id']) for i in installations]

+    async def get_user_organizations(self) -> list[str]:
+        """Get list of organization logins that the user is a member of."""
+        url = f'{self.BASE_URL}/user/orgs'
+        try:
+            response, _ = await self._make_request(url)
+            orgs = [org['login'] for org in response]
+            return orgs
+        except Exception as e:
+            logger.warning(f'Failed to get user organizations: {e}')
+            return []
+
+    def _fuzzy_match_org_name(self, query: str, org_name: str) -> bool:
+        """Check if query fuzzy matches organization name."""
+        query_lower = query.lower().replace('-', '').replace('_', '').replace(' ', '')
+        org_lower = org_name.lower().replace('-', '').replace('_', '').replace(' ', '')
+
+        # Exact match after normalization
+        if query_lower == org_lower:
+            return True
+
+        # Query is a substring of org name
+        if query_lower in org_lower:
+            return True
+
+        # Org name is a substring of query (less common but possible)
+        if org_lower in query_lower:
+            return True
+
+        return False
+
    async def search_repositories(
        self, query: str, per_page: int, sort: str, order: str, public: bool
    ) -> list[Repository]:
@@ -341,21 +371,68 @@ class GitHubService(BaseGitService, GitService, InstallationsService):
            # Add is:public to the query to ensure we only search for public repositories
            params['q'] = f'in:name {org}/{repo_name} is:public'

-        # Perhaps we should go through all orgs and the search for repos under every org
-        # Currently it will only search user repos, and org repos when '/' is in the name
+        # Handle private repository searches
        if not public and '/' in query:
            org, repo_query = query.split('/', 1)
            query_with_user = f'org:{org} in:name {repo_query}'
            params['q'] = query_with_user
        elif not public:
+            # Expand search scope to include user's repositories and organizations they're a member of
            user = await self.get_user()
-            params['q'] = f'in:name {query} user:{user.login}'
+            user_orgs = await self.get_user_organizations()

+            # Search in user repos and org repos separately
+            all_repos = []
+
+            # Search in user repositories
+            user_query = f'{query} user:{user.login}'
+            user_params = params.copy()
+            user_params['q'] = user_query
+
+            try:
+                user_response, _ = await self._make_request(url, user_params)
+                user_items = user_response.get('items', [])
+                all_repos.extend(user_items)
+            except Exception as e:
+                logger.warning(f'User search failed: {e}')
+
+            # Search for repos named "query" in each organization
+            for org in user_orgs:
+                org_query = f'{query} org:{org}'
+                org_params = params.copy()
+                org_params['q'] = org_query
+
+                try:
+                    org_response, _ = await self._make_request(url, org_params)
+                    org_items = org_response.get('items', [])
+                    all_repos.extend(org_items)
+                except Exception as e:
+                    logger.warning(f'Org {org} search failed: {e}')
+
+            # Also search for top repos from orgs that match the query name
+            for org in user_orgs:
+                if self._fuzzy_match_org_name(query, org):
+                    org_repos_query = f'org:{org}'
+                    org_repos_params = params.copy()
+                    org_repos_params['q'] = org_repos_query
+                    org_repos_params['sort'] = 'stars'
+                    org_repos_params['per_page'] = 2  # Limit to first 2 repos
+
+                    try:
+                        org_repos_response, _ = await self._make_request(
+                            url, org_repos_params
+                        )
+                        org_repo_items = org_repos_response.get('items', [])
+                        all_repos.extend(org_repo_items)
+                    except Exception as e:
+                        logger.warning(f'Org repos search for {org} failed: {e}')
+
+            return [self._parse_repository(repo) for repo in all_repos]
+
+        # Default case (public search or slash query)
        response, _ = await self._make_request(url, params)
        repo_items = response.get('items', [])
-        repos = [self._parse_repository(repo) for repo in repo_items]
-
-        return repos
+        return [self._parse_repository(repo) for repo in repo_items]

    async def execute_graphql_query(
        self, query: str, variables: dict[str, Any]
--- a/openhands/integrations/vscode/package-lock.json
+++ b/openhands/integrations/vscode/package-lock.json
--- a/openhands/integrations/vscode/src/test/suite/index.ts
+++ b/openhands/integrations/vscode/src/test/suite/index.ts
@@ -1,8 +1,8 @@
 import * as path from "path";
-import Mocha = require("mocha"); // Changed import style
-import glob = require("glob"); // Changed import style
+import Mocha = require("mocha");
+import { glob } from "glob"; // Updated for glob v9+ API

-export function run(): Promise<void> {
+export async function run(): Promise<void> {
  // Create the mocha test
  const mocha = new Mocha({
    // This should now work with the changed import
@@ -13,33 +13,25 @@ export function run(): Promise<void> {

  const testsRoot = path.resolve(__dirname, ".."); // Root of the /src/test folder (compiled to /out/test)

-  return new Promise((c, e) => {
+  try {
    // Use glob to find all test files (ending with .test.js in the compiled output)
-    glob(
-      "**/**.test.js",
-      { cwd: testsRoot },
-      (err: NodeJS.ErrnoException | null, files: string[]) => {
-        if (err) {
-          return e(err);
-        }
+    const files = await glob("**/**.test.js", { cwd: testsRoot });

-        // Add files to the test suite
-        files.forEach((f: string) => mocha.addFile(path.resolve(testsRoot, f)));
+    // Add files to the test suite
+    files.forEach((f: string) => mocha.addFile(path.resolve(testsRoot, f)));

-        try {
-          // Run the mocha test
-          mocha.run((failures: number) => {
-            if (failures > 0) {
-              e(new Error(`${failures} tests failed.`));
-            } else {
-              c();
-            }
-          });
-        } catch (err) {
-          console.error(err);
-          e(err);
+    // Run the mocha test
+    return await new Promise<void>((resolve, reject) => {
+      mocha.run((failures: number) => {
+        if (failures > 0) {
+          reject(new Error(`${failures} tests failed.`));
+        } else {
+          resolve();
        }
-      },
-    );
-  });
+      });
+    });
+  } catch (err) {
+    console.error(err);
+    throw err;
+  }
 }
--- a/openhands/llm/async_llm.py
+++ b/openhands/llm/async_llm.py
@@ -9,8 +9,8 @@ from openhands.core.logger import openhands_logger as logger
 from openhands.llm.llm import (
    LLM,
    LLM_RETRY_EXCEPTIONS,
-    REASONING_EFFORT_SUPPORTED_MODELS,
 )
+from openhands.llm.model_features import get_features
 from openhands.utils.shutdown_listener import should_continue


@@ -63,7 +63,7 @@ class AsyncLLM(LLM):
                messages = kwargs['messages']

            # Set reasoning effort for models that support it
-            if self.config.model.lower() in REASONING_EFFORT_SUPPORTED_MODELS:
+            if get_features(self.config.model).supports_reasoning_effort:
                kwargs['reasoning_effort'] = self.config.reasoning_effort

            # ensure we work with a list of messages
--- a/openhands/llm/fn_call_converter.py
+++ b/openhands/llm/fn_call_converter.py
@@ -705,6 +705,25 @@ def _fix_stopword(content: str) -> str:
    return content


+def _normalize_parameter_tags(fn_body: str) -> str:
+    """Normalize malformed parameter tags to the canonical format.
+
+    Some models occasionally emit malformed parameter tags like:
+        <parameter=command=str_replace</parameter>
+    instead of the correct:
+        <parameter=command>str_replace</parameter>
+
+    This function rewrites the malformed form into the correct one to allow
+    downstream parsing to succeed.
+    """
+    # Replace '<parameter=name=value</parameter>' with '<parameter=name>value</parameter>'
+    return re.sub(
+        r'<parameter=([a-zA-Z0-9_]+)=([^<]*)</parameter>',
+        r'<parameter=\1>\2</parameter>',
+        fn_body,
+    )
+
+
 def convert_non_fncall_messages_to_fncall_messages(
    messages: list[dict],
    tools: list[ChatCompletionToolParam],
@@ -852,7 +871,7 @@ def convert_non_fncall_messages_to_fncall_messages(

            if fn_match:
                fn_name = fn_match.group(1)
-                fn_body = fn_match.group(2)
+                fn_body = _normalize_parameter_tags(fn_match.group(2))
                matching_tool = next(
                    (
                        tool['function']
--- a/openhands/llm/llm.py
+++ b/openhands/llm/llm.py
@@ -9,6 +9,7 @@ import httpx

 from openhands.core.config import LLMConfig
 from openhands.llm.metrics import Metrics
+from openhands.llm.model_features import get_features

 with warnings.catch_warnings():
    warnings.simplefilter('ignore')
@@ -49,79 +50,6 @@ LLM_RETRY_EXCEPTIONS: tuple[type[Exception], ...] = (
    LLMNoResponseError,
 )

-# cache prompt supporting models
-# remove this when we gemini and deepseek are supported
-CACHE_PROMPT_SUPPORTED_MODELS = [
-    'claude-3-7-sonnet-20250219',
-    'claude-sonnet-3-7-latest',
-    'claude-3.7-sonnet',
-    'claude-3-5-sonnet-20241022',
-    'claude-3-5-sonnet-20240620',
-    'claude-3-5-haiku-20241022',
-    'claude-3-haiku-20240307',
-    'claude-3-opus-20240229',
-    'claude-sonnet-4-20250514',
-    'claude-sonnet-4',
-    'claude-opus-4-20250514',
-    'claude-opus-4-1-20250805',
-]
-
-# function calling supporting models
-FUNCTION_CALLING_SUPPORTED_MODELS = [
-    'claude-3-7-sonnet-20250219',
-    'claude-sonnet-3-7-latest',
-    'claude-3-5-sonnet',
-    'claude-3-5-sonnet-20240620',
-    'claude-3-5-sonnet-20241022',
-    'claude-3.5-haiku',
-    'claude-3-5-haiku-20241022',
-    'claude-sonnet-4-20250514',
-    'claude-sonnet-4',
-    'claude-opus-4-20250514',
-    'claude-opus-4-1-20250805',
-    'gpt-4o-mini',
-    'gpt-4o',
-    'o1-2024-12-17',
-    'o3-mini-2025-01-31',
-    'o3-mini',
-    'o3',
-    'o3-2025-04-16',
-    'o4-mini',
-    'o4-mini-2025-04-16',
-    'gemini-2.5-pro',
-    'gpt-4.1',
-    'kimi-k2-0711-preview',
-    'kimi-k2-instruct',
-    'Qwen3-Coder-480B-A35B-Instruct',
-    'qwen3-coder',  # this will match both qwen3-coder-480b (openhands provider) and qwen3-coder (for openrouter)
-    'gpt-5',
-    'gpt-5-2025-08-07',
-]
-
-REASONING_EFFORT_SUPPORTED_MODELS = [
-    'o1-2024-12-17',
-    'o1',
-    'o3',
-    'o3-2025-04-16',
-    'o3-mini-2025-01-31',
-    'o3-mini',
-    'o4-mini',
-    'o4-mini-2025-04-16',
-    'gemini-2.5-flash',
-    'gemini-2.5-pro',
-    'gpt-5',
-    'gpt-5-2025-08-07',
-    'claude-opus-4-1-20250805',  # we need to remove top_p for opus 4.1
-]
-
-MODELS_WITHOUT_STOP_WORDS = [
-    'o1-mini',
-    'o1-preview',
-    'o1',
-    'o1-2024-12-17',
-    'xai/grok-4-0709',
-]
-

 class LLM(RetryMixin, DebugMixin):
    """The LLM class represents a Language Model instance.
@@ -154,6 +82,7 @@ class LLM(RetryMixin, DebugMixin):
        )

        self.model_info: ModelInfo | None = None
+        self._function_calling_active: bool = False
        self.retry_listener = retry_listener
        if self.config.log_completions:
            if self.config.log_completions_folder is None:
@@ -202,10 +131,8 @@ class LLM(RetryMixin, DebugMixin):
                f'Rewrote openhands/{model_name} to {self.config.model} with base URL {self.config.base_url}'
            )

-        if (
-            self.config.model.lower() in REASONING_EFFORT_SUPPORTED_MODELS
-            or self.config.model.split('/')[-1] in REASONING_EFFORT_SUPPORTED_MODELS
-        ):
+        features = get_features(self.config.model)
+        if features.supports_reasoning_effort:
            # For Gemini models, only map 'low' to optimized thinking budget
            # Let other reasoning_effort values pass through to API as-is
            if 'gemini-2.5-pro' in self.config.model:
@@ -239,6 +166,20 @@ class LLM(RetryMixin, DebugMixin):
        elif 'gemini' in self.config.model.lower() and self.config.safety_settings:
            kwargs['safety_settings'] = self.config.safety_settings

+        # Explicitly disable Anthropic extended thinking for Opus 4.1 to avoid
+        # requiring 'thinking' content blocks. See issue #10510.
+        if 'claude-opus-4-1' in self.config.model.lower():
+            kwargs['thinking'] = {'type': 'disabled'}
+
+        # Anthropic constraint: Opus models cannot accept both temperature and top_p
+        # Prefer temperature (drop top_p) if both are specified.
+        _model_lower = self.config.model.lower()
+        # Limit to Opus 4.1 specifically to avoid changing behavior of other Anthropic models
+        if ('claude-opus-4-1' in _model_lower) and (
+            'temperature' in kwargs and 'top_p' in kwargs
+        ):
+            kwargs.pop('top_p', None)
+
        self._completion = partial(
            litellm_completion,
            model=self.config.model,
@@ -312,7 +253,7 @@ class LLM(RetryMixin, DebugMixin):

                # add stop words if the model supports it and stop words are not disabled
                if (
-                    self.config.model not in MODELS_WITHOUT_STOP_WORDS
+                    get_features(self.config.model).supports_stop_words
                    and not self.config.disable_stop_word
                ):
                    kwargs['stop'] = STOP_WORDS
@@ -556,17 +497,10 @@ class LLM(RetryMixin, DebugMixin):
                ):
                    self.config.max_output_tokens = self.model_info['max_tokens']

-        # Initialize function calling capability
-        # Check if model name is in our supported list
-        model_name_supported = (
-            self.config.model in FUNCTION_CALLING_SUPPORTED_MODELS
-            or self.config.model.split('/')[-1] in FUNCTION_CALLING_SUPPORTED_MODELS
-            or any(m in self.config.model for m in FUNCTION_CALLING_SUPPORTED_MODELS)
-        )
-
-        # Handle native_tool_calling user-defined configuration
+        # Initialize function calling using centralized model features
+        features = get_features(self.config.model)
        if self.config.native_tool_calling is None:
-            self._function_calling_active = model_name_supported
+            self._function_calling_active = features.supports_function_calling
        else:
            self._function_calling_active = self.config.native_tool_calling

@@ -601,14 +535,10 @@ class LLM(RetryMixin, DebugMixin):
        Returns:
            boolean: True if prompt caching is supported and enabled for the given model.
        """
-        return (
-            self.config.caching_prompt is True
-            and (
-                self.config.model in CACHE_PROMPT_SUPPORTED_MODELS
-                or self.config.model.split('/')[-1] in CACHE_PROMPT_SUPPORTED_MODELS
-            )
-            # We don't need to look-up model_info, because only Anthropic models needs the explicit caching breakpoint
-        )
+        if not self.config.caching_prompt:
+            return False
+        # We don't need to look-up model_info, because only Anthropic models need explicit caching breakpoints
+        return get_features(self.config.model).supports_prompt_cache

    def is_function_calling_active(self) -> bool:
        """Returns whether function calling is supported and enabled for this LLM instance.
@@ -850,6 +780,8 @@ class LLM(RetryMixin, DebugMixin):
                message.force_string_serializer = True
            if 'kimi-k2-instruct' in self.config.model and 'groq' in self.config.model:
                message.force_string_serializer = True
+            if 'openrouter/anthropic/claude-sonnet-4' in self.config.model:
+                message.force_string_serializer = True

        # let pydantic handle the serialization
        return [message.model_dump() for message in messages]
--- a/openhands/llm/model_features.py
+++ b/openhands/llm/model_features.py
@@ -0,0 +1,138 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from fnmatch import fnmatch
+
+
+def normalize_model_name(model: str) -> str:
+    """Normalize a model string to a canonical, comparable name.
+
+    Strategy:
+    - Trim whitespace
+    - Lowercase
+    - If there is a '/', keep only the basename after the last '/'
+      (handles prefixes like openrouter/, litellm_proxy/, anthropic/, etc.)
+      and treat ':' inside that basename as an Ollama-style variant tag to be removed
+    - There is no provider:model form; providers, when present, use 'provider/model'
+    - Drop a trailing "-gguf" suffix if present
+    """
+    raw = (model or '').strip().lower()
+    if '/' in raw:
+        name = raw.split('/')[-1]
+        if ':' in name:
+            # Drop Ollama-style variant tag in basename
+            name = name.split(':', 1)[0]
+    else:
+        # No '/', keep the whole raw name (we do not support provider:model)
+        name = raw
+    if name.endswith('-gguf'):
+        name = name[: -len('-gguf')]
+    return name
+
+
+def model_matches(model: str, patterns: list[str]) -> bool:
+    """Return True if the model matches any of the glob patterns.
+
+    If a pattern contains a '/', it is treated as provider-qualified and matched
+    against the full, lowercased model string (including provider prefix).
+    Otherwise, it is matched against the normalized basename.
+    """
+    raw = (model or '').strip().lower()
+    name = normalize_model_name(model)
+    for pat in patterns:
+        pat_l = pat.lower()
+        if '/' in pat_l:
+            if fnmatch(raw, pat_l):
+                return True
+        else:
+            if fnmatch(name, pat_l):
+                return True
+    return False
+
+
+@dataclass(frozen=True)
+class ModelFeatures:
+    supports_function_calling: bool
+    supports_reasoning_effort: bool
+    supports_prompt_cache: bool
+    supports_stop_words: bool
+
+
+# Pattern tables capturing current behavior. Keep patterns lowercase.
+FUNCTION_CALLING_PATTERNS: list[str] = [
+    # Anthropic families
+    'claude-3-7-sonnet*',
+    'claude-3.7-sonnet*',
+    'claude-sonnet-3-7-latest',
+    'claude-3-5-sonnet*',
+    'claude-3.5-haiku*',
+    'claude-3-5-haiku*',
+    'claude-sonnet-4*',
+    'claude-opus-4*',
+    # OpenAI families
+    'gpt-4o*',
+    'gpt-4.1',
+    'gpt-5*',
+    # o-series (keep exact o1 support per existing list)
+    'o1-2024-12-17',
+    'o3*',
+    'o4-mini*',
+    # Google Gemini
+    'gemini-2.5-pro*',
+    # Others
+    'kimi-k2-0711-preview',
+    'kimi-k2-instruct',
+    'qwen3-coder*',
+    'qwen3-coder-480b-a35b-instruct',
+]
+
+REASONING_EFFORT_PATTERNS: list[str] = [
+    # Mirror main behavior exactly (no unintended expansion), plus DeepSeek support
+    'o1-2024-12-17',
+    'o1',
+    'o3',
+    'o3-2025-04-16',
+    'o3-mini-2025-01-31',
+    'o3-mini',
+    'o4-mini',
+    'o4-mini-2025-04-16',
+    'gemini-2.5-flash',
+    'gemini-2.5-pro',
+    'gpt-5',
+    'gpt-5-2025-08-07',
+    # DeepSeek reasoning family
+    'deepseek-r1-0528*',
+]
+
+PROMPT_CACHE_PATTERNS: list[str] = [
+    'claude-3-7-sonnet*',
+    'claude-3.7-sonnet*',
+    'claude-sonnet-3-7-latest',
+    'claude-3-5-sonnet*',
+    'claude-3-5-haiku*',
+    'claude-3.5-haiku*',
+    'claude-3-haiku-20240307',
+    'claude-3-opus-20240229',
+    'claude-sonnet-4*',
+    'claude-opus-4*',
+]
+
+SUPPORTS_STOP_WORDS_FALSE_PATTERNS: list[str] = [
+    # o1 family doesn't support stop words
+    'o1*',
+    # grok-4 specific model name (basename)
+    'grok-4-0709',
+    # DeepSeek R1 family
+    'deepseek-r1-0528*',
+]
+
+
+def get_features(model: str) -> ModelFeatures:
+    return ModelFeatures(
+        supports_function_calling=model_matches(model, FUNCTION_CALLING_PATTERNS),
+        supports_reasoning_effort=model_matches(model, REASONING_EFFORT_PATTERNS),
+        supports_prompt_cache=model_matches(model, PROMPT_CACHE_PATTERNS),
+        supports_stop_words=not model_matches(
+            model, SUPPORTS_STOP_WORDS_FALSE_PATTERNS
+        ),
+    )
--- a/openhands/llm/streaming_llm.py
+++ b/openhands/llm/streaming_llm.py
@@ -5,7 +5,7 @@ from typing import Any, Callable
 from openhands.core.exceptions import UserCancelledError
 from openhands.core.logger import openhands_logger as logger
 from openhands.llm.async_llm import LLM_RETRY_EXCEPTIONS, AsyncLLM
-from openhands.llm.llm import REASONING_EFFORT_SUPPORTED_MODELS
+from openhands.llm.model_features import get_features


 class StreamingLLM(AsyncLLM):
@@ -65,7 +65,7 @@ class StreamingLLM(AsyncLLM):
                )

            # Set reasoning effort for models that support it
-            if self.config.model.lower() in REASONING_EFFORT_SUPPORTED_MODELS:
+            if get_features(self.config.model).supports_reasoning_effort:
                kwargs['reasoning_effort'] = self.config.reasoning_effort

            self.log_prompt(messages)
--- a/openhands/runtime/base.py
+++ b/openhands/runtime/base.py
@@ -67,6 +67,7 @@ from openhands.runtime.plugins import (
 from openhands.runtime.runtime_status import RuntimeStatus
 from openhands.runtime.utils.edit import FileEditRuntimeMixin
 from openhands.runtime.utils.git_handler import CommandResult, GitHandler
+from openhands.storage.locations import get_conversation_dir
 from openhands.utils.async_utils import (
    GENERAL_TIMEOUT,
    call_async_from_sync,
@@ -876,8 +877,14 @@ fi
            if isinstance(action, AgentThinkAction):
                return AgentThinkObservation('Your thought has been logged.')
            elif isinstance(action, TaskTrackingAction):
-                # If `command` is `plan`, write the serialized task list to the file TASKS.md under `.openhands/`
+                # Get the session-specific task file path
+                conversation_dir = get_conversation_dir(
+                    self.sid, self.event_stream.user_id
+                )
+                task_file_path = f'{conversation_dir}TASKS.md'
+
                if action.command == 'plan':
+                    # Write the serialized task list to the session directory
                    content = '# Task List\n\n'
                    for i, task in enumerate(action.task_list, 1):
                        status_icon = {
@@ -886,33 +893,39 @@ fi
                            'done': '✅',
                        }.get(task.get('status', 'todo'), '⏳')
                        content += f'{i}. {status_icon} {task.get("title", "")}\n{task.get("notes", "")}\n'
-                    write_obs = self.write(
-                        FileWriteAction(path='.openhands/TASKS.md', content=content)
-                    )
-                    if isinstance(write_obs, ErrorObservation):
+
+                    try:
+                        self.event_stream.file_store.write(task_file_path, content)
+                        return TaskTrackingObservation(
+                            content=f'Task list has been updated with {len(action.task_list)} items. Stored in session directory: {task_file_path}',
+                            command=action.command,
+                            task_list=action.task_list,
+                        )
+                    except Exception as e:
                        return ErrorObservation(
-                            f'Failed to write task list to .openhands/TASKS.md: {write_obs.content}'
+                            f'Failed to write task list to session directory {task_file_path}: {str(e)}'
                        )

-                    return TaskTrackingObservation(
-                        content=f'Task list has been updated with {len(action.task_list)} items.',
-                        command=action.command,
-                        task_list=action.task_list,
-                    )
                elif action.command == 'view':
-                    # If `command` is `view`, read the TASKS.md file and return its content
-                    read_obs = self.read(FileReadAction(path='.openhands/TASKS.md'))
-                    if isinstance(read_obs, FileReadObservation):
+                    # Read the TASKS.md file from the session directory
+                    try:
+                        content = self.event_stream.file_store.read(task_file_path)
                        return TaskTrackingObservation(
-                            content=read_obs.content,
+                            content=content,
                            command=action.command,
                            task_list=[],  # Empty for view command
                        )
-                    else:
-                        return TaskTrackingObservation(  # Return observation if error occurs because file might not exist yet
+                    except FileNotFoundError:
+                        return TaskTrackingObservation(
                            command=action.command,
                            task_list=[],
-                            content=f'Failed to read the task list. Error: {read_obs.content}',
+                            content='No task list found. Use the "plan" command to create one.',
+                        )
+                    except Exception as e:
+                        return TaskTrackingObservation(
+                            command=action.command,
+                            task_list=[],
+                            content=f'Failed to read the task list from session directory {task_file_path}. Error: {str(e)}',
                        )

            return NullObservation('')
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Xingyao Wang	8067ae85c3	Merge branch 'main' into fix-git-coauthorship-cli-runtime	2025-08-22 09:41:09 -04:00
Xingyao Wang	4507a25b85	Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-22 13:34:02 +00:00
llamantino	d9cf5b7302	ci: add GitHub Action to post welcome message on good first issues (#9707 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-08-22 09:09:45 -04:00
Xingyao Wang	2a86e32263	fix(CI): Pin @modelcontextprotocol/server-filesystem to version 2025.8.18 (#10561 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-22 05:00:11 +08:00
Engel Nyst	b311ae6e15	fix: normalize malformed <parameter> tags (Qwen3) (#10539 )	2025-08-21 19:03:20 +02:00
Ryan H. Tran	adb773789a	Upgrade `aci` to 0.3.2: clamp view_range end to file length and emit warning instead of error (#10502 )	2025-08-21 23:01:54 +07:00
Engel Nyst	91d3d1d20a	Fix: expose aggregated LLM metrics in State for evaluation scripts (#10537 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-21 17:43:09 +02:00
llamantino	e9e2c98946	fix(tests): increase hard timeout in test_bash_server to avoid timeout on Windows (#9930 )	2025-08-21 17:12:42 +02:00
Engel Nyst	7861c1ddf7	fix(anthropic): disable extended thinking for Opus 4.1 (#10532 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-21 00:13:15 +02:00
Engel Nyst	5ce5469bfa	docs: update OpenAPI specification to include all current endpoints (#10412 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-20 21:58:35 +02:00
Xingyao Wang	4a3f5dd9b4	fix(runtime): correctly set session_api_key for local runtime (#10506 )	2025-08-21 03:51:19 +08:00
Joe O'Connor	bc8b995dd3	Add additional networks (#9566 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-08-20 18:52:31 +00:00
chuckbutkus	07c4742496	Add useful tools jq and gettext to image (#10531 )	2025-08-20 18:27:09 +00:00
mamoodi	b5887f8a9d	Fix CLI docs command (#10520 )	2025-08-20 14:53:15 +00:00
mamoodi	0166df6575	Release 0.54.0 (#10465 )	2025-08-20 10:29:15 -04:00
Ryan H. Tran	e03a1f4e37	Move TASKS.md to session-specific directory in `~/.openhands` (#10493 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-20 22:26:55 +08:00
sp.wack	c763f0e368	chroe(vscode): Refresh vscode integration lockfile (#9965 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-08-20 15:33:11 +02:00
Engel Nyst	bb0e24d23b	Centralize model feature checks (#10414 ) Co-authored-by: OpenHands-GPT-5 <openhands@all-hands.dev>	2025-08-19 20:30:07 +00:00
sp.wack	aa6b454772	fix: Enhance GitHub repository search to include user organizations (#10324 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-19 15:56:15 +00:00
openhands	4e300b24b7	fix(runtime): ensure git safe.directory is configured for root user When running DockerRuntime with run_as_openhands=False (i.e., as root), the git safe.directory configuration was not being set up, causing 'dubious ownership' errors when git commands were executed. This fix extracts the git configuration logic into a separate function and ensures it's called for both root and non-root users, preventing the 'fatal: detected dubious ownership in repository' error. Fixes tests/runtime/test_bash.py::test_bash_remove_prefix[DockerRuntime-False] Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-19 14:51:08 +00:00
sp.wack	0297b3da18	Fix conversation ID validation to return 400 instead of 500 for long IDs (#10496 )	2025-08-19 18:03:05 +04:00
Hiep Le	476954f3a4	refactor(frontend): update the styling for the microagent management page. (#10494 )	2025-08-19 19:50:42 +07:00
dependabot[bot]	f296d7bde5	chore(deps): bump abatilo/actions-poetry from 3 to 4 (#10487 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-08-19 13:58:39 +02:00
Zacharias Fisches	f866b3f8ea	Update modal runtime for modal>=1.0 (#10479 ) Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>	2025-08-19 10:33:03 +00:00
Zacharias Fisches	36d31b74f7	fix jinja / dockerfile syntax by removing newlines (#10476 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2025-08-19 02:50:41 +00:00
openhands	6ceae397d7	tests(runtime): align timeout assertion and robust git remote setup in test_bash\n\n- get_timeout_suffix(): assert on stable prefix only\n- test_bash_remove_prefix: tolerate existing origin via add-or-set-url\n\nCo-authored-by: openhands <openhands@all-hands.dev>	2025-08-19 02:46:34 +00:00
openhands	f894c25597	runtime(docker/cli): fix git co-authorship + git safe.directory and workspace ownership - Ensure workspace ownership and permissions are set after (or alongside) user creation - Add defensive guard for UID=0 for non-root users (use 1000) - Configure git safe.directory for /workspace to avoid ‘dubious ownership’ errors - Set global core.hooksPath and init.templateDir to /openhands/git-hooks - Ship prepare-commit-msg hook at runtime (copy from code or generate fallback) to always append ‘Co-authored-by: openhands <openhands@all-hands.dev>’ - BashSession: start shell in correct working dir for target user - Command startup: never pass UID 0 to openhands user This fixes tests/runtime/test_bash.py::test_git_co_authorship_runtime_setup for DockerRuntime and CLIRuntime. Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-18 23:19:34 +00:00
Engel Nyst	634a7691a2	tests: reorganize unit tests into subdirectories mirroring source modules (#10484 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-19 01:11:07 +02:00
openhands	87b936b04a	bash: normalize timeout suffix format to 1 decimal for hard timeouts Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-18 21:51:05 +00:00
Xingyao Wang	81ba4399fa	fix(frontend): fix MCP tab in frontend unit tests (#10481 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-18 21:25:09 +00:00
openhands	6068e4298b	cli: ensure git co-authorship works in CLIRuntime - Always prefix PATH for subprocess to use wrapper - Configure global prepare-commit-msg hook for fallback Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-18 21:21:53 +00:00
Rohit Malhotra	875036d920	(Hotfix): Fix logs and filestore init for llm registry (#10470 )	2025-08-18 20:57:08 +00:00
Xingyao Wang	39333dd5de	feat: enable MCP in SaaS (#10480 )	2025-08-18 20:40:42 +00:00
Rohit Malhotra	3660933d59	refactor: replace 'convo' naming with 'conversation' (#10473 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-18 15:10:32 -04:00
Xingyao Wang	baf2cc5c7e	Pin OpenAI Python SDK to 1.99.9 to avoid LiteLLM import breakage (BerriAI/litellm#13711) (#10471 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Rohit Malhotra <rohitvinodmalhotra@gmail.com>	2025-08-18 18:45:34 +00:00
Rohit Malhotra	7b31d57a2f	Update conversation stats filename (#10472 )	2025-08-18 18:09:13 +00:00
Rohit Malhotra	61d90c31eb	(Hotfix): Fix eval pipeline (#10466 )	2025-08-18 12:51:51 -04:00
Xingyao Wang	3fea7fd2fc	feat: improve MCP config UI with comprehensive add/edit/delete functionality (#10145 ) Co-authored-by: OpenHands <openhands@all-hands.dev>	2025-08-18 16:33:27 +00:00
suixinio	c64b1ae111	fix(openrouter): Force string serialization for openrouter/anthropic/claude-sonnet-4 model (#10454 )	2025-08-18 17:50:01 +02:00
Kevin Musgrave	74ba21bad0	feat(evaluation): Added INSTRUCTION_TEMPLATE_NAME to run_infer.py in swe_bench (#10270 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: mamoodi <mamoodiha@gmail.com>	2025-08-18 14:18:08 +00:00
openhands	388e3ba496	Revert "Fix git config tests by ensuring local module imports" This reverts commit `2fd68cef2f`.	2025-08-15 22:01:28 +00:00
openhands	2fd68cef2f	Fix git config tests by ensuring local module imports The tests were failing because the poetry environment was using an installed version of the package from /openhands/code/ instead of the local development version. This caused the tests to use the old version of get_action_execution_server_startup_command that didn't include git configuration arguments. Fixed by adding the current directory to sys.path at the beginning of the test file to ensure local modules are imported first. Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-15 21:46:04 +00:00
openhands	e1788a74c5	Fix Docker build: Move git hook setup to separate RUN command - Move git hook setup after source code is copied - Separate RUN command prevents build failure when trying to copy hooks before source exists - Git hooks are now set up after COPY ./code/openhands command - This ensures the prepare-commit-msg file exists before trying to copy it Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 22:15:35 +00:00
Xingyao Wang	f046982d41	Update openhands/runtime/impl/cli/cli_runtime.py	2025-08-14 06:14:09 +08:00
openhands	e600225f0f	Move CLI git wrapper to ~/.openhands/bin - Use ~/.openhands/bin instead of workspace .openhands_bin directory - This follows standard user binary patterns and persists across workspaces - Avoids cluttering workspace with runtime infrastructure - Updated test to check new location Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 22:05:06 +00:00
openhands	dd8401cc98	Update test to expect co-authorship for all runtimes All runtimes have git hooks installed via Dockerfile.j2, so they should all automatically add co-authorship. CLI runtime has additional PATH-based wrapper but the base hook functionality works universally. Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 22:03:02 +00:00
openhands	e99f41372a	Fix test to not manually set up git hooks - Remove manual git hook setup from test - runtime should handle this - Rename test to reflect it tests runtime setup, not just hooks - Make test work with different runtime types (CLI uses wrapper, others may differ) - Test should verify runtime's co-authorship mechanism, not set it up Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 21:59:49 +00:00
openhands	e43f73f643	Implement automatic git co-authorship in CLI runtime using PATH-based wrapper - Replace conditional environment variable check with always-enabled git co-authorship - Use PATH manipulation instead of command wrapping to handle chained commands - Create modified git wrapper that uses full path to real git executable to avoid recursion - Update tests to reflect always-enabled behavior - Add comprehensive documentation for git hooks and wrapper functionality Fixes https://github.com/All-Hands-AI/OpenHands/issues/9957 Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 21:51:38 +00:00