Remove KeyboardInterrupt exit behavior from main chat loop

- Change KeyboardInterrupt handler to continue loop instead of exiting - Let signal handler manage Ctrl+C behavior completely - Only exit on explicit /exit command or outer KeyboardInterrupt This ensures that Ctrl+C during agent processing returns to chat loop instead of exiting the entire application. Co-authored-by: openhands <openhands@all-hands.dev>
Fix Ctrl+C behavior to return to chat loop instead of exiting
2026-04-29 03:00:45 -04:00 · 2025-11-03 17:09:16 +00:00 · 2025-11-03 17:07:47 +00:00 · 2025-11-03 16:59:29 +00:00 · 2025-11-03 16:57:02 +00:00 · 2025-11-03 16:51:47 +00:00
453 changed files with 12614 additions and 23173 deletions
--- a/.devcontainer/README.md
+++ b/.devcontainer/README.md
@@ -1 +0,0 @@
-This way of running OpenHands is not officially supported. It is maintained by the community.
--- a/.devcontainer/setup.sh
+++ b/.devcontainer/setup.sh
@@ -7,8 +7,5 @@ git config --global --add safe.directory "$(realpath .)"
 # Install `nc`
 sudo apt update && sudo apt install netcat -y

-# Install `uv` and `uvx`
-wget -qO- https://astral.sh/uv/install.sh | sh
-
 # Do common setup tasks
 source .openhands/setup.sh
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -13,7 +13,6 @@
 - [ ] Other (dependency update, docs, typo fixes, etc.)

 ## Checklist
-<!-- AI/LLM AGENTS: This checklist is for a human author to complete. Do NOT check either of the two boxes below. Leave them unchecked until a human has personally reviewed and tested the changes. -->

 - [ ] I have read and reviewed the code and I understand what the code is doing.
 - [ ] I have tested the code to the best of my ability and ensured it works as expected.
--- a/.github/scripts/check_version_consistency.py
+++ b/.github/scripts/check_version_consistency.py
@@ -0,0 +1,73 @@
+#!/usr/bin/env python3
+import os
+import re
+import sys
+
+
+def find_version_references(directory: str) -> tuple[set[str], set[str]]:
+    openhands_versions = set()
+    runtime_versions = set()
+
+    version_pattern_openhands = re.compile(r'openhands:(\d{1})\.(\d{2})')
+    version_pattern_runtime = re.compile(r'runtime:(\d{1})\.(\d{2})')
+
+    for root, _, files in os.walk(directory):
+        # Skip .git directory and docs/build directory
+        if '.git' in root or 'docs/build' in root:
+            continue
+
+        for file in files:
+            if file.endswith(
+                ('.md', '.yml', '.yaml', '.txt', '.html', '.py', '.js', '.ts')
+            ):
+                file_path = os.path.join(root, file)
+                try:
+                    with open(file_path, 'r', encoding='utf-8') as f:
+                        content = f.read()
+
+                        # Find all openhands version references
+                        matches = version_pattern_openhands.findall(content)
+                        if matches:
+                            print(f'Found openhands version {matches} in {file_path}')
+                            openhands_versions.update(matches)
+
+                        # Find all runtime version references
+                        matches = version_pattern_runtime.findall(content)
+                        if matches:
+                            print(f'Found runtime version {matches} in {file_path}')
+                            runtime_versions.update(matches)
+                except Exception as e:
+                    print(f'Error reading {file_path}: {e}', file=sys.stderr)
+
+    return openhands_versions, runtime_versions
+
+
+def main():
+    repo_root = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..'))
+    print(f'Checking version consistency in {repo_root}')
+    openhands_versions, runtime_versions = find_version_references(repo_root)
+
+    print(f'Found openhands versions: {sorted(openhands_versions)}')
+    print(f'Found runtime versions: {sorted(runtime_versions)}')
+
+    exit_code = 0
+
+    if len(openhands_versions) > 1:
+        print('Error: Multiple openhands versions found:', file=sys.stderr)
+        print('Found versions:', sorted(openhands_versions), file=sys.stderr)
+        exit_code = 1
+    elif len(openhands_versions) == 0:
+        print('Warning: No openhands version references found', file=sys.stderr)
+
+    if len(runtime_versions) > 1:
+        print('Error: Multiple runtime versions found:', file=sys.stderr)
+        print('Found versions:', sorted(runtime_versions), file=sys.stderr)
+        exit_code = 1
+    elif len(runtime_versions) == 0:
+        print('Warning: No runtime version references found', file=sys.stderr)
+
+    sys.exit(exit_code)
+
+
+if __name__ == '__main__':
+    main()
--- a/.github/scripts/update_pr_description.sh
+++ b/.github/scripts/update_pr_description.sh
@@ -17,6 +17,9 @@ DOCKER_RUN_COMMAND="docker run -it --rm \
  --name openhands-app-${SHORT_SHA} \
  docker.openhands.dev/openhands/openhands:${SHORT_SHA}"

+# Define the uvx command
+UVX_RUN_COMMAND="uvx --python 3.12 --from git+https://github.com/OpenHands/OpenHands@${BRANCH_NAME}#subdirectory=openhands-cli openhands"
+
 # Get the current PR body
 PR_BODY=$(gh pr view "$PR_NUMBER" --json body --jq .body)

@@ -34,6 +37,11 @@ GUI with Docker:
 \`\`\`
 ${DOCKER_RUN_COMMAND}
 \`\`\`
+
+CLI with uvx:
+\`\`\`
+${UVX_RUN_COMMAND}
+\`\`\`
 EOF
 )
 else
@@ -49,6 +57,11 @@ GUI with Docker:
 \`\`\`
 ${DOCKER_RUN_COMMAND}
 \`\`\`
+
+CLI with uvx:
+\`\`\`
+${UVX_RUN_COMMAND}
+\`\`\`
 EOF
 )
 fi
--- a/.github/workflows/check-package-versions.yml
+++ b/.github/workflows/check-package-versions.yml
@@ -1,65 +0,0 @@
-name: Check Package Versions
-
-on:
-  push:
-    branches: [main]
-  pull_request:
-  workflow_dispatch:
-
-jobs:
-  check-package-versions:
-    runs-on: ubuntu-latest
-
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v4
-
-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: "3.12"
-
-      - name: Check for any 'rev' fields in pyproject.toml
-        run: |
-          python - <<'PY'
-          import sys, tomllib, pathlib
-
-          path = pathlib.Path("pyproject.toml")
-          if not path.exists():
-              print("❌ ERROR: pyproject.toml not found")
-              sys.exit(1)
-
-          try:
-              data = tomllib.loads(path.read_text(encoding="utf-8"))
-          except Exception as e:
-              print(f"❌ ERROR: Failed to parse pyproject.toml: {e}")
-              sys.exit(1)
-
-          poetry = data.get("tool", {}).get("poetry", {})
-          sections = {
-              "dependencies": poetry.get("dependencies", {}),
-          }
-
-          errors = []
-
-          print("🔍 Checking for any dependencies with 'rev' fields...\n")
-          for section_name, deps in sections.items():
-              if not isinstance(deps, dict):
-                  continue
-
-              for pkg_name, cfg in deps.items():
-                  if isinstance(cfg, dict) and "rev" in cfg:
-                      msg = f"  ✖ {pkg_name} in [{section_name}] uses rev='{cfg['rev']}' (NOT ALLOWED)"
-                      print(msg)
-                      errors.append(msg)
-                  else:
-                      print(f"  • {pkg_name}: OK")
-
-          if errors:
-              print("\n❌ FAILED: Found dependencies using 'rev' fields:\n" + "\n".join(errors))
-              print("\nPlease use versioned releases instead, e.g.:")
-              print('  my-package = "1.0.0"')
-              sys.exit(1)
-
-          print("\n✅ SUCCESS: No 'rev' fields found. All dependencies are using proper versioned releases.")
-          PY
--- a/.github/workflows/clean-up.yml
+++ b/.github/workflows/clean-up.yml
@@ -0,0 +1,69 @@
+# Workflow that cleans up outdated and old workflows to prevent out of disk issues
+name: Delete old workflow runs
+
+# This workflow is currently only triggered manually
+on:
+  workflow_dispatch:
+    inputs:
+      days:
+        description: 'Days-worth of runs to keep for each workflow'
+        required: true
+        default: '30'
+      minimum_runs:
+        description: 'Minimum runs to keep for each workflow'
+        required: true
+        default: '10'
+      delete_workflow_pattern:
+        description: 'Name or filename of the workflow (if not set, all workflows are targeted)'
+        required: false
+      delete_workflow_by_state_pattern:
+        description: 'Filter workflows by state: active, deleted, disabled_fork, disabled_inactivity, disabled_manually'
+        required: true
+        default: "ALL"
+        type: choice
+        options:
+          - "ALL"
+          - active
+          - deleted
+          - disabled_inactivity
+          - disabled_manually
+      delete_run_by_conclusion_pattern:
+        description: 'Remove runs based on conclusion: action_required, cancelled, failure, skipped, success'
+        required: true
+        default: 'ALL'
+        type: choice
+        options:
+          - 'ALL'
+          - 'Unsuccessful: action_required,cancelled,failure,skipped'
+          - action_required
+          - cancelled
+          - failure
+          - skipped
+          - success
+      dry_run:
+        description: 'Logs simulated changes, no deletions are performed'
+        required: false
+
+jobs:
+  del_runs:
+    runs-on: blacksmith-4vcpu-ubuntu-2204
+    permissions:
+      actions: write
+      contents: read
+    steps:
+      - name: Delete workflow runs
+        uses: Mattraks/delete-workflow-runs@v2
+        with:
+          token: ${{ github.token }}
+          repository: ${{ github.repository }}
+          retain_days: ${{ github.event.inputs.days }}
+          keep_minimum_runs: ${{ github.event.inputs.minimum_runs }}
+          delete_workflow_pattern: ${{ github.event.inputs.delete_workflow_pattern }}
+          delete_workflow_by_state_pattern: ${{ github.event.inputs.delete_workflow_by_state_pattern }}
+          delete_run_by_conclusion_pattern: >-
+            ${{
+              startsWith(github.event.inputs.delete_run_by_conclusion_pattern, 'Unsuccessful:')
+              && 'action_required,cancelled,failure,skipped'
+              || github.event.inputs.delete_run_by_conclusion_pattern
+            }}
+          dry_run: ${{ github.event.inputs.dry_run }}
--- a/.github/workflows/cli-build-binary-and-optionally-release.yml
+++ b/.github/workflows/cli-build-binary-and-optionally-release.yml
@@ -0,0 +1,122 @@
+# Workflow that builds and tests the CLI binary executable
+name: CLI - Build binary and optionally release
+
+# Run on pushes to main branch and CLI tags, and on pull requests when CLI files change
+on:
+  push:
+    branches:
+      - main
+    tags:
+      - "*-cli"
+  pull_request:
+    paths:
+      - "openhands-cli/**"
+
+permissions:
+  contents: write       # needed to create releases or upload assets
+
+# Cancel previous runs if a new commit is pushed
+concurrency:
+  group: ${{ github.workflow }}-${{ (github.head_ref && github.ref) || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  build-binary:
+    name: Build binary executable
+    strategy:
+      matrix:
+        include:
+          # Build on Ubuntu 22.04 for maximum GLIBC compatibility (GLIBC 2.31)
+          - os: ubuntu-22.04
+            platform: linux
+            artifact_name: openhands-cli-linux
+          # Build on macOS for macOS users
+          - os: macos-15
+            platform: macos
+            artifact_name: openhands-cli-macos
+    runs-on: ${{ matrix.os }}
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: 3.12
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v3
+        with:
+          version: "latest"
+
+      - name: Install dependencies
+        working-directory: openhands-cli
+        run: |
+          uv sync
+
+      - name: Build binary executable
+        working-directory: openhands-cli
+        run: |
+          ./build.sh --install-pyinstaller | tee output.log
+          echo "Full output:"
+          cat output.log
+
+          if grep -q "❌" output.log; then
+            echo "❌ Found failure marker in output"
+            exit 1
+          fi
+
+          echo "✅ Build & test finished without ❌ markers"
+
+      - name: Verify binary files exist
+        run: |
+          if ! ls openhands-cli/dist/openhands* 1> /dev/null 2>&1; then
+            echo "❌ No binaries found to upload!"
+            exit 1
+          fi
+          echo "✅ Found binaries to upload."
+
+      - name: Upload binary artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ matrix.artifact_name }}
+          path: openhands-cli/dist/openhands*
+          retention-days: 30
+
+  create-github-release:
+    name: Create GitHub Release
+    runs-on: ubuntu-latest
+    needs: build-binary
+    if: startsWith(github.ref, 'refs/tags/')
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          path: artifacts
+
+      - name: Prepare release assets
+        run: |
+          mkdir -p release-assets
+          # Copy binaries with appropriate names for release
+          if [ -f artifacts/openhands-cli-linux/openhands ]; then
+            cp artifacts/openhands-cli-linux/openhands release-assets/openhands-linux
+          fi
+          if [ -f artifacts/openhands-cli-macos/openhands ]; then
+            cp artifacts/openhands-cli-macos/openhands release-assets/openhands-macos
+          fi
+          ls -la release-assets/
+
+      - name: Create GitHub Release
+        uses: softprops/action-gh-release@v2
+        with:
+          files: release-assets/*
+          draft: true
+          prerelease: false
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--- a/.github/workflows/dispatch-to-docs.yml
+++ b/.github/workflows/dispatch-to-docs.yml
@@ -0,0 +1,23 @@
+name: Dispatch to docs repo
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'docs/**'
+  workflow_dispatch:
+
+jobs:
+  dispatch:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        repo: ["OpenHands/docs"]
+    steps:
+      - name: Push to docs repo
+        uses: peter-evans/repository-dispatch@v3
+        with:
+          token: ${{ secrets.ALLHANDS_BOT_GITHUB_PAT }}
+          repository: ${{ matrix.repo }}
+          event-type: update
+          client-payload: '{"ref": "${{ github.ref }}", "sha": "${{ github.sha }}", "module": "openhands", "branch": "main"}'
--- a/.github/workflows/ghcr-build.yml
+++ b/.github/workflows/ghcr-build.yml
@@ -86,7 +86,7 @@ jobs:

  # Builds the runtime Docker images
  ghcr_build_runtime:
-    name: Build Runtime Image
+    name: Build Image
    runs-on: blacksmith-8vcpu-ubuntu-2204
    if: "!(github.event_name == 'push' && startsWith(github.ref, 'refs/tags/ext-v'))"
    permissions:
@@ -256,7 +256,7 @@ jobs:
  test_runtime_root:
    name: RT Unit Tests (Root)
    needs: [ghcr_build_runtime, define-matrix]
-    runs-on: blacksmith-4vcpu-ubuntu-2404
+    runs-on: blacksmith-8vcpu-ubuntu-2204
    strategy:
      fail-fast: false
      matrix:
@@ -298,7 +298,7 @@ jobs:
          # We install pytest-xdist in order to run tests across CPUs
          poetry run pip install pytest-xdist

-          # Install to be able to retry on failures for flakey tests
+          # Install to be able to retry on failures for flaky tests
          poetry run pip install pytest-rerunfailures

          image_name=ghcr.io/${{ env.REPO_OWNER }}/runtime:${{ env.RELEVANT_SHA }}-${{ matrix.base_image.tag }}
@@ -311,14 +311,14 @@ jobs:
          SANDBOX_RUNTIME_CONTAINER_IMAGE=$image_name \
          TEST_IN_CI=true \
          RUN_AS_OPENHANDS=false \
-          poetry run pytest -n 5 -raRs --reruns 2 --reruns-delay 3 -s ./tests/runtime --ignore=tests/runtime/test_browsergym_envs.py --durations=10
+          poetry run pytest -n 0 -raRs --reruns 2 --reruns-delay 5 -s ./tests/runtime --ignore=tests/runtime/test_browsergym_envs.py --durations=10
        env:
          DEBUG: "1"

  # Run unit tests with the Docker runtime Docker images as openhands user
  test_runtime_oh:
    name: RT Unit Tests (openhands)
-    runs-on: blacksmith-4vcpu-ubuntu-2404
+    runs-on: blacksmith-8vcpu-ubuntu-2204
    needs: [ghcr_build_runtime, define-matrix]
    strategy:
      matrix:
@@ -370,7 +370,7 @@ jobs:
          SANDBOX_RUNTIME_CONTAINER_IMAGE=$image_name \
          TEST_IN_CI=true \
          RUN_AS_OPENHANDS=true \
-          poetry run pytest -n 5 -raRs --reruns 2 --reruns-delay 3 -s ./tests/runtime --ignore=tests/runtime/test_browsergym_envs.py --durations=10
+          poetry run pytest -n 0 -raRs --reruns 2 --reruns-delay 5 -s ./tests/runtime --ignore=tests/runtime/test_browsergym_envs.py --durations=10
        env:
          DEBUG: "1"

--- a/.github/workflows/integration-runner.yml
+++ b/.github/workflows/integration-runner.yml
@@ -0,0 +1,199 @@
+name: Run Integration Tests
+
+on:
+  pull_request:
+    types: [labeled]
+  workflow_dispatch:
+    inputs:
+      reason:
+        description: 'Reason for manual trigger'
+        required: true
+        default: ''
+  schedule:
+    - cron: '30 22 * * *'  # Runs at 10:30pm UTC every day
+
+env:
+  N_PROCESSES: 10 # Global configuration for number of parallel processes for evaluation
+
+jobs:
+  run-integration-tests:
+    if: github.event.label.name == 'integration-test' || github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
+    runs-on: blacksmith-4vcpu-ubuntu-2204
+    permissions:
+      contents: "read"
+      id-token: "write"
+      pull-requests: "write"
+      issues: "write"
+    strategy:
+      matrix:
+        python-version: ["3.12"]
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Install poetry via pipx
+        run: pipx install poetry
+
+      - name: Set up Python
+        uses: useblacksmith/setup-python@v6
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: "poetry"
+
+      - name: Setup Node.js
+        uses: useblacksmith/setup-node@v5
+        with:
+          node-version: '22.x'
+
+      - name: Comment on PR if 'integration-test' label is present
+        if: github.event_name == 'pull_request' && github.event.label.name == 'integration-test'
+        uses: KeisukeYamashita/create-comment@v1
+        with:
+          unique: false
+          comment: |
+            Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.
+
+      - name: Install Python dependencies using Poetry
+        run: poetry install --with dev,test,runtime,evaluation
+
+      - name: Configure config.toml for testing with Haiku
+        env:
+          LLM_MODEL: "litellm_proxy/claude-3-5-haiku-20241022"
+          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
+          LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          MAX_ITERATIONS: 10
+        run: |
+          echo "[llm.eval]" > config.toml
+          echo "model = \"$LLM_MODEL\"" >> config.toml
+          echo "api_key = \"$LLM_API_KEY\"" >> config.toml
+          echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
+          echo "temperature = 0.0" >> config.toml
+
+      - name: Build environment
+        run: make build
+
+      - name: Run integration test evaluation for Haiku
+        env:
+          SANDBOX_FORCE_REBUILD_RUNTIME: True
+        run: |
+          poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD CodeActAgent '' 10 $N_PROCESSES '' 'haiku_run'
+
+          # get integration tests report
+          REPORT_FILE_HAIKU=$(find evaluation/evaluation_outputs/outputs/integration_tests/CodeActAgent/*haiku*_maxiter_10_N* -name "report.md" -type f | head -n 1)
+          echo "REPORT_FILE: $REPORT_FILE_HAIKU"
+          echo "INTEGRATION_TEST_REPORT_HAIKU<<EOF" >> $GITHUB_ENV
+          cat $REPORT_FILE_HAIKU >> $GITHUB_ENV
+          echo >> $GITHUB_ENV
+          echo "EOF" >> $GITHUB_ENV
+
+      - name: Wait a little bit
+        run: sleep 10
+
+      - name: Configure config.toml for testing with DeepSeek
+        env:
+          LLM_MODEL: "litellm_proxy/deepseek-chat"
+          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
+          LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          MAX_ITERATIONS: 10
+        run: |
+          echo "[llm.eval]" > config.toml
+          echo "model = \"$LLM_MODEL\"" >> config.toml
+          echo "api_key = \"$LLM_API_KEY\"" >> config.toml
+          echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
+          echo "temperature = 0.0" >> config.toml
+
+      - name: Run integration test evaluation for DeepSeek
+        env:
+          SANDBOX_FORCE_REBUILD_RUNTIME: True
+        run: |
+          poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD CodeActAgent '' 10 $N_PROCESSES '' 'deepseek_run'
+
+          # get integration tests report
+          REPORT_FILE_DEEPSEEK=$(find evaluation/evaluation_outputs/outputs/integration_tests/CodeActAgent/deepseek*_maxiter_10_N* -name "report.md" -type f | head -n 1)
+          echo "REPORT_FILE: $REPORT_FILE_DEEPSEEK"
+          echo "INTEGRATION_TEST_REPORT_DEEPSEEK<<EOF" >> $GITHUB_ENV
+          cat $REPORT_FILE_DEEPSEEK >> $GITHUB_ENV
+          echo >> $GITHUB_ENV
+          echo "EOF" >> $GITHUB_ENV
+
+      # -------------------------------------------------------------
+      # Run VisualBrowsingAgent tests for DeepSeek, limited to t05 and t06
+      - name: Wait a little bit (again)
+        run: sleep 5
+
+      - name: Configure config.toml for testing VisualBrowsingAgent (DeepSeek)
+        env:
+          LLM_MODEL: "litellm_proxy/deepseek-chat"
+          LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
+          LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
+          MAX_ITERATIONS: 15
+        run: |
+          echo "[llm.eval]" > config.toml
+          echo "model = \"$LLM_MODEL\"" >> config.toml
+          echo "api_key = \"$LLM_API_KEY\"" >> config.toml
+          echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
+          echo "temperature = 0.0" >> config.toml
+      - name: Run integration test evaluation for VisualBrowsingAgent (DeepSeek)
+        env:
+          SANDBOX_FORCE_REBUILD_RUNTIME: True
+        run: |
+          poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD VisualBrowsingAgent '' 15 $N_PROCESSES "t05_simple_browsing,t06_github_pr_browsing.py" 'visualbrowsing_deepseek_run'
+
+          # Find and export the visual browsing agent test results
+          REPORT_FILE_VISUALBROWSING_DEEPSEEK=$(find evaluation/evaluation_outputs/outputs/integration_tests/VisualBrowsingAgent/deepseek*_maxiter_15_N* -name "report.md" -type f | head -n 1)
+          echo "REPORT_FILE_VISUALBROWSING_DEEPSEEK: $REPORT_FILE_VISUALBROWSING_DEEPSEEK"
+          echo "INTEGRATION_TEST_REPORT_VISUALBROWSING_DEEPSEEK<<EOF" >> $GITHUB_ENV
+          cat $REPORT_FILE_VISUALBROWSING_DEEPSEEK >> $GITHUB_ENV
+          echo >> $GITHUB_ENV
+          echo "EOF" >> $GITHUB_ENV
+
+      - name: Create archive of evaluation outputs
+        run: |
+          TIMESTAMP=$(date +'%y-%m-%d-%H-%M')
+          cd evaluation/evaluation_outputs/outputs  # Change to the outputs directory
+          tar -czvf ../../../integration_tests_${TIMESTAMP}.tar.gz integration_tests/CodeActAgent/* integration_tests/VisualBrowsingAgent/* # Only include the actual result directories
+
+      - name: Upload evaluation results as artifact
+        uses: actions/upload-artifact@v4
+        id: upload_results_artifact
+        with:
+          name: integration-test-outputs-${{ github.run_id }}-${{ github.run_attempt }}
+          path: integration_tests_*.tar.gz
+
+      - name: Get artifact URLs
+        run: |
+          echo "ARTIFACT_URL=${{ steps.upload_results_artifact.outputs.artifact-url }}" >> $GITHUB_ENV
+
+      - name: Set timestamp and trigger reason
+        run: |
+          echo "TIMESTAMP=$(date +'%Y-%m-%d-%H-%M')" >> $GITHUB_ENV
+          if [[ "${{ github.event_name }}" == "pull_request" ]]; then
+            echo "TRIGGER_REASON=pr-${{ github.event.pull_request.number }}" >> $GITHUB_ENV
+          elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
+            echo "TRIGGER_REASON=manual-${{ github.event.inputs.reason }}" >> $GITHUB_ENV
+          else
+            echo "TRIGGER_REASON=nightly-scheduled" >> $GITHUB_ENV
+          fi
+
+      - name: Comment with results and artifact link
+        id: create_comment
+        uses: KeisukeYamashita/create-comment@v1
+        with:
+          # if triggered by PR, use PR number, otherwise use 9745 as fallback issue number for manual triggers
+          number: ${{ github.event_name == 'pull_request' && github.event.pull_request.number || 9745 }}
+          unique: false
+          comment: |
+              Trigger by: ${{ github.event_name == 'pull_request' && format('Pull Request (integration-test label on PR #{0})', github.event.pull_request.number) || (github.event_name == 'workflow_dispatch' && format('Manual Trigger: {0}', github.event.inputs.reason)) || 'Nightly Scheduled Run' }}
+              Commit: ${{ github.sha }}
+              **Integration Tests Report (Haiku)**
+              Haiku LLM Test Results:
+              ${{ env.INTEGRATION_TEST_REPORT_HAIKU }}
+              ---
+              **Integration Tests Report (DeepSeek)**
+              DeepSeek LLM Test Results:
+              ${{ env.INTEGRATION_TEST_REPORT_DEEPSEEK }}
+              ---
+              **Integration Tests Report VisualBrowsing (DeepSeek)**
+              ${{ env.INTEGRATION_TEST_REPORT_VISUALBROWSING_DEEPSEEK }}
+              ---
+              Download testing outputs (includes both Haiku and DeepSeek results): [Download](${{ steps.upload_results_artifact.outputs.artifact-url }})
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -72,3 +72,34 @@ jobs:
      - name: Run pre-commit hooks
        working-directory: ./enterprise
        run: pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml
+
+  lint-cli-python:
+    name: Lint CLI python
+    runs-on: blacksmith-4vcpu-ubuntu-2204
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - name: Set up python
+        uses: useblacksmith/setup-python@v6
+        with:
+          python-version: 3.12
+          cache: "pip"
+      - name: Install pre-commit
+        run: pip install pre-commit==4.2.0
+      - name: Run pre-commit hooks
+        working-directory: ./openhands-cli
+        run: pre-commit run --all-files --config ./dev_config/python/.pre-commit-config.yaml
+
+  # Check version consistency across documentation
+  check-version-consistency:
+    name: Check version consistency
+    runs-on: blacksmith-4vcpu-ubuntu-2204
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up python
+        uses: useblacksmith/setup-python@v6
+        with:
+          python-version: 3.12
+      - name: Run version consistency check
+        run: .github/scripts/check_version_consistency.py
--- a/.github/workflows/mdx-lint.yml
+++ b/.github/workflows/mdx-lint.yml
@@ -0,0 +1,70 @@
+# Workflow that checks MDX format in docs/ folder
+name: MDX Lint
+
+# Run on pushes to main and on pull requests that modify docs/ files
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - 'docs/**/*.mdx'
+  pull_request:
+    paths:
+      - 'docs/**/*.mdx'
+
+# If triggered by a PR, it will be in the same group. However, each commit on main will be in its own unique group
+concurrency:
+  group: ${{ github.workflow }}-${{ (github.head_ref && github.ref) || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  mdx-lint:
+    name: Lint MDX files
+    runs-on: blacksmith-4vcpu-ubuntu-2204
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install Node.js 22
+        uses: useblacksmith/setup-node@v5
+        with:
+          node-version: 22
+
+      - name: Install MDX dependencies
+        run: |
+          npm install @mdx-js/mdx@3 glob@10
+
+      - name: Validate MDX files
+        run: |
+          node -e "
+          const {compile} = require('@mdx-js/mdx');
+          const fs = require('fs');
+          const path = require('path');
+          const glob = require('glob');
+
+          async function validateMDXFiles() {
+            const files = glob.sync('docs/**/*.mdx');
+            console.log('Found', files.length, 'MDX files to validate');
+
+            let hasErrors = false;
+
+            for (const file of files) {
+              try {
+                const content = fs.readFileSync(file, 'utf8');
+                await compile(content);
+                console.log('✅ MDX parsing successful for', file);
+              } catch (err) {
+                console.error('❌ MDX parsing failed for', file, ':', err.message);
+                hasErrors = true;
+              }
+            }
+
+            if (hasErrors) {
+              console.error('\\n❌ Some MDX files have parsing errors. Please fix them before merging.');
+              process.exit(1);
+            } else {
+              console.log('\\n✅ All MDX files are valid!');
+            }
+          }
+
+          validateMDXFiles();
+          "
--- a/.github/workflows/py-tests.yml
+++ b/.github/workflows/py-tests.yml
@@ -48,10 +48,7 @@ jobs:
          python-version: ${{ matrix.python-version }}
          cache: "poetry"
      - name: Install Python dependencies using Poetry
-        run: |
-          poetry install --with dev,test,runtime
-          poetry run pip install pytest-xdist
-          poetry run pip install pytest-rerunfailures
+        run: poetry install --with dev,test,runtime
      - name: Build Environment
        run: make build
      - name: Run Unit Tests
@@ -59,7 +56,7 @@ jobs:
        env:
          COVERAGE_FILE: ".coverage.${{ matrix.python_version }}"
      - name: Run Runtime Tests with CLIRuntime
-        run: PYTHONPATH=".:$PYTHONPATH" TEST_RUNTIME=cli poetry run pytest -n 5 --reruns 2 --reruns-delay 3 -s tests/runtime/test_bash.py --cov=openhands --cov-branch
+        run: PYTHONPATH=".:$PYTHONPATH" TEST_RUNTIME=cli poetry run pytest -s tests/runtime/test_bash.py --cov=openhands --cov-branch
        env:
          COVERAGE_FILE: ".coverage.runtime.${{ matrix.python_version }}"
      - name: Store coverage file
@@ -70,7 +67,37 @@ jobs:
            .coverage.${{ matrix.python_version }}
            .coverage.runtime.${{ matrix.python_version }}
          include-hidden-files: true
-
+  # Run specific Windows python tests
+  test-on-windows:
+    name: Python Tests on Windows
+    runs-on: windows-latest
+    strategy:
+      matrix:
+        python-version: ["3.12"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install pipx
+        run: pip install pipx
+      - name: Install poetry via pipx
+        run: pipx install poetry
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: "poetry"
+      - name: Install Python dependencies using Poetry
+        run: poetry install --with dev,test,runtime
+      - name: Run Windows unit tests
+        run: poetry run pytest -svv tests/unit/runtime/utils/test_windows_bash.py
+        env:
+          PYTHONPATH: ".;$env:PYTHONPATH"
+          DEBUG: "1"
+      - name: Run Windows runtime tests with LocalRuntime
+        run: $env:TEST_RUNTIME="local"; poetry run pytest -svv tests/runtime/test_bash.py
+        env:
+          PYTHONPATH: ".;$env:PYTHONPATH"
+          TEST_RUNTIME: local
+          DEBUG: "1"
  test-enterprise:
    name: Enterprise Python Unit Tests
    runs-on: blacksmith-4vcpu-ubuntu-2404
@@ -101,11 +128,57 @@ jobs:
          path: ".coverage.enterprise.${{ matrix.python_version }}"
          include-hidden-files: true

+  # Run CLI unit tests
+  test-cli-python:
+    name: CLI Unit Tests
+    runs-on: blacksmith-4vcpu-ubuntu-2404
+    strategy:
+      matrix:
+        python-version: ["3.12"]
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Set up Python
+        uses: useblacksmith/setup-python@v6
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v3
+        with:
+          version: "latest"
+
+      - name: Install dependencies
+        working-directory: ./openhands-cli
+        run: |
+          uv sync --group dev
+
+      - name: Run CLI unit tests
+        working-directory: ./openhands-cli
+        env:
+          # write coverage to repo root so the merge step finds it
+          COVERAGE_FILE: "${{ github.workspace }}/.coverage.openhands-cli.${{ matrix.python-version }}"
+        run: |
+          uv run pytest --forked -n auto -s \
+            -p no:ddtrace -p no:ddtrace.pytest_bdd -p no:ddtrace.pytest_benchmark \
+            tests --cov=openhands_cli --cov-branch
+
+      - name: Store coverage file
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-openhands-cli
+          path: ".coverage.openhands-cli.${{ matrix.python-version }}"
+          include-hidden-files: true
+
+
  coverage-comment:
    name: Coverage Comment
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
-    needs: [test-on-linux, test-enterprise]
+    needs: [test-on-linux, test-enterprise, test-cli-python]

    permissions:
      pull-requests: write
@@ -119,6 +192,9 @@ jobs:
          pattern: coverage-*
          merge-multiple: true

+      - name: Create symlink for CLI source files
+        run: ln -sf openhands-cli/openhands_cli openhands_cli
+
      - name: Coverage comment
        id: coverage_comment
        uses: py-cov-action/python-coverage-comment-action@v3
--- a/.github/workflows/pypi-release.yml
+++ b/.github/workflows/pypi-release.yml
@@ -10,6 +10,7 @@ on:
        type: choice
        options:
          - app server
+          - cli
        default: app server
  push:
    tags:
@@ -38,3 +39,36 @@ jobs:
        run: ./build.sh
      - name: publish
        run: poetry publish -u __token__ -p ${{ secrets.PYPI_TOKEN }}
+
+  release-cli:
+    name: Publish CLI to PyPI
+    runs-on: ubuntu-latest
+    # Run when manually dispatched for "cli" OR for tag pushes that contain '-cli'
+    if: |
+      (github.event_name == 'workflow_dispatch' && github.event.inputs.reason == 'cli')
+      || (github.event_name == 'push' && startsWith(github.ref, 'refs/tags/') && contains(github.ref, '-cli'))
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: 3.12
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v3
+        with:
+          version: "latest"
+
+      - name: Build CLI package
+        working-directory: openhands-cli
+        run: |
+          # Clean dist directory to avoid conflicts with binary builds
+          rm -rf dist/
+          uv build
+
+      - name: Publish CLI to PyPI
+        working-directory: openhands-cli
+        run: |
+          uv publish --token ${{ secrets.PYPI_TOKEN_OPENHANDS }}
--- a/.github/workflows/run-eval.yml
+++ b/.github/workflows/run-eval.yml
@@ -0,0 +1,135 @@
+# Run evaluation on a PR, after releases, or manually
+name: Run Eval
+
+# Runs when a PR is labeled with one of the "run-eval-" labels, after releases, or manually triggered
+on:
+  pull_request:
+    types: [labeled]
+  release:
+    types: [published]
+  workflow_dispatch:
+    inputs:
+      branch:
+        description: 'Branch to evaluate'
+        required: true
+        default: 'main'
+      eval_instances:
+        description: 'Number of evaluation instances'
+        required: true
+        default: '50'
+        type: choice
+        options:
+          - '1'
+          - '2'
+          - '50'
+          - '100'
+      reason:
+        description: 'Reason for manual trigger'
+        required: false
+        default: ''
+
+env:
+  # Environment variable for the master GitHub issue number where all evaluation results will be commented
+  # This should be set to the issue number where you want all evaluation results to be posted
+  MASTER_EVAL_ISSUE_NUMBER: ${{ vars.MASTER_EVAL_ISSUE_NUMBER || '0' }}
+
+jobs:
+  trigger-job:
+    name: Trigger remote eval job
+    if: ${{ (github.event_name == 'pull_request' && (github.event.label.name == 'run-eval-1' || github.event.label.name == 'run-eval-2' || github.event.label.name == 'run-eval-50' || github.event.label.name == 'run-eval-100')) || github.event_name == 'release' || github.event_name == 'workflow_dispatch' }}
+    runs-on: blacksmith-4vcpu-ubuntu-2204
+
+    steps:
+      - name: Checkout branch
+        uses: actions/checkout@v4
+        with:
+          ref: ${{ github.event_name == 'pull_request' && github.head_ref || (github.event_name == 'workflow_dispatch' && github.event.inputs.branch) || github.ref }}
+
+      - name: Set evaluation parameters
+        id: eval_params
+        run: |
+          REPO_URL="https://github.com/${{ github.repository }}"
+          echo "Repository URL: $REPO_URL"
+
+          # Determine branch based on trigger type
+          if [[ "${{ github.event_name }}" == "pull_request" ]]; then
+            EVAL_BRANCH="${{ github.head_ref }}"
+            echo "PR Branch: $EVAL_BRANCH"
+          elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
+            EVAL_BRANCH="${{ github.event.inputs.branch }}"
+            echo "Manual Branch: $EVAL_BRANCH"
+          else
+            # For release events, use the tag name or main branch
+            EVAL_BRANCH="${{ github.ref_name }}"
+            echo "Release Branch/Tag: $EVAL_BRANCH"
+          fi
+
+          # Determine evaluation instances based on trigger type
+          if [[ "${{ github.event_name }}" == "pull_request" ]]; then
+            if [[ "${{ github.event.label.name }}" == "run-eval-1" ]]; then
+              EVAL_INSTANCES="1"
+            elif [[ "${{ github.event.label.name }}" == "run-eval-2" ]]; then
+              EVAL_INSTANCES="2"
+            elif [[ "${{ github.event.label.name }}" == "run-eval-50" ]]; then
+              EVAL_INSTANCES="50"
+            elif [[ "${{ github.event.label.name }}" == "run-eval-100" ]]; then
+              EVAL_INSTANCES="100"
+            fi
+          elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
+            EVAL_INSTANCES="${{ github.event.inputs.eval_instances }}"
+          else
+            # For release events, default to 50 instances
+            EVAL_INSTANCES="50"
+          fi
+
+          echo "Evaluation instances: $EVAL_INSTANCES"
+          echo "repo_url=$REPO_URL" >> $GITHUB_OUTPUT
+          echo "eval_branch=$EVAL_BRANCH" >> $GITHUB_OUTPUT
+          echo "eval_instances=$EVAL_INSTANCES" >> $GITHUB_OUTPUT
+
+      - name: Trigger remote job
+        run: |
+          # Determine PR number for the remote evaluation system
+          if [[ "${{ github.event_name }}" == "pull_request" ]]; then
+            PR_NUMBER="${{ github.event.pull_request.number }}"
+          else
+            # For non-PR triggers, use the master issue number as PR number
+            PR_NUMBER="${{ env.MASTER_EVAL_ISSUE_NUMBER }}"
+          fi
+
+          curl -X POST \
+            -H "Authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+            -H "Accept: application/vnd.github+json" \
+            -d "{\"ref\": \"main\", \"inputs\": {\"github-repo\": \"${{ steps.eval_params.outputs.repo_url }}\", \"github-branch\": \"${{ steps.eval_params.outputs.eval_branch }}\", \"pr-number\": \"${PR_NUMBER}\", \"eval-instances\": \"${{ steps.eval_params.outputs.eval_instances }}\"}}" \
+            https://api.github.com/repos/OpenHands/evaluation/actions/workflows/create-branch.yml/dispatches
+
+          # Send Slack message
+          if [[ "${{ github.event_name }}" == "pull_request" ]]; then
+            TRIGGER_URL="https://github.com/${{ github.repository }}/pull/${{ github.event.pull_request.number }}"
+            slack_text="PR $TRIGGER_URL has triggered evaluation on ${{ steps.eval_params.outputs.eval_instances }} instances..."
+          elif [[ "${{ github.event_name }}" == "release" ]]; then
+            TRIGGER_URL="https://github.com/${{ github.repository }}/releases/tag/${{ github.ref_name }}"
+            slack_text="Release $TRIGGER_URL has triggered evaluation on ${{ steps.eval_params.outputs.eval_instances }} instances..."
+          else
+            TRIGGER_URL="https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+            slack_text="Manual trigger (${{ github.event.inputs.reason || 'No reason provided' }}) has triggered evaluation on ${{ steps.eval_params.outputs.eval_instances }} instances for branch ${{ steps.eval_params.outputs.eval_branch }}..."
+          fi
+
+          curl -X POST -H 'Content-type: application/json' --data '{"text":"'"$slack_text"'"}' \
+            https://hooks.slack.com/services/${{ secrets.SLACK_TOKEN }}
+
+      - name: Comment on issue/PR
+        uses: KeisukeYamashita/create-comment@v1
+        with:
+          # For PR triggers, comment on the PR. For other triggers, comment on the master issue
+          number: ${{ github.event_name == 'pull_request' && github.event.pull_request.number || env.MASTER_EVAL_ISSUE_NUMBER }}
+          unique: false
+          comment: |
+            **Evaluation Triggered**
+
+            **Trigger:** ${{ github.event_name == 'pull_request' && format('Pull Request #{0}', github.event.pull_request.number) || (github.event_name == 'release' && 'Release') || format('Manual Trigger: {0}', github.event.inputs.reason || 'No reason provided') }}
+            **Branch:** ${{ steps.eval_params.outputs.eval_branch }}
+            **Instances:** ${{ steps.eval_params.outputs.eval_instances }}
+            **Commit:** ${{ github.sha }}
+
+            Running evaluation on the specified branch. Once eval is done, the results will be posted here.
--- a/.gitignore
+++ b/.gitignore
@@ -185,9 +185,6 @@ cython_debug/
 .repomix
 repomix-output.txt

-# Emacs backup
-*~
-
 # evaluation
 evaluation/evaluation_outputs
 evaluation/outputs
--- a/1
+++ b/1
@@ -1 +0,0 @@
-docs.all-hands.dev
--- a/COMMUNITY.md
+++ b/COMMUNITY.md
@@ -1,45 +1,43 @@
-# The OpenHands Community
+# 🙌 The OpenHands Community

-OpenHands is a community of engineers, academics, and enthusiasts reimagining software development for an AI-powered world.
+The OpenHands community is built around the belief that (1) AI and AI agents are going to fundamentally change the way
+we build software, and (2) if this is true, we should do everything we can to make sure that the benefits provided by
+such powerful technology are accessible to everyone.

-## Mission
+If this resonates with you, we'd love to have you join us in our quest!

-It’s very clear that AI is changing software development. We want the developer community to drive that change organically, through open source.
+## 🤝 How to Join

-So we’re not just building friendly interfaces for AI-driven development. We’re publishing _building blocks_ that empower developers to create new experiences, tailored to your own habits, needs, and imagination.
+Check out our [How to Join the Community section.](https://github.com/OpenHands/OpenHands?tab=readme-ov-file#-how-to-join-the-community)

-## Ethos
+## 💪 Becoming a Contributor

-We have two core values: **high openness** and **high agency**. While we don’t expect everyone in the community to embody these values, we want to establish them as norms.
+We welcome contributions from everyone! Whether you're a developer, a researcher, or simply enthusiastic about advancing
+the field of software engineering with AI, there are many ways to get involved:

-### High Openness
+- **Code Contributions:** Help us develop new core functionality, improve our agents, improve the frontend and other
+interfaces, or anything else that would help make OpenHands better.
+- **Research and Evaluation:** Contribute to our understanding of LLMs in software engineering, participate in
+evaluating the models, or suggest improvements.
+- **Feedback and Testing:** Use the OpenHands toolset, report bugs, suggest features, or provide feedback on usability.

-We welcome anyone and everyone into our community by default. You don’t have to be a software developer to help us build. You don’t have to be pro-AI to help us learn.
+For details, please check [CONTRIBUTING.md](./CONTRIBUTING.md).

-Our plans, our work, our successes, and our failures are all public record. We want the world to see not just the fruits of our work, but the whole process of growing it.
+## Code of Conduct

-We welcome thoughtful criticism, whether it’s a comment on a PR or feedback on the community as a whole.
+We have a [Code of Conduct](./CODE_OF_CONDUCT.md) that we expect all contributors to adhere to.
+Long story short, we are aiming for an open, welcoming, diverse, inclusive, and healthy community.
+All contributors are expected to contribute to building this sort of community.

-### High Agency
+## 🛠️ Becoming a Maintainer

-Everyone should feel empowered to contribute to OpenHands. Whether it’s by making a PR, hosting an event, sharing feedback, or just asking a question, don’t hold back!
+For contributors who have made significant and sustained contributions to the project, there is a possibility of joining
+the maintainer team. The process for this is as follows:

-OpenHands gives everyone the building blocks to create state-of-the-art developer experiences. We experiment constantly and love building new things.
+1. Any contributor who has made sustained and high-quality contributions to the codebase can be nominated by any
+maintainer. If you feel that you may qualify you can reach out to any of the maintainers that have reviewed your PRs and ask if you can be nominated.
+2. Once a maintainer nominates a new maintainer, there will be a discussion period among the maintainers for at least 3 days.
+3. If no concerns are raised the nomination will be accepted by acclamation, and if concerns are raised there will be a discussion and possible vote.

-Coding, development practices, and communities are changing rapidly. We won’t hesitate to change direction and make big bets.
-
-## Relationship to All Hands
-
-OpenHands is supported by the for-profit organization [All Hands AI, Inc](https://www.all-hands.dev/).
-
-All Hands was founded by three of the first major contributors to OpenHands:
-
- Xingyao Wang, a UIUC PhD candidate who got OpenHands to the top of the SWE-bench leaderboards
- Graham Neubig, a CMU Professor who rallied the academic community around OpenHands
- Robert Brennan, a software engineer who architected the user-facing features of OpenHands
-
-All Hands is an important part of the OpenHands ecosystem. We’ve raised over $20M--mainly to hire developers and researchers who can work on OpenHands full-time, and to provide them with expensive infrastructure. ([Join us!](https://allhandsai.applytojob.com/apply/))
-
-But we see OpenHands as much larger, and ultimately more important, than All Hands. When our financial responsibility to investors is at odds with our social responsibility to the community—as it inevitably will be, from time to time—we promise to navigate that conflict thoughtfully and transparently.
-
-At some point, we may transfer custody of OpenHands to an open source foundation. But for now, the [Benevolent Dictator approach](http://www.catb.org/~esr/writings/cathedral-bazaar/homesteading/ar01s16.html) helps us move forward with speed and intention. If we ever forget the “benevolent” part, please: fork us.
+Note that just making many PRs does not immediately imply that you will become a maintainer. We will be looking
+at sustained high-quality contributions over a period of time, as well as good teamwork and adherence to our [Code of Conduct](./CODE_OF_CONDUCT.md).
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -58,7 +58,7 @@ by implementing the [interface specified here](https://github.com/OpenHands/Open

 #### Testing
 When you write code, it is also good to write tests. Please navigate to the [`./tests`](./tests) folder to see existing test suites.
-At the moment, we have these kinds of tests: [`unit`](./tests/unit), [`runtime`](./tests/runtime), and [`end-to-end (e2e)`](./tests/e2e). Please refer to the README for each test suite. These tests also run on GitHub's continuous integration to ensure quality of the project.
+At the moment, we have two kinds of tests: [`unit`](./tests/unit) and [`integration`](./evaluation/integration_tests). Please refer to the README for each test suite. These tests also run on GitHub's continuous integration to ensure quality of the project.

 ## Sending Pull Requests to OpenHands

--- a/Development.md
+++ b/Development.md
@@ -91,14 +91,14 @@ make run
 #### Option B: Individual Server Startup

 - **Start the Backend Server:** If you prefer, you can start the backend server independently to focus on
-  backend-related tasks or configurations.
+backend-related tasks or configurations.

  ```bash
  make start-backend
  ```

 - **Start the Frontend Server:** Similarly, you can start the frontend server on its own to work on frontend-related
-  components or interface enhancements.
+components or interface enhancements.
  ```bash
  make start-frontend
  ```
@@ -110,7 +110,6 @@ You can use OpenHands to develop and improve OpenHands itself! This is a powerfu
 #### Quick Start

 1. **Build and run OpenHands:**
-
   ```bash
   export INSTALL_DOCKER=0
   export RUNTIME=local
@@ -118,7 +117,6 @@ You can use OpenHands to develop and improve OpenHands itself! This is a powerfu
   ```

 2. **Access the interface:**
-
   - Local development: http://localhost:3001
   - Remote/cloud environments: Use the appropriate external URL

@@ -161,7 +159,7 @@ poetry run pytest ./tests/unit/test_*.py
 To reduce build time (e.g., if no changes were made to the client-runtime component), you can use an existing Docker
 container image by setting the SANDBOX_RUNTIME_CONTAINER_IMAGE environment variable to the desired Docker image.

-Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/openhands/runtime:0.62-nikolaik`
+Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/openhands/runtime:0.60-nikolaik`

 ## Develop inside Docker container

@@ -201,6 +199,6 @@ Here's a guide to the important documentation files in the repository:
 - [/containers/README.md](./containers/README.md): Information about Docker containers and deployment
 - [/tests/unit/README.md](./tests/unit/README.md): Guide to writing and running unit tests
 - [/evaluation/README.md](./evaluation/README.md): Documentation for the evaluation framework and benchmarks
- [/skills/README.md](./skills/README.md): Information about the skills architecture and implementation
+- [/microagents/README.md](./microagents/README.md): Information about the microagents architecture and implementation
 - [/openhands/server/README.md](./openhands/server/README.md): Server implementation details and API documentation
 - [/openhands/runtime/README.md](./openhands/runtime/README.md): Documentation for the runtime environment and execution model
--- a/README.md
+++ b/README.md
@@ -1,18 +1,22 @@
 <a name="readme-top"></a>

 <div align="center">
-  <img src="https://raw.githubusercontent.com/OpenHands/docs/main/openhands/static/img/logo.png" alt="Logo" width="200">
-  <h1 align="center" style="border-bottom: none">OpenHands: AI-Driven Development</h1>
+  <img src="https://raw.githubusercontent.com/All-Hands-AI/docs/main/openhands/static/img/logo.png" alt="Logo" width="200">
+  <h1 align="center">OpenHands: Code Less, Make More</h1>
 </div>


 <div align="center">
-  <a href="https://github.com/OpenHands/OpenHands/blob/main/LICENSE"><img src="https://img.shields.io/badge/LICENSE-MIT-20B2AA?style=for-the-badge" alt="MIT License"></a>
-  <a href="https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=811504672#gid=811504672"><img src="https://img.shields.io/badge/SWEBench-72.8-00cc00?logoColor=FFE165&style=for-the-badge" alt="Benchmark Score"></a>
+  <a href="https://github.com/OpenHands/OpenHands/graphs/contributors"><img src="https://img.shields.io/github/contributors/OpenHands/OpenHands?style=for-the-badge&color=blue" alt="Contributors"></a>
+  <a href="https://github.com/OpenHands/OpenHands/stargazers"><img src="https://img.shields.io/github/stars/OpenHands/OpenHands?style=for-the-badge&color=blue" alt="Stargazers"></a>
+  <a href="https://github.com/OpenHands/OpenHands/blob/main/LICENSE"><img src="https://img.shields.io/github/license/OpenHands/OpenHands?style=for-the-badge&color=blue" alt="MIT License"></a>
  <br/>
-  <a href="https://docs.openhands.dev/sdk"><img src="https://img.shields.io/badge/Documentation-000?logo=googledocs&logoColor=FFE165&style=for-the-badge" alt="Check out the documentation"></a>
-  <a href="https://arxiv.org/abs/2511.03690"><img src="https://img.shields.io/badge/Paper-000?logoColor=FFE165&logo=arxiv&style=for-the-badge" alt="Tech Report"></a>
-
+  <a href="https://all-hands.dev/joinslack"><img src="https://img.shields.io/badge/Slack-Join%20Us-red?logo=slack&logoColor=white&style=for-the-badge" alt="Join our Slack community"></a>
+  <a href="https://github.com/OpenHands/OpenHands/blob/main/CREDITS.md"><img src="https://img.shields.io/badge/Project-Credits-blue?style=for-the-badge&color=FFE165&logo=github&logoColor=white" alt="Credits"></a>
+  <br/>
+  <a href="https://docs.all-hands.dev/usage/getting-started"><img src="https://img.shields.io/badge/Documentation-000?logo=googledocs&logoColor=FFE165&style=for-the-badge" alt="Check out the documentation"></a>
+  <a href="https://arxiv.org/abs/2407.16741"><img src="https://img.shields.io/badge/Paper%20on%20Arxiv-000?logoColor=FFE165&logo=arxiv&style=for-the-badge" alt="Paper on Arxiv"></a>
+  <a href="https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0"><img src="https://img.shields.io/badge/Benchmark%20score-000?logoColor=FFE165&logo=huggingface&style=for-the-badge" alt="Evaluation Benchmark Score"></a>

  <!-- Keep these links. Translations will automatically update with the README. -->
  <a href="https://www.readme-i18n.com/OpenHands/OpenHands?lang=de">Deutsch</a> |
@@ -24,63 +28,157 @@
  <a href="https://www.readme-i18n.com/OpenHands/OpenHands?lang=ru">Русский</a> |
  <a href="https://www.readme-i18n.com/OpenHands/OpenHands?lang=zh">中文</a>

+  <hr>
 </div>

-<hr>
+Welcome to OpenHands (formerly OpenDevin), a platform for software development agents powered by AI.

-🙌 Welcome to OpenHands, a [community](COMMUNITY.md) focused on AI-driven development. We’d love for you to [join us on Slack](https://dub.sh/openhands).
+OpenHands agents can do anything a human developer can: modify code, run commands, browse the web,
+call APIs, and yes—even copy code snippets from StackOverflow.

-There are a few ways to work with OpenHands:
+Learn more at [docs.all-hands.dev](https://docs.all-hands.dev), or [sign up for OpenHands Cloud](https://app.all-hands.dev) to get started.

-### OpenHands Software Agent SDK
-The SDK is a composable Python library that contains all of our agentic tech. It's the engine that powers everything else below.

-Define agents in code, then run them locally, or scale to 1000s of agents in the cloud.
+> [!IMPORTANT]
+> **Upcoming change**: We are renaming our GitHub Org from `All-Hands-AI` to `OpenHands` on October 20th, 2025.
+> Check the [tracking issue](https://github.com/All-Hands-AI/OpenHands/issues/11376) for more information.

-[Check out the docs](https://docs.openhands.dev/sdk) or [view the source](https://github.com/OpenHands/software-agent-sdk/)

-### OpenHands CLI
-The CLI is the easiest way to start using OpenHands. The experience will be familiar to anyone who has worked
-with e.g. Claude Code or Codex. You can power it with Claude, GPT, or any other LLM.
+> [!IMPORTANT]
+> Using OpenHands for work? We'd love to chat! Fill out
+> [this short form](https://docs.google.com/forms/d/e/1FAIpQLSet3VbGaz8z32gW9Wm-Grl4jpt5WgMXPgJ4EDPVmCETCBpJtQ/viewform)
+> to join our Design Partner program, where you'll get early access to commercial features and the opportunity to provide input on our product roadmap.

-[Check out the docs](https://docs.openhands.dev/openhands/usage/run-openhands/cli-mode) or [view the source](https://github.com/OpenHands/OpenHands-CLI)
+## ☁️ OpenHands Cloud
+The easiest way to get started with OpenHands is on [OpenHands Cloud](https://app.all-hands.dev),
+which comes with $20 in free credits for new users.

-### OpenHands Local GUI
-Use the Local GUI for running agents on your laptop. It comes with a REST API and a single-page React application.
-The experience will be familiar to anyone who has used Devin or Jules.
+## 💻 Running OpenHands Locally

-[Check out the docs](https://docs.openhands.dev/openhands/usage/run-openhands/local-setup) or view the source in this repo.
+### Option 1: CLI Launcher (Recommended)

-### OpenHands Cloud
-This is a deployment of OpenHands GUI, running on hosted infrastructure.
+The easiest way to run OpenHands locally is using the CLI launcher with [uv](https://docs.astral.sh/uv/). This provides better isolation from your current project's virtual environment and is required for OpenHands' default MCP servers.

-You can try it with a free $10 credit by [signing in with your GitHub account](https://app.all-hands.dev).
+**Install uv** (if you haven't already):

-OpenHands Cloud comes with source-available features and integrations:
- Integrations with Slack, Jira, and Linear
- Multi-user support
- RBAC and permissions
- Collaboration features (e.g., conversation sharing)
+See the [uv installation guide](https://docs.astral.sh/uv/getting-started/installation/) for the latest installation instructions for your platform.

-### OpenHands Enterprise
-Large enterprises can work with us to self-host OpenHands Cloud in their own VPC, via Kubernetes.
-OpenHands Enterprise can also work with the CLI and SDK above.
+**Launch OpenHands**:
+```bash
+# Launch the GUI server
+uvx --python 3.12 openhands serve

-OpenHands Enterprise is source-available--you can see all the source code here in the enterprise/ directory,
-but you'll need to purchase a license if you want to run it for more than one month.
+# Or launch the CLI
+uvx --python 3.12 openhands
+```

-Enterprise contracts also come with extended support and access to our research team.
+You'll find OpenHands running at [http://localhost:3000](http://localhost:3000) (for GUI mode)!

-Learn more at [openhands.dev/enterprise](https://openhands.dev/enterprise)
+### Option 2: Docker

-### Everything Else
+<details>
+<summary>Click to expand Docker command</summary>

-Check out our [Product Roadmap](https://github.com/orgs/openhands/projects/1), and feel free to
-[open up an issue](https://github.com/OpenHands/OpenHands/issues) if there's something you'd like to see!
+You can also run OpenHands directly with Docker:

-You might also be interested in our [evaluation infrastructure](https://github.com/OpenHands/benchmarks), our [chrome extension](https://github.com/OpenHands/openhands-chrome-extension/), or our [Theory-of-Mind module](https://github.com/OpenHands/ToM-SWE).
+```bash
+docker pull docker.openhands.dev/openhands/runtime:0.60-nikolaik

-All our work is available under the MIT license, except for the `enterprise/` directory in this repository (see the [enterprise license](enterprise/LICENSE) for details).
-The core `openhands` and `agent-server` Docker images are fully MIT-licensed as well.
+docker run -it --rm --pull=always \
+    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.openhands.dev/openhands/runtime:0.60-nikolaik \
+    -e LOG_ALL_EVENTS=true \
+    -v /var/run/docker.sock:/var/run/docker.sock \
+    -v ~/.openhands:/.openhands \
+    -p 3000:3000 \
+    --add-host host.docker.internal:host-gateway \
+    --name openhands-app \
+    docker.openhands.dev/openhands/openhands:0.60
+```

-If you need help with anything, or just want to chat, [come find us on Slack](https://dub.sh/openhands).
+</details>
+
+> **Note**: If you used OpenHands before version 0.44, you may want to run `mv ~/.openhands-state ~/.openhands` to migrate your conversation history to the new location.
+
+> [!WARNING]
+> On a public network? See our [Hardened Docker Installation Guide](https://docs.all-hands.dev/usage/runtimes/docker#hardened-docker-installation)
+> to secure your deployment by restricting network binding and implementing additional security measures.
+
+### Getting Started
+
+When you open the application, you'll be asked to choose an LLM provider and add an API key.
+[Anthropic's Claude Sonnet 4.5](https://www.anthropic.com/api) (`anthropic/claude-sonnet-4-5-20250929`)
+works best, but you have [many options](https://docs.all-hands.dev/usage/llms).
+
+See the [Running OpenHands](https://docs.all-hands.dev/usage/installation) guide for
+system requirements and more information.
+
+## 💡 Other ways to run OpenHands
+
+> [!WARNING]
+> OpenHands is meant to be run by a single user on their local workstation.
+> It is not appropriate for multi-tenant deployments where multiple users share the same instance. There is no built-in authentication, isolation, or scalability.
+>
+> If you're interested in running OpenHands in a multi-tenant environment, check out the source-available, commercially-licensed
+> [OpenHands Cloud Helm Chart](https://github.com/openHands/OpenHands-cloud)
+
+You can [connect OpenHands to your local filesystem](https://docs.all-hands.dev/usage/runtimes/docker#connecting-to-your-filesystem),
+interact with it via a [friendly CLI](https://docs.all-hands.dev/usage/how-to/cli-mode),
+run OpenHands in a scriptable [headless mode](https://docs.all-hands.dev/usage/how-to/headless-mode),
+or run it on tagged issues with [a github action](https://docs.all-hands.dev/usage/how-to/github-action).
+
+Visit [Running OpenHands](https://docs.all-hands.dev/usage/installation) for more information and setup instructions.
+
+If you want to modify the OpenHands source code, check out [Development.md](https://github.com/OpenHands/OpenHands/blob/main/Development.md).
+
+Having issues? The [Troubleshooting Guide](https://docs.all-hands.dev/usage/troubleshooting) can help.
+
+## 📖 Documentation
+
+To learn more about the project, and for tips on using OpenHands,
+check out our [documentation](https://docs.all-hands.dev/usage/getting-started).
+
+There you'll find resources on how to use different LLM providers,
+troubleshooting resources, and advanced configuration options.
+
+## 🤝 How to Join the Community
+
+OpenHands is a community-driven project, and we welcome contributions from everyone. We do most of our communication
+through Slack, so this is the best place to start, but we also are happy to have you contact us on Github:
+
+- [Join our Slack workspace](https://all-hands.dev/joinslack) - Here we talk about research, architecture, and future development.
+- [Read or post Github Issues](https://github.com/OpenHands/OpenHands/issues) - Check out the issues we're working on, or add your own ideas.
+
+See more about the community in [COMMUNITY.md](./COMMUNITY.md) or find details on contributing in [CONTRIBUTING.md](./CONTRIBUTING.md).
+
+## 📈 Progress
+
+See the monthly OpenHands roadmap [here](https://github.com/orgs/OpenHands/projects/1) (updated at the maintainer's meeting at the end of each month).
+
+<p align="center">
+  <a href="https://star-history.com/#OpenHands/OpenHands&Date">
+    <img src="https://api.star-history.com/svg?repos=OpenHands/OpenHands&type=Date" width="500" alt="Star History Chart">
+  </a>
+</p>
+
+## 📜 License
+
+Distributed under the MIT License, with the exception of the `enterprise/` folder. See [`LICENSE`](./LICENSE) for more information.
+
+## 🙏 Acknowledgements
+
+OpenHands is built by a large number of contributors, and every contribution is greatly appreciated! We also build upon other open source projects, and we are deeply thankful for their work.
+
+For a list of open source projects and licenses used in OpenHands, please see our [CREDITS.md](./CREDITS.md) file.
+
+## 📚 Cite
+
+```
+@inproceedings{
+  wang2025openhands,
+  title={OpenHands: An Open Platform for {AI} Software Developers as Generalist Agents},
+  author={Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and Heng Ji and Graham Neubig},
+  booktitle={The Thirteenth International Conference on Learning Representations},
+  year={2025},
+  url={https://openreview.net/forum?id=OJd3ayDDoF}
+}
+```
--- a/containers/app/Dockerfile
+++ b/containers/app/Dockerfile
@@ -73,7 +73,7 @@ ENV VIRTUAL_ENV=/app/.venv \

 COPY --chown=openhands:openhands --chmod=770 --from=backend-builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}

-COPY --chown=openhands:openhands --chmod=770 ./skills ./skills
+COPY --chown=openhands:openhands --chmod=770 ./microagents ./microagents
 COPY --chown=openhands:openhands --chmod=770 ./openhands ./openhands
 COPY --chown=openhands:openhands --chmod=777 ./openhands/runtime/plugins ./openhands/runtime/plugins
 COPY --chown=openhands:openhands pyproject.toml poetry.lock README.md MANIFEST.in LICENSE ./
--- a/containers/dev/README.md
+++ b/containers/dev/README.md
@@ -1,7 +1,7 @@
 # Develop in Docker

 > [!WARNING]
-> This way of running OpenHands is not officially supported. It is maintained by the community and may not work.
+> This is not officially supported and may not work.

 Install [Docker](https://docs.docker.com/engine/install/) on your host machine and run:

--- a/containers/dev/compose.yml
+++ b/containers/dev/compose.yml
@@ -12,7 +12,7 @@ services:
      - SANDBOX_API_HOSTNAME=host.docker.internal
      - DOCKER_HOST_ADDR=host.docker.internal
      #
-      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/openhands/runtime:0.62-nikolaik}
+      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/openhands/runtime:0.60-nikolaik}
      - SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234}
      - WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
    ports:
--- a/dev_config/python/.pre-commit-config.yaml
+++ b/dev_config/python/.pre-commit-config.yaml
@@ -3,9 +3,9 @@ repos:
    rev: v5.0.0
    hooks:
      - id: trailing-whitespace
-        exclude: ^(docs/|modules/|python/|openhands-ui/|third_party/|enterprise/)
+        exclude: ^(docs/|modules/|python/|openhands-ui/|third_party/|enterprise/|openhands-cli/)
      - id: end-of-file-fixer
-        exclude: ^(docs/|modules/|python/|openhands-ui/|third_party/|enterprise/)
+        exclude: ^(docs/|modules/|python/|openhands-ui/|third_party/|enterprise/|openhands-cli/)
      - id: check-yaml
        args: ["--allow-multiple-documents"]
      - id: debug-statements
@@ -28,12 +28,12 @@ repos:
        entry: ruff check --config dev_config/python/ruff.toml
        types_or: [python, pyi, jupyter]
        args: [--fix, --unsafe-fixes]
-        exclude: ^(third_party/|enterprise/)
+        exclude: ^(third_party/|enterprise/|openhands-cli/)
      # Run the formatter.
      - id: ruff-format
        entry: ruff format --config dev_config/python/ruff.toml
        types_or: [python, pyi, jupyter]
-        exclude: ^(third_party/|enterprise/)
+        exclude: ^(third_party/|enterprise/|openhands-cli/)

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.15.0
--- a/doc/design-doc/custom-agent-s1.md
+++ b/doc/design-doc/custom-agent-s1.md
@@ -1,836 +0,0 @@
-# Custom Agent Packages with Custom Runtime Images (Scenario 1)
-
-## 1. Introduction
-
-### 1.1 Problem Statement
-
-OpenHands currently supports agent customization through the software-agent-sdk, but users who need custom system dependencies, specialized tools, or non-Python runtime environments cannot easily deploy their agents. The current V1 architecture uses a fixed agent server image (`ghcr.io/openhands/agent-server:5f62cee-python`) that may not contain the required dependencies for specialized agents.
-
-Users building agents that require:
- Custom system packages (e.g., specialized compilers, databases, ML frameworks)
- Non-Python tools and runtimes (e.g., Node.js, Go, Rust toolchains)
- Custom Docker base images with specific OS configurations
- Proprietary or licensed software installations
-
-Currently have no supported path to deploy their agents to OpenHands Enterprise.
-
-### 1.2 Proposed Solution
-
-We propose extending the existing **Sandbox Specification System** to support custom agent runtime images with proper permissions and security controls. This approach builds directly on OpenHands' current sandbox infrastructure rather than creating parallel systems.
-
-Users will be able to:
-1. Create custom Docker images containing their agent code and dependencies
-2. Register these images as enhanced sandbox specifications with rich metadata
-3. Deploy conversations using their custom sandbox specs (with proper permissions)
-4. Maintain full compatibility with existing sandbox management and API infrastructure
-
-The solution extends the current `SandboxSpecService` with:
- **Permission-based access control** to limit custom specs to authorized users
- **Enhanced sandbox specifications** that include agent-specific metadata and requirements
- **Secure image management** with validation and approval workflows
- **Integrated deployment** through existing conversation creation APIs
-
-**Trade-offs**: This approach requires users to build and maintain Docker images, increasing complexity compared to simple Python package deployment. However, it provides the necessary isolation and dependency management for complex agent requirements while leveraging proven sandbox infrastructure.
-
-## 2. User Interface
-
-### 2.1 Custom Agent Image Creation
-
-Users create a custom agent image by extending the base agent server image:
-
-```dockerfile
-# Dockerfile for custom agent
-FROM ghcr.io/openhands/agent-server:5f62cee-python
-
-# Install custom system dependencies
-RUN apt-get update && apt-get install -y \
-    nodejs \
-    npm \
-    golang-go \
-    && rm -rf /var/lib/apt/lists/*
-
-# Install custom Python packages
-COPY requirements.txt /tmp/
-RUN pip install -r /tmp/requirements.txt
-
-# Copy custom agent code
-COPY my_custom_agent/ /app/my_custom_agent/
-COPY agent_config.json /app/config/
-
-# Set custom agent as default
-ENV CUSTOM_AGENT_MODULE=my_custom_agent
-ENV CUSTOM_AGENT_CLASS=MySpecializedAgent
-```
-
-### 2.2 Enhanced Sandbox Spec Registration
-
-Users register their custom agent image as an enhanced sandbox specification:
-
-```yaml
-# enhanced-sandbox-spec.yaml
-apiVersion: openhands.ai/v1
-kind: SandboxSpec
-metadata:
-  name: specialized-ml-agent
-  version: "1.0.0"
-  owner: user@company.com
-  permissions:
-    users: ["user@company.com", "team-lead@company.com"]
-    groups: ["ml-team", "data-science"]
-spec:
-  image: "myregistry/specialized-ml-agent:v1.0.0"
-  description: "ML agent with TensorFlow and custom data processing tools"
-  # Agent-specific metadata
-  agent:
-    capabilities:
-      - machine_learning
-      - data_analysis
-      - custom_visualization
-    type: "custom"
-    module: "agents.specialized_ml_agent"
-    class: "SpecializedMLAgent"
-  requirements:
-    memory: "4Gi"
-    cpu: "2"
-  environment:
-    TENSORFLOW_VERSION: "2.15.0"
-    CUSTOM_MODEL_PATH: "/app/models"
-    # Agent server configuration
-    CUSTOM_AGENT_MODULE: "agents.specialized_ml_agent"
-    CUSTOM_AGENT_CLASS: "SpecializedMLAgent"
-  ports:
-    - name: agent-server
-      port: 8000
-    - name: tensorboard
-      port: 6006
-```
-
-### 2.3 Conversation Creation with Custom Sandbox Spec
-
-Users create conversations using their custom sandbox specs through the existing API:
-
-```bash
-# Create conversation with custom sandbox spec
-curl -X POST "https://api.openhands.ai/api/conversations" \
-  -H "Authorization: Bearer $API_KEY" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "sandbox_spec_id": "specialized-ml-agent:v1.0.0",
-    "initial_message": "Analyze this dataset and create a predictive model",
-    "workspace": {
-      "type": "local",
-      "working_dir": "/workspace/ml-project"
-    }
-  }'
-```
-
-### 2.4 Image Management Workflows
-
-#### 2.4.1 Pre-built Image Approach
-
-For organizations that want to manage custom agent images centrally:
-
-```bash
-# Admin registers pre-built image as sandbox spec
-curl -X POST "https://api.openhands.ai/api/sandbox-specs" \
-  -H "Authorization: Bearer $ADMIN_API_KEY" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "name": "company-ml-agent",
-    "version": "1.0.0",
-    "image": "company-registry/ml-agent:v1.0.0",
-    "permissions": {
-      "groups": ["ml-team", "data-science"]
-    },
-    "agent": {
-      "type": "custom",
-      "capabilities": ["machine_learning", "data_analysis"]
-    }
-  }'
-```
-
-#### 2.4.2 User Upload Approach
-
-For users who want to upload their own custom images:
-
-```bash
-# User uploads custom image (with security validation)
-curl -X POST "https://api.openhands.ai/api/sandbox-specs/upload" \
-  -H "Authorization: Bearer $API_KEY" \
-  -F "dockerfile=@Dockerfile" \
-  -F "context=@agent-context.tar.gz" \
-  -F "spec=@sandbox-spec.yaml"
-```
-
-## 3. Other Context
-
-### 3.1 Current Sandbox Specification System
-
-OpenHands V1 uses a sandbox specification system to manage container deployments:
-
- **Single Default Spec**: Currently only one sandbox spec exists, shared by all users
- **SandboxSpecService**: Manages sandbox specifications and container creation
- **SandboxSpecInfo**: Contains image, environment, and resource configuration
- **No Permissions**: Current system lacks user-based access control
-
-The existing system provides the foundation but needs enhancement for custom agents:
- **Permission Layer**: Required to control access to custom specs
- **Rich Metadata**: Need agent-specific information beyond basic container config
- **Image Management**: Need secure workflows for custom image registration
-
-### 3.2 Enhanced Sandbox Specification Architecture
-
-Our proposal extends the existing system with:
-
-#### 3.2.1 Permission-Based Access Control
- **User Permissions**: Individual user access to specific sandbox specs
- **Group Permissions**: Team-based access control for organizational specs
- **Owner Management**: Spec ownership and delegation capabilities
- **Admin Override**: Administrative access for spec management
-
-#### 3.2.2 Agent-Specific Metadata
- **Agent Configuration**: Module, class, and capability information
- **Resource Requirements**: Memory, CPU, and storage specifications
- **Environment Variables**: Agent-specific configuration and secrets
- **Port Mappings**: Additional ports for agent services (e.g., TensorBoard)
-
-#### 3.2.3 Image Management Integration
- **Registry Support**: Integration with Docker registries for image storage
- **Security Validation**: Image scanning and approval workflows
- **Version Management**: Support for multiple versions of custom specs
- **Build Integration**: Optional image building from Dockerfile uploads
-
-### 3.3 Existing Container Orchestration Integration
-
-The enhanced system leverages existing OpenHands infrastructure:
-
- **Sandbox Service**: Extended to support permission checks and enhanced specs
- **Container Management**: Same lifecycle management with additional metadata
- **Network Isolation**: Maintains existing security boundaries
- **Resource Enforcement**: Enhanced with custom resource requirements
- **Health Monitoring**: Extended to track custom agent-specific metrics
-
-## 4. Technical Design
-
-### 4.1 Enhanced Sandbox Specification Model
-
-#### 4.1.1 Extended SandboxSpecInfo Structure
-
-The existing `SandboxSpecInfo` model is enhanced to support custom agents:
-
-```python
-# openhands/app_server/sandbox/sandbox_spec_models.py (enhanced)
-from pydantic import BaseModel, Field
-from typing import Dict, List, Optional
-
-class AgentMetadata(BaseModel):
-    """Agent-specific metadata for custom agents."""
-    type: str = Field(default="default", description="Agent type (default|custom)")
-    capabilities: List[str] = Field(default_factory=list, description="Agent capabilities")
-    module: Optional[str] = Field(description="Python module containing agent class")
-    class_name: Optional[str] = Field(description="Agent class name")
-
-class PermissionSpec(BaseModel):
-    """Permission specification for sandbox spec access."""
-    users: List[str] = Field(default_factory=list, description="Authorized user emails")
-    groups: List[str] = Field(default_factory=list, description="Authorized group names")
-    owner: Optional[str] = Field(description="Spec owner")
-
-class EnhancedSandboxSpecInfo(BaseModel):
-    """Enhanced sandbox specification with agent metadata and permissions."""
-    
-    # Existing fields from SandboxSpecInfo
-    id: str = Field(description="Docker image identifier")
-    command: List[str] = Field(default_factory=lambda: ['--port', '8000'])
-    initial_env: Dict[str, str] = Field(default_factory=dict)
-    working_dir: str = Field(default="/workspace/project")
-    
-    # Enhanced fields
-    name: str = Field(description="Human-readable spec name")
-    version: str = Field(description="Spec version")
-    description: Optional[str] = Field(description="Spec description")
-    
-    # Agent-specific metadata
-    agent: AgentMetadata = Field(default_factory=AgentMetadata)
-    
-    # Permission and access control
-    permissions: PermissionSpec = Field(default_factory=PermissionSpec)
-    
-    # Resource requirements
-    memory_limit: Optional[str] = Field(description="Memory limit (e.g., '4Gi')")
-    cpu_limit: Optional[str] = Field(description="CPU limit (e.g., '2')")
-    
-    # Additional ports for custom services
-    ports: List[Dict[str, any]] = Field(
-        default_factory=lambda: [{"name": "agent-server", "port": 8000}]
-    )
-```
-
-#### 4.1.2 Custom Agent Image Structure
-
-Custom agent images extend the base agent server with this structure:
-
-```
-/app/
-├── config/
-│   ├── agent_config.json          # Agent configuration
-│   └── tool_registry.json         # Custom tool definitions (optional)
-├── agents/
-│   └── custom_agent.py            # Agent implementation
-├── tools/                         # Custom tools (optional)
-│   ├── __init__.py
-│   └── custom_tools.py
-└── startup/
-    └── init_agent.py              # Agent initialization script
-```
-
-### 4.2 Agent Implementation Interface
-
-#### 4.2.1 Custom Agent Base Class
-
-```python
-# agents/custom_agent.py
-from openhands.sdk.agent.base import AgentBase
-from openhands.sdk.llm import LLM
-from openhands.sdk.tool import Tool
-from typing import List, Dict, Any
-
-class SpecializedMLAgent(AgentBase):
-    """Custom ML agent with TensorFlow capabilities."""
-
-    def __init__(
-        self,
-        llm: LLM,
-        tools: List[Tool],
-        config: Dict[str, Any] = None
-    ):
-        super().__init__(llm=llm, tools=tools)
-        self.config = config or {}
-        self.model_cache = self.config.get('MODEL_CACHE_DIR', '/app/models')
-
-    async def initialize(self) -> None:
-        """Initialize custom agent resources."""
-        # Load pre-trained models
-        await self._load_models()
-
-        # Initialize custom tools
-        await self._setup_custom_tools()
-
-    async def _load_models(self) -> None:
-        """Load TensorFlow models from cache."""
-        import tensorflow as tf
-        # Custom model loading logic
-        pass
-
-    async def _setup_custom_tools(self) -> None:
-        """Initialize custom tools with agent context."""
-        # Custom tool setup logic
-        pass
-```
-
-#### 4.2.2 Custom Tool Implementation
-
-```python
-# tools/custom_tools.py
-from openhands.sdk.tool import Tool, ToolExecutor, register_tool
-from openhands.sdk import Action, Observation
-from pydantic import Field
-import tensorflow as tf
-
-class TensorFlowAnalysisAction(Action):
-    dataset_path: str = Field(description="Path to dataset file")
-    model_type: str = Field(description="Type of ML model to create")
-    target_column: str = Field(description="Target column for prediction")
-
-class TensorFlowAnalysisObservation(Observation):
-    model_accuracy: float = Field(description="Model accuracy score")
-    feature_importance: Dict[str, float] = Field(description="Feature importance scores")
-    model_path: str = Field(description="Path to saved model")
-
-class TensorFlowToolExecutor(ToolExecutor[TensorFlowAnalysisAction, TensorFlowAnalysisObservation]):
-    def __call__(self, action: TensorFlowAnalysisAction, conversation=None) -> TensorFlowAnalysisObservation:
-        # Custom TensorFlow analysis logic
-        model = self._create_model(action.dataset_path, action.model_type, action.target_column)
-        accuracy = self._evaluate_model(model)
-        importance = self._get_feature_importance(model)
-        model_path = self._save_model(model)
-
-        return TensorFlowAnalysisObservation(
-            model_accuracy=accuracy,
-            feature_importance=importance,
-            model_path=model_path
-        )
-
-# Register the custom tool
-register_tool(
-    Tool(
-        name="TensorFlowTool",
-        executor=TensorFlowToolExecutor(),
-        definition=ToolDefinition(
-            name="tensorflow_analysis",
-            description="Perform machine learning analysis using TensorFlow",
-            parameters=TensorFlowAnalysisAction.model_json_schema()
-        )
-    )
-)
-```
-
-### 4.3 Runtime Integration
-
-#### 4.3.1 Custom Agent Loader
-
-```python
-# startup/init_agent.py
-import json
-import importlib
-from pathlib import Path
-from openhands.sdk.agent.base import AgentBase
-from openhands.sdk.llm import LLM
-from openhands.sdk.tool import Tool, resolve_tool
-
-class CustomAgentLoader:
-    """Loads custom agents from configuration."""
-
-    def __init__(self, config_path: str = "/app/config/agent_config.json"):
-        self.config_path = Path(config_path)
-        self.config = self._load_config()
-
-    def _load_config(self) -> dict:
-        """Load agent configuration from JSON file."""
-        with open(self.config_path) as f:
-            return json.load(f)
-
-    def create_agent(self, llm: LLM) -> AgentBase:
-        """Create custom agent instance."""
-        agent_config = self.config["agent"]
-
-        # Import custom agent class
-        module = importlib.import_module(agent_config["module"])
-        agent_class = getattr(module, agent_config["class"])
-
-        # Load custom tools
-        tools = self._load_tools()
-
-        # Create agent instance
-        agent = agent_class(
-            llm=llm,
-            tools=tools,
-            config=self.config.get("environment", {})
-        )
-
-        return agent
-
-    def _load_tools(self) -> List[Tool]:
-        """Load and resolve custom tools."""
-        tools = []
-        for tool_config in self.config.get("tools", []):
-            if "module" in tool_config:
-                # Import custom tool module to register it
-                importlib.import_module(tool_config["module"])
-
-            tool = resolve_tool(tool_config["name"])
-            tools.append(tool)
-
-        return tools
-```
-
-#### 4.3.2 Agent Server Startup Integration
-
-```python
-# Modified agent server startup in software-agent-sdk
-import os
-from openhands.agent_server.api import app
-from openhands.agent_server.conversation_service import ConversationService
-from startup.init_agent import CustomAgentLoader
-
-@app.on_event("startup")
-async def startup_event():
-    """Initialize custom agent during server startup."""
-
-    # Check for custom agent configuration
-    custom_agent_module = os.getenv('CUSTOM_AGENT_MODULE')
-    custom_agent_class = os.getenv('CUSTOM_AGENT_CLASS')
-
-    if custom_agent_module and custom_agent_class:
-        # Load custom agent
-        loader = CustomAgentLoader()
-        app.state.agent_factory = loader.create_agent
-        print(f"Loaded custom agent: {custom_agent_class}")
-    else:
-        # Use default agent
-        from openhands.sdk.agent import Agent
-        app.state.agent_factory = lambda llm: Agent(llm=llm, tools=get_default_tools())
-        print("Using default OpenHands agent")
-```
-
-### 4.4 Enhanced Sandbox Service Integration
-
-#### 4.4.1 Permission-Aware Sandbox Service
-
-```python
-# openhands/app_server/sandbox/enhanced_sandbox_spec_service.py
-from openhands.app_server.sandbox.sandbox_spec_service import SandboxSpecService
-from openhands.app_server.sandbox.sandbox_spec_models import SandboxSpecInfo, EnhancedSandboxSpecInfo
-from typing import Dict, List, Optional
-
-class EnhancedSandboxSpecService(SandboxSpecService):
-    """Enhanced sandbox service with permissions and custom agent support."""
-
-    def __init__(self, spec_registry: Dict[str, EnhancedSandboxSpecInfo]):
-        super().__init__()
-        self.spec_registry = spec_registry
-
-    def get_available_sandbox_specs(self, user_email: str, user_groups: List[str]) -> List[str]:
-        """Get sandbox specs available to the user based on permissions."""
-        available_specs = []
-        
-        for spec_key, spec in self.spec_registry.items():
-            if self._has_permission(spec, user_email, user_groups):
-                available_specs.append(spec_key)
-        
-        return available_specs
-
-    def get_sandbox_spec_by_id(
-        self, 
-        spec_id: str, 
-        user_email: str, 
-        user_groups: List[str]
-    ) -> SandboxSpecInfo:
-        """Get sandbox spec by ID with permission check."""
-        
-        if spec_id not in self.spec_registry:
-            # Fall back to default specs for backward compatibility
-            return super().get_default_sandbox_specs()[0]
-        
-        enhanced_spec = self.spec_registry[spec_id]
-        
-        # Check permissions
-        if not self._has_permission(enhanced_spec, user_email, user_groups):
-            raise PermissionError(f"User {user_email} does not have access to spec {spec_id}")
-        
-        # Convert to SandboxSpecInfo for existing infrastructure
-        return self._convert_to_sandbox_spec_info(enhanced_spec)
-
-    def _has_permission(
-        self, 
-        spec: EnhancedSandboxSpecInfo, 
-        user_email: str, 
-        user_groups: List[str]
-    ) -> bool:
-        """Check if user has permission to use the sandbox spec."""
-        
-        # Owner always has access
-        if spec.permissions.owner == user_email:
-            return True
-        
-        # Check user permissions
-        if user_email in spec.permissions.users:
-            return True
-        
-        # Check group permissions
-        for group in user_groups:
-            if group in spec.permissions.groups:
-                return True
-        
-        return False
-
-    def _convert_to_sandbox_spec_info(self, enhanced_spec: EnhancedSandboxSpecInfo) -> SandboxSpecInfo:
-        """Convert enhanced spec to standard SandboxSpecInfo."""
-        
-        # Build environment variables including agent configuration
-        env_vars = {
-            'OPENVSCODE_SERVER_ROOT': '/openhands/.openvscode-server',
-            'OH_ENABLE_VNC': '0',
-            'LOG_JSON': 'true',
-            'OH_CONVERSATIONS_PATH': '/workspace/conversations',
-            'OH_BASH_EVENTS_DIR': '/workspace/bash_events',
-            'PYTHONUNBUFFERED': '1',
-            'ENV_LOG_LEVEL': '20',
-            **enhanced_spec.initial_env
-        }
-        
-        # Add custom agent configuration if specified
-        if enhanced_spec.agent.type == "custom":
-            env_vars.update({
-                'CUSTOM_AGENT_MODULE': enhanced_spec.agent.module,
-                'CUSTOM_AGENT_CLASS': enhanced_spec.agent.class_name,
-            })
-
-        return SandboxSpecInfo(
-            id=enhanced_spec.id,
-            command=enhanced_spec.command,
-            initial_env=env_vars,
-            working_dir=enhanced_spec.working_dir,
-        )
-
-    def register_sandbox_spec(
-        self, 
-        spec: EnhancedSandboxSpecInfo,
-        admin_user: str
-    ) -> str:
-        """Register a new sandbox spec (admin only)."""
-        
-        spec_key = f"{spec.name}:{spec.version}"
-        
-        # Validate spec
-        self._validate_sandbox_spec(spec)
-        
-        # Store in registry
-        self.spec_registry[spec_key] = spec
-        
-        return spec_key
-
-    def _validate_sandbox_spec(self, spec: EnhancedSandboxSpecInfo) -> None:
-        """Validate sandbox spec for security and correctness."""
-        
-        # Image validation
-        if not spec.id or not spec.id.strip():
-            raise ValueError("Image ID cannot be empty")
-        
-        # Permission validation
-        if not spec.permissions.owner:
-            raise ValueError("Sandbox spec must have an owner")
-        
-        # Agent validation for custom agents
-        if spec.agent.type == "custom":
-            if not spec.agent.module or not spec.agent.class_name:
-                raise ValueError("Custom agents must specify module and class_name")
-```
-
-### 4.5 Enhanced API Integration
-
-#### 4.5.1 Enhanced Conversation Creation
-
-```python
-# openhands/server/routes/conversation_routes.py (enhanced)
-from fastapi import APIRouter, HTTPException, Depends
-from pydantic import BaseModel
-from typing import Optional, Dict, Any, List
-from uuid import UUID
-
-from openhands.app_server.sandbox.enhanced_sandbox_spec_service import EnhancedSandboxSpecService
-from openhands.server.session.agent_session import AgentSession
-from openhands.server.auth import get_current_user, get_user_groups
-
-# Enhanced conversation creation request
-class CreateConversationRequest(BaseModel):
-    initial_message: str
-    workspace_config: Optional[Dict[str, Any]] = None
-    # New field for custom sandbox spec
-    sandbox_spec_id: Optional[str] = None
-
-@router.post("/conversations")
-async def create_conversation(
-    request: CreateConversationRequest,
-    current_user: str = Depends(get_current_user),
-    user_groups: List[str] = Depends(get_user_groups),
-    sandbox_service: EnhancedSandboxSpecService = Depends(get_enhanced_sandbox_service)
-) -> ConversationResponse:
-    """Create conversation with optional custom sandbox spec."""
-
-    try:
-        if request.sandbox_spec_id:
-            # Use custom sandbox spec with permission check
-            sandbox_spec = sandbox_service.get_sandbox_spec_by_id(
-                request.sandbox_spec_id, 
-                current_user, 
-                user_groups
-            )
-        else:
-            # Use default sandbox spec
-            sandbox_spec = sandbox_service.get_default_sandbox_specs()[0]
-
-        # Create sandbox and conversation
-        sandbox = await sandbox_service.create_sandbox(sandbox_spec)
-        await wait_for_agent_server_ready(sandbox)
-
-        conversation = await create_conversation_with_sandbox(
-            sandbox=sandbox,
-            initial_message=request.initial_message,
-            workspace_config=request.workspace_config
-        )
-
-        return ConversationResponse(
-            conversation_id=conversation.id,
-            status="created",
-            sandbox_spec_id=request.sandbox_spec_id or "default"
-        )
-
-    except PermissionError as e:
-        raise HTTPException(status_code=403, detail=str(e))
-    except ValueError as e:
-        raise HTTPException(status_code=404, detail=str(e))
-```
-
-#### 4.5.2 Sandbox Spec Management API
-
-```python
-# openhands/server/routes/sandbox_spec_routes.py (new)
-from fastapi import APIRouter, HTTPException, Depends, UploadFile, File
-from pydantic import BaseModel
-from typing import List, Optional
-import yaml
-
-from openhands.app_server.sandbox.enhanced_sandbox_spec_service import EnhancedSandboxSpecService
-from openhands.app_server.sandbox.sandbox_spec_models import EnhancedSandboxSpecInfo
-from openhands.server.auth import get_current_user, get_user_groups, require_admin
-
-router = APIRouter(prefix="/api/sandbox-specs", tags=["Sandbox Specs"])
-
-@router.get("/")
-async def list_available_sandbox_specs(
-    current_user: str = Depends(get_current_user),
-    user_groups: List[str] = Depends(get_user_groups),
-    sandbox_service: EnhancedSandboxSpecService = Depends(get_enhanced_sandbox_service)
-) -> List[str]:
-    """List sandbox specs available to the current user."""
-    
-    return sandbox_service.get_available_sandbox_specs(current_user, user_groups)
-
-@router.post("/")
-async def register_sandbox_spec(
-    spec_data: EnhancedSandboxSpecInfo,
-    current_user: str = Depends(require_admin),
-    sandbox_service: EnhancedSandboxSpecService = Depends(get_enhanced_sandbox_service)
-) -> Dict[str, str]:
-    """Register a new sandbox spec (admin only)."""
-    
-    try:
-        spec_key = sandbox_service.register_sandbox_spec(spec_data, current_user)
-        return {"spec_id": spec_key, "status": "registered"}
-    except ValueError as e:
-        raise HTTPException(status_code=400, detail=str(e))
-
-@router.post("/upload")
-async def upload_custom_image(
-    dockerfile: UploadFile = File(...),
-    context: UploadFile = File(...),
-    spec: UploadFile = File(...),
-    current_user: str = Depends(get_current_user),
-    sandbox_service: EnhancedSandboxSpecService = Depends(get_enhanced_sandbox_service)
-) -> Dict[str, str]:
-    """Upload custom image with Dockerfile and context (with security validation)."""
-    
-    try:
-        # Parse spec file
-        spec_content = await spec.read()
-        spec_data = yaml.safe_load(spec_content)
-        
-        # Validate user has permission to create specs
-        if not _can_user_create_specs(current_user):
-            raise HTTPException(status_code=403, detail="User not authorized to create custom specs")
-        
-        # Security validation of Dockerfile
-        dockerfile_content = await dockerfile.read()
-        _validate_dockerfile_security(dockerfile_content)
-        
-        # Build image (implementation depends on build system)
-        image_id = await _build_custom_image(dockerfile_content, context, current_user)
-        
-        # Create enhanced spec
-        enhanced_spec = EnhancedSandboxSpecInfo(**spec_data)
-        enhanced_spec.id = image_id
-        enhanced_spec.permissions.owner = current_user
-        
-        # Register the spec
-        spec_key = sandbox_service.register_sandbox_spec(enhanced_spec, current_user)
-        
-        return {"spec_id": spec_key, "image_id": image_id, "status": "uploaded"}
-        
-    except Exception as e:
-        raise HTTPException(status_code=400, detail=f"Upload failed: {str(e)}")
-```
-
-## 5. Implementation Plan
-
-All implementation must pass existing lints and tests. New functionality requires comprehensive test coverage including unit tests, integration tests, and end-to-end scenarios.
-
-### 5.1 Enhanced Sandbox Models and Permissions (M1)
-
-#### 5.1.1 Enhanced Sandbox Specification Models
-
-* `openhands/app_server/sandbox/sandbox_spec_models.py` (enhanced)
-* `tests/unit/app_server/sandbox/test_enhanced_sandbox_spec_models.py`
-
-Extend existing `SandboxSpecInfo` with `EnhancedSandboxSpecInfo` including agent metadata, permissions, and resource requirements. This is the **core requirement** identified by the engineer.
-
-#### 5.1.2 Permission System Foundation
-
-* `openhands/server/auth/permissions.py`
-* `tests/unit/server/auth/test_permissions.py`
-
-Implement user and group-based permission system for sandbox spec access control. This addresses the **security concerns** from V0 mentioned by the engineer.
-
-**Demo**: Create enhanced sandbox specs with permission restrictions and verify access control works correctly.
-
-### 5.2 Enhanced Sandbox Service (M2)
-
-#### 5.2.1 Permission-Aware Sandbox Service
-
-* `openhands/app_server/sandbox/enhanced_sandbox_spec_service.py`
-* `tests/unit/app_server/sandbox/test_enhanced_sandbox_spec_service.py`
-
-Extend existing `SandboxSpecService` with permission checks and enhanced spec management. This **builds on existing infrastructure** as the engineer suggested.
-
-#### 5.2.2 Agent Server Startup Integration
-
-* `openhands-agent-server/openhands/agent_server/custom_agent_loader.py`
-* `tests/unit/agent_server/test_custom_agent_loader.py`
-
-Implement custom agent loading mechanism in agent server startup process with configuration-driven agent instantiation.
-
-**Demo**: Deploy custom agents using enhanced sandbox specs and verify permission-based access control works end-to-end.
-
-### 5.3 Image Management and API Integration (M3)
-
-#### 5.3.1 Secure Image Management
-
-* `openhands/app_server/sandbox/image_builder.py`
-* `openhands/app_server/security/dockerfile_validator.py`
-* `tests/unit/app_server/sandbox/test_image_builder.py`
-* `tests/unit/app_server/security/test_dockerfile_validator.py`
-
-Implement both **pre-built image registration** and **secure user upload** workflows as identified by the engineer. This addresses the security issues from V0.
-
-#### 5.3.2 Enhanced Conversation API
-
-* `openhands/server/routes/conversation_routes.py` (enhanced)
-* `openhands/server/routes/sandbox_spec_routes.py` (new)
-* `tests/unit/server/routes/test_enhanced_conversation_routes.py`
-* `tests/unit/server/routes/test_sandbox_spec_routes.py`
-
-Enhance existing conversation creation API to support `sandbox_spec_id` parameter and add new sandbox spec management endpoints.
-
-**Demo**: Create conversations with custom sandbox specs through existing API endpoints and demonstrate both pre-built and user-uploaded image workflows.
-
-### 5.4 Advanced Security and Management (M4)
-
-#### 5.4.1 Image Security Validation
-
-* `openhands/app_server/security/image_scanner.py`
-* `openhands/app_server/security/security_policies.py`
-* `tests/unit/app_server/security/test_image_scanner.py`
-
-Implement comprehensive security validation including image vulnerability scanning, Dockerfile analysis, and approval workflows.
-
-#### 5.4.2 Spec Registry and Lifecycle Management
-
-* `openhands/app_server/sandbox/spec_registry.py`
-* `openhands/app_server/sandbox/spec_lifecycle.py`
-* `tests/unit/app_server/sandbox/test_spec_registry.py`
-
-Add persistent storage for enhanced sandbox specs, version management, and lifecycle policies (deprecation, cleanup).
-
-**Demo**: Deploy multiple custom agents with different permission levels, demonstrate security validation workflows, and show proper spec lifecycle management.
-
---
-
-## Key Alignment with Engineer's Approach
-
-This revised implementation plan directly addresses the engineer's requirements:
-
-1. **✅ Uses existing sandbox specs system** - Enhanced rather than replaced
-2. **✅ Permissions as core requirement** - Moved to M1 instead of M4
-3. **✅ Two image management approaches** - Pre-built registration and secure user uploads
-4. **✅ Security-first design** - Addresses V0 security issues with comprehensive validation
-5. **✅ Minimal infrastructure changes** - Builds on existing `SandboxSpecService` and conversation APIs
--- a/doc/design-doc/custom-agent-s2.md
+++ b/doc/design-doc/custom-agent-s2.md
@@ -1,934 +0,0 @@
-# Dynamic Custom Agent Package Loading (Scenario 2)
-
-## 1. Introduction
-
-### 1.1 Problem Statement
-
-OpenHands V1 architecture uses a fixed agent server image (`ghcr.io/openhands/agent-server:5f62cee-python`) that contains the default agent implementation. Users who want to customize agent behavior with pure Python packages that don't require additional system dependencies currently have no supported mechanism to deploy their custom agents without building entirely new Docker images.
-
-This creates unnecessary complexity for the common use case where users simply want to:
- Customize agent prompts and reasoning logic
- Add new Python-based tools and capabilities
- Integrate with Python APIs and libraries already available in the base image
- Deploy agents with different LLM configurations or specialized workflows
-
-The current approach forces all customization through the heavyweight Scenario 1 path (custom Docker images), even when the base agent server image already contains all necessary dependencies.
-
-### 1.2 Proposed Solution
-
-We propose implementing **Dynamic Custom Agent Package Loading** within the existing V1 agent server container. This allows users to deploy custom agents by providing Python packages that are downloaded, installed, and instantiated at runtime without requiring custom Docker images.
-
-Users will be able to:
-1. Package their custom agents as standard Python packages (pip-installable)
-2. Specify agent package URLs (Git repositories, PyPI packages, or ZIP archives) in conversation creation
-3. Have the agent server dynamically download and install the package at startup
-4. Instantiate their custom agent within the existing container environment
-5. Maintain full compatibility with the existing HTTP API (`/ask_agent` endpoint)
-
-The solution leverages the existing V1 architecture's agent server container but extends the startup process to support dynamic agent loading based on environment configuration.
-
-**Trade-offs**: This approach is limited to Python packages that can run within the existing agent server environment. Users needing custom system dependencies, non-Python tools, or different base images must use Scenario 1. However, this covers the majority of agent customization use cases with significantly reduced complexity.
-
-## 2. User Interface
-
-### 2.1 Custom Agent Package Structure
-
-Users create a standard Python package with the required interface:
-
-```python
-# my_custom_agent/
-├── setup.py
-├── requirements.txt                    # Optional additional dependencies
-├── my_custom_agent/
-│   ├── __init__.py
-│   ├── agent.py                       # Main agent implementation
-│   ├── tools.py                       # Custom tools (optional)
-│   └── config.py                      # Agent configuration
-```
-
-### 2.2 Agent Implementation
-
-```python
-# my_custom_agent/agent.py
-from openhands.sdk.agent.base import AgentBase
-from openhands.sdk.llm import LLM
-from openhands.sdk.tool import Tool
-from typing import List, Dict, Any
-
-class MyCustomAgent(AgentBase):
-    """Custom agent with specialized behavior."""
-
-    def __init__(self, llm: LLM, tools: List[Tool], config: Dict[str, Any] = None):
-        super().__init__(llm=llm, tools=tools)
-        self.config = config or {}
-
-    async def initialize(self) -> None:
-        """Initialize custom agent resources."""
-        # Custom initialization logic
-        pass
-
-# Factory function for agent creation
-def create_agent(llm: LLM, tools: List[Tool], config: Dict[str, Any] = None) -> AgentBase:
-    """Factory function to create the custom agent."""
-    return MyCustomAgent(llm=llm, tools=tools, config=config)
-```
-
-### 2.3 Package Entry Point
-
-```python
-# my_custom_agent/__init__.py
-from .agent import create_agent, MyCustomAgent
-
-__all__ = ['create_agent', 'MyCustomAgent']
-```
-
-### 2.4 Setup Configuration
-
-```python
-# setup.py
-from setuptools import setup, find_packages
-
-setup(
-    name="my-custom-agent",
-    version="1.0.0",
-    packages=find_packages(),
-    install_requires=[
-        # Only additional dependencies beyond base image
-        "requests>=2.25.0",
-        "beautifulsoup4>=4.9.0",
-    ],
-    entry_points={
-        'openhands.agents': [
-            'my_custom_agent = my_custom_agent:create_agent',
-        ],
-    },
-)
-```
-
-### 2.5 Conversation Creation with Dynamic Agent Loading
-
-Users create conversations by specifying the agent package URL:
-
-```bash
-# Create conversation with custom agent package
-curl -X POST "https://api.openhands.ai/api/conversations" \
-  -H "Authorization: Bearer $API_KEY" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "agent_package_url": "git+https://github.com/user/my-custom-agent.git",
-    "initial_message": "Help me analyze this codebase",
-    "workspace": {
-      "type": "local",
-      "working_dir": "/workspace/project"
-    }
-  }'
-```
-
-Alternative package sources:
-```bash
-# PyPI package
-"agent_package_url": "my-custom-agent==1.0.0"
-
-# ZIP archive
-"agent_package_url": "https://example.com/agents/my-custom-agent.zip"
-
-# Private Git repository
-"agent_package_url": "git+https://token@github.com/private/agent.git"
-```
-
-## 3. Other Context
-
-### 3.1 Current V1 Architecture Overview
-
-The OpenHands V1 architecture follows a distributed service model with clear separation between the main application server and agent execution environment. Understanding this architecture is crucial for implementing dynamic agent loading.
-
-#### 3.1.1 Service Separation
-
-The V1 system consists of three primary components:
-
-1. **Main Server** (`openhands/app_server/`): Handles user requests, conversation management, and sandbox orchestration
-2. **Agent Server** (`software-agent-sdk/openhands-agent-server/`): Executes agent logic and manages conversation state
-3. **Action Execution Server**: Handles tool execution (bash commands, file operations) within sandboxed environments
-
-#### 3.1.2 Communication Flow
-
-The current communication pattern follows this sequence:
-
-```
-User Request → Main Server → HTTP API → Agent Server → Agent Instance → Tools → Response
-```
-
-This separation allows for:
- **Isolation**: Agent execution is isolated from the main application
- **Scalability**: Multiple agent servers can be spawned for different conversations
- **Security**: Sandboxed execution prevents agent actions from affecting the host system
- **Flexibility**: Different agent configurations can be deployed without affecting the main server
-
-#### 3.1.3 Container Orchestration
-
-The main server uses `DockerSandboxSpecService` to create and manage agent server containers:
-
- **Image Selection**: Currently hardcoded to `ghcr.io/openhands/agent-server:5f62cee-python`
- **Environment Configuration**: Passed via `initial_env` in `SandboxSpecInfo`
- **Network Isolation**: Each conversation gets its own container instance
- **Resource Management**: Memory and CPU limits enforced at container level
-
-### 3.2 Agent Server Internal Architecture
-
-#### 3.2.1 FastAPI Application Structure
-
-The agent server is built as a FastAPI application with these key components:
-
- **Conversation Router** (`conversation_router.py`): Handles HTTP endpoints for agent interaction
- **Conversation Service** (`conversation_service.py`): Manages conversation lifecycle and state
- **Event Service** (`event_service.py`): Processes agent actions and observations
- **Dependencies** (`dependencies.py`): Provides dependency injection for services
-
-#### 3.2.2 Agent Instantiation Pattern
-
-Currently, agents are instantiated during server startup using a fixed pattern:
-
-```python
-# Simplified current pattern
-agent = Agent(
-    llm=LLM(model="default-model", api_key="..."),
-    tools=[TerminalTool(), FileEditorTool(), ...]
-)
-```
-
-This creates a single agent instance that serves all requests to that container.
-
-#### 3.2.3 Request Processing Flow
-
-When the `/ask_agent` endpoint receives a request:
-
-1. **Request Validation**: `AskAgentRequest` is validated and parsed
-2. **Conversation Lookup**: Conversation state is retrieved or created
-3. **Agent Invocation**: The fixed agent instance processes the question
-4. **Response Formatting**: Result is wrapped in `AskAgentResponse`
-5. **HTTP Response**: JSON response sent back to main server
-
-### 3.3 Software Agent SDK Integration Points
-
-#### 3.3.1 Agent Interface Requirements
-
-The `software-agent-sdk` defines the contract that all agents must follow:
-
- **AgentBase**: Abstract base class requiring `llm` and `tools` parameters
- **Tool Integration**: Agents must work with the standardized tool system
- **Event Handling**: Agents process events through the conversation framework
- **State Management**: Agents maintain conversation context through event streams
-
-#### 3.3.2 Tool System Architecture
-
-The tool system provides the foundation for agent capabilities:
-
- **Tool Registration**: Tools are registered globally and resolved by name
- **Execution Framework**: `ToolExecutor` classes handle action execution
- **Built-in Tools**: Standard tools (Terminal, FileEditor, Browser) are always available
- **Custom Tools**: Additional tools can be registered through the plugin system
-
-#### 3.3.3 LLM Integration
-
-Agents interact with language models through the SDK's LLM abstraction:
-
- **Provider Agnostic**: Supports multiple LLM providers through unified interface
- **Configuration**: LLM settings (model, API keys, parameters) are configurable
- **Response Processing**: Structured handling of LLM responses and tool calls
-
-### 3.4 Dynamic Loading Technical Foundation
-
-#### 3.4.1 Python Package Management
-
-Our dynamic loading approach leverages Python's built-in package management:
-
- **pip install**: Supports Git repositories, PyPI packages, and archive files
- **importlib**: Enables runtime module importing and class instantiation
- **entry_points**: Provides standardized plugin discovery mechanism
- **sys.path**: Allows dynamic modification of Python module search paths
-
-#### 3.4.2 Container Environment Considerations
-
-The agent server container provides a controlled environment for dynamic loading:
-
- **Python Runtime**: Pre-installed Python 3.x with pip and common libraries
- **Network Access**: Required for downloading packages from external sources
- **File System**: Writable areas for package installation and caching
- **Security Context**: Isolated from host system with appropriate permissions
-
-#### 3.4.3 State Management Implications
-
-Dynamic agent loading affects conversation state management:
-
- **Agent Persistence**: Custom agents must maintain state across requests
- **Configuration Isolation**: Different conversations can use different agent configurations
- **Resource Cleanup**: Proper cleanup of agent resources when conversations end
- **Error Recovery**: Fallback mechanisms when custom agents fail to load or execute
-
-## 4. Technical Design
-
-### 4.1 Current V1 Agent Instantiation Flow
-
-To understand how our proposal integrates with the existing system, it's important to first examine how agents are currently instantiated and executed in the V1 architecture.
-
-#### 4.1.1 Current Agent Server Startup Process
-
-In the current V1 flow, agent instantiation follows this sequence:
-
-1. **Main Server Request**: When a user creates a conversation, the main server (`openhands/app_server`) creates a sandbox specification via `DockerSandboxSpecService.get_default_sandbox_specs()`
-2. **Container Launch**: The sandbox service launches the agent server container using the hardcoded image `ghcr.io/openhands/agent-server:5f62cee-python`
-3. **Agent Server Initialization**: The agent server container starts with the command `['--port', '8000']` and initializes a FastAPI application
-4. **Default Agent Creation**: During startup, the agent server creates a default agent instance (typically from the software-agent-sdk) with standard tools and configuration
-5. **HTTP API Ready**: The agent server exposes the `/api/conversations/{id}/ask_agent` endpoint, routing requests to the default agent instance
-
-#### 4.1.2 Current Agent Execution Flow
-
-When a user sends a message through the V1 API:
-
-1. **HTTP Request**: Main server makes POST request to `http://agent-server:8000/api/conversations/{id}/ask_agent`
-2. **Agent Router**: `conversation_router.py` receives the request and extracts the `AskAgentRequest`
-3. **Conversation Service**: `ConversationService.ask_agent()` method is called with the user's question
-4. **Event Service**: The request is forwarded to `EventService.ask_agent()` which manages the conversation state
-5. **Agent Execution**: The default agent processes the question using its configured LLM and tools
-6. **Response Return**: The agent's response is returned through the same HTTP chain back to the main server
-
-#### 4.1.3 Limitations of Current Approach
-
-The current system has several limitations for custom agent deployment:
-
- **Fixed Agent Implementation**: The agent server container contains a single, hardcoded agent implementation
- **Static Configuration**: Agent behavior cannot be modified without rebuilding the entire container
- **No Runtime Customization**: Users cannot specify different agent types or configurations per conversation
- **Deployment Complexity**: Any agent customization requires building and maintaining custom Docker images
-
-### 4.2 Proposed Dynamic Agent Loading Architecture
-
-Our proposal extends the current V1 flow by introducing dynamic agent loading capabilities while maintaining full backward compatibility with existing APIs and infrastructure.
-
-#### 4.2.1 Enhanced Agent Server Startup Process
-
-The modified startup process introduces agent selection based on environment configuration:
-
-1. **Environment Detection**: During agent server startup, check for `CUSTOM_AGENT_PACKAGE_URL` environment variable
-2. **Conditional Loading**: If custom agent URL is present, download and install the package; otherwise use default agent
-3. **Agent Factory Creation**: Create an agent factory function that can instantiate either custom or default agents
-4. **HTTP API Registration**: Register the same `/ask_agent` endpoint, but route to the dynamically selected agent
-
-#### 4.2.2 Dynamic Package Installation Process
-
-When a custom agent package URL is detected, the system performs these steps:
-
-1. **Package Download**: Use `pip install` to download the package from Git, PyPI, or ZIP sources
-2. **Dependency Resolution**: Install any additional Python dependencies specified in the package
-3. **Module Import**: Use `importlib` to dynamically import the custom agent module
-4. **Agent Instantiation**: Call the package's `create_agent()` factory function with LLM and tools
-5. **Initialization**: Execute any custom initialization logic defined by the agent
-6. **Caching**: Cache the agent instance for reuse across multiple requests
-
-#### 4.2.3 Modified Execution Flow
-
-The execution flow remains largely unchanged from the user's perspective, but internally:
-
-1. **Same HTTP API**: The `/ask_agent` endpoint signature and behavior remain identical
-2. **Dynamic Routing**: Requests are routed to either custom or default agent based on startup configuration
-3. **Transparent Operation**: The main server is unaware of whether it's communicating with a custom or default agent
-4. **Consistent Response Format**: All agents return responses in the same `AskAgentResponse` format
-
-### 4.3 Integration Points and Modifications
-
-#### 4.3.1 Sandbox Service Modifications
-
-The main server's sandbox service requires minimal changes to support dynamic agent loading:
-
-```python
-# Current: Fixed environment for all conversations
-def get_default_sandbox_specs():
-    return [SandboxSpecInfo(
-        id=AGENT_SERVER_IMAGE,
-        command=['--port', '8000'],
-        initial_env={...}  # Standard environment
-    )]
-
-# Enhanced: Dynamic environment based on conversation requirements
-def create_dynamic_agent_sandbox_spec(agent_package_url: str):
-    return SandboxSpecInfo(
-        id=AGENT_SERVER_IMAGE,  # Same base image
-        command=['--port', '8000'],
-        initial_env={
-            ...standard_env,
-            'CUSTOM_AGENT_PACKAGE_URL': agent_package_url  # New variable
-        }
-    )
-```
-
-#### 4.3.2 Agent Server Startup Modifications
-
-The agent server startup process is enhanced to detect and load custom agents:
-
-```python
-# Current: Fixed agent creation
-@app.on_event("startup")
-async def startup_event():
-    app.state.agent = DefaultAgent(llm=default_llm, tools=default_tools)
-
-# Enhanced: Dynamic agent creation
-@app.on_event("startup")
-async def startup_event():
-    custom_agent_url = os.getenv('CUSTOM_AGENT_PACKAGE_URL')
-    if custom_agent_url:
-        loader = DynamicAgentLoader()
-        app.state.agent = await loader.load_agent_from_url(custom_agent_url, ...)
-    else:
-        app.state.agent = DefaultAgent(llm=default_llm, tools=default_tools)
-```
-
-#### 4.3.3 Conversation Service Integration
-
-The conversation service routing logic is updated to use the dynamically loaded agent:
-
-```python
-# Current: Direct agent usage
-async def ask_agent(self, conversation_id: UUID, question: str) -> str:
-    event_service = self.event_services[conversation_id]
-    return await event_service.ask_agent(question)
-
-# Enhanced: Dynamic agent resolution
-async def ask_agent(self, conversation_id: UUID, question: str) -> str:
-    event_service = self.event_services[conversation_id]
-    # Agent is now dynamically determined at startup
-    return await event_service.ask_agent(question)
-```
-
-### 4.4 Dynamic Agent Loading Implementation
-
-#### 4.4.1 Agent Package Loader
-
-```python
-# openhands/agent_server/dynamic_agent_loader.py
-import subprocess
-import importlib
-import tempfile
-import os
-import sys
-from typing import Dict, Any, Optional
-from urllib.parse import urlparse
-from pathlib import Path
-
-from openhands.sdk.agent.base import AgentBase
-from openhands.sdk.llm import LLM
-from openhands.sdk.tool import Tool
-from openhands.sdk.logger import get_logger
-
-logger = get_logger(__name__)
-
-class DynamicAgentLoader:
-    """Loads custom agents from package URLs at runtime."""
-
-    def __init__(self):
-        self.installed_packages: Dict[str, str] = {}
-
-    async def load_agent_from_url(
-        self,
-        package_url: str,
-        llm: LLM,
-        tools: list[Tool],
-        config: Optional[Dict[str, Any]] = None
-    ) -> AgentBase:
-        """Load and instantiate agent from package URL."""
-
-        # Check if already installed
-        if package_url in self.installed_packages:
-            package_name = self.installed_packages[package_url]
-            return await self._create_agent_instance(package_name, llm, tools, config)
-
-        # Install the package
-        package_name = await self._install_package(package_url)
-        self.installed_packages[package_url] = package_name
-
-        # Create agent instance
-        return await self._create_agent_instance(package_name, llm, tools, config)
-
-    async def _install_package(self, package_url: str) -> str:
-        """Install package from URL and return package name."""
-
-        logger.info(f"Installing custom agent package: {package_url}")
-
-        try:
-            # Install package using pip
-            result = subprocess.run([
-                sys.executable, "-m", "pip", "install", package_url
-            ], capture_output=True, text=True, check=True)
-
-            logger.info(f"Package installation successful: {result.stdout}")
-
-            # Extract package name from URL
-            package_name = self._extract_package_name(package_url)
-            return package_name
-
-        except subprocess.CalledProcessError as e:
-            logger.error(f"Failed to install package {package_url}: {e.stderr}")
-            raise RuntimeError(f"Package installation failed: {e.stderr}")
-
-    def _extract_package_name(self, package_url: str) -> str:
-        """Extract package name from various URL formats."""
-
-        if package_url.startswith('git+'):
-            # Git URL: extract repo name
-            url = package_url.replace('git+', '')
-            return Path(urlparse(url).path).stem
-        elif '==' in package_url:
-            # PyPI with version: extract package name
-            return package_url.split('==')[0]
-        elif package_url.endswith('.zip'):
-            # ZIP file: extract filename
-            return Path(urlparse(package_url).path).stem
-        else:
-            # Assume it's a simple package name
-            return package_url
-
-    async def _create_agent_instance(
-        self,
-        package_name: str,
-        llm: LLM,
-        tools: list[Tool],
-        config: Optional[Dict[str, Any]] = None
-    ) -> AgentBase:
-        """Create agent instance from installed package."""
-
-        try:
-            # Import the package
-            module = importlib.import_module(package_name)
-
-            # Look for create_agent function
-            if hasattr(module, 'create_agent'):
-                create_agent_func = getattr(module, 'create_agent')
-                agent = create_agent_func(llm=llm, tools=tools, config=config)
-            else:
-                # Fallback: look for agent class
-                agent_classes = [
-                    attr for attr in dir(module)
-                    if (isinstance(getattr(module, attr), type) and
-                        issubclass(getattr(module, attr), AgentBase) and
-                        getattr(module, attr) != AgentBase)
-                ]
-
-                if not agent_classes:
-                    raise RuntimeError(f"No agent class found in package {package_name}")
-
-                agent_class = getattr(module, agent_classes[0])
-                agent = agent_class(llm=llm, tools=tools, config=config)
-
-            # Initialize the agent
-            if hasattr(agent, 'initialize'):
-                await agent.initialize()
-
-            logger.info(f"Successfully created agent from package: {package_name}")
-            return agent
-
-        except Exception as e:
-            logger.error(f"Failed to create agent from package {package_name}: {e}")
-            raise RuntimeError(f"Agent instantiation failed: {e}")
-```
-
-#### 4.1.2 Agent Server Integration
-
-```python
-# Modified openhands/agent_server/conversation_service.py
-import os
-from typing import Optional
-from openhands.agent_server.dynamic_agent_loader import DynamicAgentLoader
-from openhands.sdk.agent.base import AgentBase
-from openhands.sdk.agent import Agent  # Default agent
-
-class ConversationService:
-    """Enhanced conversation service with dynamic agent loading."""
-
-    def __init__(self, config: Config):
-        self.config = config
-        self.agent_loader = DynamicAgentLoader()
-        self._default_agent_factory = None
-        self._custom_agent_cache: Dict[str, AgentBase] = {}
-
-    async def _get_or_create_agent(
-        self,
-        conversation_id: UUID,
-        llm: LLM,
-        tools: list[Tool]
-    ) -> AgentBase:
-        """Get or create agent for conversation."""
-
-        # Check for custom agent package URL in environment
-        custom_agent_url = os.getenv('CUSTOM_AGENT_PACKAGE_URL')
-
-        if custom_agent_url:
-            # Use custom agent
-            if custom_agent_url not in self._custom_agent_cache:
-                agent = await self.agent_loader.load_agent_from_url(
-                    package_url=custom_agent_url,
-                    llm=llm,
-                    tools=tools,
-                    config=self._get_agent_config()
-                )
-                self._custom_agent_cache[custom_agent_url] = agent
-
-            return self._custom_agent_cache[custom_agent_url]
-        else:
-            # Use default agent
-            if not self._default_agent_factory:
-                self._default_agent_factory = Agent(llm=llm, tools=tools)
-
-            return self._default_agent_factory
-
-    def _get_agent_config(self) -> Dict[str, Any]:
-        """Extract agent configuration from environment."""
-        config = {}
-
-        # Parse JSON config if provided
-        config_json = os.getenv('CUSTOM_AGENT_CONFIG')
-        if config_json:
-            import json
-            config.update(json.loads(config_json))
-
-        return config
-```
-
-### 4.2 Sandbox Service Integration
-
-#### 4.2.1 Enhanced Sandbox Specification
-
-```python
-# openhands/app_server/sandbox/docker_sandbox_spec_service.py
-from typing import Optional
-
-class DockerSandboxSpecService(SandboxSpecService):
-    """Enhanced sandbox service supporting dynamic agent loading."""
-
-    def create_dynamic_agent_sandbox_spec(
-        self,
-        agent_package_url: str,
-        agent_config: Optional[Dict[str, Any]] = None
-    ) -> SandboxSpecInfo:
-        """Create sandbox spec with dynamic agent loading configuration."""
-
-        # Base environment from existing implementation
-        base_env = {
-            'OPENVSCODE_SERVER_ROOT': '/openhands/.openvscode-server',
-            'OH_ENABLE_VNC': '0',
-            'LOG_JSON': 'true',
-            'OH_CONVERSATIONS_PATH': '/workspace/conversations',
-            'OH_BASH_EVENTS_DIR': '/workspace/bash_events',
-            'PYTHONUNBUFFERED': '1',
-            'ENV_LOG_LEVEL': '20',
-        }
-
-        # Add dynamic agent configuration
-        dynamic_env = {
-            **base_env,
-            'CUSTOM_AGENT_PACKAGE_URL': agent_package_url,
-        }
-
-        # Add agent configuration as JSON if provided
-        if agent_config:
-            import json
-            dynamic_env['CUSTOM_AGENT_CONFIG'] = json.dumps(agent_config)
-
-        return SandboxSpecInfo(
-            id=AGENT_SERVER_IMAGE,  # Same base image
-            command=['--port', '8000'],
-            initial_env=dynamic_env,
-            working_dir='/workspace/project',
-        )
-```
-
-#### 4.2.2 Conversation Creation API Enhancement
-
-```python
-# openhands/server/routes/conversation_routes.py
-from pydantic import BaseModel
-from typing import Optional, Dict, Any
-
-class CreateConversationRequest(BaseModel):
-    """Enhanced conversation creation request."""
-    initial_message: str
-    workspace_config: Optional[Dict[str, Any]] = None
-    # New field for dynamic agent loading
-    agent_package_url: Optional[str] = None
-    agent_config: Optional[Dict[str, Any]] = None
-
-@router.post("/conversations")
-async def create_conversation(
-    request: CreateConversationRequest,
-    sandbox_service: DockerSandboxSpecService = Depends(get_sandbox_service)
-) -> ConversationResponse:
-    """Create conversation with optional dynamic agent loading."""
-
-    if request.agent_package_url:
-        # Create sandbox with dynamic agent loading
-        sandbox_spec = sandbox_service.create_dynamic_agent_sandbox_spec(
-            agent_package_url=request.agent_package_url,
-            agent_config=request.agent_config
-        )
-    else:
-        # Use default sandbox specification
-        sandbox_spec = sandbox_service.get_default_sandbox_specs()[0]
-
-    # Create sandbox and conversation
-    sandbox = await sandbox_service.create_sandbox(sandbox_spec)
-    await wait_for_agent_server_ready(sandbox)
-
-    conversation = await create_conversation_with_sandbox(
-        sandbox=sandbox,
-        initial_message=request.initial_message,
-        workspace_config=request.workspace_config
-    )
-
-    return ConversationResponse(
-        conversation_id=conversation.id,
-        status="created",
-        agent_type="custom" if request.agent_package_url else "default"
-    )
-```
-
-### 4.3 Agent Server Startup Process
-
-#### 4.3.1 Enhanced Agent Server Initialization
-
-```python
-# openhands/agent_server/api.py startup modification
-import os
-from openhands.agent_server.dynamic_agent_loader import DynamicAgentLoader
-
-@app.on_event("startup")
-async def startup_event():
-    """Enhanced startup with dynamic agent loading support."""
-
-    # Initialize dynamic agent loader
-    app.state.agent_loader = DynamicAgentLoader()
-
-    # Check for custom agent package URL
-    custom_agent_url = os.getenv('CUSTOM_AGENT_PACKAGE_URL')
-
-    if custom_agent_url:
-        logger.info(f"Dynamic agent loading enabled: {custom_agent_url}")
-        # Pre-validate package URL (optional)
-        try:
-            await app.state.agent_loader._install_package(custom_agent_url)
-            logger.info("Custom agent package pre-installed successfully")
-        except Exception as e:
-            logger.warning(f"Custom agent pre-installation failed: {e}")
-            # Continue with startup - will retry on first conversation
-    else:
-        logger.info("Using default agent configuration")
-```
-
-### 4.4 Error Handling and Fallback
-
-#### 4.4.1 Robust Error Handling
-
-```python
-# openhands/agent_server/dynamic_agent_loader.py (enhanced)
-class DynamicAgentLoader:
-    """Enhanced loader with comprehensive error handling."""
-
-    async def load_agent_with_fallback(
-        self,
-        package_url: str,
-        llm: LLM,
-        tools: list[Tool],
-        config: Optional[Dict[str, Any]] = None
-    ) -> AgentBase:
-        """Load custom agent with fallback to default agent."""
-
-        try:
-            return await self.load_agent_from_url(package_url, llm, tools, config)
-        except Exception as e:
-            logger.error(f"Custom agent loading failed: {e}")
-            logger.info("Falling back to default agent")
-
-            # Import default agent
-            from openhands.sdk.agent import Agent
-            return Agent(llm=llm, tools=tools)
-
-    async def _validate_package_url(self, package_url: str) -> bool:
-        """Validate package URL accessibility."""
-
-        try:
-            if package_url.startswith('git+'):
-                # Validate Git repository access
-                import subprocess
-                result = subprocess.run([
-                    'git', 'ls-remote', package_url.replace('git+', '')
-                ], capture_output=True, timeout=30)
-                return result.returncode == 0
-            elif package_url.startswith('http'):
-                # Validate HTTP URL accessibility
-                import httpx
-                async with httpx.AsyncClient() as client:
-                    response = await client.head(package_url, timeout=30)
-                    return response.status_code == 200
-            else:
-                # Assume PyPI package - always return True
-                return True
-        except Exception:
-            return False
-```
-
-### 4.5 Security and Isolation
-
-#### 4.5.1 Package Security Validation
-
-```python
-# openhands/agent_server/security/package_validator.py
-import re
-from typing import List, Set
-from urllib.parse import urlparse
-
-class PackageSecurityValidator:
-    """Validates custom agent packages for security compliance."""
-
-    ALLOWED_DOMAINS: Set[str] = {
-        'github.com',
-        'gitlab.com',
-        'bitbucket.org',
-        'pypi.org',
-        'files.pythonhosted.org'
-    }
-
-    BLOCKED_PACKAGES: Set[str] = {
-        # Add known malicious packages
-    }
-
-    def validate_package_url(self, package_url: str) -> bool:
-        """Validate package URL against security policies."""
-
-        # Check blocked packages
-        if self._is_blocked_package(package_url):
-            return False
-
-        # Validate domain for HTTP/Git URLs
-        if package_url.startswith(('http', 'git+')):
-            parsed = urlparse(package_url.replace('git+', ''))
-            if parsed.hostname not in self.ALLOWED_DOMAINS:
-                return False
-
-        # Additional security checks
-        return self._validate_package_name(package_url)
-
-    def _is_blocked_package(self, package_url: str) -> bool:
-        """Check if package is in blocklist."""
-        for blocked in self.BLOCKED_PACKAGES:
-            if blocked in package_url.lower():
-                return True
-        return False
-
-    def _validate_package_name(self, package_url: str) -> bool:
-        """Validate package name format."""
-        # Basic validation for malicious patterns
-        malicious_patterns = [
-            r'\.\./',  # Path traversal
-            r'[;&|`$]',  # Command injection
-            r'<script',  # XSS attempts
-        ]
-
-        for pattern in malicious_patterns:
-            if re.search(pattern, package_url, re.IGNORECASE):
-                return False
-
-        return True
-```
-
-## 5. Implementation Plan
-
-All implementation must pass existing lints and tests. New functionality requires comprehensive test coverage including unit tests, integration tests, and end-to-end scenarios.
-
-### 5.1 Dynamic Agent Loading Foundation (M1)
-
-#### 5.1.1 Dynamic Agent Loader Implementation
-
-* `openhands/agent_server/dynamic_agent_loader.py`
-* `tests/unit/agent_server/test_dynamic_agent_loader.py`
-
-Implement core dynamic agent loading functionality with package installation, module importing, and agent instantiation.
-
-#### 5.1.2 Package Security Validation
-
-* `openhands/agent_server/security/package_validator.py`
-* `tests/unit/agent_server/security/test_package_validator.py`
-
-Add security validation for custom agent packages including domain allowlists and malicious pattern detection.
-
-**Demo**: Load a simple custom agent from a Git repository and verify it responds to basic queries through the existing `/ask_agent` HTTP API.
-
-### 5.2 Sandbox Service Integration (M2)
-
-#### 5.2.1 Enhanced Sandbox Specification
-
-* `openhands/app_server/sandbox/docker_sandbox_spec_service.py` (modifications)
-* `tests/unit/app_server/sandbox/test_docker_sandbox_spec_service.py` (enhancements)
-
-Extend existing sandbox service to support dynamic agent loading configuration through environment variables.
-
-#### 5.2.2 Agent Server Startup Integration
-
-* `openhands/agent_server/conversation_service.py` (modifications)
-* `openhands/agent_server/api.py` (startup enhancements)
-* `tests/unit/agent_server/test_conversation_service.py` (enhancements)
-
-Integrate dynamic agent loading into agent server startup and conversation management processes.
-
-**Demo**: Create conversations with custom agents specified via environment variables and demonstrate proper agent instantiation and tool execution.
-
-### 5.3 API Integration (M3)
-
-#### 5.3.1 Enhanced Conversation Creation API
-
-* `openhands/server/routes/conversation_routes.py` (modifications)
-* `tests/unit/server/routes/test_conversation_routes.py` (enhancements)
-
-Extend conversation creation API to accept agent package URLs and configuration parameters.
-
-#### 5.3.2 Error Handling and Fallback
-
-* `openhands/agent_server/dynamic_agent_loader.py` (enhancements)
-* `tests/unit/agent_server/test_dynamic_agent_fallback.py`
-
-Implement comprehensive error handling with fallback to default agents when custom agent loading fails.
-
-**Demo**: Create conversations through API endpoints with various package URL formats (Git, PyPI, ZIP) and demonstrate proper error handling and fallback behavior.
-
-### 5.4 Advanced Features and Optimization (M4)
-
-#### 5.4.1 Agent Caching and Performance
-
-* `openhands/agent_server/agent_cache.py`
-* `tests/unit/agent_server/test_agent_cache.py`
-
-Implement agent instance caching to avoid repeated package installation and improve performance for multiple conversations with the same custom agent.
-
-#### 5.4.2 Package Management and Cleanup
-
-* `openhands/agent_server/package_manager.py`
-* `tests/unit/agent_server/test_package_manager.py`
-
-Add package lifecycle management including cleanup of unused packages and version management for package updates.
-
-**Demo**: Deploy multiple conversations with different custom agents simultaneously and demonstrate proper resource management, caching, and cleanup behavior.
-
---
-
-## References
-
-This design document is based on analysis of the following source materials:
-
-1. **OpenHands V1 Architecture**: Analysis of `openhands/app_server/sandbox/docker_sandbox_spec_service.py` and `openhands/app_server/event_callback/github_v1_callback_processor.py` for understanding the V1 flow and agent server integration.
-
-2. **Software Agent SDK**: Analysis of the `software-agent-sdk` repository, specifically:
-   - `openhands-agent-server/openhands/agent_server/conversation_router.py` for HTTP API patterns
-   - `openhands-sdk/openhands/sdk/agent/base.py` for agent interface requirements
-   - `examples/01_standalone_sdk/02_custom_tools.py` for custom agent implementation patterns
-
-3. **Agent Server Models**: Analysis of `openhands.agent_server.models` imports in the main OpenHands codebase for understanding the API contract between main server and agent server.
-
-4. **Container Architecture**: Analysis of `AGENT_SERVER_IMAGE` constant usage in `openhands/app_server/sandbox/sandbox_spec_service.py` for understanding the current container deployment model.
-
-All technical specifications and implementation details are derived from examination of the existing codebase and established patterns within the OpenHands ecosystem.
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -7,7 +7,7 @@ services:
    image: openhands:latest
    container_name: openhands-app-${DATE:-}
    environment:
-      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-docker.openhands.dev/openhands/runtime:0.62-nikolaik}
+      - SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-docker.openhands.dev/openhands/runtime:0.60-nikolaik}
      #- SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234} # enable this only if you want a specific non-root sandbox user but you will have to manually adjust permissions of ~/.openhands for this user
      - WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
    ports:
--- a/enterprise/experiments/experiment_manager.py
+++ b/enterprise/experiments/experiment_manager.py
@@ -5,8 +5,12 @@ from experiments.constants import (
    EXPERIMENT_SYSTEM_PROMPT_EXPERIMENT,
 )
 from experiments.experiment_versions import (
+    handle_condenser_max_step_experiment,
    handle_system_prompt_experiment,
 )
+from experiments.experiment_versions._004_condenser_max_step_experiment import (
+    handle_condenser_max_step_experiment__v1,
+)

 from openhands.core.config.openhands_config import OpenHandsConfig
 from openhands.core.logger import openhands_logger as logger
@@ -27,6 +31,10 @@ class SaaSExperimentManager(ExperimentManager):
            )
            return agent

+        agent = handle_condenser_max_step_experiment__v1(
+            user_id, conversation_id, agent
+        )
+
        if EXPERIMENT_SYSTEM_PROMPT_EXPERIMENT:
            agent = agent.model_copy(
                update={'system_prompt_filename': 'system_prompt_long_horizon.j2'}
@@ -52,7 +60,20 @@ class SaaSExperimentManager(ExperimentManager):
        """
        logger.debug(
            'experiment_manager:run_conversation_variant_test:started',
-            extra={'user_id': user_id, 'conversation_id': conversation_id},
+            extra={'user_id': user_id},
+        )
+
+        # Skip all experiment processing if the experiment manager is disabled
+        if not ENABLE_EXPERIMENT_MANAGER:
+            logger.info(
+                'experiment_manager:run_conversation_variant_test:skipped',
+                extra={'reason': 'experiment_manager_disabled'},
+            )
+            return conversation_settings
+
+        # Apply conversation-scoped experiments
+        conversation_settings = handle_condenser_max_step_experiment(
+            user_id, conversation_id, conversation_settings
        )

        return conversation_settings
--- a/enterprise/integrations/github/github_manager.py
+++ b/enterprise/integrations/github/github_manager.py
@@ -292,26 +292,18 @@ class GithubManager(Manager):
                    f'[GitHub] Created conversation {conversation_id} for user {user_info.username}'
                )

-                from openhands.server.shared import ConversationStoreImpl, config
-
-                conversation_store = await ConversationStoreImpl.get_instance(
-                    config, github_view.user_info.keycloak_user_id
+                # Create a GithubCallbackProcessor
+                processor = GithubCallbackProcessor(
+                    github_view=github_view,
+                    send_summary_instruction=True,
                )
-                metadata = await conversation_store.get_metadata(conversation_id)

-                if metadata.conversation_version != 'v1':
-                    # Create a GithubCallbackProcessor
-                    processor = GithubCallbackProcessor(
-                        github_view=github_view,
-                        send_summary_instruction=True,
-                    )
+                # Register the callback processor
+                register_callback_processor(conversation_id, processor)

-                    # Register the callback processor
-                    register_callback_processor(conversation_id, processor)
-
-                    logger.info(
-                        f'[Github] Registered callback processor for conversation {conversation_id}'
-                    )
+                logger.info(
+                    f'[Github] Registered callback processor for conversation {conversation_id}'
+                )

                # Send message with conversation link
                conversation_link = CONVERSATION_URL.format(conversation_id)
--- a/enterprise/integrations/github/github_view.py
+++ b/enterprise/integrations/github/github_view.py
@@ -1,4 +1,4 @@
-from uuid import UUID, uuid4
+from uuid import uuid4

 from github import Github, GithubIntegration
 from github.Issue import Issue
@@ -26,22 +26,10 @@ from storage.proactive_conversation_store import ProactiveConversationStore
 from storage.saas_secrets_store import SaasSecretsStore
 from storage.saas_settings_store import SaasSettingsStore

-from openhands.agent_server.models import SendMessageRequest
-from openhands.app_server.app_conversation.app_conversation_models import (
-    AppConversationStartRequest,
-    AppConversationStartTaskStatus,
-)
-from openhands.app_server.config import get_app_conversation_service
-from openhands.app_server.services.injector import InjectorState
-from openhands.app_server.user.specifiy_user_context import USER_CONTEXT_ATTR
-from openhands.app_server.user.user_context import UserContext
-from openhands.app_server.user.user_models import UserInfo
 from openhands.core.logger import openhands_logger as logger
 from openhands.integrations.github.github_service import GithubServiceImpl
 from openhands.integrations.provider import PROVIDER_TOKEN_TYPE, ProviderType
 from openhands.integrations.service_types import Comment
-from openhands.sdk import TextContent
-from openhands.sdk.conversation.secret_source import SecretSource
 from openhands.server.services.conversation_service import (
    initialize_conversation,
    start_conversation,
@@ -55,49 +43,6 @@ from openhands.utils.async_utils import call_sync_from_async
 OH_LABEL, INLINE_OH_LABEL = get_oh_labels(HOST)


-class GithubUserContext(UserContext):
-    """User context for GitHub integration that provides user info without web request."""
-
-    def __init__(self, keycloak_user_id: str, git_provider_tokens: PROVIDER_TOKEN_TYPE):
-        self.keycloak_user_id = keycloak_user_id
-        self.git_provider_tokens = git_provider_tokens
-        self.settings_store = SaasSettingsStore(
-            user_id=self.keycloak_user_id,
-            session_maker=session_maker,
-            config=get_config(),
-        )
-
-        self.secrets_store = SaasSecretsStore(
-            self.keycloak_user_id, session_maker, get_config()
-        )
-
-    async def get_user_id(self) -> str | None:
-        return self.keycloak_user_id
-
-    async def get_user_info(self) -> UserInfo:
-        user_settings = await self.settings_store.load()
-        return UserInfo(
-            id=self.keycloak_user_id,
-            **user_settings.model_dump(context={'expose_secrets': True}),
-        )
-
-    async def get_authenticated_git_url(self, repository: str) -> str:
-        # This would need to be implemented based on the git provider tokens
-        # For now, return a basic HTTPS URL
-        return f'https://github.com/{repository}.git'
-
-    async def get_latest_token(self, provider_type: ProviderType) -> str | None:
-        # Return the appropriate token from git_provider_tokens
-        if provider_type == ProviderType.GITHUB and self.git_provider_tokens:
-            return self.git_provider_tokens.get(ProviderType.GITHUB)
-        return None
-
-    async def get_secrets(self) -> dict[str, SecretSource]:
-        # Return empty dict for now - GitHub integration handles secrets separately
-        user_secrets = await self.secrets_store.load()
-        return dict(user_secrets.custom_secrets) if user_secrets else {}
-
-
 async def get_user_proactive_conversation_setting(user_id: str | None) -> bool:
    """Get the user's proactive conversation setting.

@@ -131,35 +76,6 @@ async def get_user_proactive_conversation_setting(user_id: str | None) -> bool:
    return settings.enable_proactive_conversation_starters


-async def get_user_v1_enabled_setting(user_id: str | None) -> bool:
-    """Get the user's V1 conversation API setting.
-
-    Args:
-        user_id: The keycloak user ID
-
-    Returns:
-        True if V1 conversations are enabled for this user, False otherwise
-    """
-
-    # If no user ID is provided, we can't check user settings
-    if not user_id:
-        return False
-
-    config = get_config()
-    settings_store = SaasSettingsStore(
-        user_id=user_id, session_maker=session_maker, config=config
-    )
-
-    settings = await call_sync_from_async(
-        settings_store.get_user_settings_by_keycloak_id, user_id
-    )
-
-    if not settings or settings.v1_enabled is None:
-        return False
-
-    return settings.v1_enabled
-
-
 # =================================================
 # SECTION: Github view types
 # =================================================
@@ -243,31 +159,6 @@ class GithubIssue(ResolverViewInterface):
        git_provider_tokens: PROVIDER_TOKEN_TYPE,
        conversation_metadata: ConversationMetadata,
    ):
-        v1_enabled = await get_user_v1_enabled_setting(self.user_info.keycloak_user_id)
-
-        if v1_enabled:
-            try:
-                # Use V1 app conversation service
-                await self._create_v1_conversation(
-                    jinja_env, git_provider_tokens, conversation_metadata
-                )
-                return
-
-            except Exception as e:
-                logger.warning(f'Error checking V1 settings, falling back to V0: {e}')
-
-        # Use existing V0 conversation service
-        await self._create_v0_conversation(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-
-    async def _create_v0_conversation(
-        self,
-        jinja_env: Environment,
-        git_provider_tokens: PROVIDER_TOKEN_TYPE,
-        conversation_metadata: ConversationMetadata,
-    ):
-        """Create conversation using the legacy V0 system."""
        custom_secrets = await self._get_user_secrets()

        user_instructions, conversation_instructions = await self._get_instructions(
@@ -286,77 +177,6 @@ class GithubIssue(ResolverViewInterface):
            conversation_instructions=conversation_instructions,
        )

-    async def _create_v1_conversation(
-        self,
-        jinja_env: Environment,
-        git_provider_tokens: PROVIDER_TOKEN_TYPE,
-        conversation_metadata: ConversationMetadata,
-    ):
-        """Create conversation using the new V1 app conversation system."""
-        user_instructions, conversation_instructions = await self._get_instructions(
-            jinja_env
-        )
-
-        # Create the initial message request
-        initial_message = SendMessageRequest(
-            role='user', content=[TextContent(text=user_instructions)]
-        )
-
-        # Create the GitHub V1 callback processor
-        github_callback_processor = self._create_github_v1_callback_processor()
-
-        # Get the app conversation service and start the conversation
-        injector_state = InjectorState()
-
-        # Create the V1 conversation start request with the callback processor
-        start_request = AppConversationStartRequest(
-            conversation_id=UUID(conversation_metadata.conversation_id),
-            system_message_suffix=conversation_instructions,
-            initial_message=initial_message,
-            selected_repository=self.full_repo_name,
-            git_provider=ProviderType.GITHUB,
-            title=f'GitHub Issue #{self.issue_number}: {self.title}',
-            trigger=ConversationTrigger.RESOLVER,
-            processors=[
-                github_callback_processor
-            ],  # Pass the callback processor directly
-        )
-
-        # Set up the GitHub user context for the V1 system
-        github_user_context = GithubUserContext(
-            keycloak_user_id=self.user_info.keycloak_user_id,
-            git_provider_tokens=git_provider_tokens,
-        )
-        setattr(injector_state, USER_CONTEXT_ATTR, github_user_context)
-
-        async with get_app_conversation_service(
-            injector_state
-        ) as app_conversation_service:
-            async for task in app_conversation_service.start_app_conversation(
-                start_request
-            ):
-                if task.status == AppConversationStartTaskStatus.ERROR:
-                    logger.error(f'Failed to start V1 conversation: {task.detail}')
-                    raise RuntimeError(
-                        f'Failed to start V1 conversation: {task.detail}'
-                    )
-
-    def _create_github_v1_callback_processor(self):
-        """Create a V1 callback processor for GitHub integration."""
-        from openhands.app_server.event_callback.github_v1_callback_processor import (
-            GithubV1CallbackProcessor,
-        )
-
-        # Create and return the GitHub V1 callback processor
-        return GithubV1CallbackProcessor(
-            github_view_data={
-                'issue_number': self.issue_number,
-                'full_repo_name': self.full_repo_name,
-                'installation_id': self.installation_id,
-            },
-            send_summary_instruction=self.send_summary_instruction,
-        )
-

@dataclass
 class GithubIssueComment(GithubIssue):
@@ -472,24 +292,6 @@ class GithubInlinePRComment(GithubPRComment):

        return user_instructions, conversation_instructions

-    def _create_github_v1_callback_processor(self):
-        """Create a V1 callback processor for GitHub integration."""
-        from openhands.app_server.event_callback.github_v1_callback_processor import (
-            GithubV1CallbackProcessor,
-        )
-
-        # Create and return the GitHub V1 callback processor
-        return GithubV1CallbackProcessor(
-            github_view_data={
-                'issue_number': self.issue_number,
-                'full_repo_name': self.full_repo_name,
-                'installation_id': self.installation_id,
-                'comment_id': self.comment_id,
-            },
-            inline_pr_comment=True,
-            send_summary_instruction=self.send_summary_instruction,
-        )
-

@dataclass
 class GithubFailingAction:
--- a/enterprise/migrations/versions/080_add_status_and_updated_at_to_callback.py
+++ b/enterprise/migrations/versions/080_add_status_and_updated_at_to_callback.py
@@ -1,71 +0,0 @@
-"""add status and updated_at to callback
-
-Revision ID: 080
-Revises: 079
-Create Date: 2025-11-05 00:00:00.000000
-
-"""
-
-from enum import Enum
-from typing import Sequence, Union
-
-import sqlalchemy as sa
-from alembic import op
-
-# revision identifiers, used by Alembic.
-revision: str = '080'
-down_revision: Union[str, None] = '079'
-branch_labels: Union[str, Sequence[str], None] = None
-depends_on: Union[str, Sequence[str], None] = None
-
-
-class EventCallbackStatus(Enum):
-    ACTIVE = 'ACTIVE'
-    DISABLED = 'DISABLED'
-    COMPLETED = 'COMPLETED'
-    ERROR = 'ERROR'
-
-
-def upgrade() -> None:
-    """Upgrade schema."""
-    status = sa.Enum(EventCallbackStatus, name='eventcallbackstatus')
-    status.create(op.get_bind(), checkfirst=True)
-    op.add_column(
-        'event_callback',
-        sa.Column('status', status, nullable=False, server_default='ACTIVE'),
-    )
-    op.add_column(
-        'event_callback',
-        sa.Column(
-            'updated_at', sa.DateTime, nullable=False, server_default=sa.func.now()
-        ),
-    )
-    op.drop_index('ix_event_callback_result_event_id')
-    op.drop_column('event_callback_result', 'event_id')
-    op.add_column(
-        'event_callback_result', sa.Column('event_id', sa.String, nullable=True)
-    )
-    op.create_index(
-        op.f('ix_event_callback_result_event_id'),
-        'event_callback_result',
-        ['event_id'],
-        unique=False,
-    )
-
-
-def downgrade() -> None:
-    """Downgrade schema."""
-    op.drop_column('event_callback', 'status')
-    op.drop_column('event_callback', 'updated_at')
-    op.drop_index('ix_event_callback_result_event_id')
-    op.drop_column('event_callback_result', 'event_id')
-    op.add_column(
-        'event_callback_result', sa.Column('event_id', sa.UUID, nullable=True)
-    )
-    op.create_index(
-        op.f('ix_event_callback_result_event_id'),
-        'event_callback_result',
-        ['event_id'],
-        unique=False,
-    )
-    op.execute('DROP TYPE eventcallbackstatus')
--- a/enterprise/migrations/versions/081_add_parent_conversation_id.py
+++ b/enterprise/migrations/versions/081_add_parent_conversation_id.py
@@ -1,41 +0,0 @@
-"""add parent_conversation_id to conversation_metadata
-
-Revision ID: 081
-Revises: 080
-Create Date: 2025-11-06 00:00:00.000000
-
-"""
-
-from typing import Sequence, Union
-
-import sqlalchemy as sa
-from alembic import op
-
-# revision identifiers, used by Alembic.
-revision: str = '081'
-down_revision: Union[str, None] = '080'
-branch_labels: Union[str, Sequence[str], None] = None
-depends_on: Union[str, Sequence[str], None] = None
-
-
-def upgrade() -> None:
-    """Upgrade schema."""
-    op.add_column(
-        'conversation_metadata',
-        sa.Column('parent_conversation_id', sa.String(), nullable=True),
-    )
-    op.create_index(
-        op.f('ix_conversation_metadata_parent_conversation_id'),
-        'conversation_metadata',
-        ['parent_conversation_id'],
-        unique=False,
-    )
-
-
-def downgrade() -> None:
-    """Downgrade schema."""
-    op.drop_index(
-        op.f('ix_conversation_metadata_parent_conversation_id'),
-        table_name='conversation_metadata',
-    )
-    op.drop_column('conversation_metadata', 'parent_conversation_id')
--- a/enterprise/migrations/versions/082_add_setting_up_skills_enum_value.py
+++ b/enterprise/migrations/versions/082_add_setting_up_skills_enum_value.py
@@ -1,51 +0,0 @@
-"""Add SETTING_UP_SKILLS to appconversationstarttaskstatus enum
-
-Revision ID: 082
-Revises: 081
-Create Date: 2025-11-19 12:00:00.000000
-
-"""
-
-from typing import Sequence, Union
-
-from alembic import op
-from sqlalchemy import text
-
-# revision identifiers, used by Alembic.
-revision: str = '082'
-down_revision: Union[str, Sequence[str], None] = '081'
-branch_labels: Union[str, Sequence[str], None] = None
-depends_on: Union[str, Sequence[str], None] = None
-
-
-def upgrade() -> None:
-    """Add SETTING_UP_SKILLS enum value to appconversationstarttaskstatus."""
-    # Check if the enum value already exists before adding it
-    # This handles the case where the enum was created with the value already included
-    connection = op.get_bind()
-    result = connection.execute(
-        text(
-            "SELECT 1 FROM pg_enum WHERE enumlabel = 'SETTING_UP_SKILLS' "
-            "AND enumtypid = (SELECT oid FROM pg_type WHERE typname = 'appconversationstarttaskstatus')"
-        )
-    )
-
-    if not result.fetchone():
-        # Add the new enum value only if it doesn't already exist
-        op.execute(
-            "ALTER TYPE appconversationstarttaskstatus ADD VALUE 'SETTING_UP_SKILLS'"
-        )
-
-
-def downgrade() -> None:
-    """Remove SETTING_UP_SKILLS enum value from appconversationstarttaskstatus.
-
-    Note: PostgreSQL doesn't support removing enum values directly.
-    This would require recreating the enum type and updating all references.
-    For safety, this downgrade is not implemented.
-    """
-    # PostgreSQL doesn't support removing enum values directly
-    # This would require a complex migration to recreate the enum
-    # For now, we'll leave this as a no-op since removing enum values
-    # is rarely needed and can be dangerous
-    pass
--- a/enterprise/migrations/versions/083_add_v1_enabled_to_user_settings.py
+++ b/enterprise/migrations/versions/083_add_v1_enabled_to_user_settings.py
@@ -1,35 +0,0 @@
-"""Add v1_enabled column to user_settings
-
-Revision ID: 083
-Revises: 082
-Create Date: 2025-11-18 00:00:00.000000
-
-"""
-
-from typing import Sequence, Union
-
-import sqlalchemy as sa
-from alembic import op
-
-# revision identifiers, used by Alembic.
-revision: str = '083'
-down_revision: Union[str, None] = '082'
-branch_labels: Union[str, Sequence[str], None] = None
-depends_on: Union[str, Sequence[str], None] = None
-
-
-def upgrade() -> None:
-    """Add v1_enabled column to user_settings table."""
-    op.add_column(
-        'user_settings',
-        sa.Column(
-            'v1_enabled',
-            sa.Boolean(),
-            nullable=True,
-        ),
-    )
-
-
-def downgrade() -> None:
-    """Remove v1_enabled column from user_settings table."""
-    op.drop_column('user_settings', 'v1_enabled')
--- a/enterprise/poetry.lock
+++ b/enterprise/poetry.lock
@@ -201,20 +201,19 @@ files = [

 [[package]]
 name = "anthropic"
-version = "0.72.0"
+version = "0.65.0"
 description = "The official Python library for the anthropic API"
 optional = false
 python-versions = ">=3.8"
 groups = ["main"]
 files = [
-    {file = "anthropic-0.72.0-py3-none-any.whl", hash = "sha256:0e9f5a7582f038cab8efbb4c959e49ef654a56bfc7ba2da51b5a7b8a84de2e4d"},
-    {file = "anthropic-0.72.0.tar.gz", hash = "sha256:8971fe76dcffc644f74ac3883069beb1527641115ae0d6eb8fa21c1ce4082f7a"},
+    {file = "anthropic-0.65.0-py3-none-any.whl", hash = "sha256:ba9d9f82678046c74ddf5698ca06d9f5b0f599cfac922ab0d5921638eb448d98"},
+    {file = "anthropic-0.65.0.tar.gz", hash = "sha256:6b6b6942574e54342050dfd42b8d856a8366b171daec147df3b80be4722733b9"},
 ]

 [package.dependencies]
 anyio = ">=3.5.0,<5"
 distro = ">=1.7.0,<2"
-docstring-parser = ">=0.15,<1"
 google-auth = {version = ">=2,<3", extras = ["requests"], optional = true, markers = "extra == \"vertex\""}
 httpx = ">=0.25.0,<1"
 jiter = ">=0.4.0,<1"
@@ -223,7 +222,7 @@ sniffio = "*"
 typing-extensions = ">=4.10,<5"

 [package.extras]
-aiohttp = ["aiohttp", "httpx-aiohttp (>=0.1.9)"]
+aiohttp = ["aiohttp", "httpx-aiohttp (>=0.1.8)"]
 bedrock = ["boto3 (>=1.28.57)", "botocore (>=1.31.57)"]
 vertex = ["google-auth[requests] (>=2,<3)"]

@@ -682,34 +681,31 @@ crt = ["awscrt (==0.27.6)"]

 [[package]]
 name = "browser-use"
-version = "0.9.5"
+version = "0.7.10"
 description = "Make websites accessible for AI agents"
 optional = false
 python-versions = "<4.0,>=3.11"
 groups = ["main"]
 files = [
-    {file = "browser_use-0.9.5-py3-none-any.whl", hash = "sha256:4a2e92847204d1ded269026a99cb0cc0e60e38bd2751fa3f58aedd78f00b4e67"},
-    {file = "browser_use-0.9.5.tar.gz", hash = "sha256:f8285fe253b149d01769a7084883b4cf4db351e2f38e26302c157bcbf14a703f"},
+    {file = "browser_use-0.7.10-py3-none-any.whl", hash = "sha256:669e12571a0c0c4c93e5fd26abf9e2534eb9bacbc510328aedcab795bd8906a9"},
+    {file = "browser_use-0.7.10.tar.gz", hash = "sha256:f93ce59e06906c12d120360dee4aa33d83618ddf7c9a575dd0ac517d2de7ccbc"},
 ]

 [package.dependencies]
 aiohttp = "3.12.15"
-anthropic = ">=0.68.1,<1.0.0"
+anthropic = ">=0.58.2,<1.0.0"
 anyio = ">=4.9.0"
 authlib = ">=1.6.0"
 bubus = ">=1.5.6"
 cdp-use = ">=1.4.0"
-click = ">=8.1.8"
-cloudpickle = ">=3.1.1"
 google-api-core = ">=2.25.0"
 google-api-python-client = ">=2.174.0"
 google-auth = ">=2.40.3"
 google-auth-oauthlib = ">=1.2.2"
 google-genai = ">=1.29.0,<2.0.0"
 groq = ">=0.30.0"
+html2text = ">=2025.4.15"
 httpx = ">=0.28.1"
-inquirerpy = ">=0.3.4"
-markdownify = ">=1.2.0"
 mcp = ">=1.10.1"
 ollama = ">=0.5.1"
 openai = ">=1.99.2,<2.0.0"
@@ -724,20 +720,16 @@ pypdf = ">=5.7.0"
 python-dotenv = ">=1.0.1"
 reportlab = ">=4.0.0"
 requests = ">=2.32.3"
-rich = ">=14.0.0"
 screeninfo = {version = ">=0.8.1", markers = "platform_system != \"darwin\""}
 typing-extensions = ">=4.12.2"
 uuid7 = ">=0.1.0"

 [package.extras]
-all = ["agentmail (==0.0.59)", "boto3 (>=1.38.45)", "botocore (>=1.37.23)", "imgcat (>=0.6.0)", "langchain-openai (>=0.3.26)", "oci (>=2.126.4)", "textual (>=3.2.0)"]
+all = ["agentmail (>=0.0.53)", "boto3 (>=1.38.45)", "botocore (>=1.37.23)", "click (>=8.1.8)", "imgcat (>=0.6.0)", "langchain-openai (>=0.3.26)", "rich (>=14.0.0)", "textual (>=3.2.0)"]
 aws = ["boto3 (>=1.38.45)"]
-cli = ["textual (>=3.2.0)"]
-cli-oci = ["oci (>=2.126.4)", "textual (>=3.2.0)"]
-code = ["matplotlib (>=3.9.0)", "numpy (>=2.3.2)", "pandas (>=2.2.0)", "tabulate (>=0.9.0)"]
-eval = ["anyio (>=4.9.0)", "datamodel-code-generator (>=0.26.0)", "lmnr[all] (==0.7.17)", "psutil (>=7.0.0)"]
-examples = ["agentmail (==0.0.59)", "botocore (>=1.37.23)", "imgcat (>=0.6.0)", "langchain-openai (>=0.3.26)"]
-oci = ["oci (>=2.126.4)"]
+cli = ["click (>=8.1.8)", "rich (>=14.0.0)", "textual (>=3.2.0)"]
+eval = ["anyio (>=4.9.0)", "browserbase (==1.4.0)", "datamodel-code-generator (>=0.26.0)", "hyperbrowser (==0.47.0)", "lmnr[all] (==0.7.10)", "psutil (>=7.0.0)"]
+examples = ["agentmail (>=0.0.53)", "botocore (>=1.37.23)", "imgcat (>=0.6.0)", "langchain-openai (>=0.3.26)"]
 video = ["imageio[ffmpeg] (>=2.37.0)", "numpy (>=2.3.2)"]

 [[package]]
@@ -3533,25 +3525,6 @@ files = [
    {file = "iniconfig-2.1.0.tar.gz", hash = "sha256:3abbd2e30b36733fee78f9c7f7308f2d0050e88f0087fd25c2645f63c773e1c7"},
 ]

-[[package]]
-name = "inquirerpy"
-version = "0.3.4"
-description = "Python port of Inquirer.js (A collection of common interactive command-line user interfaces)"
-optional = false
-python-versions = ">=3.7,<4.0"
-groups = ["main"]
-files = [
-    {file = "InquirerPy-0.3.4-py3-none-any.whl", hash = "sha256:c65fdfbac1fa00e3ee4fb10679f4d3ed7a012abf4833910e63c295827fe2a7d4"},
-    {file = "InquirerPy-0.3.4.tar.gz", hash = "sha256:89d2ada0111f337483cb41ae31073108b2ec1e618a49d7110b0d7ade89fc197e"},
-]
-
-[package.dependencies]
-pfzy = ">=0.3.1,<0.4.0"
-prompt-toolkit = ">=3.0.1,<4.0.0"
-
-[package.extras]
-docs = ["Sphinx (>=4.1.2,<5.0.0)", "furo (>=2021.8.17-beta.43,<2022.0.0)", "myst-parser (>=0.15.1,<0.16.0)", "sphinx-autobuild (>=2021.3.14,<2022.0.0)", "sphinx-copybutton (>=0.4.0,<0.5.0)"]
-
 [[package]]
 name = "installer"
 version = "0.7.0"
@@ -4607,62 +4580,6 @@ files = [
    {file = "llvmlite-0.44.0.tar.gz", hash = "sha256:07667d66a5d150abed9157ab6c0b9393c9356f229784a4385c02f99e94fc94d4"},
 ]

-[[package]]
-name = "lmnr"
-version = "0.7.20"
-description = "Python SDK for Laminar"
-optional = false
-python-versions = "<4,>=3.10"
-groups = ["main"]
-files = [
-    {file = "lmnr-0.7.20-py3-none-any.whl", hash = "sha256:5f9fa7444e6f96c25e097f66484ff29e632bdd1de0e9346948bf5595f4a8af38"},
-    {file = "lmnr-0.7.20.tar.gz", hash = "sha256:1f484cd618db2d71af65f90a0b8b36d20d80dc91a5138b811575c8677bf7c4fd"},
-]
-
-[package.dependencies]
-grpcio = ">=1"
-httpx = ">=0.24.0"
-opentelemetry-api = ">=1.33.0"
-opentelemetry-exporter-otlp-proto-grpc = ">=1.33.0"
-opentelemetry-exporter-otlp-proto-http = ">=1.33.0"
-opentelemetry-instrumentation = ">=0.54b0"
-opentelemetry-instrumentation-threading = ">=0.57b0"
-opentelemetry-sdk = ">=1.33.0"
-opentelemetry-semantic-conventions = ">=0.54b0"
-opentelemetry-semantic-conventions-ai = ">=0.4.13"
-orjson = ">=3.0.0"
-packaging = ">=22.0"
-pydantic = ">=2.0.3,<3.0.0"
-python-dotenv = ">=1.0"
-tenacity = ">=8.0"
-tqdm = ">=4.0"
-
-[package.extras]
-alephalpha = ["opentelemetry-instrumentation-alephalpha (>=0.47.1)"]
-all = ["opentelemetry-instrumentation-alephalpha (>=0.47.1)", "opentelemetry-instrumentation-bedrock (>=0.47.1)", "opentelemetry-instrumentation-chromadb (>=0.47.1)", "opentelemetry-instrumentation-cohere (>=0.47.1)", "opentelemetry-instrumentation-crewai (>=0.47.1)", "opentelemetry-instrumentation-haystack (>=0.47.1)", "opentelemetry-instrumentation-lancedb (>=0.47.1)", "opentelemetry-instrumentation-langchain (>=0.47.1)", "opentelemetry-instrumentation-llamaindex (>=0.47.1)", "opentelemetry-instrumentation-marqo (>=0.47.1)", "opentelemetry-instrumentation-mcp (>=0.47.1)", "opentelemetry-instrumentation-milvus (>=0.47.1)", "opentelemetry-instrumentation-mistralai (>=0.47.1)", "opentelemetry-instrumentation-ollama (>=0.47.1)", "opentelemetry-instrumentation-pinecone (>=0.47.1)", "opentelemetry-instrumentation-qdrant (>=0.47.1)", "opentelemetry-instrumentation-replicate (>=0.47.1)", "opentelemetry-instrumentation-sagemaker (>=0.47.1)", "opentelemetry-instrumentation-together (>=0.47.1)", "opentelemetry-instrumentation-transformers (>=0.47.1)", "opentelemetry-instrumentation-vertexai (>=0.47.1)", "opentelemetry-instrumentation-watsonx (>=0.47.1)", "opentelemetry-instrumentation-weaviate (>=0.47.1)"]
-bedrock = ["opentelemetry-instrumentation-bedrock (>=0.47.1)"]
-chromadb = ["opentelemetry-instrumentation-chromadb (>=0.47.1)"]
-cohere = ["opentelemetry-instrumentation-cohere (>=0.47.1)"]
-crewai = ["opentelemetry-instrumentation-crewai (>=0.47.1)"]
-haystack = ["opentelemetry-instrumentation-haystack (>=0.47.1)"]
-lancedb = ["opentelemetry-instrumentation-lancedb (>=0.47.1)"]
-langchain = ["opentelemetry-instrumentation-langchain (>=0.47.1)"]
-llamaindex = ["opentelemetry-instrumentation-llamaindex (>=0.47.1)"]
-marqo = ["opentelemetry-instrumentation-marqo (>=0.47.1)"]
-mcp = ["opentelemetry-instrumentation-mcp (>=0.47.1)"]
-milvus = ["opentelemetry-instrumentation-milvus (>=0.47.1)"]
-mistralai = ["opentelemetry-instrumentation-mistralai (>=0.47.1)"]
-ollama = ["opentelemetry-instrumentation-ollama (>=0.47.1)"]
-pinecone = ["opentelemetry-instrumentation-pinecone (>=0.47.1)"]
-qdrant = ["opentelemetry-instrumentation-qdrant (>=0.47.1)"]
-replicate = ["opentelemetry-instrumentation-replicate (>=0.47.1)"]
-sagemaker = ["opentelemetry-instrumentation-sagemaker (>=0.47.1)"]
-together = ["opentelemetry-instrumentation-together (>=0.47.1)"]
-transformers = ["opentelemetry-instrumentation-transformers (>=0.47.1)"]
-vertexai = ["opentelemetry-instrumentation-vertexai (>=0.47.1)"]
-watsonx = ["opentelemetry-instrumentation-watsonx (>=0.47.1)"]
-weaviate = ["opentelemetry-instrumentation-weaviate (>=0.47.1)"]
-
 [[package]]
 name = "lxml"
 version = "6.0.1"
@@ -5820,15 +5737,13 @@ llama = ["llama-index (>=0.12.29,<0.13.0)", "llama-index-core (>=0.12.29,<0.13.0

 [[package]]
 name = "openhands-agent-server"
-version = "1.3.0"
+version = "1.0.0a4"
 description = "OpenHands Agent Server - REST/WebSocket interface for OpenHands AI Agent"
 optional = false
 python-versions = ">=3.12"
 groups = ["main"]
-files = [
-    {file = "openhands_agent_server-1.3.0-py3-none-any.whl", hash = "sha256:2f87f790c740dc3fb81821c5f9fa375af875fbb937ebca3baa6dc5c035035b3c"},
-    {file = "openhands_agent_server-1.3.0.tar.gz", hash = "sha256:0a83ae77373f5c41d0ba0e22d8f0f6144d54d55784183a50b7c098c96cd5135c"},
-]
+files = []
+develop = false

 [package.dependencies]
 aiosqlite = ">=0.19"
@@ -5841,9 +5756,16 @@ uvicorn = ">=0.31.1"
 websockets = ">=12"
 wsproto = ">=1.2.0"

+[package.source]
+type = "git"
+url = "https://github.com/OpenHands/agent-sdk.git"
+reference = "3d8af53b2f0259dc98555a4acd4238f90e0afbce"
+resolved_reference = "3d8af53b2f0259dc98555a4acd4238f90e0afbce"
+subdirectory = "openhands-agent-server"
+
 [[package]]
 name = "openhands-ai"
-version = "0.62.0"
+version = "0.0.0-post.5456+15c207c40"
 description = "OpenHands: Code Less, Make More"
 optional = false
 python-versions = "^3.12,<3.14"
@@ -5860,7 +5782,6 @@ bashlex = "^0.18"
 boto3 = "*"
 browsergym-core = "0.13.3"
 deprecated = "*"
-deprecation = "^2.1.0"
 dirhash = "*"
 docker = "*"
 fastapi = "*"
@@ -5880,14 +5801,13 @@ jupyter_kernel_gateway = "*"
 kubernetes = "^33.1.0"
 libtmux = ">=0.46.2"
 litellm = ">=1.74.3, <1.78.0, !=1.64.4, !=1.67.*"
-lmnr = "^0.7.20"
 memory-profiler = "^0.61.0"
 numpy = "*"
 openai = "1.99.9"
 openhands-aci = "0.3.2"
-openhands-agent-server = "1.3.0"
-openhands-sdk = "1.3.0"
-openhands-tools = "1.3.0"
+openhands-agent-server = {git = "https://github.com/OpenHands/agent-sdk.git", rev = "3d8af53b2f0259dc98555a4acd4238f90e0afbce", subdirectory = "openhands-agent-server"}
+openhands-sdk = {git = "https://github.com/OpenHands/agent-sdk.git", rev = "3d8af53b2f0259dc98555a4acd4238f90e0afbce", subdirectory = "openhands-sdk"}
+openhands-tools = {git = "https://github.com/OpenHands/agent-sdk.git", rev = "3d8af53b2f0259dc98555a4acd4238f90e0afbce", subdirectory = "openhands-tools"}
 opentelemetry-api = "^1.33.1"
 opentelemetry-exporter-otlp-proto-grpc = "^1.33.1"
 pathspec = "^0.12.1"
@@ -5943,22 +5863,18 @@ url = ".."

 [[package]]
 name = "openhands-sdk"
-version = "1.3.0"
+version = "1.0.0a4"
 description = "OpenHands SDK - Core functionality for building AI agents"
 optional = false
 python-versions = ">=3.12"
 groups = ["main"]
-files = [
-    {file = "openhands_sdk-1.3.0-py3-none-any.whl", hash = "sha256:feee838346f8e60ea3e4d3391de7cb854314eb8b3c9e3dbbb56f98a784aadc56"},
-    {file = "openhands_sdk-1.3.0.tar.gz", hash = "sha256:2d060803a78de462121b56dea717a66356922deb02276f37b29fae8af66343fb"},
-]
+files = []
+develop = false

 [package.dependencies]
-deprecation = ">=2.1.0"
 fastmcp = ">=2.11.3"
 httpx = ">=0.27.0"
 litellm = ">=1.77.7.dev9"
-lmnr = ">=0.7.20"
 pydantic = ">=2.11.7"
 python-frontmatter = ">=1.1.0"
 python-json-logger = ">=3.3.0"
@@ -5968,28 +5884,40 @@ websockets = ">=12"
 [package.extras]
 boto3 = ["boto3 (>=1.35.0)"]

+[package.source]
+type = "git"
+url = "https://github.com/OpenHands/agent-sdk.git"
+reference = "3d8af53b2f0259dc98555a4acd4238f90e0afbce"
+resolved_reference = "3d8af53b2f0259dc98555a4acd4238f90e0afbce"
+subdirectory = "openhands-sdk"
+
 [[package]]
 name = "openhands-tools"
-version = "1.3.0"
+version = "1.0.0a4"
 description = "OpenHands Tools - Runtime tools for AI agents"
 optional = false
 python-versions = ">=3.12"
 groups = ["main"]
-files = [
-    {file = "openhands_tools-1.3.0-py3-none-any.whl", hash = "sha256:f31056d87c3058ac92709f9161c7c602daeee3ed0cb4439097b43cda105ed03e"},
-    {file = "openhands_tools-1.3.0.tar.gz", hash = "sha256:3da46f09e28593677d3e17252ce18584fcc13caab1a73213e66bd7edca2cebe0"},
-]
+files = []
+develop = false

 [package.dependencies]
 bashlex = ">=0.18"
 binaryornot = ">=0.4.4"
-browser-use = ">=0.8.0"
+browser-use = ">=0.7.7"
 cachetools = "*"
 func-timeout = ">=4.3.5"
 libtmux = ">=0.46.2"
 openhands-sdk = "*"
 pydantic = ">=2.11.7"

+[package.source]
+type = "git"
+url = "https://github.com/OpenHands/agent-sdk.git"
+reference = "3d8af53b2f0259dc98555a4acd4238f90e0afbce"
+resolved_reference = "3d8af53b2f0259dc98555a4acd4238f90e0afbce"
+subdirectory = "openhands-tools"
+
 [[package]]
 name = "openpyxl"
 version = "3.1.5"
@@ -6060,62 +5988,6 @@ opentelemetry-proto = "1.36.0"
 opentelemetry-sdk = ">=1.36.0,<1.37.0"
 typing-extensions = ">=4.6.0"

-[[package]]
-name = "opentelemetry-exporter-otlp-proto-http"
-version = "1.36.0"
-description = "OpenTelemetry Collector Protobuf over HTTP Exporter"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "opentelemetry_exporter_otlp_proto_http-1.36.0-py3-none-any.whl", hash = "sha256:3d769f68e2267e7abe4527f70deb6f598f40be3ea34c6adc35789bea94a32902"},
-    {file = "opentelemetry_exporter_otlp_proto_http-1.36.0.tar.gz", hash = "sha256:dd3637f72f774b9fc9608ab1ac479f8b44d09b6fb5b2f3df68a24ad1da7d356e"},
-]
-
-[package.dependencies]
-googleapis-common-protos = ">=1.52,<2.0"
-opentelemetry-api = ">=1.15,<2.0"
-opentelemetry-exporter-otlp-proto-common = "1.36.0"
-opentelemetry-proto = "1.36.0"
-opentelemetry-sdk = ">=1.36.0,<1.37.0"
-requests = ">=2.7,<3.0"
-typing-extensions = ">=4.5.0"
-
-[[package]]
-name = "opentelemetry-instrumentation"
-version = "0.57b0"
-description = "Instrumentation Tools & Auto Instrumentation for OpenTelemetry Python"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "opentelemetry_instrumentation-0.57b0-py3-none-any.whl", hash = "sha256:9109280f44882e07cec2850db28210b90600ae9110b42824d196de357cbddf7e"},
-    {file = "opentelemetry_instrumentation-0.57b0.tar.gz", hash = "sha256:f2a30135ba77cdea2b0e1df272f4163c154e978f57214795d72f40befd4fcf05"},
-]
-
-[package.dependencies]
-opentelemetry-api = ">=1.4,<2.0"
-opentelemetry-semantic-conventions = "0.57b0"
-packaging = ">=18.0"
-wrapt = ">=1.0.0,<2.0.0"
-
-[[package]]
-name = "opentelemetry-instrumentation-threading"
-version = "0.57b0"
-description = "Thread context propagation support for OpenTelemetry"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "opentelemetry_instrumentation_threading-0.57b0-py3-none-any.whl", hash = "sha256:adfd64857c8c78d6111cf80552311e1713bad64272dd81abdd61f07b892a161b"},
-    {file = "opentelemetry_instrumentation_threading-0.57b0.tar.gz", hash = "sha256:06fa4c98d6bfe4670e7532497670ac202db42afa647ff770aedce0e422421c6e"},
-]
-
-[package.dependencies]
-opentelemetry-api = ">=1.12,<2.0"
-opentelemetry-instrumentation = "0.57b0"
-wrapt = ">=1.0.0,<2.0.0"
-
 [[package]]
 name = "opentelemetry-proto"
 version = "1.36.0"
@@ -6164,115 +6036,6 @@ files = [
 opentelemetry-api = "1.36.0"
 typing-extensions = ">=4.5.0"

-[[package]]
-name = "opentelemetry-semantic-conventions-ai"
-version = "0.4.13"
-description = "OpenTelemetry Semantic Conventions Extension for Large Language Models"
-optional = false
-python-versions = "<4,>=3.9"
-groups = ["main"]
-files = [
-    {file = "opentelemetry_semantic_conventions_ai-0.4.13-py3-none-any.whl", hash = "sha256:883a30a6bb5deaec0d646912b5f9f6dcbb9f6f72557b73d0f2560bf25d13e2d5"},
-    {file = "opentelemetry_semantic_conventions_ai-0.4.13.tar.gz", hash = "sha256:94efa9fb4ffac18c45f54a3a338ffeb7eedb7e1bb4d147786e77202e159f0036"},
-]
-
-[[package]]
-name = "orjson"
-version = "3.11.4"
-description = "Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "orjson-3.11.4-cp310-cp310-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:e3aa2118a3ece0d25489cbe48498de8a5d580e42e8d9979f65bf47900a15aba1"},
-    {file = "orjson-3.11.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a69ab657a4e6733133a3dca82768f2f8b884043714e8d2b9ba9f52b6efef5c44"},
-    {file = "orjson-3.11.4-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:3740bffd9816fc0326ddc406098a3a8f387e42223f5f455f2a02a9f834ead80c"},
-    {file = "orjson-3.11.4-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:65fd2f5730b1bf7f350c6dc896173d3460d235c4be007af73986d7cd9a2acd23"},
-    {file = "orjson-3.11.4-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9fdc3ae730541086158d549c97852e2eea6820665d4faf0f41bf99df41bc11ea"},
-    {file = "orjson-3.11.4-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e10b4d65901da88845516ce9f7f9736f9638d19a1d483b3883dc0182e6e5edba"},
-    {file = "orjson-3.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fb6a03a678085f64b97f9d4a9ae69376ce91a3a9e9b56a82b1580d8e1d501aff"},
-    {file = "orjson-3.11.4-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:2c82e4f0b1c712477317434761fbc28b044c838b6b1240d895607441412371ac"},
-    {file = "orjson-3.11.4-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:d58c166a18f44cc9e2bad03a327dc2d1a3d2e85b847133cfbafd6bfc6719bd79"},
-    {file = "orjson-3.11.4-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:94f206766bf1ea30e1382e4890f763bd1eefddc580e08fec1ccdc20ddd95c827"},
-    {file = "orjson-3.11.4-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:41bf25fb39a34cf8edb4398818523277ee7096689db352036a9e8437f2f3ee6b"},
-    {file = "orjson-3.11.4-cp310-cp310-win32.whl", hash = "sha256:fa9627eba4e82f99ca6d29bc967f09aba446ee2b5a1ea728949ede73d313f5d3"},
-    {file = "orjson-3.11.4-cp310-cp310-win_amd64.whl", hash = "sha256:23ef7abc7fca96632d8174ac115e668c1e931b8fe4dde586e92a500bf1914dcc"},
-    {file = "orjson-3.11.4-cp311-cp311-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:5e59d23cd93ada23ec59a96f215139753fbfe3a4d989549bcb390f8c00370b39"},
-    {file = "orjson-3.11.4-cp311-cp311-macosx_15_0_arm64.whl", hash = "sha256:5c3aedecfc1beb988c27c79d52ebefab93b6c3921dbec361167e6559aba2d36d"},
-    {file = "orjson-3.11.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:da9e5301f1c2caa2a9a4a303480d79c9ad73560b2e7761de742ab39fe59d9175"},
-    {file = "orjson-3.11.4-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:8873812c164a90a79f65368f8f96817e59e35d0cc02786a5356f0e2abed78040"},
-    {file = "orjson-3.11.4-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:5d7feb0741ebb15204e748f26c9638e6665a5fa93c37a2c73d64f1669b0ddc63"},
-    {file = "orjson-3.11.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:01ee5487fefee21e6910da4c2ee9eef005bee568a0879834df86f888d2ffbdd9"},
-    {file = "orjson-3.11.4-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:3d40d46f348c0321df01507f92b95a377240c4ec31985225a6668f10e2676f9a"},
-    {file = "orjson-3.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:95713e5fc8af84d8edc75b785d2386f653b63d62b16d681687746734b4dfc0be"},
-    {file = "orjson-3.11.4-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:ad73ede24f9083614d6c4ca9a85fe70e33be7bf047ec586ee2363bc7418fe4d7"},
-    {file = "orjson-3.11.4-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:842289889de515421f3f224ef9c1f1efb199a32d76d8d2ca2706fa8afe749549"},
-    {file = "orjson-3.11.4-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:3b2427ed5791619851c52a1261b45c233930977e7de8cf36de05636c708fa905"},
-    {file = "orjson-3.11.4-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:3c36e524af1d29982e9b190573677ea02781456b2e537d5840e4538a5ec41907"},
-    {file = "orjson-3.11.4-cp311-cp311-win32.whl", hash = "sha256:87255b88756eab4a68ec61837ca754e5d10fa8bc47dc57f75cedfeaec358d54c"},
-    {file = "orjson-3.11.4-cp311-cp311-win_amd64.whl", hash = "sha256:e2d5d5d798aba9a0e1fede8d853fa899ce2cb930ec0857365f700dffc2c7af6a"},
-    {file = "orjson-3.11.4-cp311-cp311-win_arm64.whl", hash = "sha256:6bb6bb41b14c95d4f2702bce9975fda4516f1db48e500102fc4d8119032ff045"},
-    {file = "orjson-3.11.4-cp312-cp312-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:d4371de39319d05d3f482f372720b841c841b52f5385bd99c61ed69d55d9ab50"},
-    {file = "orjson-3.11.4-cp312-cp312-macosx_15_0_arm64.whl", hash = "sha256:e41fd3b3cac850eaae78232f37325ed7d7436e11c471246b87b2cd294ec94853"},
-    {file = "orjson-3.11.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:600e0e9ca042878c7fdf189cf1b028fe2c1418cc9195f6cb9824eb6ed99cb938"},
-    {file = "orjson-3.11.4-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7bbf9b333f1568ef5da42bc96e18bf30fd7f8d54e9ae066d711056add508e415"},
-    {file = "orjson-3.11.4-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4806363144bb6e7297b8e95870e78d30a649fdc4e23fc84daa80c8ebd366ce44"},
-    {file = "orjson-3.11.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:ad355e8308493f527d41154e9053b86a5be892b3b359a5c6d5d95cda23601cb2"},
-    {file = "orjson-3.11.4-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c8a7517482667fb9f0ff1b2f16fe5829296ed7a655d04d68cd9711a4d8a4e708"},
-    {file = "orjson-3.11.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:97eb5942c7395a171cbfecc4ef6701fc3c403e762194683772df4c54cfbb2210"},
-    {file = "orjson-3.11.4-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:149d95d5e018bdd822e3f38c103b1a7c91f88d38a88aada5c4e9b3a73a244241"},
-    {file = "orjson-3.11.4-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:624f3951181eb46fc47dea3d221554e98784c823e7069edb5dbd0dc826ac909b"},
-    {file = "orjson-3.11.4-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:03bfa548cf35e3f8b3a96c4e8e41f753c686ff3d8e182ce275b1751deddab58c"},
-    {file = "orjson-3.11.4-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:525021896afef44a68148f6ed8a8bf8375553d6066c7f48537657f64823565b9"},
-    {file = "orjson-3.11.4-cp312-cp312-win32.whl", hash = "sha256:b58430396687ce0f7d9eeb3dd47761ca7d8fda8e9eb92b3077a7a353a75efefa"},
-    {file = "orjson-3.11.4-cp312-cp312-win_amd64.whl", hash = "sha256:c6dbf422894e1e3c80a177133c0dda260f81428f9de16d61041949f6a2e5c140"},
-    {file = "orjson-3.11.4-cp312-cp312-win_arm64.whl", hash = "sha256:d38d2bc06d6415852224fcc9c0bfa834c25431e466dc319f0edd56cca81aa96e"},
-    {file = "orjson-3.11.4-cp313-cp313-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:2d6737d0e616a6e053c8b4acc9eccea6b6cce078533666f32d140e4f85002534"},
-    {file = "orjson-3.11.4-cp313-cp313-macosx_15_0_arm64.whl", hash = "sha256:afb14052690aa328cc118a8e09f07c651d301a72e44920b887c519b313d892ff"},
-    {file = "orjson-3.11.4-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:38aa9e65c591febb1b0aed8da4d469eba239d434c218562df179885c94e1a3ad"},
-    {file = "orjson-3.11.4-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f2cf4dfaf9163b0728d061bebc1e08631875c51cd30bf47cb9e3293bfbd7dcd5"},
-    {file = "orjson-3.11.4-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:89216ff3dfdde0e4070932e126320a1752c9d9a758d6a32ec54b3b9334991a6a"},
-    {file = "orjson-3.11.4-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9daa26ca8e97fae0ce8aa5d80606ef8f7914e9b129b6b5df9104266f764ce436"},
-    {file = "orjson-3.11.4-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5c8b2769dc31883c44a9cd126560327767f848eb95f99c36c9932f51090bfce9"},
-    {file = "orjson-3.11.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1469d254b9884f984026bd9b0fa5bbab477a4bfe558bba6848086f6d43eb5e73"},
-    {file = "orjson-3.11.4-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:68e44722541983614e37117209a194e8c3ad07838ccb3127d96863c95ec7f1e0"},
-    {file = "orjson-3.11.4-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:8e7805fda9672c12be2f22ae124dcd7b03928d6c197544fe12174b86553f3196"},
-    {file = "orjson-3.11.4-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:04b69c14615fb4434ab867bf6f38b2d649f6f300af30a6705397e895f7aec67a"},
-    {file = "orjson-3.11.4-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:639c3735b8ae7f970066930e58cf0ed39a852d417c24acd4a25fc0b3da3c39a6"},
-    {file = "orjson-3.11.4-cp313-cp313-win32.whl", hash = "sha256:6c13879c0d2964335491463302a6ca5ad98105fc5db3565499dcb80b1b4bd839"},
-    {file = "orjson-3.11.4-cp313-cp313-win_amd64.whl", hash = "sha256:09bf242a4af98732db9f9a1ec57ca2604848e16f132e3f72edfd3c5c96de009a"},
-    {file = "orjson-3.11.4-cp313-cp313-win_arm64.whl", hash = "sha256:a85f0adf63319d6c1ba06fb0dbf997fced64a01179cf17939a6caca662bf92de"},
-    {file = "orjson-3.11.4-cp314-cp314-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:42d43a1f552be1a112af0b21c10a5f553983c2a0938d2bbb8ecd8bc9fb572803"},
-    {file = "orjson-3.11.4-cp314-cp314-macosx_15_0_arm64.whl", hash = "sha256:26a20f3fbc6c7ff2cb8e89c4c5897762c9d88cf37330c6a117312365d6781d54"},
-    {file = "orjson-3.11.4-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6e3f20be9048941c7ffa8fc523ccbd17f82e24df1549d1d1fe9317712d19938e"},
-    {file = "orjson-3.11.4-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:aac364c758dc87a52e68e349924d7e4ded348dedff553889e4d9f22f74785316"},
-    {file = "orjson-3.11.4-cp314-cp314-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d5c54a6d76e3d741dcc3f2707f8eeb9ba2a791d3adbf18f900219b62942803b1"},
-    {file = "orjson-3.11.4-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f28485bdca8617b79d44627f5fb04336897041dfd9fa66d383a49d09d86798bc"},
-    {file = "orjson-3.11.4-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:bfc2a484cad3585e4ba61985a6062a4c2ed5c7925db6d39f1fa267c9d166487f"},
-    {file = "orjson-3.11.4-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e34dbd508cb91c54f9c9788923daca129fe5b55c5b4eebe713bf5ed3791280cf"},
-    {file = "orjson-3.11.4-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:b13c478fa413d4b4ee606ec8e11c3b2e52683a640b006bb586b3041c2ca5f606"},
-    {file = "orjson-3.11.4-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:724ca721ecc8a831b319dcd72cfa370cc380db0bf94537f08f7edd0a7d4e1780"},
-    {file = "orjson-3.11.4-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:977c393f2e44845ce1b540e19a786e9643221b3323dae190668a98672d43fb23"},
-    {file = "orjson-3.11.4-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:1e539e382cf46edec157ad66b0b0872a90d829a6b71f17cb633d6c160a223155"},
-    {file = "orjson-3.11.4-cp314-cp314-win32.whl", hash = "sha256:d63076d625babab9db5e7836118bdfa086e60f37d8a174194ae720161eb12394"},
-    {file = "orjson-3.11.4-cp314-cp314-win_amd64.whl", hash = "sha256:0a54d6635fa3aaa438ae32e8570b9f0de36f3f6562c308d2a2a452e8b0592db1"},
-    {file = "orjson-3.11.4-cp314-cp314-win_arm64.whl", hash = "sha256:78b999999039db3cf58f6d230f524f04f75f129ba3d1ca2ed121f8657e575d3d"},
-    {file = "orjson-3.11.4-cp39-cp39-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:405261b0a8c62bcbd8e2931c26fdc08714faf7025f45531541e2b29e544b545b"},
-    {file = "orjson-3.11.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:af02ff34059ee9199a3546f123a6ab4c86caf1708c79042caf0820dc290a6d4f"},
-    {file = "orjson-3.11.4-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0b2eba969ea4203c177c7b38b36c69519e6067ee68c34dc37081fac74c796e10"},
-    {file = "orjson-3.11.4-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0baa0ea43cfa5b008a28d3c07705cf3ada40e5d347f0f44994a64b1b7b4b5350"},
-    {file = "orjson-3.11.4-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:80fd082f5dcc0e94657c144f1b2a3a6479c44ad50be216cf0c244e567f5eae19"},
-    {file = "orjson-3.11.4-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1e3704d35e47d5bee811fb1cbd8599f0b4009b14d451c4c57be5a7e25eb89a13"},
-    {file = "orjson-3.11.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:caa447f2b5356779d914658519c874cf3b7629e99e63391ed519c28c8aea4919"},
-    {file = "orjson-3.11.4-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:bba5118143373a86f91dadb8df41d9457498226698ebdf8e11cbb54d5b0e802d"},
-    {file = "orjson-3.11.4-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:622463ab81d19ef3e06868b576551587de8e4d518892d1afab71e0fbc1f9cffc"},
-    {file = "orjson-3.11.4-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:3e0a700c4b82144b72946b6629968df9762552ee1344bfdb767fecdd634fbd5a"},
-    {file = "orjson-3.11.4-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:6e18a5c15e764e5f3fc569b47872450b4bcea24f2a6354c0a0e95ad21045d5a9"},
-    {file = "orjson-3.11.4-cp39-cp39-win32.whl", hash = "sha256:fb1c37c71cad991ef4d89c7a634b5ffb4447dbd7ae3ae13e8f5ee7f1775e7ab1"},
-    {file = "orjson-3.11.4-cp39-cp39-win_amd64.whl", hash = "sha256:e2985ce8b8c42d00492d0ed79f2bd2b6460d00f2fa671dfde4bf2e02f49bf5c6"},
-    {file = "orjson-3.11.4.tar.gz", hash = "sha256:39485f4ab4c9b30a3943cfe99e1a213c4776fb69e8abd68f66b83d5a0b0fdc6d"},
-]
-
 [[package]]
 name = "packaging"
 version = "25.0"
@@ -6489,21 +6252,6 @@ files = [
 [package.dependencies]
 ptyprocess = ">=0.5"

-[[package]]
-name = "pfzy"
-version = "0.3.4"
-description = "Python port of the fzy fuzzy string matching algorithm"
-optional = false
-python-versions = ">=3.7,<4.0"
-groups = ["main"]
-files = [
-    {file = "pfzy-0.3.4-py3-none-any.whl", hash = "sha256:5f50d5b2b3207fa72e7ec0ef08372ef652685470974a107d0d4999fc5a903a96"},
-    {file = "pfzy-0.3.4.tar.gz", hash = "sha256:717ea765dd10b63618e7298b2d98efd819e0b30cd5905c9707223dceeb94b3f1"},
-]
-
-[package.extras]
-docs = ["Sphinx (>=4.1.2,<5.0.0)", "furo (>=2021.8.17-beta.43,<2022.0.0)", "myst-parser (>=0.15.1,<0.16.0)", "sphinx-autobuild (>=2021.3.14,<2022.0.0)", "sphinx-copybutton (>=0.4.0,<0.5.0)"]
-
 [[package]]
 name = "pg8000"
 version = "1.31.5"
--- a/enterprise/server/auth/constants.py
+++ b/enterprise/server/auth/constants.py
@@ -30,11 +30,3 @@ JIRA_DC_CLIENT_SECRET = os.getenv('JIRA_DC_CLIENT_SECRET', '').strip()
 JIRA_DC_BASE_URL = os.getenv('JIRA_DC_BASE_URL', '').strip()
 JIRA_DC_ENABLE_OAUTH = os.getenv('JIRA_DC_ENABLE_OAUTH', '1') in ('1', 'true')
 AUTH_URL = os.getenv('AUTH_URL', '').rstrip('/')
-ROLE_CHECK_ENABLED = os.getenv('ROLE_CHECK_ENABLED', 'false').lower() in (
-    '1',
-    'true',
-    't',
-    'yes',
-    'y',
-    'on',
-)
--- a/enterprise/server/constants.py
+++ b/enterprise/server/constants.py
@@ -50,7 +50,7 @@ SUBSCRIPTION_PRICE_DATA = {
    },
 }

-DEFAULT_INITIAL_BUDGET = float(os.environ.get('DEFAULT_INITIAL_BUDGET', '10'))
+DEFAULT_INITIAL_BUDGET = float(os.environ.get('DEFAULT_INITIAL_BUDGET', '20'))
 STRIPE_API_KEY = os.environ.get('STRIPE_API_KEY', None)
 STRIPE_WEBHOOK_SECRET = os.environ.get('STRIPE_WEBHOOK_SECRET', None)
 REQUIRE_PAYMENT = os.environ.get('REQUIRE_PAYMENT', '0') in ('1', 'true')
--- a/enterprise/server/routes/auth.py
+++ b/enterprise/server/routes/auth.py
@@ -12,7 +12,6 @@ from server.auth.constants import (
    KEYCLOAK_CLIENT_ID,
    KEYCLOAK_REALM_NAME,
    KEYCLOAK_SERVER_URL_EXT,
-    ROLE_CHECK_ENABLED,
 )
 from server.auth.gitlab_sync import schedule_gitlab_repo_sync
 from server.auth.saas_user_auth import SaasUserAuth
@@ -133,12 +132,6 @@ async def keycloak_callback(

    user_info = await token_manager.get_user_info(keycloak_access_token)
    logger.debug(f'user_info: {user_info}')
-    if ROLE_CHECK_ENABLED and 'roles' not in user_info:
-        return JSONResponse(
-            status_code=status.HTTP_401_UNAUTHORIZED,
-            content={'error': 'Missing required role'},
-        )
-
    if 'sub' not in user_info or 'preferred_username' not in user_info:
        return JSONResponse(
            status_code=status.HTTP_400_BAD_REQUEST,
--- a/enterprise/storage/saas_conversation_store.py
+++ b/enterprise/storage/saas_conversation_store.py
@@ -35,7 +35,6 @@ class SaasConversationStore(ConversationStore):
            session.query(StoredConversationMetadata)
            .filter(StoredConversationMetadata.user_id == self.user_id)
            .filter(StoredConversationMetadata.conversation_id == conversation_id)
-            .filter(StoredConversationMetadata.conversation_version == 'V0')
        )

    def _to_external_model(self, conversation_metadata: StoredConversationMetadata):
@@ -60,7 +59,6 @@ class SaasConversationStore(ConversationStore):
        kwargs.pop('reasoning_tokens', None)
        kwargs.pop('context_window', None)
        kwargs.pop('per_turn_token', None)
-        kwargs.pop('parent_conversation_id', None)

        return ConversationMetadata(**kwargs)

@@ -125,7 +123,6 @@ class SaasConversationStore(ConversationStore):
                conversations = (
                    session.query(StoredConversationMetadata)
                    .filter(StoredConversationMetadata.user_id == self.user_id)
-                    .filter(StoredConversationMetadata.conversation_version == 'V0')
                    .order_by(StoredConversationMetadata.created_at.desc())
                    .offset(offset)
                    .limit(limit + 1)
--- a/enterprise/storage/saas_settings_store.py
+++ b/enterprise/storage/saas_settings_store.py
@@ -97,10 +97,6 @@ class SaasSettingsStore(SettingsStore):
            return settings

    async def store(self, item: Settings):
-        # Check if provider is OpenHands and generate API key if needed
-        if item and self._is_openhands_provider(item):
-            await self._ensure_openhands_api_key(item)
-
        with self.session_maker() as session:
            existing = None
            kwargs = {}
@@ -372,30 +368,6 @@ class SaasSettingsStore(SettingsStore):
    def _should_encrypt(self, key: str) -> bool:
        return key in ('llm_api_key', 'llm_api_key_for_byor', 'search_api_key')

-    def _is_openhands_provider(self, item: Settings) -> bool:
-        """Check if the settings use the OpenHands provider."""
-        return bool(item.llm_model and item.llm_model.startswith('openhands/'))
-
-    async def _ensure_openhands_api_key(self, item: Settings) -> None:
-        """Generate and set the OpenHands API key for the given settings.
-
-        First checks if an existing key with the OpenHands alias exists,
-        and reuses it if found. Otherwise, generates a new key.
-        """
-        # Generate new key if none exists
-        generated_key = await self._generate_openhands_key()
-        if generated_key:
-            item.llm_api_key = SecretStr(generated_key)
-            logger.info(
-                'saas_settings_store:store:generated_openhands_key',
-                extra={'user_id': self.user_id},
-            )
-        else:
-            logger.warning(
-                'saas_settings_store:store:failed_to_generate_openhands_key',
-                extra={'user_id': self.user_id},
-            )
-
    async def _create_user_in_lite_llm(
        self, client: httpx.AsyncClient, email: str | None, max_budget: int, spend: int
    ):
@@ -418,55 +390,3 @@ class SaasSettingsStore(SettingsStore):
            },
        )
        return response
-
-    async def _generate_openhands_key(self) -> str | None:
-        """Generate a new OpenHands provider key for a user."""
-        if not (LITE_LLM_API_KEY and LITE_LLM_API_URL):
-            logger.warning(
-                'saas_settings_store:_generate_openhands_key:litellm_config_not_found',
-                extra={'user_id': self.user_id},
-            )
-            return None
-
-        try:
-            async with httpx.AsyncClient(
-                verify=httpx_verify_option(),
-                headers={
-                    'x-goog-api-key': LITE_LLM_API_KEY,
-                },
-            ) as client:
-                response = await client.post(
-                    f'{LITE_LLM_API_URL}/key/generate',
-                    json={
-                        'user_id': self.user_id,
-                        'metadata': {'type': 'openhands'},
-                    },
-                )
-                response.raise_for_status()
-                response_json = response.json()
-                key = response_json.get('key')
-
-                if key:
-                    logger.info(
-                        'saas_settings_store:_generate_openhands_key:success',
-                        extra={
-                            'user_id': self.user_id,
-                            'key_length': len(key) if key else 0,
-                            'key_prefix': (
-                                key[:10] + '...' if key and len(key) > 10 else key
-                            ),
-                        },
-                    )
-                    return key
-                else:
-                    logger.error(
-                        'saas_settings_store:_generate_openhands_key:no_key_in_response',
-                        extra={'user_id': self.user_id, 'response_json': response_json},
-                    )
-                    return None
-        except Exception as e:
-            logger.exception(
-                'saas_settings_store:_generate_openhands_key:error',
-                extra={'user_id': self.user_id, 'error': str(e)},
-            )
-            return None
--- a/enterprise/storage/user_settings.py
+++ b/enterprise/storage/user_settings.py
@@ -38,4 +38,3 @@ class UserSettings(Base):  # type: ignore
    email_verified = Column(Boolean, nullable=True)
    git_user_name = Column(String, nullable=True)
    git_user_email = Column(String, nullable=True)
-    v1_enabled = Column(Boolean, nullable=True)
--- a/enterprise/tests/unit/experiments/test_saas_experiment_manager.py
+++ b/enterprise/tests/unit/experiments/test_saas_experiment_manager.py
@@ -92,8 +92,11 @@ def test_unknown_variant_returns_original_agent_without_changes(monkeypatch):
    assert getattr(result, 'condenser', None) is None


+@patch('experiments.experiment_manager.handle_condenser_max_step_experiment__v1')
@patch('experiments.experiment_manager.ENABLE_EXPERIMENT_MANAGER', False)
-def test_run_agent_variant_tests_v1_noop_when_manager_disabled():
+def test_run_agent_variant_tests_v1_noop_when_manager_disabled(
+    mock_handle_condenser,
+):
    """If ENABLE_EXPERIMENT_MANAGER is False, the method returns the exact same agent and does not call the handler."""
    agent = make_agent()
    conv_id = uuid4()
@@ -106,6 +109,8 @@ def test_run_agent_variant_tests_v1_noop_when_manager_disabled():

    # Same object returned (no copy)
    assert result is agent
+    # Handler should not have been called
+    mock_handle_condenser.assert_not_called()


@patch('experiments.experiment_manager.ENABLE_EXPERIMENT_MANAGER', True)
@@ -126,3 +131,7 @@ def test_run_agent_variant_tests_v1_calls_handler_and_sets_system_prompt(monkeyp
    # Should be a different instance than the original (copied after handler runs)
    assert result is not agent
    assert result.system_prompt_filename == 'system_prompt_long_horizon.j2'
+
+    # The condenser returned by the handler must be preserved after the system-prompt override copy
+    assert isinstance(result.condenser, LLMSummarizingCondenser)
+    assert result.condenser.max_size == 80
--- a/enterprise/tests/unit/test_github_view.py
+++ b/enterprise/tests/unit/test_github_view.py
@@ -1,9 +1,7 @@
 from unittest import TestCase, mock
-from unittest.mock import MagicMock, patch

-from integrations.github.github_view import GithubFactory, GithubIssue, get_oh_labels
+from integrations.github.github_view import GithubFactory, get_oh_labels
 from integrations.models import Message, SourceType
-from integrations.types import UserData


 class TestGithubLabels(TestCase):
@@ -77,128 +75,3 @@ class TestGithubCommentCaseInsensitivity(TestCase):
        self.assertTrue(GithubFactory.is_issue_comment(message_lower))
        self.assertTrue(GithubFactory.is_issue_comment(message_upper))
        self.assertTrue(GithubFactory.is_issue_comment(message_mixed))
-
-
-class TestGithubV1ConversationRouting(TestCase):
-    """Test V1 conversation routing logic in GitHub integration."""
-
-    def setUp(self):
-        """Set up test fixtures."""
-        # Create a proper UserData instance instead of MagicMock
-        user_data = UserData(
-            user_id=123, username='testuser', keycloak_user_id='test-keycloak-id'
-        )
-
-        # Create a mock raw_payload
-        raw_payload = Message(
-            source=SourceType.GITHUB,
-            message={
-                'payload': {
-                    'action': 'opened',
-                    'issue': {'number': 123},
-                }
-            },
-        )
-
-        self.github_issue = GithubIssue(
-            user_info=user_data,
-            full_repo_name='test/repo',
-            issue_number=123,
-            installation_id=456,
-            conversation_id='test-conversation-id',
-            should_extract=True,
-            send_summary_instruction=False,
-            is_public_repo=True,
-            raw_payload=raw_payload,
-            uuid='test-uuid',
-            title='Test Issue',
-            description='Test issue description',
-            previous_comments=[],
-        )
-
-    @patch('integrations.github.github_view.get_user_v1_enabled_setting')
-    @patch.object(GithubIssue, '_create_v0_conversation')
-    @patch.object(GithubIssue, '_create_v1_conversation')
-    async def test_create_new_conversation_routes_to_v0_when_disabled(
-        self, mock_create_v1, mock_create_v0, mock_get_v1_setting
-    ):
-        """Test that conversation creation routes to V0 when v1_enabled is False."""
-        # Mock v1_enabled as False
-        mock_get_v1_setting.return_value = False
-        mock_create_v0.return_value = None
-        mock_create_v1.return_value = None
-
-        # Mock parameters
-        jinja_env = MagicMock()
-        git_provider_tokens = MagicMock()
-        conversation_metadata = MagicMock()
-
-        # Call the method
-        await self.github_issue.create_new_conversation(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-
-        # Verify V0 was called and V1 was not
-        mock_create_v0.assert_called_once_with(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-        mock_create_v1.assert_not_called()
-
-    @patch('integrations.github.github_view.get_user_v1_enabled_setting')
-    @patch.object(GithubIssue, '_create_v0_conversation')
-    @patch.object(GithubIssue, '_create_v1_conversation')
-    async def test_create_new_conversation_routes_to_v1_when_enabled(
-        self, mock_create_v1, mock_create_v0, mock_get_v1_setting
-    ):
-        """Test that conversation creation routes to V1 when v1_enabled is True."""
-        # Mock v1_enabled as True
-        mock_get_v1_setting.return_value = True
-        mock_create_v0.return_value = None
-        mock_create_v1.return_value = None
-
-        # Mock parameters
-        jinja_env = MagicMock()
-        git_provider_tokens = MagicMock()
-        conversation_metadata = MagicMock()
-
-        # Call the method
-        await self.github_issue.create_new_conversation(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-
-        # Verify V1 was called and V0 was not
-        mock_create_v1.assert_called_once_with(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-        mock_create_v0.assert_not_called()
-
-    @patch('integrations.github.github_view.get_user_v1_enabled_setting')
-    @patch.object(GithubIssue, '_create_v0_conversation')
-    @patch.object(GithubIssue, '_create_v1_conversation')
-    async def test_create_new_conversation_fallback_on_v1_setting_error(
-        self, mock_create_v1, mock_create_v0, mock_get_v1_setting
-    ):
-        """Test that conversation creation falls back to V0 when _create_v1_conversation fails."""
-        # Mock v1_enabled as True so V1 is attempted
-        mock_get_v1_setting.return_value = True
-        # Mock _create_v1_conversation to raise an exception
-        mock_create_v1.side_effect = Exception('V1 conversation creation failed')
-        mock_create_v0.return_value = None
-
-        # Mock parameters
-        jinja_env = MagicMock()
-        git_provider_tokens = MagicMock()
-        conversation_metadata = MagicMock()
-
-        # Call the method
-        await self.github_issue.create_new_conversation(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-
-        # Verify V1 was attempted first, then V0 was called as fallback
-        mock_create_v1.assert_called_once_with(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
-        mock_create_v0.assert_called_once_with(
-            jinja_env, git_provider_tokens, conversation_metadata
-        )
--- a/enterprise/tests/unit/test_saas_settings_store.py
+++ b/enterprise/tests/unit/test_saas_settings_store.py
@@ -243,7 +243,7 @@ async def test_update_settings_with_litellm_default(
    # Check that the URL and most of the JSON payload match what we expect
    assert call_args['json']['user_email'] == 'testy@tester.com'
    assert call_args['json']['models'] == []
-    assert call_args['json']['max_budget'] == 10.0
+    assert call_args['json']['max_budget'] == 20.0
    assert call_args['json']['user_id'] == 'user-id'
    assert call_args['json']['teams'] == ['test_team']
    assert call_args['json']['auto_create_key'] is True
--- a/evaluation/benchmarks/multi_swe_bench/README.md
+++ b/evaluation/benchmarks/multi_swe_bench/README.md
@@ -15,7 +15,7 @@ python evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py

 ## Docker image download

-Please download the multi-swe-bench docker images from [here](https://github.com/multi-swe-bench/multi-swe-bench?tab=readme-ov-file#run-evaluation).
+Please download the multi-swe-bench dokcer images from [here](https://github.com/multi-swe-bench/multi-swe-bench?tab=readme-ov-file#run-evaluation).

 ## Generate patch

@@ -47,7 +47,7 @@ For debugging purposes, you can set `export EVAL_SKIP_MAXIMUM_RETRIES_EXCEEDED=t

 The results will be generated in evaluation/evaluation_outputs/outputs/XXX/CodeActAgent/YYY/output.jsonl, you can refer to the [example](examples/output.jsonl).

-## Running evaluation
+## Runing evaluation

 First, install [multi-swe-bench](https://github.com/multi-swe-bench/multi-swe-bench).

--- a/evaluation/benchmarks/swefficiency/README.md
+++ b/evaluation/benchmarks/swefficiency/README.md
@@ -1,65 +0,0 @@
-# SWE-fficiency Evaluation
-
-This folder contains the OpenHands inference generation of the [SWE-fficiency benchmark](https://swefficiency.com/) ([paper](https://arxiv.org/pdf/2507.12415v1)).
-
-The evaluation consists of three steps:
-
-1. Environment setup: [install python environment](../../README.md#development-environment) and [configure LLM config](../../README.md#configure-openhands-and-your-llm).
-2. [Run inference](#running-inference-locally-with-docker): Generate a edit patch for each Github issue
-3. [Evaluate patches](#evaluate-generated-patches)
-
-## Setup Environment and LLM Configuration
-
-Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
-
-## Running inference Locally with Docker
-
-Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the SWE-PErf set you are running on) for the instance-level docker image.
-
-When the `run_infer.sh` script is started, it will automatically pull the relevant SWE-Perf images.
-For example, for instance ID `scikit-learn_scikit-learn-11674`, it will try to pull our pre-build docker image `betty1202/sweb.eval.x86_64.scikit-learn_s_scikit-learn-11674` from DockerHub.
-This image will be used create an OpenHands runtime image where the agent will operate on.
-
-```bash
-./evaluation/benchmarks/swefficiency/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] [n_runs] [mode]
-
-# Example
-./evaluation/benchmarks/swefficiency/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 500 100 1 swefficiency/swefficiency test
-```
-
-where `model_config` is mandatory, and the rest are optional.
-
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
-LLM settings, as defined in your `config.toml`.
- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would
-like to evaluate. It could also be a release tag like `0.6.2`.
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
-to `CodeActAgent`.
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
-default, the script evaluates the entire SWE-Perf test set (140 issues). Note:
-in order to use `eval_limit`, you must also set `agent`.
- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
-default, it is set to 100.
- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
-default, it is set to 1.
- `dataset`, a huggingface dataset name. e.g. `SWE-Perf/SWE-Perf`, specifies which dataset to evaluate on.
- `dataset_split`, split for the huggingface dataset. e.g., `test`, `dev`. Default to `test`.
-
- `n_runs`, e.g. `3`, is the number of times to run the evaluation. Default is 1.
- `mode`, e.g. `swt`, `swt-ci`, or `swe`, specifies the evaluation mode. Default is `swe`.
-
-> [!CAUTION]
-> Setting `num_workers` larger than 1 is not officially tested, YMMV.
-
-
-Let's say you'd like to run 10 instances using `llm.eval_gpt4_1106_preview` and CodeActAgent,
-
-then your command would be:
-
-```bash
-./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10
-```
-
-### 2. Run the SWE-fficiency benchmark official evaluation
-
-Once the output is converted, use the [official SWE-fficiency benchmark evaluation](https://github.com/swefficiency/swefficiency) to evaluate it.
--- a/evaluation/benchmarks/swefficiency/binary_patch_utils.py
+++ b/evaluation/benchmarks/swefficiency/binary_patch_utils.py
@@ -1,52 +0,0 @@
-"""
-Utilities for handling binary files and patch generation in SWE-bench evaluation.
-"""
-
-
-def remove_binary_diffs(patch_text):
-    """
-    Remove binary file diffs from a git patch.
-
-    Args:
-        patch_text (str): The git patch text
-
-    Returns:
-        str: The cleaned patch text with binary diffs removed
-    """
-    lines = patch_text.splitlines()
-    cleaned_lines = []
-    block = []
-    is_binary_block = False
-
-    for line in lines:
-        if line.startswith('diff --git '):
-            if block and not is_binary_block:
-                cleaned_lines.extend(block)
-            block = [line]
-            is_binary_block = False
-        elif 'Binary files' in line:
-            is_binary_block = True
-            block.append(line)
-        else:
-            block.append(line)
-
-    if block and not is_binary_block:
-        cleaned_lines.extend(block)
-    return '\n'.join(cleaned_lines)
-
-
-def remove_binary_files_from_git():
-    """
-    Generate a bash command to remove binary files from git staging.
-
-    Returns:
-        str: A bash command that removes binary files from git staging
-    """
-    return """
-    for file in $(git status --porcelain | grep -E "^(M| M|\\?\\?|A| A)" | cut -c4-); do
-        if [ -f "$file" ] && (file "$file" | grep -q "executable" || git check-attr binary "$file" | grep -q "binary: set"); then
-            git rm -f "$file" 2>/dev/null || rm -f "$file"
-            echo "Removed: $file"
-        fi
-    done
-    """.strip()
--- a/evaluation/benchmarks/swefficiency/run_infer.py
+++ b/evaluation/benchmarks/swefficiency/run_infer.py
@@ -1,960 +0,0 @@
-import asyncio
-import copy
-import functools
-import json
-import multiprocessing
-import os
-import tempfile
-from typing import Any, Literal
-
-import pandas as pd
-import toml
-from datasets import load_dataset
-
-import openhands.agenthub
-from evaluation.benchmarks.swe_bench.binary_patch_utils import (
-    remove_binary_diffs,
-    remove_binary_files_from_git,
-)
-from evaluation.utils.shared import (
-    EvalException,
-    EvalMetadata,
-    EvalOutput,
-    assert_and_raise,
-    codeact_user_response,
-    get_default_sandbox_config_for_eval,
-    get_metrics,
-    is_fatal_evaluation_error,
-    make_metadata,
-    prepare_dataset,
-    reset_logger_for_multiprocessing,
-    run_evaluation,
-    update_llm_config_for_completions_logging,
-)
-from openhands.controller.state.state import State
-from openhands.core.config import (
-    AgentConfig,
-    OpenHandsConfig,
-    get_evaluation_parser,
-    get_llm_config_arg,
-)
-from openhands.core.config.condenser_config import NoOpCondenserConfig
-from openhands.core.config.utils import get_condenser_config_arg
-from openhands.core.logger import openhands_logger as logger
-from openhands.core.main import create_runtime, run_controller
-from openhands.critic import AgentFinishedCritic
-from openhands.events.action import CmdRunAction, FileReadAction, MessageAction
-from openhands.events.observation import (
-    CmdOutputObservation,
-    ErrorObservation,
-    FileReadObservation,
-)
-from openhands.events.serialization.event import event_from_dict, event_to_dict
-from openhands.runtime.base import Runtime
-from openhands.utils.async_utils import call_async_from_sync
-from openhands.utils.shutdown_listener import sleep_if_should_continue
-
-USE_HINT_TEXT = os.environ.get('USE_HINT_TEXT', 'false').lower() == 'true'
-RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'true'
-BenchMode = Literal['swe', 'swt', 'swt-ci']
-
-
-AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
-    'CodeActAgent': codeact_user_response,
-}
-
-
-def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
-    return f'{instance.repo}__{instance.version}'.replace('/', '__')
-
-
-def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
-    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
-
-    # TODO: Change to testbed?
-    instruction = f"""
-<uploaded_files>
-/workspace/{workspace_dir_name}
-</uploaded_files>
-
-I’ve uploaded a python code repository in the directory workspace_dir_name. Consider the following performance workload and `workload()` function showing an specific usage of the repository:
-<performance_workload>
-{instance.workload}
-</performance_workload>
-
-Can you help me implement the necessary changes to the repository so that the runtime of the `workload()` function is faster? Basic guidelines:
-1. Your task is to make changes to non-test files in the /workspace directory to improve the performance of the code running in `workload()`. Please do not directly change the implementation of the `workload()` function to optimize things: I want you to focus on making the workload AS IS run faster by only editing the repository containing code that the `workload()` function calls.
-2. Make changes while ensuring the repository is functionally equivalent to the original: your changes should not introduce new bugs or cause already-passing tests to begin failing after your changes. However, you do not need to worry about tests that already fail without any changes made. For relevant test files you find in the repository, you can run them via the bash command `{instance.test_cmd} <test_file>` to check for correctness. Note that running all the tests may take a long time, so you need to determine which tests are relevant to your changes.
-3. Make sure the `workload()` function improves in performance after you make changes to the repository. The workload can potentially take some time to run, so please allow it to finish and be generous with setting your timeout parameter (a timeout value of 3600 or larger here is encouraged): for faster iteration, you should adjust the workload script to use fewer iterations. Before you complete your task, please make sure to check that the **original performance workload** and `workload()` function runs successfully and the performance is improved.
-4. You may need to reinstall/rebuild the repo for your changes to take effect before testing if you made non-Python changes. Reinstalling may take a long time to run (a timeout value of 3600 or larger here is encouraged), so please be patient with running it and allow it to complete if possible. You can reinstall the repository by running the bash command `{instance.rebuild_cmd}` in the workspace directory.
-5. All the dependencies required to run the `workload()` function are already installed in the environment. You should not install or upgrade any dependencies.
-
-Follow these steps to improve performance:
-1. As a first step, explore the repository structure.
-2. Create a Python script to reproduce the performance workload, execute it with python <workload_file>, and examine the printed output metrics.
-3. Edit the source code of the repository to improve performance. Please do not change the contents of the `workload()` function itself, but focus on optimizing the code in the repository that the original `workload()` function uses.
-4. If non-Python changes were made, rebuild the repo to make sure the changes take effect.
-5. Rerun your script to confirm that performance has improved.
-6. If necessary, identify any relevant test files in the repository related to your changes and verify that test statuses did not change after your modifications.
-7. After each attempted change, please reflect on the changes attempted and the performance impact observed. If the performance did not improve, consider alternative approaches or optimizations.
-8. Once you are satisfied, please use the finish command to complete your task.
-
-Please remember that you should not change the implementation of the `workload()` function. The performance improvement should solely come from editing the source files in the code repository.
-"""
-
-    if RUN_WITH_BROWSING:
-        instruction += (
-            '<IMPORTANT!>\nYou SHOULD NEVER attempt to browse the web. </IMPORTANT!>\n'
-        )
-
-    return MessageAction(content=instruction)
-
-
-def get_instance_docker_image(
-    instance_id: str,
-) -> str:
-    return f'ghcr.io/swefficiency/swefficiency-images:{instance_id}'
-
-
-def get_config(
-    instance: pd.Series,
-    metadata: EvalMetadata,
-    cpu_group: list[int] | None = None,
-) -> OpenHandsConfig:
-    # We use a different instance image for the each instance of swe-bench eval
-    base_container_image = get_instance_docker_image(
-        instance['instance_id'],
-    )
-    logger.info(
-        f'Using instance container image: {base_container_image}. '
-        f'Please make sure this image exists. '
-        f'Submit an issue on https://github.com/All-Hands-AI/OpenHands if you run into any issues.'
-    )
-
-    sandbox_config = get_default_sandbox_config_for_eval()
-    sandbox_config.base_container_image = base_container_image
-    sandbox_config.enable_auto_lint = True
-    sandbox_config.use_host_network = False
-    sandbox_config.timeout = 3600
-
-    # Control container cleanup behavior via environment variable
-    # Default to False for multiprocessing stability to prevent cascade failures
-    sandbox_config.rm_all_containers = True
-
-    sandbox_config.platform = 'linux/amd64'
-    sandbox_config.remote_runtime_resource_factor = 4.0
-    sandbox_config.runtime_startup_env_vars.update(
-        {
-            'NO_CHANGE_TIMEOUT_SECONDS': '900',  # 15 minutes
-        }
-    )
-
-    if cpu_group is not None:
-        print(f'Configuring Docker runtime with CPU group: {cpu_group}')
-        sandbox_config.docker_runtime_kwargs = {
-            # HACK: Use the cpu_group if provided, otherwise use all available CPUs
-            'cpuset_cpus': ','.join(map(str, cpu_group)),
-            'nano_cpus': int(1e9 * len(cpu_group)),  # optional: hard cap to vCPU count
-            'mem_limit': '16g',
-        }
-
-    # Note: We keep rm_all_containers = False for worker process safety
-
-    config = OpenHandsConfig(
-        default_agent=metadata.agent_class,
-        run_as_openhands=False,
-        max_iterations=metadata.max_iterations,
-        runtime=os.environ.get('RUNTIME', 'docker'),
-        sandbox=sandbox_config,
-        # do not mount workspace
-        workspace_base=None,
-        workspace_mount_path=None,
-    )
-    config.set_llm_config(
-        update_llm_config_for_completions_logging(
-            metadata.llm_config, metadata.eval_output_dir, instance['instance_id']
-        )
-    )
-    agent_config = AgentConfig(
-        enable_jupyter=False,
-        enable_browsing=RUN_WITH_BROWSING,
-        enable_llm_editor=False,
-        enable_mcp=False,
-        condenser=metadata.condenser_config,
-        enable_prompt_extensions=False,
-    )
-    config.set_agent_config(agent_config)
-    return config
-
-
-def initialize_runtime(
-    runtime: Runtime,
-    instance: pd.Series,  # this argument is not required
-    metadata: EvalMetadata,
-):
-    """Initialize the runtime for the agent.
-
-    This function is called before the runtime is used to run the agent.
-    """
-    logger.info('-' * 30)
-    logger.info('BEGIN Runtime Initialization Fn')
-    logger.info('-' * 30)
-    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
-    obs: CmdOutputObservation
-
-    # Set instance id and git configuration
-    action = CmdRunAction(
-        command=f"""echo 'export SWE_INSTANCE_ID={instance['instance_id']}' >> ~/.bashrc && echo 'export PIP_CACHE_DIR=~/.cache/pip' >> ~/.bashrc && echo "alias git='git --no-pager'" >> ~/.bashrc && git config --global core.pager "" && git config --global diff.binary false"""
-    )
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        obs.exit_code == 0,
-        f'Failed to export SWE_INSTANCE_ID and configure git: {str(obs)}',
-    )
-
-    action = CmdRunAction(command="""export USER=$(whoami); echo USER=${USER} """)
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(obs.exit_code == 0, f'Failed to export USER: {str(obs)}')
-
-    # inject the init script
-    script_dir = os.path.dirname(__file__)
-
-    # inject the instance info
-    action = CmdRunAction(command='mkdir -p /swe_util/eval_data/instances')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        obs.exit_code == 0,
-        f'Failed to create /swe_util/eval_data/instances: {str(obs)}',
-    )
-
-    swe_instance_json_name = 'swe-bench-instance.json'
-    with tempfile.TemporaryDirectory() as temp_dir:
-        # Construct the full path for the desired file name within the temporary directory
-        temp_file_path = os.path.join(temp_dir, swe_instance_json_name)
-        # Write to the file with the desired name within the temporary directory
-        with open(temp_file_path, 'w') as f:
-            if not isinstance(instance, dict):
-                json.dump([instance.to_dict()], f)
-            else:
-                json.dump([instance], f)
-
-        # Copy the file to the desired location
-        runtime.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
-
-        # inject the instance swe entry
-        runtime.copy_to(
-            str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
-            '/swe_util/',
-        )
-
-    action = CmdRunAction(command='cat ~/.bashrc')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(obs.exit_code == 0, f'Failed to cat ~/.bashrc: {str(obs)}')
-
-    action = CmdRunAction(command='source ~/.bashrc')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    if isinstance(obs, ErrorObservation):
-        logger.error(f'Failed to source ~/.bashrc: {str(obs)}')
-    assert_and_raise(obs.exit_code == 0, f'Failed to source ~/.bashrc: {str(obs)}')
-
-    action = CmdRunAction(command='source /swe_util/instance_swe_entry.sh')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        obs.exit_code == 0,
-        f'Failed to source /swe_util/instance_swe_entry.sh: {str(obs)}',
-    )
-
-    action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        obs.exit_code == 0,
-        f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
-    )
-
-    action = CmdRunAction(command='git reset --hard')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(obs.exit_code == 0, f'Failed to git reset --hard: {str(obs)}')
-
-    action = CmdRunAction(
-        command='for remote_name in $(git remote); do git remote remove "${remote_name}"; done'
-    )
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')
-
-    action = CmdRunAction(command='which python')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        obs.exit_code == 0 and 'testbed' in obs.content,
-        f'Expected to find python interpreter from testbed, but got: {str(obs)}',
-    )
-
-    logger.info('-' * 30)
-    logger.info('END Runtime Initialization Fn')
-    logger.info('-' * 30)
-
-
-def complete_runtime(
-    runtime: Runtime,
-    instance: pd.Series,  # this argument is not required, but it is used to get the workspace_dir_name
-) -> dict[str, Any]:
-    """Complete the runtime for the agent.
-
-    This function is called before the runtime is used to run the agent.
-    If you need to do something in the sandbox to get the correctness metric after
-    the agent has run, modify this function.
-    """
-    logger.info('-' * 30)
-    logger.info('BEGIN Runtime Completion Fn')
-    logger.info('-' * 30)
-    obs: CmdOutputObservation
-    workspace_dir_name = _get_swebench_workspace_dir_name(instance)
-
-    action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-
-    if obs.exit_code == -1:
-        # The previous command is still running
-        # We need to kill previous command
-        logger.info('The previous command is still running, trying to kill it...')
-        action = CmdRunAction(command='C-c')
-        obs = runtime.run_action(action)
-        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-
-        # Then run the command again
-        action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
-        action.set_hard_timeout(600)
-        logger.info(action, extra={'msg_type': 'ACTION'})
-        obs = runtime.run_action(action)
-        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-
-    if obs.exit_code == -1:
-        # The previous command is still running
-        # We need to kill previous command
-        logger.info('The previous command is still running, trying to ctrl+z it...')
-        action = CmdRunAction(command='C-z')
-        obs = runtime.run_action(action)
-        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-
-        # Then run the command again
-        action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
-        action.set_hard_timeout(600)
-        logger.info(action, extra={'msg_type': 'ACTION'})
-        obs = runtime.run_action(action)
-        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-
-    assert_and_raise(
-        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
-        f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
-    )
-
-    action = CmdRunAction(command='git config --global core.pager ""')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
-        f'Failed to git config --global core.pager "": {str(obs)}',
-    )
-
-    # First check for any git repositories in subdirectories
-    action = CmdRunAction(command='find . -type d -name .git -not -path "./.git"')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
-        f'Failed to find git repositories: {str(obs)}',
-    )
-
-    git_dirs = [p for p in obs.content.strip().split('\n') if p]
-    if git_dirs:
-        # Remove all .git directories in subdirectories
-        for git_dir in git_dirs:
-            action = CmdRunAction(command=f'rm -rf "{git_dir}"')
-            action.set_hard_timeout(600)
-            logger.info(action, extra={'msg_type': 'ACTION'})
-            obs = runtime.run_action(action)
-            logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-            assert_and_raise(
-                isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
-                f'Failed to remove git directory {git_dir}: {str(obs)}',
-            )
-
-    # add all files
-    action = CmdRunAction(command='git add -A')
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
-        f'Failed to git add -A: {str(obs)}',
-    )
-
-    # Remove binary files from git staging
-    action = CmdRunAction(command=remove_binary_files_from_git())
-    action.set_hard_timeout(600)
-    logger.info(action, extra={'msg_type': 'ACTION'})
-    obs = runtime.run_action(action)
-    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-    assert_and_raise(
-        isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
-        f'Failed to remove binary files: {str(obs)}',
-    )
-
-    n_retries = 0
-    git_patch = None
-    while n_retries < 5:
-        action = CmdRunAction(
-            command=f'git diff --no-color --cached {instance["base_commit"]} > patch.diff'
-        )
-        action.set_hard_timeout(max(300 + 100 * n_retries, 600))
-        logger.info(action, extra={'msg_type': 'ACTION'})
-        obs = runtime.run_action(action)
-        logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-        n_retries += 1
-        if isinstance(obs, CmdOutputObservation):
-            if obs.exit_code == 0:
-                # Read the patch file
-                action = FileReadAction(path='patch.diff')
-                action.set_hard_timeout(max(300 + 100 * n_retries, 600))
-                logger.info(action, extra={'msg_type': 'ACTION'})
-                obs = runtime.run_action(action)
-                logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-                if isinstance(obs, FileReadObservation):
-                    git_patch = obs.content
-                    break
-                elif isinstance(obs, ErrorObservation):
-                    # Fall back to cat "patch.diff" to get the patch
-                    assert 'File could not be decoded as utf-8' in obs.content
-                    action = CmdRunAction(command='cat patch.diff')
-                    action.set_hard_timeout(max(300 + 100 * n_retries, 600))
-                    logger.info(action, extra={'msg_type': 'ACTION'})
-                    obs = runtime.run_action(action)
-                    assert isinstance(obs, CmdOutputObservation) and obs.exit_code == 0
-                    logger.info(obs, extra={'msg_type': 'OBSERVATION'})
-                    git_patch = obs.content
-                    break
-                else:
-                    assert_and_raise(False, f'Unexpected observation type: {str(obs)}')
-            else:
-                logger.info('Failed to get git diff, retrying...')
-                sleep_if_should_continue(10)
-        elif isinstance(obs, ErrorObservation):
-            logger.error(f'Error occurred: {obs.content}. Retrying...')
-            sleep_if_should_continue(10)
-        else:
-            assert_and_raise(False, f'Unexpected observation type: {str(obs)}')
-
-    assert_and_raise(git_patch is not None, 'Failed to get git diff (None)')
-
-    # Remove binary diffs from the patch
-    git_patch = remove_binary_diffs(git_patch)
-
-    logger.info('-' * 30)
-    logger.info('END Runtime Completion Fn')
-    logger.info('-' * 30)
-    return {'git_patch': git_patch}
-
-
-class CPUGroupManager:
-    def __init__(self, cpu_groups_queue: multiprocessing.Queue):
-        self.cpu_groups_queue = cpu_groups_queue
-
-    def __enter__(self):
-        # Get the current CPU group for this worker]
-        if self.cpu_groups_queue is not None:
-            self.cpu_group = self.cpu_groups_queue.get()
-            logger.info(f'Worker started with CPU group: {self.cpu_group}')
-            return self.cpu_group
-        return None
-
-    def __exit__(self, exc_type, exc_value, traceback):
-        # Put the CPU group back into the queue for other workers to use
-        if self.cpu_groups_queue is not None:
-            self.cpu_groups_queue.put(self.cpu_group)
-            logger.info(f'Worker finished with CPU group: {self.cpu_group}')
-
-
-def cleanup_docker_resources_for_worker():
-    """Clean up Docker resources specific to this worker process.
-
-    This prevents cascade failures when one worker's container crashes.
-    Note: This only cleans up stale locks, not containers, to avoid
-    interfering with other workers. Container cleanup is handled
-    by the DockerRuntime.close() method based on configuration.
-    """
-
-    # Clean up any stale port locks from crashed processes
-    try:
-        from openhands.runtime.utils.port_lock import cleanup_stale_locks
-
-        cleanup_stale_locks(max_age_seconds=300)  # Clean up locks older than 5 minutes
-    except Exception as e:
-        logger.debug(f'Error cleaning up stale port locks: {e}')
-
-
-def process_instance(
-    instance: pd.Series,
-    metadata: EvalMetadata,
-    reset_logger: bool = True,
-    runtime_failure_count: int = 0,
-    cpu_groups_queue: multiprocessing.Queue = None,
-) -> EvalOutput:
-    # Clean up any Docker resources from previous failed runs
-    cleanup_docker_resources_for_worker()
-
-    # HACK: Use the global and get the cpu group for this worker.
-    with CPUGroupManager(cpu_groups_queue) as cpu_group:
-        config = get_config(instance, metadata, cpu_group=cpu_group)
-
-        # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
-        if reset_logger:
-            log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
-            reset_logger_for_multiprocessing(logger, instance.instance_id, log_dir)
-        else:
-            logger.info(f'Starting evaluation for instance {instance.instance_id}.')
-
-        metadata = copy.deepcopy(metadata)
-        metadata.details['runtime_failure_count'] = runtime_failure_count
-        metadata.details['remote_runtime_resource_factor'] = (
-            config.sandbox.remote_runtime_resource_factor
-        )
-
-        runtime = create_runtime(config, sid=None)
-        call_async_from_sync(runtime.connect)
-
-        try:
-            initialize_runtime(runtime, instance, metadata)
-
-            message_action = get_instruction(instance, metadata)
-
-            # Here's how you can run the agent (similar to the `main` function) and get the final task state
-            state: State | None = asyncio.run(
-                run_controller(
-                    config=config,
-                    initial_user_action=message_action,
-                    runtime=runtime,
-                    fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN[
-                        metadata.agent_class
-                    ],
-                )
-            )
-
-            # if fatal error, throw EvalError to trigger re-run
-            if is_fatal_evaluation_error(state.last_error):
-                raise EvalException('Fatal error detected: ' + state.last_error)
-
-            # ======= THIS IS SWE-Bench specific =======
-            # Get git patch
-            return_val = complete_runtime(runtime, instance)
-            git_patch = return_val['git_patch']
-            logger.info(
-                f'Got git diff for instance {instance.instance_id}:\n--------\n{git_patch}\n--------'
-            )
-        except Exception as e:
-            # Log the error but don't let it crash other workers
-            logger.error(
-                f'Error in worker processing instance {instance.instance_id}: {str(e)}'
-            )
-            raise
-        finally:
-            # Ensure runtime is properly closed to prevent cascade failures
-            try:
-                runtime.close()
-            except Exception as e:
-                logger.warning(
-                    f'Error closing runtime for {instance.instance_id}: {str(e)}'
-                )
-                # Don't re-raise - we want to continue cleanup
-
-        # ==========================================
-
-        # ======= Attempt to evaluate the agent's edits =======
-        # we use eval_infer.sh to evaluate the agent's edits, not here
-        # because the agent may alter the environment / testcases
-        test_result = {
-            'git_patch': git_patch,
-        }
-
-        # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
-        # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
-        if state is None:
-            raise ValueError('State should not be None.')
-
-        # NOTE: this is NO LONGER the event stream, but an agent history that includes delegate agent's events
-        histories = [event_to_dict(event) for event in state.history]
-        metrics = get_metrics(state)
-
-        # Save the output
-        instruction = message_action.content
-        if message_action.image_urls:
-            instruction += (
-                '\n\n<image_urls>'
-                + '\n'.join(message_action.image_urls)
-                + '</image_urls>'
-            )
-        output = EvalOutput(
-            instance_id=instance.instance_id,
-            instruction=instruction,
-            instance=instance.to_dict(),  # SWE Bench specific
-            test_result=test_result,
-            metadata=metadata,
-            history=histories,
-            metrics=metrics,
-            error=state.last_error if state and state.last_error else None,
-        )
-        return output
-
-
-def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:
-    file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'config.toml')
-    if os.path.exists(file_path):
-        with open(file_path, 'r') as file:
-            data = toml.load(file)
-            if 'selected_ids' in data:
-                selected_ids = data['selected_ids']
-                logger.info(
-                    f'Filtering {len(selected_ids)} tasks from "selected_ids"...'
-                )
-                subset = dataset[dataset[filter_column].isin(selected_ids)]
-                logger.info(f'Retained {subset.shape[0]} tasks after filtering')
-                return subset
-            if 'selected_repos' in data:
-                # repos for the swe-bench instances:
-                # ['astropy/astropy', 'django/django', 'matplotlib/matplotlib', 'mwaskom/seaborn', 'pallets/flask', 'psf/requests', 'pydata/xarray', 'pylint-dev/pylint', 'pytest-dev/pytest', 'scikit-learn/scikit-learn', 'sphinx-doc/sphinx', 'sympy/sympy']
-                selected_repos = data['selected_repos']
-                if isinstance(selected_repos, str):
-                    selected_repos = [selected_repos]
-                assert isinstance(selected_repos, list)
-                logger.info(
-                    f'Filtering {selected_repos} tasks from "selected_repos"...'
-                )
-                subset = dataset[dataset['repo'].isin(selected_repos)]
-                logger.info(f'Retained {subset.shape[0]} tasks after filtering')
-                return subset
-
-    skip_ids = os.environ.get('SKIP_IDS', '').split(',')
-    if len(skip_ids) > 0:
-        logger.info(f'Filtering {len(skip_ids)} tasks from "SKIP_IDS"...')
-        return dataset[~dataset[filter_column].isin(skip_ids)]
-    return dataset
-
-
-def divide_cpus_among_workers(num_workers, num_cpus_per_worker=4, num_to_skip=0):
-    """Divide CPUs among workers, with better error handling for multiprocessing."""
-    try:
-        current_cpus = list(os.sched_getaffinity(0))
-    except AttributeError:
-        # os.sched_getaffinity not available on all platforms
-        import multiprocessing
-
-        current_cpus = list(range(multiprocessing.cpu_count()))
-
-    num_cpus = len(current_cpus)
-    if num_workers <= 0:
-        raise ValueError('Number of workers must be greater than 0')
-
-    # Chec that num worers and num_cpus_per_worker fit into available CPUs
-    total_cpus_needed = num_workers * num_cpus_per_worker + num_to_skip
-    if total_cpus_needed > num_cpus:
-        raise ValueError(
-            f'Not enough CPUs available. Requested {total_cpus_needed} '
-            f'CPUs (num_workers={num_workers}, num_cpus_per_worker={num_cpus_per_worker}, '
-            f'num_to_skip={num_to_skip}), but only {num_cpus} CPUs are available.'
-        )
-
-    # Divide this into groups, skipping the first `num_to_skip` CPUs.
-    available_cpus = current_cpus[num_to_skip:]
-    cpu_groups = [
-        available_cpus[i * num_cpus_per_worker : (i + 1) * num_cpus_per_worker]
-        for i in range(num_workers)
-    ]
-    print(
-        f'Divided {num_cpus} CPUs into {num_workers} groups, each with {num_cpus_per_worker} CPUs.'
-    )
-    print(f'CPU groups: {cpu_groups}')
-
-    return cpu_groups
-
-
-if __name__ == '__main__':
-    parser = get_evaluation_parser()
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default=None,
-        help='data set to evaluate on, for now use local.',
-    )
-    parser.add_argument(
-        '--split',
-        type=str,
-        default='test',
-        help='split to evaluate on',
-    )
-    parser.add_argument(
-        '--mode',
-        type=str,
-        default='swe',
-        help='mode to evaluate on',
-    )
-
-    args, _ = parser.parse_known_args()
-
-    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
-    # so we don't need to manage file uploading to OpenHands's repo
-
-    # dataset = load_dataset(args.dataset, split=args.split)
-    # swe_bench_tests = filter_dataset(dataset.to_pandas(), 'instance_id')
-    dataset = load_dataset(args.dataset, split=args.split)
-
-    # Convert dataset to pandas DataFrame if it is not already.
-    if not isinstance(dataset, pd.DataFrame):
-        dataset = dataset.to_pandas()
-
-    dataset['version'] = dataset['version'].astype(str)
-
-    # Convert created_at column to string.
-    dataset['created_at'] = dataset['created_at'].astype(str)
-
-    swe_bench_tests = filter_dataset(dataset, 'instance_id')
-
-    logger.info(
-        f'Loaded dataset {args.dataset} with split {args.split}: {len(swe_bench_tests)} tasks'
-    )
-
-    llm_config = None
-    if args.llm_config:
-        llm_config = get_llm_config_arg(args.llm_config)
-        llm_config.log_completions = True
-        # modify_params must be False for evaluation purpose, for reproducibility and accurancy of results
-        llm_config.modify_params = False
-
-    if llm_config is None:
-        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
-
-    # Get condenser config from environment variable
-    condenser_name = os.environ.get('EVAL_CONDENSER')
-    if condenser_name:
-        condenser_config = get_condenser_config_arg(condenser_name)
-        if condenser_config is None:
-            raise ValueError(
-                f'Could not find Condenser config: EVAL_CONDENSER={condenser_name}'
-            )
-    else:
-        # If no specific condenser config is provided via env var, default to NoOpCondenser
-        condenser_config = NoOpCondenserConfig()
-        logger.debug(
-            'No Condenser config provided via EVAL_CONDENSER, using NoOpCondenser.'
-        )
-
-    details = {'mode': args.mode}
-    _agent_cls = openhands.agenthub.Agent.get_cls(args.agent_cls)
-
-    dataset_descrption = (
-        args.dataset.replace('/', '__') + '-' + args.split.replace('/', '__')
-    )
-    metadata = make_metadata(
-        llm_config,
-        dataset_descrption,
-        args.agent_cls,
-        args.max_iterations,
-        args.eval_note,
-        args.eval_output_dir,
-        details=details,
-        condenser_config=condenser_config,
-    )
-
-    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
-    print(f'### OUTPUT FILE: {output_file} ###')
-
-    # Run evaluation in iterative mode:
-    # If a rollout fails to output AgentFinishAction, we will try again until it succeeds OR total 3 attempts have been made.
-    ITERATIVE_EVAL_MODE = (
-        os.environ.get('ITERATIVE_EVAL_MODE', 'false').lower() == 'true'
-    )
-    ITERATIVE_EVAL_MODE_MAX_ATTEMPTS = int(
-        os.environ.get('ITERATIVE_EVAL_MODE_MAX_ATTEMPTS', '3')
-    )
-
-    # Get all CPUs and divide into groups of num_workers and put them into a multiprocessing.Queue.
-    cpu_groups_queue = None
-    cpu_groups_list = divide_cpus_among_workers(args.eval_num_workers, num_to_skip=8)
-    cpu_groups_queue = multiprocessing.Manager().Queue()
-    for cpu_group in cpu_groups_list:
-        cpu_groups_queue.put(cpu_group)
-
-    if not ITERATIVE_EVAL_MODE:
-        # load the dataset
-        instances = prepare_dataset(swe_bench_tests, output_file, args.eval_n_limit)
-
-        process_instance_with_cpu_groups = functools.partial(
-            process_instance,
-            cpu_groups_queue=cpu_groups_queue,
-        )
-
-        config = get_config(
-            instances.iloc[0],  # Use the first instance to get the config
-            metadata,
-            cpu_group=None,  # We will use the cpu_groups_queue to get the cpu group later
-        )
-
-        run_evaluation(
-            instances,
-            metadata,
-            output_file,
-            args.eval_num_workers,
-            process_instance_with_cpu_groups,
-            timeout_seconds=8
-            * 60
-            * 60,  # 8 hour PER instance should be more than enough
-            max_retries=3,
-        )
-    else:
-        critic = AgentFinishedCritic()
-
-        def get_cur_output_file_path(attempt: int) -> str:
-            return (
-                f'{output_file.removesuffix(".jsonl")}.critic_attempt_{attempt}.jsonl'
-            )
-
-        eval_ids = None
-        for attempt in range(1, ITERATIVE_EVAL_MODE_MAX_ATTEMPTS + 1):
-            cur_output_file = get_cur_output_file_path(attempt)
-            logger.info(
-                f'Running evaluation with critic {critic.__class__.__name__} for attempt {attempt} of {ITERATIVE_EVAL_MODE_MAX_ATTEMPTS}.'
-            )
-
-            # For deterministic eval, we set temperature to 0.1 for (>1) attempt
-            # so hopefully we get slightly different results
-            if attempt > 1 and metadata.llm_config.temperature == 0:
-                logger.info(
-                    f'Detected temperature is 0 for (>1) attempt {attempt}. Setting temperature to 0.1...'
-                )
-                metadata.llm_config.temperature = 0.1
-
-            # Load instances - at first attempt, we evaluate all instances
-            # On subsequent attempts, we only evaluate the instances that failed the previous attempt determined by critic
-            instances = prepare_dataset(
-                swe_bench_tests, cur_output_file, args.eval_n_limit, eval_ids=eval_ids
-            )
-            if len(instances) > 0 and not isinstance(
-                instances['PASS_TO_PASS'][instances['PASS_TO_PASS'].index[0]], str
-            ):
-                for col in ['PASS_TO_PASS', 'FAIL_TO_PASS']:
-                    instances[col] = instances[col].apply(lambda x: str(x))
-
-            # Run evaluation - but save them to cur_output_file
-            logger.info(
-                f'Evaluating {len(instances)} instances for attempt {attempt}...'
-            )
-            run_evaluation(
-                instances,
-                metadata,
-                cur_output_file,
-                args.eval_num_workers,
-                process_instance,
-                timeout_seconds=8
-                * 60
-                * 60,  # 8 hour PER instance should be more than enough
-                max_retries=1,
-            )
-
-            # When eval is done, we update eval_ids to the instances that failed the current attempt
-            instances_failed = []
-            logger.info(
-                f'Use critic {critic.__class__.__name__} to check {len(instances)} instances for attempt {attempt}...'
-            )
-            with open(cur_output_file, 'r') as f:
-                for line in f:
-                    instance = json.loads(line)
-                    try:
-                        history = [
-                            event_from_dict(event) for event in instance['history']
-                        ]
-                        critic_result = critic.evaluate(
-                            history, instance['test_result'].get('git_patch', '')
-                        )
-                        if not critic_result.success:
-                            instances_failed.append(instance['instance_id'])
-                    except Exception as e:
-                        logger.error(
-                            f'Error loading history for instance {instance["instance_id"]}: {e}'
-                        )
-                        instances_failed.append(instance['instance_id'])
-            logger.info(
-                f'{len(instances_failed)} instances failed the current attempt {attempt}: {instances_failed}'
-            )
-            eval_ids = instances_failed
-
-            # If no instances failed, we break
-            if len(instances_failed) == 0:
-                break
-
-        # Then we should aggregate the results from all attempts into the original output file
-        # and remove the intermediate files
-        logger.info(
-            'Aggregating results from all attempts into the original output file...'
-        )
-        fout = open(output_file, 'w')
-        added_instance_ids = set()
-        for attempt in reversed(range(1, ITERATIVE_EVAL_MODE_MAX_ATTEMPTS + 1)):
-            cur_output_file = get_cur_output_file_path(attempt)
-            if not os.path.exists(cur_output_file):
-                logger.warning(
-                    f'Intermediate output file {cur_output_file} does not exist. Skipping...'
-                )
-                continue
-
-            with open(cur_output_file, 'r') as f:
-                for line in f:
-                    instance = json.loads(line)
-                    # Also make sure git_patch is not empty - otherwise we fall back to previous attempt (empty patch is worse than anything else)
-                    if (
-                        instance['instance_id'] not in added_instance_ids
-                        and instance['test_result'].get('git_patch', '').strip()
-                    ):
-                        fout.write(line)
-                        added_instance_ids.add(instance['instance_id'])
-            logger.info(
-                f'Aggregated instances from {cur_output_file}. Total instances added so far: {len(added_instance_ids)}'
-            )
-        fout.close()
-        logger.info(
-            f'Done! Total {len(added_instance_ids)} instances added to {output_file}'
-        )
--- a/evaluation/benchmarks/swefficiency/scripts/run_infer.sh
+++ b/evaluation/benchmarks/swefficiency/scripts/run_infer.sh
@@ -1,148 +0,0 @@
-#!/usr/bin/env bash
-set -eo pipefail
-
-source "evaluation/utils/version_control.sh"
-
-MODEL_CONFIG=$1
-COMMIT_HASH=$2
-AGENT=$3
-EVAL_LIMIT=$4
-MAX_ITER=$5
-NUM_WORKERS=$6
-DATASET=$7
-SPLIT=$8
-N_RUNS=$9
-MODE=${10}
-
-
-if [ -z "$NUM_WORKERS" ]; then
-  NUM_WORKERS=1
-  echo "Number of workers not specified, use default $NUM_WORKERS"
-fi
-checkout_eval_branch
-
-if [ -z "$AGENT" ]; then
-  echo "Agent not specified, use default CodeActAgent"
-  AGENT="CodeActAgent"
-fi
-
-if [ -z "$MAX_ITER" ]; then
-  echo "MAX_ITER not specified, use default 100"
-  MAX_ITER=100
-fi
-
-if [ -z "$RUN_WITH_BROWSING" ]; then
-  echo "RUN_WITH_BROWSING not specified, use default false"
-  RUN_WITH_BROWSING=false
-fi
-
-
-if [ -z "$DATASET" ]; then
-  echo "DATASET not specified, use default princeton-nlp/SWE-bench_Lite"
-  DATASET="swefficiency/swefficiency"
-fi
-
-if [ -z "$SPLIT" ]; then
-  echo "SPLIT not specified, use default test"
-  SPLIT="test"
-fi
-
-if [ -z "$MODE" ]; then
-  MODE="swe"
-  echo "MODE not specified, use default $MODE"
-fi
-
-if [ -n "$EVAL_CONDENSER" ]; then
-  echo "Using Condenser Config: $EVAL_CONDENSER"
-else
-  echo "No Condenser Config provided via EVAL_CONDENSER, use default (NoOpCondenser)."
-fi
-
-export RUN_WITH_BROWSING=$RUN_WITH_BROWSING
-echo "RUN_WITH_BROWSING: $RUN_WITH_BROWSING"
-
-get_openhands_version
-
-echo "AGENT: $AGENT"
-echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
-echo "MODEL_CONFIG: $MODEL_CONFIG"
-echo "DATASET: $DATASET"
-echo "SPLIT: $SPLIT"
-echo "MAX_ITER: $MAX_ITER"
-echo "NUM_WORKERS: $NUM_WORKERS"
-echo "COMMIT_HASH: $COMMIT_HASH"
-echo "MODE: $MODE"
-echo "EVAL_CONDENSER: $EVAL_CONDENSER"
-
-# Default to NOT use Hint
-if [ -z "$USE_HINT_TEXT" ]; then
-  export USE_HINT_TEXT=false
-fi
-echo "USE_HINT_TEXT: $USE_HINT_TEXT"
-EVAL_NOTE="$OPENHANDS_VERSION"
-# if not using Hint, add -no-hint to the eval note
-if [ "$USE_HINT_TEXT" = false ]; then
-  EVAL_NOTE="$EVAL_NOTE-no-hint"
-fi
-
-if [ "$RUN_WITH_BROWSING" = true ]; then
-  EVAL_NOTE="$EVAL_NOTE-with-browsing"
-fi
-
-if [ -n "$EXP_NAME" ]; then
-  EVAL_NOTE="$EVAL_NOTE-$EXP_NAME"
-fi
-# if mode != swe, add mode to the eval note
-if [ "$MODE" != "swe" ]; then
-  EVAL_NOTE="${EVAL_NOTE}-${MODE}"
-fi
-# Add condenser config to eval note if provided
-if [ -n "$EVAL_CONDENSER" ]; then
-  EVAL_NOTE="${EVAL_NOTE}-${EVAL_CONDENSER}"
-fi
-
-# export RUNTIME="remote"
-# export SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev"
-export NO_CHANGE_TIMEOUT_SECONDS=900 # 15 minutes
-
-function run_eval() {
-  local eval_note="${1}"
-  COMMAND="poetry run python evaluation/benchmarks/swefficiency/run_infer.py \
-    --agent-cls $AGENT \
-    --llm-config $MODEL_CONFIG \
-    --max-iterations $MAX_ITER \
-    --eval-num-workers $NUM_WORKERS \
-    --eval-note $eval_note \
-    --dataset $DATASET \
-    --split $SPLIT \
-    --mode $MODE"
-
-  if [ -n "$EVAL_LIMIT" ]; then
-    echo "EVAL_LIMIT: $EVAL_LIMIT"
-    COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
-  fi
-
-  # Run the command
-  eval $COMMAND
-}
-
-unset SANDBOX_ENV_GITHUB_TOKEN # prevent the agent from using the github token to push
-if [ -z "$N_RUNS" ]; then
-  N_RUNS=1
-  echo "N_RUNS not specified, use default $N_RUNS"
-fi
-
-# Skip runs if the run number is in the SKIP_RUNS list
-# read from env variable SKIP_RUNS as a comma separated list of run numbers
-SKIP_RUNS=(${SKIP_RUNS//,/ })
-for i in $(seq 1 $N_RUNS); do
-  if [[ " ${SKIP_RUNS[@]} " =~ " $i " ]]; then
-    echo "Skipping run $i"
-    continue
-  fi
-  current_eval_note="$EVAL_NOTE-run_$i"
-  echo "EVAL_NOTE: $current_eval_note"
-  run_eval $current_eval_note
-done
-
-checkout_original_branch
--- a/evaluation/benchmarks/swefficiency/scripts/setup/instance_swe_entry.sh
+++ b/evaluation/benchmarks/swefficiency/scripts/setup/instance_swe_entry.sh
@@ -1,43 +0,0 @@
-#!/usr/bin/env bash
-
-source ~/.bashrc
-SWEUTIL_DIR=/swe_util
-
-# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
-# SWE_INSTANCE_ID=django__django-11099
-if [ -z "$SWE_INSTANCE_ID" ]; then
-    echo "Error: SWE_INSTANCE_ID is not set." >&2
-    exit 1
-fi
-
-# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
-item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-instance.json)
-
-if [[ -z "$item" ]]; then
-  echo "No item found for the provided instance ID."
-  exit 1
-fi
-
-
-WORKSPACE_NAME=$(echo "$item" | jq -r '(.repo | tostring) + "__" + (.version | tostring) | gsub("/"; "__")')
-
-echo "WORKSPACE_NAME: $WORKSPACE_NAME"
-
-# Clear the workspace
-if [ -d /workspace ]; then
-    rm -rf /workspace/*
-else
-    mkdir /workspace
-fi
-# Copy repo to workspace
-if [ -d /workspace/$WORKSPACE_NAME ]; then
-    rm -rf /workspace/$WORKSPACE_NAME
-fi
-mkdir -p /workspace
-cp -r /testbed /workspace/$WORKSPACE_NAME
-
-# Activate instance-specific environment
-if [ -d /opt/miniconda3 ]; then
-    . /opt/miniconda3/etc/profile.d/conda.sh
-    conda activate testbed
-fi
--- a/evaluation/benchmarks/swefficiency/scripts/setup/prepare_swe_utils.sh
+++ b/evaluation/benchmarks/swefficiency/scripts/setup/prepare_swe_utils.sh
@@ -1,27 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-EVAL_WORKSPACE="evaluation/benchmarks/swe_bench/eval_workspace"
-mkdir -p $EVAL_WORKSPACE
-
-# 1. Prepare REPO
-echo "==== Prepare SWE-bench repo ===="
-OH_SWE_BENCH_REPO_PATH="https://github.com/All-Hands-AI/SWE-bench.git"
-OH_SWE_BENCH_REPO_BRANCH="eval"
-git clone -b $OH_SWE_BENCH_REPO_BRANCH $OH_SWE_BENCH_REPO_PATH $EVAL_WORKSPACE/OH-SWE-bench
-
-# 2. Prepare DATA
-echo "==== Prepare SWE-bench data ===="
-EVAL_IMAGE=ghcr.io/all-hands-ai/eval-swe-bench:builder_with_conda
-EVAL_WORKSPACE=$(realpath $EVAL_WORKSPACE)
-chmod +x $EVAL_WORKSPACE/OH-SWE-bench/swebench/harness/prepare_data.sh
-if [ -d $EVAL_WORKSPACE/eval_data ]; then
-    rm -r $EVAL_WORKSPACE/eval_data
-fi
-docker run \
-    -v $EVAL_WORKSPACE:/workspace \
-    -w /workspace \
-    -u $(id -u):$(id -g) \
-    -e HF_DATASETS_CACHE="/tmp" \
-    --rm -it $EVAL_IMAGE \
-    bash -c "cd OH-SWE-bench/swebench/harness && /swe_util/miniforge3/bin/conda run -n swe-bench-eval ./prepare_data.sh && mv eval_data /workspace/"
--- a/evaluation/benchmarks/swefficiency/scripts/setup/swe_entry.sh
+++ b/evaluation/benchmarks/swefficiency/scripts/setup/swe_entry.sh
@@ -1,96 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-# assert user name is `root`
-if [ "$USER" != "root" ]; then
-    echo "Error: This script is intended to be run by the 'root' user only." >&2
-    exit 1
-fi
-
-source ~/.bashrc
-
-SWEUTIL_DIR=/swe_util
-
-# Create logs directory
-LOG_DIR=/openhands/logs
-mkdir -p $LOG_DIR && chmod 777 $LOG_DIR
-
-# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
-# SWE_INSTANCE_ID=django__django-11099
-if [ -z "$SWE_INSTANCE_ID" ]; then
-    echo "Error: SWE_INSTANCE_ID is not set." >&2
-    exit 1
-fi
-
-# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
-item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-test-lite.json)
-
-if [[ -z "$item" ]]; then
-  echo "No item found for the provided instance ID."
-  exit 1
-fi
-
-CONDA_ENV_NAME=$(echo "$item" | jq -r '.repo + "__" + .version | gsub("/"; "__")')
-
-echo "CONDA_ENV_NAME: $CONDA_ENV_NAME"
-
-SWE_TASK_DIR=/openhands/swe_tasks
-mkdir -p $SWE_TASK_DIR
-# Dump test_patch to /workspace/test.patch
-echo "$item" | jq -r '.test_patch' > $SWE_TASK_DIR/test.patch
-# Dump patch to /workspace/gold.patch
-echo "$item" | jq -r '.patch' > $SWE_TASK_DIR/gold.patch
-# Dump the item to /workspace/instance.json except for the "test_patch" and "patch" fields
-echo "$item" | jq 'del(.test_patch, .patch)' > $SWE_TASK_DIR/instance.json
-
-# Clear the workspace
-rm -rf /workspace/*
-# Copy repo to workspace
-if [ -d /workspace/$CONDA_ENV_NAME ]; then
-    rm -rf /workspace/$CONDA_ENV_NAME
-fi
-cp -r $SWEUTIL_DIR/eval_data/testbeds/$CONDA_ENV_NAME /workspace
-
-# Reset swe-bench testbed and install the repo
-. $SWEUTIL_DIR/miniforge3/etc/profile.d/conda.sh
-conda config --set changeps1 False
-conda config --append channels conda-forge
-conda activate swe-bench-eval
-
-mkdir -p $SWE_TASK_DIR/reset_testbed_temp
-mkdir -p $SWE_TASK_DIR/reset_testbed_log_dir
-SWE_BENCH_DIR=/swe_util/OH-SWE-bench
-output=$(
-    export PYTHONPATH=$SWE_BENCH_DIR && \
-    cd $SWE_BENCH_DIR && \
-    python swebench/harness/reset_swe_env.py \
-    --swe_bench_tasks $SWEUTIL_DIR/eval_data/instances/swe-bench-test.json \
-    --temp_dir $SWE_TASK_DIR/reset_testbed_temp \
-    --testbed /workspace \
-    --conda_path $SWEUTIL_DIR/miniforge3 \
-    --instance_id $SWE_INSTANCE_ID \
-    --log_dir $SWE_TASK_DIR/reset_testbed_log_dir \
-    --timeout 900 \
-    --verbose
-)
-
-REPO_PATH=$(echo "$output" | awk -F': ' '/repo_path:/ {print $2}')
-TEST_CMD=$(echo "$output" | awk -F': ' '/test_cmd:/ {print $2}')
-echo "Repo Path: $REPO_PATH"
-echo "Test Command: $TEST_CMD"
-
-echo "export SWE_BENCH_DIR=\"$SWE_BENCH_DIR\"" >> ~/.bashrc
-echo "export REPO_PATH=\"$REPO_PATH\"" >> ~/.bashrc
-echo "export TEST_CMD=\"$TEST_CMD\"" >> ~/.bashrc
-
-if [[ "$REPO_PATH" == "None" ]]; then
-    echo "Error: Failed to retrieve repository path. Tests may not have passed or output was not as expected." >&2
-    exit 1
-fi
-
-# Activate instance-specific environment
-. $SWEUTIL_DIR/miniforge3/etc/profile.d/conda.sh
-conda activate $CONDA_ENV_NAME
-
-set +e
--- a/evaluation/integration_tests/README.md
+++ b/evaluation/integration_tests/README.md
@@ -0,0 +1,69 @@
+# Integration tests
+
+This directory implements integration tests that [was running in CI](https://github.com/OpenHands/OpenHands/tree/23d3becf1d6f5d07e592f7345750c314a826b4e9/tests/integration).
+
+[PR 3985](https://github.com/OpenHands/OpenHands/pull/3985) introduce LLM-based editing, which requires access to LLM to perform edit. Hence, we remove integration tests from CI and intend to run them as nightly evaluation to ensure the quality of OpenHands softwares.
+
+## To add new tests
+
+Each test is a file named like `tXX_testname.py` where `XX` is a number.
+Make sure to name the file for each test to start with `t` and ends with `.py`.
+
+Each test should be structured as a subclass of [`BaseIntegrationTest`](./tests/base.py), where you need to implement `initialize_runtime` that setup the runtime enviornment before test, and `verify_result` that takes in a `Runtime` and history of `Event` and return a `TestResult`. See [t01_fix_simple_typo.py](./tests/t01_fix_simple_typo.py) and [t05_simple_browsing.py](./tests/t05_simple_browsing.py) for two representative examples.
+
+```python
+class TestResult(BaseModel):
+    success: bool
+    reason: str | None = None
+
+
+class BaseIntegrationTest(ABC):
+    """Base class for integration tests."""
+
+    INSTRUCTION: str
+
+    @classmethod
+    @abstractmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        """Initialize the runtime for the test to run."""
+        pass
+
+    @classmethod
+    @abstractmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        """Verify the result of the test.
+
+        This method will be called after the agent performs the task on the runtime.
+        """
+        pass
+```
+
+
+## Setup Environment and LLM Configuration
+
+Please follow instruction [here](../README.md#setup) to setup your local
+development environment and LLM.
+
+## Start the evaluation
+
+```bash
+./evaluation/integration_tests/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]
+```
+
+- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for
+    your LLM settings, as defined in your `config.toml`.
+- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version
+    you would like to evaluate. It could also be a release tag like `0.9.0`.
+- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks,
+    defaulting to `CodeActAgent`.
+- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit`
+    instances. By default, the script evaluates the entire Exercism test set
+    (133 issues). Note: in order to use `eval_limit`, you must also set `agent`.
+- `eval-num-workers`: the number of workers to use for evaluation. Default: `1`.
+- `eval_ids`, e.g. `"1,3,10"`, limits the evaluation to instances with the
+    given IDs (comma separated).
+
+Example:
+```bash
+./evaluation/integration_tests/scripts/run_infer.sh llm.claude-35-sonnet-eval HEAD CodeActAgent
+```
--- a/evaluation/benchmarks/swefficiency/init.py
+++ b/evaluation/benchmarks/swefficiency/init.py
--- a/evaluation/integration_tests/run_infer.py
+++ b/evaluation/integration_tests/run_infer.py
@@ -0,0 +1,251 @@
+import asyncio
+import importlib.util
+import os
+
+import pandas as pd
+
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from evaluation.utils.shared import (
+    EvalMetadata,
+    EvalOutput,
+    get_default_sandbox_config_for_eval,
+    get_metrics,
+    get_openhands_config_for_eval,
+    make_metadata,
+    prepare_dataset,
+    reset_logger_for_multiprocessing,
+    run_evaluation,
+    update_llm_config_for_completions_logging,
+)
+from evaluation.utils.shared import (
+    codeact_user_response as fake_user_response,
+)
+from openhands.controller.state.state import State
+from openhands.core.config import (
+    AgentConfig,
+    OpenHandsConfig,
+    get_evaluation_parser,
+    get_llm_config_arg,
+)
+from openhands.core.logger import openhands_logger as logger
+from openhands.core.main import create_runtime, run_controller
+from openhands.events.action import MessageAction
+from openhands.events.serialization.event import event_to_dict
+from openhands.runtime.base import Runtime
+from openhands.utils.async_utils import call_async_from_sync
+
+FAKE_RESPONSES = {
+    'CodeActAgent': fake_user_response,
+    'VisualBrowsingAgent': fake_user_response,
+}
+
+
+def get_config(
+    metadata: EvalMetadata,
+    instance_id: str,
+) -> OpenHandsConfig:
+    sandbox_config = get_default_sandbox_config_for_eval()
+    sandbox_config.platform = 'linux/amd64'
+    config = get_openhands_config_for_eval(
+        metadata=metadata,
+        runtime=os.environ.get('RUNTIME', 'docker'),
+        sandbox_config=sandbox_config,
+    )
+    config.debug = True
+    config.set_llm_config(
+        update_llm_config_for_completions_logging(
+            metadata.llm_config, metadata.eval_output_dir, instance_id
+        )
+    )
+    agent_config = AgentConfig(
+        enable_jupyter=True,
+        enable_browsing=True,
+        enable_llm_editor=False,
+    )
+    config.set_agent_config(agent_config)
+    return config
+
+
+def process_instance(
+    instance: pd.Series,
+    metadata: EvalMetadata,
+    reset_logger: bool = True,
+) -> EvalOutput:
+    config = get_config(metadata, instance.instance_id)
+
+    # Setup the logger properly, so you can run multi-processing to parallelize the evaluation
+    if reset_logger:
+        log_dir = os.path.join(metadata.eval_output_dir, 'infer_logs')
+        reset_logger_for_multiprocessing(logger, str(instance.instance_id), log_dir)
+    else:
+        logger.info(
+            f'\nStarting evaluation for instance {str(instance.instance_id)}.\n'
+        )
+
+    # =============================================
+    # import test instance
+    # =============================================
+    instance_id = instance.instance_id
+    spec = importlib.util.spec_from_file_location(instance_id, instance.file_path)
+    test_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(test_module)
+    assert hasattr(test_module, 'Test'), (
+        f'Test module {instance_id} does not have a Test class'
+    )
+
+    test_class: type[BaseIntegrationTest] = test_module.Test
+    assert issubclass(test_class, BaseIntegrationTest), (
+        f'Test class {instance_id} does not inherit from BaseIntegrationTest'
+    )
+
+    instruction = test_class.INSTRUCTION
+
+    # =============================================
+    # create sandbox and run the agent
+    # =============================================
+    runtime: Runtime = create_runtime(config)
+    call_async_from_sync(runtime.connect)
+    try:
+        test_class.initialize_runtime(runtime)
+
+        # Here's how you can run the agent (similar to the `main` function) and get the final task state
+        state: State | None = asyncio.run(
+            run_controller(
+                config=config,
+                initial_user_action=MessageAction(content=instruction),
+                runtime=runtime,
+                fake_user_response_fn=FAKE_RESPONSES[metadata.agent_class],
+            )
+        )
+        if state is None:
+            raise ValueError('State should not be None.')
+
+        # # =============================================
+        # # result evaluation
+        # # =============================================
+
+        histories = state.history
+
+        # some basic check
+        logger.info(f'Total events in history: {len(histories)}')
+        assert len(histories) > 0, 'History should not be empty'
+
+        test_result: TestResult = test_class.verify_result(runtime, histories)
+        metrics = get_metrics(state)
+    finally:
+        runtime.close()
+
+    # Save the output
+    output = EvalOutput(
+        instance_id=str(instance.instance_id),
+        instance=instance.to_dict(),
+        instruction=instruction,
+        metadata=metadata,
+        history=[event_to_dict(event) for event in histories],
+        metrics=metrics,
+        error=state.last_error if state and state.last_error else None,
+        test_result=test_result.model_dump(),
+    )
+    return output
+
+
+def load_integration_tests() -> pd.DataFrame:
+    """Load tests from python files under ./tests"""
+    cur_dir = os.path.dirname(os.path.abspath(__file__))
+    test_dir = os.path.join(cur_dir, 'tests')
+    test_files = [
+        os.path.join(test_dir, f)
+        for f in os.listdir(test_dir)
+        if f.startswith('t') and f.endswith('.py')
+    ]
+    df = pd.DataFrame(test_files, columns=['file_path'])
+    df['instance_id'] = df['file_path'].apply(
+        lambda x: os.path.basename(x).rstrip('.py')
+    )
+    return df
+
+
+if __name__ == '__main__':
+    parser = get_evaluation_parser()
+    args, _ = parser.parse_known_args()
+    integration_tests = load_integration_tests()
+
+    llm_config = None
+    if args.llm_config:
+        llm_config = get_llm_config_arg(args.llm_config)
+
+    if llm_config is None:
+        raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
+
+    metadata = make_metadata(
+        llm_config,
+        'integration_tests',
+        args.agent_cls,
+        args.max_iterations,
+        args.eval_note,
+        args.eval_output_dir,
+    )
+    output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
+
+    # Parse dataset IDs if provided
+    eval_ids = None
+    if args.eval_ids:
+        eval_ids = str(args.eval_ids).split(',')
+        logger.info(f'\nUsing specific dataset IDs: {eval_ids}\n')
+
+    instances = prepare_dataset(
+        integration_tests,
+        output_file,
+        args.eval_n_limit,
+        eval_ids=eval_ids,
+    )
+
+    run_evaluation(
+        instances,
+        metadata,
+        output_file,
+        args.eval_num_workers,
+        process_instance,
+    )
+
+    df = pd.read_json(output_file, lines=True, orient='records')
+
+    # record success and reason
+    df['success'] = df['test_result'].apply(lambda x: x['success'])
+    df['reason'] = df['test_result'].apply(lambda x: x['reason'])
+    logger.info('-' * 100)
+    logger.info(
+        f'Success rate: {df["success"].mean():.2%} ({df["success"].sum()}/{len(df)})'
+    )
+    logger.info(
+        '\nEvaluation Results:'
+        + '\n'
+        + df[['instance_id', 'success', 'reason']].to_string(index=False)
+    )
+    logger.info('-' * 100)
+
+    # record cost for each instance, with 3 decimal places
+    # we sum up all the "costs" from the metrics array
+    df['cost'] = df['metrics'].apply(
+        lambda m: round(sum(c['cost'] for c in m['costs']), 3)
+        if m and 'costs' in m
+        else 0.0
+    )
+
+    # capture the top-level error if present, per instance
+    df['error_message'] = df.get('error', None)
+
+    logger.info(f'Total cost: USD {df["cost"].sum():.2f}')
+
+    report_file = os.path.join(metadata.eval_output_dir, 'report.md')
+    with open(report_file, 'w') as f:
+        f.write(
+            f'Success rate: {df["success"].mean():.2%}'
+            f' ({df["success"].sum()}/{len(df)})\n'
+        )
+        f.write(f'\nTotal cost: USD {df["cost"].sum():.2f}\n')
+        f.write(
+            df[
+                ['instance_id', 'success', 'reason', 'cost', 'error_message']
+            ].to_markdown(index=False)
+        )
--- a/evaluation/integration_tests/scripts/run_infer.sh
+++ b/evaluation/integration_tests/scripts/run_infer.sh
@@ -0,0 +1,62 @@
+#!/usr/bin/env bash
+set -eo pipefail
+
+source "evaluation/utils/version_control.sh"
+
+MODEL_CONFIG=$1
+COMMIT_HASH=$2
+AGENT=$3
+EVAL_LIMIT=$4
+MAX_ITERATIONS=$5
+NUM_WORKERS=$6
+EVAL_IDS=$7
+
+if [ -z "$NUM_WORKERS" ]; then
+  NUM_WORKERS=1
+  echo "Number of workers not specified, use default $NUM_WORKERS"
+fi
+checkout_eval_branch
+
+if [ -z "$AGENT" ]; then
+  echo "Agent not specified, use default CodeActAgent"
+  AGENT="CodeActAgent"
+fi
+
+get_openhands_version
+
+echo "AGENT: $AGENT"
+echo "OPENHANDS_VERSION: $OPENHANDS_VERSION"
+echo "MODEL_CONFIG: $MODEL_CONFIG"
+
+EVAL_NOTE=$OPENHANDS_VERSION
+
+# Default to NOT use unit tests.
+if [ -z "$USE_UNIT_TESTS" ]; then
+  export USE_UNIT_TESTS=false
+fi
+echo "USE_UNIT_TESTS: $USE_UNIT_TESTS"
+# If use unit tests, set EVAL_NOTE to the commit hash
+if [ "$USE_UNIT_TESTS" = true ]; then
+  EVAL_NOTE=$EVAL_NOTE-w-test
+fi
+
+# export PYTHONPATH=evaluation/integration_tests:\$PYTHONPATH
+COMMAND="poetry run python evaluation/integration_tests/run_infer.py \
+  --agent-cls $AGENT \
+  --llm-config $MODEL_CONFIG \
+  --max-iterations ${MAX_ITERATIONS:-10} \
+  --eval-num-workers $NUM_WORKERS \
+  --eval-note $EVAL_NOTE"
+
+if [ -n "$EVAL_LIMIT" ]; then
+  echo "EVAL_LIMIT: $EVAL_LIMIT"
+  COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
+fi
+
+if [ -n "$EVAL_IDS" ]; then
+  echo "EVAL_IDS: $EVAL_IDS"
+  COMMAND="$COMMAND --eval-ids $EVAL_IDS"
+fi
+
+# Run the command
+eval $COMMAND
--- a/evaluation/integration_tests/tests/init.py
+++ b/evaluation/integration_tests/tests/init.py
--- a/evaluation/integration_tests/tests/base.py
+++ b/evaluation/integration_tests/tests/base.py
@@ -0,0 +1,32 @@
+from abc import ABC, abstractmethod
+
+from pydantic import BaseModel
+
+from openhands.events.event import Event
+from openhands.runtime.base import Runtime
+
+
+class TestResult(BaseModel):
+    success: bool
+    reason: str | None = None
+
+
+class BaseIntegrationTest(ABC):
+    """Base class for integration tests."""
+
+    INSTRUCTION: str
+
+    @classmethod
+    @abstractmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        """Initialize the runtime for the test to run."""
+        pass
+
+    @classmethod
+    @abstractmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        """Verify the result of the test.
+
+        This method will be called after the agent performs the task on the runtime.
+        """
+        pass
--- a/evaluation/integration_tests/tests/t01_fix_simple_typo.py
+++ b/evaluation/integration_tests/tests/t01_fix_simple_typo.py
@@ -0,0 +1,39 @@
+import os
+import tempfile
+
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from openhands.events.action import CmdRunAction
+from openhands.events.event import Event
+from openhands.runtime.base import Runtime
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = 'Fix typos in bad.txt.'
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        # create a file with a typo in /workspace/bad.txt
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_file_path = os.path.join(temp_dir, 'bad.txt')
+            with open(temp_file_path, 'w') as f:
+                f.write('This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!')
+
+            # Copy the file to the desired location
+            runtime.copy_to(temp_file_path, '/workspace')
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        # check if the file /workspace/bad.txt has been fixed
+        action = CmdRunAction(command='cat /workspace/bad.txt')
+        obs = runtime.run_action(action)
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False, reason=f'Failed to run command: {obs.content}'
+            )
+        # check if the file /workspace/bad.txt has been fixed
+        if (
+            obs.content.strip().replace('\r\n', '\n')
+            == 'This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!'
+        ):
+            return TestResult(success=True)
+        return TestResult(success=False, reason=f'File not fixed: {obs.content}')
--- a/evaluation/integration_tests/tests/t02_add_bash_hello.py
+++ b/evaluation/integration_tests/tests/t02_add_bash_hello.py
@@ -0,0 +1,40 @@
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from evaluation.utils.shared import assert_and_raise
+from openhands.events.action import CmdRunAction
+from openhands.events.event import Event
+from openhands.runtime.base import Runtime
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = "Write a shell script '/workspace/hello.sh' that prints 'hello'."
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        action = CmdRunAction(command='mkdir -p /workspace')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        # check if the file /workspace/hello.sh exists
+        action = CmdRunAction(command='cat /workspace/hello.sh')
+        obs = runtime.run_action(action)
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False,
+                reason=f'Failed to cat /workspace/hello.sh: {obs.content}.',
+            )
+
+        # execute the script
+        action = CmdRunAction(command='bash /workspace/hello.sh')
+        obs = runtime.run_action(action)
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False,
+                reason=f'Failed to execute /workspace/hello.sh: {obs.content}.',
+            )
+        if obs.content.strip() != 'hello':
+            return TestResult(
+                success=False, reason=f'Script did not print "hello": {obs.content}.'
+            )
+        return TestResult(success=True)
--- a/evaluation/integration_tests/tests/t03_jupyter_write_file.py
+++ b/evaluation/integration_tests/tests/t03_jupyter_write_file.py
@@ -0,0 +1,43 @@
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from evaluation.utils.shared import assert_and_raise
+from openhands.events.action import CmdRunAction
+from openhands.events.event import Event
+from openhands.runtime.base import Runtime
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = "Use Jupyter IPython to write a text file containing 'hello world' to '/workspace/test.txt'."
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        action = CmdRunAction(command='mkdir -p /workspace')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        # check if the file /workspace/hello.sh exists
+        action = CmdRunAction(command='cat /workspace/test.txt')
+        obs = runtime.run_action(action)
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False,
+                reason=f'Failed to cat /workspace/test.txt: {obs.content}.',
+            )
+
+        # execute the script
+        action = CmdRunAction(command='cat /workspace/test.txt')
+        obs = runtime.run_action(action)
+
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False,
+                reason=f'Failed to cat /workspace/test.txt: {obs.content}.',
+            )
+
+        if 'hello world' not in obs.content.strip():
+            return TestResult(
+                success=False,
+                reason=f'File did not contain "hello world": {obs.content}.',
+            )
+        return TestResult(success=True)
--- a/evaluation/integration_tests/tests/t04_git_staging.py
+++ b/evaluation/integration_tests/tests/t04_git_staging.py
@@ -0,0 +1,57 @@
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from evaluation.utils.shared import assert_and_raise
+from openhands.events.action import CmdRunAction
+from openhands.events.event import Event
+from openhands.runtime.base import Runtime
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = 'Write a git commit message for the current staging area and commit the changes.'
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        action = CmdRunAction(command='mkdir -p /workspace')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+        # git init
+        action = CmdRunAction(command='git init')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+        # create file
+        action = CmdRunAction(command='echo \'print("hello world")\' > hello.py')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+        # git add
+        cmd_str = 'git add hello.py'
+        action = CmdRunAction(command=cmd_str)
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        # check if the file /workspace/hello.py exists
+        action = CmdRunAction(command='cat /workspace/hello.py')
+        obs = runtime.run_action(action)
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False,
+                reason=f'Failed to cat /workspace/hello.py: {obs.content}.',
+            )
+
+        # check if the staging area is empty
+        action = CmdRunAction(command='git status')
+        obs = runtime.run_action(action)
+        if obs.exit_code != 0:
+            return TestResult(
+                success=False, reason=f'Failed to git status: {obs.content}.'
+            )
+        if 'nothing to commit, working tree clean' in obs.content.strip():
+            return TestResult(success=True)
+
+        return TestResult(
+            success=False,
+            reason=f'Failed to check for "nothing to commit, working tree clean": {obs.content}.',
+        )
--- a/evaluation/integration_tests/tests/t05_simple_browsing.py
+++ b/evaluation/integration_tests/tests/t05_simple_browsing.py
@@ -0,0 +1,145 @@
+import os
+import tempfile
+
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from evaluation.utils.shared import assert_and_raise
+from openhands.events.action import AgentFinishAction, CmdRunAction, MessageAction
+from openhands.events.event import Event
+from openhands.events.observation import AgentDelegateObservation
+from openhands.runtime.base import Runtime
+
+HTML_FILE = """
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>The Ultimate Answer</title>
+    <style>
+        body {
+            display: flex;
+            justify-content: center;
+            align-items: center;
+            height: 100vh;
+            margin: 0;
+            background: linear-gradient(to right, #1e3c72, #2a5298);
+            color: #fff;
+            font-family: 'Arial', sans-serif;
+            text-align: center;
+        }
+        .container {
+            text-align: center;
+            padding: 20px;
+            background: rgba(255, 255, 255, 0.1);
+            border-radius: 10px;
+            box-shadow: 0 0 10px rgba(0, 0, 0, 0.2);
+        }
+        h1 {
+            font-size: 36px;
+            margin-bottom: 20px;
+        }
+        p {
+            font-size: 18px;
+            margin-bottom: 30px;
+        }
+        #showButton {
+            padding: 10px 20px;
+            font-size: 16px;
+            color: #1e3c72;
+            background: #fff;
+            border: none;
+            border-radius: 5px;
+            cursor: pointer;
+            transition: background 0.3s ease;
+        }
+        #showButton:hover {
+            background: #f0f0f0;
+        }
+        #result {
+            margin-top: 20px;
+            font-size: 24px;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>The Ultimate Answer</h1>
+        <p>Click the button to reveal the answer to life, the universe, and everything.</p>
+        <button id="showButton">Click me</button>
+        <div id="result"></div>
+    </div>
+    <script>
+        document.getElementById('showButton').addEventListener('click', function() {
+            document.getElementById('result').innerText = 'The answer is OpenHands is all you need!';
+        });
+    </script>
+</body>
+</html>
+"""
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = 'Browse localhost:8000, and tell me the ultimate answer to life.'
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        action = CmdRunAction(command='mkdir -p /workspace')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+        action = CmdRunAction(command='mkdir -p /tmp/server')
+        obs = runtime.run_action(action)
+        assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')
+
+        # create a file with a typo in /workspace/bad.txt
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_file_path = os.path.join(temp_dir, 'index.html')
+            with open(temp_file_path, 'w') as f:
+                f.write(HTML_FILE)
+            # Copy the file to the desired location
+            runtime.copy_to(temp_file_path, '/tmp/server')
+
+        # create README.md
+        action = CmdRunAction(
+            command='cd /tmp/server && nohup python3 -m http.server 8000 &'
+        )
+        obs = runtime.run_action(action)
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        from openhands.core.logger import openhands_logger as logger
+
+        # check if the "The answer is OpenHands is all you need!" is in any message
+        message_actions = [
+            event
+            for event in histories
+            if isinstance(
+                event, (MessageAction, AgentFinishAction, AgentDelegateObservation)
+            )
+        ]
+        logger.debug(f'Total message-like events: {len(message_actions)}')
+
+        for event in message_actions:
+            try:
+                if isinstance(event, AgentDelegateObservation):
+                    content = event.content
+                elif isinstance(event, AgentFinishAction):
+                    content = event.outputs.get('content', '')
+                elif isinstance(event, MessageAction):
+                    content = event.content
+                else:
+                    logger.warning(f'Unexpected event type: {type(event)}')
+                    continue
+
+                if 'OpenHands is all you need!' in content:
+                    return TestResult(success=True)
+            except Exception as e:
+                logger.error(f'Error processing event: {e}')
+
+        logger.debug(
+            f'Total messages: {len(message_actions)}. Messages: {message_actions}'
+        )
+        return TestResult(
+            success=False,
+            reason=f'The answer is not found in any message. Total messages: {len(message_actions)}.',
+        )
--- a/evaluation/integration_tests/tests/t06_github_pr_browsing.py
+++ b/evaluation/integration_tests/tests/t06_github_pr_browsing.py
@@ -0,0 +1,58 @@
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from openhands.events.action import AgentFinishAction, MessageAction
+from openhands.events.event import Event
+from openhands.events.observation import AgentDelegateObservation
+from openhands.runtime.base import Runtime
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = 'Look at https://github.com/OpenHands/OpenHands/pull/8, and tell me what is happening there and what did @asadm suggest.'
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        pass
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        from openhands.core.logger import openhands_logger as logger
+
+        # check if the license information is in any message
+        message_actions = [
+            event
+            for event in histories
+            if isinstance(
+                event, (MessageAction, AgentFinishAction, AgentDelegateObservation)
+            )
+        ]
+        logger.info(f'Total message-like events: {len(message_actions)}')
+
+        for event in message_actions:
+            try:
+                if isinstance(event, AgentDelegateObservation):
+                    content = event.content
+                elif isinstance(event, AgentFinishAction):
+                    content = event.outputs.get('content', '')
+                    if event.thought:
+                        content += f'\n\n{event.thought}'
+                elif isinstance(event, MessageAction):
+                    content = event.content
+                else:
+                    logger.warning(f'Unexpected event type: {type(event)}')
+                    continue
+
+                if (
+                    'non-commercial' in content
+                    or 'MIT' in content
+                    or 'Apache 2.0' in content
+                ):
+                    return TestResult(success=True)
+            except Exception as e:
+                logger.error(f'Error processing event: {e}')
+
+        logger.debug(
+            f'Total messages: {len(message_actions)}. Messages: {message_actions}'
+        )
+        return TestResult(
+            success=False,
+            reason=f'The answer is not found in any message. Total messages: {len(message_actions)}.',
+        )
--- a/evaluation/integration_tests/tests/t07_interactive_commands.py
+++ b/evaluation/integration_tests/tests/t07_interactive_commands.py
@@ -0,0 +1,73 @@
+import hashlib
+
+from evaluation.integration_tests.tests.base import BaseIntegrationTest, TestResult
+from openhands.events.action import (
+    AgentFinishAction,
+    FileWriteAction,
+    MessageAction,
+)
+from openhands.events.event import Event
+from openhands.events.observation import AgentDelegateObservation
+from openhands.runtime.base import Runtime
+
+
+class Test(BaseIntegrationTest):
+    INSTRUCTION = 'Execute the python script /workspace/python_script.py with input "John" and "25" and tell me the secret number.'
+    SECRET_NUMBER = int(hashlib.sha256(str(25).encode()).hexdigest()[:8], 16) % 1000
+
+    @classmethod
+    def initialize_runtime(cls, runtime: Runtime) -> None:
+        from openhands.core.logger import openhands_logger as logger
+
+        action = FileWriteAction(
+            path='/workspace/python_script.py',
+            content=(
+                'name = input("Enter your name: "); age = input("Enter your age: "); '
+                'import hashlib; secret = int(hashlib.sha256(str(age).encode()).hexdigest()[:8], 16) % 1000; '
+                'print(f"Hello {name}, you are {age} years old. Tell you a secret number: {secret}")'
+            ),
+        )
+        logger.info(action, extra={'msg_type': 'ACTION'})
+        observation = runtime.run_action(action)
+        logger.info(observation, extra={'msg_type': 'OBSERVATION'})
+
+    @classmethod
+    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
+        from openhands.core.logger import openhands_logger as logger
+
+        # check if the license information is in any message
+        message_actions = [
+            event
+            for event in histories
+            if isinstance(
+                event, (MessageAction, AgentFinishAction, AgentDelegateObservation)
+            )
+        ]
+        logger.info(f'Total message-like events: {len(message_actions)}')
+
+        for event in message_actions:
+            try:
+                if isinstance(event, AgentDelegateObservation):
+                    content = event.content
+                elif isinstance(event, AgentFinishAction):
+                    content = event.outputs.get('content', '')
+                    if event.thought:
+                        content += f'\n\n{event.thought}'
+                elif isinstance(event, MessageAction):
+                    content = event.content
+                else:
+                    logger.warning(f'Unexpected event type: {type(event)}')
+                    continue
+
+                if str(cls.SECRET_NUMBER) in content:
+                    return TestResult(success=True)
+            except Exception as e:
+                logger.error(f'Error processing event: {e}')
+
+        logger.debug(
+            f'Total messages: {len(message_actions)}. Messages: {message_actions}'
+        )
+        return TestResult(
+            success=False,
+            reason=f'The answer is not found in any message. Total messages: {len(message_actions)}.',
+        )
--- a/frontend/tests/components/context-menu/account-settings-context-menu.test.tsx
+++ b/frontend/tests/components/context-menu/account-settings-context-menu.test.tsx
@@ -33,24 +33,9 @@ describe("AccountSettingsContextMenu", () => {
    expect(
      screen.getByTestId("account-settings-context-menu"),
    ).toBeInTheDocument();
-    expect(screen.getByText("SIDEBAR$DOCS")).toBeInTheDocument();
    expect(screen.getByText("ACCOUNT_SETTINGS$LOGOUT")).toBeInTheDocument();
  });

-  it("should render Documentation link with correct attributes", () => {
-    renderWithRouter(
-      <AccountSettingsContextMenu
-        onLogout={onLogoutMock}
-        onClose={onCloseMock}
-      />,
-    );
-
-    const documentationLink = screen.getByText("SIDEBAR$DOCS").closest("a");
-    expect(documentationLink).toHaveAttribute("href", "https://docs.openhands.dev");
-    expect(documentationLink).toHaveAttribute("target", "_blank");
-    expect(documentationLink).toHaveAttribute("rel", "noopener noreferrer");
-  });
-
  it("should call onLogout when the logout option is clicked", async () => {
    renderWithRouter(
      <AccountSettingsContextMenu
--- a/frontend/tests/components/features/auth-modal.test.tsx
+++ b/frontend/tests/components/features/auth-modal.test.tsx
@@ -8,13 +8,6 @@ vi.mock("#/hooks/use-auth-url", () => ({
  useAuthUrl: () => "https://gitlab.com/oauth/authorize",
 }));

-// Mock the useTracking hook
-vi.mock("#/hooks/use-tracking", () => ({
-  useTracking: () => ({
-    trackLoginButtonClick: vi.fn(),
-  }),
-}));
-
 describe("AuthModal", () => {
  beforeEach(() => {
    vi.stubGlobal("location", { href: "" });
--- a/frontend/tests/components/features/chat/task-tracking-observation-content.test.tsx
+++ b/frontend/tests/components/features/chat/task-tracking-observation-content.test.tsx
@@ -8,11 +8,10 @@ vi.mock("react-i18next", () => ({
  useTranslation: () => ({
    t: (key: string) => {
      const translations: Record<string, string> = {
-        TASK_TRACKING_OBSERVATION$TASK_LIST: "Task List",
-        TASK_TRACKING_OBSERVATION$TASK_ID: "ID",
-        TASK_TRACKING_OBSERVATION$TASK_NOTES: "Notes",
-        TASK_TRACKING_OBSERVATION$RESULT: "Result",
-        COMMON$TASKS: "Tasks",
+        "TASK_TRACKING_OBSERVATION$TASK_LIST": "Task List",
+        "TASK_TRACKING_OBSERVATION$TASK_ID": "ID",
+        "TASK_TRACKING_OBSERVATION$TASK_NOTES": "Notes",
+        "TASK_TRACKING_OBSERVATION$RESULT": "Result",
      };
      return translations[key] || key;
    },
@@ -62,26 +61,19 @@ describe("TaskTrackingObservationContent", () => {
  it("renders task list when command is 'plan' and tasks exist", () => {
    render(<TaskTrackingObservationContent event={mockEvent} />);

-    expect(screen.getByText("Tasks")).toBeInTheDocument();
+    expect(screen.getByText("Task List (3 items)")).toBeInTheDocument();
    expect(screen.getByText("Implement feature A")).toBeInTheDocument();
    expect(screen.getByText("Fix bug B")).toBeInTheDocument();
    expect(screen.getByText("Deploy to production")).toBeInTheDocument();
  });

  it("displays correct status icons and badges", () => {
-    const { container } = render(
-      <TaskTrackingObservationContent event={mockEvent} />,
-    );
+    render(<TaskTrackingObservationContent event={mockEvent} />);

-    // Status is represented by icons, not text. Verify task items are rendered with their titles
-    // which indicates the status icons are present (status affects icon rendering)
-    expect(screen.getByText("Implement feature A")).toBeInTheDocument();
-    expect(screen.getByText("Fix bug B")).toBeInTheDocument();
-    expect(screen.getByText("Deploy to production")).toBeInTheDocument();
-
-    // Verify task items are present (they contain the status icons)
-    const taskItems = container.querySelectorAll('[data-name="item"]');
-    expect(taskItems).toHaveLength(3);
+    // Check for status text (the icons are emojis)
+    expect(screen.getByText("todo")).toBeInTheDocument();
+    expect(screen.getByText("in progress")).toBeInTheDocument();
+    expect(screen.getByText("done")).toBeInTheDocument();
  });

  it("displays task IDs and notes", () => {
@@ -92,9 +84,14 @@ describe("TaskTrackingObservationContent", () => {
    expect(screen.getByText("ID: task-3")).toBeInTheDocument();

    expect(screen.getByText("Notes: This is a test task")).toBeInTheDocument();
-    expect(
-      screen.getByText("Notes: Completed successfully"),
-    ).toBeInTheDocument();
+    expect(screen.getByText("Notes: Completed successfully")).toBeInTheDocument();
+  });
+
+  it("renders result section when content exists", () => {
+    render(<TaskTrackingObservationContent event={mockEvent} />);
+
+    expect(screen.getByText("Result")).toBeInTheDocument();
+    expect(screen.getByText("Task tracking operation completed successfully")).toBeInTheDocument();
  });

  it("does not render task list when command is not 'plan'", () => {
@@ -108,7 +105,7 @@ describe("TaskTrackingObservationContent", () => {

    render(<TaskTrackingObservationContent event={eventWithoutPlan} />);

-    expect(screen.queryByText("Tasks")).not.toBeInTheDocument();
+    expect(screen.queryByText("Task List")).not.toBeInTheDocument();
  });

  it("does not render task list when task list is empty", () => {
@@ -122,6 +119,17 @@ describe("TaskTrackingObservationContent", () => {

    render(<TaskTrackingObservationContent event={eventWithEmptyTasks} />);

-    expect(screen.queryByText("Tasks")).not.toBeInTheDocument();
+    expect(screen.queryByText("Task List")).not.toBeInTheDocument();
+  });
+
+  it("does not render result section when content is empty", () => {
+    const eventWithoutContent = {
+      ...mockEvent,
+      content: "",
+    };
+
+    render(<TaskTrackingObservationContent event={eventWithoutContent} />);
+
+    expect(screen.queryByText("Result")).not.toBeInTheDocument();
  });
 });
--- a/frontend/tests/components/features/conversation/server-status.test.tsx
+++ b/frontend/tests/components/features/conversation/server-status.test.tsx
@@ -13,6 +13,34 @@ vi.mock("#/hooks/use-agent-state", () => ({
  useAgentState: vi.fn(),
 }));

+// Mock the custom hooks
+const mockStartConversationMutate = vi.fn();
+const mockStopConversationMutate = vi.fn();
+
+vi.mock("#/hooks/mutation/use-unified-start-conversation", () => ({
+  useUnifiedStartConversation: () => ({
+    mutate: mockStartConversationMutate,
+  }),
+}));
+
+vi.mock("#/hooks/mutation/use-unified-stop-conversation", () => ({
+  useUnifiedStopConversation: () => ({
+    mutate: mockStopConversationMutate,
+  }),
+}));
+
+vi.mock("#/hooks/use-conversation-id", () => ({
+  useConversationId: () => ({
+    conversationId: "test-conversation-id",
+  }),
+}));
+
+vi.mock("#/hooks/use-user-providers", () => ({
+  useUserProviders: () => ({
+    providers: [],
+  }),
+}));
+
 vi.mock("#/hooks/query/use-task-polling", () => ({
  useTaskPolling: () => ({
    isTask: false,
@@ -38,12 +66,8 @@ vi.mock("react-i18next", async () => {
          COMMON$SERVER_STOPPED: "Server Stopped",
          COMMON$ERROR: "Error",
          COMMON$STARTING: "Starting",
-          COMMON$STOPPING: "Stopping...",
          COMMON$STOP_RUNTIME: "Stop Runtime",
          COMMON$START_RUNTIME: "Start Runtime",
-          CONVERSATION$ERROR_STARTING_CONVERSATION:
-            "Error starting conversation",
-          CONVERSATION$READY: "Ready",
        };
        return translations[key] || key;
      },
@@ -55,6 +79,10 @@ vi.mock("react-i18next", async () => {
 });

 describe("ServerStatus", () => {
+  // Mock functions for handlers
+  const mockHandleStop = vi.fn();
+  const mockHandleResumeAgent = vi.fn();
+
  // Helper function to mock agent state with specific state
  const mockAgentStore = (agentState: AgentState) => {
    vi.mocked(useAgentState).mockReturnValue({
@@ -66,91 +94,248 @@ describe("ServerStatus", () => {
    vi.clearAllMocks();
  });

-  it("should render server status with RUNNING conversation status", () => {
+  it("should render server status with different conversation statuses", () => {
+    // Mock agent store to return RUNNING state
    mockAgentStore(AgentState.RUNNING);

-    renderWithProviders(<ServerStatus conversationStatus="RUNNING" />);
+    // Test RUNNING status
+    const { rerender } = renderWithProviders(
+      <ServerStatus
+        conversationStatus="RUNNING"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+    expect(screen.getByText("Running")).toBeInTheDocument();

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
+    // Test STOPPED status
+    rerender(
+      <ServerStatus
+        conversationStatus="STOPPED"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+    expect(screen.getByText("Server Stopped")).toBeInTheDocument();
+
+    // Test STARTING status (shows "Running" due to agent state being RUNNING)
+    rerender(
+      <ServerStatus
+        conversationStatus="STARTING"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+    expect(screen.getByText("Running")).toBeInTheDocument();
+
+    // Test null status (shows "Running" due to agent state being RUNNING)
+    rerender(
+      <ServerStatus
+        conversationStatus={null}
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
    expect(screen.getByText("Running")).toBeInTheDocument();
  });

-  it("should render server status with STOPPED conversation status", () => {
-    mockAgentStore(AgentState.RUNNING);
+  it("should show context menu when clicked with RUNNING status", async () => {
+    const user = userEvent.setup();

-    renderWithProviders(<ServerStatus conversationStatus="STOPPED" />);
-
-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
-    expect(screen.getByText("Server Stopped")).toBeInTheDocument();
-  });
-
-  it("should render STARTING status when agent state is LOADING", () => {
-    mockAgentStore(AgentState.LOADING);
-
-    renderWithProviders(<ServerStatus conversationStatus="STARTING" />);
-
-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
-    expect(screen.getByText("Starting")).toBeInTheDocument();
-  });
-
-  it("should render STARTING status when agent state is INIT", () => {
-    mockAgentStore(AgentState.INIT);
-
-    renderWithProviders(<ServerStatus conversationStatus="STARTING" />);
-
-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
-    expect(screen.getByText("Starting")).toBeInTheDocument();
-  });
-
-  it("should render ERROR status when agent state is ERROR", () => {
-    mockAgentStore(AgentState.ERROR);
-
-    renderWithProviders(<ServerStatus conversationStatus="RUNNING" />);
-
-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
-    expect(screen.getByText("Error")).toBeInTheDocument();
-  });
-
-  it("should render STOPPING status when isPausing is true", () => {
+    // Mock agent store to return RUNNING state
    mockAgentStore(AgentState.RUNNING);

    renderWithProviders(
-      <ServerStatus conversationStatus="RUNNING" isPausing={true} />,
+      <ServerStatus
+        conversationStatus="RUNNING"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
    );

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
-    expect(screen.getByText("Stopping...")).toBeInTheDocument();
+    const statusContainer = screen.getByText("Running").closest("div");
+    expect(statusContainer).toBeInTheDocument();
+
+    await user.click(statusContainer!);
+
+    // Context menu should appear
+    expect(
+      screen.getByTestId("server-status-context-menu"),
+    ).toBeInTheDocument();
+    expect(screen.getByTestId("stop-server-button")).toBeInTheDocument();
+  });
+
+  it("should show context menu when clicked with STOPPED status", async () => {
+    const user = userEvent.setup();
+
+    // Mock agent store to return STOPPED state
+    mockAgentStore(AgentState.STOPPED);
+
+    renderWithProviders(
+      <ServerStatus
+        conversationStatus="STOPPED"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+
+    const statusContainer = screen.getByText("Server Stopped").closest("div");
+    expect(statusContainer).toBeInTheDocument();
+
+    await user.click(statusContainer!);
+
+    // Context menu should appear
+    expect(
+      screen.getByTestId("server-status-context-menu"),
+    ).toBeInTheDocument();
+    expect(screen.getByTestId("start-server-button")).toBeInTheDocument();
+  });
+
+  it("should not show context menu when clicked with other statuses", async () => {
+    const user = userEvent.setup();
+
+    // Mock agent store to return RUNNING state
+    mockAgentStore(AgentState.RUNNING);
+
+    renderWithProviders(
+      <ServerStatus
+        conversationStatus="STARTING"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+
+    const statusContainer = screen.getByText("Running").closest("div");
+    expect(statusContainer).toBeInTheDocument();
+
+    await user.click(statusContainer!);
+
+    // Context menu should not appear
+    expect(
+      screen.queryByTestId("server-status-context-menu"),
+    ).not.toBeInTheDocument();
+  });
+
+  it("should call stop conversation mutation when stop server is clicked", async () => {
+    const user = userEvent.setup();
+
+    // Clear previous calls
+    mockHandleStop.mockClear();
+
+    // Mock agent store to return RUNNING state
+    mockAgentStore(AgentState.RUNNING);
+
+    renderWithProviders(
+      <ServerStatus
+        conversationStatus="RUNNING"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+
+    const statusContainer = screen.getByText("Running").closest("div");
+    await user.click(statusContainer!);
+
+    const stopButton = screen.getByTestId("stop-server-button");
+    await user.click(stopButton);
+
+    expect(mockHandleStop).toHaveBeenCalledTimes(1);
+  });
+
+  it("should call start conversation mutation when start server is clicked", async () => {
+    const user = userEvent.setup();
+
+    // Clear previous calls
+    mockHandleResumeAgent.mockClear();
+
+    // Mock agent store to return STOPPED state
+    mockAgentStore(AgentState.STOPPED);
+
+    renderWithProviders(
+      <ServerStatus
+        conversationStatus="STOPPED"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+
+    const statusContainer = screen.getByText("Server Stopped").closest("div");
+    await user.click(statusContainer!);
+
+    const startButton = screen.getByTestId("start-server-button");
+    await user.click(startButton);
+
+    expect(mockHandleResumeAgent).toHaveBeenCalledTimes(1);
+  });
+
+  it("should close context menu after stop server action", async () => {
+    const user = userEvent.setup();
+
+    // Mock agent store to return RUNNING state
+    mockAgentStore(AgentState.RUNNING);
+
+    renderWithProviders(
+      <ServerStatus
+        conversationStatus="RUNNING"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+
+    const statusContainer = screen.getByText("Running").closest("div");
+    await user.click(statusContainer!);
+
+    const stopButton = screen.getByTestId("stop-server-button");
+    await user.click(stopButton);
+
+    // Context menu should be closed (handled by the component)
+    expect(mockHandleStop).toHaveBeenCalledTimes(1);
+  });
+
+  it("should close context menu after start server action", async () => {
+    const user = userEvent.setup();
+
+    // Mock agent store to return STOPPED state
+    mockAgentStore(AgentState.STOPPED);
+
+    renderWithProviders(
+      <ServerStatus
+        conversationStatus="STOPPED"
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
+    );
+
+    const statusContainer = screen.getByText("Server Stopped").closest("div");
+    await user.click(statusContainer!);
+
+    const startButton = screen.getByTestId("start-server-button");
+    await user.click(startButton);
+
+    // Context menu should be closed
+    expect(
+      screen.queryByTestId("server-status-context-menu"),
+    ).not.toBeInTheDocument();
  });

  it("should handle null conversation status", () => {
-    mockAgentStore(AgentState.RUNNING);
-
-    renderWithProviders(<ServerStatus conversationStatus={null} />);
-
-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
-    expect(screen.getByText("Running")).toBeInTheDocument();
-  });
-
-  it("should apply custom className", () => {
+    // Mock agent store to return RUNNING state
    mockAgentStore(AgentState.RUNNING);

    renderWithProviders(
-      <ServerStatus conversationStatus="RUNNING" className="custom-class" />,
+      <ServerStatus
+        conversationStatus={null}
+        handleStop={mockHandleStop}
+        handleResumeAgent={mockHandleResumeAgent}
+      />,
    );

-    const container = screen.getByTestId("server-status");
-    expect(container).toHaveClass("custom-class");
+    const statusText = screen.getByText("Running");
+    expect(statusText).toBeInTheDocument();
  });
 });

 describe("ServerStatusContextMenu", () => {
-  // Helper function to mock agent state with specific state
-  const mockAgentStore = (agentState: AgentState) => {
-    vi.mocked(useAgentState).mockReturnValue({
-      curAgentState: agentState,
-    });
-  };
-
  const defaultProps = {
    onClose: vi.fn(),
    conversationStatus: "RUNNING" as ConversationStatus,
@@ -161,8 +346,6 @@ describe("ServerStatusContextMenu", () => {
  });

  it("should render stop server button when status is RUNNING", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -171,14 +354,11 @@ describe("ServerStatusContextMenu", () => {
      />,
    );

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
    expect(screen.getByTestId("stop-server-button")).toBeInTheDocument();
    expect(screen.getByText("Stop Runtime")).toBeInTheDocument();
  });

  it("should render start server button when status is STOPPED", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -187,14 +367,11 @@ describe("ServerStatusContextMenu", () => {
      />,
    );

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
    expect(screen.getByTestId("start-server-button")).toBeInTheDocument();
    expect(screen.getByText("Start Runtime")).toBeInTheDocument();
  });

  it("should not render stop server button when onStopServer is not provided", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -202,13 +379,10 @@ describe("ServerStatusContextMenu", () => {
      />,
    );

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
    expect(screen.queryByTestId("stop-server-button")).not.toBeInTheDocument();
  });

  it("should not render start server button when onStartServer is not provided", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -216,14 +390,12 @@ describe("ServerStatusContextMenu", () => {
      />,
    );

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
    expect(screen.queryByTestId("start-server-button")).not.toBeInTheDocument();
  });

  it("should call onStopServer when stop button is clicked", async () => {
    const user = userEvent.setup();
    const onStopServer = vi.fn();
-    mockAgentStore(AgentState.RUNNING);

    renderWithProviders(
      <ServerStatusContextMenu
@@ -242,7 +414,6 @@ describe("ServerStatusContextMenu", () => {
  it("should call onStartServer when start button is clicked", async () => {
    const user = userEvent.setup();
    const onStartServer = vi.fn();
-    mockAgentStore(AgentState.RUNNING);

    renderWithProviders(
      <ServerStatusContextMenu
@@ -259,8 +430,6 @@ describe("ServerStatusContextMenu", () => {
  });

  it("should render correct text content for stop server button", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -275,8 +444,6 @@ describe("ServerStatusContextMenu", () => {
  });

  it("should render correct text content for start server button", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -292,7 +459,6 @@ describe("ServerStatusContextMenu", () => {

  it("should call onClose when context menu is closed", () => {
    const onClose = vi.fn();
-    mockAgentStore(AgentState.RUNNING);

    renderWithProviders(
      <ServerStatusContextMenu
@@ -309,8 +475,6 @@ describe("ServerStatusContextMenu", () => {
  });

  it("should not render any buttons for other conversation statuses", () => {
-    mockAgentStore(AgentState.RUNNING);
-
    renderWithProviders(
      <ServerStatusContextMenu
        {...defaultProps}
@@ -318,7 +482,6 @@ describe("ServerStatusContextMenu", () => {
      />,
    );

-    expect(screen.getByTestId("server-status")).toBeInTheDocument();
    expect(screen.queryByTestId("stop-server-button")).not.toBeInTheDocument();
    expect(screen.queryByTestId("start-server-button")).not.toBeInTheDocument();
  });
--- a/frontend/tests/components/features/home/repo-connector.test.tsx
+++ b/frontend/tests/components/features/home/repo-connector.test.tsx
@@ -71,7 +71,6 @@ beforeEach(() => {
    provider_tokens_set: {
      github: "some-token",
      gitlab: null,
-      azure_devops: null,
    },
  });
 });
--- a/frontend/tests/components/features/home/task-card.test.tsx
+++ b/frontend/tests/components/features/home/task-card.test.tsx
@@ -23,7 +23,6 @@ const MOCK_RESPOSITORIES: GitRepository[] = [
  { id: "2", full_name: "repo2", git_provider: "github", is_public: true },
  { id: "3", full_name: "repo3", git_provider: "gitlab", is_public: true },
  { id: "4", full_name: "repo4", git_provider: "gitlab", is_public: true },
-  { id: "5", full_name: "repo5", git_provider: "azure_devops", is_public: true },
 ];

 const renderTaskCard = (task = MOCK_TASK_1) => {
--- a/frontend/tests/components/features/microagent-management/microagent-management.test.tsx
+++ b/frontend/tests/components/features/microagent-management/microagent-management.test.tsx
@@ -21,7 +21,6 @@ const mockUseConfig = vi.fn();
 const mockUseRepositoryMicroagents = vi.fn();
 const mockUseMicroagentManagementConversations = vi.fn();
 const mockUseSearchRepositories = vi.fn();
-const mockUseCreateConversationAndSubscribeMultiple = vi.fn();

 vi.mock("#/hooks/use-user-providers", () => ({
  useUserProviders: () => mockUseUserProviders(),
@@ -48,17 +47,6 @@ vi.mock("#/hooks/query/use-search-repositories", () => ({
  useSearchRepositories: () => mockUseSearchRepositories(),
 }));

-vi.mock("#/hooks/use-tracking", () => ({
-  useTracking: () => ({
-    trackEvent: vi.fn(),
-  }),
-}));
-
-vi.mock("#/hooks/use-create-conversation-and-subscribe-multiple", () => ({
-  useCreateConversationAndSubscribeMultiple: () =>
-    mockUseCreateConversationAndSubscribeMultiple(),
-}));
-
 describe("MicroagentManagement", () => {
  const RouterStub = createRoutesStub([
    {
@@ -321,16 +309,6 @@ describe("MicroagentManagement", () => {
      isError: false,
    });

-    mockUseCreateConversationAndSubscribeMultiple.mockReturnValue({
-      createConversationAndSubscribe: vi.fn(({ onSuccessCallback }) => {
-        // Immediately call the success callback to close the modal
-        if (onSuccessCallback) {
-          onSuccessCallback();
-        }
-      }),
-      isPending: false,
-    });
-
    // Mock the search repositories hook to return repositories with OpenHands suffixes
    const mockSearchResults =
      getRepositoriesWithOpenHandsSuffix(mockRepositories);
--- a/frontend/tests/components/image-preview.test.tsx
+++ b/frontend/tests/components/image-preview.test.tsx
@@ -30,7 +30,7 @@ describe("ImagePreview", () => {
    expect(onRemoveMock).toHaveBeenCalledOnce();
  });

-  it("should not display the close button when onRemove is not provided", () => {
+  it("shoud not display the close button when onRemove is not provided", () => {
    render(<ImagePreview src="https://example.com/image.jpg" />);
    expect(screen.queryByRole("button")).not.toBeInTheDocument();
  });
--- a/frontend/tests/hooks/use-websocket.test.ts
+++ b/frontend/tests/hooks/use-websocket.test.ts
@@ -268,7 +268,7 @@ describe("useWebSocket", () => {
    });

    // onError handler should have been called
-    expect(onErrorSpy).toHaveBeenCalled();
+    expect(onErrorSpy).toHaveBeenCalledOnce();
  });

  it("should provide sendMessage function to send messages to WebSocket", async () => {
--- a/frontend/tests/posthog-tracking.test.tsx
+++ b/frontend/tests/posthog-tracking.test.tsx
@@ -1,233 +0,0 @@
-import {
-  describe,
-  it,
-  expect,
-  beforeAll,
-  afterAll,
-  afterEach,
-  vi,
-} from "vitest";
-import { screen, waitFor, render, cleanup } from "@testing-library/react";
-import { QueryClient, QueryClientProvider } from "@tanstack/react-query";
-import { createMockAgentErrorEvent } from "#/mocks/mock-ws-helpers";
-import { ConversationWebSocketProvider } from "#/contexts/conversation-websocket-context";
-import { conversationWebSocketTestSetup } from "./helpers/msw-websocket-setup";
-import { ConnectionStatusComponent } from "./helpers/websocket-test-components";
-
-// Mock the tracking function
-const mockTrackCreditLimitReached = vi.fn();
-
-// Mock useTracking hook
-vi.mock("#/hooks/use-tracking", () => ({
-  useTracking: () => ({
-    trackCreditLimitReached: mockTrackCreditLimitReached,
-    trackLoginButtonClick: vi.fn(),
-    trackConversationCreated: vi.fn(),
-    trackPushButtonClick: vi.fn(),
-    trackPullButtonClick: vi.fn(),
-    trackCreatePrButtonClick: vi.fn(),
-    trackGitProviderConnected: vi.fn(),
-    trackUserSignupCompleted: vi.fn(),
-    trackCreditsPurchased: vi.fn(),
-  }),
-}));
-
-// Mock useActiveConversation hook
-vi.mock("#/hooks/query/use-active-conversation", () => ({
-  useActiveConversation: () => ({
-    data: null,
-    isLoading: false,
-    error: null,
-  }),
-}));
-
-// MSW WebSocket mock setup
-const { wsLink, server: mswServer } = conversationWebSocketTestSetup();
-
-beforeAll(() => {
-  // The global MSW server from vitest.setup.ts is already running
-  // We just need to start our WebSocket-specific server
-  mswServer.listen({ onUnhandledRequest: "bypass" });
-});
-
-afterEach(() => {
-  // Clear all mocks before each test
-  mockTrackCreditLimitReached.mockClear();
-  mswServer.resetHandlers();
-  // Clean up any React components
-  cleanup();
-});
-
-afterAll(async () => {
-  // Close the WebSocket MSW server
-  mswServer.close();
-
-  // Give time for any pending WebSocket connections to close. This is very important to prevent serious memory leaks
-  await new Promise((resolve) => {
-    setTimeout(resolve, 500);
-  });
-});
-
-// Helper function to render components with all necessary providers
-function renderWithProviders(
-  children: React.ReactNode,
-  conversationId = "test-conversation-123",
-  conversationUrl = "http://localhost:3000/api/conversations/test-conversation-123",
-) {
-  const queryClient = new QueryClient({
-    defaultOptions: {
-      queries: { retry: false },
-      mutations: { retry: false },
-    },
-  });
-
-  return render(
-    <QueryClientProvider client={queryClient}>
-      <ConversationWebSocketProvider
-        conversationId={conversationId}
-        conversationUrl={conversationUrl}
-        sessionApiKey={null}
-      >
-        {children}
-      </ConversationWebSocketProvider>
-    </QueryClientProvider>,
-  );
-}
-
-describe("PostHog Analytics Tracking", () => {
-  describe("Credit Limit Tracking", () => {
-    it("should track credit_limit_reached when AgentErrorEvent contains budget error", async () => {
-      // Create a mock AgentErrorEvent with budget-related error message
-      const mockBudgetErrorEvent = createMockAgentErrorEvent({
-        error: "ExceededBudget: Task exceeded maximum budget of $10.00",
-      });
-
-      // Set up MSW to send the budget error event when connection is established
-      mswServer.use(
-        wsLink.addEventListener("connection", ({ client, server }) => {
-          server.connect();
-          // Send the mock budget error event after connection
-          client.send(JSON.stringify(mockBudgetErrorEvent));
-        }),
-      );
-
-      // Render with all providers
-      renderWithProviders(<ConnectionStatusComponent />);
-
-      // Wait for connection to be established
-      await waitFor(() => {
-        expect(screen.getByTestId("connection-state")).toHaveTextContent(
-          "OPEN",
-        );
-      });
-
-      // Wait for the tracking event to be captured
-      await waitFor(() => {
-        expect(mockTrackCreditLimitReached).toHaveBeenCalledWith(
-          expect.objectContaining({
-            conversationId: "test-conversation-123",
-          }),
-        );
-      });
-    });
-
-    it("should track credit_limit_reached when AgentErrorEvent contains 'credit' keyword", async () => {
-      // Create error with "credit" keyword (case-insensitive)
-      const mockCreditErrorEvent = createMockAgentErrorEvent({
-        error: "Insufficient CREDIT to complete this operation",
-      });
-
-      mswServer.use(
-        wsLink.addEventListener("connection", ({ client, server }) => {
-          server.connect();
-          client.send(JSON.stringify(mockCreditErrorEvent));
-        }),
-      );
-
-      renderWithProviders(<ConnectionStatusComponent />);
-
-      await waitFor(() => {
-        expect(screen.getByTestId("connection-state")).toHaveTextContent(
-          "OPEN",
-        );
-      });
-
-      await waitFor(() => {
-        expect(mockTrackCreditLimitReached).toHaveBeenCalledWith(
-          expect.objectContaining({
-            conversationId: "test-conversation-123",
-          }),
-        );
-      });
-    });
-
-    it("should NOT track credit_limit_reached for non-budget errors", async () => {
-      // Create a regular error without budget/credit keywords
-      const mockRegularErrorEvent = createMockAgentErrorEvent({
-        error: "Failed to execute command: Permission denied",
-      });
-
-      mswServer.use(
-        wsLink.addEventListener("connection", ({ client, server }) => {
-          server.connect();
-          client.send(JSON.stringify(mockRegularErrorEvent));
-        }),
-      );
-
-      renderWithProviders(<ConnectionStatusComponent />);
-
-      // Wait for connection and error to be processed
-      await waitFor(() => {
-        expect(screen.getByTestId("connection-state")).toHaveTextContent(
-          "OPEN",
-        );
-      });
-
-      // Verify that credit_limit_reached was NOT tracked
-      expect(mockTrackCreditLimitReached).not.toHaveBeenCalled();
-    });
-
-    it("should only track credit_limit_reached once per error event", async () => {
-      const mockBudgetErrorEvent = createMockAgentErrorEvent({
-        error: "Budget exceeded: $10.00 limit reached",
-      });
-
-      mswServer.use(
-        wsLink.addEventListener("connection", ({ client, server }) => {
-          server.connect();
-          // Send the same error event twice
-          client.send(JSON.stringify(mockBudgetErrorEvent));
-          client.send(
-            JSON.stringify({ ...mockBudgetErrorEvent, id: "different-id" }),
-          );
-        }),
-      );
-
-      renderWithProviders(<ConnectionStatusComponent />);
-
-      await waitFor(() => {
-        expect(screen.getByTestId("connection-state")).toHaveTextContent(
-          "OPEN",
-        );
-      });
-
-      await waitFor(() => {
-        expect(mockTrackCreditLimitReached).toHaveBeenCalledTimes(2);
-      });
-
-      // Both calls should be for credit_limit_reached (once per event)
-      expect(mockTrackCreditLimitReached).toHaveBeenNthCalledWith(
-        1,
-        expect.objectContaining({
-          conversationId: "test-conversation-123",
-        }),
-      );
-      expect(mockTrackCreditLimitReached).toHaveBeenNthCalledWith(
-        2,
-        expect.objectContaining({
-          conversationId: "test-conversation-123",
-        }),
-      );
-    });
-  });
-});
--- a/frontend/tests/routes/accept-tos.test.tsx
+++ b/frontend/tests/routes/accept-tos.test.tsx
@@ -1,9 +1,10 @@
 import { render, screen } from "@testing-library/react";
 import { it, describe, expect, vi, beforeEach, afterEach } from "vitest";
 import userEvent from "@testing-library/user-event";
-import { QueryClient, QueryClientProvider } from "@tanstack/react-query";
 import AcceptTOS from "#/routes/accept-tos";
 import * as CaptureConsent from "#/utils/handle-capture-consent";
+import * as ToastHandlers from "#/utils/custom-toast-handlers";
+import { QueryClient, QueryClientProvider } from "@tanstack/react-query";
 import { openHands } from "#/api/open-hands-axios";

 // Mock the react-router hooks
@@ -43,13 +44,9 @@ const createWrapper = () => {
    },
  });

-  function Wrapper({ children }: { children: React.ReactNode }) {
-    return (
-      <QueryClientProvider client={queryClient}>{children}</QueryClientProvider>
-    );
-  }
-
-  return Wrapper;
+  return ({ children }: { children: React.ReactNode }) => (
+    <QueryClientProvider client={queryClient}>{children}</QueryClientProvider>
+  );
 };

 describe("AcceptTOS", () => {
@@ -109,10 +106,7 @@ describe("AcceptTOS", () => {
    // Wait for the mutation to complete
    await new Promise(process.nextTick);

-    expect(handleCaptureConsentSpy).toHaveBeenCalledWith(
-      expect.anything(),
-      true,
-    );
+    expect(handleCaptureConsentSpy).toHaveBeenCalledWith(true);
    expect(openHands.post).toHaveBeenCalledWith("/api/accept_tos", {
      redirect_url: "/dashboard",
    });
--- a/frontend/tests/routes/app-settings.test.tsx
+++ b/frontend/tests/routes/app-settings.test.tsx
@@ -46,21 +46,6 @@ describe("Content", () => {
    });
  });

-  it("should render analytics toggle as enabled when server returns null (opt-in by default)", async () => {
-    const getSettingsSpy = vi.spyOn(SettingsService, "getSettings");
-    getSettingsSpy.mockResolvedValue({
-      ...MOCK_DEFAULT_USER_SETTINGS,
-      user_consents_to_analytics: null,
-    });
-
-    renderAppSettingsScreen();
-
-    await waitFor(() => {
-      const analytics = screen.getByTestId("enable-analytics-switch");
-      expect(analytics).toBeChecked();
-    });
-  });
-
  it("should render the language options", async () => {
    renderAppSettingsScreen();

@@ -178,10 +163,7 @@ describe("Form submission", () => {
    await userEvent.click(submit);

    await waitFor(() =>
-      expect(handleCaptureConsentsSpy).toHaveBeenCalledWith(
-        expect.anything(),
-        true,
-      ),
+      expect(handleCaptureConsentsSpy).toHaveBeenCalledWith(true),
    );
  });

@@ -206,10 +188,7 @@ describe("Form submission", () => {
    await userEvent.click(submit);

    await waitFor(() =>
-      expect(handleCaptureConsentsSpy).toHaveBeenCalledWith(
-        expect.anything(),
-        false,
-      ),
+      expect(handleCaptureConsentsSpy).toHaveBeenCalledWith(false),
    );
  });

--- a/frontend/tests/routes/git-settings.test.tsx
+++ b/frontend/tests/routes/git-settings.test.tsx
@@ -124,9 +124,6 @@ describe("Content", () => {
    await screen.findByTestId("bitbucket-token-input");
    await screen.findByTestId("bitbucket-token-help-anchor");

-    await screen.findByTestId("azure-devops-token-input");
-    await screen.findByTestId("azure-devops-token-help-anchor");
-
    getConfigSpy.mockResolvedValue(VALID_SAAS_CONFIG);
    queryClient.invalidateQueries();
    rerender();
@@ -152,13 +149,6 @@ describe("Content", () => {
      expect(
        screen.queryByTestId("bitbucket-token-help-anchor"),
      ).not.toBeInTheDocument();
-
-      expect(
-        screen.queryByTestId("azure-devops-token-input"),
-      ).not.toBeInTheDocument();
-      expect(
-        screen.queryByTestId("azure-devops-token-help-anchor"),
-      ).not.toBeInTheDocument();
    });
  });

@@ -297,7 +287,6 @@ describe("Form submission", () => {
      github: { token: "test-token", host: "" },
      gitlab: { token: "", host: "" },
      bitbucket: { token: "", host: "" },
-      azure_devops: { token: "", host: "" },
    });
  });

@@ -319,7 +308,6 @@ describe("Form submission", () => {
      github: { token: "", host: "" },
      gitlab: { token: "test-token", host: "" },
      bitbucket: { token: "", host: "" },
-      azure_devops: { token: "", host: "" },
    });
  });

@@ -341,29 +329,6 @@ describe("Form submission", () => {
      github: { token: "", host: "" },
      gitlab: { token: "", host: "" },
      bitbucket: { token: "test-token", host: "" },
-      azure_devops: { token: "", host: "" },
-    });
-  });
-
-  it("should save the Azure DevOps token", async () => {
-    const saveProvidersSpy = vi.spyOn(SecretsService, "addGitProvider");
-    saveProvidersSpy.mockImplementation(() => Promise.resolve(true));
-    const getConfigSpy = vi.spyOn(OptionService, "getConfig");
-    getConfigSpy.mockResolvedValue(VALID_OSS_CONFIG);
-
-    renderGitSettingsScreen();
-
-    const azureDevOpsInput = await screen.findByTestId("azure-devops-token-input");
-    const submit = await screen.findByTestId("submit-button");
-
-    await userEvent.type(azureDevOpsInput, "test-token");
-    await userEvent.click(submit);
-
-    expect(saveProvidersSpy).toHaveBeenCalledWith({
-      github: { token: "", host: "" },
-      gitlab: { token: "", host: "" },
-      bitbucket: { token: "", host: "" },
-      azure_devops: { token: "test-token", host: "" },
    });
  });

--- a/frontend/tests/stores/use-event-store.test.ts
+++ b/frontend/tests/stores/use-event-store.test.ts
@@ -55,7 +55,7 @@ const mockObservationEvent: ObservationEvent = {
  tool_call_id: "call_123",
  observation: {
    kind: "ExecuteBashObservation",
-    content: [{ type: "text", text: "hello\n" }],
+    output: "hello\n",
    command: "echo hello",
    exit_code: 0,
    error: false,
--- a/frontend/tests/utils/convert-raw-providers-to-list.test.ts
+++ b/frontend/tests/utils/convert-raw-providers-to-list.test.ts
@@ -7,7 +7,6 @@ describe("convertRawProvidersToList", () => {
    const example1: Partial<Record<Provider, string | null>> | undefined = {
      github: "test-token",
      gitlab: "test-token",
-      azure_devops: "test-token",
    };
    const example2: Partial<Record<Provider, string | null>> | undefined = {
      github: "",
@@ -15,13 +14,9 @@ describe("convertRawProvidersToList", () => {
    const example3: Partial<Record<Provider, string | null>> | undefined = {
      gitlab: null,
    };
-    const example4: Partial<Record<Provider, string | null>> | undefined = {
-      azure_devops: "test-token",
-    };

-    expect(convertRawProvidersToList(example1)).toEqual(["github", "gitlab", "azure_devops"]);
+    expect(convertRawProvidersToList(example1)).toEqual(["github", "gitlab"]);
    expect(convertRawProvidersToList(example2)).toEqual(["github"]);
    expect(convertRawProvidersToList(example3)).toEqual(["gitlab"]);
-    expect(convertRawProvidersToList(example4)).toEqual(["azure_devops"]);
  });
 });
--- a/frontend/tests/utils/error-handler.test.ts
+++ b/frontend/tests/utils/error-handler.test.ts
@@ -32,7 +32,6 @@ describe("Error Handler", () => {
      const error = {
        message: "Test error",
        source: "test",
-        posthog,
      };

      trackError(error);
@@ -53,7 +52,6 @@ describe("Error Handler", () => {
          extra: "info",
          details: { foo: "bar" },
        },
-        posthog,
      };

      trackError(error);
@@ -75,7 +73,6 @@ describe("Error Handler", () => {
      const error = {
        message: "Toast error",
        source: "toast-test",
-        posthog,
      };

      showErrorToast(error);
@@ -97,7 +94,6 @@ describe("Error Handler", () => {
        message: "Toast error",
        source: "toast-test",
        metadata: { context: "testing" },
-        posthog,
      };

      showErrorToast(error);
@@ -117,7 +113,6 @@ describe("Error Handler", () => {
        message: "Agent error",
        source: "agent-status",
        metadata: { id: "error.agent" },
-        posthog,
      });

      expect(posthog.captureException).toHaveBeenCalledWith(
@@ -132,7 +127,6 @@ describe("Error Handler", () => {
        message: "Server error",
        source: "server",
        metadata: { error_code: 500, details: "Internal error" },
-        posthog,
      });

      expect(posthog.captureException).toHaveBeenCalledWith(
@@ -151,7 +145,6 @@ describe("Error Handler", () => {
        message: error.message,
        source: "feedback",
        metadata: { conversationId: "123", error },
-        posthog,
      });

      expect(posthog.captureException).toHaveBeenCalledWith(
@@ -171,7 +164,6 @@ describe("Error Handler", () => {
        message: "Chat error",
        source: "chat-test",
        msgId: "123",
-        posthog,
      };

      showChatError(error);
--- a/frontend/tests/utils/handle-capture-consent.test.ts
+++ b/frontend/tests/utils/handle-capture-consent.test.ts
@@ -13,14 +13,14 @@ describe("handleCaptureConsent", () => {
  });

  it("should opt out of of capturing", () => {
-    handleCaptureConsent(posthog, false);
+    handleCaptureConsent(false);

    expect(optOutSpy).toHaveBeenCalled();
    expect(optInSpy).not.toHaveBeenCalled();
  });

  it("should opt in to capturing if the user consents", () => {
-    handleCaptureConsent(posthog, true);
+    handleCaptureConsent(true);

    expect(optInSpy).toHaveBeenCalled();
    expect(optOutSpy).not.toHaveBeenCalled();
@@ -28,7 +28,7 @@ describe("handleCaptureConsent", () => {

  it("should not opt in to capturing if the user is already opted in", () => {
    hasOptedInSpy.mockReturnValueOnce(true);
-    handleCaptureConsent(posthog, true);
+    handleCaptureConsent(true);

    expect(optInSpy).not.toHaveBeenCalled();
    expect(optOutSpy).not.toHaveBeenCalled();
@@ -36,7 +36,7 @@ describe("handleCaptureConsent", () => {

  it("should not opt out of capturing if the user is already opted out", () => {
    hasOptedOutSpy.mockReturnValueOnce(true);
-    handleCaptureConsent(posthog, false);
+    handleCaptureConsent(false);

    expect(optOutSpy).not.toHaveBeenCalled();
    expect(optInSpy).not.toHaveBeenCalled();
--- a/frontend/tests/utils/handle-event-for-ui.test.ts
+++ b/frontend/tests/utils/handle-event-for-ui.test.ts
@@ -17,7 +17,7 @@ describe("handleEventForUI", () => {
    tool_call_id: "call_123",
    observation: {
      kind: "ExecuteBashObservation",
-      content: [{ type: "text", text: "hello\n" }],
+      output: "hello\n",
      command: "echo hello",
      exit_code: 0,
      error: false,
--- a/frontend/package-lock.json
+++ b/frontend/package-lock.json
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -1,17 +1,16 @@
 {
  "name": "openhands-frontend",
-  "version": "0.62.0",
+  "version": "0.60.0",
  "private": true,
  "type": "module",
  "engines": {
    "node": ">=22.0.0"
  },
  "dependencies": {
-    "@heroui/react": "2.8.5",
+    "@heroui/react": "^2.8.4",
    "@heroui/use-infinite-scroll": "^2.2.11",
    "@microlink/react-json-view": "^1.26.2",
    "@monaco-editor/react": "^4.7.0-rc.0",
-    "@posthog/react": "^1.4.0",
    "@react-router/node": "^7.9.3",
    "@react-router/serve": "^7.9.3",
    "@react-types/shared": "^3.32.0",
@@ -38,7 +37,7 @@
    "jose": "^6.1.0",
    "lucide-react": "^0.544.0",
    "monaco-editor": "^0.53.0",
-    "posthog-js": "^1.298.1",
+    "posthog-js": "^1.268.8",
    "react": "^19.1.1",
    "react-dom": "^19.1.1",
    "react-highlight": "^0.15.0",
--- a/frontend/public/android-chrome-192x192.png
+++ b/frontend/public/android-chrome-192x192.png
--- a/frontend/public/android-chrome-512x512.png
+++ b/frontend/public/android-chrome-512x512.png
--- a/frontend/public/apple-touch-icon.png
+++ b/frontend/public/apple-touch-icon.png
--- a/frontend/public/favicon-16x16.png
+++ b/frontend/public/favicon-16x16.png
--- a/frontend/public/favicon-32x32.png
+++ b/frontend/public/favicon-32x32.png
--- a/frontend/public/favicon.ico
+++ b/frontend/public/favicon.ico
--- a/frontend/public/safari-pinned-tab.svg
+++ b/frontend/public/safari-pinned-tab.svg
@@ -1,7 +1,32 @@
-<svg width="1365" height="1365" viewBox="0 -24 148 148" fill="none" xmlns="http://www.w3.org/2000/svg">
-<path d="M71.7542 16.863V2.97414C71.7542 1.82355 72.6872 0.890503 73.8378 0.890503C74.9884 0.890503 75.9214 1.82355 75.9214 2.97414V16.863C75.9214 18.0136 74.9884 18.9466 73.8378 18.9466C72.6872 18.9466 71.7542 18.0136 71.7542 16.863Z" fill="black"/>
-<path d="M82.5272 18.9329L89.4716 6.90477C90.0469 5.90832 91.3215 5.5668 92.3179 6.1421C93.3144 6.7174 93.6559 7.99197 93.0806 8.98841L86.1362 21.0165C85.5609 22.0129 84.2863 22.3545 83.2899 21.7792C82.2934 21.2039 81.9519 19.9293 82.5272 18.9329Z" fill="black"/>
-<path d="M65.1481 18.9329L58.2037 6.90477C57.6284 5.90832 56.3538 5.5668 55.3574 6.1421C54.3609 6.7174 54.0194 7.99197 54.5947 8.98841L61.5391 21.0165C62.1144 22.0129 63.389 22.3545 64.3854 21.7792C65.3819 21.2039 65.7234 19.9293 65.1481 18.9329Z" fill="black"/>
-<path d="M140.606 62.0292C140.606 58.409 141.583 47.6748 141.89 44.1323C142.097 41.7374 141.809 40.4247 141.424 39.7542C141.141 39.2626 140.699 38.915 139.634 38.8436C138.865 38.7921 138.027 39.0114 137.401 39.5761C136.814 40.1052 136.159 41.1682 136.159 43.3176L136.155 43.4388L135.198 59.758C135.164 60.3451 134.883 60.8911 134.424 61.2599C133.966 61.6284 133.374 61.7859 132.793 61.6941L122.764 60.1068L111.948 58.6703C110.949 58.5376 110.188 57.7084 110.142 56.7016L109.561 44.1323C109.535 43.621 109.51 43.1141 109.484 42.6146C109.241 37.9294 109.022 33.7805 109.022 32.4282C109.022 28.3859 108.338 26.6806 107.74 25.9634C107.263 25.3915 106.577 25.1402 105.11 25.1402C104.583 25.1402 104.212 25.2481 103.933 25.4111C103.659 25.5714 103.346 25.8587 103.049 26.4208C102.41 27.6257 101.945 29.891 102.118 33.8479C102.342 38.9804 102.692 42.8146 103.035 46.2718C103.377 49.7231 103.718 52.8561 103.908 56.4971C104.204 62.1966 104.178 66.1256 103.945 68.7924C103.828 70.124 103.656 71.1996 103.423 72.0501C103.202 72.8558 102.871 73.6757 102.296 74.2887C101.6 75.0303 100.608 75.3844 99.577 75.136C98.7592 74.9389 98.1847 74.4215 97.8706 74.0916C97.2141 73.4017 96.7501 72.5106 96.568 72.0512C95.5097 69.3812 92.2352 63.1808 87.8023 59.6811C86.5089 58.6599 85.5666 58.3652 84.9736 58.3204C84.4148 58.2783 84.0094 58.4436 83.6909 58.6967C83.34 58.9756 83.0781 59.3811 82.9479 59.7643C82.9019 59.8999 82.8823 59.9968 82.8741 60.0584C84.0759 62.0865 88.8421 69.5222 91.0896 77.069C92.7648 82.6941 96.8038 88.4259 99.8194 90.8809C102.74 93.258 107.988 94.7313 113.9 95.0218C119.756 95.3095 125.788 94.4121 130.033 92.5092C138.233 88.8334 139.903 80.7382 140.651 77.2292C141.232 74.5057 141.243 71.5987 141.087 68.9009C141.01 67.5551 140.894 66.2969 140.793 65.1373C140.695 64.0105 140.606 62.9215 140.606 62.0292ZM120.986 27.0953C120.986 25.8314 120.648 24.7049 120.089 23.9514C119.583 23.27 118.84 22.7987 117.646 22.7984C116.668 22.7982 116.011 22.9187 115.546 23.1167C115.13 23.2943 114.781 23.5699 114.463 24.0831C113.73 25.2671 113.192 27.6455 113.189 32.384L113.721 43.9088C113.91 47.5661 114.106 51.4922 114.235 54.7707L120.986 55.6666V27.0953ZM125.153 56.2652L131.172 57.218L131.992 43.267V32.5083C131.992 31.031 131.39 30.1275 130.678 29.5489C129.884 28.9039 128.957 28.6731 128.519 28.6731C127.722 28.6731 126.899 28.797 126.306 29.2179C125.849 29.5421 125.153 30.3087 125.153 32.5083V56.2652ZM136.159 35.4278C137.406 34.8069 138.74 34.6083 139.912 34.6868C142.037 34.8292 143.91 35.718 145.037 37.6779C146.06 39.4592 146.273 41.8136 146.041 44.4927C145.72 48.1949 144.772 58.6457 144.772 62.0292C144.772 62.708 144.843 63.6116 144.944 64.7758C145.042 65.907 145.165 67.2389 145.247 68.6606C145.411 71.4987 145.422 74.8383 144.727 78.0987C144.002 81.4953 142.041 91.6918 131.738 96.3108C126.731 98.5551 120.002 99.4936 113.696 99.1838C107.445 98.8767 101.128 97.3189 97.1887 94.1122C93.4809 91.0937 88.9938 84.6307 87.0962 78.2589C84.9529 71.0619 80.3109 63.9646 79.1527 61.9533C78.4706 60.7689 78.684 59.3628 79.0019 58.4258C79.3607 57.3688 80.0554 56.2631 81.0993 55.4337C82.1758 54.5784 83.6043 54.0377 85.2876 54.1647C86.9369 54.2893 88.6462 55.0393 90.3834 56.4107C94.8541 59.9401 98.1342 65.5082 99.7424 68.9231C99.759 68.7664 99.779 68.6024 99.7941 68.4298C100.003 66.0435 100.039 62.3344 99.7467 56.7132C99.5635 53.1942 99.2356 50.1809 98.8888 46.6828C98.5425 43.1904 98.184 39.2713 97.955 34.0302C97.7722 29.8481 98.2012 26.6722 99.3672 24.471C99.9716 23.3302 100.79 22.4223 101.83 21.814C102.866 21.2087 103.995 20.974 105.11 20.974C106.759 20.974 108.813 21.2062 110.448 22.7678C110.593 22.4576 110.75 22.1652 110.921 21.8899C111.676 20.6698 112.681 19.8084 113.912 19.2835C115.095 18.7791 116.378 18.6309 117.646 18.6311C120.195 18.6315 122.164 19.7567 123.434 21.4683C124.256 22.576 124.75 23.8775 124.985 25.1982C126.338 24.5876 127.691 24.5068 128.519 24.5068C129.933 24.5068 131.784 25.0791 133.305 26.3154C134.908 27.6179 136.159 29.6733 136.159 32.5083V35.4278Z" fill="black"/>
-<path d="M7.15661 62.0292C7.15661 58.409 6.17994 47.6748 5.87291 44.1323C5.6654 41.7374 5.95357 40.4247 6.33875 39.7542C6.62116 39.2626 7.06336 38.915 8.12834 38.8436C8.89759 38.7921 9.73544 39.0114 10.3616 39.5761C10.9484 40.1052 11.6032 41.1682 11.6032 43.3176L11.6074 43.4388L12.5644 59.758C12.5988 60.3451 12.8798 60.8911 13.338 61.2599C13.7961 61.6284 14.3887 61.7859 14.9695 61.6941L24.9988 60.1068L35.8143 58.6703C36.8135 58.5376 37.5741 57.7084 37.6208 56.7016L38.2015 44.1323C38.2279 43.621 38.2525 43.1141 38.2784 42.6146C38.5218 37.9294 38.7401 33.7805 38.7401 32.4282C38.7401 28.3859 39.4246 26.6806 40.0227 25.9634C40.4996 25.3915 41.185 25.1402 42.6523 25.1402C43.1794 25.1402 43.5505 25.2481 43.8295 25.4111C44.1038 25.5714 44.416 25.8587 44.7138 26.4208C45.3521 27.6257 45.8173 29.891 45.6444 33.8479C45.4201 38.9804 45.0703 42.8146 44.7275 46.2718C44.3853 49.7231 44.0443 52.8561 43.8548 56.4971C43.5582 62.1966 43.5847 66.1256 43.8179 68.7924C43.9344 70.124 44.1069 71.1996 44.3396 72.0501C44.5601 72.8558 44.891 73.6757 45.4663 74.2887C46.1625 75.0303 47.1546 75.3844 48.1855 75.136C49.0033 74.9389 49.5778 74.4215 49.8918 74.0916C50.5484 73.4017 51.0123 72.5106 51.1945 72.0512C52.2527 69.3812 55.5272 63.1808 59.9601 59.6811C61.2536 58.6599 62.1958 58.3652 62.7889 58.3204C63.3476 58.2783 63.753 58.4436 64.0715 58.6967C64.4225 58.9756 64.6844 59.3811 64.8146 59.7643C64.8606 59.8999 64.8801 59.9968 64.8883 60.0584C63.6866 62.0865 58.9204 69.5222 56.6729 77.069C54.9977 82.6941 50.9586 88.4259 47.9431 90.8809C45.0229 93.258 39.7747 94.7313 33.8624 95.0218C28.0068 95.3095 21.9748 94.4121 17.7297 92.5092C9.52988 88.8334 7.85961 80.7382 7.11129 77.2292C6.53054 74.5057 6.5195 71.5987 6.67496 68.9009C6.75251 67.5551 6.86809 66.2969 6.96901 65.1373C7.06707 64.0105 7.1566 62.9215 7.15661 62.0292ZM26.7768 27.0953C26.7768 25.8314 27.1147 24.7049 27.6737 23.9514C28.1792 23.27 28.9221 22.7987 30.1167 22.7984C31.0942 22.7982 31.7518 22.9187 32.2162 23.1167C32.6326 23.2943 32.9817 23.5699 33.2996 24.0831C34.0328 25.2671 34.5705 27.6455 34.5738 32.384L34.0416 43.9088C33.8524 47.5661 33.6565 51.4922 33.5273 54.7707L26.7768 55.6666V27.0953ZM22.6095 56.2652L16.5904 57.218L15.7705 43.267V32.5083C15.7705 31.031 16.3726 30.1275 17.0847 29.5489C17.8785 28.9039 18.8058 28.6731 19.2432 28.6731C20.0404 28.6731 20.8634 28.797 21.4565 29.2179C21.9131 29.5421 22.6095 30.3087 22.6095 32.5083V56.2652ZM11.6032 35.4278C10.3568 34.8069 9.02265 34.6083 7.8501 34.6868C5.72541 34.8292 3.85197 35.718 2.72584 37.6779C1.70247 39.4592 1.48924 41.8136 1.72143 44.4927C2.0423 48.1949 2.99038 58.6457 2.99038 62.0292C2.99037 62.708 2.91991 63.6116 2.81859 64.7758C2.72014 65.907 2.59699 67.2389 2.51505 68.6606C2.3515 71.4987 2.34041 74.8383 3.0357 78.0987C3.76005 81.4953 5.72154 91.6918 16.0245 96.3108C21.0311 98.5551 27.7601 99.4936 34.0669 99.1838C40.3172 98.8767 46.6346 97.3189 50.5737 94.1122C54.2816 91.0937 58.7686 84.6307 60.6662 78.2589C62.8095 71.0619 67.4515 63.9646 68.6098 61.9533C69.2919 60.7689 69.0785 59.3628 68.7605 58.4258C68.4018 57.3688 67.707 56.2631 66.6632 55.4337C65.5867 54.5784 64.1582 54.0377 62.4748 54.1647C60.8256 54.2893 59.1162 55.0393 57.379 56.4107C52.9083 59.9401 49.6283 65.5082 48.02 68.9231C48.0034 68.7664 47.9835 68.6024 47.9684 68.4298C47.7597 66.0435 47.7232 62.3344 48.0158 56.7132C48.1989 53.1942 48.5269 50.1809 48.8737 46.6828C49.22 43.1904 49.5784 39.2713 49.8075 34.0302C49.9903 29.8481 49.5612 26.6722 48.3952 24.471C47.7909 23.3302 46.9729 22.4223 45.9321 21.814C44.8964 21.2087 43.7676 20.974 42.6523 20.974C41.0038 20.974 38.9497 21.2062 37.3141 22.7678C37.1698 22.4576 37.0124 22.1652 36.8419 21.8899C36.0863 20.6698 35.0817 19.8084 33.8508 19.2835C32.6679 18.7791 31.3849 18.6309 30.1167 18.6311C27.5677 18.6315 25.5986 19.7567 24.3285 21.4683C23.5066 22.576 23.0121 23.8775 22.7771 25.1982C21.4247 24.5876 20.0718 24.5068 19.2432 24.5068C17.8298 24.5068 15.9788 25.0791 14.4573 26.3154C12.8542 27.6179 11.6032 29.6733 11.6032 32.5083V35.4278Z" fill="black"/>
+<svg version="1.2" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1365 1365" width="1365" height="1365">
+	<title>safari-pinned-tab-svg</title>
+	<defs>
+		<clipPath clipPathUnits="userSpaceOnUse" id="cp1">
+			<path d="m655.2 313.1v822.23h-622.69v-822.23z"/>
+		</clipPath>
+		<clipPath clipPathUnits="userSpaceOnUse" id="cp2">
+			<path d="m1308.84 304v828.5h-617.24v-828.5z"/>
+		</clipPath>
+	</defs>
+	<style>
+		.s0 { fill: none }
+		.s1 { fill: #000000 }
+	</style>
+	<g id="surface1">
+		<path class="s0" d="m1258.6 499.2c-40.8-26.1-68 13.8-64.7 68.1l-0.3 0.4c0.1-56.6-7.3-119.2-31.7-169.8-8.7-17.9-26.2-47.4-61-33.5-15.2 6.1-29 24.4-21.9 71.7 0 0 8 49.7 6.5 112.2v0.8c-9.9-172.2-47.3-224.7-100.7-221.3-17.1 3.1-40.5 10.8-32.6 63.8 0 0 8.5 55.2 11.3 99.2l0.2 2.2h-0.2c-25.2-94.9-59-96.2-83.5-92.5-22.3 3.3-46.6 27.3-34.3 74.7 38.6 148.4 31 327.2 28.2 352.9-7.9-17.6-10.3-31.4-21.3-50.7-43.9-77.1-64.8-82.8-90.4-84-25.5-1.1-53 15.2-51.2 46.3 1.9 31.1 17.1 36.3 38.7 79.6 16.9 33.8 21.7 78 55.7 158.4 28.1 66.5 101.6 139.5 235.6 130.8 108.5-3.7 270.6-43.2 242.4-302.5-7-45-1.7-82.7 1.9-121.4 5.8-60 14.1-159.4-26.7-185.5z"/>
+		<path class="s0" d="m580.9 695.3c-25.7 1.7-46.4 7.7-89 85.7-10.6 19.4-12.7 33.3-20.3 50.9-3.3-25.5-14-204.2 21.9-353.4 11.4-47.5-13.3-71-35.6-73.9-24.6-3.3-58.5-1.3-81.9 94.6h-0.3l0.4-2.8c1.9-44 9.5-99.4 9.5-99.4 6.9-53.1-16.6-60.3-33.7-63.2-53.4-2.3-89.7 50.4-96.7 221.1h-0.2c-2.4-61.8 4.6-111 4.6-111 6.3-47.5-7.9-65.5-23.2-71.3-35-13.3-52 16.6-60.3 34.7-23.6 51-29.9 113.7-28.7 170.3l-0.4-0.4c2.4-54.2-25.6-93.7-65.9-66.9-40.3 26.9-30.1 126.2-23.4 186 4.5 38.6 10.4 76.2 4.1 121.4-23.5 259.7 139.3 296.1 247.8 297.8 134.1 6.1 206.3-68.3 233.3-135.4 32.5-80.9 36.6-125.3 52.8-159.3 20.8-43.8 36-49.2 37.3-80.3 1.3-31.1-26.5-46.9-52-45.3z"/>
+		<g id="Clip-Path" clip-path="url(#cp1)">
+			<g>
+				<path fill-rule="evenodd" class="s1" d="m634.4 695.7c11.9 12 17.8 27.9 17.1 45.7-1 24.6-9.5 37.8-19.4 53.2-5.8 9-12.3 19.2-19.8 34.9-6.6 13.8-11.2 30.4-17 51.5-7.6 27.3-17 61.2-35.3 106.7-14.1 35.1-71.4 146.1-231.3 147.6q-9.8 0.1-20-0.3c-93.6-1.5-164.2-28.5-209.3-80.6-46.9-54-65.8-134.2-56.4-238.4l0.1-0.9c5.2-37.6 1.5-69.5-2.6-103.3-0.5-4.4-1-8.7-1.5-13.1-9.4-82.5-15.4-173.1 31.8-204.6 27.9-18.5 49.3-11.1 59.6-4.9 1 0.6 1.9 1.2 2.9 1.9 4.5-32.3 12.6-63.9 25.6-92.2 11.2-24.2 37.1-62.2 83.8-44.5 7.6 3 17.3 8.8 24.8 20.2 6.7-14.7 14.5-26.6 23.5-35.8 16.6-17.2 37.4-25.4 61.6-24.3l2.1 0.2c39.2 6.5 55.9 35 49.4 84.9 0 0 0 0.3 0 0.6 20-17 40.4-16.9 56.1-14.8 17.2 2.2 32.8 12.1 42.8 27.2 8.6 13 17.1 35.8 8.7 70.7-22.3 92.9-26.1 198.5-25.2 268.8 36.3-61.5 59.9-74.1 93.2-76.2 20.8-1.3 41.3 6 54.8 19.8zm-316.9-329.4c-35.1 36.2-49.7 148.5-43.4 333.7 19.5-3 39.5-5.2 59.6-6.5 4.7-91.2 13.3-153.7 23.7-197q0-0.5 0-0.9c2-44.4 9.3-98.9 9.7-101.2 4.7-36.3-5.7-39.2-17-41.1-13.2-0.3-23.5 3.8-32.5 13zm-124.6 49.5c-50 108.3-16.2 277.2-8.5 303.8 16.1-4.8 33.8-9.2 52.4-13-2.1-57.7-2.3-108.1-0.6-151.9-2.3-62.3 4.4-111.4 4.7-113.5 1.8-13 4.2-44.5-11.1-50.3-10.9-4.1-22.8-5.7-36.9 24.8zm407.9 357.3c8.9-13.8 12.6-19.5 13.2-33.3 0.2-6.7-1.6-12-5.9-16.3-5.9-6.1-16-9.4-26.1-8.8-17.3 1.2-33.6 2.2-73.8 75.9-5.6 10.4-8.5 18.9-11.8 28.6-2.1 6.3-4.4 13.2-7.7 20.8-0.1 0.3-0.2 0.6-0.4 0.8-1.4 3.3-3 6.8-4.8 10.4-17.6 34.2-46.2 53.4-76.5 51.6-10.4-0.7-18.2-9.9-17.6-20.6 0.6-10.7 9.6-18.7 19.8-18.2 18 1.1 33.1-15.3 41.2-31.1 0.7-1.4 1.3-2.8 2-4.2-4-41.1-11.8-210.7 22.9-354.8 3.9-16.5 2.8-30.1-3.3-39.3-5.6-8.4-13.4-10.3-16.5-10.7-11.2-1.5-19.3-0.9-27.8 6.4-20.5 17.8-46.8 77.8-56.4 261.8 5.8 0 11.6 0 17.3 0.2 10.3 0.3 18.5 9.2 18.2 19.9-0.3 10.7-8.8 19.1-19.2 18.9-95.8-2.8-204.4 22.8-250.2 47.9-2.8 1.5-5.7 2.3-8.6 2.3-6.8 0.1-13.4-3.7-16.8-10.3-4.8-9.5-1.4-21.2 7.8-26.3 8.1-4.4 17.9-8.9 28.9-13.1-6.7-22.4-18.6-82-20.1-150.6-0.3-1.5-0.5-3-0.4-4.6 1.2-29.4-7.6-48.4-16.4-53.6-5.2-3.1-12.1-1.8-20.6 3.9-31.6 21-19.4 127.4-14.9 167.4 0.5 4.3 1 8.6 1.5 12.9 4.1 34.7 8.4 70.6 2.6 113-8.3 92.7 7.5 162.9 47 208.5 38.4 44.2 98 66.3 182.4 67.6 151.5 7.1 203.3-92.6 215.7-123.3 17.4-43.4 26.5-76.2 33.8-102.5 6.4-22.9 11.4-41 19.5-58 8.5-17.9 16.1-29.7 22.2-39.2z"/>
+			</g>
+		</g>
+		<g id="Clip-Path" clip-path="url(#cp2)">
+			<g>
+				<path fill-rule="evenodd" class="s1" d="m1301.9 802.9l0.1 0.9c11.3 104-6.3 184.6-52.2 239.5-44.1 52.8-114.2 81.3-208.3 84.5q-10.2 0.7-19.9 0.8c-159.5 1.6-218.8-108.4-233.5-143.2-19.1-45.1-29.1-78.9-37.2-106-6.2-21-11.1-37.5-18-51.2-7.7-15.5-14.4-25.6-20.4-34.5-10.1-15.1-18.8-28.2-20.3-52.8-1.1-17.7 4.5-33.6 16.2-45.9 13.3-14 33.8-21.8 54.5-20.9 33.3 1.5 57.1 13.7 94.5 74.5-0.3-70.4-6-175.9-30-268.4-9-34.8-0.9-57.7 7.5-70.8 9.7-15.3 25.1-25.5 42.2-28.1 15.7-2.3 36.2-2.9 56.4 13.8 0-0.2 0-0.4 0-0.4-7.4-49.9 8.8-78.8 47.9-86l2.1-0.3c24.1-1.6 45 6.3 62 23.2 9.1 9 17.1 20.7 24.1 35.3 7.4-11.6 16.9-17.6 24.5-20.6 46.4-18.7 72.9 18.9 84.5 42.9 13.5 28 22.1 59.5 27.3 91.6 0.9-0.6 1.8-1.3 2.8-1.9 10.2-6.3 31.5-14.3 59.7 3.8 47.8 30.6 43.4 121.3 35.5 203.9-0.4 4.4-0.8 8.7-1.3 13.1-3.4 33.9-6.6 65.9-0.8 103.3zm-197.6-256.5c2.5 43.7 3.2 94.2 2.1 151.9 18.7 3.5 36.4 7.5 52.7 12 7.2-26.7 37.9-196.3-14-303.7-14.6-30.1-26.5-28.4-37.4-24.1-15.1 6.1-12.2 37.5-10.2 50.7 0.4 1.9 8 50.9 6.8 113.2zm-117.5-199.2c-11.4 2.2-21.6 5.2-16.3 41.6 0.4 2.1 8.8 56.4 11.5 100.8q0 0.5 0 0.9c11.2 43.2 21 105.4 27.2 196.6 20.2 0.9 40.2 2.7 59.7 5.3 3-185.3-13.5-297.2-49.3-332.8-9.1-9.1-19.5-13-32.7-12.4zm278.5 348.5c0.5-4.3 0.9-8.6 1.3-13 3.9-40.1 14.1-146.6-17.9-167-8.6-5.5-15.5-6.7-20.6-3.5-8.7 5.4-17.2 24.5-15.4 53.9 0.1 1.5 0 3.1-0.3 4.6-0.4 68.5-11.2 128.4-17.5 150.9 11.1 4.1 20.9 8.3 29.1 12.6 9.3 4.9 13 16.6 8.3 26.2-3.3 6.7-9.8 10.6-16.6 10.6-2.9 0-5.9-0.6-8.6-2.1-46.2-24.3-155.4-47.7-251-43.2-10.5 0.3-19.2-7.7-19.6-18.5-0.5-10.7 7.5-19.8 17.9-20.2 5.6-0.3 11.4-0.5 17.2-0.6-12.8-183.8-40.2-243.3-61-260.7-8.6-7.1-16.8-7.6-27.9-5.9-3.2 0.5-10.9 2.5-16.4 11.1-5.9 9.3-6.8 22.9-2.5 39.3 37.3 143.5 32.4 313.2 29.2 354.3 0.7 1.4 1.3 2.7 2.1 4.2 8.4 15.6 23.7 31.7 41.7 30.3 10.3-0.7 19.3 7.2 20.1 17.9 0.8 10.6-6.9 20-17.2 20.8-30.3 2.5-59.3-16.3-77.4-50.1-2-3.6-3.6-7-5.1-10.3-0.1-0.2-0.2-0.5-0.3-0.7-3.5-7.5-5.9-14.5-8.2-20.8-3.5-9.6-6.4-18-12.3-28.3-41.6-72.9-57.9-73.7-75.1-74.5-10.2-0.5-20.2 3.1-26.1 9.3-4.1 4.3-5.9 9.7-5.5 16.4 0.8 13.8 4.6 19.4 13.7 33.1 6.3 9.4 14.1 21 23 38.7 8.3 16.9 13.7 34.8 20.5 57.7 7.7 26.2 17.4 58.7 35.6 101.8 12.9 30.5 66.7 129.1 217.3 119.2 84.9-2.9 144.2-26.2 181.7-71.1 38.7-46.3 53.2-116.8 43.3-209.4-6.6-42.3-3-78.2 0.5-113z"/>
+			</g>
+		</g>
+		<path class="s1" d="m739.8 434.2c-3 0-6.1-0.6-9-2-10.5-5-15.2-18-10.3-28.9 15.7-35.6 38.8-68.4 66.6-94.9 8.5-8.1 21.9-7.6 29.7 1.3 7.9 8.9 7.4 22.7-1.2 30.8-23.7 22.7-43.4 50.6-56.8 81-3.6 7.9-11.1 12.6-19 12.7z"/>
+		<path class="s1" d="m668.8 421.6c-10.9 0.1-20.3-8.5-21.2-20-4-51.4-4.2-103.5-0.4-154.8 0.9-12 11.1-21 22.6-20.1 11.6 0.9 20.3 11.3 19.4 23.3-3.6 49.1-3.4 98.9 0.4 148 1 12-7.7 22.5-19.3 23.5-0.5 0-1 0-1.5 0z"/>
+		<path class="s1" d="m596.2 435.1c-9.4 0.1-18.2-6.5-20.6-16.4-8.9-36.3-25.9-70.8-48.9-99.7-7.4-9.3-6.1-23 2.8-30.7 9-7.6 22.3-6.3 29.7 3 26.9 33.8 46.7 74.2 57.2 116.7 2.9 11.6-4 23.5-15.2 26.5-1.8 0.4-3.4 0.6-5.1 0.7z"/>
+	</g>
 </svg>
--- a/frontend/src/api/conversation-service/v1-conversation-service.api.ts
+++ b/frontend/src/api/conversation-service/v1-conversation-service.api.ts
@@ -11,6 +11,7 @@ import type {
  V1AppConversationStartTask,
  V1AppConversationStartTaskPage,
  V1AppConversation,
+  V1SandboxInfo,
 } from "./v1-conversation-service.types";

 class V1ConversationService {
@@ -60,8 +61,6 @@ class V1ConversationService {
    selected_branch?: string,
    conversationInstructions?: string,
    trigger?: ConversationTrigger,
-    parent_conversation_id?: string,
-    agent_type?: "default" | "plan",
  ): Promise<V1AppConversationStartTask> {
    const body: V1AppConversationStartRequest = {
      selected_repository: selectedRepository,
@@ -69,8 +68,6 @@ class V1ConversationService {
      selected_branch,
      title: conversationInstructions,
      trigger,
-      parent_conversation_id: parent_conversation_id || null,
-      agent_type,
    };

    // Add initial message if provided
@@ -115,11 +112,11 @@ class V1ConversationService {
   * Search for start tasks (ongoing tasks that haven't completed yet)
   * Use this to find tasks that were started but the user navigated away
   *
-   * Note: Backend supports filtering by limit and created_at__gte. To filter by repository/trigger,
+   * Note: Backend only supports filtering by limit. To filter by repository/trigger,
   * filter the results client-side after fetching.
   *
   * @param limit Maximum number of tasks to return (max 100)
-   * @returns Array of start tasks from the last 20 minutes
+   * @returns Array of start tasks
   */
  static async searchStartTasks(
    limit: number = 100,
@@ -127,10 +124,6 @@ class V1ConversationService {
    const params = new URLSearchParams();
    params.append("limit", limit.toString());

-    // Only get tasks from the last 20 minutes
-    const twentyMinutesAgo = new Date(Date.now() - 20 * 60 * 1000);
-    params.append("created_at__gte", twentyMinutesAgo.toISOString());
-
    const { data } = await openHands.get<V1AppConversationStartTaskPage>(
      `/api/v1/app-conversations/start-tasks/search?${params.toString()}`,
    );
@@ -220,6 +213,36 @@ class V1ConversationService {
    return data;
  }

+  /**
+   * Pause a V1 sandbox
+   * Calls the /api/v1/sandboxes/{id}/pause endpoint
+   *
+   * @param sandboxId The sandbox ID to pause
+   * @returns Success response
+   */
+  static async pauseSandbox(sandboxId: string): Promise<{ success: boolean }> {
+    const { data } = await openHands.post<{ success: boolean }>(
+      `/api/v1/sandboxes/${sandboxId}/pause`,
+      {},
+    );
+    return data;
+  }
+
+  /**
+   * Resume a V1 sandbox
+   * Calls the /api/v1/sandboxes/{id}/resume endpoint
+   *
+   * @param sandboxId The sandbox ID to resume
+   * @returns Success response
+   */
+  static async resumeSandbox(sandboxId: string): Promise<{ success: boolean }> {
+    const { data } = await openHands.post<{ success: boolean }>(
+      `/api/v1/sandboxes/${sandboxId}/resume`,
+      {},
+    );
+    return data;
+  }
+
  /**
   * Batch get V1 app conversations by their IDs
   * Returns null for any missing conversations
@@ -246,6 +269,32 @@ class V1ConversationService {
    return data;
  }

+  /**
+   * Batch get V1 sandboxes by their IDs
+   * Returns null for any missing sandboxes
+   *
+   * @param ids Array of sandbox IDs (max 100)
+   * @returns Array of sandboxes or null for missing ones
+   */
+  static async batchGetSandboxes(
+    ids: string[],
+  ): Promise<(V1SandboxInfo | null)[]> {
+    if (ids.length === 0) {
+      return [];
+    }
+    if (ids.length > 100) {
+      throw new Error("Cannot request more than 100 sandboxes at once");
+    }
+
+    const params = new URLSearchParams();
+    ids.forEach((id) => params.append("id", id));
+
+    const { data } = await openHands.get<(V1SandboxInfo | null)[]>(
+      `/api/v1/sandboxes?${params.toString()}`,
+    );
+    return data;
+  }
+
  /**
   * Upload a single file to the V1 conversation workspace
   * V1 API endpoint: POST /api/file/upload/{path}
@@ -298,21 +347,20 @@ class V1ConversationService {
  }

  /**
-   * Read a file from a specific conversation's sandbox workspace
-   * @param conversationId The conversation ID
-   * @param filePath Path to the file to read within the sandbox workspace (defaults to /workspace/project/PLAN.md)
-   * @returns The content of the file or an empty string if the file doesn't exist
+   * Get the count of events for a conversation
+   * Uses the V1 API endpoint: GET /api/v1/events/count
+   *
+   * @param conversationId The conversation ID to get event count for
+   * @returns The number of events in the conversation
   */
-  static async readConversationFile(
-    conversationId: string,
-    filePath: string = "/workspace/project/PLAN.md",
-  ): Promise<string> {
+  static async getEventCount(conversationId: string): Promise<number> {
    const params = new URLSearchParams();
-    params.append("file_path", filePath);
+    params.append("conversation_id__eq", conversationId);

-    const { data } = await openHands.get<string>(
-      `/api/v1/app-conversations/${conversationId}/file?${params.toString()}`,
+    const { data } = await openHands.get<number>(
+      `/api/v1/events/count?${params.toString()}`,
    );
+
    return data;
  }
 }
--- a/frontend/src/api/conversation-service/v1-conversation-service.types.ts
+++ b/frontend/src/api/conversation-service/v1-conversation-service.types.ts
@@ -1,6 +1,5 @@
 import { ConversationTrigger } from "../open-hands.types";
 import { Provider } from "#/types/settings";
-import { V1SandboxStatus } from "../sandbox-service/sandbox-service.types";

 // V1 API Types for requests
 // Note: This represents the serialized API format, not the internal TextContent/ImageContent types
@@ -30,8 +29,6 @@ export interface V1AppConversationStartRequest {
  title?: string | null;
  trigger?: ConversationTrigger | null;
  pr_number?: number[];
-  parent_conversation_id?: string | null;
-  agent_type?: "default" | "plan";
 }

 export type V1AppConversationStartTaskStatus =
@@ -40,7 +37,6 @@ export type V1AppConversationStartTaskStatus =
  | "PREPARING_REPOSITORY"
  | "RUNNING_SETUP_SCRIPT"
  | "SETTING_UP_GIT_HOOKS"
-  | "SETTING_UP_SKILLS"
  | "STARTING_CONVERSATION"
  | "READY"
  | "ERROR";
@@ -68,7 +64,14 @@ export interface V1AppConversationStartTaskPage {
  next_page_id: string | null;
 }

-export type V1ConversationExecutionStatus =
+export type V1SandboxStatus =
+  | "MISSING"
+  | "STARTING"
+  | "RUNNING"
+  | "STOPPED"
+  | "PAUSED";
+
+export type V1AgentExecutionStatus =
  | "RUNNING"
  | "AWAITING_USER_INPUT"
  | "AWAITING_USER_CONFIRMATION"
@@ -91,7 +94,22 @@ export interface V1AppConversation {
  created_at: string;
  updated_at: string;
  sandbox_status: V1SandboxStatus;
-  execution_status: V1ConversationExecutionStatus | null;
+  agent_status: V1AgentExecutionStatus | null;
  conversation_url: string | null;
  session_api_key: string | null;
 }
+
+export interface V1ExposedUrl {
+  name: string;
+  url: string;
+}
+
+export interface V1SandboxInfo {
+  id: string;
+  created_by_user_id: string | null;
+  sandbox_spec_id: string;
+  status: V1SandboxStatus;
+  session_api_key: string | null;
+  exposed_urls: V1ExposedUrl[] | null;
+  created_at: string;
+}
--- a/frontend/src/api/event-service/event-service.api.ts
+++ b/frontend/src/api/event-service/event-service.api.ts
@@ -5,7 +5,6 @@ import type {
  ConfirmationResponseRequest,
  ConfirmationResponseResponse,
 } from "./event-service.types";
-import { openHands } from "../open-hands-axios";

 class EventService {
  /**
@@ -37,14 +36,6 @@ class EventService {

    return data;
  }
-
-  static async getEventCount(conversationId: string): Promise<number> {
-    const params = new URLSearchParams();
-    params.append("conversation_id__eq", conversationId);
-    const { data } = await openHands.get<number>(
-      `/api/v1/events/count?${params.toString()}`,
-    );
-    return data;
-  }
 }
+
 export default EventService;
--- a/frontend/src/api/open-hands.types.ts
+++ b/frontend/src/api/open-hands.types.ts
@@ -77,7 +77,6 @@ export interface Conversation {
  session_api_key: string | null;
  pr_number?: number[] | null;
  conversation_version?: "V0" | "V1";
-  sub_conversation_ids?: string[];
 }

 export interface ResultSet<T> {
--- a/frontend/src/api/sandbox-service/sandbox-service.api.ts
+++ b/frontend/src/api/sandbox-service/sandbox-service.api.ts
@@ -1,52 +0,0 @@
-// sandbox-service.api.ts
-// This file contains API methods for /api/v1/sandboxes endpoints.
-
-import { openHands } from "../open-hands-axios";
-import type { V1SandboxInfo } from "./sandbox-service.types";
-
-export class SandboxService {
-  /**
-   * Pause a V1 sandbox
-   * Calls the /api/v1/sandboxes/{id}/pause endpoint
-   */
-  static async pauseSandbox(sandboxId: string): Promise<{ success: boolean }> {
-    const { data } = await openHands.post<{ success: boolean }>(
-      `/api/v1/sandboxes/${sandboxId}/pause`,
-      {},
-    );
-    return data;
-  }
-
-  /**
-   * Resume a V1 sandbox
-   * Calls the /api/v1/sandboxes/{id}/resume endpoint
-   */
-  static async resumeSandbox(sandboxId: string): Promise<{ success: boolean }> {
-    const { data } = await openHands.post<{ success: boolean }>(
-      `/api/v1/sandboxes/${sandboxId}/resume`,
-      {},
-    );
-    return data;
-  }
-
-  /**
-   * Batch get V1 sandboxes by their IDs
-   * Returns null for any missing sandboxes
-   */
-  static async batchGetSandboxes(
-    ids: string[],
-  ): Promise<(V1SandboxInfo | null)[]> {
-    if (ids.length === 0) {
-      return [];
-    }
-    if (ids.length > 100) {
-      throw new Error("Cannot request more than 100 sandboxes at once");
-    }
-    const params = new URLSearchParams();
-    ids.forEach((id) => params.append("id", id));
-    const { data } = await openHands.get<(V1SandboxInfo | null)[]>(
-      `/api/v1/sandboxes?${params.toString()}`,
-    );
-    return data;
-  }
-}
--- a/frontend/src/api/sandbox-service/sandbox-service.types.ts
+++ b/frontend/src/api/sandbox-service/sandbox-service.types.ts
@@ -1,24 +0,0 @@
-// sandbox-service.types.ts
-// This file contains types for Sandbox API.
-
-export type V1SandboxStatus =
-  | "MISSING"
-  | "STARTING"
-  | "RUNNING"
-  | "STOPPED"
-  | "PAUSED";
-
-export interface V1ExposedUrl {
-  name: string;
-  url: string;
-}
-
-export interface V1SandboxInfo {
-  id: string;
-  created_by_user_id: string | null;
-  sandbox_spec_id: string;
-  status: V1SandboxStatus;
-  session_api_key: string | null;
-  exposed_urls: V1ExposedUrl[] | null;
-  created_at: string;
-}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
openhands	eb348a5f3d	Remove KeyboardInterrupt exit behavior from main chat loop - Change KeyboardInterrupt handler to continue loop instead of exiting - Let signal handler manage Ctrl+C behavior completely - Only exit on explicit /exit command or outer KeyboardInterrupt This ensures that Ctrl+C during agent processing returns to chat loop instead of exiting the entire application. Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 17:09:16 +00:00
openhands	099dcb787f	Fix Ctrl+C behavior to return to chat loop instead of exiting - Remove os._exit(1) from second Ctrl+C handler - Reset Ctrl+C counter after force killing process - Add graceful handling in SimpleProcessRunner for killed processes - Show user-friendly message that they can continue sending messages This allows users to stop a running agent and continue with new messages instead of having to restart the entire CLI application. Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 17:07:47 +00:00
openhands	b3034a0d75	Fix multiprocessing serialization issues in SimpleProcessRunner - Pass conversation_id and message_data instead of full objects to subprocess - Recreate conversation and message objects in the subprocess - Extract text content from Message objects for serialization - Store conversation_id as string for subprocess recreation This fixes the 'cannot pickle _asyncio.Future object' error by avoiding passing non-serializable objects between processes. Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 16:59:29 +00:00
openhands	459e224d37	Fix Message creation to include required role field - Add role='user' to Message constructor in agent_chat.py - This fixes the validation error when processing user messages Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 16:57:02 +00:00
openhands	97f13b7100	Fix SimpleProcessRunner to use proper SDK imports - Replace incorrect openhands.core.main imports with openhands.sdk - Use existing ConversationRunner from runner.py instead of run_controller - Update SimpleProcessRunner to accept BaseConversation instead of setup function - Update agent_chat.py to create conversation first, then pass to SimpleProcessRunner - Fix process_message to use proper Message object with TextContent This ensures the openhands-cli remains standalone and only uses the SDK library as intended, without importing from the main OpenHands codebase. Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 16:51:47 +00:00
openhands	6ecaca5b3c	Simplify Ctrl+C handling implementation - Replace complex ProcessSignalHandler with SimpleSignalHandler - Direct signal handling in main process instead of queue communication - Simple Ctrl+C counting with immediate force kill on second press - Reset functionality to clear count when starting new operations - Replace ProcessBasedConversationRunner with SimpleProcessRunner - Minimal multiprocessing - only process_message runs in subprocess - Direct method calls for status, settings, and other operations - No unnecessary queue communication - Update agent_chat.py to use simplified components - Reset Ctrl+C count when starting new message processing - Direct method calls for commands that don't need process isolation - Cleaner error handling and resource cleanup - Update simple_main.py imports Fixes issues where second Ctrl+C wouldn't register properly due to complex queue-based communication and race conditions. Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 16:46:43 +00:00
openhands	5351702d3a	Implement improved Ctrl+C handling for OpenHands CLI - First Ctrl+C attempts graceful pause of agent - Second Ctrl+C (within 3 seconds) kills process immediately - Added SignalHandler and ProcessSignalHandler classes for signal management - Implemented ProcessBasedConversationRunner for separate process execution - Modified pause_listener to remove Ctrl+C handling (now handled by signal handler) - Updated agent_chat.py to use process-based runner with new signal management - Updated simple_main.py to install basic signal handler - Added comprehensive test script and documentation Co-authored-by: openhands <openhands@all-hands.dev>	2025-11-03 16:24:23 +00:00
				`@@ -1 +0,0 @@`
				`This way of running OpenHands is not officially supported. It is maintained by the community.`