Compare commits

...

24 Commits

Author SHA1 Message Date
Ray Myers 4f0c659dad Bump version to 0.44.0 2025-06-16 14:46:48 -05:00
Rohit Malhotra 1f90086030 (Hotfix): Slack app installation flow (#9162)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-16 19:33:43 +00:00
Xingyao Wang 2c4ecd02f7 feat(frontend): add user feedback Likert scale for agent performance rating (only on OH Cloud) (#8992)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: sp.wack <83104063+amanape@users.noreply.github.com>
2025-06-16 19:26:24 +00:00
Rohit Malhotra 2fd1fdcd7e [Refactor, Fix]: Agent controller state/metrics management (#9012)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-16 11:24:13 -04:00
Graham Neubig cbe32a1a12 Fix bash timeout issue caused by interactive git clone prompts (#9148)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-16 08:39:28 -04:00
better629 432d8829dc disable mcp in run_localize and install oh-aci[llama] for issue 9150 (#9151) 2025-06-16 11:03:17 +00:00
Graham Neubig 24f891687d Fix CLI displaying claude-2 as default model for anthropic provider (#9101)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-15 21:21:33 -04:00
Graham Neubig 2d2ccf1329 Fix conversation URL format in pull request links (#9143) 2025-06-15 15:41:08 -04:00
FT e5bff91e8e Fix Typo: Change "accurancy" to "accuracy" in Evaluation Benchmark Comments (#9139) 2025-06-15 12:48:26 +00:00
Linghao Zhang a93b0457c6 feat(eval): Support evaluation on SWE-bench-Live (#9137) 2025-06-15 12:30:47 +00:00
Graham Neubig 98e0f5509c Update CLI mode docs to accurately reflect settings workflow (#9134) 2025-06-14 19:21:18 +00:00
kilavvy 4e99aabcb2 Minor Code Comment Corrections and Clarifications (#9129) 2025-06-14 18:57:14 +00:00
Graham Neubig 0c307ea12e Lint all files in the repo (#9131)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-06-14 16:25:59 +00:00
Graham Neubig 5134a7d938 Add secrets manager documentation to GUI mode docs (#9084)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-14 12:13:24 -04:00
Graham Neubig a1627914ad Fix broken link to LLMs section in GUI mode documentation (#9121)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-14 23:26:41 +08:00
Graham Neubig ccdd86e476 docs: remove 'coming soon' mentions from Slack app installation page (#9112)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Rohit Malhotra <rohitvinodmalhotra@gmail.com>
2025-06-14 14:35:04 +00:00
ASTONE be62ba6b35 add_versicode (#8221) 2025-06-14 13:17:18 +00:00
leopardracer 13c298d35f Minor Typo Fixes in Comments and Documentation (#9058) 2025-06-14 12:51:38 +00:00
llamantino 47b0dc548e feat: support dev container networking without host mode (#9122) 2025-06-14 08:38:18 -04:00
Graham Neubig 90ae4bda0d Restore Windows without WSL documentation (#9090)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-06-14 08:35:30 -04:00
dependabot[bot] 8963644fb4 chore(deps): bump the version-all group across 1 directory with 14 updates (#9107)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-14 07:58:24 -04:00
Engel Nyst fd3b4ac8e6 Refactor SWE-bench instruction (#8010) 2025-06-13 23:27:52 +02:00
Rohit Malhotra 53623c76b5 [Fix]: allow agent to configure draft status for opened prs/mrs via git mcp (#9117) 2025-06-13 21:06:23 +00:00
Ray Myers e6036b8346 Bump version for 0.43.0 release (#9109) 2025-06-13 14:47:26 -05:00
151 changed files with 5805 additions and 1417 deletions
+4 -1
View File
@@ -12,5 +12,8 @@
"ghcr.io/devcontainers/features/node:1": {},
},
"postCreateCommand": ".devcontainer/setup.sh",
"runArgs": ["--network=host"],
"runArgs": ["--add-host=host.docker.internal:host-gateway"],
"containerEnv": {
"DOCKER_HOST_ADDR": "host.docker.internal"
},
}
+1 -1
View File
@@ -74,7 +74,7 @@ jobs:
- name: Fix python lint issues
run: |
# Run all pre-commit hooks and continue even if they modify files (exit code 1)
pre-commit run --config ./dev_config/python/.pre-commit-config.yaml --files openhands/**/* evaluation/**/* tests/**/* || true
pre-commit run --config ./dev_config/python/.pre-commit-config.yaml --all-files || true
# Commit and push changes if any
- name: Check for changes
+1 -1
View File
@@ -53,7 +53,7 @@ jobs:
- name: Install pre-commit
run: pip install pre-commit==3.7.0
- name: Run pre-commit hooks
run: pre-commit run --files openhands/**/* evaluation/**/* tests/**/* --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml
run: pre-commit run --all-files --show-diff-on-failure --config ./dev_config/python/.pre-commit-config.yaml
# Check version consistency across documentation
check-version-consistency:
-1
View File
@@ -81,4 +81,3 @@ jobs:
env:
TEST_RUNTIME: local
DEBUG: "1"
+1 -1
View File
@@ -136,7 +136,7 @@ poetry run pytest ./tests/unit/test_*.py
To reduce build time (e.g., if no changes were made to the client-runtime component), you can use an existing Docker
container image by setting the SANDBOX_RUNTIME_CONTAINER_IMAGE environment variable to the desired Docker image.
Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:0.42-nikolaik`
Example: `export SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:0.44-nikolaik`
## Develop inside Docker container
+1 -1
View File
@@ -189,7 +189,7 @@ install-pre-commit-hooks:
lint-backend:
@echo "$(YELLOW)Running linters...$(RESET)"
@poetry run pre-commit run --files openhands/**/* evaluation/**/* tests/**/* --show-diff-on-failure --config $(PRE_COMMIT_CONFIG_PATH)
@poetry run pre-commit run --all-files --show-diff-on-failure --config $(PRE_COMMIT_CONFIG_PATH)
lint-frontend:
@echo "$(YELLOW)Running linters for frontend...$(RESET)"
+11 -11
View File
@@ -20,15 +20,15 @@
<a href="https://docs.google.com/spreadsheets/d/1wOUdFCMyY6Nt0AIqF705KN4JKOWgeI4wUGUP60krXXs/edit?gid=0#gid=0"><img src="https://img.shields.io/badge/Benchmark%20score-000?logoColor=FFE165&logo=huggingface&style=for-the-badge" alt="Evaluation Benchmark Score"></a>
<!-- Keep these links. Translations will automatically update with the README. -->
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=de">Deutsch</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=es">Español</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=fr">français</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=ja">日本語</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=ko">한국어</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=pt">Português</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=ru">Русский</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=de">Deutsch</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=es">Español</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=fr">français</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=ja">日本語</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=ko">한국어</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=pt">Português</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=ru">Русский</a> |
<a href="https://www.readme-i18n.com/All-Hands-AI/OpenHands?lang=zh">中文</a>
<hr>
</div>
@@ -62,17 +62,17 @@ system requirements and more information.
```bash
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik
docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands-state:/.openhands-state \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.42
docker.all-hands.dev/all-hands-ai/openhands:0.44
```
You'll find OpenHands running at [http://localhost:3000](http://localhost:3000)!
+3 -3
View File
@@ -51,17 +51,17 @@ OpenHands也可以使用Docker在本地系统上运行。
```bash
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik
docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands-state:/.openhands-state \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.42
docker.all-hands.dev/all-hands-ai/openhands:0.44
```
您将在[http://localhost:3000](http://localhost:3000)找到运行中的OpenHands
+2 -2
View File
@@ -10,13 +10,13 @@ services:
environment:
- BACKEND_HOST=${BACKEND_HOST:-"0.0.0.0"}
- SANDBOX_API_HOSTNAME=host.docker.internal
- DOCKER_HOST_ADDR=host.docker.internal
#
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.42-nikolaik}
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-ghcr.io/all-hands-ai/runtime:0.44-nikolaik}
- SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234}
- WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
ports:
- "3000:3000"
network_mode: host
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
+1 -1
View File
@@ -7,7 +7,7 @@ services:
image: openhands:latest
container_name: openhands-app-${DATE:-}
environment:
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik}
- SANDBOX_RUNTIME_CONTAINER_IMAGE=${SANDBOX_RUNTIME_CONTAINER_IMAGE:-docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik}
#- SANDBOX_USER_ID=${SANDBOX_USER_ID:-1234} # enable this only if you want a specific non-root sandbox user but you will have to manually adjust permissions of openhands-state for this user
- WORKSPACE_MOUNT_PATH=${WORKSPACE_BASE:-$PWD/workspace}
ports:
+2 -2
View File
@@ -4,7 +4,7 @@
npm install -g mint
```
or
or
```
yarn global add mint
@@ -14,4 +14,4 @@ yarn global add mint
```
mint dev
```
```
+29 -8
View File
@@ -1,21 +1,42 @@
---
title: Slack Integration - Coming soon...
title: Slack Integration (Beta)
description: This guide walks you through installing the OpenHands Slack app.
---
<Warning>This integration is not live yet, but will be available soon.</Warning>
## Prerequisites
- You are a slack workspace admin
- Access to OpenHands Cloud
## Installation Steps
1. Log in to [OpenHands Cloud](https://app.all-hands.dev)
2. Click the button below to OpenHands Slack App <a target="_blank" href="https://slack.com/oauth/v2/authorize?client_id=7477886716822.8729519890534&scope=app_mentions:read,chat:write,users:read,channels:history,groups:history,mpim:history,im:history&user_scope=channels:history,groups:history,im:history,mpim:history"><img alt="Add to Slack" height="40" width="139" src="https://platform.slack-edge.com/img/add_to_slack.png" srcSet="https://platform.slack-edge.com/img/add_to_slack.png 1x, https://platform.slack-edge.com/img/add_to_slack@2x.png 2x" /></a>
3. In the top right corner, select the workspace to install the OpenHands Slack app.
4. Review permissions and click allow
<AccordionGroup>
<Accordion title="Install Slack App (only for Slack admins/owners)">
**This step is for Slack admins/owners**
1. Make sure you have permissions to install Apps to your workspace.
2. Click the button below to install OpenHands Slack App <a target="_blank" href="https://slack.com/oauth/v2/authorize?client_id=7477886716822.8729519890534&scope=app_mentions:read,chat:write,users:read,channels:history,groups:history,mpim:history,im:history&user_scope=channels:history,groups:history,im:history,mpim:history"><img alt="Add to Slack" height="40" width="139" src="https://platform.slack-edge.com/img/add_to_slack.png" srcSet="https://platform.slack-edge.com/img/add_to_slack.png 1x, https://platform.slack-edge.com/img/add_to_slack@2x.png 2x" /></a>
3. In the top right corner, select the workspace to install the OpenHands Slack app.
4. Review permissions and click allow.
</Accordion>
<Accordion title="Authorize Slack App (for all Slack workspace members)">
**Make sure your Slack workspace admin/owner has installed OpenHands Slack App first**
Every user in the slack workspace (including admins/owners) must link their Cloud OpenHands account to the OpenHands Slack App. To do this
1. Visit [integrations settings](https://app.all-hands.dev/settings/integrations) in OpenHands Cloud.
2. Click the button "Install Slack App".
3. In the top right corner, select the workspace to install the OpenHands Slack app.
4. Review permissions and click allow.
Depending on the workspace settings, you may need approval from your slack admin to authorize the Slack App.
</Accordion>
</AccordionGroup>
## Working With the Slack App
+5 -4
View File
@@ -17,13 +17,14 @@ for scripting.
pip install openhands-ai
```
2. Set your model, API key, and other preferences using environment variables or with the [`config.toml`](https://github.com/All-Hands-AI/OpenHands/blob/main/config.template.toml) file.
3. Launch an interactive OpenHands conversation from the command line:
2. Launch an interactive OpenHands conversation from the command line:
```bash
openhands
```
3. Set your model, API key, and other preferences using the UI (or alternatively environment variables, below).
This command opens an interactive prompt where you can type tasks or commands and get responses from OpenHands.
#### For Developers
@@ -46,7 +47,7 @@ poetry run python -m openhands.cli.main
```bash
docker run -it \
--pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik \
-e SANDBOX_USER_ID=$(id -u) \
-e SANDBOX_VOLUMES=$SANDBOX_VOLUMES \
-e LLM_API_KEY=$LLM_API_KEY \
@@ -55,7 +56,7 @@ docker run -it \
-v ~/.openhands-state:/.openhands-state \
--add-host host.docker.internal:host-gateway \
--name openhands-app-$(date +%Y%m%d%H%M%S) \
docker.all-hands.dev/all-hands-ai/openhands:0.42 \
docker.all-hands.dev/all-hands-ai/openhands:0.44 \
python -m openhands.cli.main --override-cli-mode true
```
+32 -2
View File
@@ -27,7 +27,7 @@ You can use the Settings page at any time to:
- [Configure MCP servers](/usage/mcp).
- [Connect to GitHub](/usage/how-to/gui-mode#github-setup) and [connect to GitLab](/usage/how-to/gui-mode#gitlab-setup)
- Set application settings like your preferred language, notifications and other preferences.
- Generate custom secrets.
- [Manage custom secrets](/usage/how-to/gui-mode#secrets-management).
#### GitHub Setup
@@ -122,6 +122,36 @@ OpenHands automatically exports a `GITLAB_TOKEN` to the shell environment if pro
</Accordion>
</AccordionGroup>
#### Secrets Management
OpenHands provides a secrets manager that allows you to securely store and manage sensitive information that can be accessed by the agent during runtime, such as API keys. These secrets are automatically exported as environment variables in the agent's runtime environment.
1. **Accessing the Secrets Manager**:
- In the Settings page, navigate to the `Secrets` tab.
- You'll see a list of all your existing custom secrets (if any).
2. **Adding a New Secret**:
- Click the `Add New Secret` button.
- Fill in the following fields:
- **Name**: A unique identifier for your secret (e.g., `AWS_ACCESS_KEY`). This will be the environment variable name.
- **Value**: The sensitive information you want to store.
- **Description** (optional): A brief description of what the secret is used for, which is also provided to the agent.
- Click `Add Secret` to save.
3. **Editing a Secret**:
- Click the `Edit` button next to the secret you want to modify.
- You can update the name and description of the secret.
- Note: For security reasons, you cannot view or edit the value of an existing secret. If you need to change the value, delete the secret and create a new one.
4. **Deleting a Secret**:
- Click the `Delete` button next to the secret you want to remove.
- Confirm the deletion when prompted.
5. **Using Secrets in the Agent**:
- All custom secrets are automatically exported as environment variables in the agent's runtime environment.
- You can access them in your code using standard environment variable access methods (e.g., `os.environ['SECRET_NAME']` in Python).
- Example: If you create a secret named `OPENAI_API_KEY`, you can access it in your code as `process.env.OPENAI_API_KEY` in JavaScript or `os.environ['OPENAI_API_KEY']` in Python.
#### Advanced Settings
The `Advanced` settings allows configuration of additional LLM settings. Inside the Settings page, under the `LLM` tab,
@@ -154,7 +184,7 @@ is loaded. Typically these include:
## Tips for Effective Use
- Be specific in your requests to get the most accurate and helpful responses, as described in the [prompting best practices](../prompting/prompting-best-practices).
- Use one of the recommended models, as described in the [LLMs section](usage/llms/llms.md).
- Use one of the recommended models, as described in the [LLMs section](/usage/llms/llms).
## Other Ways to Run Openhands
- [Run OpenHands in a scriptable headless mode.](/usage/how-to/headless-mode)
+2 -2
View File
@@ -32,7 +32,7 @@ To run OpenHands in Headless mode with Docker:
```bash
docker run -it \
--pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik \
-e SANDBOX_USER_ID=$(id -u) \
-e SANDBOX_VOLUMES=$SANDBOX_VOLUMES \
-e LLM_API_KEY=$LLM_API_KEY \
@@ -42,7 +42,7 @@ docker run -it \
-v ~/.openhands-state:/.openhands-state \
--add-host host.docker.internal:host-gateway \
--name openhands-app-$(date +%Y%m%d%H%M%S) \
docker.all-hands.dev/all-hands-ai/openhands:0.42 \
docker.all-hands.dev/all-hands-ai/openhands:0.44 \
python -m openhands.core.main -t "write a bash script that prints hi"
```
+2 -2
View File
@@ -8,7 +8,7 @@ description: OpenHands uses LiteLLM to make calls to Google's chat models. You c
When running OpenHands, you'll need to set the following in the OpenHands UI through the Settings under the `LLM` tab:
- `LLM Provider` to `Gemini`
- `LLM Model` to the model you will be using.
If the model is not in the list, enable `Advanced` options, and enter it in `Custom Model`
If the model is not in the list, enable `Advanced` options, and enter it in `Custom Model`
(e.g. gemini/&lt;model-name&gt; like `gemini/gemini-2.0-flash`).
- `API Key` to your Gemini API key
@@ -26,5 +26,5 @@ VERTEXAI_LOCATION="<your-gcp-location>"
Then set the following in the OpenHands UI through the Settings under the `LLM` tab:
- `LLM Provider` to `VertexAI`
- `LLM Model` to the model you will be using.
If the model is not in the list, enable `Advanced` options, and enter it in `Custom Model`
If the model is not in the list, enable `Advanced` options, and enter it in `Custom Model`
(e.g. vertex_ai/&lt;model-name&gt;).
+1 -1
View File
@@ -8,7 +8,7 @@ description: OpenHands uses LiteLLM to make calls to chat models on Groq. You ca
When running OpenHands, you'll need to set the following in the OpenHands UI through the Settings under the `LLM` tab:
- `LLM Provider` to `Groq`
- `LLM Model` to the model you will be using. [Visit here to see the list of
models that Groq hosts](https://console.groq.com/docs/models). If the model is not in the list,
models that Groq hosts](https://console.groq.com/docs/models). If the model is not in the list,
enable `Advanced` options, and enter it in `Custom Model` (e.g. groq/&lt;model-name&gt; like `groq/llama3-70b-8192`).
- `API key` to your Groq API key. To find or create your Groq API Key, [see here](https://console.groq.com/keys).
+1 -1
View File
@@ -16,7 +16,7 @@ To use LiteLLM proxy with OpenHands, you need to:
## Supported Models
The supported models depend on your LiteLLM proxy configuration. OpenHands supports any model that your LiteLLM proxy
The supported models depend on your LiteLLM proxy configuration. OpenHands supports any model that your LiteLLM proxy
is configured to handle.
Refer to your LiteLLM proxy configuration for the list of available models and their names.
+4 -4
View File
@@ -54,25 +54,25 @@ Check [the installation guide](/usage/local-setup) to make sure you have all the
export LMSTUDIO_MODEL_NAME="imported-models/uncategorized/devstralq4_k_m.gguf" # <- Replace this with the model name you copied from LMStudio
export LMSTUDIO_URL="http://host.docker.internal:1234" # <- Replace this with the port from LMStudio
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik
mkdir -p ~/.openhands-state && echo '{"language":"en","agent":"CodeActAgent","max_iterations":null,"security_analyzer":null,"confirmation_mode":false,"llm_model":"lm_studio/'$LMSTUDIO_MODEL_NAME'","llm_api_key":"dummy","llm_base_url":"'$LMSTUDIO_URL/v1'","remote_runtime_resource_factor":null,"github_token":null,"enable_default_condenser":true,"user_consents_to_analytics":true}' > ~/.openhands-state/settings.json
docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands-state:/.openhands-state \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.42
docker.all-hands.dev/all-hands-ai/openhands:0.44
```
Once your server is running -- you can visit `http://localhost:3000` in your browser to use OpenHands with local Devstral model:
```
Digest: sha256:e72f9baecb458aedb9afc2cd5bc935118d1868719e55d50da73190d3a85c674f
Status: Image is up to date for docker.all-hands.dev/all-hands-ai/openhands:0.42
Status: Image is up to date for docker.all-hands.dev/all-hands-ai/openhands:0.44
Starting OpenHands...
Running OpenHands as root
14:22:13 - openhands:INFO: server_config.py:50 - Using config class None
+1 -1
View File
@@ -9,6 +9,6 @@ When running OpenHands, you'll need to set the following in the OpenHands UI thr
* `LLM Provider` to `OpenRouter`
* `LLM Model` to the model you will be using.
[Visit here to see a full list of OpenRouter models](https://openrouter.ai/models).
If the model is not in the list, enable `Advanced` options, and enter it in
If the model is not in the list, enable `Advanced` options, and enter it in
`Custom Model` (e.g. openrouter/&lt;model-name&gt; like `openrouter/anthropic/claude-3.5-sonnet`).
* `API Key` to your OpenRouter API key.
+8 -3
View File
@@ -10,6 +10,7 @@ description: Getting started with running OpenHands on your own.
- MacOS with [Docker Desktop support](https://docs.docker.com/desktop/setup/install/mac-install/#system-requirements)
- Linux
- Windows with [WSL](https://learn.microsoft.com/en-us/windows/wsl/install) and [Docker Desktop support](https://docs.docker.com/desktop/setup/install/windows-install/#system-requirements)
- Windows without WSL (see [Windows Without WSL Guide](/usage/windows-without-wsl))
A system with a modern processor and a minimum of **4GB RAM** is recommended to run OpenHands.
@@ -55,6 +56,10 @@ A system with a modern processor and a minimum of **4GB RAM** is recommended to
The docker command below to start the app must be run inside the WSL terminal.
</Note>
**Alternative: Windows without WSL**
If you prefer to run OpenHands on Windows without WSL or Docker, see our [Windows Without WSL Guide](/usage/windows-without-wsl).
</Accordion>
</AccordionGroup>
@@ -62,17 +67,17 @@ A system with a modern processor and a minimum of **4GB RAM** is recommended to
### Start the App
```bash
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik
docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.42-nikolaik \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.44-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands-state:/.openhands-state \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.42
docker.all-hands.dev/all-hands-ai/openhands:0.44
```
You'll find OpenHands running at http://localhost:3000!
+1 -1
View File
@@ -5,7 +5,7 @@ description: Organizations and users can define microagents that apply to all re
## Usage
These microagents can be [any type of microagent](./microagents-overview#microagent-types) and will be loaded
These microagents can be [any type of microagent](./microagents-overview#microagent-types) and will be loaded
accordingly. However, they are applied to all repositories belonging to the organization or user.
Add a `.openhands` repository under the organization or user and create a `microagents` directory and place the
+1 -1
View File
@@ -15,7 +15,7 @@ Before using the Local Runtime, ensure that:
1. You can run OpenHands using the [Development workflow](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md).
2. For Linux and Mac, tmux is available on your system.
3. For Windows, PowerShell is available on your system.
- Only [CLI mode](../how-to/cli-mode) and [headless mode](../how-to/headless-mode) are supported in Windows with Local Runtime.
- Only [CLI mode](../how-to/cli-mode) and [headless mode](../how-to/headless-mode) are supported in Windows with Local Runtime.
## Configuration
+200
View File
@@ -0,0 +1,200 @@
---
title: Windows Without WSL
description: Running OpenHands GUI on Windows without using WSL or Docker
---
# Running OpenHands GUI on Windows Without WSL
This guide provides step-by-step instructions for running OpenHands on a Windows machine without using WSL or Docker.
## Prerequisites
1. **Windows 10/11** - A modern Windows operating system
2. **PowerShell 7+** - While Windows PowerShell comes pre-installed on Windows 10/11, PowerShell 7+ is strongly recommended to avoid compatibility issues (see Troubleshooting section for "System.Management.Automation" errors)
3. **.NET Core Runtime** - Required for the PowerShell integration via pythonnet
4. **Python 3.12 or 3.13** - Python 3.12 or 3.13 is required (Python 3.14 is not supported due to pythonnet compatibility)
5. **Git** - For cloning the repository and version control
6. **Node.js and npm** - For running the frontend
## Step 1: Install Required Software
1. **Install Python 3.12 or 3.13**
- Download Python 3.12.x or 3.13.x from [python.org](https://www.python.org/downloads/)
- During installation, check "Add Python to PATH"
- Verify installation by opening PowerShell and running:
```powershell
python --version
```
2. **Install PowerShell 7**
- Download and install PowerShell 7 from the [official PowerShell GitHub repository](https://github.com/PowerShell/PowerShell/releases)
- Choose the MSI installer appropriate for your system (x64 for most modern computers)
- Run the installer with default options
- Verify installation by opening a new terminal and running:
```powershell
pwsh --version
```
- Using PowerShell 7 (pwsh) instead of Windows PowerShell will help avoid "System.Management.Automation" errors
3. **Install .NET Core Runtime**
- Download and install the .NET Core Runtime from [Microsoft's .NET download page](https://dotnet.microsoft.com/download)
- Choose the latest .NET Core Runtime (not SDK)
- Verify installation by opening PowerShell and running:
```powershell
dotnet --info
```
- This step is required for the PowerShell integration via pythonnet. Without it, OpenHands will fall back to a more limited PowerShell implementation.
4. **Install Git**
- Download Git from [git-scm.com](https://git-scm.com/download/win)
- Use default installation options
- Verify installation:
```powershell
git --version
```
5. **Install Node.js and npm**
- Download Node.js from [nodejs.org](https://nodejs.org/) (LTS version recommended)
- During installation, accept the default options which will install npm as well
- Verify installation:
```powershell
node --version
npm --version
```
6. **Install Poetry**
- Open PowerShell as Administrator and run:
```powershell
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
```
- Add Poetry to your PATH:
```powershell
$env:Path += ";$env:APPDATA\Python\Scripts"
```
- Verify installation:
```powershell
poetry --version
```
## Step 2: Clone and Set Up OpenHands
1. **Clone the Repository**
```powershell
git clone https://github.com/All-Hands-AI/OpenHands.git
cd OpenHands
```
2. **Install Dependencies**
```powershell
poetry install
```
This will install all required dependencies, including:
- pythonnet - Required for Windows PowerShell integration
- All other OpenHands dependencies
## Step 3: Run OpenHands
1. **Build the Frontend**
```powershell
cd frontend
npm install
npm run build
cd ..
```
This will build the frontend files that the backend will serve.
2. **Start the Backend**
```powershell
# Make sure to use PowerShell 7 (pwsh) instead of Windows PowerShell
pwsh
$env:RUNTIME="local"; poetry run uvicorn openhands.server.listen:app --host 0.0.0.0 --port 3000 --reload --reload-exclude "./workspace"
```
This will start the OpenHands app using the local runtime with PowerShell integration, available at `localhost:3000`.
> **Note**: If you encounter a `RuntimeError: Directory './frontend/build' does not exist` error, make sure you've built the frontend first using the command above.
> **Important**: Using PowerShell 7 (pwsh) instead of Windows PowerShell is recommended to avoid "System.Management.Automation" errors. If you encounter this error, see the Troubleshooting section below.
3. **Alternatively, Run the Frontend in Development Mode (in a separate PowerShell window)**
```powershell
cd frontend
npm run dev
```
4. **Access the OpenHands GUI**
Open your browser and navigate to:
```
http://localhost:3000
```
> **Note**: If you're running the frontend in development mode (using `npm run dev`), use port 3001 instead: `http://localhost:3001`
## Limitations on Windows
When running OpenHands on Windows without WSL or Docker, be aware of the following limitations:
1. **Browser Tool Not Supported**: The browser tool is not currently supported on Windows.
2. **.NET Core Requirement**: The PowerShell integration requires .NET Core Runtime to be installed. If .NET Core is not available, OpenHands will automatically fall back to a more limited PowerShell implementation with reduced functionality.
3. **Interactive Shell Commands**: Some interactive shell commands may not work as expected. The PowerShell session implementation has limitations compared to the bash session used on Linux/macOS.
4. **Path Handling**: Windows uses backslashes (`\`) in paths, which may require adjustments when working with code examples designed for Unix-like systems.
## Troubleshooting
### "System.Management.Automation" Not Found Error
If you encounter an error message stating that "System.Management.Automation" was not found, this typically indicates that you have a minimal version of PowerShell installed or that the .NET components required for PowerShell integration are missing.
> **IMPORTANT**: This error is most commonly caused by using the built-in Windows PowerShell (powershell.exe) instead of PowerShell 7 (pwsh.exe). Even if you installed PowerShell 7 during the prerequisites, you may still be using the older Windows PowerShell by default.
To resolve this issue:
1. **Install the latest version of PowerShell 7** from the official Microsoft repository:
- Visit [https://github.com/PowerShell/PowerShell/releases](https://github.com/PowerShell/PowerShell/releases)
- Download and install the latest MSI package for your system architecture (x64 for most systems)
- During installation, ensure you select the following options:
- "Add PowerShell to PATH environment variable"
- "Register Windows PowerShell 7 as the default shell"
- "Enable PowerShell remoting"
- The installer will place PowerShell 7 in `C:\Program Files\PowerShell\7` by default
2. **Restart your terminal or command prompt** to ensure the new PowerShell is available
3. **Verify the installation** by running:
```powershell
pwsh --version
```
You should see output indicating PowerShell 7.x.x
4. **Run OpenHands using PowerShell 7** instead of Windows PowerShell:
```powershell
pwsh
cd path\to\openhands
$env:RUNTIME="local"; poetry run uvicorn openhands.server.listen:app --host 0.0.0.0 --port 3000 --reload --reload-exclude "./workspace"
```
> **Note**: Make sure you're explicitly using `pwsh` (PowerShell 7) and not `powershell` (Windows PowerShell). The command prompt or terminal title should say "PowerShell 7" rather than just "Windows PowerShell".
5. **If the issue persists**, ensure that you have the .NET Runtime installed:
- Download and install the latest .NET Runtime from [Microsoft's .NET download page](https://dotnet.microsoft.com/download)
- Choose ".NET Runtime" (not SDK) version 6.0 or later
- After installation, verify it's properly installed by running:
```powershell
dotnet --info
```
- Restart your computer after installation
- Try running OpenHands again
6. **Ensure that the .NET Framework is properly installed** on your system:
- Go to Control Panel > Programs > Programs and Features > Turn Windows features on or off
- Make sure ".NET Framework 4.8 Advanced Services" is enabled
- Click OK and restart if prompted
This error occurs because OpenHands uses the pythonnet package to interact with PowerShell, which requires the System.Management.Automation assembly from the .NET framework. A minimal PowerShell installation or older Windows PowerShell (rather than PowerShell 7+) might not include all the necessary components for this integration.
@@ -144,7 +144,7 @@ if __name__ == '__main__':
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
# modify_params must be False for evaluation purpose, for reproducibility and accurancy of results
# modify_params must be False for evaluation purpose, for reproducibility and accuracy of results
llm_config.modify_params = False
if llm_config is None:
+1 -1
View File
@@ -223,7 +223,7 @@ if __name__ == '__main__':
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
# modify_params must be False for evaluation purpose, for reproducibility and accurancy of results
# modify_params must be False for evaluation purpose, for reproducibility and accuracy of results
llm_config.modify_params = False
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
@@ -2,6 +2,8 @@
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
**UPDATE (6/15/2025): We now support running SWE-bench-Live evaluation (see the paper [here](https://arxiv.org/abs/2505.23419))! For how to run it, checkout [this README](./SWE-bench-Live.md).**
**UPDATE (5/26/2025): We now support running interactive SWE-Bench evaluation (see the paper [here](https://arxiv.org/abs/2502.13069))! For how to run it, checkout [this README](./SWE-Interact.md).**
**UPDATE (4/8/2025): We now support running SWT-Bench evaluation! For more details, checkout [the corresponding section](#SWT-Bench-Evaluation).**
@@ -0,0 +1,65 @@
# SWE-bench-Live
<p align="center">
<a href="https://arxiv.org/abs/2505.23419">📃 Paper</a>
<a href="https://huggingface.co/SWE-bench-Live" >🤗 HuggingFace</a>
<a href="https://SWE-bench-Live.github.io" >📊 Leaderboard</a>
</p>
SWE-bench-Live is a live benchmark for issue resolving, providing a dataset that contains the latest issue tasks. This document explains how to run the evaluation of OpenHands on SWE-bench-Live.
Since SWE-bench-Live has an almost identical setting to SWE-bench, you only need to simply change the dataset name to `SWE-bench-Live/SWE-bench-Live`, the other parts are basically the same as running on SWE-bench.
## Setting Up
Set up the development environment and configure your LLM provider by following the [README](README.md).
## Running Inference
Use the same script, but change the dataset name to `SWE-bench-Live` and select the split (either `lite` or `full`). The lite split contains 300 instances from the past six months, while the full split includes 1,319 instances created after 2024.
```shell
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
```
In the original SWE-bench-Live paper, max_iterations is set to 100.
```shell
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.your_llm HEAD CodeActAgent 300 100 3 SWE-bench-Live/SWE-bench-Live lite
```
## Evaluating Results
After OpenHands generates patch results for each issue, we evaluate the results using the [SWE-bench-Live evaluation harness](https://github.com/microsoft/SWE-bench-Live).
Convert to the format of predictions for SWE benchmarks:
```shell
# You can find output.jsonl in evaluation/evaluation_outputs
python evaluation/benchmarks/swe_bench/scripts/live/convert.py --output_jsonl [path/to/evaluation/output.jsonl] > preds.jsonl
```
Please refer to the original [SWE-bench-Live repository](https://github.com/microsoft/SWE-bench-Live) to set up the evaluation harness and use the provided scripts to generate the evaluation report:
```shell
python -m swebench.harness.run_evaluation \
--dataset_name SWE-bench-Live/SWE-bench-Live \
--split lite \
--namespace starryzhang \
--predictions_path preds.jsonl \
--max_workers 10 \
--run_id openhands
```
## Citation
```bibtex
@article{zhang2025swebenchgoeslive,
title={SWE-bench Goes Live!},
author={Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang},
journal={arXiv preprint arXiv:2505.23419},
year={2025}
}
```
@@ -0,0 +1,80 @@
from typing import Any
import pandas as pd
from evaluation.utils.shared import assert_and_raise
from openhands.core.logger import openhands_logger as logger
from openhands.events.action import CmdRunAction
from openhands.events.observation import (
CmdOutputObservation,
ErrorObservation,
)
from openhands.runtime.base import Runtime
from openhands.utils.shutdown_listener import sleep_if_should_continue
def complete_runtime(
runtime: Runtime,
instance: pd.Series,
) -> dict[str, Any]:
"""Complete the runtime and export the git patch for SWE-bench-Live."""
logger.info('-' * 30)
logger.info('BEGIN Runtime Completion Fn')
logger.info('-' * 30)
obs: CmdOutputObservation
workspace_dir_name = instance.instance_id
action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
action.set_hard_timeout(600)
logger.info(action)
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
)
action = CmdRunAction(command='git config --global core.pager ""')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
f'Failed to git config --global core.pager "": {str(obs)}',
)
action = CmdRunAction(command='git add -A')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
f'Failed to git add -A: {str(obs)}',
)
n_retries = 0
git_patch = None
while n_retries < 5:
action = CmdRunAction(
command=f'git diff --no-color --cached {instance["base_commit"]}',
)
action.set_hard_timeout(100 + 10 * n_retries)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
n_retries += 1
if isinstance(obs, CmdOutputObservation):
if obs.exit_code == 0:
git_patch = obs.content.strip()
break
else:
logger.info('Failed to get git diff, retrying...')
sleep_if_should_continue(10)
elif isinstance(obs, ErrorObservation):
logger.error(f'Error occurred: {obs.content}. Retrying...')
sleep_if_should_continue(10)
else:
assert_and_raise(False, f'Unexpected observation type: {str(obs)}')
assert_and_raise(git_patch is not None, 'Failed to get git diff (None)')
logger.info('-' * 30)
logger.info('END Runtime Completion Fn')
logger.info('-' * 30)
return {'git_patch': git_patch}
@@ -1,4 +1,4 @@
TASK_INSTRUECTION="""
TASK_INSTRUECTION = """
Given the following GitHub problem description, your objective is to localize the specific files, classes or functions, and lines of code that need modification or contain key information to resolve the issue.
Follow these steps to localize the issue:
@@ -66,4 +66,4 @@ FAKE_USER_MSG_FOR_LOC = (
'Verify that you have carefully analyzed the impact of the found locations on the repository, especially their dependencies. '
'If you think you have solved the task, please send your final answer (including the former answer and reranking) to user through message and then call `finish` to finish.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP.\n'
)
)
@@ -0,0 +1,65 @@
<uploaded_files>
/workspace/{{ workspace_dir_name }}
</uploaded_files>
I've uploaded a python code repository in the directory {{ workspace_dir_name }}. Consider the following issue description:
<issue_description>
{{ instance.problem_statement }}
</issue_description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the <issue_description> are met?
I've already taken care of all changes to any of the test files described in the <issue_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
Also the development Python environment is already set up for you (i.e., all dependencies already installed), so you don't need to install other packages.
Your task is to make the minimal changes to non-test files in the /workspace/{{ workspace_dir_name }} directory to ensure the <issue_description> is satisfied.
Follow these phases to resolve the issue:
Phase 1. READING: read the problem and reword it in clearer terms
1.1 If there are code or config snippets. Express in words any best practices or conventions in them.
1.2 Hightlight message errors, method names, variables, file names, stack traces, and technical details.
1.3 Explain the problem in clear terms.
1.4 Enumerate the steps to reproduce the problem.
1.5 Hightlight any best practices to take into account when testing and fixing the issue
Phase 2. RUNNING: install and run the tests on the repository
2.1 Follow the readme
2.2 Install the environment and anything needed
2.2 Iterate and figure out how to run the tests
Phase 3. EXPLORATION: find the files that are related to the problem and possible solutions
3.1 Use `grep` to search for relevant methods, classes, keywords and error messages.
3.2 Identify all files related to the problem statement.
3.3 Propose the methods and files to fix the issue and explain why.
3.4 From the possible file locations, select the most likely location to fix the issue.
Phase 4. TEST CREATION: before implementing any fix, create a script to reproduce and verify the issue.
4.1 Look at existing test files in the repository to understand the test format/structure.
4.2 Create a minimal reproduction script that reproduces the located issue.
4.3 Run the reproduction script to confirm you are reproducing the issue.
4.4 Adjust the reproduction script as necessary.
Phase 5. FIX ANALYSIS: state clearly the problem and how to fix it
5.1 State clearly what the problem is.
5.2 State clearly where the problem is located.
5.3 State clearly how the test reproduces the issue.
5.4 State clearly the best practices to take into account in the fix.
5.5 State clearly how to fix the problem.
Phase 6. FIX IMPLEMENTATION: Edit the source code to implement your chosen solution.
6.1 Make minimal, focused changes to fix the issue.
Phase 7. VERIFICATION: Test your implementation thoroughly.
7.1 Run your reproduction script to verify the fix works.
7.2 Add edge cases to your test script to ensure comprehensive coverage.
7.3 Run existing tests related to the modified code to ensure you haven't broken anything.
8. FINAL REVIEW: Carefully re-read the problem description and compare your changes with the base commit {{ instance.base_commit }}.
8.1 Ensure you've fully addressed all requirements.
8.2 Run any tests in the repository related to:
8.2.1 The issue you are fixing
8.2.2 The files you modified
8.2.3 The functions you changed
8.3 If any tests fail, revise your implementation until all tests pass
Be thorough in your exploration, testing, and reasoning. It's fine if your thinking process is lengthy - quality and completeness are more important than brevity.
@@ -0,0 +1,65 @@
<uploaded_files>
/workspace/{{ workspace_dir_name }}
</uploaded_files>
I've uploaded a python code repository in the directory {{ workspace_dir_name }}. Consider the following issue description:
<issue_description>
{{ instance.problem_statement }}
</issue_description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the <issue_description> are met?
I've already taken care of all changes to any of the test files described in the <issue_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
Also the development Python environment is already set up for you (i.e., all dependencies already installed), so you don't need to install other packages.
Your task is to make the minimal changes to non-test files in the /workspace/{{ workspace_dir_name }} directory to ensure the <issue_description> is satisfied.
Follow these phases to resolve the issue:
Phase 1. READING: read the problem and reword it in clearer terms
1.1 If there are code or config snippets. Express in words any best practices or conventions in them.
1.2 Hightlight message errors, method names, variables, file names, stack traces, and technical details.
1.3 Explain the problem in clear terms.
1.4 Enumerate the steps to reproduce the problem.
1.5 Hightlight any best practices to take into account when testing and fixing the issue
Phase 2. RUNNING: install and run the tests on the repository
2.1 Follow the readme
2.2 Install the environment and anything needed
2.2 Iterate and figure out how to run the tests
Phase 3. EXPLORATION: find the files that are related to the problem and possible solutions
3.1 Use `grep` to search for relevant methods, classes, keywords and error messages.
3.2 Identify all files related to the problem statement.
3.3 Propose the methods and files to fix the issue and explain why.
3.4 From the possible file locations, select the most likely location to fix the issue.
Phase 4. TEST CREATION: before implementing any fix, create a script to reproduce and verify the issue.
4.1 Look at existing test files in the repository to understand the test format/structure.
4.2 Create a minimal reproduction script that reproduces the located issue.
4.3 Run the reproduction script to confirm you are reproducing the issue.
4.4 Adjust the reproduction script as necessary.
Phase 5. FIX ANALYSIS: state clearly the problem and how to fix it
5.1 State clearly what the problem is.
5.2 State clearly where the problem is located.
5.3 State clearly how the test reproduces the issue.
5.4 State clearly the best practices to take into account in the fix.
5.5 State clearly how to fix the problem.
Phase 6. FIX IMPLEMENTATION: Edit the source code to implement your chosen solution.
6.1 Make minimal, focused changes to fix the issue.
Phase 7. VERIFICATION: Test your implementation thoroughly.
7.1 Run your reproduction script to verify the fix works.
7.2 Add edge cases to your test script to ensure comprehensive coverage.
7.3 Run existing tests related to the modified code to ensure you haven't broken anything.
8. FINAL REVIEW: Carefully re-read the problem description and compare your changes with the base commit {{ instance.base_commit }}.
8.1 Ensure you've fully addressed all requirements.
8.2 Run any tests in the repository related to:
8.2.1 The issue you are fixing
8.2.2 The files you modified
8.2.3 The functions you changed
8.3 If any tests fail, revise your implementation until all tests pass
Be thorough in your exploration, testing, and reasoning. It's fine if your thinking process is lengthy - quality and completeness are more important than brevity.
@@ -0,0 +1,45 @@
# Task: Fix Issue in Python Repository
## Repository Context
You are provided with a Python code repository that contains an issue requiring your attention. The repository is located in a sandboxed environment, and you have access to the codebase to implement the necessary changes.
The code repository is located at: `/workspace/{{ workspace_dir_name }}`
(This path is provided for context; use file system tools to confirm paths before access).
## Goal
Your goal is to fix the issue described in the **Issue Description** section below. Implement the necessary changes to **non-test files only** within the repository, ensuring that **all relevant tests pass** after your changes.
## Key Requirements & Constraints
1. **Understand the problem** very well: it is a bug report, and you know humans don't always write good descriptions. Explore the codebase to understand the related code and the problem in depth. It is possible that the solution needs to be a bit more extensive than just the stated text. Don't exagerate though: don't do unrelated refactoring, but also don't interpret the description too strictly.
2. **Focus on the issues:** Implement the fix focusing on non-test files related to the issue.
2. **Environment Ready:** The Python environment is pre-configured with all dependencies. Do not install packages.
3. **Mandatory Testing Procedure:**
* **Create Test to Reproduce the Issue:** *Before* implementing any fix, you MUST create a *new test* (separate from existing tests) that specifically reproduces the issue.
* Take existing tests as example to understand the testing format/structure.
* Enhance this test with edge cases.
* Run this test to confirm reproduction.
* **Verify Fix:** After implementing the fix, run your test again to verify the issue is resolved.
* **Identify ALL Relevant Tests:** You MUST perform a **dedicated search and analysis** to identify **all** existing unit tests potentially affected by your changes. This includes:
* Tests in the same module/directory as the changed files (e.g., `tests/` subdirectories).
* Tests explicitly importing or using the modified code/classes/functions.
* Tests mentioned in the issue description or related documentation.
* Tests covering functionalities that *depend on* the modified code (analyze callers/dependencies if necessary).
**If you cannot confidently identify a specific subset, you MUST identify and plan to run the entire test suite for the modified application or module(s). State your identified test scope clearly.**
* **Run Identified Relevant Tests:** You MUST execute the **complete set** of relevant existing unit tests you identified in the previous step. Ensure you are running the *correct and comprehensive set* of tests. You MUST NOT modify these existing tests.
* **Final Check & Verification:** Before finishing, ensure **all** identified relevant existing tests pass. **Explicitly confirm that you have considered potential omissions in your test selection and believe the executed tests comprehensively cover the impact of your changes.** Failing to identify and run the *complete* relevant set constitutes a failure. If any identified tests fail, revise your fix. Passing all relevant tests is the primary measure of success.
4. **Defensive Programming:** Actively practice defensive programming: anticipate and handle potential edge cases, unexpected inputs, and different ways the affected code might be called **to ensure the fix works reliably and allows relevant tests to pass.** Analyze the potential impact on other parts of the codebase.
5. **Final Review:** Compare your solution against the original issue and the base commit ({{ instance.base_commit }}) to ensure completeness and test passage.
## General Workflow Guidance
* Prioritize understanding the problem, exploring the code, planning your fix, implementing it carefully using the required diff format, and **thoroughly testing** according to the **Mandatory Testing Procedure**.
* Consider trade-offs between different solutions. The goal is a **robust change that makes the relevant tests pass.** Quality, correctness, and reliability are key.
* Actively practice defensive programming: anticipate and handle potential edge cases, unexpected inputs, and different ways the affected code might be called **to ensure the fix works reliably and allows relevant tests to pass.** Analyze the potential impact on other parts of the codebase.
* IMPORTANT: Your solution will be tested by additional hidden tests, so do not assume the task is complete just because visible tests pass! Refine the solution until you are confident that it is robust and comprehensive according to the **Defensive Programming** requirement.
## Final Note
Be thorough in your exploration, testing, and reasoning. It's fine if your thinking process is lengthy - quality and completeness are more important than brevity.
## Issue Description
{{ instance.problem_statement }}
@@ -0,0 +1,80 @@
You will be tasked to fix an issue from an open-source repository.
Your thinking should be thorough and so it's fine if it's very long. You can think step by step before and after each action you decide to take.
You MUST iterate and keep going until the problem is solved.
You already have everything you need to solve this problem in the /workspace/{{ workspace_dir_name }} folder, even without internet connection. I want you to fully solve this autonomously before coming back to me.
Only terminate your turn when you are sure that the problem is solved. Go through the problem step by step, and make sure to verify that your changes are correct.
NEVER end your turn without having solved the problem, and when you say you are going to make a tool call, make sure you ACTUALLY make the tool call, instead of ending your turn.
THE PROBLEM CAN DEFINITELY BE SOLVED WITHOUT THE INTERNET.
Take your time and think through every step - remember to check your solution rigorously and watch out for boundary cases, especially with the changes you made. Your solution must be perfect. If not, continue working on it.
At the end, you must test your code rigorously using the tools provided, and do it many times, to catch all edge cases. If it is not robust, iterate more and make it perfect. Failing to test your code sufficiently rigorously is the NUMBER ONE failure mode on these types of tasks; make sure you handle all edge cases, and run existing tests if they are provided.
You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
# Workflow
## High-Level Problem Solving Strategy
1. Understand the problem deeply. Carefully read the issue and think critically about what is required.
2. Investigate the codebase. Explore relevant files, search for key functions, and gather context.
3. Develop a clear, step-by-step plan. Break down the fix into manageable, incremental steps.
4. Implement the fix incrementally. Make small, testable code changes.
5. Debug as needed. Use debugging techniques to isolate and resolve issues.
6. Test frequently. Run tests after each change to verify correctness.
7. Iterate until the root cause is fixed and all tests pass.
8. Reflect and validate comprehensively. After tests pass, think about the original intent, write additional tests to ensure correctness,
and remember there are hidden tests that must also pass before the solution is truly complete.
Refer to the detailed sections below for more information on each step.
## 1. Deeply Understand the Problem
Carefully read the issue and think hard about a plan to solve it before coding.
## 2. Codebase Investigation
- Explore relevant files and directories.
- Search for key functions, classes, or variables related to the issue.
- Read and understand relevant code snippets.
- Identify the root cause of the problem.
- Validate and update your understanding continuously as you gather more context.
## 3. Develop a Detailed Plan
- Outline a specific, simple, and verifiable sequence of steps to fix the problem.
- Break down the fix into small, incremental changes.
## 4. Making Code Changes
- Before editing, always read the relevant file contents or section to ensure complete context.
- If a patch is not applied correctly, attempt to reapply it.
- Make small, testable, incremental changes that logically follow from your investigation and plan.
## 5. Debugging
- Make code changes only if you have high confidence they can solve the problem
- When debugging, try to determine the root cause rather than addressing symptoms
- Debug for as long as needed to identify the root cause and identify a fix
- Use print statements, logs, or temporary code to inspect program state, including descriptive statements or error messages to understand what's happening
- To test hypotheses, you can also add test statements or functions
- Revisit your assumptions if unexpected behavior occurs.
## 6. Testing
- Run tests frequently using `python3 run_tests.py` (or equivalent).
- After each change, verify correctness by running relevant tests.
- If tests fail, analyze failures and revise your patch.
- Write additional tests if needed to capture important behaviors or edge cases.
- Ensure all tests pass before finalizing.
## 7. Final Verification
- Confirm the root cause is fixed.
- Review your solution for logic correctness and robustness.
- Iterate until you are extremely confident the fix is complete and all tests pass.
## 8. Final Reflection and Additional Testing
- Reflect carefully on the original intent of the user and the problem statement.
- Think about potential edge cases or scenarios that may not be covered by existing tests.
- Write additional tests that would need to pass to fully validate the correctness of your solution.
- Run these new tests and ensure they all pass.
- Be aware that there are additional hidden tests that must also pass for the solution to be successful.
- Do not assume the task is complete just because the visible tests pass; continue refining until you are confident the fix is robust and comprehensive.
@@ -0,0 +1,19 @@
<uploaded_files>
/workspace/{{ workspace_dir_name }}
</uploaded_files>
I've uploaded a python code repository in the directory {{ workspace_dir_name }}. Consider the following issue description:
<issue_description>
{{ instance.problem_statement }}
</issue_description>
Can you help me implement the necessary changes to the repository to test whether the issue in <issue_description> was resolved?
I will take care of all changes to any of the non-test files. This means you DON'T have to modify the actual logic and ONLY have to update test logic and tests!
Your task is to make the minimal changes to tests files in the /workspace directory to reproduce the issue in the <issue_description>, i.e., such that the generated tests fail in the current state (where the issue is unresolved) and pass when the issue will be resolved.
Follow these steps to reproduce the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2. Create a script `reproduction.py` to reproduce the error and execute it with `python reproduction.py` using the BashTool, to confirm the error
3. Edit the sourcecode of the repo to integrate your reproduction script into the test framework
4. Run the test framework and make sure your tests fail! Only submit FAILING tests! Never submit passing tests.
{{ test_instructions }}Your thinking should be thorough and so it's fine if it's very long.
+93 -101
View File
@@ -8,6 +8,7 @@ from typing import Any, Literal
import pandas as pd
import toml
from datasets import load_dataset
from jinja2 import Environment, FileSystemLoader
import openhands.agenthub
from evaluation.benchmarks.swe_bench.binary_patch_utils import (
@@ -42,7 +43,7 @@ from openhands.core.config import (
AgentConfig,
OpenHandsConfig,
get_llm_config_arg,
get_parser
get_parser,
)
from openhands.core.config.condenser_config import NoOpCondenserConfig
from openhands.core.config.utils import get_condenser_config_arg
@@ -65,6 +66,26 @@ RUN_WITH_BROWSING = os.environ.get('RUN_WITH_BROWSING', 'false').lower() == 'tru
ENABLE_LLM_EDITOR = os.environ.get('ENABLE_LLM_EDITOR', 'false').lower() == 'true'
BenchMode = Literal['swe', 'swt', 'swt-ci']
# Global variable to track dataset type
DATASET_TYPE = 'SWE-bench'
def set_dataset_type(dataset_name: str) -> str:
"""Set dataset type based on dataset name."""
global DATASET_TYPE
name_lower = dataset_name.lower()
if 'swe-gym' in name_lower:
DATASET_TYPE = 'SWE-Gym'
elif 'swe-bench-live' in name_lower:
DATASET_TYPE = 'SWE-bench-Live'
elif 'multimodal' in name_lower:
DATASET_TYPE = 'Multimodal'
else:
DATASET_TYPE = 'SWE-bench'
logger.info(f'Dataset type set to: {DATASET_TYPE}')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
@@ -72,107 +93,59 @@ AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
return f'{instance.repo}__{instance.version}'.replace('/', '__')
if DATASET_TYPE == 'SWE-bench-Live':
return instance.instance_id
else:
return f'{instance.repo}__{instance.version}'.replace('/', '__')
def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
mode = metadata.details['mode']
llm_model = metadata.llm_config.model
# Determine the template file based on mode and LLM
if mode.startswith('swt'):
test_instructions = (
f'The following command can be used to run the tests: `{list(MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE[instance.repo].values())[0]}`. Make sure they fail in the expected way.\n'
if mode.endswith('ci')
else ''
)
instruction = f"""\
<uploaded_files>
/workspace/{workspace_dir_name}
</uploaded_files>
I've uploaded a python code repository in the directory {workspace_dir_name}. Consider the following issue description:
<issue_description>
{instance.problem_statement}
</issue_description>
Can you help me implement the necessary changes to the repository to test whether the issue in <issue_description> was resolved?
I will take care of all changes to any of the non-test files. This means you DON'T have to modify the actual logic and ONLY have to update test logic and tests!
Your task is to make the minimal changes to tests files in the /workspace directory to reproduce the issue in the <issue_description>, i.e., such that the generated tests fail in the current state (where the issue is unresolved) and pass when the issue will be resolved.
Follow these steps to reproduce the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2. Create a script `reproduction.py` to reproduce the error and execute it with `python reproduction.py` using the BashTool, to confirm the error
3. Edit the sourcecode of the repo to integrate your reproduction script into the test framework
4. Run the test framework and make sure your tests fail! Only submit FAILING tests! Never submit passing tests.
{test_instructions}Your thinking should be thorough and so it's fine if it's very long.
"""
template_name = 'swt.j2'
elif mode == 'swe':
if 'claude' in llm_model:
template_name = 'swe_claude.j2'
elif 'gemini' in llm_model:
template_name = 'swe_gemini.j2'
elif 'gpt-4.1' in llm_model:
template_name = 'swe_gpt4.j2'
else:
template_name = (
'swe_default.j2' # Default for 'swe' mode (regular swe-bench)
)
else:
instruction = f"""
<uploaded_files>
/workspace/{workspace_dir_name}
</uploaded_files>
# Fallback or error handling if mode is unexpected
logger.error(f'Unexpected evaluation mode: {mode}. Falling back to default.')
template_name = 'swe_default.j2'
I've uploaded a python code repository in the directory {workspace_dir_name}. Consider the following issue description:
# Set up Jinja2 environment
# Assuming templates are in 'evaluation/benchmarks/swe_bench/prompts' relative to this script
prompts_dir = os.path.join(os.path.dirname(__file__), 'prompts')
env = Environment(loader=FileSystemLoader(prompts_dir))
template = env.get_template(template_name)
<issue_description>
{instance.problem_statement}
</issue_description>
# Prepare context for rendering
context = {
'instance': instance,
'workspace_dir_name': workspace_dir_name,
'metadata': metadata, # Pass metadata if needed in templates
}
Can you help me implement the necessary changes to the repository so that the requirements specified in the <issue_description> are met?
I've already taken care of all changes to any of the test files described in the <issue_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
Also the development Python environment is already set up for you (i.e., all dependencies already installed), so you don't need to install other packages.
Your task is to make the minimal changes to non-test files in the /workspace/{workspace_dir_name} directory to ensure the <issue_description> is satisfied.
# Add specific context for swt-ci mode if needed
if mode == 'swt-ci':
context['test_instructions'] = (
f'The following command can be used to run the tests: `{list(MAP_REPO_TO_TEST_FRAMEWORK_VERBOSE[instance.repo].values())[0]}`. Make sure they fail in the expected way.\n'
)
else:
context['test_instructions'] = '' # Ensure it's defined for other modes
Follow these phases to resolve the issue:
Phase 1. READING: read the problem and reword it in clearer terms
1.1 If there are code or config snippets. Express in words any best practices or conventions in them.
1.2 Hightlight message errors, method names, variables, file names, stack traces, and technical details.
1.3 Explain the problem in clear terms.
1.4 Enumerate the steps to reproduce the problem.
1.5 Hightlight any best practices to take into account when testing and fixing the issue
Phase 2. RUNNING: install and run the tests on the repository
2.1 Follow the readme
2.2 Install the environment and anything needed
2.2 Iterate and figure out how to run the tests
Phase 3. EXPLORATION: find the files that are related to the problem and possible solutions
3.1 Use `grep` to search for relevant methods, classes, keywords and error messages.
3.2 Identify all files related to the problem statement.
3.3 Propose the methods and files to fix the issue and explain why.
3.4 From the possible file locations, select the most likely location to fix the issue.
Phase 4. TEST CREATION: before implementing any fix, create a script to reproduce and verify the issue.
4.1 Look at existing test files in the repository to understand the test format/structure.
4.2 Create a minimal reproduction script that reproduces the located issue.
4.3 Run the reproduction script to confirm you are reproducing the issue.
4.4 Adjust the reproduction script as necessary.
Phase 5. FIX ANALYSIS: state clearly the problem and how to fix it
5.1 State clearly what the problem is.
5.2 State clearly where the problem is located.
5.3 State clearly how the test reproduces the issue.
5.4 State clearly the best practices to take into account in the fix.
5.5 State clearly how to fix the problem.
Phase 6. FIX IMPLEMENTATION: Edit the source code to implement your chosen solution.
6.1 Make minimal, focused changes to fix the issue.
Phase 7. VERIFICATION: Test your implementation thoroughly.
7.1 Run your reproduction script to verify the fix works.
7.2 Add edge cases to your test script to ensure comprehensive coverage.
7.3 Run existing tests related to the modified code to ensure you haven't broken anything.
8. FINAL REVIEW: Carefully re-read the problem description and compare your changes with the base commit {instance['base_commit']}.
8.1 Ensure you've fully addressed all requirements.
8.2 Run any tests in the repository related to:
8.2.1 The issue you are fixing
8.2.2 The files you modified
8.2.3 The functions you changed
8.3 If any tests fail, revise your implementation until all tests pass
Be thorough in your exploration, testing, and reasoning. It's fine if your thinking process is lengthy - quality and completeness are more important than brevity.
"""
# Render the instruction
instruction = template.render(context)
if RUN_WITH_BROWSING:
instruction += (
@@ -203,9 +176,13 @@ def get_instance_docker_image(
if swebench_official_image:
# Official SWE-Bench image
# swebench/sweb.eval.x86_64.django_1776_django-11333:v1
docker_image_prefix = 'docker.io/swebench/'
# SWE-bench-Live uses the same naming convention as SWE-Bench
if DATASET_TYPE == 'SWE-bench-Live':
docker_image_prefix = 'docker.io/starryzhang/'
elif DATASET_TYPE == 'SWE-bench':
docker_image_prefix = 'docker.io/swebench/'
repo, name = instance_id.split('__')
image_name = f'swebench/sweb.eval.x86_64.{repo}_1776_{name}:latest'.lower()
image_name = f'{docker_image_prefix.rstrip("/")}/sweb.eval.x86_64.{repo}_1776_{name}:latest'.lower()
logger.debug(f'Using official SWE-Bench image: {image_name}')
return image_name
else:
@@ -223,7 +200,8 @@ def get_config(
metadata: EvalMetadata,
) -> OpenHandsConfig:
# We use a different instance image for the each instance of swe-bench eval
use_swebench_official_image = 'swe-gym' not in metadata.dataset.lower()
use_swebench_official_image = DATASET_TYPE != 'SWE-Gym'
base_container_image = get_instance_docker_image(
instance['instance_id'],
swebench_official_image=use_swebench_official_image,
@@ -340,8 +318,12 @@ def initialize_runtime(
runtime.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
# inject the instance swe entry
if DATASET_TYPE == 'SWE-bench-Live':
entry_script_path = 'instance_swe_entry_live.sh'
else:
entry_script_path = 'instance_swe_entry.sh'
runtime.copy_to(
str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
str(os.path.join(script_dir, f'scripts/setup/{entry_script_path}')),
'/swe_util/',
)
@@ -361,14 +343,14 @@ def initialize_runtime(
logger.error(f'Failed to source ~/.bashrc: {str(obs)}')
assert_and_raise(obs.exit_code == 0, f'Failed to source ~/.bashrc: {str(obs)}')
action = CmdRunAction(command='source /swe_util/instance_swe_entry.sh')
action = CmdRunAction(command=f'source /swe_util/{entry_script_path}')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0,
f'Failed to source /swe_util/instance_swe_entry.sh: {str(obs)}',
f'Failed to source /swe_util/{entry_script_path}: {str(obs)}',
)
action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
@@ -421,9 +403,9 @@ def initialize_runtime(
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
if 'multimodal' not in metadata.dataset.lower():
if DATASET_TYPE != 'Multimodal' and DATASET_TYPE != 'SWE-bench-Live':
# Only for non-multimodal datasets, we need to activate the testbed environment for Python
# SWE-Bench multimodal datasets are not using the testbed environment
# SWE-Bench multimodal datasets and SWE-bench-Live are not using the testbed environment
action = CmdRunAction(command='which python')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
@@ -665,7 +647,13 @@ def process_instance(
# ======= THIS IS SWE-Bench specific =======
# Get git patch
return_val = complete_runtime(runtime, instance)
if DATASET_TYPE == 'SWE-bench-Live':
from evaluation.benchmarks.swe_bench.live_utils import (
complete_runtime as complete_runtime_fn,
)
else:
complete_runtime_fn = complete_runtime
return_val = complete_runtime_fn(runtime, instance)
git_patch = return_val['git_patch']
logger.info(
f'Got git diff for instance {instance.instance_id}:\n--------\n{git_patch}\n--------'
@@ -770,11 +758,15 @@ if __name__ == '__main__':
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenHands's repo
dataset = load_dataset(args.dataset, split=args.split)
# Set the global dataset type based on dataset name
set_dataset_type(args.dataset)
swe_bench_tests = filter_dataset(dataset.to_pandas(), 'instance_id')
logger.info(
f'Loaded dataset {args.dataset} with split {args.split}: {len(swe_bench_tests)} tasks'
)
if 'SWE-Gym' in args.dataset:
if DATASET_TYPE == 'SWE-Gym':
with open(
os.path.join(
os.path.dirname(os.path.abspath(__file__)),
@@ -192,6 +192,8 @@ def get_config(
dataset_name=metadata.dataset,
instance_id=instance['instance_id'],
)
oh_aci_li_cmd = '/openhands/micromamba/bin/micromamba run -n openhands poetry run pip install openhands-aci[llama]'
sandbox_config.runtime_extra_deps = oh_aci_li_cmd
workspace_dir_name = _get_swebench_workspace_dir_name(instance)
sandbox_config.runtime_startup_env_vars = {
'REPO_PATH': f'/workspace/{workspace_dir_name}/',
@@ -216,6 +218,7 @@ def get_config(
enable_jupyter=False,
enable_browsing=RUN_WITH_BROWSING,
enable_llm_editor=False,
enable_mcp=os.environ.get('ENABLE_MCP', False),
condenser=metadata.condenser_config,
enable_prompt_extensions=False,
)
@@ -0,0 +1,33 @@
import argparse
import json
def main(output_jsonl: str):
with open(output_jsonl, 'r') as f:
for line in f:
try:
output = json.loads(line)
pred = {
'instance_id': output['instance_id'],
'model_name_or_path': output['metadata']['llm_config']['model'],
'model_patch': output['test_result']['git_patch'],
}
except Exception as e:
print(
f'Error while reading output of instance {output["instance_id"]}: {e}'
)
print(json.dumps(pred))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--output_jsonl',
type=str,
required=True,
help='Path to the prediction file (.../outputs.jsonl)',
)
args = parser.parse_args()
main(args.output_jsonl)
@@ -0,0 +1,41 @@
#!/usr/bin/env bash
source ~/.bashrc
SWEUTIL_DIR=/swe_util
# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
# SWE_INSTANCE_ID=django__django-11099
if [ -z "$SWE_INSTANCE_ID" ]; then
echo "Error: SWE_INSTANCE_ID is not set." >&2
exit 1
fi
# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-instance.json)
if [[ -z "$item" ]]; then
echo "No item found for the provided instance ID."
exit 1
fi
echo "WORKSPACE_NAME: $SWE_INSTANCE_ID"
# Clear the workspace
if [ -d /workspace ]; then
rm -rf /workspace/*
else
mkdir /workspace
fi
# Copy repo to workspace
if [ -d /workspace/$SWE_INSTANCE_ID ]; then
rm -rf /workspace/$SWE_INSTANCE_ID
fi
mkdir -p /workspace
cp -r /testbed /workspace/$SWE_INSTANCE_ID
# SWE-bench-Live does not use conda to manage Python
# if [ -d /opt/miniconda3 ]; then
# . /opt/miniconda3/etc/profile.d/conda.sh
# conda activate testbed
# fi
@@ -921,7 +921,7 @@ SPECS_PYDICOM.update(
SPECS_HUMANEVAL = {k: {'python': '3.9', 'test_cmd': 'python'} for k in ['1.0']}
# Constants - Task Instance Instllation Environment
# Constants - Task Instance Installation Environment
MAP_REPO_VERSION_TO_SPECS: dict[str, dict[str, Any]] = {
'astropy/astropy': SPECS_ASTROPY,
'dbt-labs/dbt-core': SPECS_DBT_CORE,
@@ -539,7 +539,7 @@ if __name__ == '__main__':
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
llm_config.log_completions = True
# modify_params must be False for evaluation purpose, for reproducibility and accurancy of results
# modify_params must be False for evaluation purpose, for reproducibility and accuracy of results
llm_config.modify_params = False
if llm_config is None:
+102
View File
@@ -0,0 +1,102 @@
# VersiCode benchmark
This project is used to evaluate the performance of the model on VersiCode. It includes:
- data: the test data needed and the model outputs
- inference_utils: inference scripts for ours tasks and models
- metric: scripts for calculating various metric
- output_processing: process the model output to facilitate the calculation of model metrics
# Details
1. **Prepare the environment**
```shell
#create conda environment
conda create -n VersiCode python==3.12
#install requirements
pip install -r requirements.txt
```
2. **Experiment Data**
To obtain the experimental data, please visit the Hugging Face link: https://huggingface.co/datasets/AstoneNg/VersiCode.
Locate the files `VersiCode_block_completion.json` and `VersiCode_migration.json` under the `experiment_data` directory, and place them in the `/data/test_data directory` of this project.
3. **Model inference**
```shell
#cd inference_utils directory
cd inference_utils
#The script file starting with 'test' is used to test the local model
#The script file at the beginning of the API is used to test the API call model
#block level code completipn
#Modify the 10th and 12th lines of code to specify the base URL and model name
python api_test_block_completion.py
#Modify the 30th line of code to specify the local model path
python test_block.py
# code migration (migration order is 'old_to_new')
#Modify the 10th and 12th lines of code to specify the base URL and model name
python api_code_migration.py
#Modify the 30th line of code to specify the local model path
python test_migration.py
```
4. **Process output**
Process the output content of the model, remove redundant content, extract specified content for easy calculation of indicators.
```shell
#cd output_processing
cd output_processing
#Extract content from<start> and <end>
#Modify the 8th and 9th lines of code to specify the model and task granularity
python clear_ans.py
#In the block completion task and migration task, cdc@k The calculation of indicators needs to be targeted at key rows,
#Modify lines 76 and 79 to specify the data path
python choose_core_line_from_block_versicode.py
python choose_core_line_from_migration_versicode.py
```
5. **Metric**
We have three metrics pass@kem@k and cdc@k Due to our inability to automatically build a dynamic evaluation environment, we have not provided pass@k .
```shell
#cd metric
cd metric
#Modify lines 137-140 in migration task (compute_migration_cdc_score.py) or 143-145 in block and line completion task (compute_versicode_cdc_score.py and compute_versicode_em_score.py) of the code to specify the data path and calculate the k-value of the metric
python compute_migration_cdc_score.py
python compute_versicode_cdc_score.py
python compute_versicode_em_score.py
#Notes
#We found limitations in the ISM@k and PM@k metrics for evaluating code generation, so they are used only as reference in our experiments.
#Modify lines 261-265 in block and line completion task of the code to specify the data path and calculate the k-value of the metric
python compute_ism_pm_score.py
```
# Citation
```
@article{versicode,
author={Tongtong Wu and Weigang Wu and Xingyu Wang and Kang Xu and Suyu Ma and Bo Jiang and Ping Yang and Zhenchang Xing and Yuan-Fang Li and Gholamreza Haffari},
title = {VersiCode: Towards Version-controllable Code Generation},
journal = {CoRR},
volume = {abs/2406.07411},
year = {2024},
url = {https://arxiv.org/abs/2406.07411},
}
```
**Github url**: https://github.com/wutong8023/VersiCode
# Contributor
[Tongtong Wu](https://scholar.google.com/citations?hl=zh-CN&user=u1Qp8lUAAAAJ&view_op=list_works&sortby=pubdate), [Weigang Wu](https://scholar.google.com/citations?hl=zh-CN&user=UneIZo8AAAAJ), [Xingyu Wang](https://scholar.google.com/citations?hl=zh-CN&user=wqPJcxcAAAAJ), [Kang Xu](https://scholar.google.com/citations?hl=zh-CN&user=N1UUDi0AAAAJ), [Suyu Ma](https://scholar.google.com/citations?hl=zh-CN&user=NJHR1ukAAAAJ), [Bo Jiang](https://wutong8023.site/VersiCode/), [Ping Yang](https://scholar.google.com/citations?view_op=list_works&hl=en&hl=en&user=hrogvxoAAAAJ), [Zhenchang Xing](https://scholar.google.com/citations?hl=zh-CN&user=0vCxuH4AAAAJ), [Yuan-Fang Li](https://scholar.google.com/citations?hl=zh-CN&user=wufXO1kAAAAJ), [Gholamreza Haffari](https://scholar.google.com/citations?hl=zh-CN&user=Perjx5EAAAAJ)
@@ -0,0 +1,134 @@
"""
GPT performs line level generation prediction and truncates overly long tokens
"""
import json
import os
import tiktoken
from openai import OpenAI
max_tokens = 127000 # gpt3.5 is 16ktoken gpt4o is 128k
model_name = ''
os.environ['OPENAI_API_KEY'] = ''
client = OpenAI()
def truncate_text(text, max_tokens):
encoding = tiktoken.get_encoding('cl100k_base')
disallowed_special = ()
tokens = encoding.encode(text, disallowed_special=disallowed_special)
print(len(tokens))
if len(tokens) > max_tokens:
tokens = tokens[:max_tokens]
truncated_text = encoding.decode(tokens)
return truncated_text
def predict(content, model_name):
response = client.chat.completions.create(
model=model_name,
messages=[{'role': 'user', 'content': content}],
frequency_penalty=0.1,
max_tokens=128,
logit_bias=None,
logprobs=None,
n=6,
presence_penalty=0.0,
seed=None,
stop=None,
stream=False,
temperature=0.8,
top_p=0.95,
)
ans_list = []
choices_list = response.choices
for c in choices_list:
content = c.message.content
ans_list.append(content)
final_ans = str(ans_list)
return final_ans
def bulid_prompt(description, old_version, old_code, new_version) -> str:
"""
build prompt
:param version:
:param description:
:param masked_code:
:param options:
:return:
"""
prompt = f"""
You are now a professional Python programming engineer. I will provide you with a code snippet and a description of its functionality,
including the dependencies and versions used in the code. Then, I will provide the same dependencies but with a specified new version.
Your task is to refactor the code using the methods provided by the specified new version and return the refactored code.
Please note that you only need to return the refactored code and enclose it with <start> and <end>:
###Functionality description of the code
{description}
###Dependency and old version
{old_version}
###Old version code
{old_code}
###Dependency and new version
{new_version}
###Refactored new code
"""
return prompt
json_path = '../data/test_data/VersiCode_migration.json'
with open(json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_dict = lodict
data_list = data_dict
for data in data_list:
if 'model_output' in data:
print(
f'the {data_list.index(data) + 1} has already been predicted, skipping this data!'
)
continue
try:
print(f'Predicting {data_list.index(data) + 1} ')
old_version = data['dependency'] + data['old_version'] # package == x.x.x
new_version = data['dependency'] + data['new_version'] # package == x.x.x
description = data['description'] # 功能描述
old_code = data['old_code'] # mask后的代码
instruction = bulid_prompt(description, old_version, old_code, new_version)
truncated_text = truncate_text(instruction, max_tokens)
prediction = predict(truncated_text, model_name)
data['model_output'] = prediction
except Exception as e:
print(f'error{e}')
print('save current data')
save_folder_path = os.path.join(
'../data/result_data/code_migration', model_name
)
if not os.path.exists(save_folder_path):
os.makedirs(save_folder_path)
save_json_path = os.path.join(save_folder_path, json_path.split('/')[-1])
with open(save_json_path, 'w', encoding='utf-8') as fw:
json.dump(data_dict, fw, indent=4, ensure_ascii=False)
break
save_folder_path = os.path.join('../data/result_data/code_migration', model_name)
if not os.path.exists(save_folder_path):
os.makedirs(save_folder_path)
save_json_path = os.path.join(save_folder_path, json_path.split('/')[-1])
with open(save_json_path, 'w', encoding='utf-8') as fw:
json.dump(data_dict, fw, indent=4, ensure_ascii=False)
@@ -0,0 +1,141 @@
"""
GPT performs line level generation prediction and truncates overly long tokens
"""
import json
import os
import tiktoken
from openai import OpenAI
max_tokens = 127000 # gpt3.5 is 16ktoken gpt4o is 128k
model_name = ''
os.environ['OPENAI_API_KEY'] = ''
client = OpenAI()
def truncate_text(text, max_tokens):
encoding = tiktoken.get_encoding('cl100k_base')
disallowed_special = ()
tokens = encoding.encode(text, disallowed_special=disallowed_special)
print(len(tokens))
if len(tokens) > max_tokens:
tokens = tokens[:max_tokens]
truncated_text = encoding.decode(tokens)
return truncated_text
def predict(content, model_name):
response = client.chat.completions.create(
model=model_name,
messages=[{'role': 'user', 'content': content}],
frequency_penalty=0.1,
max_tokens=128,
logit_bias=None,
logprobs=None,
n=6,
presence_penalty=0.0,
seed=None,
stop=None,
stream=False,
temperature=0.8,
top_p=0.95,
)
ans_list = []
choices_list = response.choices
for c in choices_list:
content = c.message.content
ans_list.append(content)
final_ans = str(ans_list)
return final_ans
def bulid_prompt(version, description) -> str:
"""
build prompt
:param version:
:param description:
:param masked_code:
:param options:
:return:
"""
prompt = f"""
You are a professional Python engineer, and I will provide functional descriptions and versions of specified dependency packages.
You need to write code in Python to implement this feature based on the functional description and using the dependency package and version I specified.
Please note that you only need to return the code that implements the function, and do not return any other content.
Please use <start> and <end> to enclose the generated code. Here is an example:
###Function Description
The function of this code is to print the results predicted by calling the model using vllm.
###dependeny and version
vllm==0.3.3
###response:
<start>
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print("Prompt,Generated text")
<end>
###Function Description
{description}
###dependeny and version
{version}
###response:
"""
return prompt
json_path = '../data/test_data/VersiCode_block_completion.json'
with open(json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_dict = lodict
data_list = data_dict
for data in data_list:
if 'model_output' in data:
print(
f'the {data_list.index(data) + 1} has already been predicted, skipping this data!'
)
continue
try:
print(f'Predicting {data_list.index(data) + 1} ')
version = data['dependency'] + data['version'] # package == x.x.x
description = data['description'] # func description
instruction = bulid_prompt(version, description)
truncated_text = truncate_text(instruction, max_tokens)
prediction = predict(truncated_text, model_name)
data['model_output'] = prediction
except Exception as e:
print(f'error{e}')
print('save current data')
save_folder_path = os.path.join(
'../data/result_data/block_completion', model_name
)
if not os.path.exists(save_folder_path):
os.makedirs(save_folder_path)
save_json_path = os.path.join(save_folder_path, json_path.split('/')[-1])
with open(save_json_path, 'w', encoding='utf-8') as fw:
json.dump(data_dict, fw, indent=4, ensure_ascii=False)
break
save_folder_path = os.path.join('../data/result_data/block_completion', model_name)
if not os.path.exists(save_folder_path):
os.makedirs(save_folder_path)
save_json_path = os.path.join(save_folder_path, json_path.split('/')[-1])
with open(save_json_path, 'w', encoding='utf-8') as fw:
json.dump(data_dict, fw, indent=4, ensure_ascii=False)
@@ -0,0 +1,129 @@
"""
block completion
"""
import copy
import gc
import json
import os
import time
from multiprocessing import Process
import tiktoken
import torch
from vllm import LLM, SamplingParams
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
def truncate_text(text, max_tokens):
encoding = tiktoken.get_encoding('cl100k_base')
disallowed_special = ()
tokens = encoding.encode(text, disallowed_special=disallowed_special)
print(len(tokens))
if len(tokens) > max_tokens:
tokens = tokens[:max_tokens]
truncated_text = encoding.decode(tokens)
return truncated_text
model_list = ['/data2/base models/starcoder2-15b', '/data2/base models/CodeGemma-7B']
def run_inference(model_name, origin_data_list):
temp_data_list = copy.deepcopy(origin_data_list)
test_list = []
for data in temp_data_list:
version = data['dependency'] + data['version'] # package == x.x.x
description = data['description'] # func description
instruction = bulid_prompt(version, description)
test_list.append(instruction)
sampling_params = SamplingParams(n=6, temperature=0.8, top_p=0.95, max_tokens=64)
llm = LLM(
model=model_name,
tensor_parallel_size=4,
gpu_memory_utilization=0.9,
swap_space=20,
)
outputs = llm.generate(test_list, sampling_params)
for output in outputs:
requests_id = int(output.request_id)
temp_ans_list = []
output_list = output.outputs
for o in output_list:
text = o.text
temp_ans_list.append(text)
temp_data_list[requests_id]['model_output'] = str(temp_ans_list)
save_folder_path = os.path.join(
'../data/result_data/block_completion', model_name.split('/')[-1]
)
if not os.path.exists(save_folder_path):
os.makedirs(save_folder_path)
save_json_path = os.path.join(save_folder_path, json_path.split('/')[-1])
with open(save_json_path, 'w', encoding='utf-8') as fw:
json.dump(temp_data_list, fw, indent=4, ensure_ascii=False)
gc.collect()
torch.cuda.empty_cache()
def bulid_prompt(version, description) -> str:
"""
build prompt
:param version:
:param description:
:param masked_code:
:param options:
:return:
"""
prompt = f"""
You are a professional Python engineer, and I will provide functional descriptions and versions of specified dependency packages.
You need to write code in Python to implement this feature based on the functional description and using the dependency package and version I specified.
Please note that you only need to return the code that implements the function, and do not return any other content.
Please use <start> and <end> to enclose the generated code. Here is an example:
###Function Description
The function of this code is to print the results predicted by calling the model using vllm.
###dependeny and version
vllm==0.3.3
###response:
<start>
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print("Prompt,Generated text")
<end>
###Function Description
{description}
###dependeny and version
{version}
###response:
"""
return prompt
json_path = '../data/test_data/VersiCode_block_completion.json'
with open(json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
origin_data_list = lodict
for model_name in model_list:
process = Process(target=run_inference, args=(model_name, origin_data_list))
process.start()
process.join()
time.sleep(120)
@@ -0,0 +1,122 @@
"""
code migration
"""
import copy
import gc
import json
import os
import time
from multiprocessing import Process
import tiktoken
import torch
from vllm import LLM, SamplingParams
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
def truncate_text(text, max_tokens):
encoding = tiktoken.get_encoding('cl100k_base')
disallowed_special = ()
tokens = encoding.encode(text, disallowed_special=disallowed_special)
print(len(tokens))
if len(tokens) > max_tokens:
tokens = tokens[:max_tokens]
truncated_text = encoding.decode(tokens)
return truncated_text
model_list = ['/data2/base models/starcoder2-15b', '/data2/base models/CodeGemma-7B']
def run_inference(model_name, origin_data_list):
temp_data_list = copy.deepcopy(origin_data_list)
test_list = []
for data in temp_data_list:
old_version = data['dependency'] + data['old_version'] # package == x.x.x
new_version = data['dependency'] + data['new_version'] # package == x.x.x
description = data['description'] # 功能描述
old_code = data['old_code'] # mask后的代码
instruction = bulid_prompt(description, old_version, old_code, new_version)
test_list.append(instruction)
sampling_params = SamplingParams(n=6, temperature=0.8, top_p=0.95, max_tokens=512)
llm = LLM(
model=model_name,
tensor_parallel_size=4,
gpu_memory_utilization=0.6,
swap_space=40,
)
outputs = llm.generate(test_list, sampling_params)
for output in outputs:
requests_id = int(output.request_id)
temp_ans_list = []
output_list = output.outputs
for o in output_list:
text = o.text
temp_ans_list.append(text)
temp_data_list[requests_id]['model_output'] = str(temp_ans_list)
save_folder_path = os.path.join(
'../data/result_data/code_migration', model_name.split('/')[-1]
)
if not os.path.exists(save_folder_path):
os.makedirs(save_folder_path)
save_json_path = os.path.join(save_folder_path, json_path.split('/')[-1])
with open(save_json_path, 'w', encoding='utf-8') as fw:
json.dump(temp_data_list, fw, indent=4, ensure_ascii=False)
gc.collect()
torch.cuda.empty_cache()
def bulid_prompt(description, old_version, old_code, new_version) -> str:
"""
build prompt
:param version:
:param description:
:param masked_code:
:param options:
:return:
"""
prompt = f"""
You are now a professional Python programming engineer. I will provide you with a code snippet and a description of its functionality,
including the dependencies and versions used in the code. Then, I will provide the same dependencies but with a specified new version.
Your task is to refactor the code using the methods provided by the specified new version and return the refactored code.
Please note that you only need to return the refactored code and enclose it with <start> and <end>:
###Functionality description of the code
{description}
###Dependency and old version
{old_version}
###Old version code
{old_code}
###Dependency and new version
{new_version}
###Refactored new code
"""
return prompt
json_path = '../data/test_data/VersiCode_migration.json'
with open(json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
origin_data_list = lodict
for model_name in model_list:
process = Process(target=run_inference, args=(model_name, origin_data_list))
process.start()
process.join()
time.sleep(120)
@@ -0,0 +1,356 @@
"""
评测block的预测能力
1、判断是否包含正确的函数名
2、判断是否合法
3、计算ISM,和PM
"""
import io
import json
import math
import os
import re
import tokenize
def is_code_valid(code):
try:
compile(code, '<string>', 'exec')
return True
except Exception:
return False
def longest_common_prefix_between_lists_with_elements(list1, list2):
"""
计算两个字符串列表中元素的最长前缀匹配长度
:param list1:
:param list2:
:return:
"""
max_prefix_length = 0
max_prefix_elements = ()
for str1 in list1:
for str2 in list2:
prefix_length = 0
min_len = min(len(str1), len(str2))
for i in range(min_len):
if str1[i] == str2[i]:
prefix_length += 1
else:
break
if prefix_length > max_prefix_length:
max_prefix_length = prefix_length
max_prefix_elements = (str1, str2)
return max_prefix_length, max_prefix_elements
def get_token(ans_code: str, output_code: str):
"""
对代码进行词法分析,分解成标识符,返回两个标识符列表
:param ans_code:
:param output_code:
:return:
"""
output_flag = True
ans_flag = True
try:
tokens_ans = tokenize.tokenize(io.BytesIO(ans_code.encode('utf-8')).readline)
except Exception:
tokens_ans = ans_code.splitlines()
ans_flag = False
try:
tokens_output = tokenize.tokenize(
io.BytesIO(output_code.encode('utf-8')).readline
)
except Exception:
tokens_output = output_code.splitlines()
output_flag = False
identifiers_ans = []
identifiers_output = []
if ans_flag:
try:
for token in tokens_ans:
if token.type == tokenize.NAME:
identifiers_ans.append(token.string)
except Exception:
identifiers_ans = tokens_ans
else:
identifiers_ans = tokens_ans
if output_flag:
try:
for to in tokens_output:
if to.type == tokenize.NAME:
identifiers_output.append(to.string)
except Exception:
identifiers_output = tokens_output
else:
identifiers_output = tokens_output
return identifiers_ans, identifiers_output
def get_token_per_line(code: str):
"""
对每一行代码进行词法分析,记录每一行的标识符
:param code: 代码字符串
:return: 每一行的标识符列表组成的列表
"""
lines = code.split('\n') # 将代码按行分割成列表
identifiers_per_line = [] # 用于存储每一行的标识符列表的列表
for line in lines:
tokens = tokenize.tokenize(io.BytesIO(line.encode('utf-8')).readline)
identifiers = []
try:
for token in tokens:
if token.type == tokenize.NAME:
identifiers.append(token.string)
except Exception:
identifiers = line.split(' ')
identifiers_per_line.append(identifiers)
return identifiers_per_line
def get_ISM(answer_code: str, model_output_list: list, asnwer_name: str) -> list:
"""
计算ISM,返回一个有序的得分列表
:return:
"""
score_list = []
for code in model_output_list:
if '```python' in code:
code = code.replace('```python', '')
code = code.replace('```', '')
if not re.search(rf'\b{re.escape(asnwer_name)}\b', code) or not is_code_valid(
code
):
score_list.append(0)
continue
# if asnwer_name not in code:
# score_list.append(0)
# continue
identifiers_ans, identifiers_output = get_token(answer_code, code)
max_len, elements = longest_common_prefix_between_lists_with_elements(
identifiers_ans, identifiers_output
)
if max_len != 0:
base_element_len = max(len(elements[0]), len(elements[1]))
temp_score = max_len / base_element_len
score_list.append(temp_score)
else:
score_list.append(0)
# base_element_len = max(len(elements[0]), len(elements[1]))
# temp_score = max_len/base_element_len
# score_list.append(temp_score)
score_list = sorted(score_list, reverse=True)
return score_list
def get_ISM_without_verification(
answer_code: str, model_output_list: list, asnwer_name: str
) -> list:
"""
计算ISM,返回一个有序的得分列表
:return:
"""
score_list = []
for code in model_output_list:
if asnwer_name not in code:
score_list.append(0)
continue
# if asnwer_name not in code:
# score_list.append(0)
# continue
identifiers_ans, identifiers_output = get_token(answer_code, code)
max_len, elements = longest_common_prefix_between_lists_with_elements(
identifiers_ans, identifiers_output
)
if max_len != 0:
base_element_len = max(len(elements[0]), len(elements[1]))
temp_score = max_len / base_element_len
score_list.append(temp_score)
else:
score_list.append(0)
# base_element_len = max(len(elements[0]), len(elements[1]))
# temp_score = max_len/base_element_len
# score_list.append(temp_score)
score_list = sorted(score_list, reverse=True)
return score_list
def longest_common_prefix_with_lengths(list1, list2):
"""
计算两个二维列表中每个子列表的最长前缀匹配长度,并记录拥有最长前缀匹配长度的两个子列表的长度
:param list1: 第一个二维列表
:param list2: 第二个二维列表
:return: 最长前缀匹配长度以及拥有最长前缀匹配长度的两个子列表的长度
"""
max_length = 0
len_list1 = 0
len_list2 = 0
for i, sublist1 in enumerate(list1):
for j, sublist2 in enumerate(list2):
match_length = 0
min_length = min(len(sublist1), len(sublist2))
for k in range(min_length):
if sublist1[k] == sublist2[k]:
match_length += 1
else:
break
if match_length > max_length:
max_length = match_length
len_list1 = len(sublist1)
len_list2 = len(sublist2)
return max_length, len_list1, len_list2
def get_PM(answer_code: str, model_output_list: list, asnwer_name: str) -> list:
"""
计算PM,返回一个有序的得分列表
:return:
"""
score_list = []
for code in model_output_list:
if '```python' in code:
code = code.replace('```python', '')
code = code.replace('```', '')
if not re.search(rf'\b{re.escape(asnwer_name)}\b', code) or not is_code_valid(
code
):
# if asnwer_name not in code or is_code_valid(code) == False:
score_list.append(0)
continue
# if asnwer_name not in code:
# score_list.append(0)
# continue
ans_list = get_token_per_line(answer_code)
output_token_list = get_token_per_line(code)
max_len, len1, len2 = longest_common_prefix_with_lengths(
ans_list, output_token_list
)
base_element_len = max(len1, len2)
if base_element_len != 0:
temp_score = max_len / base_element_len
score_list.append(temp_score)
else:
score_list.append(0)
score_list = sorted(score_list, reverse=True)
return score_list
def get_score(score_list: list, k):
"""
计算score@n,k
:param score_list:
:param k:
:return:
"""
n = len(score_list)
sum = 0
final = n - k + 1
for i in range(1, final + 1):
sum += math.comb(n - i, k - 1) * score_list[i - 1]
final_score = sum / math.comb(n, k)
return final_score
k = 1
task = 'block' # block or line
json_name = f'Versicode_{task}_completion.json'
folder_path = f'../data/result_data/{task}_completion'
model_list = os.listdir(folder_path)
for model in model_list:
model_json_path = os.path.join(folder_path, model, json_name)
with open(model_json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_dict = lodict
data_list = data_dict
data_len = len(data_list)
sum_ISM = 0
sum_PM = 0
for data in data_list:
# model_output_list = eval(data['model_output'])
model_output_list = eval(data['model_output_clear'])[:1]
temp_list = []
for o in model_output_list:
temp_out = o.replace('```python', '')
temp_out = temp_out.replace('```', '')
temp_list.append(temp_out)
model_output_list = temp_list
answer_code = data['code']
answer_name = data['core_token']
#
# answer_code = data['new_code'] #code editing
# answer_name = data['new_name'] #code editing
# answer_code = data['old_code'] # code editing new to old
# answer_name = data['old_name'] # code editing new to old
#
ISM_score_list = get_ISM(answer_code, model_output_list, answer_name)
# ISM_score_without_verification_list = get_ISM_without_verification(answer_code, model_output_list, answer_name) #新增
PM_score_list = get_PM(answer_code, model_output_list, answer_name)
# if not ISM_score_without_verification_list == ISM_score_list:#新增
# for s in ISM_score_list:#新增
# if s != ISM_score_without_verification_list[ISM_score_list.index(s)]:#新增
# print('元数据如下')#新增
# print(data)#新增
# print('答案如下')#新增
# print(model_output_list[ISM_score_list.index(s)])#新增
# flag = int(input('输入1继续,0退出'))#新增
# if flag == 1:
# continue
ISM_score = get_score(ISM_score_list, k)
PM_score = get_score(PM_score_list, k)
sum_ISM += ISM_score
sum_PM += PM_score
# print(f"ISM分数:{ISM_score}")
# print(f"PM分数:{PM_score}")
print(f'{model}, {task} completion task, ISM@{k} score: {sum_ISM / data_len}')
print(f'{model}, {task} completion task, PM@{k} score: {sum_PM / data_len}')
# def get_token(ans_code:str, output_code:str):
# """
# 对代码进行词法分析,分解成标识符,返回两个标识符列表
# :param ans_code:
# :param output_code:
# :return:
# """
# tokens_ans = tokenize.tokenize(io.BytesIO(ans_code.encode('utf-8')).readline)
# tokens_output = tokenize.tokenize(io.BytesIO(output_code.encode('utf-8')).readline)
# identifiers_ans = []
# identifiers_output = []
# for token in tokens_ans:
# if token.type == tokenize.NAME:
# identifiers_ans.append(token.string)
#
# for to in tokens_output:
# if to.type == tokenize.NAME:
# identifiers_output.append(to.string)
#
# return identifiers_ans, identifiers_output
@@ -0,0 +1,198 @@
"""
Calculate the cdc score for migration
"""
import json
import math
import os
import re
# warnings.filterwarnings("ignore", category=SyntaxWarning)
def is_correct_parameter_count(function_name, correct_code, test_code):
"""
判断参数数量是否一致
:param function_name:
:param correct_code:
:param test_code:
:return:
"""
# 获取正确代码中的参数数量
# return True
pattern = rf'{function_name}\((.*?)\)'
correct_match = re.search(pattern, correct_code)
if correct_match:
correct_params = correct_match.group(1).strip()
correct_param_list = [p.strip() for p in correct_params.split(',') if p.strip()]
expected_count = len(correct_param_list)
else:
expected_count = 0 # 如果没有参数,期望数量为0
# 在需要判断的代码中查找函数调用
test_match = re.search(pattern, test_code)
if test_match:
test_params = test_match.group(1).strip()
test_param_list = [p.strip() for p in test_params.split(',') if p.strip()]
return len(test_param_list) == expected_count # 检查参数数量
else:
# 如果没有括号,检查函数名是否在字符串中
return expected_count == 0 and function_name in test_code
def check_keyword_parameters(function_name, correct_code, test_code):
"""
判断关键词参数赋值是否正确使用
:param function_name:
:param correct_code:
:param test_code:
:return:
"""
# 正则表达式匹配正确代码中的函数调用
# return True
pattern = rf'{function_name}\((.*?)\)'
correct_match = re.search(pattern, correct_code)
if correct_match:
correct_params = correct_match.group(1).strip()
correct_param_list = [p.strip() for p in correct_params.split(',') if p.strip()]
# 检查待检测代码中的函数调用
test_match = re.search(pattern, test_code)
if test_match:
test_params = test_match.group(1).strip()
test_param_list = [p.strip() for p in test_params.split(',') if p.strip()]
# 确保待检测的每个参数都以关键字参数形式赋值
for correct_param in correct_param_list:
if '=' in correct_param: # 仅当正确代码中有关键词参数
param_name = correct_param.split('=')[0].strip()
if not any(
param_name in test_param and '=' in test_param
for test_param in test_param_list
):
return False # 如果对应参数不是关键词参数,则返回False
return True # 所有关键字参数匹配
return False # 如果没有匹配,返回False
def with_correct(answer_code: str, model_output: str) -> bool:
"""
当answer是with结构时,判断模型生成的是不是with结构
:param answer_code:
:param model_output:
:return:
"""
# return True
if not answer_code.startswith('with') and not model_output.startswith('with'):
return True
elif answer_code.startswith('with') and model_output.startswith('with'):
return True
else:
return False
def compute_block_score_k(
answer: str,
model_output: list,
k: int,
model_filled_code,
core_line_in_core_block,
core_line_in_output_clear,
):
"""
cdc需要满足五个条件,em只需要满足第一个条件
"""
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if (
re.search(rf'\b{re.escape(answer)}\b', code)
and is_code_valid(model_filled_code[index])
and is_correct_parameter_count(
answer, core_line_in_core_block, core_line_in_output_clear[index]
)
and with_correct(core_line_in_core_block, core_line_in_output_clear[index])
and check_keyword_parameters(
answer, core_line_in_core_block, core_line_in_output_clear[index]
)
): # block
# if re.search(rf'\b{re.escape(answer)}\b', code):#block
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
def is_code_valid(code):
try:
compile(code, '<string>', 'exec')
return True
except Exception:
return False
def compute_score_k(answer: str, model_output: list, k: int):
c = 0
n = len(model_output)
for output in model_output:
if '```python' in output:
output = output.replace('```python', '')
output = output.replace('```', '')
# if answer == output:
if re.search(rf'\b{re.escape(answer)}\b', output) and is_code_valid(output):
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
k = 1 # cdc@k
json_name = 'VersiCode_migration.json'
task = 'migration'
folder_path = '../data/result_data/code_migration'
model_list = os.listdir(folder_path)
for model in model_list:
# if model != 'gpt-4o':
# continue
model_json_path = os.path.join(folder_path, model, json_name)
with open(model_json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_list = lodict
score_list = []
for data in data_list:
answer = data['new_name'] # old -> new
model_output = data['model_output_clear'] # old -> new
model_filled_code = model_output
# core_line_in_core_block = data['core_line_in_new_core_block']# old -> new
core_line_in_core_block = data['core_line_in_code'] # old -> new
core_line_in_output_clear = data['core_line_in_output_clear'] # old -> new
score_list.append(
compute_block_score_k(
answer,
model_output,
k,
model_filled_code,
core_line_in_core_block,
core_line_in_output_clear,
)
)
final_score = sum(score_list) / len(score_list)
print(f'{model}, {task} task, cdc@{k} score: {final_score}')
@@ -0,0 +1,225 @@
"""
Calculate the cdc score for line and block
"""
import json
import math
import os
import re
# warnings.filterwarnings("ignore", category=SyntaxWarning)
def is_code_valid(code):
try:
compile(code, '<string>', 'exec')
return True
except Exception:
return False
def is_correct_parameter_count(function_name, correct_code, test_code):
"""
判断参数数量是否一致
:param function_name:
:param correct_code:
:param test_code:
:return:
"""
# 获取正确代码中的参数数量
# return True
pattern = rf'{function_name}\((.*?)\)'
correct_match = re.search(pattern, correct_code)
if correct_match:
correct_params = correct_match.group(1).strip()
correct_param_list = [p.strip() for p in correct_params.split(',') if p.strip()]
expected_count = len(correct_param_list)
else:
expected_count = 0 # 如果没有参数,期望数量为0
# 在需要判断的代码中查找函数调用
test_match = re.search(pattern, test_code)
if test_match:
test_params = test_match.group(1).strip()
test_param_list = [p.strip() for p in test_params.split(',') if p.strip()]
return len(test_param_list) == expected_count # 检查参数数量
else:
# 如果没有括号,检查函数名是否在字符串中
return expected_count == 0 and function_name in test_code
def check_keyword_parameters(function_name, correct_code, test_code):
"""
判断关键词参数赋值是否正确使用
:param function_name:
:param correct_code:
:param test_code:
:return:
"""
# 正则表达式匹配正确代码中的函数调用
# return True
pattern = rf'{function_name}\((.*?)\)'
correct_match = re.search(pattern, correct_code)
if correct_match:
correct_params = correct_match.group(1).strip()
correct_param_list = [p.strip() for p in correct_params.split(',') if p.strip()]
# 检查待检测代码中的函数调用
test_match = re.search(pattern, test_code)
if test_match:
test_params = test_match.group(1).strip()
test_param_list = [p.strip() for p in test_params.split(',') if p.strip()]
# 确保待检测的每个参数都以关键字参数形式赋值
for correct_param in correct_param_list:
if '=' in correct_param: # 仅当正确代码中有关键词参数
param_name = correct_param.split('=')[0].strip()
if not any(
param_name in test_param and '=' in test_param
for test_param in test_param_list
):
return False # 如果对应参数不是关键词参数,则返回False
return True # 所有关键字参数匹配
return False # 如果没有匹配,返回False
def with_correct(answer_code: str, model_output: str) -> bool:
"""
当answer是with结构时,判断模型生成的是不是with结构
:param answer_code:
:param model_output:
:return:
"""
# return True
if not answer_code.startswith('with') and not model_output.startswith('with'):
return True
elif answer_code.startswith('with') and model_output.startswith('with'):
return True
else:
return False
def compute_line_score_k(
answer: str, model_output: list, k: int, model_filled_code, core_line
):
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if (
re.search(rf'\b{re.escape(answer)}\b', code)
and is_code_valid(model_filled_code[index])
and is_correct_parameter_count(answer, core_line, code)
and with_correct(core_line, code)
and check_keyword_parameters(answer, core_line, code)
): # line
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
def compute_block_score_k(
answer: str,
model_output: list,
k: int,
model_filled_code,
core_line_in_core_block,
core_line_in_output_clear,
):
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if (
re.search(rf'\b{re.escape(answer)}\b', code)
and is_code_valid(model_filled_code[index])
and is_correct_parameter_count(
answer, core_line_in_core_block, core_line_in_output_clear[index]
)
and with_correct(core_line_in_core_block, core_line_in_output_clear[index])
and check_keyword_parameters(
answer, core_line_in_core_block, core_line_in_output_clear[index]
)
): # block
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
def compute_score_k(answer: str, model_output: list, k: int):
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if re.search(rf'\b{re.escape(answer)}\b', code) and is_code_valid(
code
): # block
# if re.search(rf'\b{re.escape(answer)}\b', code):#line
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
k = 3 # cdc@k
task = 'block' # line or block
json_name = f'Versicode_{task}_completion.json'
folder_path = f'../data/result_data/{task}_completion'
model_list = os.listdir(folder_path)
for model in model_list:
model_json_path = os.path.join(folder_path, model, json_name)
with open(model_json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_list = lodict
if task == 'line':
score_list = []
for data in data_list:
answer = data['core_token']
model_output = eval(data['model_output_clear'])
model_filled_code = [
data['masked_code'].replace('<mask>', i) for i in model_output
]
core_line = data['core_line']
score_list.append(
compute_line_score_k(
answer, model_output, k, model_filled_code, core_line
)
)
else:
score_list = []
for data in data_list:
answer = data['core_token']
model_output = eval(data['model_output_clear'])
model_filled_code = eval(data['model_output_clear'])
core_line = data['core_line']
core_line_in_output_clear = data['core_line_in_output_clear']
score_list.append(
compute_block_score_k(
answer,
model_output,
k,
model_filled_code,
core_line,
core_line_in_output_clear,
)
)
final_score = sum(score_list) / len(score_list)
print(f'{model}, {task} completion task, cdc@{k} score: {final_score}')
@@ -0,0 +1,209 @@
"""
Calculate the cdc score for line and block
"""
import json
import math
import os
import re
# warnings.filterwarnings("ignore", category=SyntaxWarning)
def is_code_valid(code):
try:
compile(code, '<string>', 'exec')
return True
except Exception:
return False
def is_correct_parameter_count(function_name, correct_code, test_code):
"""
判断参数数量是否一致
:param function_name:
:param correct_code:
:param test_code:
:return:
"""
# 获取正确代码中的参数数量
# return True
pattern = rf'{function_name}\((.*?)\)'
correct_match = re.search(pattern, correct_code)
if correct_match:
correct_params = correct_match.group(1).strip()
correct_param_list = [p.strip() for p in correct_params.split(',') if p.strip()]
expected_count = len(correct_param_list)
else:
expected_count = 0 # 如果没有参数,期望数量为0
# 在需要判断的代码中查找函数调用
test_match = re.search(pattern, test_code)
if test_match:
test_params = test_match.group(1).strip()
test_param_list = [p.strip() for p in test_params.split(',') if p.strip()]
return len(test_param_list) == expected_count # 检查参数数量
else:
# 如果没有括号,检查函数名是否在字符串中
return expected_count == 0 and function_name in test_code
def check_keyword_parameters(function_name, correct_code, test_code):
"""
判断关键词参数赋值是否正确使用
:param function_name:
:param correct_code:
:param test_code:
:return:
"""
# 正则表达式匹配正确代码中的函数调用
# return True
pattern = rf'{function_name}\((.*?)\)'
correct_match = re.search(pattern, correct_code)
if correct_match:
correct_params = correct_match.group(1).strip()
correct_param_list = [p.strip() for p in correct_params.split(',') if p.strip()]
# 检查待检测代码中的函数调用
test_match = re.search(pattern, test_code)
if test_match:
test_params = test_match.group(1).strip()
test_param_list = [p.strip() for p in test_params.split(',') if p.strip()]
# 确保待检测的每个参数都以关键字参数形式赋值
for correct_param in correct_param_list:
if '=' in correct_param: # 仅当正确代码中有关键词参数
param_name = correct_param.split('=')[0].strip()
if not any(
param_name in test_param and '=' in test_param
for test_param in test_param_list
):
return False # 如果对应参数不是关键词参数,则返回False
return True # 所有关键字参数匹配
return False # 如果没有匹配,返回False
def with_correct(answer_code: str, model_output: str) -> bool:
"""
当answer是with结构时,判断模型生成的是不是with结构
:param answer_code:
:param model_output:
:return:
"""
# return True
if not answer_code.startswith('with') and not model_output.startswith('with'):
return True
elif answer_code.startswith('with') and model_output.startswith('with'):
return True
else:
return False
def compute_line_score_k(
answer: str, model_output: list, k: int, model_filled_code, core_line
):
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if re.search(rf'\b{re.escape(answer)}\b', code): # line
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
def compute_block_score_k(
answer: str,
model_output: list,
k: int,
model_filled_code,
core_line_in_core_block,
core_line_in_output_clear,
):
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if re.search(rf'\b{re.escape(answer)}\b', code): # block
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
def compute_score_k(answer: str, model_output: list, k: int):
c = 0
n = len(model_output)
for index, code in enumerate(model_output):
if re.search(rf'\b{re.escape(answer)}\b', code) and is_code_valid(
code
): # block
# if re.search(rf'\b{re.escape(answer)}\b', code):#line
c += 1
if n - c < k:
return 1.0
score = 1 - (math.comb(n - c, k)) / (math.comb(n, k))
return score
k = 3 # em@k
task = 'block' # line or block
json_name = f'Versicode_{task}_completion.json'
folder_path = f'../data/result_data/{task}_completion'
model_list = os.listdir(folder_path)
for model in model_list:
model_json_path = os.path.join(folder_path, model, json_name)
with open(model_json_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_list = lodict
if task == 'line':
score_list = []
for data in data_list:
answer = data['core_token']
model_output = eval(data['model_output_clear'])
model_filled_code = [
data['masked_code'].replace('<mask>', i) for i in model_output
]
core_line = data['core_line']
score_list.append(
compute_line_score_k(
answer, model_output, k, model_filled_code, core_line
)
)
else:
score_list = []
for data in data_list:
answer = data['core_token']
model_output = eval(data['model_output_clear'])
model_filled_code = eval(data['model_output_clear'])
core_line = data['core_line']
core_line_in_output_clear = data['core_line_in_output_clear']
score_list.append(
compute_block_score_k(
answer,
model_output,
k,
model_filled_code,
core_line,
core_line_in_output_clear,
)
)
final_score = sum(score_list) / len(score_list)
print(f'{model}, {task} completion task, em@{k} score: {final_score}')
@@ -0,0 +1,99 @@
"""
Find the line of code generated by the model using the block in the version code
"""
import json
import os
import random
import re
def process_line_mask(code_snippet, core_token):
if not core_token:
return None, None
replaced_lines = {}
lines = code_snippet.split('\n')
in_multi_line_comment = False
for i, line in enumerate(lines):
if in_multi_line_comment:
if ('"""' in line or "'''" in line) and not re.findall(
r"'''(.*?)'''|\"\"\"(.*?)\"\"\"", line
):
in_multi_line_comment = False
continue
elif line.strip().startswith('#'):
continue
elif re.findall(r"'''(.*?)'''|\"\"\"(.*?)\"\"\"", line):
continue
elif ('"""' in line or "'''" in line) and not re.findall(
r"'''(.*?)'''|\"\"\"(.*?)\"\"\"", line
):
in_multi_line_comment = True
continue
else:
if re.search(r'\bdef\s+task_function\b', line):
continue
if re.search(r'\b{}\b(?!\s*=)'.format(re.escape(core_token)), line):
replaced_lines.update({i: line})
if replaced_lines:
random_line_location = random.choice(list(replaced_lines.keys()))
masked_line = lines[random_line_location]
leading_spaces = re.match(r'^\s*', masked_line).group(0)
masked_line = masked_line.strip()
lines[random_line_location] = leading_spaces + '<line_mask>'
masked_code = '\n'.join(lines)
return masked_code, masked_line
return None, None
def load_json(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
data = json.load(f)
return data
def save_json(file_path, data):
with open(file_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
if __name__ == '__main__':
model_list = os.listdir('../data/result_data/block_completion')
for model in model_list:
input_json_file = f'../data/result_data/block_completion/{model}/VersiCode_block_completion.json'
output_json_file = input_json_file
data = load_json(input_json_file)
for item in data:
core_token = item['core_token']
code = item['code']
_, core_line_in_code = process_line_mask(code, core_token)
if core_line_in_code:
item['core_line_in_code'] = core_line_in_code
else:
item['core_line_in_code'] = 'N/A'
model_output_clear = item['model_output_clear']
core_line_in_output_list = []
for entry in eval(model_output_clear):
_, core_line_in_output = process_line_mask(entry, core_token)
if core_line_in_output:
core_line_in_output_list.append(core_line_in_output)
else:
core_line_in_output_list.append('N/A')
item['core_line_in_output_clear'] = core_line_in_output_list
save_json(output_json_file, data)
print('Done!')
@@ -0,0 +1,102 @@
"""
Find the line of code generated by the model using the block in the version code
"""
import json
import os
import random
import re
def process_line_mask(code_snippet, core_token):
if not core_token:
return None, None
replaced_lines = {}
lines = code_snippet.split('\n')
in_multi_line_comment = False
for i, line in enumerate(lines):
if in_multi_line_comment:
if ('"""' in line or "'''" in line) and not re.findall(
r"'''(.*?)'''|\"\"\"(.*?)\"\"\"", line
):
in_multi_line_comment = False
continue
elif line.strip().startswith('#'):
continue
elif re.findall(r"'''(.*?)'''|\"\"\"(.*?)\"\"\"", line):
continue
elif ('"""' in line or "'''" in line) and not re.findall(
r"'''(.*?)'''|\"\"\"(.*?)\"\"\"", line
):
in_multi_line_comment = True
continue
else:
if re.search(r'\bdef\s+task_function\b', line):
continue
if re.search(r'\b{}\b(?!\s*=)'.format(re.escape(core_token)), line):
replaced_lines.update({i: line})
if replaced_lines:
random_line_location = random.choice(list(replaced_lines.keys()))
masked_line = lines[random_line_location]
leading_spaces = re.match(r'^\s*', masked_line).group(0)
masked_line = masked_line.strip()
lines[random_line_location] = leading_spaces + '<line_mask>'
masked_code = '\n'.join(lines)
return masked_code, masked_line
return None, None
def load_json(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
data = json.load(f)
return data
def save_json(file_path, data):
with open(file_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
if __name__ == '__main__':
model_list = os.listdir('../data/result_data/code_migration')
for model in model_list:
input_json_file = (
f'../data/result_data/code_migration/{model}/VersiCode_migration.json'
)
output_json_file = input_json_file
data = load_json(input_json_file)
for item in data:
core_token = item['old_name']
code = item['old_code']
_, core_line_in_code = process_line_mask(code, core_token)
if core_line_in_code:
item['core_line_in_code'] = core_line_in_code
else:
item['core_line_in_code'] = 'N/A'
model_output_clear = item['model_output_clear']
core_line_in_output_list = []
core_token = item['new_name']
for entry in eval(model_output_clear):
_, core_line_in_output = process_line_mask(entry, core_token)
if core_line_in_output:
core_line_in_output_list.append(core_line_in_output)
else:
core_line_in_output_list.append('N/A')
item['core_line_in_output_clear'] = core_line_in_output_list
save_json(output_json_file, data)
print('Done!')
@@ -0,0 +1,38 @@
"""
Clear the<start>and<end>generated by the model in inference
"""
import json
model_name = ''
task = 'block_completion'
result_path = f'../data/result_data/{task}/{model_name}/VersiCode_block_completion.json' # Modify the file according to the task format
with open(result_path, 'r', encoding='utf-8') as fr:
lodict = json.load(fr)
data_dict = lodict
data_list = data_dict
for data in data_list:
temp_list = []
model_output_list = eval(data['model_output'])
for output in model_output_list:
if '<start>' in output and '<end>' in output:
start_index = output.find('<start>') + len('<start>')
end_index = output.find('<end>')
content = (
output[start_index:end_index]
.replace('```python', '')
.replace('```', '')
)
else:
content = 'no_answer'
temp_list.append(content)
data['model_output_clear'] = str(temp_list)
with open(result_path, 'w', encoding='utf-8') as fw:
json.dump(data_dict, fw, indent=4, ensure_ascii=False)
@@ -0,0 +1,146 @@
aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aiosignal==1.3.2
airportsdata==20250224
annotated-types==0.7.0
anyio==4.9.0
astor==0.8.1
attrs==25.3.0
blake3==1.0.4
cachetools==5.5.2
certifi==2025.1.31
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
compressed-tensors==0.9.3
cupy-cuda12x==13.4.1
Deprecated==1.2.18
depyf==0.18.0
dill==0.4.0
diskcache==5.6.3
distro==1.9.0
dnspython==2.7.0
einops==0.8.1
email_validator==2.2.0
fastapi==0.115.12
fastapi-cli==0.0.7
fastrlock==0.8.3
filelock==3.18.0
frozenlist==1.6.0
fsspec==2025.3.2
gguf==0.16.2
googleapis-common-protos==1.70.0
grpcio==1.71.0
h11==0.14.0
hf-xet==1.0.3
httpcore==1.0.8
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.30.2
idna==3.10
importlib_metadata==8.0.0
interegular==0.3.3
Jinja2==3.1.6
jiter==0.9.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lark==1.2.2
llguidance==0.7.16
llvmlite==0.44.0
lm-format-enforcer==0.10.11
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
mistral_common==1.5.4
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.4.3
nest-asyncio==1.6.0
networkx==3.4.2
ninja==1.11.1.4
numba==0.61.2
numpy==2.2.5
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.75.0
opencv-python-headless==4.11.0.86
opentelemetry-api==1.26.0
opentelemetry-exporter-otlp==1.26.0
opentelemetry-exporter-otlp-proto-common==1.26.0
opentelemetry-exporter-otlp-proto-grpc==1.26.0
opentelemetry-exporter-otlp-proto-http==1.26.0
opentelemetry-proto==1.26.0
opentelemetry-sdk==1.26.0
opentelemetry-semantic-conventions==0.47b0
opentelemetry-semantic-conventions-ai==0.4.3
outlines==0.1.11
outlines_core==0.1.26
packaging==25.0
partial-json-parser==0.2.1.1.post5
pillow==11.2.1
prometheus-fastapi-instrumentator==7.1.0
prometheus_client==0.21.1
propcache==0.3.1
protobuf==4.25.6
psutil==7.0.0
py-cpuinfo==9.0.0
pycountry==24.6.1
pydantic==2.11.3
pydantic_core==2.33.1
Pygments==2.19.1
python-dotenv==1.1.0
python-json-logger==3.3.0
python-multipart==0.0.20
PyYAML==6.0.2
pyzmq==26.4.0
ray==2.43.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==14.0.0
rich-toolkit==0.14.1
rpds-py==0.24.0
safetensors==0.5.3
scipy==1.15.2
sentencepiece==0.2.0
setuptools==75.8.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
starlette==0.46.2
sympy==1.13.1
tiktoken==0.9.0
tokenizers==0.21.1
torch==2.6.0
torchaudio==2.6.0
torchvision==0.21.0
tqdm==4.67.1
transformers==4.51.3
triton==3.2.0
typer==0.15.2
typing-inspection==0.4.0
typing_extensions==4.13.2
urllib3==2.4.0
uvicorn==0.34.2
uvloop==0.21.0
vllm==0.8.4
watchfiles==1.0.5
websockets==15.0.1
wheel==0.45.1
wrapt==1.17.2
xformers==0.0.29.post2
xgrammar==0.1.18
yarl==1.20.0
zipp==3.21.0
+1 -1
View File
@@ -212,7 +212,7 @@ if __name__ == '__main__':
llm_config = None
if args.llm_config:
llm_config = get_llm_config_arg(args.llm_config)
# modify_params must be False for evaluation purpose, for reproducibility and accurancy of results
# modify_params must be False for evaluation purpose, for reproducibility and accuracy of results
llm_config.modify_params = False
if llm_config is None:
raise ValueError(f'Could not find LLM config: --llm_config {args.llm_config}')
+12 -1
View File
@@ -263,8 +263,19 @@ def prepare_dataset(
f'Randomly sampling {eval_n_limit} unique instances with random seed 42.'
)
def make_serializable(instance: pd.Series) -> dict:
import numpy as np
instance_dict = instance.to_dict()
for k, v in instance_dict.items():
if isinstance(v, np.ndarray):
instance_dict[k] = v.tolist()
elif isinstance(v, pd.Timestamp):
instance_dict[k] = str(v)
return instance_dict
new_dataset = [
instance
make_serializable(instance)
for _, instance in dataset.iterrows()
if str(instance[id_column]) not in finished_ids
]
@@ -31,7 +31,7 @@ const renderRepoConnector = () => {
},
{
Component: () => <div data-testid="git-settings-screen" />,
path: "/settings/git",
path: "/settings/integrations",
},
],
},
@@ -16,8 +16,8 @@ vi.mock("react-i18next", async () => {
if (i18nKey === "SETTINGS$API_KEYS_DESCRIPTION") {
return (
<span>
API keys allow you to authenticate with the OpenHands API programmatically.
Keep your API keys secure; anyone with your API key can access your account.
API keys allow you to authenticate with the OpenHands API programmatically.
Keep your API keys secure; anyone with your API key can access your account.
For more information on how to use the API, see our {components.a}
</span>
);
@@ -48,7 +48,7 @@ describe("ApiKeysManager", () => {
it("should render the API documentation link", () => {
renderComponent();
// Find the link to the API documentation
const link = screen.getByRole("link");
expect(link).toBeInTheDocument();
@@ -56,4 +56,4 @@ describe("ApiKeysManager", () => {
expect(link).toHaveAttribute("target", "_blank");
expect(link).toHaveAttribute("rel", "noopener noreferrer");
});
});
});
@@ -35,13 +35,13 @@ const queryClient = new QueryClient();
const GitSettingsRouterStub = createRoutesStub([
{
Component: GitSettingsScreen,
path: "/settings/github",
path: "/settings/integrations",
},
]);
const renderGitSettingsScreen = () => {
const { rerender, ...rest } = render(
<GitSettingsRouterStub initialEntries={["/settings/github"]} />,
<GitSettingsRouterStub initialEntries={["/settings/integrations"]} />,
{
wrapper: ({ children }) => (
<QueryClientProvider client={queryClient}>
@@ -54,7 +54,7 @@ const renderGitSettingsScreen = () => {
const rerenderGitSettingsScreen = () =>
rerender(
<QueryClientProvider client={queryClient}>
<GitSettingsRouterStub initialEntries={["/settings/github"]} />
<GitSettingsRouterStub initialEntries={["/settings/integrations"]} />
</QueryClientProvider>,
);
@@ -31,7 +31,7 @@ const RouterStub = createRoutesStub([
},
{
Component: () => <div data-testid="git-settings-screen" />,
path: "/settings/git",
path: "/settings/integrations",
},
],
},
@@ -30,7 +30,7 @@ vi.mock("react-i18next", async () => {
useTranslation: () => ({
t: (key: string) => {
const translations: Record<string, string> = {
"SETTINGS$NAV_GIT": "Git",
"SETTINGS$NAV_INTEGRATIONS": "Integrations",
"SETTINGS$NAV_APPLICATION": "Application",
"SETTINGS$NAV_CREDITS": "Credits",
"SETTINGS$NAV_API_KEYS": "API Keys",
@@ -61,7 +61,7 @@ describe("Settings Billing", () => {
},
{
Component: () => <div data-testid="git-settings-screen" />,
path: "/settings/git",
path: "/settings/integrations",
},
{
Component: () => <div data-testid="user-settings-screen" />,
+4 -4
View File
@@ -14,7 +14,7 @@ vi.mock("react-i18next", async () => {
useTranslation: () => ({
t: (key: string) => {
const translations: Record<string, string> = {
SETTINGS$NAV_GIT: "Git",
SETTINGS$NAV_INTEGRATIONS: "Integrations",
SETTINGS$NAV_APPLICATION: "Application",
SETTINGS$NAV_CREDITS: "Credits",
SETTINGS$NAV_API_KEYS: "API Keys",
@@ -49,7 +49,7 @@ describe("Settings Screen", () => {
},
{
Component: () => <div data-testid="git-settings-screen" />,
path: "/settings/git",
path: "/settings/integrations",
},
{
Component: () => <div data-testid="application-settings-screen" />,
@@ -79,7 +79,7 @@ describe("Settings Screen", () => {
};
it("should render the navbar", async () => {
const sectionsToInclude = ["llm", "git", "application", "secrets"];
const sectionsToInclude = ["llm", "integrations", "application", "secrets"];
const sectionsToExclude = ["api keys", "credits"];
const getConfigSpy = vi.spyOn(OpenHands, "getConfig");
// @ts-expect-error - only return app mode
@@ -111,7 +111,7 @@ describe("Settings Screen", () => {
APP_MODE: "saas",
});
const sectionsToInclude = [
"git",
"integrations",
"application",
"credits",
"secrets",
@@ -39,4 +39,4 @@ describe("Check for hardcoded English strings in Home components", () => {
expect(text).not.toContain(str);
});
});
});
});
+2 -2
View File
@@ -1,12 +1,12 @@
{
"name": "openhands-frontend",
"version": "0.42.0",
"version": "0.44.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "openhands-frontend",
"version": "0.42.0",
"version": "0.44.0",
"dependencies": {
"@heroui/react": "^2.8.0-beta.7",
"@microlink/react-json-view": "^1.26.2",
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "openhands-frontend",
"version": "0.42.0",
"version": "0.44.0",
"private": true,
"type": "module",
"engines": {
+90 -53
View File
@@ -5,24 +5,23 @@
* Mock Service Worker.
* @see https://github.com/mswjs/msw
* - Please do NOT modify this file.
* - Please do NOT serve this file on production.
*/
const PACKAGE_VERSION = '2.8.4'
const INTEGRITY_CHECKSUM = '00729d72e3b82faf54ca8b9621dbb96f'
const PACKAGE_VERSION = '2.10.2'
const INTEGRITY_CHECKSUM = 'f5825c521429caf22a4dd13b66e243af'
const IS_MOCKED_RESPONSE = Symbol('isMockedResponse')
const activeClientIds = new Set()
self.addEventListener('install', function () {
addEventListener('install', function () {
self.skipWaiting()
})
self.addEventListener('activate', function (event) {
addEventListener('activate', function (event) {
event.waitUntil(self.clients.claim())
})
self.addEventListener('message', async function (event) {
const clientId = event.source.id
addEventListener('message', async function (event) {
const clientId = Reflect.get(event.source || {}, 'id')
if (!clientId || !self.clients) {
return
@@ -94,17 +93,18 @@ self.addEventListener('message', async function (event) {
}
})
self.addEventListener('fetch', function (event) {
const { request } = event
addEventListener('fetch', function (event) {
// Bypass navigation requests.
if (request.mode === 'navigate') {
if (event.request.mode === 'navigate') {
return
}
// Opening the DevTools triggers the "only-if-cached" request
// that cannot be handled by the worker. Bypass such requests.
if (request.cache === 'only-if-cached' && request.mode !== 'same-origin') {
if (
event.request.cache === 'only-if-cached' &&
event.request.mode !== 'same-origin'
) {
return
}
@@ -115,48 +115,62 @@ self.addEventListener('fetch', function (event) {
return
}
// Generate unique request ID.
const requestId = crypto.randomUUID()
event.respondWith(handleRequest(event, requestId))
})
/**
* @param {FetchEvent} event
* @param {string} requestId
*/
async function handleRequest(event, requestId) {
const client = await resolveMainClient(event)
const requestCloneForEvents = event.request.clone()
const response = await getResponse(event, client, requestId)
// Send back the response clone for the "response:*" life-cycle events.
// Ensure MSW is active and ready to handle the message, otherwise
// this message will pend indefinitely.
if (client && activeClientIds.has(client.id)) {
;(async function () {
const responseClone = response.clone()
const serializedRequest = await serializeRequest(requestCloneForEvents)
sendToClient(
client,
{
type: 'RESPONSE',
payload: {
requestId,
isMockedResponse: IS_MOCKED_RESPONSE in response,
// Clone the response so both the client and the library could consume it.
const responseClone = response.clone()
sendToClient(
client,
{
type: 'RESPONSE',
payload: {
isMockedResponse: IS_MOCKED_RESPONSE in response,
request: {
id: requestId,
...serializedRequest,
},
response: {
type: responseClone.type,
status: responseClone.status,
statusText: responseClone.statusText,
body: responseClone.body,
headers: Object.fromEntries(responseClone.headers.entries()),
body: responseClone.body,
},
},
[responseClone.body],
)
})()
},
responseClone.body ? [serializedRequest.body, responseClone.body] : [],
)
}
return response
}
// Resolve the main client for the given event.
// Client that issues a request doesn't necessarily equal the client
// that registered the worker. It's with the latter the worker should
// communicate with during the response resolving phase.
/**
* Resolve the main client for the given event.
* Client that issues a request doesn't necessarily equal the client
* that registered the worker. It's with the latter the worker should
* communicate with during the response resolving phase.
* @param {FetchEvent} event
* @returns {Promise<Client | undefined>}
*/
async function resolveMainClient(event) {
const client = await self.clients.get(event.clientId)
@@ -184,12 +198,16 @@ async function resolveMainClient(event) {
})
}
/**
* @param {FetchEvent} event
* @param {Client | undefined} client
* @param {string} requestId
* @returns {Promise<Response>}
*/
async function getResponse(event, client, requestId) {
const { request } = event
// Clone the request because it might've been already used
// (i.e. its body has been read and sent to the client).
const requestClone = request.clone()
const requestClone = event.request.clone()
function passthrough() {
// Cast the request headers to a new Headers instance
@@ -230,29 +248,17 @@ async function getResponse(event, client, requestId) {
}
// Notify the client that a request has been intercepted.
const requestBuffer = await request.arrayBuffer()
const serializedRequest = await serializeRequest(event.request)
const clientMessage = await sendToClient(
client,
{
type: 'REQUEST',
payload: {
id: requestId,
url: request.url,
mode: request.mode,
method: request.method,
headers: Object.fromEntries(request.headers.entries()),
cache: request.cache,
credentials: request.credentials,
destination: request.destination,
integrity: request.integrity,
redirect: request.redirect,
referrer: request.referrer,
referrerPolicy: request.referrerPolicy,
body: requestBuffer,
keepalive: request.keepalive,
...serializedRequest,
},
},
[requestBuffer],
[serializedRequest.body],
)
switch (clientMessage.type) {
@@ -268,6 +274,12 @@ async function getResponse(event, client, requestId) {
return passthrough()
}
/**
* @param {Client} client
* @param {any} message
* @param {Array<Transferable>} transferrables
* @returns {Promise<any>}
*/
function sendToClient(client, message, transferrables = []) {
return new Promise((resolve, reject) => {
const channel = new MessageChannel()
@@ -280,14 +292,18 @@ function sendToClient(client, message, transferrables = []) {
resolve(event.data)
}
client.postMessage(
message,
[channel.port2].concat(transferrables.filter(Boolean)),
)
client.postMessage(message, [
channel.port2,
...transferrables.filter(Boolean),
])
})
}
async function respondWithMock(response) {
/**
* @param {Response} response
* @returns {Response}
*/
function respondWithMock(response) {
// Setting response status code to 0 is a no-op.
// However, when responding with a "Response.error()", the produced Response
// instance will have status code set to 0. Since it's not possible to create
@@ -305,3 +321,24 @@ async function respondWithMock(response) {
return mockedResponse
}
/**
* @param {Request} request
*/
async function serializeRequest(request) {
return {
url: request.url,
mode: request.mode,
method: request.method,
headers: Object.fromEntries(request.headers.entries()),
cache: request.cache,
credentials: request.credentials,
destination: request.destination,
integrity: request.integrity,
redirect: request.redirect,
referrer: request.referrer,
referrerPolicy: request.referrerPolicy,
body: await request.arrayBuffer(),
keepalive: request.keepalive,
}
}
@@ -60,11 +60,11 @@ Object.entries(translationJson).forEach(([key, translations]) => {
if (Object.keys(missingTranslations).length > 0) {
console.error('\x1b[31m%s\x1b[0m', 'ERROR: Missing translations detected');
console.error(`Found ${Object.keys(missingTranslations).length} translation keys with missing languages:`);
Object.entries(missingTranslations).forEach(([key, langs]) => {
console.error(`- Key "${key}" is missing translations for: ${langs.join(', ')}`);
});
console.error('\nPlease add the missing translations before committing.');
}
@@ -72,11 +72,11 @@ if (Object.keys(missingTranslations).length > 0) {
if (Object.keys(extraLanguages).length > 0) {
console.error('\x1b[31m%s\x1b[0m', 'ERROR: Extra languages detected');
console.error(`Found ${Object.keys(extraLanguages).length} translation keys with extra languages not in AvailableLanguages:`);
Object.entries(extraLanguages).forEach(([key, langs]) => {
console.error(`- Key "${key}" has translations for unsupported languages: ${langs.join(', ')}`);
});
console.error('\nPlease remove the extra languages before committing.');
}
@@ -85,4 +85,4 @@ if (hasErrors) {
process.exit(1);
} else {
console.log('\x1b[32m%s\x1b[0m', 'All translation keys have complete language coverage!');
}
}
+53
View File
@@ -111,6 +111,59 @@ class OpenHands {
return data;
}
/**
* Submit conversation feedback with rating
* @param conversationId The conversation ID
* @param rating The rating (1-5)
* @param eventId Optional event ID this feedback corresponds to
* @param reason Optional reason for the rating
* @returns Response from the feedback endpoint
*/
static async submitConversationFeedback(
conversationId: string,
rating: number,
eventId?: number,
reason?: string,
): Promise<{ status: string; message: string }> {
const url = `/feedback/conversation`;
const payload = {
conversation_id: conversationId,
event_id: eventId,
rating,
reason,
metadata: { source: "likert-scale" },
};
const { data } = await openHands.post<{ status: string; message: string }>(
url,
payload,
);
return data;
}
/**
* Check if feedback exists for a specific conversation and event
* @param conversationId The conversation ID
* @param eventId The event ID to check
* @returns Feedback data including existence, rating, and reason
*/
static async checkFeedbackExists(
conversationId: string,
eventId: number,
): Promise<{ exists: boolean; rating?: number; reason?: string }> {
try {
const url = `/feedback/conversation/${conversationId}/${eventId}`;
const { data } = await openHands.get<{
exists: boolean;
rating?: number;
reason?: string;
}>(url);
return data;
} catch (error) {
// Error checking if feedback exists
return { exists: false };
}
}
/**
* Authenticate with GitHub token
* @returns Response with authentication status and user info if successful
@@ -18,6 +18,7 @@ import { useWsClient } from "#/context/ws-client-provider";
import { Messages } from "./messages";
import { ChatSuggestions } from "./chat-suggestions";
import { ActionSuggestions } from "./action-suggestions";
import { ScrollProvider } from "#/context/scroll-context";
import { ScrollToBottomButton } from "#/components/shared/buttons/scroll-to-bottom-button";
import { LoadingSpinner } from "#/components/shared/loading-spinner";
@@ -28,6 +29,7 @@ import { useOptimisticUserMessage } from "#/hooks/use-optimistic-user-message";
import { useWSErrorMessage } from "#/hooks/use-ws-error-message";
import { ErrorMessageBanner } from "./error-message-banner";
import { shouldRenderEvent } from "./event-content-helpers/should-render-event";
import { useConfig } from "#/hooks/query/use-config";
function getEntryPoint(
hasRepository: boolean | null,
@@ -45,8 +47,15 @@ export function ChatInterface() {
useOptimisticUserMessage();
const { t } = useTranslation();
const scrollRef = React.useRef<HTMLDivElement>(null);
const { scrollDomToBottom, onChatBodyScroll, hitBottom } =
useScrollToBottom(scrollRef);
const {
scrollDomToBottom,
onChatBodyScroll,
hitBottom,
autoScroll,
setAutoScroll,
setHitBottom,
} = useScrollToBottom(scrollRef);
const { data: config } = useConfig();
const { curAgentState } = useSelector((state: RootState) => state.agent);
@@ -126,80 +135,97 @@ export function ChatInterface() {
curAgentState === AgentState.AWAITING_USER_INPUT ||
curAgentState === AgentState.FINISHED;
// Create a ScrollProvider with the scroll hook values
const scrollProviderValue = {
scrollRef,
autoScroll,
setAutoScroll,
scrollDomToBottom,
hitBottom,
setHitBottom,
onChatBodyScroll,
};
return (
<div className="h-full flex flex-col justify-between">
{events.length === 0 && !optimisticUserMessage && (
<ChatSuggestions onSuggestionsClick={setMessageToSend} />
)}
<div
ref={scrollRef}
onScroll={(e) => onChatBodyScroll(e.currentTarget)}
className="scrollbar scrollbar-thin scrollbar-thumb-gray-400 scrollbar-thumb-rounded-full scrollbar-track-gray-800 hover:scrollbar-thumb-gray-300 flex flex-col grow overflow-y-auto overflow-x-hidden px-4 pt-4 gap-2 fast-smooth-scroll"
>
{isLoadingMessages && (
<div className="flex justify-center">
<LoadingSpinner size="small" />
</div>
<ScrollProvider value={scrollProviderValue}>
<div className="h-full flex flex-col justify-between">
{events.length === 0 && !optimisticUserMessage && (
<ChatSuggestions onSuggestionsClick={setMessageToSend} />
)}
{!isLoadingMessages && (
<Messages
messages={events}
isAwaitingUserConfirmation={
curAgentState === AgentState.AWAITING_USER_CONFIRMATION
}
/>
)}
<div
ref={scrollRef}
onScroll={(e) => onChatBodyScroll(e.currentTarget)}
className="scrollbar scrollbar-thin scrollbar-thumb-gray-400 scrollbar-thumb-rounded-full scrollbar-track-gray-800 hover:scrollbar-thumb-gray-300 flex flex-col grow overflow-y-auto overflow-x-hidden px-4 pt-4 gap-2 fast-smooth-scroll"
>
{isLoadingMessages && (
<div className="flex justify-center">
<LoadingSpinner size="small" />
</div>
)}
{isWaitingForUserInput &&
events.length > 0 &&
!optimisticUserMessage && (
<ActionSuggestions
onSuggestionsClick={(value) => handleSendMessage(value, [])}
{!isLoadingMessages && (
<Messages
messages={events}
isAwaitingUserConfirmation={
curAgentState === AgentState.AWAITING_USER_CONFIRMATION
}
/>
)}
</div>
<div className="flex flex-col gap-[6px] px-4 pb-4">
<div className="flex justify-between relative">
<TrajectoryActions
onPositiveFeedback={() =>
onClickShareFeedbackActionButton("positive")
}
onNegativeFeedback={() =>
onClickShareFeedbackActionButton("negative")
}
onExportTrajectory={() => onClickExportTrajectoryButton()}
/>
<div className="absolute left-1/2 transform -translate-x-1/2 bottom-0">
{curAgentState === AgentState.RUNNING && <TypingIndicator />}
</div>
{!hitBottom && <ScrollToBottomButton onClick={scrollDomToBottom} />}
{isWaitingForUserInput &&
events.length > 0 &&
!optimisticUserMessage && (
<ActionSuggestions
onSuggestionsClick={(value) => handleSendMessage(value, [])}
/>
)}
</div>
{errorMessage && <ErrorMessageBanner message={errorMessage} />}
<div className="flex flex-col gap-[6px] px-4 pb-4">
<div className="flex justify-between relative">
{config?.APP_MODE !== "saas" && (
<TrajectoryActions
onPositiveFeedback={() =>
onClickShareFeedbackActionButton("positive")
}
onNegativeFeedback={() =>
onClickShareFeedbackActionButton("negative")
}
onExportTrajectory={() => onClickExportTrajectoryButton()}
/>
)}
<InteractiveChatBox
onSubmit={handleSendMessage}
onStop={handleStop}
isDisabled={
curAgentState === AgentState.LOADING ||
curAgentState === AgentState.AWAITING_USER_CONFIRMATION
}
mode={curAgentState === AgentState.RUNNING ? "stop" : "submit"}
value={messageToSend ?? undefined}
onChange={setMessageToSend}
/>
<div className="absolute left-1/2 transform -translate-x-1/2 bottom-0">
{curAgentState === AgentState.RUNNING && <TypingIndicator />}
</div>
{!hitBottom && <ScrollToBottomButton onClick={scrollDomToBottom} />}
</div>
{errorMessage && <ErrorMessageBanner message={errorMessage} />}
<InteractiveChatBox
onSubmit={handleSendMessage}
onStop={handleStop}
isDisabled={
curAgentState === AgentState.LOADING ||
curAgentState === AgentState.AWAITING_USER_CONFIRMATION
}
mode={curAgentState === AgentState.RUNNING ? "stop" : "submit"}
value={messageToSend ?? undefined}
onChange={setMessageToSend}
/>
</div>
{config?.APP_MODE !== "saas" && (
<FeedbackModal
isOpen={feedbackModalIsOpen}
onClose={() => setFeedbackModalIsOpen(false)}
polarity={feedbackPolarity}
/>
)}
</div>
<FeedbackModal
isOpen={feedbackModalIsOpen}
onClose={() => setFeedbackModalIsOpen(false)}
polarity={feedbackPolarity}
/>
</div>
</ScrollProvider>
);
}
@@ -1,3 +1,4 @@
import React from "react";
import { ConfirmationButtons } from "#/components/shared/buttons/confirmation-buttons";
import { OpenHandsAction } from "#/types/core/actions";
import {
@@ -18,6 +19,10 @@ import { MCPObservationContent } from "./mcp-observation-content";
import { getObservationResult } from "./event-content-helpers/get-observation-result";
import { getEventContent } from "./event-content-helpers/get-event-content";
import { GenericEventMessage } from "./generic-event-message";
import { LikertScale } from "../feedback/likert-scale";
import { useConfig } from "#/hooks/query/use-config";
import { useFeedbackExists } from "#/hooks/query/use-feedback-exists";
const hasThoughtProperty = (
obj: Record<string, unknown>,
@@ -39,6 +44,14 @@ export function EventMessage({
const shouldShowConfirmationButtons =
isLastMessage && event.source === "agent" && isAwaitingUserConfirmation;
const { data: config } = useConfig();
// Use our query hook to check if feedback exists and get rating/reason
const {
data: feedbackData = { exists: false },
isLoading: isCheckingFeedback,
} = useFeedbackExists(isFinishAction(event) ? event.id : undefined);
if (isErrorObservation(event)) {
return (
<ErrorMessage
@@ -55,9 +68,25 @@ export function EventMessage({
return null;
}
const showLikertScale =
config?.APP_MODE === "saas" &&
isFinishAction(event) &&
isLastMessage &&
!isCheckingFeedback;
if (isFinishAction(event)) {
return (
<ChatMessage type="agent" message={getEventContent(event).details} />
<>
<ChatMessage type="agent" message={getEventContent(event).details} />
{showLikertScale && (
<LikertScale
eventId={event.id}
initiallySubmitted={feedbackData.exists}
initialRating={feedbackData.rating}
initialReason={feedbackData.reason}
/>
)}
</>
);
}
@@ -0,0 +1,248 @@
import React, { useState, useEffect, useContext } from "react";
import { cn } from "#/utils/utils";
import i18n from "#/i18n";
import { useSubmitConversationFeedback } from "#/hooks/mutation/use-submit-conversation-feedback";
import { ScrollContext } from "#/context/scroll-context";
// Global timeout duration in milliseconds
const AUTO_SUBMIT_TIMEOUT = 10000;
interface LikertScaleProps {
eventId?: number;
initiallySubmitted?: boolean;
initialRating?: number;
initialReason?: string;
}
const FEEDBACK_REASONS = [
i18n.t("FEEDBACK$REASON_MISUNDERSTOOD_INSTRUCTION"),
i18n.t("FEEDBACK$REASON_FORGOT_CONTEXT"),
i18n.t("FEEDBACK$REASON_UNNECESSARY_CHANGES"),
i18n.t("FEEDBACK$REASON_OTHER"),
];
export function LikertScale({
eventId,
initiallySubmitted = false,
initialRating,
initialReason,
}: LikertScaleProps) {
const [selectedRating, setSelectedRating] = useState<number | null>(
initialRating || null,
);
const [selectedReason, setSelectedReason] = useState<string | null>(
initialReason || null,
);
const [showReasons, setShowReasons] = useState(false);
const [reasonTimeout, setReasonTimeout] = useState<NodeJS.Timeout | null>(
null,
);
const [isSubmitted, setIsSubmitted] = useState(initiallySubmitted);
const [countdown, setCountdown] = useState<number>(0);
// Get scroll context
const scrollContext = useContext(ScrollContext);
// If scrollContext is undefined, we're not inside a ScrollProvider
const scrollToBottom = scrollContext?.scrollDomToBottom;
const autoScroll = scrollContext?.autoScroll;
// Use our mutation hook
const { mutate: submitConversationFeedback } =
useSubmitConversationFeedback();
// Update isSubmitted if initiallySubmitted changes
useEffect(() => {
setIsSubmitted(initiallySubmitted);
}, [initiallySubmitted]);
// Update selectedRating if initialRating changes
useEffect(() => {
if (initialRating) {
setSelectedRating(initialRating);
}
}, [initialRating]);
// Update selectedReason if initialReason changes
useEffect(() => {
if (initialReason) {
setSelectedReason(initialReason);
}
}, [initialReason]);
// Submit feedback and disable the component
const submitFeedback = (rating: number, reason?: string) => {
submitConversationFeedback(
{
rating,
eventId,
reason,
},
{
onSuccess: () => {
setSelectedReason(reason || null);
setShowReasons(false);
setIsSubmitted(true);
},
},
);
};
// Handle star rating selection
const handleRatingClick = (rating: number) => {
if (isSubmitted) return; // Prevent changes after submission
setSelectedRating(rating);
// Only show reasons if rating is 3 or less (1, 2, or 3 stars)
// For ratings > 3 (4 or 5 stars), submit immediately without showing reasons
if (rating <= 3) {
setShowReasons(true);
setCountdown(Math.ceil(AUTO_SUBMIT_TIMEOUT / 1000));
// Set a timeout to auto-submit if no reason is selected
const timeout = setTimeout(() => {
submitFeedback(rating);
}, AUTO_SUBMIT_TIMEOUT);
setReasonTimeout(timeout);
// Only scroll to bottom if the user is already at the bottom (autoScroll is true)
if (scrollToBottom && autoScroll) {
// Small delay to ensure the reasons are fully rendered
setTimeout(() => {
scrollToBottom();
}, 100);
}
} else {
// For ratings > 3 (4 or 5 stars), submit immediately without showing reasons
setShowReasons(false);
submitFeedback(rating);
}
};
// Handle reason selection
const handleReasonClick = (reason: string) => {
if (selectedRating && reasonTimeout && !isSubmitted) {
clearTimeout(reasonTimeout);
setCountdown(0);
submitFeedback(selectedRating, reason);
}
};
// Countdown effect
useEffect(() => {
if (countdown > 0 && showReasons && !isSubmitted) {
const timer = setTimeout(() => {
setCountdown(countdown - 1);
}, 1000);
return () => clearTimeout(timer);
}
return () => {};
}, [countdown, showReasons, isSubmitted]);
// Clean up timeout on unmount
useEffect(
() => () => {
if (reasonTimeout) {
clearTimeout(reasonTimeout);
}
},
[reasonTimeout],
);
// Scroll to bottom when component mounts, but only if user is already at the bottom
useEffect(() => {
if (scrollToBottom && autoScroll && !isSubmitted) {
// Small delay to ensure the component is fully rendered
setTimeout(() => {
scrollToBottom();
}, 100);
}
}, [scrollToBottom, autoScroll, isSubmitted]);
// Scroll to bottom when reasons are shown, but only if user is already at the bottom
useEffect(() => {
if (scrollToBottom && autoScroll && showReasons) {
// Small delay to ensure the reasons are fully rendered
setTimeout(() => {
scrollToBottom();
}, 100);
}
}, [scrollToBottom, autoScroll, showReasons]);
// Helper function to get button class based on state
const getButtonClass = (rating: number) => {
if (isSubmitted) {
return selectedRating && selectedRating >= rating
? "text-yellow-400 cursor-not-allowed"
: "text-gray-300 opacity-50 cursor-not-allowed";
}
return selectedRating && selectedRating >= rating
? "text-yellow-400"
: "text-gray-300 hover:text-yellow-200";
};
return (
<div className="mt-3 flex flex-col gap-1">
<div className="text-sm text-gray-500 mb-1">
{isSubmitted
? i18n.t("FEEDBACK$THANK_YOU_FOR_FEEDBACK")
: i18n.t("FEEDBACK$RATE_AGENT_PERFORMANCE")}
</div>
<div className="flex flex-col gap-1">
<span className="flex gap-2 items-center flex-wrap">
{[1, 2, 3, 4, 5].map((rating) => (
<button
type="button"
key={rating}
onClick={() => handleRatingClick(rating)}
disabled={isSubmitted}
className={cn("text-xl transition-all", getButtonClass(rating))}
aria-label={`Rate ${rating} stars`}
>
</button>
))}
{/* Show selected reason inline with stars when submitted (only for ratings <= 3) */}
{isSubmitted &&
selectedReason &&
selectedRating &&
selectedRating <= 3 && (
<span className="text-sm text-gray-500 italic">
{selectedReason}
</span>
)}
</span>
</div>
{showReasons && !isSubmitted && (
<div className="mt-1 flex flex-col gap-1">
<div className="text-xs text-gray-500 mb-1">
{i18n.t("FEEDBACK$SELECT_REASON")}
</div>
{countdown > 0 && (
<div className="text-xs text-gray-400 mb-1 italic">
{i18n.t("FEEDBACK$SELECT_REASON_COUNTDOWN", {
countdown,
})}
</div>
)}
<div className="flex flex-col gap-0.5">
{FEEDBACK_REASONS.map((reason) => (
<button
type="button"
key={reason}
onClick={() => handleReasonClick(reason)}
className="text-sm text-left py-1 px-2 rounded hover:bg-gray-700 transition-colors"
>
{reason}
</button>
))}
</div>
</div>
)}
</div>
);
}
@@ -10,7 +10,10 @@ export function ConnectToProviderMessage() {
return (
<div className="flex flex-col gap-4">
<p>{t("HOME$CONNECT_PROVIDER_MESSAGE")}</p>
<Link data-testid="navigate-to-settings-button" to="/settings/git">
<Link
data-testid="navigate-to-settings-button"
to="/settings/integrations"
>
<BrandButton type="button" variant="primary" isDisabled={isLoading}>
{!isLoading && t("SETTINGS$TITLE")}
{isLoading && t("HOME$LOADING")}
@@ -0,0 +1,21 @@
import { useTranslation } from "react-i18next";
import { I18nKey } from "#/i18n/declaration";
import { BrandButton } from "../brand-button";
export function InstallSlackAppAnchor() {
const { t } = useTranslation();
return (
<a
data-testid="install-slack-app-button"
href="https://slack.com/oauth/v2/authorize?client_id=7477886716822.8729519890534&scope=app_mentions:read,chat:write,users:read,channels:history,groups:history,mpim:history,im:history&user_scope=channels:history,groups:history,im:history,mpim:history"
target="_blank"
rel="noreferrer noopener"
className="py-9"
>
<BrandButton type="button" variant="secondary">
{t(I18nKey.SLACK$INSTALL_APP)}
</BrandButton>
</a>
);
}
+42
View File
@@ -0,0 +1,42 @@
import React, { createContext, useContext, ReactNode, RefObject } from "react";
import { useScrollToBottom } from "#/hooks/use-scroll-to-bottom";
interface ScrollContextType {
scrollRef: RefObject<HTMLDivElement | null>;
autoScroll: boolean;
setAutoScroll: (value: boolean) => void;
scrollDomToBottom: () => void;
hitBottom: boolean;
setHitBottom: (value: boolean) => void;
onChatBodyScroll: (e: HTMLElement) => void;
}
export const ScrollContext = createContext<ScrollContextType | undefined>(
undefined,
);
interface ScrollProviderProps {
children: ReactNode;
value?: ScrollContextType;
}
export function ScrollProvider({ children, value }: ScrollProviderProps) {
const scrollHook = useScrollToBottom(React.useRef<HTMLDivElement>(null));
// Use provided value or default to the hook
const contextValue = value || scrollHook;
return (
<ScrollContext.Provider value={contextValue}>
{children}
</ScrollContext.Provider>
);
}
export function useScrollContext() {
const context = useContext(ScrollContext);
if (context === undefined) {
throw new Error("useScrollContext must be used within a ScrollProvider");
}
return context;
}
@@ -0,0 +1,39 @@
import { useMutation, useQueryClient } from "@tanstack/react-query";
import { useTranslation } from "react-i18next";
import OpenHands from "#/api/open-hands";
import { useConversationId } from "#/hooks/use-conversation-id";
type SubmitConversationFeedbackArgs = {
rating: number;
eventId?: number;
reason?: string;
};
export const useSubmitConversationFeedback = () => {
const { conversationId } = useConversationId();
const queryClient = useQueryClient();
const { t } = useTranslation();
return useMutation({
mutationFn: ({ rating, eventId, reason }: SubmitConversationFeedbackArgs) =>
OpenHands.submitConversationFeedback(
conversationId,
rating,
eventId,
reason,
),
onSuccess: (_, { eventId }) => {
// Invalidate the feedback existence query to trigger a refetch
if (eventId) {
queryClient.invalidateQueries({
queryKey: ["feedback", "exists", conversationId, eventId],
});
}
},
onError: (error) => {
// Log error but don't show toast - user will just see the UI stay in unsubmitted state
// eslint-disable-next-line no-console
console.error(t("FEEDBACK$FAILED_TO_SUBMIT"), error);
},
});
};
@@ -0,0 +1,24 @@
import { useQuery } from "@tanstack/react-query";
import OpenHands from "#/api/open-hands";
import { useConversationId } from "#/hooks/use-conversation-id";
export interface FeedbackData {
exists: boolean;
rating?: number;
reason?: string;
}
export const useFeedbackExists = (eventId?: number) => {
const { conversationId } = useConversationId();
return useQuery<FeedbackData>({
queryKey: ["feedback", "exists", conversationId, eventId],
queryFn: () => {
if (!eventId) return { exists: false };
return OpenHands.checkFeedbackExists(conversationId, eventId);
},
enabled: !!eventId,
staleTime: 1000 * 60 * 5, // 5 minutes
gcTime: 1000 * 60 * 15, // 15 minutes
});
};
+11 -1
View File
@@ -80,7 +80,7 @@ export enum I18nKey {
ANALYTICS$CONFIRM_PREFERENCES = "ANALYTICS$CONFIRM_PREFERENCES",
SETTINGS$SAVING = "SETTINGS$SAVING",
SETTINGS$SAVE_CHANGES = "SETTINGS$SAVE_CHANGES",
SETTINGS$NAV_GIT = "SETTINGS$NAV_GIT",
SETTINGS$NAV_INTEGRATIONS = "SETTINGS$NAV_INTEGRATIONS",
SETTINGS$NAV_APPLICATION = "SETTINGS$NAV_APPLICATION",
SETTINGS$NAV_CREDITS = "SETTINGS$NAV_CREDITS",
SETTINGS$NAV_SECRETS = "SETTINGS$NAV_SECRETS",
@@ -174,6 +174,7 @@ export enum I18nKey {
GITHUB$TOKEN_INVALID = "GITHUB$TOKEN_INVALID",
BUTTON$DISCONNECT = "BUTTON$DISCONNECT",
GITHUB$CONFIGURE_REPOS = "GITHUB$CONFIGURE_REPOS",
SLACK$INSTALL_APP = "SLACK$INSTALL_APP",
COMMON$CLICK_FOR_INSTRUCTIONS = "COMMON$CLICK_FOR_INSTRUCTIONS",
LLM$SELECT_MODEL_PLACEHOLDER = "LLM$SELECT_MODEL_PLACEHOLDER",
LLM$MODEL = "LLM$MODEL",
@@ -583,4 +584,13 @@ export enum I18nKey {
SETTINGS$EMAIL_VERIFICATION_RESTRICTION_MESSAGE = "SETTINGS$EMAIL_VERIFICATION_RESTRICTION_MESSAGE",
SETTINGS$RESEND_VERIFICATION = "SETTINGS$RESEND_VERIFICATION",
SETTINGS$FAILED_TO_RESEND_VERIFICATION = "SETTINGS$FAILED_TO_RESEND_VERIFICATION",
FEEDBACK$RATE_AGENT_PERFORMANCE = "FEEDBACK$RATE_AGENT_PERFORMANCE",
FEEDBACK$SELECT_REASON = "FEEDBACK$SELECT_REASON",
FEEDBACK$SELECT_REASON_COUNTDOWN = "FEEDBACK$SELECT_REASON_COUNTDOWN",
FEEDBACK$REASON_MISUNDERSTOOD_INSTRUCTION = "FEEDBACK$REASON_MISUNDERSTOOD_INSTRUCTION",
FEEDBACK$REASON_FORGOT_CONTEXT = "FEEDBACK$REASON_FORGOT_CONTEXT",
FEEDBACK$REASON_UNNECESSARY_CHANGES = "FEEDBACK$REASON_UNNECESSARY_CHANGES",
FEEDBACK$REASON_OTHER = "FEEDBACK$REASON_OTHER",
FEEDBACK$THANK_YOU_FOR_FEEDBACK = "FEEDBACK$THANK_YOU_FOR_FEEDBACK",
FEEDBACK$FAILED_TO_SUBMIT = "FEEDBACK$FAILED_TO_SUBMIT",
}
+175 -15
View File
@@ -1279,21 +1279,21 @@
"de": "Änderungen speichern",
"uk": "Зберегти зміни"
},
"SETTINGS$NAV_GIT": {
"en": "Git",
"ja": "Git",
"zh-CN": "Git",
"zh-TW": "Git",
"ko-KR": "Git",
"no": "Git",
"it": "Git",
"pt": "Git",
"es": "Git",
"ar": "Git",
"fr": "Git",
"tr": "Git",
"de": "Git",
"uk": "Git"
"SETTINGS$NAV_INTEGRATIONS": {
"en": "Integrations",
"ja": "統合",
"zh-CN": "集成",
"zh-TW": "整合",
"ko-KR": "통합",
"no": "Integrasjoner",
"it": "Integrazioni",
"pt": "Integrações",
"es": "Integraciones",
"ar": "التكامل",
"fr": "Intégrations",
"tr": "Entegrasyonlar",
"de": "Integrationen",
"uk": "Інтеграції"
},
"SETTINGS$NAV_APPLICATION": {
"en": "Application",
@@ -2783,6 +2783,22 @@
"de": "GitHub-Repositories konfigurieren",
"uk": "Налаштування репозиторіїв Github"
},
"SLACK$INSTALL_APP": {
"en": "Install OpenHands Slack App",
"ja": "OpenHands Slackアプリをインストール",
"zh-CN": "安装 OpenHands Slack 应用",
"zh-TW": "安裝 OpenHands Slack 應用程式",
"ko-KR": "OpenHands Slack 앱 설치",
"no": "Installer OpenHands Slack-app",
"it": "Installa l'app Slack di OpenHands",
"pt": "Instalar aplicativo Slack do OpenHands",
"es": "Instalar aplicación Slack de OpenHands",
"ar": "تثبيت تطبيق OpenHands Slack",
"fr": "Installer l'application Slack OpenHands",
"tr": "OpenHands Slack uygulamasını yükle",
"de": "OpenHands Slack-App installieren",
"uk": "Встановити додаток OpenHands Slack"
},
"COMMON$CLICK_FOR_INSTRUCTIONS": {
"en": "Click here for instructions",
"ja": "手順はこちらをクリック",
@@ -9326,5 +9342,149 @@
"tr": "Doğrulama e-postası yeniden gönderilemedi",
"de": "Bestätigungs-E-Mail konnte nicht erneut gesendet werden",
"uk": "Не вдалося повторно надіслати лист підтвердження"
},
"FEEDBACK$RATE_AGENT_PERFORMANCE": {
"en": "Rate the agent's performance:",
"ja": "エージェントのパフォーマンスを評価してください:",
"zh-CN": "评价代理的表现:",
"zh-TW": "評價代理的表現:",
"ko-KR": "에이전트의 성능을 평가하세요:",
"no": "Vurder agentens ytelse:",
"it": "Valuta le prestazioni dell'agente:",
"pt": "Avalie o desempenho do agente:",
"es": "Evalúe el rendimiento del agente:",
"ar": "قيم أداء الوكيل:",
"fr": "Évaluez la performance de l'agent :",
"tr": "Ajanın performansını değerlendirin:",
"de": "Bewerten Sie die Leistung des Agenten:",
"uk": "Оцініть продуктивність агента:"
},
"FEEDBACK$SELECT_REASON": {
"en": "Select a reason (optional):",
"ja": "理由を選択してください(任意):",
"zh-CN": "选择原因(可选):",
"zh-TW": "選擇原因(可選):",
"ko-KR": "이유 선택 (선택 사항):",
"no": "Velg en grunn (valgfritt):",
"it": "Seleziona un motivo (opzionale):",
"pt": "Selecione um motivo (opcional):",
"es": "Seleccione un motivo (opcional):",
"ar": "حدد سببًا (اختياري):",
"fr": "Sélectionnez une raison (facultatif) :",
"tr": "Bir neden seçin (isteğe bağlı):",
"de": "Wählen Sie einen Grund (optional):",
"uk": "Виберіть причину (необов'язково):"
},
"FEEDBACK$SELECT_REASON_COUNTDOWN": {
"en": "Auto-submitting in {{countdown}} seconds...",
"ja": "{{countdown}}秒後に自動送信されます...",
"zh-CN": "{{countdown}}秒后自动提交...",
"zh-TW": "{{countdown}}秒後自動提交...",
"ko-KR": "{{countdown}}초 후 자동 제출...",
"no": "Sender automatisk om {{countdown}} sekunder...",
"it": "Invio automatico tra {{countdown}} secondi...",
"pt": "Enviando automaticamente em {{countdown}} segundos...",
"es": "Enviando automáticamente en {{countdown}} segundos...",
"ar": "الإرسال التلقائي خلال {{countdown}} ثانية...",
"fr": "Envoi automatique dans {{countdown}} secondes...",
"tr": "{{countdown}} saniye içinde otomatik gönderilecek...",
"de": "Automatische Übermittlung in {{countdown}} Sekunden...",
"uk": "Автоматична відправка через {{countdown}} секунд..."
},
"FEEDBACK$REASON_MISUNDERSTOOD_INSTRUCTION": {
"en": "The agent misunderstood my instruction",
"ja": "エージェントは私の指示を誤解しました",
"zh-CN": "代理误解了我的指示",
"zh-TW": "代理誤解了我的指示",
"ko-KR": "에이전트가 내 지시를 잘못 이해했습니다",
"no": "Agenten misforsto instruksjonene mine",
"it": "L'agente ha frainteso le mie istruzioni",
"pt": "O agente não entendeu minhas instruções",
"es": "El agente malinterpretó mis instrucciones",
"ar": "أساء الوكيل فهم تعليماتي",
"fr": "L'agent a mal compris mes instructions",
"tr": "Ajan talimatlarımı yanlış anladı",
"de": "Der Agent hat meine Anweisungen missverstanden",
"uk": "Агент неправильно зрозумів мої інструкції"
},
"FEEDBACK$REASON_FORGOT_CONTEXT": {
"en": "The agent forgot about the earlier context",
"ja": "エージェントは以前のコンテキストを忘れました",
"zh-CN": "代理忘记了之前的上下文",
"zh-TW": "代理忘記了之前的上下文",
"ko-KR": "에이전트가 이전 컨텍스트를 잊었습니다",
"no": "Agenten glemte den tidligere konteksten",
"it": "L'agente ha dimenticato il contesto precedente",
"pt": "O agente esqueceu o contexto anterior",
"es": "El agente olvidó el contexto anterior",
"ar": "نسي الوكيل السياق السابق",
"fr": "L'agent a oublié le contexte précédent",
"tr": "Ajan önceki bağlamı unuttu",
"de": "Der Agent hat den früheren Kontext vergessen",
"uk": "Агент забув про попередній контекст"
},
"FEEDBACK$REASON_UNNECESSARY_CHANGES": {
"en": "The agent made unnecessary changes",
"ja": "エージェントは不要な変更を行いました",
"zh-CN": "代理进行了不必要的更改",
"zh-TW": "代理進行了不必要的更改",
"ko-KR": "에이전트가 불필요한 변경을 했습니다",
"no": "Agenten gjorde unødvendige endringer",
"it": "L'agente ha apportato modifiche non necessarie",
"pt": "O agente fez alterações desnecessárias",
"es": "El agente hizo cambios innecesarios",
"ar": "قام الوكيل بتغييرات غير ضرورية",
"fr": "L'agent a apporté des modifications inutiles",
"tr": "Ajan gereksiz değişiklikler yaptı",
"de": "Der Agent hat unnötige Änderungen vorgenommen",
"uk": "Агент зробив непотрібні зміни"
},
"FEEDBACK$REASON_OTHER": {
"en": "Other",
"ja": "その他",
"zh-CN": "其他",
"zh-TW": "其他",
"ko-KR": "기타",
"no": "Annet",
"it": "Altro",
"pt": "Outro",
"es": "Otro",
"ar": "أخرى",
"fr": "Autre",
"tr": "Diğer",
"de": "Andere",
"uk": "Інше"
},
"FEEDBACK$THANK_YOU_FOR_FEEDBACK": {
"en": "Thank you for your feedback! This will help us improve OpenHands going forward.",
"ja": "フィードバックをありがとうございます!これにより、今後OpenHandsを改善していくことができます。",
"zh-CN": "感谢您的反馈!这将帮助我们改进OpenHands。",
"zh-TW": "感謝您的反饋!這將幫助我們改進OpenHands。",
"ko-KR": "피드백 감사합니다! 이를 통해 OpenHands를 개선해 나가겠습니다.",
"no": "Takk for tilbakemeldingen! Dette vil hjelpe oss med å forbedre OpenHands fremover.",
"it": "Grazie per il tuo feedback! Questo ci aiuterà a migliorare OpenHands in futuro.",
"pt": "Obrigado pelo seu feedback! Isso nos ajudará a melhorar o OpenHands no futuro.",
"es": "¡Gracias por su comentario! Esto nos ayudará a mejorar OpenHands en el futuro.",
"ar": "شكرا على ملاحظاتك! سيساعدنا هذا في تحسين OpenHands في المستقبل.",
"fr": "Merci pour votre retour ! Cela nous aidera à améliorer OpenHands à l'avenir.",
"tr": "Geri bildiriminiz için teşekkürler! Bu, OpenHands'i ileride geliştirmemize yardımcı olacak.",
"de": "Vielen Dank für Ihr Feedback! Das hilft uns, OpenHands in Zukunft zu verbessern.",
"uk": "Дякуємо за ваш відгук! Це допоможе нам покращити OpenHands у майбутньому."
},
"FEEDBACK$FAILED_TO_SUBMIT": {
"en": "Failed to submit feedback",
"ja": "フィードバックの送信に失敗しました",
"zh-CN": "提交反馈失败",
"zh-TW": "提交反饋失敗",
"ko-KR": "피드백 제출 실패",
"no": "Kunne ikke sende tilbakemelding",
"it": "Impossibile inviare feedback",
"pt": "Falha ao enviar feedback",
"es": "Error al enviar comentarios",
"ar": "فشل في تقديم التعليقات",
"fr": "Échec de l'envoi des commentaires",
"tr": "Geri bildirim gönderilemedi",
"de": "Feedback konnte nicht gesendet werden",
"uk": "Не вдалося надіслати відгук"
}
}
+1 -1
View File
@@ -13,7 +13,7 @@ export default [
index("routes/llm-settings.tsx"),
route("mcp", "routes/mcp-settings.tsx"),
route("user", "routes/user-settings.tsx"),
route("git", "routes/git-settings.tsx"),
route("integrations", "routes/git-settings.tsx"),
route("app", "routes/app-settings.tsx"),
route("billing", "routes/billing.tsx"),
route("secrets", "routes/secrets-settings.tsx"),
+5
View File
@@ -7,6 +7,7 @@ import { useLogout } from "#/hooks/mutation/use-logout";
import { GitHubTokenInput } from "#/components/features/settings/git-settings/github-token-input";
import { GitLabTokenInput } from "#/components/features/settings/git-settings/gitlab-token-input";
import { ConfigureGitHubRepositoriesAnchor } from "#/components/features/settings/git-settings/configure-github-repositories-anchor";
import { InstallSlackAppAnchor } from "#/components/features/settings/git-settings/install-slack-app-anchor";
import { I18nKey } from "#/i18n/declaration";
import {
displayErrorToast,
@@ -103,6 +104,10 @@ function GitSettingsScreen() {
<ConfigureGitHubRepositoriesAnchor slug={config.APP_SLUG!} />
)}
{shouldRenderExternalConfigureButtons && !isLoading && (
<InstallSlackAppAnchor />
)}
{!isSaas && (
<GitHubTokenInput
name="github-token-input"
+5 -1
View File
@@ -84,7 +84,11 @@ function SecretsSettingsScreen() {
)}
{shouldRenderConnectToGitButton && (
<Link to="/settings/git" data-testid="connect-git-button" type="button">
<Link
to="/settings/integrations"
data-testid="connect-git-button"
type="button"
>
<BrandButton type="button" variant="secondary">
Connect a Git provider to manage secrets
</BrandButton>
+2 -2
View File
@@ -16,7 +16,7 @@ function SettingsScreen() {
const saasNavItems = [
{ to: "/settings/user", text: t("SETTINGS$NAV_USER") },
{ to: "/settings/git", text: t("SETTINGS$NAV_GIT") },
{ to: "/settings/integrations", text: t("SETTINGS$NAV_INTEGRATIONS") },
{ to: "/settings/app", text: t("SETTINGS$NAV_APPLICATION") },
{ to: "/settings/billing", text: t("SETTINGS$NAV_CREDITS") },
{ to: "/settings/secrets", text: t("SETTINGS$NAV_SECRETS") },
@@ -26,7 +26,7 @@ function SettingsScreen() {
const ossNavItems = [
{ to: "/settings", text: t("SETTINGS$NAV_LLM") },
{ to: "/settings/mcp", text: t("SETTINGS$NAV_MCP") },
{ to: "/settings/git", text: t("SETTINGS$NAV_GIT") },
{ to: "/settings/integrations", text: t("SETTINGS$NAV_INTEGRATIONS") },
{ to: "/settings/app", text: t("SETTINGS$NAV_APPLICATION") },
{ to: "/settings/secrets", text: t("SETTINGS$NAV_SECRETS") },
];
+27 -27
View File
@@ -19,10 +19,10 @@ vi.mock("react-i18next", () => ({
describe("RepositorySelectionForm", () => {
const mockOnRepoSelection = vi.fn();
beforeEach(() => {
vi.resetAllMocks();
// Mock the hooks with default values
(useUserRepositories as any).mockReturnValue({
data: [
@@ -32,7 +32,7 @@ describe("RepositorySelectionForm", () => {
isLoading: false,
isError: false,
});
(useRepositoryBranches as any).mockReturnValue({
data: [
{ name: "main" },
@@ -41,90 +41,90 @@ describe("RepositorySelectionForm", () => {
isLoading: false,
isError: false,
});
(useCreateConversation as any).mockReturnValue({
mutate: vi.fn(),
isPending: false,
isSuccess: false,
});
(useIsCreatingConversation as any).mockReturnValue(false);
});
it("should clear selected branch when input is empty", async () => {
render(<RepositorySelectionForm onRepoSelection={mockOnRepoSelection} />);
// First select a repository to enable the branch dropdown
const repoDropdown = screen.getByTestId("repository-dropdown");
fireEvent.change(repoDropdown, { target: { value: "test/repo1" } });
// Get the branch dropdown and verify it's enabled
const branchDropdown = screen.getByTestId("branch-dropdown");
expect(branchDropdown).not.toBeDisabled();
// Simulate deleting all text in the branch input
fireEvent.change(branchDropdown, { target: { value: "" } });
// Verify the branch input is cleared (no selected branch)
expect(branchDropdown).toHaveValue("");
});
it("should clear selected branch when input contains only whitespace", async () => {
render(<RepositorySelectionForm onRepoSelection={mockOnRepoSelection} />);
// First select a repository to enable the branch dropdown
const repoDropdown = screen.getByTestId("repository-dropdown");
fireEvent.change(repoDropdown, { target: { value: "test/repo1" } });
// Get the branch dropdown and verify it's enabled
const branchDropdown = screen.getByTestId("branch-dropdown");
expect(branchDropdown).not.toBeDisabled();
// Simulate entering only whitespace in the branch input
fireEvent.change(branchDropdown, { target: { value: " " } });
// Verify the branch input is cleared (no selected branch)
expect(branchDropdown).toHaveValue("");
});
it("should keep branch empty after being cleared even with auto-selection", async () => {
render(<RepositorySelectionForm onRepoSelection={mockOnRepoSelection} />);
// First select a repository to enable the branch dropdown
const repoDropdown = screen.getByTestId("repository-dropdown");
fireEvent.change(repoDropdown, { target: { value: "test/repo1" } });
// Get the branch dropdown and verify it's enabled
const branchDropdown = screen.getByTestId("branch-dropdown");
expect(branchDropdown).not.toBeDisabled();
// The branch should be auto-selected to "main" initially
expect(branchDropdown).toHaveValue("main");
// Simulate deleting all text in the branch input
fireEvent.change(branchDropdown, { target: { value: "" } });
// Verify the branch input is cleared (no selected branch)
expect(branchDropdown).toHaveValue("");
// Trigger a re-render by changing something else
fireEvent.change(repoDropdown, { target: { value: "test/repo2" } });
fireEvent.change(repoDropdown, { target: { value: "test/repo1" } });
// The branch should be auto-selected to "main" again after repo change
expect(branchDropdown).toHaveValue("main");
// Clear it again
fireEvent.change(branchDropdown, { target: { value: "" } });
// Verify it stays empty
expect(branchDropdown).toHaveValue("");
// Simulate a component update without changing repos
// This would normally trigger the useEffect if our fix wasn't working
fireEvent.blur(branchDropdown);
// Verify it still stays empty
expect(branchDropdown).toHaveValue("");
});
});
});
@@ -125,9 +125,9 @@ class BrowsingAgent(Agent):
self.reset()
def reset(self) -> None:
"""Resets the Browsing Agent."""
"""Resets the Browsing Agent's internal state."""
super().reset()
self.cost_accumulator = 0
# Reset agent-specific counters but not LLM metrics
self.error_accumulator = 0
def step(self, state: State) -> Action:
@@ -136,8 +136,9 @@ class CodeActAgent(Agent):
return tools
def reset(self) -> None:
"""Resets the CodeAct Agent."""
"""Resets the CodeAct Agent's internal state."""
super().reset()
# Only clear pending actions, not LLM metrics
self.pending_actions.clear()
def step(self, state: State) -> 'Action':
+4 -4
View File
@@ -119,14 +119,14 @@ class DummyAgent(Agent):
]
def step(self, state: State) -> Action:
if state.iteration >= len(self.steps):
if state.iteration_flag.current_value >= len(self.steps):
return AgentFinishAction()
current_step = self.steps[state.iteration]
current_step = self.steps[state.iteration_flag.current_value]
action = current_step['action']
if state.iteration > 0:
prev_step = self.steps[state.iteration - 1]
if state.iteration_flag.current_value > 0:
prev_step = self.steps[state.iteration_flag.current_value - 1]
if 'observations' in prev_step and prev_step['observations']:
expected_observations = prev_step['observations']
@@ -176,9 +176,9 @@ Note:
self.reset()
def reset(self) -> None:
"""Resets the VisualBrowsingAgent."""
"""Resets the VisualBrowsingAgent's internal state."""
super().reset()
self.cost_accumulator = 0
# Reset agent-specific counters but not LLM metrics
self.error_accumulator = 0
def step(self, state: State) -> Action:
@@ -208,7 +208,9 @@ Note:
# for visualwebarena, webarena and miniwob++ eval, we need to retrieve the initial observation already in browser env
# initialize and retrieve the first observation by issuing an noop OP
# For non-benchmark browsing, the browser env starts with a blank page, and the agent is expected to first navigate to desired websites
return BrowseInteractiveAction(browser_actions='noop(1000)', return_axtree=True)
return BrowseInteractiveAction(
browser_actions='noop(1000)', return_axtree=True
)
for event in state.view:
if isinstance(event, BrowseInteractiveAction):
+12 -4
View File
@@ -215,10 +215,18 @@ async def modify_llm_settings_basic(
]
provider_models = VERIFIED_ANTHROPIC_MODELS + provider_models
# Set default model to the first model in the list (which will be a verified model if available)
default_model = (
provider_models[0] if provider_models else 'claude-sonnet-4-20250514'
)
# Set default model to the best verified model for the provider
if provider == 'anthropic' and VERIFIED_ANTHROPIC_MODELS:
# Use the first model in the VERIFIED_ANTHROPIC_MODELS list as it's the best/newest
default_model = VERIFIED_ANTHROPIC_MODELS[0]
elif provider == 'openai' and VERIFIED_OPENAI_MODELS:
# Use the first model in the VERIFIED_OPENAI_MODELS list as it's the best/newest
default_model = VERIFIED_OPENAI_MODELS[0]
else:
# For other providers, use the first model in the list
default_model = (
provider_models[0] if provider_models else 'claude-sonnet-4-20250514'
)
# Show the default model but allow changing it
print_formatted_text(
+9 -9
View File
@@ -158,17 +158,17 @@ VERIFIED_OPENAI_MODELS = [
]
VERIFIED_ANTHROPIC_MODELS = [
'claude-2',
'claude-2.1',
'claude-3-5-sonnet-20240620',
'claude-3-5-sonnet-20241022',
'claude-3-5-haiku-20241022',
'claude-3-haiku-20240307',
'claude-3-opus-20240229',
'claude-3-sonnet-20240229',
'claude-3-7-sonnet-20250219',
'claude-sonnet-4-20250514',
'claude-opus-4-20250514',
'claude-3-7-sonnet-20250219',
'claude-3-sonnet-20240229',
'claude-3-opus-20240229',
'claude-3-haiku-20240307',
'claude-3-5-haiku-20241022',
'claude-3-5-sonnet-20241022',
'claude-3-5-sonnet-20240620',
'claude-2.1',
'claude-2',
]
+2 -8
View File
@@ -103,16 +103,10 @@ class Agent(ABC):
pass
def reset(self) -> None:
"""Resets the agent's execution status and clears the history. This method can be used
to prepare the agent for restarting the instruction or cleaning up before destruction.
"""
# TODO clear history
"""Resets the agent's execution status."""
# Only reset the completion status, not the LLM metrics
self._complete = False
if self.llm:
self.llm.reset()
@property
def name(self) -> str:
return self.__class__.__name__
+110 -323
View File
@@ -7,7 +7,6 @@ import time
import traceback
from typing import Callable
import litellm # noqa
from litellm.exceptions import ( # noqa
APIConnectionError,
APIError,
@@ -25,7 +24,8 @@ from litellm.exceptions import ( # noqa
from openhands.controller.agent import Agent
from openhands.controller.replay import ReplayManager
from openhands.controller.state.state import State, TrafficControlState
from openhands.controller.state.state import State
from openhands.controller.state.state_tracker import StateTracker
from openhands.controller.stuck import StuckDetector
from openhands.core.config import AgentConfig, LLMConfig
from openhands.core.exceptions import (
@@ -61,7 +61,6 @@ from openhands.events.action import (
)
from openhands.events.action.agent import CondensationAction, RecallAction
from openhands.events.event import Event
from openhands.events.event_filter import EventFilter
from openhands.events.observation import (
AgentDelegateObservation,
AgentStateChangedObservation,
@@ -69,10 +68,11 @@ from openhands.events.observation import (
NullObservation,
Observation,
)
from openhands.events.serialization.event import event_to_trajectory, truncate_content
from openhands.events.serialization.event import truncate_content
from openhands.llm.llm import LLM
from openhands.llm.metrics import Metrics, TokenUsage
from openhands.memory.view import View
from openhands.storage.files import FileStore
# note: RESUME is only available on web GUI
TRAFFIC_CONTROL_REMINDER = (
@@ -101,11 +101,13 @@ class AgentController:
self,
agent: Agent,
event_stream: EventStream,
max_iterations: int,
max_budget_per_task: float | None = None,
iteration_delta: int,
budget_per_task_delta: float | None = None,
agent_to_llm_config: dict[str, LLMConfig] | None = None,
agent_configs: dict[str, AgentConfig] | None = None,
sid: str | None = None,
file_store: FileStore | None = None,
user_id: str | None = None,
confirmation_mode: bool = False,
initial_state: State | None = None,
is_delegate: bool = False,
@@ -132,7 +134,10 @@ class AgentController:
status_callback: Optional callback function to handle status updates.
replay_events: A list of logs to replay.
"""
self.id = sid or event_stream.sid
self.user_id = user_id
self.file_store = file_store
self.agent = agent
self.headless_mode = headless_mode
self.is_delegate = is_delegate
@@ -146,29 +151,22 @@ class AgentController:
EventStreamSubscriber.AGENT_CONTROLLER, self.on_event, self.id
)
# filter out events that are not relevant to the agent
# so they will not be included in the agent history
self.agent_history_filter = EventFilter(
exclude_types=(
NullAction,
NullObservation,
ChangeAgentStateAction,
AgentStateChangedObservation,
),
exclude_hidden=True,
)
self.state_tracker = StateTracker(sid, file_store, user_id)
# state from the previous session, state from a parent agent, or a fresh state
self.set_initial_state(
state=initial_state,
max_iterations=max_iterations,
max_iterations=iteration_delta,
max_budget_per_task=budget_per_task_delta,
confirmation_mode=confirmation_mode,
)
self.max_budget_per_task = max_budget_per_task
self.state = self.state_tracker.state # TODO: share between manager and controller for backward compatability; we should ideally move all state related logic to the state manager
self.agent_to_llm_config = agent_to_llm_config if agent_to_llm_config else {}
self.agent_configs = agent_configs if agent_configs else {}
self._initial_max_iterations = max_iterations
self._initial_max_budget_per_task = max_budget_per_task
self._initial_max_iterations = iteration_delta
self._initial_max_budget_per_task = budget_per_task_delta
# stuck helper
self._stuck_detector = StuckDetector(self.state)
@@ -214,26 +212,7 @@ class AgentController:
if set_stop_state:
await self.set_agent_state_to(AgentState.STOPPED)
# we made history, now is the time to rewrite it!
# the final state.history will be used by external scripts like evals, tests, etc.
# history will need to be complete WITH delegates events
# like the regular agent history, it does not include:
# - 'hidden' events, events with hidden=True
# - backend events (the default 'filtered out' types, types in self.filter_out)
start_id = self.state.start_id if self.state.start_id >= 0 else 0
end_id = (
self.state.end_id
if self.state.end_id >= 0
else self.event_stream.get_latest_event_id()
)
self.state.history = list(
self.event_stream.search_events(
start_id=start_id,
end_id=end_id,
reverse=False,
filter=self.agent_history_filter,
)
)
self.state_tracker.close(self.event_stream)
# unsubscribe from the event stream
# only the root parent controller subscribes to the event stream
@@ -257,14 +236,6 @@ class AgentController:
extra_merged = {'session_id': self.id, **extra}
getattr(logger, level)(message, extra=extra_merged, stacklevel=2)
def update_state_before_step(self) -> None:
self.state.iteration += 1
self.state.local_iteration += 1
async def update_state_after_step(self) -> None:
# update metrics especially for cost. Use deepcopy to avoid it being modified by agent._reset()
self.state.local_metrics = copy.deepcopy(self.agent.llm.metrics)
async def _react_to_exception(
self,
e: Exception,
@@ -390,10 +361,17 @@ class AgentController:
# If we have a delegate that is not finished or errored, forward events to it
if self.delegate is not None:
delegate_state = self.delegate.get_agent_state()
if delegate_state not in (
AgentState.FINISHED,
AgentState.ERROR,
AgentState.REJECTED,
if (
delegate_state
not in (
AgentState.FINISHED,
AgentState.ERROR,
AgentState.REJECTED,
)
or 'RuntimeError: Agent reached maximum iteration.'
in self.delegate.state.last_error
or 'RuntimeError:Agent reached maximum budget for conversation'
in self.delegate.state.last_error
):
# Forward the event to delegate and skip parent processing
asyncio.get_event_loop().run_until_complete(
@@ -412,9 +390,7 @@ class AgentController:
if hasattr(event, 'hidden') and event.hidden:
return
# if the event is not filtered out, add it to the history
if self.agent_history_filter.include(event):
self.state.history.append(event)
self.state_tracker.add_history(event)
if isinstance(event, Action):
await self._handle_action(event)
@@ -457,11 +433,9 @@ class AgentController:
elif isinstance(action, AgentFinishAction):
self.state.outputs = action.outputs
self.state.metrics.merge(self.state.local_metrics)
await self.set_agent_state_to(AgentState.FINISHED)
elif isinstance(action, AgentRejectAction):
self.state.outputs = action.outputs
self.state.metrics.merge(self.state.local_metrics)
await self.set_agent_state_to(AgentState.REJECTED)
async def _handle_observation(self, observation: Observation) -> None:
@@ -481,8 +455,10 @@ class AgentController:
log_level, str(observation_to_print), extra={'msg_type': 'OBSERVATION'}
)
# TODO: these metrics come from the draft editor, and they get accumulated into controller's state metrics and the agent's llm metrics
# In the future, we should have a more principled way to sharing metrics across all LLM instances for a given conversation
if observation.llm_metrics is not None:
self.agent.llm.metrics.merge(observation.llm_metrics)
self.state_tracker.merge_metrics(observation.llm_metrics)
# this happens for runnable actions and microagent actions
if self._pending_action and self._pending_action.id == observation.cause:
@@ -496,9 +472,6 @@ class AgentController:
if self.state.agent_state == AgentState.USER_REJECTED:
await self.set_agent_state_to(AgentState.AWAITING_USER_INPUT)
return
elif isinstance(observation, ErrorObservation):
if self.state.agent_state == AgentState.ERROR:
self.state.metrics.merge(self.state.local_metrics)
async def _handle_message_action(self, action: MessageAction) -> None:
"""Handles message actions from the event stream.
@@ -516,22 +489,6 @@ class AgentController:
str(action),
extra={'msg_type': 'ACTION', 'event_source': EventSource.USER},
)
# Extend max iterations when the user sends a message (only in non-headless mode)
if self._initial_max_iterations is not None and not self.headless_mode:
self.state.max_iterations = (
self.state.iteration + self._initial_max_iterations
)
if (
self.state.traffic_control_state == TrafficControlState.THROTTLING
or self.state.traffic_control_state == TrafficControlState.PAUSED
):
self.state.traffic_control_state = TrafficControlState.NORMAL
self.log(
'debug',
f'Extended max iterations to {self.state.max_iterations} after user message',
)
# try to retrieve microagents relevant to the user message
# set pending_action while we search for information
# if this is the first user message for this agent, matters for the microagent info type
first_user_message = self._first_user_message()
@@ -605,36 +562,16 @@ class AgentController:
return
if new_state in (AgentState.STOPPED, AgentState.ERROR):
# sync existing metrics BEFORE resetting the agent
await self.update_state_after_step()
self.state.metrics.merge(self.state.local_metrics)
self._reset()
elif (
new_state == AgentState.RUNNING
and self.state.agent_state == AgentState.PAUSED
# TODO: do we really need both THROTTLING and PAUSED states, or can we clean up one of them completely?
and self.state.traffic_control_state == TrafficControlState.THROTTLING
):
# user intends to interrupt traffic control and let the task resume temporarily
self.state.traffic_control_state = TrafficControlState.PAUSED
# User has chosen to deliberately continue - lets double the max iterations
if (
self.state.iteration is not None
and self.state.max_iterations is not None
and self._initial_max_iterations is not None
and not self.headless_mode
):
if self.state.iteration >= self.state.max_iterations:
self.state.max_iterations += self._initial_max_iterations
if (
self.state.metrics.accumulated_cost is not None
and self.max_budget_per_task is not None
and self._initial_max_budget_per_task is not None
):
if self.state.metrics.accumulated_cost >= self.max_budget_per_task:
self.max_budget_per_task += self._initial_max_budget_per_task
elif self._pending_action is not None and (
# User is allowing to check control limits and expand them if applicable
if (
self.state.agent_state == AgentState.ERROR
and new_state == AgentState.RUNNING
):
self.state_tracker.maybe_increase_control_flags_limits(self.headless_mode)
if self._pending_action is not None and (
new_state in (AgentState.USER_CONFIRMED, AgentState.USER_REJECTED)
):
if hasattr(self._pending_action, 'thought'):
@@ -659,6 +596,10 @@ class AgentController:
EventSource.ENVIRONMENT,
)
# Save state whenever agent state changes to ensure we don't lose state
# in case of crashes or unexpected circumstances
self.save_state()
def get_agent_state(self) -> AgentState:
"""Returns the current state of the agent.
@@ -686,19 +627,27 @@ class AgentController:
agent_cls: type[Agent] = Agent.get_cls(action.agent)
agent_config = self.agent_configs.get(action.agent, self.agent.config)
llm_config = self.agent_to_llm_config.get(action.agent, self.agent.llm.config)
llm = LLM(config=llm_config, retry_listener=self._notify_on_llm_retry)
# Make sure metrics are shared between parent and child for global accumulation
llm = LLM(
config=llm_config,
retry_listener=self.agent.llm.retry_listener,
metrics=self.state.metrics,
)
delegate_agent = agent_cls(llm=llm, config=agent_config)
# Take a snapshot of the current metrics before starting the delegate
state = State(
session_id=self.id.removesuffix('-delegate'),
inputs=action.inputs or {},
local_iteration=0,
iteration=self.state.iteration,
max_iterations=self.state.max_iterations,
iteration_flag=self.state.iteration_flag,
budget_flag=self.state.budget_flag,
delegate_level=self.state.delegate_level + 1,
# global metrics should be shared between parent and child
metrics=self.state.metrics,
# start on top of the stream
start_id=self.event_stream.get_latest_event_id() + 1,
parent_metrics_snapshot=self.state_tracker.get_metrics_snapshot(),
parent_iteration=self.state.iteration_flag.current_value,
)
self.log(
'debug',
@@ -708,10 +657,12 @@ class AgentController:
# Create the delegate with is_delegate=True so it does NOT subscribe directly
self.delegate = AgentController(
sid=self.id + '-delegate',
file_store=self.file_store,
user_id=self.user_id,
agent=delegate_agent,
event_stream=self.event_stream,
max_iterations=self.state.max_iterations,
max_budget_per_task=self.max_budget_per_task,
iteration_delta=self._initial_max_iterations,
budget_per_task_delta=self._initial_max_budget_per_task,
agent_to_llm_config=self.agent_to_llm_config,
agent_configs=self.agent_configs,
initial_state=state,
@@ -730,7 +681,13 @@ class AgentController:
delegate_state = self.delegate.get_agent_state()
# update iteration that is shared across agents
self.state.iteration = self.delegate.state.iteration
self.state.iteration_flag.current_value = (
self.delegate.state.iteration_flag.current_value
)
# Calculate delegate-specific metrics before closing the delegate
delegate_metrics = self.state.get_local_metrics()
logger.info(f'Local metrics for delegate: {delegate_metrics}')
# close the delegate controller before adding new events
asyncio.get_event_loop().run_until_complete(self.delegate.close())
@@ -743,8 +700,12 @@ class AgentController:
# prepare delegate result observation
# TODO: replace this with AI-generated summary (#2395)
# Filter out metrics from the formatted output to avoid clutter
display_outputs = {
k: v for k, v in delegate_outputs.items() if k != 'metrics'
}
formatted_output = ', '.join(
f'{key}: {value}' for key, value in delegate_outputs.items()
f'{key}: {value}' for key, value in display_outputs.items()
)
content = (
f'{self.delegate.agent.name} finishes task with {formatted_output}'
@@ -798,24 +759,16 @@ class AgentController:
self.log(
'debug',
f'LEVEL {self.state.delegate_level} LOCAL STEP {self.state.local_iteration} GLOBAL STEP {self.state.iteration}',
f'LEVEL {self.state.delegate_level} LOCAL STEP {self.state.get_local_step()} GLOBAL STEP {self.state.iteration_flag.current_value}',
extra={'msg_type': 'STEP'},
)
stop_step = False
if self.state.iteration >= self.state.max_iterations:
stop_step = await self._handle_traffic_control(
'iteration', self.state.iteration, self.state.max_iterations
)
if self.max_budget_per_task is not None:
current_cost = self.state.metrics.accumulated_cost
if current_cost > self.max_budget_per_task:
stop_step = await self._handle_traffic_control(
'budget', current_cost, self.max_budget_per_task
)
if stop_step:
logger.warning('Stopping agent due to traffic control')
return
# Ensure budget control flag is synchronized with the latest metrics.
# In the future, we should centralized the use of one LLM object per conversation.
# This will help us unify the cost for auto generating titles, running the condensor, etc.
# Before many microservices will touh the same llm cost field, we should sync with the budget flag for the controller
# and check that we haven't exceeded budget BEFORE executing an agent step.
self.state_tracker.sync_budget_flag_with_metrics()
if self._is_stuck():
await self._react_to_exception(
@@ -823,7 +776,13 @@ class AgentController:
)
return
self.update_state_before_step()
try:
self.state_tracker.run_control_flags()
except Exception as e:
logger.warning('Control flag limits hit')
await self._react_to_exception(e)
return
action: Action = NullAction()
if self._replay_manager.should_replay():
@@ -894,60 +853,9 @@ class AgentController:
self.event_stream.add_event(action, action._source) # type: ignore [attr-defined]
await self.update_state_after_step()
log_level = 'info' if LOG_ALL_EVENTS else 'debug'
self.log(log_level, str(action), extra={'msg_type': 'ACTION'})
def _notify_on_llm_retry(self, retries: int, max: int) -> None:
if self.status_callback is not None:
msg_id = 'STATUS$LLM_RETRY'
self.status_callback(
'info', msg_id, f'Retrying LLM request, {retries} / {max}'
)
async def _handle_traffic_control(
self, limit_type: str, current_value: float, max_value: float
) -> bool:
"""Handles agent state after hitting the traffic control limit.
Args:
limit_type (str): The type of limit that was hit.
current_value (float): The current value of the limit.
max_value (float): The maximum value of the limit.
"""
stop_step = False
if self.state.traffic_control_state == TrafficControlState.PAUSED:
self.log(
'debug', 'Hitting traffic control, temporarily resume upon user request'
)
self.state.traffic_control_state = TrafficControlState.NORMAL
else:
self.state.traffic_control_state = TrafficControlState.THROTTLING
# Format values as integers for iterations, keep decimals for budget
if limit_type == 'iteration':
current_str = str(int(current_value))
max_str = str(int(max_value))
else:
current_str = f'{current_value:.2f}'
max_str = f'{max_value:.2f}'
if self.headless_mode:
e = RuntimeError(
f'Agent reached maximum {limit_type} in headless mode. '
f'Current {limit_type}: {current_str}, max {limit_type}: {max_str}'
)
await self._react_to_exception(e)
else:
e = RuntimeError(
f'Agent reached maximum {limit_type}. '
f'Current {limit_type}: {current_str}, max {limit_type}: {max_str}. '
)
# FIXME: this isn't really an exception--we should have a different path
await self._react_to_exception(e)
stop_step = True
return stop_step
@property
def _pending_action(self) -> Action | None:
"""Get the current pending action with time tracking.
@@ -1015,150 +923,26 @@ class AgentController:
self,
state: State | None,
max_iterations: int,
max_budget_per_task: float | None,
confirmation_mode: bool = False,
) -> None:
"""Sets the initial state for the agent, either from the previous session, or from a parent agent, or by creating a new one.
Args:
state: The state to initialize with, or None to create a new state.
max_iterations: The maximum number of iterations allowed for the task.
confirmation_mode: Whether to enable confirmation mode.
"""
# state can come from:
# - the previous session, in which case it has history
# - from a parent agent, in which case it has no history
# - None / a new state
# If state is None, we create a brand new state and still load the event stream so we can restore the history
if state is None:
self.state = State(
session_id=self.id.removesuffix('-delegate'),
inputs={},
max_iterations=max_iterations,
confirmation_mode=confirmation_mode,
)
self.state.start_id = 0
self.log(
'info',
f'AgentController {self.id} - created new state. start_id: {self.state.start_id}',
)
else:
self.state = state
if self.state.start_id <= -1:
self.state.start_id = 0
self.log(
'info',
f'AgentController {self.id} initializing history from event {self.state.start_id}',
)
):
self.state_tracker.set_initial_state(
self.id,
self.agent,
state,
max_iterations,
max_budget_per_task,
confirmation_mode,
)
# Always load from the event stream to avoid losing history
self._init_history()
self.state_tracker._init_history(
self.event_stream,
)
def get_trajectory(self, include_screenshots: bool = False) -> list[dict]:
# state history could be partially hidden/truncated before controller is closed
assert self._closed
return [
event_to_trajectory(event, include_screenshots)
for event in self.state.history
]
def _init_history(self) -> None:
"""Initializes the agent's history from the event stream.
The history is a list of events that:
- Excludes events of types listed in self.filter_out
- Excludes events with hidden=True attribute
- For delegate events (between AgentDelegateAction and AgentDelegateObservation):
- Excludes all events between the action and observation
- Includes the delegate action and observation themselves
"""
# define range of events to fetch
# delegates start with a start_id and initially won't find any events
# otherwise we're restoring a previous session
start_id = self.state.start_id if self.state.start_id >= 0 else 0
end_id = (
self.state.end_id
if self.state.end_id >= 0
else self.event_stream.get_latest_event_id()
)
# sanity check
if start_id > end_id + 1:
self.log(
'warning',
f'start_id {start_id} is greater than end_id + 1 ({end_id + 1}). History will be empty.',
)
self.state.history = []
return
events: list[Event] = []
# Get rest of history
events_to_add = list(
self.event_stream.search_events(
start_id=start_id,
end_id=end_id,
reverse=False,
filter=self.agent_history_filter,
)
)
events.extend(events_to_add)
# Find all delegate action/observation pairs
delegate_ranges: list[tuple[int, int]] = []
delegate_action_ids: list[int] = [] # stack of unmatched delegate action IDs
for event in events:
if isinstance(event, AgentDelegateAction):
delegate_action_ids.append(event.id)
# Note: we can get agent=event.agent and task=event.inputs.get('task','')
# if we need to track these in the future
elif isinstance(event, AgentDelegateObservation):
# Match with most recent unmatched delegate action
if not delegate_action_ids:
self.log(
'warning',
f'Found AgentDelegateObservation without matching action at id={event.id}',
)
continue
action_id = delegate_action_ids.pop()
delegate_ranges.append((action_id, event.id))
# Filter out events between delegate action/observation pairs
if delegate_ranges:
filtered_events: list[Event] = []
current_idx = 0
for start_id, end_id in sorted(delegate_ranges):
# Add events before delegate range
filtered_events.extend(
event for event in events[current_idx:] if event.id < start_id
)
# Add delegate action and observation
filtered_events.extend(
event for event in events if event.id in (start_id, end_id)
)
# Update index to after delegate range
current_idx = next(
(i for i, e in enumerate(events) if e.id > end_id), len(events)
)
# Add any remaining events after last delegate range
filtered_events.extend(events[current_idx:])
self.state.history = filtered_events
else:
self.state.history = events
# make sure history is in sync
self.state.start_id = start_id
return self.state_tracker.get_trajectory(include_screenshots)
def _handle_long_context_error(self) -> None:
# When context window is exceeded, keep roughly half of agent interactions
@@ -1359,7 +1143,7 @@ class AgentController:
action: The action to attach metrics to
"""
# Get metrics from agent LLM
agent_metrics = self.agent.llm.metrics
agent_metrics = self.state.metrics
# Get metrics from condenser LLM if it exists
condenser_metrics: TokenUsage | None = None
@@ -1390,10 +1174,10 @@ class AgentController:
# Log the metrics information for debugging
# Get the latest usage directly from the agent's metrics
latest_usage = None
if self.agent.llm.metrics.token_usages:
latest_usage = self.agent.llm.metrics.token_usages[-1]
if self.state.metrics.token_usages:
latest_usage = self.state.metrics.token_usages[-1]
accumulated_usage = self.agent.llm.metrics.accumulated_token_usage
accumulated_usage = self.state.metrics.accumulated_token_usage
self.log(
'debug',
f'Action metrics - accumulated_cost: {metrics.accumulated_cost}, '
@@ -1481,3 +1265,6 @@ class AgentController:
None,
)
return self._cached_first_user_message
def save_state(self):
self.state_tracker.save_state()
@@ -0,0 +1,95 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import Generic, TypeVar
T = TypeVar(
'T', int, float
) # Type for the value (int for iterations, float for budget)
@dataclass
class ControlFlag(Generic[T]):
"""Base class for control flags that manage limits and state transitions."""
limit_increase_amount: T
current_value: T
max_value: T
headless_mode: bool = False
_hit_limit: bool = False
def reached_limit(self) -> bool:
"""Check if the limit has been reached.
Returns:
bool: True if the limit has been reached, False otherwise.
"""
raise NotImplementedError
def increase_limit(self, headless_mode: bool) -> None:
"""Expand the limit when needed."""
raise NotImplementedError
def step(self):
"""Determine the next state based on the current state and mode.
Returns:
ControlFlagState: The next state.
"""
raise NotImplementedError
@dataclass
class IterationControlFlag(ControlFlag[int]):
"""Control flag for managing iteration limits."""
def reached_limit(self) -> bool:
"""Check if the iteration limit has been reached."""
self._hit_limit = self.current_value >= self.max_value
return self._hit_limit
def increase_limit(self, headless_mode: bool) -> None:
"""Expand the iteration limit by adding the initial value."""
if not headless_mode and self._hit_limit:
self.max_value += self.limit_increase_amount
self._hit_limit = False
def step(self):
if self.reached_limit():
raise RuntimeError(
f'Agent reached maximum iteration. '
f'Current iteration: {self.current_value}, max iteration: {self.max_value}'
)
# Increment the current value
self.current_value += 1
@dataclass
class BudgetControlFlag(ControlFlag[float]):
"""Control flag for managing budget limits."""
def reached_limit(self) -> bool:
"""Check if the budget limit has been reached."""
self._hit_limit = self.current_value >= self.max_value
return self._hit_limit
def increase_limit(self, headless_mode) -> None:
"""Expand the budget limit by adding the initial value to the current value."""
if self._hit_limit:
self.max_value = self.current_value + self.limit_increase_amount
self._hit_limit = False
def step(self):
"""Check if we've reached the limit and update state accordingly.
Note: Unlike IterationControlFlag, this doesn't increment the value
as the budget is updated externally.
"""
if self.reached_limit():
current_str = f'{self.current_value:.2f}'
max_str = f'{self.max_value:.2f}'
raise RuntimeError(
f'Agent reached maximum budget for conversation.'
f'Current budget: {current_str}, max budget: {max_str}'
)
+83 -19
View File
@@ -8,6 +8,10 @@ from enum import Enum
from typing import Any
import openhands
from openhands.controller.state.control_flags import (
BudgetControlFlag,
IterationControlFlag,
)
from openhands.core.logger import openhands_logger as logger
from openhands.core.schema import AgentState
from openhands.events.action import (
@@ -20,7 +24,15 @@ from openhands.memory.view import View
from openhands.storage.files import FileStore
from openhands.storage.locations import get_conversation_agent_state_filename
RESUMABLE_STATES = [
AgentState.RUNNING,
AgentState.PAUSED,
AgentState.AWAITING_USER_INPUT,
AgentState.FINISHED,
]
# NOTE: this is deprecated
class TrafficControlState(str, Enum):
# default state, no rate limiting
NORMAL = 'normal'
@@ -32,14 +44,6 @@ class TrafficControlState(str, Enum):
PAUSED = 'paused'
RESUMABLE_STATES = [
AgentState.RUNNING,
AgentState.PAUSED,
AgentState.AWAITING_USER_INPUT,
AgentState.FINISHED,
]
@dataclass
class State:
"""
@@ -75,35 +79,43 @@ class State:
"""
session_id: str = ''
# global iteration for the current task
iteration: int = 0
# local iteration for the current subtask
local_iteration: int = 0
# max number of iterations for the current task
max_iterations: int = 100
iteration_flag: IterationControlFlag = field(
default_factory=lambda: IterationControlFlag(
limit_increase_amount=100, current_value=0, max_value=100
)
)
budget_flag: BudgetControlFlag | None = None
confirmation_mode: bool = False
history: list[Event] = field(default_factory=list)
inputs: dict = field(default_factory=dict)
outputs: dict = field(default_factory=dict)
agent_state: AgentState = AgentState.LOADING
resume_state: AgentState | None = None
traffic_control_state: TrafficControlState = TrafficControlState.NORMAL
# global metrics for the current task
metrics: Metrics = field(default_factory=Metrics)
# local metrics for the current subtask
local_metrics: Metrics = field(default_factory=Metrics)
# root agent has level 0, and every delegate increases the level by one
delegate_level: int = 0
# start_id and end_id track the range of events in history
start_id: int = -1
end_id: int = -1
delegates: dict[tuple[int, int], tuple[str, str]] = field(default_factory=dict)
# NOTE: This will never be used by the controller, but it can be used by different
parent_metrics_snapshot: Metrics | None = None
parent_iteration: int = 100
# NOTE: this is used by the controller to track parent's metrics snapshot before delegation
# evaluation tasks to store extra data needed to track the progress/state of the task.
extra_data: dict[str, Any] = field(default_factory=dict)
last_error: str = ''
# NOTE: deprecated args, kept here temporarily for backwards compatability
# Will be remove in 30 days
iteration: int | None = None
local_iteration: int | None = None
max_iterations: int | None = None
traffic_control_state: TrafficControlState | None = None
local_metrics: Metrics | None = None
delegates: dict[tuple[int, int], tuple[str, str]] | None = None
def save_to_session(
self, sid: str, file_store: FileStore, user_id: str | None
) -> None:
@@ -165,6 +177,10 @@ class State:
# first state after restore
state.agent_state = AgentState.LOADING
# We don't need to clean up deprecated fields here
# They will be handled by __getstate__ when the state is saved again
return state
def __getstate__(self) -> dict:
@@ -177,15 +193,52 @@ class State:
state.pop('_history_checksum', None)
state.pop('_view', None)
# Remove deprecated fields before pickling
state.pop('iteration', None)
state.pop('local_iteration', None)
state.pop('max_iterations', None)
state.pop('traffic_control_state', None)
state.pop('local_metrics', None)
state.pop('delegates', None)
return state
def __setstate__(self, state: dict) -> None:
# Check if we're restoring from an older version (before control flags)
is_old_version = 'iteration' in state
# Convert old iteration tracking to new iteration_flag if needed
if is_old_version:
# Create iteration_flag from old values
max_iterations = state.get('max_iterations', 100)
current_iteration = state.get('iteration', 0)
# Add the iteration_flag to the state
state['iteration_flag'] = IterationControlFlag(
limit_increase_amount=max_iterations,
current_value=current_iteration,
max_value=max_iterations,
)
# Update the state
self.__dict__.update(state)
# We keep the deprecated fields for backward compatibility
# They will be removed by __getstate__ when the state is saved again
# make sure we always have the attribute history
if not hasattr(self, 'history'):
self.history = []
# Ensure we have default values for new fields if they're missing
if not hasattr(self, 'iteration_flag'):
self.iteration_flag = IterationControlFlag(
limit_increase_amount=100, current_value=0, max_value=100
)
if not hasattr(self, 'budget_flag'):
self.budget_flag = None
def get_current_user_intent(self) -> tuple[str | None, list[str] | None]:
"""Returns the latest user message and image(if provided) that appears after a FinishAction, or the first (the task) if nothing was finished yet."""
last_user_message = None
@@ -223,6 +276,17 @@ class State:
],
}
def get_local_step(self):
if not self.parent_iteration:
return self.iteration_flag.current_value
return self.iteration_flag.current_value - self.parent_iteration
def get_local_metrics(self):
if not self.parent_metrics_snapshot:
return self.metrics
return self.metrics.diff(self.parent_metrics_snapshot)
@property
def view(self) -> View:
# Compute a simple checksum from the history to see if we can re-use any
+290
View File
@@ -0,0 +1,290 @@
from openhands.controller.agent import Agent
from openhands.controller.state.control_flags import (
BudgetControlFlag,
IterationControlFlag,
)
from openhands.controller.state.state import State
from openhands.core.logger import openhands_logger as logger
from openhands.events.action.agent import AgentDelegateAction, ChangeAgentStateAction
from openhands.events.action.empty import NullAction
from openhands.events.event import Event
from openhands.events.event_filter import EventFilter
from openhands.events.observation.agent import AgentStateChangedObservation
from openhands.events.observation.delegate import AgentDelegateObservation
from openhands.events.observation.empty import NullObservation
from openhands.events.serialization.event import event_to_trajectory
from openhands.events.stream import EventStream
from openhands.llm.metrics import Metrics
from openhands.storage.files import FileStore
class StateTracker:
"""Manages and synchronizes the state of an agent throughout its lifecycle.
It is responsible for:
1. Maintaining agent state persistence across sessions
2. Managing agent history by filtering and tracking relevant events (previously done in the agent controller)
3. Synchronizing metrics between the controller and LLM components
4. Updating control flags for budget and iteration limits
"""
def __init__(
self, sid: str | None, file_store: FileStore | None, user_id: str | None
):
self.sid = sid
self.file_store = file_store
self.user_id = user_id
# filter out events that are not relevant to the agent
# so they will not be included in the agent history
self.agent_history_filter = EventFilter(
exclude_types=(
NullAction,
NullObservation,
ChangeAgentStateAction,
AgentStateChangedObservation,
),
exclude_hidden=True,
)
def set_initial_state(
self,
id: str,
agent: Agent,
state: State | None,
max_iterations: int,
max_budget_per_task: float | None,
confirmation_mode: bool = False,
) -> None:
"""Sets the initial state for the agent, either from the previous session, or from a parent agent, or by creating a new one.
Args:
state: The state to initialize with, or None to create a new state.
max_iterations: The maximum number of iterations allowed for the task.
confirmation_mode: Whether to enable confirmation mode.
"""
# state can come from:
# - the previous session, in which case it has history
# - from a parent agent, in which case it has no history
# - None / a new state
# If state is None, we create a brand new state and still load the event stream so we can restore the history
if state is None:
self.state = State(
session_id=id.removesuffix('-delegate'),
inputs={},
iteration_flag=IterationControlFlag(
limit_increase_amount=max_iterations,
current_value=0,
max_value=max_iterations,
),
budget_flag=None
if not max_budget_per_task
else BudgetControlFlag(
limit_increase_amount=max_budget_per_task,
current_value=0,
max_value=max_budget_per_task,
),
confirmation_mode=confirmation_mode,
)
self.state.start_id = 0
logger.info(
f'AgentController {id} - created new state. start_id: {self.state.start_id}'
)
else:
self.state = state
if self.state.start_id <= -1:
self.state.start_id = 0
logger.info(
f'AgentController {id} initializing history from event {self.state.start_id}',
)
# Share the state metrics with the agent's LLM metrics
# This ensures that all accumulated metrics are always in sync between controller and llm
agent.llm.metrics = self.state.metrics
def _init_history(self, event_stream: EventStream) -> None:
"""Initializes the agent's history from the event stream.
The history is a list of events that:
- Excludes events of types listed in self.filter_out
- Excludes events with hidden=True attribute
- For delegate events (between AgentDelegateAction and AgentDelegateObservation):
- Excludes all events between the action and observation
- Includes the delegate action and observation themselves
"""
# define range of events to fetch
# delegates start with a start_id and initially won't find any events
# otherwise we're restoring a previous session
start_id = self.state.start_id if self.state.start_id >= 0 else 0
end_id = (
self.state.end_id
if self.state.end_id >= 0
else event_stream.get_latest_event_id()
)
# sanity check
if start_id > end_id + 1:
logger.warning(
f'start_id {start_id} is greater than end_id + 1 ({end_id + 1}). History will be empty.',
)
self.state.history = []
return
events: list[Event] = []
# Get rest of history
events_to_add = list(
event_stream.search_events(
start_id=start_id,
end_id=end_id,
reverse=False,
filter=self.agent_history_filter,
)
)
events.extend(events_to_add)
# Find all delegate action/observation pairs
delegate_ranges: list[tuple[int, int]] = []
delegate_action_ids: list[int] = [] # stack of unmatched delegate action IDs
for event in events:
if isinstance(event, AgentDelegateAction):
delegate_action_ids.append(event.id)
# Note: we can get agent=event.agent and task=event.inputs.get('task','')
# if we need to track these in the future
elif isinstance(event, AgentDelegateObservation):
# Match with most recent unmatched delegate action
if not delegate_action_ids:
logger.warning(
f'Found AgentDelegateObservation without matching action at id={event.id}',
)
continue
action_id = delegate_action_ids.pop()
delegate_ranges.append((action_id, event.id))
# Filter out events between delegate action/observation pairs
if delegate_ranges:
filtered_events: list[Event] = []
current_idx = 0
for start_id, end_id in sorted(delegate_ranges):
# Add events before delegate range
filtered_events.extend(
event for event in events[current_idx:] if event.id < start_id
)
# Add delegate action and observation
filtered_events.extend(
event for event in events if event.id in (start_id, end_id)
)
# Update index to after delegate range
current_idx = next(
(i for i, e in enumerate(events) if e.id > end_id), len(events)
)
# Add any remaining events after last delegate range
filtered_events.extend(events[current_idx:])
self.state.history = filtered_events
else:
self.state.history = events
# make sure history is in sync
self.state.start_id = start_id
def close(self, event_stream: EventStream):
# we made history, now is the time to rewrite it!
# the final state.history will be used by external scripts like evals, tests, etc.
# history will need to be complete WITH delegates events
# like the regular agent history, it does not include:
# - 'hidden' events, events with hidden=True
# - backend events (the default 'filtered out' types, types in self.filter_out)
start_id = self.state.start_id if self.state.start_id >= 0 else 0
end_id = (
self.state.end_id
if self.state.end_id >= 0
else event_stream.get_latest_event_id()
)
self.state.history = list(
event_stream.search_events(
start_id=start_id,
end_id=end_id,
reverse=False,
filter=self.agent_history_filter,
)
)
def add_history(self, event: Event):
# if the event is not filtered out, add it to the history
if self.agent_history_filter.include(event):
self.state.history.append(event)
def get_trajectory(self, include_screenshots: bool = False) -> list[dict]:
return [
event_to_trajectory(event, include_screenshots)
for event in self.state.history
]
def maybe_increase_control_flags_limits(self, headless_mode: bool):
# Iteration and budget extensions are independent of each other
# An error will be thrown if any one of the control flags have reached or exceeded its limit
self.state.iteration_flag.increase_limit(headless_mode)
if self.state.budget_flag:
self.state.budget_flag.increase_limit(headless_mode)
def get_metrics_snapshot(self):
"""
Deep copy of metrics
This serves as a snapshot for the parent's metrics at the time a delegate is created
It will be stored and used to compute local metrics for the delegate
(since delegates now accumulate metrics from where its parent left off)
"""
return self.state.metrics.copy()
def save_state(self):
"""
Save's current state to persistent store
"""
if self.sid and self.file_store:
self.state.save_to_session(self.sid, self.file_store, self.user_id)
def run_control_flags(self):
"""
Performs one step of the control flags
"""
self.state.iteration_flag.step()
if self.state.budget_flag:
self.state.budget_flag.step()
def sync_budget_flag_with_metrics(self):
"""
Ensures that budget flag is up to date with accumulated costs from llm completions
Budget flag will monitor for when budget is exceeded
"""
if self.state.budget_flag:
self.state.budget_flag.current_value = self.state.metrics.accumulated_cost
def merge_metrics(self, metrics: Metrics):
"""
Merges metrics with the state metrics
NOTE: this should be refactored in the future. We should have services (draft llm, title autocomplete, condenser, etc)
use their own LLMs, but the metrics object should be shared. This way we have one source of truth for accumulated costs from
all services
This would prevent having fragmented stores for metrics, and we don't have the burden of deciding where and how to store them
if we decide introduce more specialized services that require llm completions
"""
self.state.metrics.merge(metrics)
if self.state.budget_flag:
self.state.budget_flag.current_value = self.state.metrics.accumulated_cost
+1
View File
@@ -54,6 +54,7 @@ class MCPStdioServerConfig(BaseModel):
and set(self.env.items()) == set(other.env.items())
)
class MCPSHTTPServerConfig(BaseModel):
url: str
api_key: str | None = None
+2 -2
View File
@@ -206,8 +206,8 @@ def create_controller(
controller = AgentController(
agent=agent,
max_iterations=config.max_iterations,
max_budget_per_task=config.max_budget_per_task,
iteration_delta=config.max_iterations,
budget_per_task_delta=config.max_budget_per_task,
agent_to_llm_config=config.get_agent_to_llm_config_map(),
event_stream=event_stream,
initial_state=initial_state,
@@ -39,6 +39,7 @@ class GitHubService(BaseGitService, GitService):
The class is instantiated via get_impl() in openhands.server.shared.py.
"""
BASE_URL = 'https://api.github.com'
token: SecretStr = SecretStr('')
refresh = False
@@ -508,7 +509,6 @@ class GitHubService(BaseGitService, GitService):
return response['html_url']
github_service_cls = os.environ.get(
'OPENHANDS_GITHUB_SERVICE_CLS',
'openhands.integrations.github.github_service.GitHubService',
@@ -32,6 +32,7 @@ class GitLabService(BaseGitService, GitService):
The class is instantiated via get_impl() in openhands.server.shared.py.
"""
BASE_URL = 'https://gitlab.com/api/v4'
GRAPHQL_URL = 'https://gitlab.com/api/graphql'
token: SecretStr = SecretStr('')
@@ -470,7 +471,6 @@ class GitLabService(BaseGitService, GitService):
target_branch: The name of the branch you want the changes merged into
title: The title of the merge request (optional, defaults to a generic title)
description: The description of the merge request (optional)
draft: Whether to create the MR as a draft (optional, defaults to False)
Returns:
- MR URL when successful
@@ -483,9 +483,7 @@ class GitLabService(BaseGitService, GitService):
# Set default description if none provided
if not description:
description = (
f'Merging changes from {source_branch} into {target_branch}'
)
description = f'Merging changes from {source_branch} into {target_branch}'
# Prepare the request payload
payload = {
@@ -500,11 +498,9 @@ class GitLabService(BaseGitService, GitService):
url=url, params=payload, method=RequestMethod.POST
)
return response['web_url']
gitlab_service_cls = os.environ.get(
'OPENHANDS_GITLAB_SERVICE_CLS',
'openhands.integrations.gitlab.gitlab_service.GitLabService',
@@ -1 +1 @@
{{ issue_comment }}
{{ issue_comment }}
@@ -1 +1 @@
Please fix issue number #{{ issue_number }} in your repository.
Please fix issue number #{{ issue_number }} in your repository.

Some files were not shown because too many files have changed in this diff Show More