Revamp AgentRejectAction and allow ManagerAgent to handle rejection (#1735)

* Fix AgentRejectAction handling

* Add ManagerAgent to integration tests

* Fix regenerate.sh

* Fix merge

* Update README for micro-agents

* Add test reject to regenerate.sh

* regenerate.sh: Add support for running a specific test and/or agent

* Refine reject schema, and allow ManagerAgent to handle reject

* Add test artifacts for test_simple_task_rejection

* Fix manager agent tests

* Fix README

* test_simple_task_rejection: check final agent state

* Integration test: exit if mock prompt not found

* Update test_simple_task_rejection tests

* Fix test_edits test artifacts after prompt update

* Fix ManagerAgent test_edits

* WIP

* Fix tests

* update test_edits for ManagerAgent

* Skip local sandbox for reject test

* Fix test comparison
This commit is contained in:
Boxuan Li
2024-06-08 23:12:30 -07:00
committed by GitHub
parent c062468dcf
commit a9a2f10170
36 changed files with 675 additions and 18 deletions

View File

@@ -55,6 +55,15 @@ TEST_ONLY=true ./tests/integration/regenerate.sh
to run all integration tests until the first failure.
If you only want to run a specific test, set environment variable
`ONLY_TEST_NAME` to the test name. If you only want to run a specific agent,
set environment variable `ONLY_TEST_AGENT` to the agent. You could also use both,
e.g.
```bash
TEST_ONLY=true ONLY_TEST_NAME="test_simple_task_rejection" ONLY_TEST_AGENT="ManagerAgent" ./tests/integration/regenerate.sh
```
## Regenerate Integration Tests
When you make changes to an agent's prompt, the integration tests will fail. You'll need to regenerate them

View File

@@ -1,9 +1,9 @@
import io
import os
import re
import subprocess
import sys
import tempfile
import subprocess
from functools import partial
from http.server import HTTPServer, SimpleHTTPRequestHandler
from threading import Thread
@@ -18,7 +18,8 @@ workspace_path = os.getenv('WORKSPACE_BASE')
def filter_out_symbols(input):
return ' '.join([char for char in input if char.isalnum()])
input = re.sub(r'\\n|\\r\\n|\\r|\s+', '', input)
return input
def get_log_id(prompt_log_name):
@@ -84,13 +85,19 @@ def get_mock_response(test_name: str, messages: str, id: int) -> str:
print('Mismatched Prompt File path', prompt_file_path)
print('---' * 10)
# Create a temporary file to store messages
with tempfile.NamedTemporaryFile(delete=False, mode='w', encoding='utf-8') as tmp_file:
with tempfile.NamedTemporaryFile(
delete=False, mode='w', encoding='utf-8'
) as tmp_file:
tmp_file_path = tmp_file.name
tmp_file.write(messages)
try:
# Use diff command to compare files and capture the output
result = subprocess.run(['diff', '-u', prompt_file_path, tmp_file_path], capture_output=True, text=True)
result = subprocess.run(
['diff', '-u', prompt_file_path, tmp_file_path],
capture_output=True,
text=True,
)
if result.returncode != 0:
print('Diff:')
print(result.stdout)

View File

@@ -11,6 +11,15 @@ can do the actual work. A description of each agent is provided below. You MUST
select one of the delegates below to move towards accomplishing the task, and you MUST
provide the correct inputs for the delegate you select.
Note: the delegated agent either returns "finish" or "reject".
- If the action is "finish", but the full task is not done yet, you should
continue to delegate to one of the agents below to until the full task is finished.
- If the action is "reject", it means the delegated agent is not capable of the
task you send to. You should revisit the input you send to the delegate, and consider
whether any other delegate would be able to solve the task. If you cannot find
a proper delegate agent, or the delegate attempts keep failing, call the `reject`
action.
## Agents
### CoderAgent
@@ -62,6 +71,9 @@ They are time-ordered, with your most recent action at the bottom.
[[{"source": "user", "action": "message", "args": {"content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}]]
If the last item in the history is an error, you should try to fix it. If you
cannot fix it, call the `reject` action.
## Available Actions
* `delegate` - send a task to another agent from the list provided. Arguments:
* `agent` - the agent to which the task is delegated. MUST match a name in the list of agents provided.
@@ -70,6 +82,9 @@ They are time-ordered, with your most recent action at the bottom.
* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
* `outputs` - a dictionary representing the outputs of your task, if any
* `reject` - reject the task. Arguments:
* `outputs` - a dictionary with only a `reason` attribute
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:

View File

@@ -52,7 +52,7 @@ as well as observations you've made. This only includes the MOST RECENT
actions and observations--more may have happened before that.
They are time-ordered, with your most recent action at the bottom.
[[{"source": "agent", "action": "read", "args": {"path": "./bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "./bad.txt"}}], [{"source": "agent", "action": "message", "args": {"content": "The content of 'bad.txt' has been read. The next step is to identify and fix the typos.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}]]
[[{"source": "agent", "action": "read", "args": {"path": "./bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "./bad.txt"}}], [{"source": "agent", "action": "message", "args": {"content": "The content of 'bad.txt' has been read. The following typos have been identified:\n1. 'typoo' should be 'typo'\n2. 'mor' should be 'more'\nI will now proceed to fix these typos and overwrite the file with the corrected content.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}]]
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:

View File

@@ -52,7 +52,7 @@ as well as observations you've made. This only includes the MOST RECENT
actions and observations--more may have happened before that.
They are time-ordered, with your most recent action at the bottom.
[[{"source": "agent", "action": "read", "args": {"path": "./bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "./bad.txt"}}], [{"source": "agent", "action": "message", "args": {"content": "The content of 'bad.txt' has been read. The next step is to identify and fix the typos.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}], [{"source": "agent", "action": "write", "args": {"path": "./bad.txt", "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "write", "content": "", "extras": {"path": "./bad.txt"}}]]
[[{"source": "agent", "action": "read", "args": {"path": "./bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "./bad.txt"}}], [{"source": "agent", "action": "message", "args": {"content": "The content of 'bad.txt' has been read. The following typos have been identified:\n1. 'typoo' should be 'typo'\n2. 'mor' should be 'more'\nI will now proceed to fix these typos and overwrite the file with the corrected content.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}], [{"source": "agent", "action": "write", "args": {"path": "./bad.txt", "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "write", "content": "", "extras": {"path": "./bad.txt"}}]]
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:

View File

@@ -11,6 +11,15 @@ can do the actual work. A description of each agent is provided below. You MUST
select one of the delegates below to move towards accomplishing the task, and you MUST
provide the correct inputs for the delegate you select.
Note: the delegated agent either returns "finish" or "reject".
- If the action is "finish", but the full task is not done yet, you should
continue to delegate to one of the agents below to until the full task is finished.
- If the action is "reject", it means the delegated agent is not capable of the
task you send to. You should revisit the input you send to the delegate, and consider
whether any other delegate would be able to solve the task. If you cannot find
a proper delegate agent, or the delegate attempts keep failing, call the `reject`
action.
## Agents
### CoderAgent
@@ -60,7 +69,10 @@ as well as observations you've made. This only includes the MOST RECENT
actions and observations--more may have happened before that.
They are time-ordered, with your most recent action at the bottom.
[[{"source": "user", "action": "message", "args": {"content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}], [{"source": "agent", "action": "delegate", "args": {"agent": "TypoFixerAgent", "inputs": {"task": "Fix typos in bad.txt"}, "thought": ""}}, {"observation": "null", "content": "", "extras": {}}], [{"action": "null", "args": {}}, {"source": "agent", "observation": "delegate", "content": "", "extras": {"outputs": {"summary": {"file": "./bad.txt", "typos_fixed": [{"original": "typoo", "fixed": "typo"}, {"original": "mor", "fixed": "more"}]}}}}]]
[[{"source": "user", "action": "message", "args": {"content": "Fix typos in bad.txt. Do not ask me for confirmation at any point.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}], [{"source": "agent", "action": "delegate", "args": {"agent": "TypoFixerAgent", "inputs": {"task": "Fix typos in bad.txt"}, "thought": ""}}, {"observation": "null", "content": "", "extras": {}}], [{"action": "null", "args": {}}, {"source": "agent", "observation": "delegate", "content": "", "extras": {"outputs": {"summary": {"file": "bad.txt", "typos_fixed": [{"original": "typoo", "corrected": "typo"}, {"original": "mor", "corrected": "more"}]}}}}]]
If the last item in the history is an error, you should try to fix it. If you
cannot fix it, call the `reject` action.
## Available Actions
* `delegate` - send a task to another agent from the list provided. Arguments:
@@ -70,6 +82,9 @@ They are time-ordered, with your most recent action at the bottom.
* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
* `outputs` - a dictionary representing the outputs of your task, if any
* `reject` - reject the task. Arguments:
* `outputs` - a dictionary with only a `reason` attribute
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:

View File

@@ -1,7 +1,7 @@
{
"action": "message",
"args": {
"content": "The content of 'bad.txt' has been read. The next step is to identify and fix the typos.",
"content": "The content of 'bad.txt' has been read. The following typos have been identified:\n1. 'typoo' should be 'typo'\n2. 'mor' should be 'more'\nI will now proceed to fix these typos and overwrite the file with the corrected content.",
"wait_for_response": false
}
}

View File

@@ -3,15 +3,15 @@
"args": {
"outputs": {
"summary": {
"file": "./bad.txt",
"file": "bad.txt",
"typos_fixed": [
{
"original": "typoo",
"fixed": "typo"
"corrected": "typo"
},
{
"original": "mor",
"fixed": "more"
"corrected": "more"
}
]
}

View File

@@ -3,15 +3,15 @@
"action": "finish",
"args": {
"outputs": {
"file": "./bad.txt",
"file": "bad.txt",
"typos_fixed": [
{
"original": "typoo",
"fixed": "typo"
"corrected": "typo"
},
{
"original": "mor",
"fixed": "more"
"corrected": "more"
}
]
}

View File

@@ -0,0 +1,94 @@
----------
# Task
You are in charge of accomplishing the following task:
Write a git commit message for the current staging area. Do not ask me for confirmation at any point.
In order to accomplish this goal, you must delegate tasks to one or more agents, who
can do the actual work. A description of each agent is provided below. You MUST
select one of the delegates below to move towards accomplishing the task, and you MUST
provide the correct inputs for the delegate you select.
Note: the delegated agent either returns "finish" or "reject".
- If the action is "finish", but the full task is not done yet, you should
continue to delegate to one of the agents below to until the full task is finished.
- If the action is "reject", it means the delegated agent is not capable of the
task you send to. You should revisit the input you send to the delegate, and consider
whether any other delegate would be able to solve the task. If you cannot find
a proper delegate agent, or the delegate attempts keep failing, call the `reject`
action.
## Agents
### CoderAgent
Given a particular task, and a detailed description of the codebase, accomplishes the task
#### Inputs
{"task": "string", "summary": "string"}
### CommitWriterAgent
Write a git commit message for files in the git staging area
#### Inputs
{}
### MathAgent
Solves simple and complex math problems using python
#### Inputs
{"task": "string"}
### PostgresAgent
Writes and maintains PostgreSQL migrations
#### Inputs
{"task": "string"}
### RepoExplorerAgent
Generates a detailed summary of an existing codebase
#### Inputs
{}
### StudyRepoForTaskAgent
Given a particular task, finds and describes all relevant parts of the codebase
#### Inputs
{"task": "string"}
### TypoFixerAgent
Fixes typos in files in the current working directory
#### Inputs
{"task": "string"}
### VerifierAgent
Given a particular task, verifies that the task has been completed
#### Inputs
{"task": "string"}
## History
Here is a recent history of actions you've taken in service of this plan,
as well as observations you've made. This only includes the MOST RECENT
actions and observations--more may have happened before that.
They are time-ordered, with your most recent action at the bottom.
[[{"source": "user", "action": "message", "args": {"content": "Write a git commit message for the current staging area. Do not ask me for confirmation at any point.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}]]
If the last item in the history is an error, you should try to fix it. If you
cannot fix it, call the `reject` action.
## Available Actions
* `delegate` - send a task to another agent from the list provided. Arguments:
* `agent` - the agent to which the task is delegated. MUST match a name in the list of agents provided.
* `inputs` - a dictionary of input parameters to the agent, as specified in the list
* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
* `outputs` - a dictionary representing the outputs of your task, if any
* `reject` - reject the task. Arguments:
* `outputs` - a dictionary with only a `reason` attribute
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:
* `action`, which is one of the actions specified here
* `args`, which is a map of key-value pairs, specifying the arguments for that action
You MUST NOT include any other text besides the JSON response

View File

@@ -0,0 +1,50 @@
----------
# Task
You are a responsible software engineer and always write good commit messages.
Please analyze the diff in the staging area, understand the context and content
of the updates from the diff only. Identify key elements like:
- Which files are affected?
- What types of changes were made (e.g., new features, bug fixes, refactoring, documentation, testing)?
Then you should generate a commit message that succinctly summarizes the staged
changes. The commit message should include:
- A summary line that clearly states the purpose of the changes.
- Optionally, a detailed description if the changes are complex or need further explanation.
You should find the diff using `git diff --cached`, compile a commit message,
and call the `finish` action with `outputs.answer` set to the answer. If current
repo is not a valid git repo, or there is no diff in the staging area, please call
the `reject` action.
## History
Here is a recent history of actions you've taken in service of this plan,
as well as observations you've made. This only includes the MOST RECENT
actions and observations--more may have happened before that.
They are time-ordered, with your most recent action at the bottom.
[]
If the last item in the history is an error, you should try to fix it.
## Available Actions
* `run` - runs a command on the command line in a Linux shell. Arguments:
* `command` - the command to run
* `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
* `reject` - reject the task. Arguments:
* `outputs` - a dictionary with only a `reason` attribute
* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
* `outputs` - a dictionary representing the outputs of your task, if any
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:
* `action`, which is one of the actions specified here
* `args`, which is a map of key-value pairs, specifying the arguments for that action
You MUST NOT include any other text besides the JSON response

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,94 @@
----------
# Task
You are in charge of accomplishing the following task:
Write a git commit message for the current staging area. Do not ask me for confirmation at any point.
In order to accomplish this goal, you must delegate tasks to one or more agents, who
can do the actual work. A description of each agent is provided below. You MUST
select one of the delegates below to move towards accomplishing the task, and you MUST
provide the correct inputs for the delegate you select.
Note: the delegated agent either returns "finish" or "reject".
- If the action is "finish", but the full task is not done yet, you should
continue to delegate to one of the agents below to until the full task is finished.
- If the action is "reject", it means the delegated agent is not capable of the
task you send to. You should revisit the input you send to the delegate, and consider
whether any other delegate would be able to solve the task. If you cannot find
a proper delegate agent, or the delegate attempts keep failing, call the `reject`
action.
## Agents
### CoderAgent
Given a particular task, and a detailed description of the codebase, accomplishes the task
#### Inputs
{"task": "string", "summary": "string"}
### CommitWriterAgent
Write a git commit message for files in the git staging area
#### Inputs
{}
### MathAgent
Solves simple and complex math problems using python
#### Inputs
{"task": "string"}
### PostgresAgent
Writes and maintains PostgreSQL migrations
#### Inputs
{"task": "string"}
### RepoExplorerAgent
Generates a detailed summary of an existing codebase
#### Inputs
{}
### StudyRepoForTaskAgent
Given a particular task, finds and describes all relevant parts of the codebase
#### Inputs
{"task": "string"}
### TypoFixerAgent
Fixes typos in files in the current working directory
#### Inputs
{"task": "string"}
### VerifierAgent
Given a particular task, verifies that the task has been completed
#### Inputs
{"task": "string"}
## History
Here is a recent history of actions you've taken in service of this plan,
as well as observations you've made. This only includes the MOST RECENT
actions and observations--more may have happened before that.
They are time-ordered, with your most recent action at the bottom.
[[{"source": "user", "action": "message", "args": {"content": "Write a git commit message for the current staging area. Do not ask me for confirmation at any point.", "wait_for_response": false}}, {"observation": "null", "content": "", "extras": {}}], [{"source": "agent", "action": "delegate", "args": {"agent": "CommitWriterAgent", "inputs": {}, "thought": ""}}, {"observation": "null", "content": "", "extras": {}}], [{"action": "null", "args": {}}, {"source": "agent", "observation": "delegate", "content": "", "extras": {"outputs": {"reason": "The current repository is not a valid git repository or there is no diff in the staging area."}}}]]
If the last item in the history is an error, you should try to fix it. If you
cannot fix it, call the `reject` action.
## Available Actions
* `delegate` - send a task to another agent from the list provided. Arguments:
* `agent` - the agent to which the task is delegated. MUST match a name in the list of agents provided.
* `inputs` - a dictionary of input parameters to the agent, as specified in the list
* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
* `outputs` - a dictionary representing the outputs of your task, if any
* `reject` - reject the task. Arguments:
* `outputs` - a dictionary with only a `reason` attribute
## Format
Your response MUST be in JSON format. It must be an object, and it must contain two fields:
* `action`, which is one of the actions specified here
* `args`, which is a map of key-value pairs, specifying the arguments for that action
You MUST NOT include any other text besides the JSON response

View File

@@ -0,0 +1,9 @@
```json
{
"action": "delegate",
"args": {
"agent": "CommitWriterAgent",
"inputs": {}
}
}
```

View File

@@ -0,0 +1,8 @@
```json
{
"action": "run",
"args": {
"command": "git diff --cached"
}
}
```

View File

@@ -0,0 +1,9 @@
```json
{
"action": "run",
"args": {
"command": "git diff --cached",
"background": false
}
}
```

View File

@@ -0,0 +1,9 @@
```json
{
"action": "run",
"args": {
"command": "git diff --staged",
"background": false
}
}
```

View File

@@ -0,0 +1,8 @@
```json
{
"action": "reject",
"args": {
"reason": "The current repository is not a valid git repository or there is no diff in the staging area."
}
}
```

View File

@@ -0,0 +1,8 @@
```json
{
"action": "reject",
"args": {
"reason": "The current repository is not a valid git repository or there is no diff in the staging area."
}
}
```

View File

@@ -0,0 +1,10 @@
```json
{
"action": "reject",
"args": {
"outputs": {
"reason": "The current repository is not a valid git repository or there is no diff in the staging area."
}
}
}
```

View File

@@ -0,0 +1,10 @@
```json
{
"action": "reject",
"args": {
"outputs": {
"reason": "The current repository is not a valid git repository or there is no diff in the staging area."
}
}
}
```

View File

@@ -25,6 +25,7 @@ tasks=(
"Fix typos in bad.txt."
"Write a shell script 'hello.sh' that prints 'hello'."
"Use Jupyter IPython to write a text file containing 'hello world' to '/workspace/test.txt'."
"Write a git commit message for the current staging area."
"Install and import pymsgbox==1.0.9 and print it's version in /workspace/test.txt."
"Browse localhost:8000, and tell me the ultimate answer to life."
)
@@ -32,6 +33,7 @@ test_names=(
"test_edits"
"test_write_simple_script"
"test_ipython"
"test_simple_task_rejection"
"test_ipython_module"
"test_browse_internet"
)

View File

@@ -10,6 +10,7 @@ from opendevin.core.main import main
from opendevin.core.schema import AgentState
from opendevin.events.action import (
AgentFinishAction,
AgentRejectAction,
)
workspace_base = os.getenv('WORKSPACE_BASE')
@@ -23,6 +24,10 @@ workspace_base = os.getenv('WORKSPACE_BASE')
os.getenv('AGENT') == 'CodeActAgent' and os.getenv('SANDBOX_TYPE').lower() != 'ssh',
reason='CodeActAgent only supports ssh sandbox which is stateful',
)
@pytest.mark.skipif(
os.getenv('AGENT') == 'ManagerAgent',
reason='Manager agent is not capable of finishing this in reasonable steps yet',
)
def test_write_simple_script():
task = "Write a shell script 'hello.sh' that prints 'hello'. Do not ask me for confirmation at any point."
final_state: State = asyncio.run(main(task, exit_on_message=True))
@@ -109,6 +114,23 @@ def test_ipython():
), f'Expected content "hello world", but got "{content.strip()}"'
@pytest.mark.skipif(
os.getenv('AGENT') != 'ManagerAgent',
reason='Currently, only ManagerAgent supports task rejection',
)
@pytest.mark.skipif(
os.getenv('SANDBOX_TYPE') == 'local',
reason='FIXME: local sandbox does not capture stderr',
)
def test_simple_task_rejection():
# Give an impossible task to do: cannot write a commit message because
# the workspace is not a git repo
task = 'Write a git commit message for the current staging area. Do not ask me for confirmation at any point.'
final_state: State = asyncio.run(main(task))
assert final_state.agent_state == AgentState.STOPPED
assert isinstance(final_state.history[-1][0], AgentRejectAction)
@pytest.mark.skipif(
os.getenv('AGENT') != 'CodeActAgent',
reason='currently only CodeActAgent defaults to have IPython (Jupyter) execution',