Default to gpt-4o (#2158 )

* Default to gpt-4o * Fix default
Bump @nextui-org/react from 2.4.0 to 2.4.1 in /frontend (#2161 )
2026-04-29 03:00:45 -04:00 · 2024-05-31 14:44:07 +00:00 · 2024-05-31 14:32:21 +00:00 · 2024-05-31 14:30:14 +00:00 · 2024-05-31 12:32:17 +08:00 · 2024-05-31 03:12:27 +00:00
160 changed files with 8302 additions and 1839 deletions
@@ -10,6 +10,9 @@ on:
    - main
  pull_request:

+env:
+  PERSIST_SANDBOX : "false"
+
 jobs:
  test:
    runs-on: ubuntu-latest
@@ -15,6 +15,9 @@ on:
      - 'evaluation/**'
  pull_request:

+env:
+  PERSIST_SANDBOX : "false"
+
 jobs:
  integration-tests-on-linux:
    name: Integration Tests on Linux
@@ -15,6 +15,9 @@ on:
      - 'evaluation/**'
  pull_request:

+env:
+  PERSIST_SANDBOX : "false"
+
 jobs:
  test-on-macos:
    name: Test on macOS
@@ -5,8 +5,8 @@ This guide is for people working on OpenDevin and editing the source code.

 ### 1. Requirements
 * Linux, Mac OS, or [WSL on Windows](https://learn.microsoft.com/en-us/windows/wsl/install)
-* [Docker](https://docs.docker.com/engine/install/)(For those on MacOS, make sure to allow the default Docker socket to be used from advanced settings!)
-* [Python](https://www.python.org/downloads/) >= 3.11
+* [Docker](https://docs.docker.com/engine/install/) (For those on MacOS, make sure to allow the default Docker socket to be used from advanced settings!)
+* [Python](https://www.python.org/downloads/) = 3.11
 * [NodeJS](https://nodejs.org/en/download/package-manager) >= 18.17.1
 * [Poetry](https://python-poetry.org/docs/#installing-with-the-official-installer) >= 1.8

@@ -45,6 +45,7 @@ To configure the LM of your choice, follow these steps:
   make setup-config
   ```
   This command will prompt you to enter the LLM API key, model name, and other variables ensuring that OpenDevin is tailored to your specific needs. Note that the model name will apply only when you run headless. If you use the UI, please set the model in the UI.
+   Set `persist_sandbox` to false if you want to use clean sandbox for each task. If `persist_sandbox` is set to true, you will need to set the `ssh_password` as well.

 **Note on Alternative Models:**
 Some alternative models may prove more challenging to tame than others. Fear not, brave adventurer! We shall soon unveil LLM-specific documentation to guide you on your quest. And if you've already mastered the art of wielding a model other than OpenAI's GPT, we encourage you to [share your setup instructions with us](https://github.com/OpenDevin/OpenDevin/issues/417).
@@ -98,4 +99,4 @@ Please refer to [this README](./tests/integration/README.md) for details.
 ### 9. Add or update dependency

 1. Add your dependency in `pyproject.toml` or use `peotry add xxx`
-2. Update the poetry.lock file via `poetry lock --no-update`
+2. Update the poetry.lock file via `poetry lock --no-update`
@@ -7,7 +7,7 @@ BACKEND_PORT = 3000
 BACKEND_HOST = "127.0.0.1:$(BACKEND_PORT)"
 FRONTEND_PORT = 3001
 DEFAULT_WORKSPACE_DIR = "./workspace"
-DEFAULT_MODEL = "gpt-3.5-turbo"
+DEFAULT_MODEL = "gpt-4o"
 CONFIG_FILE = config.toml
 PRECOMMIT_CONFIG_PATH = "./dev_config/python/.pre-commit-config.yaml"

@@ -226,6 +226,15 @@ setup-config-prompts:
 	 workspace_dir=$${workspace_dir:-$(DEFAULT_WORKSPACE_DIR)}; \
 	 echo "workspace_base=\"$$workspace_dir\"" >> $(CONFIG_FILE).tmp

+	@read -p "Do you want to persist the sandbox container? [true/false] [default: true]: " persist_sandbox; \
+	 persist_sandbox=$${persist_sandbox:-true}; \
+	 if [ "$$persist_sandbox" = "true" ]; then \
+		 read -p "Enter a password for the sandbox container: " ssh_password; \
+		 echo "ssh_password=\"$$ssh_password\"" >> $(CONFIG_FILE).tmp; \
+	 else \
+		echo "persist_sandbox=\"$$persist_sandbox\"" >> $(CONFIG_FILE).tmp
+	 fi
+
 	@echo "" >> $(CONFIG_FILE).tmp

 	@echo "[llm]" >> $(CONFIG_FILE).tmp
@@ -51,20 +51,21 @@ You must be using Linux, Mac OS, or WSL on Windows.

 To start the app, run these commands, replacing `$(pwd)/workspace` with the directory you want OpenDevin to work with.

+> [!WARNING]
+> OpenDevin runs bash commands within a Docker sandbox, so it should not affect your machine.
+> But your workspace directory will be attached to that sandbox, and files in the directory may be modified or deleted.
+
 ```bash
 # The directory you want OpenDevin to work with. MUST be an absolute path!
 export WORKSPACE_BASE=$(pwd)/workspace;
 ```

-> [!WARNING]  
-> OpenDevin runs bash commands within a Docker sandbox, so it should not affect your machine. 
-> But your workspace directory will be attached to that sandbox, and files in the directory may be modified or deleted.
-
 ```bash
-docker run \
-    -it \
+docker run -it \
    --pull=always \
    -e SANDBOX_USER_ID=$(id -u) \
+    -e PERSIST_SANDBOX="true" \
+    -e SSH_PASSWORD="make something up here" \
    -e WORKSPACE_MOUNT_PATH=$WORKSPACE_BASE \
    -v $WORKSPACE_BASE:/opt/workspace_base \
    -v /var/run/docker.sock:/var/run/docker.sock \
@@ -12,6 +12,7 @@ from . import (  # noqa: E402
    SWE_agent,
    browsing_agent,
    codeact_agent,
+    codeact_swe_agent,
    delegator_agent,
    dummy_agent,
    monologue_agent,
@@ -21,6 +22,7 @@ from . import (  # noqa: E402
 __all__ = [
    'monologue_agent',
    'codeact_agent',
+    'codeact_swe_agent',
    'planner_agent',
    'SWE_agent',
    'delegator_agent',
@@ -105,6 +105,18 @@ def truncate_observation(observation: str, max_chars: int = 10_000) -> str:
    )


+# FIXME: We can tweak these two settings to create MicroAgents specialized toward different area
+def get_system_message() -> str:
+    if ENABLE_GITHUB:
+        return f'{SYSTEM_PREFIX}\n{GITHUB_MESSAGE}\n\n{COMMAND_DOCS}\n\n{SYSTEM_SUFFIX}'
+    else:
+        return f'{SYSTEM_PREFIX}\n\n{COMMAND_DOCS}\n\n{SYSTEM_SUFFIX}'
+
+
+def get_in_context_example() -> str:
+    return EXAMPLES
+
+
 class CodeActAgent(Agent):
    VERSION = '1.5'
    """
@@ -152,11 +164,8 @@ class CodeActAgent(Agent):
    ]
    jupyter_kernel_init_code: str = 'from agentskills import *'

-    system_message: str = (
-        f'{SYSTEM_PREFIX}\n{GITHUB_MESSAGE}\n\n{COMMAND_DOCS}\n\n{SYSTEM_SUFFIX}'
-        if ENABLE_GITHUB
-        else f'{SYSTEM_PREFIX}\n\n{COMMAND_DOCS}\n\n{SYSTEM_SUFFIX}'
-    )
+    system_message: str = get_system_message()
+    in_context_example: str = f"Here is an example of how you can interact with the environment for task solving:\n{get_in_context_example()}\n\nNOW, LET'S START!"

    def __init__(
        self,
@@ -194,10 +203,7 @@ class CodeActAgent(Agent):
        """
        messages: list[dict[str, str]] = [
            {'role': 'system', 'content': self.system_message},
-            {
-                'role': 'user',
-                'content': f"Here is an example of how you can interact with the environment for task solving:\n{EXAMPLES}\n\nNOW, LET'S START!",
-            },
+            {'role': 'user', 'content': self.in_context_example},
        ]

        for prev_action, obs in state.history:
@@ -8,17 +8,23 @@ COMMAND_DOCS = (
    "Please note that THE `edit_file` FUNCTION REQUIRES PROPER INDENTATION. If the assistant would like to add the line '        print(x)', it must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run."
 )

-SYSTEM_PREFIX = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
+# ======= SYSTEM MESSAGE =======
+MINIMAL_SYSTEM_PREFIX = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
 The assistant can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. The code should be enclosed using "<execute_ipython>" tag, for example:
 <execute_ipython>
 print("Hello World!")
 </execute_ipython>
 The assistant can execute bash commands on behalf of the user by wrapping them with <execute_bash> and </execute_bash>.
 For example, you can list the files in the current directory by <execute_bash> ls </execute_bash>.
-The assistant can browse the Internet with commands on behalf of the user by wrapping them with <execute_browse> and </execute_browse>.
+"""
+
+BROWSING_PREFIX = """The assistant can browse the Internet with commands on behalf of the user by wrapping them with <execute_browse> and </execute_browse>.
 For example, you can browse a given URL by <execute_browse> goto("<URL>") </execute_browse>.
 The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
-The assistant can install Python packages using the %pip magic command in an IPython environment by using the following syntax: <execute_ipython> %pip install [package needed] </execute_ipython> and should always import packages and define variables before starting to use them."""
+"""
+PIP_INSTALL_PREFIX = """The assistant can install Python packages using the %pip magic command in an IPython environment by using the following syntax: <execute_ipython> %pip install [package needed] </execute_ipython> and should always import packages and define variables before starting to use them."""
+
+SYSTEM_PREFIX = MINIMAL_SYSTEM_PREFIX + BROWSING_PREFIX + PIP_INSTALL_PREFIX

 GITHUB_MESSAGE = """To do any activities on GitHub, the assistant should use the token in the $GITHUB_TOKEN environment variable.
 For instance, to push a local branch `my_branch` to the github repo `owner/repo`, the assistant can use the following four commands:
@@ -30,6 +36,8 @@ The assistant should include ONLY ONE <execute_ipython> or <execute_bash> or <ex
 IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> or <execute_browse> instead of providing it.
 """

+
+# ======= EXAMPLE MESSAGE =======
 EXAMPLES = """
 --- START OF EXAMPLE ---

@@ -0,0 +1,7 @@
+# CodeAct (SWE Edit Specialized)
+
+This agent is an adaptation of the original [SWE Agent](https://swe-agent.com/) based on CodeAct using the `agentskills` library of OpenDevin.
+
+Its intended use is **solving Github issues**.
+
+It removes web-browsing and Github capability from the original CodeAct agent to avoid confusion to the agent.
@@ -0,0 +1,5 @@
+from opendevin.controller.agent import Agent
+
+from .codeact_swe_agent import CodeActSWEAgent
+
+Agent.register('CodeActSWEAgent', CodeActSWEAgent)
@@ -0,0 +1,246 @@
+import re
+
+from agenthub.codeact_swe_agent.prompt import (
+    COMMAND_DOCS,
+    MINIMAL_SYSTEM_PREFIX,
+    SWE_EXAMPLE,
+    SYSTEM_SUFFIX,
+)
+from opendevin.controller.agent import Agent
+from opendevin.controller.state.state import State
+from opendevin.events.action import (
+    Action,
+    AgentFinishAction,
+    BrowseInteractiveAction,
+    CmdRunAction,
+    IPythonRunCellAction,
+    MessageAction,
+)
+from opendevin.events.observation import (
+    BrowserOutputObservation,
+    CmdOutputObservation,
+    IPythonRunCellObservation,
+)
+from opendevin.llm.llm import LLM
+from opendevin.runtime.plugins import (
+    AgentSkillsRequirement,
+    JupyterRequirement,
+    PluginRequirement,
+)
+
+
+def parse_response(response) -> str:
+    action = response.choices[0].message.content
+    for lang in ['bash', 'ipython', 'browse']:
+        if f'<execute_{lang}>' in action and f'</execute_{lang}>' not in action:
+            action += f'</execute_{lang}>'
+    return action
+
+
+def action_to_str(action: Action) -> str:
+    if isinstance(action, CmdRunAction):
+        return f'{action.thought}\n<execute_bash>\n{action.command}\n</execute_bash>'
+    elif isinstance(action, IPythonRunCellAction):
+        return f'{action.thought}\n<execute_ipython>\n{action.code}\n</execute_ipython>'
+    elif isinstance(action, BrowseInteractiveAction):
+        return f'{action.thought}\n<execute_browse>\n{action.browser_actions}\n</execute_browse>'
+    elif isinstance(action, MessageAction):
+        return action.content
+    return ''
+
+
+def get_action_message(action: Action) -> dict[str, str] | None:
+    if (
+        isinstance(action, BrowseInteractiveAction)
+        or isinstance(action, CmdRunAction)
+        or isinstance(action, IPythonRunCellAction)
+        or isinstance(action, MessageAction)
+    ):
+        return {
+            'role': 'user' if action.source == 'user' else 'assistant',
+            'content': action_to_str(action),
+        }
+    return None
+
+
+def get_observation_message(obs) -> dict[str, str] | None:
+    if isinstance(obs, CmdOutputObservation):
+        content = 'OBSERVATION:\n' + truncate_observation(obs.content)
+        content += (
+            f'\n[Command {obs.command_id} finished with exit code {obs.exit_code}]]'
+        )
+        return {'role': 'user', 'content': content}
+    elif isinstance(obs, IPythonRunCellObservation):
+        content = 'OBSERVATION:\n' + obs.content
+        # replace base64 images with a placeholder
+        splitted = content.split('\n')
+        for i, line in enumerate(splitted):
+            if '![image](data:image/png;base64,' in line:
+                splitted[i] = (
+                    '![image](data:image/png;base64, ...) already displayed to user'
+                )
+        content = '\n'.join(splitted)
+        content = truncate_observation(content)
+        return {'role': 'user', 'content': content}
+    elif isinstance(obs, BrowserOutputObservation):
+        content = 'OBSERVATION:\n' + truncate_observation(obs.content)
+        return {'role': 'user', 'content': content}
+    return None
+
+
+def truncate_observation(observation: str, max_chars: int = 10_000) -> str:
+    """
+    Truncate the middle of the observation if it is too long.
+    """
+    if len(observation) <= max_chars:
+        return observation
+    half = max_chars // 2
+    return (
+        observation[:half]
+        + '\n[... Observation truncated due to length ...]\n'
+        + observation[-half:]
+    )
+
+
+def get_system_message() -> str:
+    return f'{MINIMAL_SYSTEM_PREFIX}\n\n{COMMAND_DOCS}\n\n{SYSTEM_SUFFIX}'
+
+
+def get_in_context_example() -> str:
+    return SWE_EXAMPLE
+
+
+class CodeActSWEAgent(Agent):
+    VERSION = '1.5'
+    """
+    This agent is an adaptation of the original [SWE Agent](https://swe-agent.com/) based on CodeAct 1.5 using the `agentskills` library of OpenDevin.
+
+    It is intended use is **solving Github issues**.
+
+    It removes web-browsing and Github capability from the original CodeAct agent to avoid confusion to the agent.
+    """
+
+    sandbox_plugins: list[PluginRequirement] = [
+        # NOTE: AgentSkillsRequirement need to go before JupyterRequirement, since
+        # AgentSkillsRequirement provides a lot of Python functions
+        # and it need to be initialized before Jupyter for Jupyter to use those functions.
+        AgentSkillsRequirement(),
+        JupyterRequirement(),
+    ]
+    jupyter_kernel_init_code: str = 'from agentskills import *'
+
+    system_message: str = get_system_message()
+    in_context_example: str = f"Here is an example of how you can interact with the environment for task solving:\n{get_in_context_example()}\n\nNOW, LET'S START!"
+
+    def __init__(
+        self,
+        llm: LLM,
+    ) -> None:
+        """
+        Initializes a new instance of the CodeActAgent class.
+
+        Parameters:
+        - llm (LLM): The llm to be used by this agent
+        """
+        super().__init__(llm)
+        self.reset()
+
+    def reset(self) -> None:
+        """
+        Resets the CodeAct Agent.
+        """
+        super().reset()
+
+    def step(self, state: State) -> Action:
+        """
+        Performs one step using the CodeAct Agent.
+        This includes gathering info on previous steps and prompting the model to make a command to execute.
+
+        Parameters:
+        - state (State): used to get updated info and background commands
+
+        Returns:
+        - CmdRunAction(command) - bash command to run
+        - IPythonRunCellAction(code) - IPython code to run
+        - BrowseInteractiveAction(browsergym_command) - BrowserGym commands to run
+        - MessageAction(content) - Message action to run (e.g. ask for clarification)
+        - AgentFinishAction() - end the interaction
+        """
+        messages: list[dict[str, str]] = [
+            {'role': 'system', 'content': self.system_message},
+            {'role': 'user', 'content': self.in_context_example},
+        ]
+
+        for prev_action, obs in state.history:
+            action_message = get_action_message(prev_action)
+            if action_message:
+                messages.append(action_message)
+
+            obs_message = get_observation_message(obs)
+            if obs_message:
+                messages.append(obs_message)
+
+        latest_user_message = [m for m in messages if m['role'] == 'user'][-1]
+        if latest_user_message:
+            if latest_user_message['content'].strip() == '/exit':
+                return AgentFinishAction()
+            latest_user_message['content'] += (
+                f'\n\nENVIRONMENT REMINDER: You have {state.max_iterations - state.iteration} turns left to complete the task.'
+            )
+
+        response = self.llm.do_completion(
+            messages=messages,
+            stop=[
+                '</execute_ipython>',
+                '</execute_bash>',
+                '</execute_browse>',
+            ],
+            temperature=0.0,
+        )
+
+        action_str: str = parse_response(response)
+        state.num_of_chars += sum(
+            len(message['content']) for message in messages
+        ) + len(action_str)
+
+        if finish_command := re.search(r'<finish>.*</finish>', action_str, re.DOTALL):
+            thought = action_str.replace(finish_command.group(0), '').strip()
+            return AgentFinishAction(thought=thought)
+        if bash_command := re.search(
+            r'<execute_bash>(.*?)</execute_bash>', action_str, re.DOTALL
+        ):
+            # remove the command from the action string to get thought
+            thought = action_str.replace(bash_command.group(0), '').strip()
+            # a command was found
+            command_group = bash_command.group(1).strip()
+
+            if command_group.strip() == 'exit':
+                return AgentFinishAction()
+            return CmdRunAction(command=command_group, thought=thought)
+        elif python_code := re.search(
+            r'<execute_ipython>(.*?)</execute_ipython>', action_str, re.DOTALL
+        ):
+            # a code block was found
+            code_group = python_code.group(1).strip()
+            thought = action_str.replace(python_code.group(0), '').strip()
+            return IPythonRunCellAction(
+                code=code_group,
+                thought=thought,
+                kernel_init_code=self.jupyter_kernel_init_code,
+            )
+        elif browse_command := re.search(
+            r'<execute_browse>(.*)</execute_browse>', action_str, re.DOTALL
+        ):
+            # BrowserGym actions was found
+            browse_actions = browse_command.group(1).strip()
+            thought = action_str.replace(browse_command.group(0), '').strip()
+            return BrowseInteractiveAction(
+                browser_actions=browse_actions, thought=thought
+            )
+        else:
+            # We assume the LLM is GOOD enough that when it returns pure natural language
+            # it want to talk to the user
+            return MessageAction(content=action_str, wait_for_response=True)
+
+    def search_memory(self, query: str) -> list[str]:
+        raise NotImplementedError('Implement this abstract method')
@@ -0,0 +1,451 @@
+from opendevin.runtime.plugins import AgentSkillsRequirement
+
+_AGENT_SKILLS_DOCS = AgentSkillsRequirement.documentation
+
+COMMAND_DOCS = (
+    '\nApart from the standard Python library, the assistant can also use the following functions (already imported) in <execute_ipython> environment:\n'
+    f'{_AGENT_SKILLS_DOCS}'
+    "Please note that THE `edit_file` FUNCTION REQUIRES PROPER INDENTATION. If the assistant would like to add the line '        print(x)', it must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run."
+)
+
+# ======= SYSTEM MESSAGE =======
+MINIMAL_SYSTEM_PREFIX = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
+The assistant can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. The code should be enclosed using "<execute_ipython>" tag, for example:
+<execute_ipython>
+print("Hello World!")
+</execute_ipython>
+The assistant can execute bash commands on behalf of the user by wrapping them with <execute_bash> and </execute_bash>.
+For example, you can list the files in the current directory by <execute_bash> ls </execute_bash>.
+"""
+
+SYSTEM_SUFFIX = """The assistant's response should be concise.
+The assistant should include ONLY ONE <execute_ipython> or <execute_bash> or <execute_browse> in every one of the responses, unless the assistant is finished with the task or need more input or action from the user in order to proceed.
+IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> or <execute_browse> instead of providing it.
+"""
+
+SWE_EXAMPLE = """
+--- START OF EXAMPLE ---
+
+USER: We are currently solving the following issue within our repository. Here is the issue text:
+--- BEGIN ISSUE ---
+TimeDelta serialization precision
+Hi there!
+
+I just found quite strange behaviour of `TimeDelta` field serialization
+
+```python3
+from marshmallow.fields import TimeDelta
+from datetime import timedelta
+
+td_field = TimeDelta(precision="milliseconds")
+
+obj = dict()
+obj["td_field"] = timedelta(milliseconds=345)
+
+print(td_field.serialize("td_field", obj))
+```
+
+Output of this snippet is `344`, but it seems that `345` is correct.
+
+Looks like a rounding issue here: https://github.com/marshmallow-code/marshmallow/blob/dev/src/marshmallow/fields.py#L1474
+--- END ISSUE ---
+
+[Current directory: /workspace/marshmallow-code__marshmallow]
+
+ASSISTANT:
+Let's first start by reproducing the results of the issue. The issue includes some example code for reproduction, which we can use. We'll create a new file called `reproduce.py` and paste the example code into it.
+<execute_ipython>
+create_file('/workspace/marshmallow-code__marshmallow/reproduce.py')
+</execute_ipython>
+
+USER:
+OBSERVATION:
+[File: /workspace/marshmallow-code__marshmallow/reproduce.py (1 lines total)]
+1|
+[File /workspace/marshmallow-code__marshmallow/reproduce.py created.]
+
+ASSISTANT:
+Now let's paste in the example code from the issue.
+<execute_ipython>
+EDITED_CODE = \"\"\"from marshmallow.fields import TimeDelta
+from datetime import timedelta
+
+td_field = TimeDelta(precision="milliseconds")
+
+obj = dict()
+obj["td_field"] = timedelta(milliseconds=345)
+
+print(td_field.serialize("td_field", obj))
+\"\"\"
+edit_file(start=1, end=1, content=EDITED_CODE)
+</execute_ipython>
+
+USER:
+OBSERVATION:
+[File: /workspace/marshmallow-code__marshmallow/reproduce.py (9 lines total)]
+1|from marshmallow.fields import TimeDelta
+2|from datetime import timedelta
+3|
+4|td_field = TimeDelta(precision="milliseconds")
+5|
+6|obj = dict()
+7|obj["td_field"] = timedelta(milliseconds=345)
+8|
+9|print(td_field.serialize("td_field", obj))
+[File updated. Please review the changes and make sure they are correct (correct indentation, no duplicate lines, etc). Edit the file again if necessary.]
+
+ASSISTANT:
+Now let's run the code to see if we see the same output as the issue.
+<execute_bash>
+python reproduce.py
+</execute_bash>
+
+USER:
+OBSERVATION:
+344
+
+ASSISTANT:
+We are indeed seeing the same output as the issue. The issue suggests that we should look at line 1474 of the `fields.py` file to see if there is a rounding issue. Let's open the `fields.py` file and navigate to line 1474 to see the relevant code. First we should find the `fields.py` file in the repository. It is likely to be in the `src/marshmallow` directory, but we should check to be sure.
+<execute_bash>
+ls -F
+</execute_bash>
+
+USER:
+OBSERVATION:
+AUTHORS.rst
+CHANGELOG.rst
+CODE_OF_CONDUCT.md
+CONTRIBUTING.rst
+LICENSE
+MANIFEST.in
+NOTICE
+README.rst
+RELEASING.md
+azure-pipelines.yml
+docs/
+examples/
+performance/
+pyproject.toml
+reproduce.py
+setup.cfg
+setup.py
+src/
+tests/
+tox.ini
+
+ASSISTANT:
+It looks like the `src` directory is present, which suggests that the `fields.py` file is likely to be in the `src` directory. Let's use find_file to see where it is.
+<execute_ipython>
+find_file("fields.py", "/workspace/marshmallow-code__marshmallow/src")
+</execute_ipython>
+
+USER:
+OBSERVATION:
+[Found 1 matches for "fields.py" in /workspace/marshmallow-code__marshmallow/src]
+/workspace/marshmallow-code__marshmallow/src/marshmallow/fields.py
+
+ASSISTANT:
+It looks like the `fields.py` file is present in the `./src/marshmallow/` directory. The issue also points to a specific URL with line number 1474. We should navigate to that line in fields.py to see the relevant code for the `TimeDelta` serialization.
+<execute_ipython>
+open_file("/workspace/marshmallow-code__marshmallow/src/marshmallow/fields.py", 1474)
+</execute_ipython>
+
+USER:
+OBSERVATION:
+[File: /workspace/marshmallow-code__marshmallow/src/marshmallow/fields.py (1997 lines total)]
+(1458 more lines above)
+1459|            self.WEEKS,
+1460|        )
+1461|
+1462|        if precision not in units:
+1463|            msg = 'The precision must be {} or "{}".'.format(
+1464|                ", ".join([f'"{each}"' for each in units[:-1]]), units[-1]
+1465|            )
+1466|            raise ValueError(msg)
+1467|
+1468|        self.precision = precision
+1469|        super().__init__(**kwargs)
+1470|
+1471|    def _serialize(self, value, attr, obj, **kwargs):
+1472|        if value is None:
+1473|            return None
+1474|        base_unit = dt.timedelta(**{self.precision: 1})
+1475|        return int(value.total_seconds() / base_unit.total_seconds())
+1476|
+1477|    def _deserialize(self, value, attr, data, **kwargs):
+1478|        try:
+1479|            value = int(value)
+1480|        except (TypeError, ValueError) as error:
+1481|            raise self.make_error("invalid") from error
+1482|
+1483|        kwargs = {self.precision: value}
+1484|
+1485|        try:
+1486|            return dt.timedelta(**kwargs)
+1487|        except OverflowError as error:
+1488|            raise self.make_error("invalid") from error
+1489|
+1490|
+1491|class Mapping(Field):
+1492|    \"\"\"An abstract class for objects with key-value pairs.
+1493|
+1494|    :param keys: A field class or instance for dict keys.
+1495|    :param values: A field class or instance for dict values.
+1496|    :param kwargs: The same keyword arguments that :class:`Field` receives.
+1497|
+1498|    .. note::
+1499|        When the structure of nested data is not known, you may omit the
+1500|        `keys` and `values` arguments to prevent content validation.
+1501|
+1502|    .. versionadded:: 3.0.0rc4
+1503|    \"\"\"
+1504|
+1505|    mapping_type = dict
+1506|
+1507|    #: Default error messages.
+1508|    default_error_messages = {"invalid": "Not a valid mapping type."}
+1509|
+1510|    def __init__(
+1511|        self,
+1512|        keys: typing.Optional[typing.Union[Field, type]] = None,
+1513|        values: typing.Optional[typing.Union[Field, type]] = None,
+1514|        **kwargs,
+1515|    ):
+1516|        super().__init__(**kwargs)
+1517|        if keys is None:
+1518|            self.key_field = None
+1519|        else:
+1520|            try:
+1521|                self.key_field = resolve_field_instance(keys)
+1522|            except FieldInstanceResolutionError as error:
+1523|                raise ValueError(
+1524|                    '"keys" must be a subclass or instance of '
+1525|                    "marshmallow.base.FieldABC."
+1526|                ) from error
+1527|
+1528|        if values is None:
+1529|            self.value_field = None
+1530|        else:
+1531|            try:
+1532|                self.value_field = resolve_field_instance(values)
+1533|            except FieldInstanceResolutionError as error:
+1534|                raise ValueError(
+1535|                    '"values" must be a subclass or instance of '
+1536|                    "marshmallow.base.FieldABC."
+1537|                ) from error
+1538|            if isinstance(self.value_field, Nested):
+1539|                self.only = self.value_field.only
+1540|                self.exclude = self.value_field.exclude
+1541|
+1542|    def _bind_to_schema(self, field_name, schema):
+1543|        super()._bind_to_schema(field_name, schema)
+1544|        if self.value_field:
+1545|            self.value_field = copy.deepcopy(self.value_field)
+1546|            self.value_field._bind_to_schema(field_name, self)
+1547|        if isinstance(self.value_field, Nested):
+1548|            self.value_field.only = self.only
+1549|            self.value_field.exclude = self.exclude
+1550|        if self.key_field:
+1551|            self.key_field = copy.deepcopy(self.key_field)
+1552|            self.key_field._bind_to_schema(field_name, self)
+1553|
+1554|    def _serialize(self, value, attr, obj, **kwargs):
+1555|        if value is None:
+1556|            return None
+1557|        if not self.value_field and not self.key_field:
+1558|            return self.mapping_type(value)
+(439 more lines below)
+
+ASSISTANT:
+We are now looking at the relevant section of the `fields.py` file where the `TimeDelta` serialization occurs. The issue suggests that there is a rounding problem near line 1474. The current implementation uses integer division, which truncates the decimal part and can lead to incorrect rounding on line 1475. To fix this issue, we should ensure that the division results in a float, and then round the result to the nearest integer before converting it to an int. This should preserve the milliseconds precision as expected. Let's make the necessary edit to the code.
+<execute_ipython>
+edit_file(start=1475, end=1475, content="return int(round(value.total_seconds() / base_unit.total_seconds()))  # round to nearest int")
+</execute_ipython>
+
+USER:
+OBSERVATION:
+[Your proposed edit has introduced new syntax error(s). Please understand the errors and retry your edit command.]
+
+ERRORS:
+- E999 IndentationError: unexpected indent
+
+[This is how your edit would have looked if applied]
+-------------------------------------------------
+[File: /marshmallow-code__marshmallow/src/marshmallow/fields.py (1997 lines total)]
+(1468 more lines above)
+1469|        super().__init__(**kwargs)
+1470|
+1471|    def _serialize(self, value, attr, obj, **kwargs):
+1472|        if value is None:
+1473|            return None
+1474|        base_unit = dt.timedelta(**{self.precision: 1})
+1475|return int(round(value.total_seconds() / base_unit.total_seconds()))  # round to nearest int
+1476|
+1477|    def _deserialize(self, value, attr, data, **kwargs):
+1478|        try:
+1479|            value = int(value)
+(518 more lines below)
+-------------------------------------------------
+
+[This is the original code before your edit]
+-------------------------------------------------
+[File: /marshmallow-code__marshmallow/src/marshmallow/fields.py (1997 lines total)]
+(1469 more lines above)
+1470|
+1471|    def _serialize(self, value, attr, obj, **kwargs):
+1472|        if value is None:
+1473|            return None
+1474|        base_unit = dt.timedelta(**{self.precision: 1})
+1475|        return int(value.total_seconds() / base_unit.total_seconds())
+1476|
+1477|    def _deserialize(self, value, attr, data, **kwargs):
+1478|        try:
+1479|            value = int(value)
+1480|        except (TypeError, ValueError) as error:
+(517 more lines below)
+-------------------------------------------------
+Your changes have NOT been applied. Please fix your edit command and try again.
+You either need to 1) Specify the correct start/end line arguments or 2) Correct your edit code.
+DO NOT re-run the same failed edit command. Running it again will lead to the same error.
+
+ASSISTANT:
+My edit command did not use the proper indentation, I will fix my syntax in this follow up edit command.
+<execute_ipython>
+EDITED_CODE="        return int(round(value.total_seconds() / base_unit.total_seconds()))  # round to nearest int"
+edit_file(start=1475, end=1475, content=EDITED_CODE)
+</execute_ipython>
+
+USER:
+OBSERVATION:
+[File: /marshmallow-code__marshmallow/src/marshmallow/fields.py (1997 lines total)]
+(1424 more lines above)
+1425|
+1426|    :param precision: Influences how the integer is interpreted during
+1427|        (de)serialization. Must be 'days', 'seconds', 'microseconds',
+1428|        'milliseconds', 'minutes', 'hours' or 'weeks'.
+1429|    :param kwargs: The same keyword arguments that :class:`Field` receives.
+1430|
+1431|    .. versionchanged:: 2.0.0
+1432|        Always serializes to an integer value to avoid rounding errors.
+1433|        Add `precision` parameter.
+1434|    \"\"\"
+1435|
+1436|    DAYS = "days"
+1437|    SECONDS = "seconds"
+1438|    MICROSECONDS = "microseconds"
+1439|    MILLISECONDS = "milliseconds"
+1440|    MINUTES = "minutes"
+1441|    HOURS = "hours"
+1442|    WEEKS = "weeks"
+1443|
+1444|    #: Default error messages.
+1445|    default_error_messages = {
+1446|        "invalid": "Not a valid period of time.",
+1447|        "format": "{input!r} cannot be formatted as a timedelta.",
+1448|    }
+1449|
+1450|    def __init__(self, precision: str = SECONDS, **kwargs):
+1451|        precision = precision.lower()
+1452|        units = (
+1453|            self.DAYS,
+1454|            self.SECONDS,
+1455|            self.MICROSECONDS,
+1456|            self.MILLISECONDS,
+1457|            self.MINUTES,
+1458|            self.HOURS,
+1459|            self.WEEKS,
+1460|        )
+1461|
+1462|        if precision not in units:
+1463|            msg = 'The precision must be {} or "{}".'.format(
+1464|                ", ".join([f'"{each}"' for each in units[:-1]]), units[-1]
+1465|            )
+1466|            raise ValueError(msg)
+1467|
+1468|        self.precision = precision
+1469|        super().__init__(**kwargs)
+1470|
+1471|    def _serialize(self, value, attr, obj, **kwargs):
+1472|        if value is None:
+1473|            return None
+1474|        base_unit = dt.timedelta(**{self.precision: 1})
+1475|        return int(round(value.total_seconds() / base_unit.total_seconds()))  # round to nearest int
+1476|
+1477|    def _deserialize(self, value, attr, data, **kwargs):
+1478|        try:
+1479|            value = int(value)
+1480|        except (TypeError, ValueError) as error:
+1481|            raise self.make_error("invalid") from error
+1482|
+1483|        kwargs = {self.precision: value}
+1484|
+1485|        try:
+1486|            return dt.timedelta(**kwargs)
+1487|        except OverflowError as error:
+1488|            raise self.make_error("invalid") from error
+1489|
+1490|
+1491|class Mapping(Field):
+1492|    \"\"\"An abstract class for objects with key-value pairs.
+1493|
+1494|    :param keys: A field class or instance for dict keys.
+1495|    :param values: A field class or instance for dict values.
+1496|    :param kwargs: The same keyword arguments that :class:`Field` receives.
+1497|
+1498|    .. note::
+1499|        When the structure of nested data is not known, you may omit the
+1500|        `keys` and `values` arguments to prevent content validation.
+1501|
+1502|    .. versionadded:: 3.0.0rc4
+1503|    \"\"\"
+1504|
+1505|    mapping_type = dict
+1506|
+1507|    #: Default error messages.
+1508|    default_error_messages = {"invalid": "Not a valid mapping type."}
+1509|
+1510|    def __init__(
+1511|        self,
+1512|        keys: typing.Optional[typing.Union[Field, type]] = None,
+1513|        values: typing.Optional[typing.Union[Field, type]] = None,
+1514|        **kwargs,
+1515|    ):
+1516|        super().__init__(**kwargs)
+1517|        if keys is None:
+1518|            self.key_field = None
+1519|        else:
+1520|            try:
+1521|                self.key_field = resolve_field_instance(keys)
+1522|            except FieldInstanceResolutionError as error:
+1523|                raise ValueError(
+1524|                    '"keys" must be a subclass or instance of '
+(473 more lines below)
+[File updated. Please review the changes and make sure they are correct (correct indentation, no duplicate lines, etc). Edit the file again if necessary.]
+
+ASSISTANT:
+The code has been updated to use the `round` function, which should fix the rounding issue. Before submitting the changes, it would be prudent to run the reproduce.py code again to ensure that this change has the desired effect. I see that "Current directory" is still `/marshmallow-code__marshmallow`, so we can just run `reproduce.py` directly.
+<execute_bash>
+python reproduce.py
+</execute_bash>
+
+USER:
+OBSERVATION:
+345
+
+ASSISTANT:
+The output has changed from 344 to 345, which suggests that the rounding issue has been fixed. Let's remove the reproduce.py file since it is no longer needed.
+<execute_bash>
+rm reproduce.py
+</execute_bash>
+
+USER:
+OBSERVATION:
+
+ASSISTANT:
+rm doesn't have any output when it deletes successfully, so that must have worked. Now that we have fixed the issue, we can exit the current task.
+<execute_bash>
+exit
+</execute_bash>
+
+--- END OF EXAMPLE ---
+"""
@@ -1,2 +1,2 @@
-* `finish` - if you're absolutely certain that you've completed your task and have tested your work, use the finish action to stop working. Arguments:
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
  * `outputs` - a dictionary representing the outputs of your task, if any
@@ -55,14 +55,13 @@ class MicroAgent(Agent):
        del self.delegates[self.agent_definition['name']]

    def step(self, state: State) -> Action:
-        latest_user_message = state.get_current_user_intent()
        prompt = self.prompt_template.render(
            state=state,
            instructions=instructions,
            to_json=to_json,
            history_to_json=history_to_json,
            delegates=self.delegates,
-            latest_user_message=latest_user_message,
+            latest_user_message=state.get_current_user_intent(),
        )
        messages = [{'content': prompt, 'role': 'user'}]
        resp = self.llm.do_completion(messages=messages)
@@ -2,5 +2,5 @@ name: CoderAgent
 description: Given a particular task, and a detailed description of the codebase, accomplishes the task
 inputs:
  task: string
-  codebase_summary: string
+  summary: string
 outputs: {}
@@ -2,7 +2,7 @@
 You are a software engineer. You've inherited an existing codebase, which you
 need to modify to complete this task:

-{{ latest_user_message }}
+{{ state.inputs.task }}

 {% if state.inputs.summary %}
 Here's a summary of the codebase, as it relates to this task:
@@ -1,7 +1,7 @@
 # Task
 You are a brilliant mathematician and programmer. You've been given the following problem to solve:

-{{ latest_user_message }}
+`{{ state.inputs.task }}`

 Please write a python script that solves this problem, and prints the answer to stdout.
 ONLY print the answer to stdout, nothing else.
@@ -2,7 +2,7 @@
 You are a database engineer. You are working on an existing Postgres project, and have been given
 the following task:

-{{ latest_user_message }}
+{{ state.inputs.task }}

 You must:
 * Investigate the existing migrations to understand the current schema
@@ -4,7 +4,10 @@ import yaml

 all_microagents = {}

-for dir in os.listdir(os.path.dirname(__file__)):
+# Get the list of directories and sort them to preserve determinism
+dirs = sorted(os.listdir(os.path.dirname(__file__)))
+
+for dir in dirs:
    base = os.path.dirname(__file__) + '/' + dir
    if os.path.isfile(base):
        continue
@@ -1,9 +1,11 @@
 # Task
-You are a software engineer. You've inherited an existing codebase, which you're
-learning about for the first time. You need to study the codebase to find all
-the information needed to complete this task:
+You are a software architect. Your team has inherited an existing codebase, and
+need to finish a project:

-{{ latest_user_message }}
+{{ state.inputs.task }}
+
+As an architect, you need to study the codebase to find all the information that
+might be helpful for your software engineering team.

 ## Available Actions
 {{ instructions.actions.run }}
@@ -11,11 +13,14 @@ the information needed to complete this task:
 {{ instructions.actions.message }}
 {{ instructions.actions.finish }}

-You must ONLY `run` commands that have no side-effects, like `ls` and `grep`.
+You must ONLY `run` commands that have no side-effects, like `ls` and `grep`. You
+MUST NOT modify or write to any file.

 Do NOT finish until you have a complete understanding of which parts of the
-codebase are relevant to the task, including particular files, functions, and classes.
+codebase are relevant to the project, including particular files, functions, and classes.
 When you're done, put your summary in `outputs.summary` in the `finish` action.
+Remember, your task is to explore and study the current repository, not actually
+implement the solution. If the codebase is empty, you shoud call the `finish` action.

 ## History
 {{ instructions.history_truncated }}
@@ -23,3 +28,36 @@ When you're done, put your summary in `outputs.summary` in the `finish` action.

 ## Format
 {{ instructions.format.action }}
+
+## Examples
+
+Here is an example of how you can interact with the environment for task solving:
+
+--- START OF EXAMPLE ---
+
+USER: Can you create a list of numbers from 1 to 10, and create a web page to display them at port 5000?
+
+ASSISTANT:
+{
+  "action": "run",
+  "args": {
+    "command": "ls",
+    "background": false
+  }
+}
+
+USER:
+OBSERVATION:
+[]
+
+ASSISTANT:
+{
+  "action": "finish",
+  "args": {
+    "outputs": {
+      "summary": "The codebase appears to be empty. Engineers should start everything from scratch."
+    }
+  }
+}
+
+--- END OF EXAMPLE ---
@@ -1,5 +1,6 @@
 name: TypoFixerAgent
 description: Fixes typos in files in the current working directory
-inputs: {}
+inputs:
+  task: string
 outputs:
  summary: string
@@ -1,5 +1,13 @@
 # Task
-You are a proofreader tasked with fixing typos in the files in your current working directory. Your goal is to:
+You are a proofreader tasked with fixing typos in the files in your current working directory.
+
+{% if state.inputs.task %}
+Specifically, your task is:
+{{ state.inputs.task }}
+{% endif %}
+
+To achieve this goal, you should:
+
 1. Scan the files for typos
 2. Overwrite the files with the typos fixed
 3. Provide a summary of the typos fixed
@@ -13,10 +21,10 @@ You are a proofreader tasked with fixing typos in the files in your current work

 To complete this task:
 1. Use the `read` action to read the contents of the files in your current working directory. Make sure to provide the file path in the format `'./file_name.ext'`.
-2. Use the `think` action to analyze the contents and identify typos.
+2. Use the `message` action to analyze the contents and identify typos.
 3. Use the `write` action to create new versions of the files with the typos fixed.
  - Overwrite the original files with the corrected content. Make sure to provide the file path in the format `'./file_name.ext'`.
-4. Use the `think` action to generate a summary of the typos fixed, including the original and fixed versions of each typo, and the file(s) they were found in.
+4. Use the `message` action to generate a summary of the typos fixed, including the original and fixed versions of each typo, and the file(s) they were found in.
 5. Use the `finish` action to return the summary in the `outputs.summary` field.

 Do NOT finish until you have fixed all the typos and generated a summary.
@@ -2,9 +2,10 @@
 You are a quality assurance engineer. Another engineer has made changes to the
 codebase which are supposed to solve this task:

-{{ latest_user_message }}
+{{ state.inputs.task }}

-Your goal is to verify that the changes are correct and bug-free.
+Note the changes might have already been applied in-line. You should focus on
+validating if the task is solved, nothing else.

 ## Available Actions
 {{ instructions.actions.run }}
@@ -81,43 +81,6 @@ const config: Config = {
        },
      ],
    },
-    footer: {
-      style: "dark",
-      links: [
-        {
-          title: "OpenDevin",
-          items: [
-            {
-              label: "Docs",
-              to: "/modules/usage/intro",
-            },
-          ],
-        },
-        {
-          title: "Community",
-          items: [
-            {
-              label: "Slack",
-              href: "https://join.slack.com/t/opendevin/shared_invite/zt-2ggtwn3k5-PvAA2LUmqGHVZ~XzGq~ILw"
-            },
-            {
-              label: "Discord",
-              href: "https://discord.gg/ESHStjSjD4",
-            },
-          ],
-        },
-        {
-          title: "More",
-          items: [
-            {
-              label: "GitHub",
-              href: "https://github.com/OpenDevin/OpenDevin",
-            },
-          ],
-        },
-      ],
-      copyright: `Copyright © ${new Date().getFullYear()} OpenDevin`,
-    },
    prism: {
      theme: prismThemes.oneLight,
      darkTheme: prismThemes.oneDark,
@@ -73,11 +73,11 @@ OpenDevin runs bash commands within a Docker sandbox, so it should not affect yo
 :::

 ```
-docker run \
-    -it \
+docker run -it \
    --pull=always \
-    -e LLM_API_KEY \
    -e SANDBOX_USER_ID=$(id -u) \
+    -e PERSIST_SANDBOX="true" \
+    -e SSH_PASSWORD="make something up here" \
    -e WORKSPACE_MOUNT_PATH=$WORKSPACE_BASE \
    -v $WORKSPACE_BASE:/opt/workspace_base \
    -v /var/run/docker.sock:/var/run/docker.sock \
@@ -92,7 +92,7 @@ You'll find OpenDevin running at [http://localhost:3000](http://localhost:3000).
 If you want to use the **(unstable!)** bleeding edge, you can use `ghcr.io/opendevin/opendevin:main` as the image (last line).
 :::

-See [Development.md](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) for instructions on running OpenDevin without Docker.
+For the development workflow, see [Development.md](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md).

 Are you having trouble? Check out our [Troubleshooting Guide](https://opendevin.github.io/OpenDevin/modules/usage/troubleshooting).

@@ -11,19 +11,20 @@
        "@docusaurus/core": "3.2.1",
        "@docusaurus/preset-classic": "3.2.1",
        "@mdx-js/react": "^3.0.0",
-        "autoprefixer": "^10.4.19",
        "clsx": "^2.0.0",
-        "postcss": "^8.4.38",
        "prism-react-renderer": "^2.3.0",
        "react": "^18.0.0",
        "react-dom": "^18.0.0",
-        "react-use": "^17.5.0",
-        "tailwindcss": "^3.4.3"
+        "react-icons": "^5.2.1",
+        "react-use": "^17.5.0"
      },
      "devDependencies": {
        "@docusaurus/module-type-aliases": "3.2.1",
        "@docusaurus/tsconfig": "3.2.1",
        "@docusaurus/types": "3.2.1",
+        "autoprefixer": "^10.4.19",
+        "postcss": "^8.4.38",
+        "tailwindcss": "^3.4.3",
        "typescript": "~5.2.2"
      },
      "engines": {
@@ -213,6 +214,7 @@
      "version": "5.2.0",
      "resolved": "https://registry.npmjs.org/@alloc/quick-lru/-/quick-lru-5.2.0.tgz",
      "integrity": "sha512-UrcABB+4bUrFABwbluTIBErXwvbsU/V7TZWfmbgJfbkwiBuziS9gxdODUyuiecfdGQ85jglMW6juS3+z5TsKLw==",
+      "dev": true,
      "engines": {
        "node": ">=10"
      },
@@ -2763,6 +2765,7 @@
      "version": "8.0.2",
      "resolved": "https://registry.npmjs.org/@isaacs/cliui/-/cliui-8.0.2.tgz",
      "integrity": "sha512-O8jcjabXaleOG9DQ0+ARXWZBTfnP4WNAqzuiJK7ll44AmxGKv/J2M4TPjxjY3znBCfvBXFzucm1twdyFybFqEA==",
+      "dev": true,
      "dependencies": {
        "string-width": "^5.1.2",
        "string-width-cjs": "npm:string-width@^4.2.0",
@@ -2779,6 +2782,7 @@
      "version": "6.0.1",
      "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-6.0.1.tgz",
      "integrity": "sha512-n5M855fKb2SsfMIiFFoVrABHJC8QtHwVx+mHWP3QcEqBHYienj5dHSgjbxtC0WEZXYt4wcD6zrQElDPhFuZgfA==",
+      "dev": true,
      "engines": {
        "node": ">=12"
      },
@@ -2790,6 +2794,7 @@
      "version": "7.1.0",
      "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-7.1.0.tgz",
      "integrity": "sha512-iq6eVVI64nQQTRYq2KtEg2d2uU7LElhTJwsH4YzIHZshxlgZms/wIc4VoDQTlG/IvVIrBKG06CrZnp0qv7hkcQ==",
+      "dev": true,
      "dependencies": {
        "ansi-regex": "^6.0.1"
      },
@@ -2970,6 +2975,7 @@
      "version": "0.11.0",
      "resolved": "https://registry.npmjs.org/@pkgjs/parseargs/-/parseargs-0.11.0.tgz",
      "integrity": "sha512-+1VkjdD0QBLPodGrJUeqarH8VAIvQODIbwh9XpP5Syisf7YoQgsJKPNFoqqLQlu+VQ/tVSshMR6loPMn8U+dPg==",
+      "dev": true,
      "optional": true,
      "engines": {
        "node": ">=14"
@@ -4048,7 +4054,8 @@
    "node_modules/any-promise": {
      "version": "1.3.0",
      "resolved": "https://registry.npmjs.org/any-promise/-/any-promise-1.3.0.tgz",
-      "integrity": "sha512-7UvmKalWRt1wgjL1RrGxoSJW/0QZFIegpeGvZG9kjp8vrRu55XTHbwnqq2GpXm9uLbcuhxm3IqX9OB4MZR1b2A=="
+      "integrity": "sha512-7UvmKalWRt1wgjL1RrGxoSJW/0QZFIegpeGvZG9kjp8vrRu55XTHbwnqq2GpXm9uLbcuhxm3IqX9OB4MZR1b2A==",
+      "dev": true
    },
    "node_modules/anymatch": {
      "version": "3.1.3",
@@ -4472,6 +4479,7 @@
      "version": "2.0.1",
      "resolved": "https://registry.npmjs.org/camelcase-css/-/camelcase-css-2.0.1.tgz",
      "integrity": "sha512-QOSvevhslijgYwRx6Rv7zKdMF8lbRmx+uQGx2+vDc+KI/eBnsy9kit5aj23AgGu3pa4t9AgwbnXWqS+iOY+2aA==",
+      "dev": true,
      "engines": {
        "node": ">= 6"
      }
@@ -5626,7 +5634,8 @@
    "node_modules/didyoumean": {
      "version": "1.2.2",
      "resolved": "https://registry.npmjs.org/didyoumean/-/didyoumean-1.2.2.tgz",
-      "integrity": "sha512-gxtyfqMg7GKyhQmb056K7M3xszy/myH8w+B4RT+QXBQsvAOdc3XymqDDPHx1BgPgsdAA5SIifona89YtRATDzw=="
+      "integrity": "sha512-gxtyfqMg7GKyhQmb056K7M3xszy/myH8w+B4RT+QXBQsvAOdc3XymqDDPHx1BgPgsdAA5SIifona89YtRATDzw==",
+      "dev": true
    },
    "node_modules/dir-glob": {
      "version": "3.0.1",
@@ -5642,7 +5651,8 @@
    "node_modules/dlv": {
      "version": "1.1.3",
      "resolved": "https://registry.npmjs.org/dlv/-/dlv-1.1.3.tgz",
-      "integrity": "sha512-+HlytyjlPKnIG8XuRG8WvmBP8xs8P71y+SKKS6ZXWoEgLuePxtDoUEiH7WkdePWrQ5JBpE6aoVqfZfJUQkjXwA=="
+      "integrity": "sha512-+HlytyjlPKnIG8XuRG8WvmBP8xs8P71y+SKKS6ZXWoEgLuePxtDoUEiH7WkdePWrQ5JBpE6aoVqfZfJUQkjXwA==",
+      "dev": true
    },
    "node_modules/dns-packet": {
      "version": "5.6.1",
@@ -6464,6 +6474,7 @@
      "version": "3.1.1",
      "resolved": "https://registry.npmjs.org/foreground-child/-/foreground-child-3.1.1.tgz",
      "integrity": "sha512-TMKDUnIte6bfb5nWv7V/caI169OHgvwjb7V4WkeUvbQQdjr5rWKqHFiKWb/fcOwB+CzBT+qbWjvj+DVwRskpIg==",
+      "dev": true,
      "dependencies": {
        "cross-spawn": "^7.0.0",
        "signal-exit": "^4.0.1"
@@ -6479,6 +6490,7 @@
      "version": "4.1.0",
      "resolved": "https://registry.npmjs.org/signal-exit/-/signal-exit-4.1.0.tgz",
      "integrity": "sha512-bzyZ1e88w9O1iNJbKnOlvYTrWPDl46O1bG0D3XInv+9tkPrxrN8jUUTiFlDkkmKWgn1M6CfIA13SuGqOa9Korw==",
+      "dev": true,
      "engines": {
        "node": ">=14"
      },
@@ -7958,6 +7970,7 @@
      "version": "2.3.6",
      "resolved": "https://registry.npmjs.org/jackspeak/-/jackspeak-2.3.6.tgz",
      "integrity": "sha512-N3yCS/NegsOBokc8GAdM8UcmfsKiSS8cipheD/nivzr700H+nsMOxJjQnvwOcRYVuFkdH0wGUvW2WbXGmrZGbQ==",
+      "dev": true,
      "dependencies": {
        "@isaacs/cliui": "^8.0.2"
      },
@@ -10501,6 +10514,7 @@
      "version": "7.0.4",
      "resolved": "https://registry.npmjs.org/minipass/-/minipass-7.0.4.tgz",
      "integrity": "sha512-jYofLM5Dam9279rdkWzqHozUo4ybjdZmCsDHePy5V/PbBcVMiSZR97gmAy45aqi8CK1lG2ECd356FU86avfwUQ==",
+      "dev": true,
      "engines": {
        "node": ">=16 || 14 >=14.17"
      }
@@ -10534,6 +10548,7 @@
      "version": "2.7.0",
      "resolved": "https://registry.npmjs.org/mz/-/mz-2.7.0.tgz",
      "integrity": "sha512-z81GNO7nnYMEhrGh9LeymoE4+Yr0Wn5McHIZMK5cfQCl+NDX08sCZgUc9/6MHni9IWuFLm1Z3HTCXu2z9fN62Q==",
+      "dev": true,
      "dependencies": {
        "any-promise": "^1.0.0",
        "object-assign": "^4.0.1",
@@ -10691,6 +10706,7 @@
      "version": "3.0.0",
      "resolved": "https://registry.npmjs.org/object-hash/-/object-hash-3.0.0.tgz",
      "integrity": "sha512-RSn9F68PjH9HqtltsSnqYC1XXoWe9Bju5+213R98cNGttag9q9yAOTzdbsqvIa7aNm5WffBZFpWYr2aWrklWAw==",
+      "dev": true,
      "engines": {
        "node": ">= 6"
      }
@@ -11029,6 +11045,7 @@
      "version": "1.10.2",
      "resolved": "https://registry.npmjs.org/path-scurry/-/path-scurry-1.10.2.tgz",
      "integrity": "sha512-7xTavNy5RQXnsjANvVvMkEjvloOinkAjv/Z6Ildz9v2RinZ4SBKTWFOVRbaF8p0vpHnyjV/UwNDdKuUv6M5qcA==",
+      "dev": true,
      "dependencies": {
        "lru-cache": "^10.2.0",
        "minipass": "^5.0.0 || ^6.0.2 || ^7.0.0"
@@ -11044,6 +11061,7 @@
      "version": "10.2.1",
      "resolved": "https://registry.npmjs.org/lru-cache/-/lru-cache-10.2.1.tgz",
      "integrity": "sha512-tS24spDe/zXhWbNPErCHs/AGOzbKGHT+ybSBqmdLm8WZ1xXLWvH8Qn71QPAlqVhd0qUTWjy+Kl9JmISgDdEjsA==",
+      "dev": true,
      "engines": {
        "node": "14 || >=16.14"
      }
@@ -11094,6 +11112,7 @@
      "version": "2.3.0",
      "resolved": "https://registry.npmjs.org/pify/-/pify-2.3.0.tgz",
      "integrity": "sha512-udgsAY+fTnvv7kI7aaxbqwWNb0AHiB0qBO89PZKPkoTmGOgdbrHDKD+0B2X4uTfJ/FT1R09r9gTsjUjNJotuog==",
+      "dev": true,
      "engines": {
        "node": ">=0.10.0"
      }
@@ -11102,6 +11121,7 @@
      "version": "4.0.6",
      "resolved": "https://registry.npmjs.org/pirates/-/pirates-4.0.6.tgz",
      "integrity": "sha512-saLsH7WeYYPiD25LDuLRRY/i+6HaPYr6G1OUlN39otzkSTxKnubR9RTxS3/Kk50s1g2JTgFwWQDQyplC5/SHZg==",
+      "dev": true,
      "engines": {
        "node": ">= 6"
      }
@@ -11320,6 +11340,7 @@
      "version": "15.1.0",
      "resolved": "https://registry.npmjs.org/postcss-import/-/postcss-import-15.1.0.tgz",
      "integrity": "sha512-hpr+J05B2FVYUAXHeK1YyI267J/dDDhMU6B6civm8hSY1jYJnBXxzKDKDswzJmtLHryrjhnDjqqp/49t8FALew==",
+      "dev": true,
      "dependencies": {
        "postcss-value-parser": "^4.0.0",
        "read-cache": "^1.0.0",
@@ -11336,6 +11357,7 @@
      "version": "4.0.1",
      "resolved": "https://registry.npmjs.org/postcss-js/-/postcss-js-4.0.1.tgz",
      "integrity": "sha512-dDLF8pEO191hJMtlHFPRa8xsizHaM82MLfNkUHdUtVEV3tgTp5oj+8qbEqYM57SLfc74KSbw//4SeJma2LRVIw==",
+      "dev": true,
      "dependencies": {
        "camelcase-css": "^2.0.1"
      },
@@ -11354,6 +11376,7 @@
      "version": "4.0.2",
      "resolved": "https://registry.npmjs.org/postcss-load-config/-/postcss-load-config-4.0.2.tgz",
      "integrity": "sha512-bSVhyJGL00wMVoPUzAVAnbEoWyqRxkjv64tUl427SKnPrENtq6hJwUojroMz2VB+Q1edmi4IfrAPpami5VVgMQ==",
+      "dev": true,
      "funding": [
        {
          "type": "opencollective",
@@ -11388,6 +11411,7 @@
      "version": "3.1.1",
      "resolved": "https://registry.npmjs.org/lilconfig/-/lilconfig-3.1.1.tgz",
      "integrity": "sha512-O18pf7nyvHTckunPWCV1XUNXU1piu01y2b7ATJ0ppkUkk8ocqVWBrYjJBCwHDjD/ZWcfyrA0P4gKhzWGi5EINQ==",
+      "dev": true,
      "engines": {
        "node": ">=14"
      },
@@ -11399,6 +11423,7 @@
      "version": "2.4.1",
      "resolved": "https://registry.npmjs.org/yaml/-/yaml-2.4.1.tgz",
      "integrity": "sha512-pIXzoImaqmfOrL7teGUBt/T7ZDnyeGBWyXQBvOVhLkWLN37GXv8NMLK406UY6dS51JfcQHsmcW5cJ441bHg6Lg==",
+      "dev": true,
      "bin": {
        "yaml": "bin.mjs"
      },
@@ -11618,6 +11643,7 @@
      "version": "6.0.1",
      "resolved": "https://registry.npmjs.org/postcss-nested/-/postcss-nested-6.0.1.tgz",
      "integrity": "sha512-mEp4xPMi5bSWiMbsgoPfcP74lsWLHkQbZc3sY+jWYd65CUwXrUaTp0fmNpa01ZcETKlIgUdFN/MpS2xZtqL9dQ==",
+      "dev": true,
      "dependencies": {
        "postcss-selector-parser": "^6.0.11"
      },
@@ -12282,6 +12308,14 @@
        "react-dom": "^16.6.0 || ^17.0.0 || ^18.0.0"
      }
    },
+    "node_modules/react-icons": {
+      "version": "5.2.1",
+      "resolved": "https://registry.npmjs.org/react-icons/-/react-icons-5.2.1.tgz",
+      "integrity": "sha512-zdbW5GstTzXaVKvGSyTaBalt7HSfuK5ovrzlpyiWHAFXndXTdd/1hdDHI4xBM1Mn7YriT6aqESucFl9kEXzrdw==",
+      "peerDependencies": {
+        "react": "*"
+      }
+    },
    "node_modules/react-is": {
      "version": "16.13.1",
      "resolved": "https://registry.npmjs.org/react-is/-/react-is-16.13.1.tgz",
@@ -12412,6 +12446,7 @@
      "version": "1.0.0",
      "resolved": "https://registry.npmjs.org/read-cache/-/read-cache-1.0.0.tgz",
      "integrity": "sha512-Owdv/Ft7IjOgm/i0xvNDZ1LrRANRfew4b2prF3OWMQLxLfu3bS8FVhCsrSCMK4lR56Y9ya+AThoTpDCTxCmpRA==",
+      "dev": true,
      "dependencies": {
        "pify": "^2.3.0"
      }
@@ -13616,6 +13651,7 @@
      "version": "4.2.3",
      "resolved": "https://registry.npmjs.org/string-width/-/string-width-4.2.3.tgz",
      "integrity": "sha512-wKyQRQpjJ0sIp62ErSZdGsjMJWsap5oRNihHhu6G7JVO/9jIB6UyevL+tXuOqrng8j/cxKTWyWUwvSTriiZz/g==",
+      "dev": true,
      "dependencies": {
        "emoji-regex": "^8.0.0",
        "is-fullwidth-code-point": "^3.0.0",
@@ -13628,7 +13664,8 @@
    "node_modules/string-width-cjs/node_modules/emoji-regex": {
      "version": "8.0.0",
      "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz",
-      "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A=="
+      "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A==",
+      "dev": true
    },
    "node_modules/string-width/node_modules/ansi-regex": {
      "version": "6.0.1",
@@ -13697,6 +13734,7 @@
      "version": "6.0.1",
      "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-6.0.1.tgz",
      "integrity": "sha512-Y38VPSHcqkFrCpFnQ9vuSXmquuv5oXOKpGeT6aGrr3o3Gc9AlVa6JBfUSOCnbxGGZF+/0ooI7KrPuUSztUdU5A==",
+      "dev": true,
      "dependencies": {
        "ansi-regex": "^5.0.1"
      },
@@ -13763,6 +13801,7 @@
      "version": "3.35.0",
      "resolved": "https://registry.npmjs.org/sucrase/-/sucrase-3.35.0.tgz",
      "integrity": "sha512-8EbVDiu9iN/nESwxeSxDKe0dunta1GOlHufmSSXxMD2z2/tMZpDMpvXQGsc+ajGo8y2uYUmixaSRUc/QPoQ0GA==",
+      "dev": true,
      "dependencies": {
        "@jridgewell/gen-mapping": "^0.3.2",
        "commander": "^4.0.0",
@@ -13784,6 +13823,7 @@
      "version": "2.0.1",
      "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-2.0.1.tgz",
      "integrity": "sha512-XnAIvQ8eM+kC6aULx6wuQiwVsnzsi9d3WxzV3FpWTGA19F621kwdbsAcFKXgKUHZWsy+mY6iL1sHTxWEFCytDA==",
+      "dev": true,
      "dependencies": {
        "balanced-match": "^1.0.0"
      }
@@ -13792,6 +13832,7 @@
      "version": "4.1.1",
      "resolved": "https://registry.npmjs.org/commander/-/commander-4.1.1.tgz",
      "integrity": "sha512-NOKm8xhkzAjzFx8B2v5OAHT+u5pRQc2UCa2Vq9jYL/31o2wi9mxBA7LIFs3sV5VSC49z6pEhfbMULvShKj26WA==",
+      "dev": true,
      "engines": {
        "node": ">= 6"
      }
@@ -13800,6 +13841,7 @@
      "version": "10.3.12",
      "resolved": "https://registry.npmjs.org/glob/-/glob-10.3.12.tgz",
      "integrity": "sha512-TCNv8vJ+xz4QiqTpfOJA7HvYv+tNIRHKfUWw/q+v2jdgN4ebz+KY9tGx5J4rHP0o84mNP+ApH66HRX8us3Khqg==",
+      "dev": true,
      "dependencies": {
        "foreground-child": "^3.1.0",
        "jackspeak": "^2.3.6",
@@ -13821,6 +13863,7 @@
      "version": "9.0.4",
      "resolved": "https://registry.npmjs.org/minimatch/-/minimatch-9.0.4.tgz",
      "integrity": "sha512-KqWh+VchfxcMNRAJjj2tnsSJdNbHsVgnkBhTNrW7AjVo6OvLtxw8zfT9oLw1JSohlFzJ8jCoTgaoXvJ+kHt6fw==",
+      "dev": true,
      "dependencies": {
        "brace-expansion": "^2.0.1"
      },
@@ -13953,6 +13996,7 @@
      "version": "3.4.3",
      "resolved": "https://registry.npmjs.org/tailwindcss/-/tailwindcss-3.4.3.tgz",
      "integrity": "sha512-U7sxQk/n397Bmx4JHbJx/iSOOv5G+II3f1kpLpY2QeUv5DcPdcTsYLlusZfq1NthHS1c1cZoyFmmkex1rzke0A==",
+      "dev": true,
      "dependencies": {
        "@alloc/quick-lru": "^5.2.0",
        "arg": "^5.0.2",
@@ -13989,6 +14033,7 @@
      "version": "6.0.2",
      "resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-6.0.2.tgz",
      "integrity": "sha512-XxwI8EOhVQgWp6iDL+3b0r86f4d6AX6zSU55HfB4ydCEuXLXc5FcYeOu+nnGftS4TEju/11rt4KJPTMgbfmv4A==",
+      "dev": true,
      "dependencies": {
        "is-glob": "^4.0.3"
      },
@@ -14140,6 +14185,7 @@
      "version": "3.3.1",
      "resolved": "https://registry.npmjs.org/thenify/-/thenify-3.3.1.tgz",
      "integrity": "sha512-RVZSIV5IG10Hk3enotrhvz0T9em6cyHBLkH/YAZuKqd8hRkKhSfCGIcP2KUY0EPxndzANBmNllzWPwak+bheSw==",
+      "dev": true,
      "dependencies": {
        "any-promise": "^1.0.0"
      }
@@ -14148,6 +14194,7 @@
      "version": "1.6.0",
      "resolved": "https://registry.npmjs.org/thenify-all/-/thenify-all-1.6.0.tgz",
      "integrity": "sha512-RNxQH/qI8/t3thXJDwcstUO4zeqo64+Uy/+sNVRBx4Xn2OX+OZ9oP+iJnNFqplFra2ZUVeKCSa2oVWi3T4uVmA==",
+      "dev": true,
      "dependencies": {
        "thenify": ">= 3.1.0 < 4"
      },
@@ -14244,7 +14291,8 @@
    "node_modules/ts-interface-checker": {
      "version": "0.1.13",
      "resolved": "https://registry.npmjs.org/ts-interface-checker/-/ts-interface-checker-0.1.13.tgz",
-      "integrity": "sha512-Y/arvbn+rrz3JCKl9C4kVNfTfSm2/mEp5FSz5EsZSANGPSlQrpRI5M4PKF+mJnE52jOO90PnPSc3Ur3bTQw0gA=="
+      "integrity": "sha512-Y/arvbn+rrz3JCKl9C4kVNfTfSm2/mEp5FSz5EsZSANGPSlQrpRI5M4PKF+mJnE52jOO90PnPSc3Ur3bTQw0gA==",
+      "dev": true
    },
    "node_modules/tslib": {
      "version": "2.6.2",
@@ -15202,6 +15250,7 @@
      "version": "7.0.0",
      "resolved": "https://registry.npmjs.org/wrap-ansi/-/wrap-ansi-7.0.0.tgz",
      "integrity": "sha512-YVGIj2kamLSTxw6NsZjoBxfSwsn0ycdesmc4p+Q21c5zPuZ1pl+NfxVdxPtdHvmNVOQ6XSYG4AUtyt/Fi7D16Q==",
+      "dev": true,
      "dependencies": {
        "ansi-styles": "^4.0.0",
        "string-width": "^4.1.0",
@@ -15217,12 +15266,14 @@
    "node_modules/wrap-ansi-cjs/node_modules/emoji-regex": {
      "version": "8.0.0",
      "resolved": "https://registry.npmjs.org/emoji-regex/-/emoji-regex-8.0.0.tgz",
-      "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A=="
+      "integrity": "sha512-MSjYzcWNOA0ewAHpz0MxpYFvwg6yjy1NG3xteoqz644VCo/RPgnr1/GGt+ic3iJTzQ8Eu3TdM14SawnVUmGE6A==",
+      "dev": true
    },
    "node_modules/wrap-ansi-cjs/node_modules/string-width": {
      "version": "4.2.3",
      "resolved": "https://registry.npmjs.org/string-width/-/string-width-4.2.3.tgz",
      "integrity": "sha512-wKyQRQpjJ0sIp62ErSZdGsjMJWsap5oRNihHhu6G7JVO/9jIB6UyevL+tXuOqrng8j/cxKTWyWUwvSTriiZz/g==",
+      "dev": true,
      "dependencies": {
        "emoji-regex": "^8.0.0",
        "is-fullwidth-code-point": "^3.0.0",
@@ -18,19 +18,20 @@
    "@docusaurus/core": "3.2.1",
    "@docusaurus/preset-classic": "3.2.1",
    "@mdx-js/react": "^3.0.0",
-    "autoprefixer": "^10.4.19",
    "clsx": "^2.0.0",
-    "postcss": "^8.4.38",
    "prism-react-renderer": "^2.3.0",
    "react": "^18.0.0",
    "react-dom": "^18.0.0",
-    "react-use": "^17.5.0",
-    "tailwindcss": "^3.4.3"
+    "react-icons": "^5.2.1",
+    "react-use": "^17.5.0"
  },
  "devDependencies": {
    "@docusaurus/module-type-aliases": "3.2.1",
    "@docusaurus/tsconfig": "3.2.1",
    "@docusaurus/types": "3.2.1",
+    "autoprefixer": "^10.4.19",
+    "postcss": "^8.4.38",
+    "tailwindcss": "^3.4.3",
    "typescript": "~5.2.2"
  },
  "browserslist": {
@@ -0,0 +1,6 @@
+module.exports = {
+  plugins: {
+    tailwindcss: {},
+    autoprefixer: {},
+  },
+};
@@ -0,0 +1,27 @@
+import { FaSlack, FaDiscord, FaGithub } from "react-icons/fa";
+
+function CustomFooter() {
+  return (
+    <footer style={{ backgroundColor: 'dark' }} className="dark:text-white h-[25vh] bg-gradient-to-b from-gray-900 to-gray-900">
+        <div className="flex flex-col justify-between w-full items-center p-2 h-full">
+          <div className="flex gap-2">
+            <div className="font-bold  text-lg md:text-3xl">OpenDevin</div>
+            <div className="text-sm"><a className="hover:text-white transition-all duration-300 cursor-pointer hover:no-underline" href="/modules/usage/intro">Docs</a></div>
+          </div>
+            <div className="uppercase font-light">Community</div>
+          <div className="flex gap-6 text-3xl">
+              <div><a className="hover:text-white trasnition-all duration-300" href="https://join.slack.com/t/opendevin/shared_invite/zt-2ggtwn3k5-PvAA2LUmqGHVZ~XzGq~ILw" target="_blank"><FaSlack /></a></div>
+              <div><a className="hover:text-white trasnition-all duration-300" href="https://discord.gg/ESHStjSjD4" target="_blank"><FaDiscord /></a></div>
+              <div><a className="hover:text-white trasnition-all duration-300" href="https://github.com/OpenDevin/OpenDevin" target="_blank"><FaGithub /></a></div>
+          </div>
+          <div >
+          </div>
+        <div >
+          <p className="uppercase">Copyright &copy; {new Date().getFullYear()} OpenDevin</p>
+        </div>
+      </div>
+    </footer>
+  );
+}
+
+export default CustomFooter;
@@ -7,9 +7,14 @@ import styles from "./index.module.css";
 export function HomepageHeader() {
  const { siteConfig } = useDocusaurusContext();
  return (
-    <div className={styles.headerContainer}>
-      <div className={styles.header}>
-        <Heading as="h1" className="hero__title">
+    <div className="h-screen bg-gradient-to-t from-slate-600 to-black">
+    {/* <div className={styles.headerContainer}> */}
+      <div className={`text-white flex flex-col 
+      items-center p-6 font-light w-full`}>
+        <Heading as="h1" className="
+        text-5xl
+        ">
+          {/* hero__title  */}
          {siteConfig.title}
        </Heading>
        <p className="hero__subtitle">{siteConfig.tagline}</p>
@@ -21,8 +26,8 @@ export function HomepageHeader() {
            Get Started
          </Link>
        </div>
-      </div>{" "}
      <Demo />
+      </div>
    </div>
  );
 }
@@ -1,11 +1,14 @@
 import styles from "./styles.module.css";
-
+import "../../pages/index.module.css"
 export function Welcome() {
  return (
-    <div className={styles.container}>
-      <div className={styles.innerContainer}>
-        <img src="img/logo.png" className={styles.sidebarImage} />
-        <p className={styles.welcomeText}>
+    <div className="text-white">
+      <div className="flex justify-center items-center flex-col md:flex-row bg-gradient-to-b from-slate-600 dark:to-gray-900 to-gray-200">
+        <img src="img/logo.png" className="
+        max-sm:h-[40vw] max-sm:w-[40vw]
+        h-[45vh] w-[45vw]
+        md:h-[60vh] md:w-[350px]" />
+        <p className=" px-6 md:p-2 mb-6 font-light text-lg md:text-2xl">
          Welcome to OpenDevin, an open-source project aiming to replicate
          Devin, an autonomous AI software engineer who is capable of executing
          complex engineering tasks and collaborating actively with users on
@@ -0,0 +1,4 @@
+/* src/css/main.css */
+@tailwind base;
+@tailwind components;
+@tailwind utilities;
@@ -1,31 +1,32 @@
 import Layout from "@theme/Layout";
+import CustomFooter from "../components/CustomFooter";

 export default function FAQ() {
  return (
+    <>
    <Layout title="FAQ" description="Frequently Asked Questions">
      <div
        id="faq"
-        style={{
-          maxWidth: "900px",
-          margin: "0px auto",
-          padding: "40px",
-          textAlign: "justify",
-        }}
+        className="m-auto p-6 flex flex-col gap-2 mb-6"
      >
-        <h1 style={{ fontSize: "3rem" }}>Frequently Asked Questions</h1>
-        <h2 style={{ fontSize: "2rem" }}>Support</h2>
-        <h3>How can I report an issue with OpenDevin?</h3>
-        <p>
+        <div className="flex items-center justify-center text-2xl lg:text-6xl p-2 uppercase font-bold">Frequently Asked Questions</div>
+        <div className="flex flex-col gap-2 w-full mb-6" >
+        <div className="uppercase font-bold text-4xl tracking-wider">Support</div>
+        <div>How can I report an issue with OpenDevin?</div>
+        <div>
          Please file a bug on{" "}
-          <a href="https://github.com/OpenDevin/OpenDevin/issues">GitHub</a> if
+          <a href="https://github.com/OpenDevin/OpenDevin/issues" target="_blank">GitHub</a> if
          you notice a problem that likely affects others.
          If you're having trouble installing, or have general questions, reach out on{" "}
-          <a href="https://discord.gg/mBuDGRzzES">Discord</a> or{" "}
-          <a href="https://join.slack.com/t/opendevin/shared_invite/zt-2ggtwn3k5-PvAA2LUmqGHVZ~XzGq~ILw">Slack</a>.
-        </p>
-        <h2 style={{ fontSize: "2rem" }}>General</h2>
-        <h3>What is Devin?</h3>
-        <p>
+          <a href="https://discord.gg/mBuDGRzzES" target="_blank">Discord</a> or{" "}
+          <a href="https://join.slack.com/t/opendevin/shared_invite/zt-2ggtwn3k5-PvAA2LUmqGHVZ~XzGq~ILw" target="_blank">Slack</a>.
+        </div>
+
+        </div>
+        <div className="flex flex-col gap-2 w-full mb-6">
+        <div className="uppercase font-bold text-4xl tracking-wider" >General</div>
+        <div>What is Devin?</div>
+        <div>
          <span style={{ fontWeight: "600", color: "var(--logo)" }}>Devin</span>{" "}
          represents a cutting-edge autonomous agent designed to navigate the
          complexities of software engineering. It leverages a combination of
@@ -34,8 +35,10 @@ export default function FAQ() {
          explore and expand upon Devin's capabilities, identifying both its
          strengths and areas for improvement, to guide the progress of open
          code models.
-        </p>
-        <h3>Why OpenDevin?</h3>
+        </div>
+        </div>
+        <div className="flex flex-col gap-2 w-full mb-6">
+        <div className="uppercase font-bold text-4xl tracking-wider">Why OpenDevin?</div>
        <p>
          The{" "}
          <span style={{ fontWeight: "600", color: "var(--logo)" }}>
@@ -50,8 +53,11 @@ export default function FAQ() {
          scenarios, producing works that significantly contribute to the
          community and pave the way for future advancements.
        </p>
-        <h3>How to fix an issue on OpenDevin?</h3>
-        <p>
+
+        </div>
+        <div className="flex flex-col gap-2 w-full mb-6">
+        <div className="uppercase font-bold text-4xl tracking-wider">How to fix an issue on OpenDevin?</div>
+        <div>
          To fix an issue on GitHub using OpenDevin, send a prompt to OpenDevin asking it to follow these steps:
          <ol>
            <li>Read the issue on <a href="https://github.com/OpenDevin/OpenDevin/issues/1611">GitHub</a></li>
@@ -61,16 +67,19 @@ export default function FAQ() {
            <li>Tell me the link that I need to go to to send a pull request</li>
          </ol>
          Before you run OpenDevin, you can do:
-          <pre>
+          <div className="flex flex-col p-2 bg-gray-300 rounded-md my-2">
            export SANDBOX_ENV_GITHUB_TOKEN=XXX
-          </pre>
+          </div>
          where XXX is a GitHub token that you created that has permissions to push to the OpenDevin repo. If you don’t have write permission to the OpenDevin repo, you might need to change that to:
-          <pre>
+          <div className="flex flex-col p-2 bg-gray-300 rounded-md my-2">
            4. Push the resulting output to my fork at https://github.com/USERNAME/OpenDevin/ using the GITHUB_TOKEN environment variable
-          </pre>
+          </div>
          where USERNAME is your GitHub username.
-        </p>
+        </div>
+        </div>
      </div>
    </Layout>
+    <CustomFooter/>
+    </>
  );
 }
@@ -1,12 +1,14 @@
 import useDocusaurusContext from "@docusaurus/useDocusaurusContext";
 import Layout from "@theme/Layout";
+import '../css/main.css';

 import { HomepageHeader } from "../components/HomepageHeader/HomepageHeader";
 import { Welcome } from "../components/Welcome/Welcome";
-
+import CustomFooter from "../components/CustomFooter";
 export function Header({ title, summary, description }): JSX.Element {
  return (
    <div>
+      <h1>{title}</h1>
      <h2 style={{ fontSize: "40px" }}>{summary}</h2>
      <h3 className="headerDescription">{description}</h3>
    </div>
@@ -16,8 +18,9 @@ export function Header({ title, summary, description }): JSX.Element {
 export default function Home(): JSX.Element {
  const { siteConfig } = useDocusaurusContext();
  return (
+    <>
    <Layout
-      title={`Hello from ${siteConfig.title}`}
+      title={`${siteConfig.title}`}
      description="AI-powered code generation for software engineering."
    >
      <div>
@@ -27,5 +30,7 @@ export default function Home(): JSX.Element {
        </div>
      </div>
    </Layout>
+    <CustomFooter />
+    </>
  );
 }
@@ -0,0 +1,12 @@
+/** @type {import('tailwindcss').Config} */
+module.exports = {
+  content: [
+    "./src/**/*.{js,jsx,ts,tsx}",
+    "./src/components/**/*.{js,jsx,ts,tsx}",
+    "./src/pages/**/*.{js,jsx,ts,tsx}",
+  ],
+  theme: {
+    extend: {},
+  },
+  plugins: [],
+};
@@ -16,6 +16,7 @@ all the preprocessing/evaluation/analysis scripts.
 - HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
 - GAIA: [`evaluation/gaia`](./gaia)
 - Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)
+- MINT: [`evaluation/mint`](./mint)

 ### Result Visualization

@@ -0,0 +1,12 @@
+Cold(Bob, True)
+Quiet(Bob, True)
+Red(Bob, True)
+Smart(Bob, True)
+Kind(Charlie, True)
+Quiet(Charlie, True)
+Red(Charlie, True)
+Rough(Charlie, True)
+Cold(Dave, True)
+Kind(Dave, True)
+Smart(Dave, True)
+Quiet(Fiona, True)
@@ -0,0 +1,52 @@
+fact1
+	foreach
+		facts.Quiet($x, True)
+		facts.Cold($x, True)
+	assert
+		facts.Smart($x, True)
+
+fact2
+	foreach
+		facts.Red($x, True)
+		facts.Cold($x, True)
+	assert
+		facts.Round($x, True)
+
+fact3
+	foreach
+		facts.Kind($x, True)
+		facts.Rough($x, True)
+	assert
+		facts.Red($x, True)
+
+fact4
+	foreach
+		facts.Quiet($x, True)
+	assert
+		facts.Rough($x, True)
+
+fact5
+	foreach
+		facts.Cold($x, True)
+		facts.Smart($x, True)
+	assert
+		facts.Red($x, True)
+
+fact6
+	foreach
+		facts.Rough($x, True)
+	assert
+		facts.Cold($x, True)
+
+fact7
+	foreach
+		facts.Red($x, True)
+	assert
+		facts.Rough($x, True)
+
+fact8
+	foreach
+		facts.Smart(Dave, True)
+		facts.Kind(Dave, True)
+	assert
+		facts.Quiet(Dave, True)
@@ -0,0 +1,35 @@
+# Logic Reasoning Evaluation
+
+This folder contains evaluation harness for evaluating agents on the logic reasoning benchmark [ProntoQA](https://github.com/asaparov/prontoqa) and [ProofWriter](https://allenai.org/data/proofwriter).
+
+## Configure OpenDevin and your LLM
+
+Create a `config.toml` file if it does not exist at the root of the workspace.
+
+Add the following configurations:
+
+```toml
+[core]
+max_iterations = 100
+cache_dir = "/tmp/cache"
+ssh_hostname = "localhost"
+enable_auto_lint = true
+
+# TODO: Change these to the model you want to evaluate
+[eval_gpt4_1106_preview]
+model = "gpt-4-1106-preview"
+api_key = "XXX"
+temperature = 0.0
+
+[eval_some_openai_compatible_model]
+model = "openai/MODEL_NAME"
+base_url = "https://OPENAI_COMPATIBLE_URL/v1"
+api_key = "XXX"
+temperature = 0.0
+```
+
+## Run Inference on logic_reasoning
+The following code will run inference on the first example of the ProntoQA dataset with model gpt-4o.
+```bash
+./evaluation/logic_reasoning/scripts/run_infer.sh ProntoQA gpt-4o 1
+```
@@ -0,0 +1,20 @@
+You are a helpful assistant assigned with logic reasoning task. You need to determine the correctness of a query given some facts and fules. 
+you can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. The code should be enclosed using "<execute_ipython>" tag.
+In this task, you need to use the code in [[logic_inference_path.py]] to help you. Specifically, you first need to instantiate a **LogicInferenceEngine** class and use the **safe_execute_program** method to prove the **logic programs**. You should receive *answer*, *flag*, *error_message* from the output. 
+
+An example would be look like this:
+    <execute_ipython>
+    import sys
+    sys.path.append(workspace_mount_path)
+    engine = LogicInferenceEngine(dataset_name, workspace_mount_path)
+    answer, flag, error_message = engine.safe_execute_program(logic_programs)
+    </execute_ipython>
+
+Please send the *answer* variable through message.
+
+dataset_name:
+[[dataset_name]]
+
+logic_programs:
+[[logic_programs]]
+
@@ -0,0 +1,220 @@
+import os
+import random
+import re
+import shutil
+
+from pyke import knowledge_engine
+
+
+class PykeProgram:
+    def __init__(
+        self, logic_program: str, dataset_name='ProntoQA', workspace_mount_path='./'
+    ) -> None:
+        self.logic_program = logic_program
+        self.flag = self.parse_logic_program()
+        self.dataset_name = dataset_name
+        self.cache_dir = os.path.join(workspace_mount_path, '.cache_program')
+
+        # prepare the files for facts and rules
+        try:
+            self.create_fact_file(self.Facts)
+            self.create_rule_file(self.Rules)
+            self.flag = True
+        except Exception:
+            self.flag = False
+
+        self.answer_map = {
+            'ProntoQA': self.answer_map_prontoqa,
+            'ProofWriter': self.answer_map_proofwriter,
+        }
+
+    def parse_logic_program(self):
+        keywords = ['Query:', 'Rules:', 'Facts:', 'Predicates:']
+        program_str = self.logic_program
+        for keyword in keywords:
+            try:
+                program_str, segment_list = self._parse_segment(program_str, keyword)
+                setattr(self, keyword[:-1], segment_list)
+            except Exception:
+                setattr(self, keyword[:-1], None)
+
+        return self.validate_program()
+
+    def _parse_segment(self, program_str, key_phrase):
+        remain_program_str, segment = program_str.split(key_phrase)
+        segment_list = segment.strip().split('\n')
+        for i in range(len(segment_list)):
+            segment_list[i] = segment_list[i].split(':::')[0].strip()
+        return remain_program_str, segment_list
+
+    # check if the program is valid; if not, try to fix it
+    def validate_program(self):
+        if self.Rules is not None and self.Facts is not None:
+            if not self.Rules[0] == '' and not self.Facts[0] == '':
+                return True
+        # try to fix the program
+        tmp_rules = []
+        tmp_facts = []
+        statements = self.Facts if self.Facts is not None else self.Rules
+        if statements is None:
+            return False
+
+        for fact in statements:
+            if fact.find('>>>') >= 0:  # this is a rule
+                tmp_rules.append(fact)
+            else:
+                tmp_facts.append(fact)
+        self.Rules = tmp_rules
+        self.Facts = tmp_facts
+        return False
+
+    def create_fact_file(self, facts):
+        with open(os.path.join(self.cache_dir, 'facts.kfb'), 'w') as f:
+            for fact in facts:
+                # check for invalid facts
+                if not fact.find('$x') >= 0:
+                    f.write(fact + '\n')
+
+    def create_rule_file(self, rules):
+        pyke_rules = []
+        for idx, rule in enumerate(rules):
+            pyke_rules.append(self.parse_forward_rule(idx + 1, rule))
+
+        with open(os.path.join(self.cache_dir, 'rules.krb'), 'w') as f:
+            f.write('\n\n'.join(pyke_rules))
+
+    # example rule: Furry($x, True) && Quite($x, True) >>> White($x, True)
+    def parse_forward_rule(self, f_index, rule):
+        premise, conclusion = rule.split('>>>')
+        premise = premise.strip()
+        # split the premise into multiple facts if needed
+        premise = premise.split('&&')
+        premise_list = [p.strip() for p in premise]
+
+        conclusion = conclusion.strip()
+        # split the conclusion into multiple facts if needed
+        conclusion = conclusion.split('&&')
+        conclusion_list = [c.strip() for c in conclusion]
+
+        # create the Pyke rule
+        pyke_rule = f"""fact{f_index}\n\tforeach"""
+        for p in premise_list:
+            pyke_rule += f"""\n\t\tfacts.{p}"""
+        pyke_rule += """\n\tassert"""
+        for c in conclusion_list:
+            pyke_rule += f"""\n\t\tfacts.{c}"""
+        return pyke_rule
+
+    """
+    for example: Is Marvin from Mars?
+    Query: FromMars(Marvin, $label)
+    """
+
+    def check_specific_predicate(self, subject_name, predicate_name, engine):
+        results = []
+        with engine.prove_goal(
+            f'facts.{predicate_name}({subject_name}, $label)'
+        ) as gen:
+            for vars, plan in gen:
+                results.append(vars['label'])
+
+        with engine.prove_goal(
+            f'rules.{predicate_name}({subject_name}, $label)'
+        ) as gen:
+            for vars, plan in gen:
+                results.append(vars['label'])
+
+        if len(results) == 1:
+            return results[0]
+        elif len(results) == 2:
+            return results[0] and results[1]
+        elif len(results) == 0:
+            return None
+
+    """
+    Input Example: Metallic(Wren, False)
+    """
+
+    def parse_query(self, query):
+        pattern = r'(\w+)\(([^,]+),\s*([^)]+)\)'
+        match = re.match(pattern, query)
+        if match:
+            function_name = match.group(1)
+            arg1 = match.group(2)
+            arg2 = match.group(3)
+            arg2 = True if arg2 == 'True' else False
+            return function_name, arg1, arg2
+        else:
+            raise ValueError(f'Invalid query: {query}')
+
+    def execute_program(self):
+        # delete the compiled_krb dir
+        complied_krb_dir = './models/compiled_krb'
+        if os.path.exists(complied_krb_dir):
+            print('removing compiled_krb')
+            # os.system(f'rm -rf {complied_krb_dir}/*')
+            shutil.rmtree(complied_krb_dir)
+
+        # absolute_path = os.path.abspath(complied_krb_dir)
+        # print(absolute_path)
+        try:
+            engine = knowledge_engine.engine(self.cache_dir)
+            engine.reset()
+            engine.activate('rules')
+            engine.get_kb('facts')
+
+            # parse the logic query into pyke query
+            predicate, subject, value_to_check = self.parse_query(self.Query[0])
+            result = self.check_specific_predicate(subject, predicate, engine)
+            answer = self.answer_map[self.dataset_name](result, value_to_check)
+        except Exception as err:
+            return None, err
+
+        return answer, ''
+
+    def answer_mapping(self, answer):
+        return answer
+
+    def answer_map_prontoqa(self, result, value_to_check):
+        if result == value_to_check:
+            return 'A'
+        else:
+            return 'B'
+
+    def answer_map_proofwriter(self, result, value_to_check):
+        if result is None:
+            return 'C'
+        elif result == value_to_check:
+            return 'A'
+        else:
+            return 'B'
+
+
+class LogicInferenceEngine:
+    def __init__(self, dataset_name, workspace_mount_path):
+        self.dataset_name = dataset_name
+        self.workspace_mount_path = workspace_mount_path
+
+    def random_backup(self):
+        if self.dataset_name == 'ProntoQA':
+            return random.choice(['A', 'B'])
+        elif self.dataset_name == 'ProofWriter':
+            return random.choice(['A', 'B', 'C'])
+
+    def safe_execute_program(self, logic_program):
+        program = PykeProgram(
+            logic_program, self.dataset_name, self.workspace_mount_path
+        )
+        # cannot parse the program
+        if not program.flag:
+            answer = self.random_backup()
+            return answer, 'parsing error', ''
+        # execuate the program
+        answer, error_message = program.execute_program()
+        # not executable
+        if answer is None:
+            answer = self.random_backup()
+            return answer, 'execution error', error_message
+        # successfully executed
+        answer = program.answer_mapping(answer)
+        return answer, 'success', ''
@@ -0,0 +1,436 @@
+import asyncio
+import json
+import logging
+import multiprocessing as mp
+import os
+import pathlib
+import shutil
+import time
+from concurrent.futures import ProcessPoolExecutor
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+from evaluation.swe_bench.swe_env_box import DockerSSHBox
+from opendevin.controller.state.state import State
+from opendevin.core.config import config, get_llm_config_arg, get_parser
+from opendevin.core.logger import get_console_handler
+from opendevin.core.logger import opendevin_logger as logger
+from opendevin.core.main import main
+from opendevin.events.action import MessageAction
+from opendevin.events.serialization.event import event_to_dict
+
+
+def cleanup():
+    logger.info('Cleaning up child processes...')
+    for process in mp.active_children():
+        logger.info(f'Terminating child process: {process.name}')
+        process.terminate()
+        process.join()
+
+
+def codeact_user_response(state: State) -> str:
+    msg = (
+        'Please continue working on the task on whatever approach you think is suitable.\n'
+        'If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.\n'
+        'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
+    )
+    if state.history:
+        user_msgs = [
+            action
+            for action, _ in state.history
+            if isinstance(action, MessageAction) and action.source == 'user'
+        ]
+        if len(user_msgs) >= 2:
+            # let the agent know that it can give up when it has tried 3 times
+            return (
+                msg
+                + 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
+            )
+    return msg
+
+
+def monologue_user_response(state: State) -> str:
+    raise NotImplementedError('MonologueAgent should never ask for user responses.')
+
+
+AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
+    'CodeActAgent': codeact_user_response,
+    'MonologueAgent': monologue_user_response,
+}
+
+AGENT_CLS_TO_INST_SUFFIX = {
+    'CodeActAgent': 'When you think you have solved the question, please first send your answer to user through message and then exit.\n'
+}
+
+
+def get_choice(answer_str):
+    choices = [
+        'A',
+        'B',
+        'C',
+        'D',
+        'E',
+        'F',
+        'G',
+        'H',
+        'A)',
+        'B)',
+        'C)',
+        'D)',
+        'E)',
+        'F)',
+        'G)',
+        'H)',
+        'A.',
+        'B.',
+        'C.',
+        'D.',
+        'E.',
+        'F.',
+        'G.',
+        'H.',
+    ]
+    for c in choices:
+        if answer_str.startswith(c):
+            return c.replace(')', '')
+
+    if answer_str.startswith(':'):
+        return answer_str.replace(':', '').replace('.', '').strip()
+    return None
+
+
+def get_test_result(
+    model_answer: str,
+    ground_truth: str,
+) -> bool:
+    gold_answer = ground_truth.replace('(', '').replace(')', '').strip()
+    answer_str = model_answer if model_answer is not None else ''
+    prediction = get_choice(answer_str)
+
+    indicators = [
+        'the correct option is',
+        'the correct answer is',
+        'The correct answer is',
+        'The correct option is',
+        'Thus, the answer is',
+    ]
+    if prediction is None:
+        for indicator in indicators:
+            if answer_str.find(indicator) >= 0:
+                answer_str = answer_str.split(indicator)[1].strip()
+                prediction = get_choice(answer_str)
+                break
+
+    isTrue = prediction == gold_answer
+    test_result = {'result': isTrue}
+    return test_result
+
+
+def process_instance(
+    instance,
+    agent_class,
+    # metadata,
+    dataset_name,
+    skip_workspace_mount,
+    eval_output_dir,
+    reset_logger: bool = True,
+):
+    old_workspace_mount_path = config.workspace_mount_path
+    old_workspace_base = config.workspace_base
+    workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
+    # create process-specific workspace dir
+    # if `not skip_workspace_mount` - we will create a workspace directory for EACH process
+    # so that different agent don't interfere with each other.
+    if not skip_workspace_mount:
+        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
+        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+
+    # reset workspace to config
+    config.workspace_base = workspace_mount_path
+    config.workspace_mount_path = workspace_mount_path
+
+    # Setup the logger properly, so you can run multi-processing to parallize the evaluation
+    if reset_logger:
+        # Set up logger
+        log_file = os.path.join(
+            eval_output_dir, 'logs', f'instance_{instance["id"]}.log'
+        )
+        # Remove all existing handlers from logger
+        for handler in logger.handlers[:]:
+            logger.removeHandler(handler)
+        # add back the console handler to print ONE line
+        logger.addHandler(get_console_handler())
+        logger.info(
+            f'Starting evaluation for instance {instance["id"]}.\nLOG:   tail -f {log_file}'
+        )
+        # Remove all existing handlers from logger
+        for handler in logger.handlers[:]:
+            logger.removeHandler(handler)
+        file_handler = logging.FileHandler(log_file)
+        file_handler.setFormatter(
+            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+        )
+        logger.addHandler(file_handler)
+
+    if not skip_workspace_mount:
+        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+
+    # sandbox = DockerSSHBox()
+    logic_inference_path = os.path.join(workspace_mount_path, 'logic_inference.py')
+    if not os.path.exists(logic_inference_path):
+        shutil.copyfile(
+            './evaluation/logic_reasoning/logic_inference.py', logic_inference_path
+        )
+    logger.info(f'logic_inference.py copied to {workspace_mount_path}')
+
+    cache_dir = os.path.join(workspace_mount_path, '.cache_program')
+    if not os.path.exists(cache_dir):
+        os.makedirs(cache_dir)
+
+    # Prepare instruction
+
+    with open('./evaluation/logic_reasoning/instruction.txt', 'r') as f:
+        instruction = f.read()
+
+    instance_logic_programs = instance['raw_logic_programs'][0].strip()
+    instruction = instruction.replace('[[dataset_name]]', dataset_name)
+    instruction = instruction.replace('[[logic_programs]]', instance_logic_programs)
+    instruction = instruction.replace(
+        '[[logic_inference_path.py]]', logic_inference_path
+    )
+
+    # NOTE: You can actually set slightly different instruction for different agents
+    instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
+
+    sandbox = DockerSSHBox()
+    exit_code, command_output = sandbox.execute(f'pip install scitools-pyke')
+    
+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
+    state: State = asyncio.run(
+        main(
+            instruction,
+            fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(agent_class),
+            sandbox=sandbox,
+        )
+    )
+    # ======= Attempt to evaluate the agent's edits =======
+    # If you are working on simplier benchmark that only evaluates the final model output (e.g., in a MessageAction)
+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
+
+    if state is None:
+        raise ValueError('State should not be None.')
+
+    final_message = ''
+    messages = []
+    for action, obs in reversed(state.history):
+        # if isinstance(act, MessageAction):
+        messages.append(obs.content)
+        # print("obs.content:", obs.content)
+        if str(obs.content) in ["'A'", "'B'", "'C'"]:
+            final_message = obs.content
+            break
+    
+    final_message = final_message.strip("'")
+    logger.info(f'Predicted answer: {final_message}, Ground truth: {instance["answer"]}')
+
+    test_result = get_test_result(
+        model_answer=final_message, ground_truth=instance['answer']
+    )
+
+    # Save the output
+    output = {
+        'id': instance['id'],
+        'instance': instance,
+        'instruction': instruction,
+        # 'metadata': metadata,
+        'history': [
+            (event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
+        ],
+        'final_message': final_message,
+        'messages': messages,
+        'error': state.error if state and state.error else None,
+        'test_result': test_result,
+    }
+    config.workspace_mount_path = old_workspace_mount_path
+    config.workspace_base = old_workspace_base
+    
+    # Close the sandbox
+    sandbox.close()
+    
+    return output
+
+
+if __name__ == '__main__':
+    parser = get_parser()
+    parser.add_argument(
+        '--dataset',
+        type=str,
+        help='the logic reasoning dataset to evaluate on {ProntoQA, ProofWriter}',
+        default='ProntoQA',
+    )
+    parser.add_argument(
+        '--data_split',
+        type=str,
+        help='data split to evaluate on {validation}', # right now we only support validation split
+        default='validation',
+    )
+
+    args, _ = parser.parse_known_args()
+    if args.directory:
+        config.workspace_base = os.path.abspath(args.directory)
+        print(f'Setting workspace base to {config.workspace_base}')
+    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
+    # so we don't need to manage file uploading to OpenDevin's repo
+
+    dataset_name = args.dataset
+    data_split = args.data_split
+    dataset = load_dataset(f'renma/{dataset_name}')
+    logic_reasoning_tests = dataset[data_split]
+    logger.info(f'Evaluating logic reasoning dataset {dataset_name} {data_split} split')
+
+    # Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
+    # for details of how to set `llm_config`
+    if args.llm_config:
+        specified_llm_config = get_llm_config_arg(args.llm_config)
+        if specified_llm_config:
+            config.llm = specified_llm_config
+    logger.info(f'Config for evaluation: {config}')
+
+    # TEST METADATA
+    agent_class = args.agent_cls
+    assert (
+        agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
+    ), f'Unsupported agent class: {agent_class}'
+    model_name = config.llm.model.split('/')[-1]
+    max_iterations = args.max_iterations
+    eval_note = ''
+    if args.eval_note is not None:
+        eval_note += '_N_' + args.eval_note
+
+    eval_output_dir = os.path.join(
+        args.eval_output_dir,
+        'logic_reasoning',
+        agent_class,
+        dataset_name,
+        model_name + '_maxiter_' + str(max_iterations) + eval_note
+    )
+
+    pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
+    pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
+        parents=True, exist_ok=True
+    )
+    logger.info(f'Using evaluation output directory: {eval_output_dir}')
+
+    # LIMIT EVALUATION
+    eval_n_limit = args.eval_n_limit
+    if eval_n_limit:
+        logic_reasoning_tests = logic_reasoning_tests.select(list(range(eval_n_limit)))
+        logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
+
+    start_time = time.strftime('%Y-%m-%d %H:%M:%S')
+
+    # OUTPUT FILE
+    output_file = os.path.join(eval_output_dir, 'output.jsonl')
+    logger.info(f'Writing evaluation output to {output_file}')
+    finished_task_ids = set()
+    if os.path.exists(output_file):
+        with open(output_file, 'r') as f:
+            for line in f:
+                data = json.loads(line)
+                finished_task_ids.add(data['id'])
+        logger.warning(
+            f'Output file {output_file} already exists. Loaded {len(finished_task_ids)} finished instances.'
+        )
+    output_fp = open(output_file, 'a')
+
+    logger.info(
+        f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
+    )
+
+    # =============================================
+    # filter out finished instances
+    new_logic_reasoning_tests = []
+    for instance in logic_reasoning_tests:
+        if instance['id'] in finished_task_ids:
+            logger.info(
+                f'Skipping instance {instance["id"]} as it is already finished.'
+            )
+            continue
+        new_logic_reasoning_tests.append(instance)
+
+    logic_reasoning_tests = new_logic_reasoning_tests
+    logger.info(
+        f'Finished instances: {len(finished_task_ids)}, Remaining instances: {len(logic_reasoning_tests)}'
+    )
+    # =============================================
+
+    pbar = tqdm(total=len(logic_reasoning_tests))
+
+    # This function tracks the progress AND write the output to a JSONL file
+    def update_progress(future):
+        pbar.update(1)
+        output = future.result()
+        pbar.set_description(f'Instance {output["id"]}')
+        pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
+        logger.info(
+            f'Finished evaluation for instance {output["id"]}: {output["test_result"]["result"]}'
+        )
+        output_fp.write(json.dumps(output) + '\n')
+        # json.dump(output, output_fp, indent=4)
+        output_fp.flush()
+
+    # This sets the multi-processing
+    num_workers = args.eval_num_workers
+    # num_workers = 1
+    logger.info(f'Using {num_workers} workers for evaluation.')
+
+    # This is SWE-Bench specific - CodeActAgent don't requires mounted workspace to work
+    skip_workspace_mount = False
+    logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
+
+    try:
+        with ProcessPoolExecutor(num_workers) as executor:
+            futures = []
+            # This is how we perform multi-processing
+            for instance in logic_reasoning_tests:
+                future = executor.submit(
+                    process_instance,
+                    instance,
+                    agent_class,
+                    dataset_name,
+                    skip_workspace_mount,
+                    eval_output_dir,
+                    reset_logger=bool(num_workers > 1),
+                )
+                future.add_done_callback(update_progress)
+                futures.append(future)
+
+            # Wait for all futures to complete
+            for future in futures:
+                future.result()
+    except KeyboardInterrupt:
+        print('KeyboardInterrupt received. Cleaning up...')
+        cleanup()
+
+    output_fp.close()
+    
+    with open(output_file, 'r') as f:
+        test_result = [(json.loads(line))["test_result"]["result"] for line in f]
+            
+    metadata = {
+        "Dataset": dataset_name,
+        "Data split": data_split,
+        "Number of Samples": len(test_result),
+        'Agent class': agent_class,
+        'Model name': model_name,
+        'Start_time': start_time,
+        "End_time": time.strftime('%Y-%m-%d %H:%M:%S'),
+        "Final Accuracy": f"{sum(test_result)/len(test_result):.2f}",
+        }
+    
+    with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
+        json.dump(metadata, f, indent=4)
+        
+    logger.info(f'Metadata: {json.dumps(metadata, indent=4)}')
+    logger.info(f'Evaluation finished. Metadata saved to {eval_output_dir}/metadata.json')
@@ -0,0 +1,37 @@
+#!/bin/bash
+DATASET=$1
+MODEL_CONFIG=$2
+EVAL_LIMIT=$3
+AGENT=$4
+
+# ################################################################################
+
+if [ -z "$AGENT" ]; then
+  echo "Agent not specified, use default CodeActAgent"
+  AGENT="CodeActAgent"
+fi
+
+# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
+# We need to track the version of Agent in the evaluation to make sure results are comparable
+AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
+
+echo "AGENT: $AGENT"
+echo "AGENT_VERSION: $AGENT_VERSION"
+echo "MODEL_CONFIG: $MODEL_CONFIG"
+
+COMMAND="poetry run python evaluation/logic_reasoning/run_infer.py \
+  --agent-cls $AGENT \
+  --llm-config $MODEL_CONFIG \
+  --dataset $DATASET \
+  --max-iterations 10 \
+  --max-chars 10000000 \
+  --eval-num-workers 1 \
+  --eval-note $AGENT_VERSION"
+
+if [ -n "$EVAL_LIMIT" ]; then
+  echo "EVAL_LIMIT: $EVAL_LIMIT"
+  COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
+fi
+
+# Run the command
+eval $COMMAND
@@ -0,0 +1 @@
+!requirements.txt
@@ -0,0 +1,45 @@
+# MINT Benchmark
+
+This folder contains the evaluation harness for the [MINT benchmark](https://arxiv.org/abs/2309.10691) on LLMs' ability to solve tasks with multi-turn interactions.
+
+## Configure OpenDevin and LM
+
+Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
+
+## Start the evaluation
+
+We are using the MINT dataset hosted on [Hugging Face](https://huggingface.co/datasets/ryanhoangt/xingyaoww-mint-bench).
+
+Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is `CodeActAgent`.
+
+```bash
+./evaluation/mint/scripts/run_infer.sh [model_config] [subset] [eval_limit]
+```
+
+where `model_config` is mandatory, while `subset` and `eval_limit` are optional.
+
+- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your LLM settings, as defined in your `config.toml`.
+
+- `subset`, e.g. `math`, is the subset of the MINT benchmark to evaluate on, defaulting to `math`.
+
+- `eval_limit`, e.g. `2`, limits the evaluation to the first `eval_limit` instances, defaulting to all instances.
+
+Note: in order to use `eval_limit`, you must also set `subset`.
+
+Let's say you'd like to run 3 instances on the `gsm8k` subset using `eval_gpt4_1106_preview`,
+then your command would be:
+
+```bash
+./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview gsm8k 3
+```
+## Reference
+```
+@misc{wang2024mint,
+    title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
+    author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
+    year={2024},
+    eprint={2309.10691},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
@@ -0,0 +1,5 @@
+TASK_INFO_MAP = {
+    # === Reasoning ===
+    'gsm8k': {'class': 'ReasoningTask', 'type': 'reasoning'},
+    'math': {'class': 'ReasoningTask', 'type': 'reasoning'},
+}
@@ -0,0 +1,82 @@
+import enum
+from typing import Any, Dict, Tuple
+
+
+class TaskState:
+    def __init__(
+        self,
+        finished: bool = False,
+        success: bool = False,
+        agent_action_count: dict = None,
+        terminate_reason: str = None,
+        latest_output: Dict[str, Any] = None,
+    ):
+        self.finished = finished
+        self.success = success
+        self.agent_action_count: Dict[str, int] = agent_action_count or {
+            'propose_solution': 0,
+            'use_tool': 0,
+            'invalid_action': 0,
+        }
+        self.terminate_reason = terminate_reason
+        self.latest_output = latest_output
+
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            'finished': self.finished,
+            'success': self.success,
+            'agent_action_count': self.agent_action_count,
+            'terminate_reason': self.terminate_reason,
+            'latest_output': self.latest_output,
+        }
+
+
+class ParseError(Exception):
+    pass
+
+
+class FeedbackType(enum.Enum):
+    FEEDBACK_WITH_GT = 'feedback_with_gt'
+    FEEDBACK_WO_GT = 'feedback_wo_gt'
+    NO_FEEDBACK = 'no_feedback'
+
+
+class StepOutput:
+    def __init__(
+        self,
+        observation: str = None,
+        success: bool = False,
+        extra: Dict[str, Any] = None,
+        turn_info: Tuple[int, int] = None,
+    ):
+        self.observation: str = observation
+        self.success: bool = success
+        self.extra: Dict[str, Any] = extra
+        self.turn_info = turn_info
+
+    def __repr__(self) -> str:
+        return self.observation
+
+    def to_str(self) -> str:
+        output = 'Observation:\n'
+        if self.observation is not None:
+            output += self.observation + '\n'
+        else:
+            if not self.success:
+                output += 'Your answer is wrong.\n'
+
+        if self.turn_info is not None:
+            n_steps_left, n_propose_solution_left = self.turn_info
+            output += 'You have {} steps left and {} chances to propose solution left.\n'.format(
+                n_steps_left, n_propose_solution_left
+            )
+            if n_steps_left <= 1:
+                output += 'You should take the last step to propose a solution.\n'
+
+        return output
+
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            'observation': self.observation,
+            'success': self.success,
+        }
@@ -0,0 +1,119 @@
+import re
+import traceback
+from typing import Dict, Optional
+
+from datatypes import ParseError, StepOutput, TaskState
+from task import Task
+
+from opendevin.controller.state.state import State
+
+
+class SimplifiedEnv:
+    INVALID_INPUT_MESSAGE = (
+        "I don't understand your input. \n"
+        'If you want to execute code, please use <execute_ipython> YOUR_CODE_HERE </execute_ipython>.\n'
+        'If you want to give me an answer, please use <solution> YOUR_SOLUTION_HERE </solution>.\n'
+        'For example: The answer to the question is <solution> 42 </solution>. \n'
+    )
+
+    def __init__(self, agent_state: State, task: Task, task_config: Dict[str, int]):
+        self.agent_state = agent_state
+        self.task = task
+        self.task_state = TaskState()
+        self.task_config = task_config
+
+    def step(self, lm_message: str):
+        observation = self.handle_propose_solution(lm_message)
+
+        self.check_max_iteration()
+
+        turn_info = (
+            self.task_config['max_iterations'] - self.agent_state.iteration,
+            self.task_config['max_propose_solution']
+            - self.task_state.agent_action_count['propose_solution'],
+        )
+
+        output = StepOutput(
+            observation=observation,
+            success=self.task_state.success,
+            turn_info=turn_info,
+        )
+
+        self.log_output(output)
+        return self.task_state
+
+    def handle_propose_solution(self, lm_message) -> Optional[str]:
+        """Propose answer to check the task success.
+
+        It might set self.state.finished = True if the task is successful.
+        """
+        self.task_state.agent_action_count['propose_solution'] += 1
+        try:
+            parsed = self.parse_propose_solution(lm_message)
+            task_success = self.check_task_success(parsed['answer'])
+            if task_success:
+                self.task_state.finished = True
+                self.task_state.success = True
+                self.task_state.terminate_reason = 'task_success'
+                # NOTE: should not return the function now, because we need to log the output
+                # Set state.finished = True will terminate the episode
+        except ParseError:
+            return SimplifiedEnv.INVALID_INPUT_MESSAGE
+        except Exception:
+            error_traceback = traceback.format_exc()
+            return f'{error_traceback}'
+
+    def parse_propose_solution(self, lm_message: str) -> dict:
+        """Define the parsing logic."""
+        lm_output = '\n' + lm_message + '\n'
+
+        answer = '\n'.join(
+            [
+                i.strip()
+                for i in re.findall(r'<solution>(.*?)</solution>', lm_output, re.DOTALL)
+            ]
+        )
+        if answer == '':
+            raise ParseError('No answer found.')
+
+        return {'answer': answer}
+
+    def log_output(self, output: StepOutput) -> None:
+        if self.task_state.finished:
+            return
+
+        content = output.to_str()
+        # self.state.history.append({"role": "user", "content": content})
+        self.task_state.latest_output = output.to_dict()
+        self.task_state.latest_output['content'] = content
+
+    def check_task_success(self, answer: str) -> bool:
+        # log_message.info(f"STUDENT ANSWER: [{answer}]")
+        # log_message.info(f"REFERENCE ANSWER: [{self.task.reference}]")
+        return self.task.success(answer)
+
+    def check_max_iteration(self):
+        """Check if the agent has reached the max iteration limit.
+
+        It might set self.state.finished = True if the agent has reached the max iteration limit.
+        """
+        if self.task_state.finished:
+            # ignore if the episode is already finished (e.g., task success)
+            return
+
+        if (
+            # propose solution > max output solution
+            self.task_state.agent_action_count['propose_solution']
+            >= self.task_config['max_propose_solution']
+        ):
+            self.task_state.finished = True
+            self.task_state.success = False
+            self.task_state.terminate_reason = 'max_propose_steps'
+        elif (
+            # (propose_solution + use_tool) > max iteration limit
+            sum(self.task_state.agent_action_count.values())
+            >= self.task_config['max_iterations']
+        ):
+            self.task_state.finished = True
+            self.task_state.success = False
+            self.task_state.terminate_reason = 'max_iterations'
@@ -0,0 +1,48 @@
+Task:
+Evaluate $\int\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.
+Output format requirement: Round to three decimal places, using half up rounding rule.
+
+Assistant:
+<thought> To solve this problem, let's start by determining Li Bai's birth year. </thought>
+<execute_ipython> print(wikipedia_search('Li Bai')) </execute_ipython>
+
+Observation:
+Page: Li Bai
+Summary: Li Bai (Chinese: 李白; pinyin: Lǐ Bái, 701–762), also pronounced as Li Bo, courtesy name Taibai (Chinese: 太白), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.
+
+Assistant:
+<thought>
+From the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.
+To evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.
+In polar coordinates, $x = r\cos(\theta)$ and $y = r\sin(\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\theta$.
+$\int\int_C (x+y)^2 dx,dy = \int_0^{2\pi}\int_0^{\sqrt{701}} (r\cos(\theta) + r\sin(\theta))^2r,dr,d\theta$
+Let's use the SymPy package to calculate this.
+</thought>
+<execute_ipython>
+from sympy import symbols, integrate, cos, sin, pi
+
+r, theta = symbols('r theta')
+
+# the integrand
+integrand = (r*cos(theta) + r*sin(theta))**2 * r
+
+# perform the double integration
+double_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))
+print(double_integral)
+</execute_ipython>
+
+Observation:
+Out[2]: 245700.5*pi
+
+Assistant:
+<thought> Let's round this result to three decimal places using half up rounding rule. </thought>
+<execute_ipython>
+import math
+print(round(245700.5*math.pi, 3))
+</execute_ipython>
+
+Observation:
+771890.886
+
+Assistant:
+The answer is <solution> 771890.886 </solution>.
@@ -0,0 +1,25 @@
+import os
+
+from utils import load_file
+
+PROMPT_DIR = os.path.dirname(__file__)
+TEMPLATE_WITH_TOOL = load_file(os.path.join(PROMPT_DIR, 'template_with_tool.txt'))
+
+
+class PromptTemplate:
+    """A prompt template."""
+
+    def __init__(self, template: str):
+        self.template: str = template
+
+    def __call__(self, **kwargs) -> str:
+        return self.template.format(**kwargs)
+
+
+class ToolPromptTemplate(PromptTemplate):
+    def __init__(self, use_tool: bool):
+        if use_tool:
+            template = TEMPLATE_WITH_TOOL
+        else:
+            raise NotImplementedError('Evaluation without tool is not supported yet.')
+        super().__init__(template)
@@ -0,0 +1,19 @@
+You are a helpful assistant assigned with the task of problem-solving.
+To solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.
+At each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using "<thought>" tag, for example: <thought> I need to print "Hello World!" </thought>.
+
+After that, you have two options:
+1) Interact with a Python programming environment and receive the corresponding output.
+2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using "<solution>" tag, for example: The answer is <solution> A </solution>.
+Either you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.
+
+You have {max_total_steps} chances to interact with the environment or propose a solution. You can only propose a solution {max_propose_solution} times.
+
+---
+
+{in_context_example}
+
+---
+
+# Problem statement:
+{task_prompt}
@@ -0,0 +1,32 @@
+pre-commit
+openai
+datasets
+backoff
+charset-normalizer==3.1.0
+# Alfworld
+pandas==1.4.4
+opencv-python
+networkx
+tqdm
+vocab
+revtok
+Click
+ai2thor==2.1.0
+transformers
+tokenizers
+scipy==1.10.1
+ipython
+matplotlib
+cython
+nltk
+gym==0.15.4
+pipreqs
+pyyaml
+pytz
+visdom
+sympy
+pycocotools
+seaborn
+google-generativeai
+python-dateutil
+statsmodels
@@ -0,0 +1,357 @@
+import asyncio
+import functools
+import json
+import logging
+import multiprocessing as mp
+import os
+import pathlib
+import subprocess
+import time
+from concurrent.futures import ProcessPoolExecutor
+from typing import Dict
+
+from datasets import load_dataset
+from datatypes import TaskState
+from env import SimplifiedEnv
+from prompts import ToolPromptTemplate
+from task import ReasoningTask, Task
+from tqdm import tqdm
+
+from evaluation.swe_bench.swe_env_box import DockerSSHBox
+from opendevin.controller.state.state import State
+from opendevin.core.config import config, get_llm_config_arg, get_parser
+from opendevin.core.logger import get_console_handler
+from opendevin.core.logger import opendevin_logger as logger
+from opendevin.core.main import main
+from opendevin.events.serialization.event import event_to_dict
+
+
+def cleanup():
+    print('Cleaning up child processes...')
+    for process in mp.active_children():
+        print(f'Terminating child process: {process.name}')
+        process.terminate()
+        process.join()
+
+
+def codeact_user_response(state: State, task: Task, task_config: Dict[str, int]):
+    logger.info(f'Gold reference: {task.reference}')
+    logger.info(f'Task config: {task_config}')
+
+    env = SimplifiedEnv(
+        agent_state=state,
+        task=task,
+        task_config=task_config,
+    )
+    last_action, _ = state.history[-1]
+    result_state: TaskState = env.step(last_action.message)
+    state.task_state = result_state
+
+    if not result_state.latest_output:
+        if result_state.success:
+            msg = 'Your answer is correct. Please EXIT using the following command: <execute_bash> exit </execute_bash>.'
+        else:
+            msg = 'Something went wrong! No output from the model.'
+    else:
+        msg = result_state.latest_output['content']
+
+    logger.info('User response:' + msg)
+    return msg
+
+
+def monologue_user_response(state: State) -> str:
+    raise NotImplementedError('MonologueAgent should never ask for user responses.')
+
+
+AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
+    'CodeActAgent': codeact_user_response,
+    'MonologueAgent': monologue_user_response,
+}
+
+AGENT_CLS_TO_INST_SUFFIX = {
+    'CodeActAgent': '\nIMPORTANT: When your answer is confirmed by the user to be correct, you can exit using the following command: <execute_bash> exit </execute_bash>.\n'
+}
+
+
+def process_instance(
+    instance: Task,
+    agent_class,
+    metadata,
+    skip_workspace_mount,
+    eval_output_dir,
+    reset_logger: bool = True,
+):
+    workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
+    # create process-specific workspace dir
+    # if `not skip_workspace_mount` - we will create a workspace directory for EACH process
+    # so that different agent don't interfere with each other.
+    if not skip_workspace_mount:
+        workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
+        pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
+
+    # Setup the logger properly, so you can run multi-processing to parallize the evaluation
+    if reset_logger:
+        # Set up logger
+        log_file = os.path.join(
+            eval_output_dir, 'logs', f'instance_{instance.task_id}.log'
+        )
+        # Remove all existing handlers from logger
+        for handler in logger.handlers[:]:
+            logger.removeHandler(handler)
+        # add back the console handler to print ONE line
+        logger.addHandler(get_console_handler())
+        logger.info(
+            f'Starting evaluation for instance {instance.task_id}.\nHint: run "tail -f {log_file}" to see live logs in a seperate shell'
+        )
+        # Remove all existing handlers from logger
+        for handler in logger.handlers[:]:
+            logger.removeHandler(handler)
+        file_handler = logging.FileHandler(log_file)
+        file_handler.setFormatter(
+            logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+        )
+        logger.addHandler(file_handler)
+
+    if not skip_workspace_mount:
+        logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
+
+    sandbox = DockerSSHBox()
+
+    requirements_host_src = 'evaluation/mint/requirements.txt'
+    requirements_sandbox_dest = '/opendevin/plugins/mint/requirements.txt'
+    sandbox.copy_to(
+        host_src=requirements_host_src,
+        sandbox_dest=requirements_sandbox_dest,
+        recursive=False,
+    )
+    logger.info(
+        f'Copied files from [{requirements_host_src}] to [{requirements_sandbox_dest}] inside sandbox.'
+    )
+    exit_code, output = sandbox.execute(f'pip install -r {requirements_sandbox_dest}')
+
+    # Prepare instruction
+    instruction = ToolPromptTemplate(use_tool=True)(
+        max_total_steps=metadata['max_iterations'],
+        max_propose_solution=metadata['max_propose_solution'],
+        in_context_example=instance.in_context_example(
+            use_tool=True, with_feedback=False
+        ),
+        task_prompt='Task:\n' + instance.prompt,
+    )
+    instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n'
+
+    # NOTE: You can actually set slightly different instruction for different agents
+    instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
+
+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
+    fake_user_response_fn = functools.partial(
+        AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(agent_class),
+        task=instance,
+        task_config={
+            'max_iterations': metadata['max_iterations'],
+            'max_propose_solution': metadata['max_propose_solution'],
+        },
+    )
+
+    state: State = asyncio.run(
+        main(
+            instruction,
+            fake_user_response_fn=fake_user_response_fn,
+            sandbox=sandbox,
+        )
+    )
+
+    if state is None:
+        raise ValueError('State should not be None.')
+
+    logger.info('Msgs: ' + str(state.history))
+
+    task_state: TaskState = state.task_state
+    logger.info('Task state: ' + str(task_state.to_dict()))
+
+    # Save the output
+    output = {
+        'id': instance.task_id,
+        'instance': instance.to_dict(),
+        'instruction': instruction,
+        'metadata': metadata,
+        'history': [
+            (event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
+        ],
+        'error': state.error if state and state.error else None,
+        'test_result': task_state.success,
+    }
+
+    # Close the sandbox
+    sandbox.close()
+
+    return output
+
+
+if __name__ == '__main__':
+    parser = get_parser()
+
+    parser.add_argument(
+        '--subset',
+        default='math',
+        choices=['math', 'gsm8k'],
+        type=str,
+        help='subset of the dataset to be used',
+    )
+    parser.add_argument(
+        '--max-propose-solution',
+        default=2,
+        type=int,
+        help='maximum number of times the agent can propose a solution',
+    )
+
+    args, _ = parser.parse_known_args()
+
+    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
+    # so we don't need to manage file uploading to OpenDevin's repo
+    mint_dataset = load_dataset(
+        'ryanhoangt/xingyaoww-mint-bench', name=args.subset, split='test'
+    )
+    logger.info(f'Evaluating MINT - {args.subset} subset')
+
+    # Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
+    # for details of how to set `llm_config`
+    if args.llm_config:
+        specified_llm_config = get_llm_config_arg(args.llm_config)
+        if specified_llm_config:
+            config.llm = specified_llm_config
+    logger.info(f'Config for evaluation: {config}')
+
+    # TEST METADATA
+    agent_class = args.agent_cls
+    assert (
+        agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
+    ), f'Unsupported agent class: {agent_class}'
+    model_name = config.llm.model.split('/')[-1]
+    max_iterations = args.max_iterations
+    eval_note = ''
+    if args.eval_note is not None:
+        eval_note += '_N_' + args.eval_note
+    eval_output_dir = os.path.join(
+        args.eval_output_dir,
+        'mint',
+        agent_class,
+        model_name + '_maxiter_' + str(max_iterations) + eval_note,
+        args.subset,
+    )
+
+    pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
+    pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
+        parents=True, exist_ok=True
+    )
+    logger.info(f'Using evaluation output directory: {eval_output_dir}')
+
+    metadata = {
+        'agent_class': agent_class,
+        'model_name': model_name,
+        'max_iterations': max_iterations,
+        'max_propose_solution': args.max_propose_solution,
+        'eval_output_dir': eval_output_dir,
+        'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
+        # get the commit id of current repo for reproduciblity
+        'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
+        .decode('utf-8')
+        .strip(),
+    }
+    logger.info(f'Metadata: {metadata}')
+    with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
+        json.dump(metadata, f)
+
+    # LIMIT EVALUATION
+    eval_n_limit = args.eval_n_limit
+    if eval_n_limit:
+        mint_dataset = mint_dataset.select(range(eval_n_limit))
+        logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
+
+    # OUTPUT FILE
+    output_file = os.path.join(eval_output_dir, 'output.jsonl')
+    logger.info(f'Writing evaluation output to {output_file}')
+    finished_instance_ids = set()
+    if os.path.exists(output_file):
+        with open(output_file, 'r') as f:
+            for line in f:
+                data = json.loads(line)
+                finished_instance_ids.add(data['id'])
+        logger.warning(
+            f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
+        )
+    output_fp = open(output_file, 'a')
+
+    logger.info(
+        f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}, max propose solution {args.max_propose_solution}.'
+    )
+
+    # =============================================
+    # filter out finished instances
+    task_class = ReasoningTask
+    new_mint_tests: list[ReasoningTask] = []
+    for instance in mint_dataset:
+        if instance['id'] in finished_instance_ids:
+            logger.info(
+                f'Skipping instance {instance["id"]} as it is already finished.'
+            )
+            continue
+        # convert to Task object
+        instance = ReasoningTask(**instance)
+        new_mint_tests.append(instance)
+
+    mint_dataset = new_mint_tests
+    logger.info(
+        f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(mint_dataset)}'
+    )
+    # =============================================
+
+    pbar = tqdm(total=len(mint_dataset))
+
+    # This function tracks the progress AND write the output to a JSONL file
+    def update_progress(future):
+        pbar.update(1)
+        output = future.result()
+        # logger.info('Output: ', output)
+        # pbar.set_description(f'Instance {output["instance_id"]}')
+        # pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
+        # logger.info(
+        #     f'Finished evaluation for instance {output["instance_id"]}: {output["test_result"]["result"]}'
+        # )
+        output_fp.write(json.dumps(output) + '\n')
+        output_fp.flush()
+
+    # This sets the multi-processing
+    num_workers = args.eval_num_workers
+    logger.info(f'Using {num_workers} workers for evaluation.')
+
+    # This is SWE-Bench specific - CodeActAgent doesn't require mounted workspace to work
+    skip_workspace_mount = agent_class == 'CodeActAgent'
+    logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
+
+    try:
+        with ProcessPoolExecutor(num_workers) as executor:
+            futures = []
+            # This is how we perform multi-processing
+            for instance in mint_dataset:
+                future = executor.submit(
+                    process_instance,
+                    instance,
+                    agent_class,
+                    metadata,
+                    skip_workspace_mount,
+                    eval_output_dir,
+                    reset_logger=bool(num_workers > 1),
+                )
+                future.add_done_callback(update_progress)
+                futures.append(future)
+
+            # Wait for all futures to complete
+            for future in futures:
+                future.result()
+    except KeyboardInterrupt:
+        print('KeyboardInterrupt received. Cleaning up...')
+        cleanup()
+
+    output_fp.close()
+    logger.info('Evaluation finished.')
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+MODEL_CONFIG=$1
+SUBSET=$2
+EVAL_LIMIT=$3
+# Only 'CodeActAgent' is supported for MINT now
+AGENT="CodeActAgent"
+
+# We need to track the version of Agent in the evaluation to make sure results are comparable
+AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
+
+echo "AGENT: $AGENT"
+echo "AGENT_VERSION: $AGENT_VERSION"
+
+export PYTHONPATH=$(pwd)
+
+COMMAND="poetry run python ./evaluation/mint/run_infer.py \
+    --max-iterations 5 \
+    --max-propose-solution 2 \
+    --eval-note $AGENT_VERSION"
+
+if [ -n "$SUBSET" ]; then
+  echo "SUBSET: $SUBSET"
+  COMMAND="$COMMAND --subset $SUBSET"
+# otherwise default to use the math subset
+else
+  echo "SUBSET: math"
+  COMMAND="$COMMAND --subset math"
+fi
+
+if [ -n "$EVAL_LIMIT" ]; then
+  echo "EVAL_LIMIT: $EVAL_LIMIT"
+  COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
+fi
+
+# Run the command
+eval $COMMAND
@@ -0,0 +1,121 @@
+import json
+import logging
+import os
+from abc import ABC, abstractmethod
+from typing import List, Optional, Tuple
+
+from utils import load_file
+
+LOGGER = logging.getLogger('MINT')
+
+
+class Task(ABC):
+    """Base class for a task instance."""
+
+    task_name: str = 'base'
+    in_context_example_dir = os.path.join(
+        os.path.dirname(os.path.abspath(__file__)),
+        'in_context_examples',
+    )
+
+    def __init__(self, **kwargs) -> None:
+        if 'loaded_history' in kwargs:
+            self.loaded_history = kwargs['loaded_history']
+        else:
+            self.loaded_history = None
+        # pre-load the in-context example
+        task_dir = os.path.join(self.in_context_example_dir, self.task_name)
+        self._in_context_example = {
+            'with_tool': load_file(os.path.join(task_dir, 'with_tool.txt')),
+        }
+        self.metadata = {}
+
+    @property
+    def task_id(self) -> str:
+        """Return the task id."""
+        assert hasattr(self, '_id'), 'Task does not have an id.'
+        return self._id
+
+    def in_context_example(
+        self, use_tool: bool = True, with_feedback: bool = False
+    ) -> str:
+        """Return the in-context example for the task."""
+        if use_tool and not with_feedback:
+            return self._in_context_example['with_tool']
+        else:
+            raise NotImplementedError
+
+    @property
+    def prompt(self) -> str:
+        """Return the task prompt."""
+        assert hasattr(self, '_prompt'), 'Task does not have a prompt.'
+        return self._prompt
+
+    @property
+    def reference(self) -> str:
+        """Return the reference solution for the task."""
+        assert hasattr(self, '_reference'), 'Task does not have a reference solution.'
+        return self._reference
+
+    @abstractmethod
+    def extract_answer(self, solution: str) -> Optional[str]:
+        """Extract the answer from the given solution."""
+        pass
+
+    @abstractmethod
+    def success(self, solution: str) -> bool:
+        """This checks whether the given solution can complete the current task.
+
+        Can be used to provide binary feedback.
+        """
+        answer = self.extract_answer(solution)
+        return answer == self.reference
+
+    @classmethod
+    def load_tasks(cls, path: str) -> Tuple[List['Task'], int]:
+        """Load all the tasks from a given jsonl file."""
+        assert path.endswith('.jsonl') or path.endswith('.json')
+        with open(path, 'r') as f:
+            tasks = [cls(**json.loads(line)) for line in f.readlines()]
+        LOGGER.info(f'Loaded {len(tasks)} tasks from {path}')
+        return tasks, len(tasks)
+
+    def to_dict(self) -> dict:
+        """Convert the task to a dictionary."""
+        return {
+            'task_name': self.task_name,
+            'task_id': self.task_id,
+            'prompt': self.prompt,
+            'reference': self.reference,
+            'metadata': self.metadata,
+        }
+
+
+class ReasoningTask(Task):
+    task_name = 'reasoning'
+
+    def __init__(self, id: str, prompt: str, reference: str, **kwargs):
+        super().__init__(**kwargs)
+        self._id = id
+        self._prompt = prompt.strip()
+        self._reference = str(reference).strip().lower()
+
+    def extract_answer(self, solution: str) -> Optional[str]:
+        """Extract the answer from the given solution."""
+        return solution.lower().strip()
+
+    def compare_w_digits(self, reference: str, answer: str) -> bool:
+        """Compare the reference and answer with digits."""
+        # if reference can and answer can both be converted to floats by float()
+        try:
+            float(reference)
+            float(answer)
+            return abs(float(reference) - float(answer)) <= 0.05 * abs(float(reference))
+        except ValueError:
+            return reference in answer
+        except Exception:
+            raise ValueError(f'Cannot compare {reference} and {answer}')
+
+    def success(self, solution: str) -> bool:
+        answer = self.extract_answer(solution)
+        return self.compare_w_digits(self._reference, answer)
@@ -0,0 +1,10 @@
+import functools
+
+
+# use cache to avoid loading the same file multiple times
+# which can leads to too many open files error
+@functools.lru_cache(maxsize=128)
+def load_file(filepath: str) -> str:
+    with open(filepath, 'r') as f:
+        content = f.read()
+    return content
@@ -116,9 +116,11 @@ selected_ids = ['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'scikit-learn__
 Then only these tasks (rows whose `instance_id` is in the above list) will be evaluated.
 In this case, `eval_limit` option applies to tasks that are in the `selected_ids` list.

+After running the inference, you will obtain a `output.jsonl` (by default it will be saved to `evaluation/evaluation_outputs`).
+
 ## Evaluate Generated Patches

-After running the inference described in the previous section, you will obtain a `output.jsonl` (by default it will save to `evaluation/evaluation_outputs`). Then you can run this one line script to evaluate generated patches, and produce a fine-grained report:
+With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patches, and produce a fine-grained report.

 If you want to evaluate existing results, you should first run this to clone existing outputs

@@ -185,6 +187,15 @@ It will contains an additional field `fine_grained_report` (see example below) c

 Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).

+## View Result Summary
+
+If you just want to know the resolve rate, and/or a summary of what tests pass and what don't, you could run
+
+```bash
+poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_output_merged_jsonl_file>
+# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/output.merged.jsonl
+```
+
 ## Submit your evaluation results

 You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
@@ -62,11 +62,13 @@ def monologue_user_response(state: State) -> str:

 AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
    'CodeActAgent': codeact_user_response,
+    'CodeActSWEAgent': codeact_user_response,
    'MonologueAgent': monologue_user_response,
 }

 AGENT_CLS_TO_INST_SUFFIX = {
-    'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n'
+    'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n',
+    'CodeActSWEAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n',
 }


@@ -243,19 +245,62 @@ def process_instance(
    )

    # Prepare instruction
-    instruction = (
-        f'Please fix the following issue for the repository in /workspace/{workspace_dir_name}.\n'
-        'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
-        '# Problem Statement\n'
-        f'{instance.problem_statement}\n\n'
-    )
-    if USE_HINT_TEXT and instance.hints_text:
-        instruction += f'# Hints\n{instance.hints_text}\n\n'
-    instruction += (
-        'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
-        'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
-        'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
-    )
+    if agent_class == 'CodeActSWEAgent':
+        instruction = (
+            'We are currently solving the following issue within our repository. Here is the issue text:\n'
+            '--- BEGIN ISSUE ---\n'
+            f'{instance.problem_statement}\n'
+            '--- END ISSUE ---\n\n'
+        )
+
+        if USE_HINT_TEXT and instance.hints_text:
+            instruction += (
+                f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n'
+            )
+        instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
+Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
+When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
+Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
+
+NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
+
+IMPORTANT TIPS:
+1. Always start by trying to replicate the bug that the issues discusses.
+    If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
+    Then start trying to fix it.
+    When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
+
+    If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
+    so that you can be sure that the script indeed ran fine all the way through.
+
+2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
+
+3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
+
+4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
+
+5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current  open file.
+
+6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
+
+[Current directory: /workspace/{workspace_dir_name}]
+"""
+    else:
+        # Testing general agents
+        instruction = (
+            f'Please fix the following issue for the repository in /workspace/{workspace_dir_name}.\n'
+            'Environment has been set up for you to start working. You may assume all necessary tools are installed.\n\n'
+            '# Problem Statement\n'
+            f'{instance.problem_statement}\n\n'
+        )
+        if USE_HINT_TEXT and instance.hints_text:
+            instruction += f'# Hints\n{instance.hints_text}\n\n'
+        instruction += (
+            'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
+            'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
+            'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
+        )
+
    # NOTE: You can actually set slightly different instruction for different agents
    instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')

@@ -370,6 +415,11 @@ if __name__ == '__main__':
        .decode('utf-8')
        .strip(),
    }
+    _agent_cls = agenthub.Agent.get_cls(agent_class)
+    if hasattr(_agent_cls, 'system_message'):
+        metadata['system_message'] = _agent_cls.system_message
+    if hasattr(_agent_cls, 'in_context_example'):
+        metadata['in_context_example'] = _agent_cls.in_context_example
    logger.info(f'Metadata: {metadata}')
    with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
        json.dump(metadata, f)
@@ -2,12 +2,18 @@
 MODEL_CONFIG=$1
 AGENT=$2
 EVAL_LIMIT=$3
+MAX_ITER=$4

 if [ -z "$AGENT" ]; then
  echo "Agent not specified, use default CodeActAgent"
  AGENT="CodeActAgent"
 fi

+if [ -z "$MAX_ITER" ]; then
+  echo "MAX_ITER not specified, use default 30"
+  MAX_ITER=30
+fi
+
 # IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
 # We need to track the version of Agent in the evaluation to make sure results are comparable
 AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
@@ -32,7 +38,7 @@ unset SANDBOX_ENV_GITHUB_TOKEN # prevent the agent from using the github token t
 COMMAND="poetry run python evaluation/swe_bench/run_infer.py \
  --agent-cls $AGENT \
  --llm-config $MODEL_CONFIG \
-  --max-iterations 30 \
+  --max-iterations $MAX_ITER \
  --max-chars 10000000 \
  --eval-num-workers 8 \
  --eval-note $EVAL_NOTE"
@@ -0,0 +1,39 @@
+import json
+import sys
+
+
+def extract_test_results(json_file_path):
+    passed_tests = []
+    failed_tests = []
+    with open(json_file_path, 'r') as file:
+        for line in file:
+            data = json.loads(line.strip())
+            instance_id = data['instance_id']
+            resolved = False
+            if 'fine_grained_report' in data:
+                resolved = data['fine_grained_report']['resolved']
+            else:
+                resolved = data['test_result']['result']['resolved']
+            if resolved:
+                passed_tests.append(instance_id)
+            else:
+                failed_tests.append(instance_id)
+    return passed_tests, failed_tests
+
+
+if __name__ == '__main__':
+    if len(sys.argv) != 2:
+        print(
+            'Usage: poetry run python summarise_results.py <path_to_output_merged_jsonl_file>'
+        )
+        sys.exit(1)
+    json_file_path = sys.argv[1]
+    passed_tests, failed_tests = extract_test_results(json_file_path)
+    succ_rate = len(passed_tests) / (len(passed_tests) + len(failed_tests))
+    print(
+        f'\nPassed {len(passed_tests)} tests, failed {len(failed_tests)} tests, resolve rate = {succ_rate}'
+    )
+    print('PASSED TESTS:')
+    print(passed_tests)
+    print('FAILED TESTS:')
+    print(failed_tests)
@@ -25,12 +25,14 @@ class SWEBenchSSHBox(DockerSSHBox):
        swe_instance: dict | None = None,
        skip_workspace_mount: bool = True,
        sandbox_plugins: list[PluginRequirement] = [],  # noqa: B006
+        workspace_dir_name: str | None = None,
    ):
        if swe_instance_id is None:
            raise ValueError('swe_instance_id must be provided!')
        self.swe_instance_id = swe_instance_id
        self.swe_instance = swe_instance
        self.skip_workspace_mount = skip_workspace_mount
+        self.workspace_dir_name = workspace_dir_name

        assert (
            container_image is not None
@@ -94,6 +96,7 @@ class SWEBenchSSHBox(DockerSSHBox):
            swe_instance=instance,
            skip_workspace_mount=skip_workspace_mount,
            sandbox_plugins=sandbox_plugins,
+            workspace_dir_name=workspace_dir_name,
        )
        logger.info(f"SSH box started for instance {instance['instance_id']}.")

@@ -123,7 +126,13 @@ class SWEBenchSSHBox(DockerSSHBox):

    def get_diff_patch(self):
        # add everything to the index
-        exit_code, output = self.execute('git add --all')
+        exit_code, output = self.execute(f'cd /workspace/{self.workspace_dir_name}')
+        if exit_code != 0:
+            logger.error('Failed to cd to the repo')
+            return ''
+
+        # add everything to the index
+        exit_code, output = self.execute('git add -A')
        if exit_code != 0:
            logger.error('Failed to add everything to the index')
            return ''
@@ -44,6 +44,7 @@
        }],
        // For https://stackoverflow.com/questions/55844608/stuck-with-eslint-error-i-e-separately-loops-should-be-avoided-in-favor-of-arra
        "no-restricted-syntax": "off",
+        "react/require-default-props": "off",
        "import/prefer-default-export": "off",
        "no-underscore-dangle": "off",
        "jsx-a11y/no-static-element-interactions": "off",
@@ -8,15 +8,15 @@
  },
  "dependencies": {
    "@monaco-editor/react": "^4.6.0",
-    "@nextui-org/react": "^2.3.6",
+    "@nextui-org/react": "^2.4.1",
    "@react-types/shared": "^3.23.1",
    "@reduxjs/toolkit": "^2.2.5",
-    "@vitejs/plugin-react": "^4.2.1",
+    "@vitejs/plugin-react": "^4.3.0",
    "@xterm/addon-fit": "^0.10.0",
    "@xterm/xterm": "^5.4.0",
    "clsx": "^2.1.1",
    "eslint-config-airbnb-typescript": "^18.0.0",
-    "framer-motion": "^11.2.6",
+    "framer-motion": "^11.2.10",
    "i18next": "^23.11.5",
    "i18next-browser-languagedetector": "^8.0.0",
    "i18next-http-backend": "^2.5.2",
@@ -33,7 +33,7 @@
    "react-router-dom": "^6.23.1",
    "react-syntax-highlighter": "^15.5.0",
    "tailwind-merge": "^2.3.0",
-    "vite": "^5.2.11",
+    "vite": "^5.2.12",
    "web-vitals": "^3.5.2"
  },
  "scripts": {
@@ -62,14 +62,14 @@
    "@tailwindcss/typography": "^0.5.13",
    "@testing-library/jest-dom": "^6.4.5",
    "@testing-library/react": "^15.0.7",
-    "@testing-library/user-event": "^13.5.0",
-    "@types/node": "^18.0.0 ",
+    "@testing-library/user-event": "^14.5.2",
+    "@types/node": "^20.12.13",
    "@types/react": "^18.3.3",
    "@types/react-dom": "^18.3.0",
    "@types/react-highlight": "^0.12.8",
    "@types/react-syntax-highlighter": "^15.5.13",
-    "@typescript-eslint/eslint-plugin": "^7.10.0",
-    "@typescript-eslint/parser": "^7.10.0",
+    "@typescript-eslint/eslint-plugin": "^7.11.0",
+    "@typescript-eslint/parser": "^7.11.0",
    "autoprefixer": "^10.4.19",
    "eslint": "^8.57.0",
    "eslint-config-airbnb": "^19.0.4",
@@ -78,15 +78,15 @@
    "eslint-plugin-import": "^2.29.1",
    "eslint-plugin-jsx-a11y": "^6.8.0",
    "eslint-plugin-prettier": "^5.1.3",
-    "eslint-plugin-react": "^7.34.1",
+    "eslint-plugin-react": "^7.34.2",
    "eslint-plugin-react-hooks": "^4.6.2",
    "husky": "^9.0.11",
-    "jsdom": "^24.0.0",
-    "lint-staged": "^15.2.4",
+    "jsdom": "^24.1.0",
+    "lint-staged": "^15.2.5",
    "postcss": "^8.4.38",
    "prettier": "^3.2.5",
    "tailwindcss": "^3.4.2",
-    "typescript": "^5.4.3",
+    "typescript": "^5.4.5",
    "vite-tsconfig-paths": "^4.3.2",
    "vitest": "^1.6.0"
  },
@@ -41,7 +41,7 @@ function ActionButton({
  action,
  handleAction,
  children,
-  large,
+  large = false,
 }: React.PropsWithChildren<ButtonProps>): React.ReactNode {
  return (
    <Tooltip content={content} closeDelay={100}>
@@ -57,10 +57,6 @@ function ActionButton({
  );
 }

-ActionButton.defaultProps = {
-  large: false,
-};
-
 function AgentControlBar() {
  const { curAgentState } = useSelector((state: RootState) => state.agent);
  const [desiredState, setDesiredState] = React.useState(AgentState.INIT);
@@ -12,7 +12,7 @@ function IconButton({
  icon,
  onClick,
  ariaLabel,
-  testId,
+  testId = "",
 }: IconButtonProps): React.ReactElement {
  return (
    <Button
@@ -28,8 +28,4 @@ function IconButton({
  );
 }

-IconButton.defaultProps = {
-  testId: "",
-};
-
 export default IconButton;
@@ -10,7 +10,7 @@ interface ChatInputProps {
  onSendMessage: (message: string) => void;
 }

-function ChatInput({ disabled, onSendMessage }: ChatInputProps) {
+function ChatInput({ disabled = false, onSendMessage }: ChatInputProps) {
  const { t } = useTranslation();

  const [message, setMessage] = React.useState("");
@@ -70,8 +70,4 @@ function ChatInput({ disabled, onSendMessage }: ChatInputProps) {
  );
 }

-ChatInput.defaultProps = {
-  disabled: false,
-};
-
 export default ChatInput;
@@ -16,8 +16,4 @@ function ExplorerTree({ files, defaultOpen = false }: ExplorerTreeProps) {
  );
 }

-ExplorerTree.defaultProps = {
-  defaultOpen: false,
-};
-
 export default ExplorerTree;
@@ -94,8 +94,4 @@ function TreeNode({ path, defaultOpen = false }: TreeNodeProps) {
  );
 }

-TreeNode.defaultProps = {
-  defaultOpen: false,
-};
-
 export default React.memo(TreeNode);
@@ -24,9 +24,9 @@ function BaseModal({
  onOpenChange,
  title,
  isDismissable = true,
-  subtitle,
-  actions,
-  children,
+  subtitle = undefined,
+  actions = [],
+  children = null,
 }: BaseModalProps) {
  return (
    <Modal
@@ -60,11 +60,4 @@ function BaseModal({
  );
 }

-BaseModal.defaultProps = {
-  isDismissable: true,
-  subtitle: undefined,
-  actions: [],
-  children: null,
-};
-
 export default BaseModal;
@@ -5,7 +5,10 @@ interface HeaderContentProps {
  subtitle?: string;
 }

-export function HeaderContent({ title, subtitle }: HeaderContentProps) {
+export function HeaderContent({
+  title,
+  subtitle = undefined,
+}: HeaderContentProps) {
  return (
    <>
      <h3>{title}</h3>
@@ -15,7 +18,3 @@ export function HeaderContent({ title, subtitle }: HeaderContentProps) {
    </>
  );
 }
-
-HeaderContent.defaultProps = {
-  subtitle: undefined,
-};
@@ -77,8 +77,3 @@ export function AutocompleteCombobox({
    </Tooltip>
  );
 }
-
-AutocompleteCombobox.defaultProps = {
-  allowCustomValue: false,
-  disabled: false,
-};
@@ -23,12 +23,12 @@ vi.spyOn(Session, "isConnected").mockImplementation(() => true);
 vi.mock("#/services/settings", async (importOriginal) => ({
  ...(await importOriginal<typeof import("#/services/settings")>()),
  getSettings: vi.fn().mockReturnValue({
-    LLM_MODEL: "gpt-3.5-turbo",
+    LLM_MODEL: "gpt-4o",
    AGENT: "MonologueAgent",
    LANGUAGE: "en",
  }),
  getDefaultSettings: vi.fn().mockReturnValue({
-    LLM_MODEL: "gpt-3.5-turbo",
+    LLM_MODEL: "gpt-4o",
    AGENT: "CodeActAgent",
    LANGUAGE: "en",
    LLM_API_KEY: "",
@@ -81,7 +81,7 @@ describe("SettingsModal", () => {
  it("should disabled the save button if the settings contain a missing value", async () => {
    const onOpenChangeMock = vi.fn();
    (getSettings as Mock).mockReturnValueOnce({
-      LLM_MODEL: "gpt-3.5-turbo",
+      LLM_MODEL: "gpt-4o",
      AGENT: "",
    });
    await act(async () =>
@@ -97,7 +97,7 @@ describe("SettingsModal", () => {

  describe("onHandleSave", () => {
    const initialSettings: Settings = {
-      LLM_MODEL: "gpt-3.5-turbo",
+      LLM_MODEL: "gpt-4o",
      AGENT: "MonologueAgent",
      LANGUAGE: "en",
      LLM_API_KEY: "sk-...",
@@ -5,12 +5,10 @@ const WAIT_FOR_AUTH_DELAY_MS = 500;

 export async function request(
  url: string,
-  optionsIn: RequestInit = {},
+  options: RequestInit = {},
  disableToast: boolean = false,
  /* eslint-disable-next-line @typescript-eslint/no-explicit-any */
 ): Promise<any> {
-  const options = JSON.parse(JSON.stringify(optionsIn));
-
  const onFail = (msg: string) => {
    if (!disableToast) {
      toast.error("api", msg);
@@ -23,11 +21,12 @@ export async function request(
  if (!token && needsAuth) {
    return new Promise((resolve) => {
      setTimeout(() => {
-        resolve(request(url, optionsIn, disableToast));
+        resolve(request(url, options, disableToast));
      }, WAIT_FOR_AUTH_DELAY_MS);
    });
  }
  if (token) {
+    // eslint-disable-next-line no-param-reassign
    options.headers = {
      ...(options.headers || {}),
      Authorization: `Bearer ${token}`,
@@ -8,7 +8,7 @@ export type Settings = {
 };

 export const DEFAULT_SETTINGS: Settings = {
-  LLM_MODEL: "gpt-3.5-turbo",
+  LLM_MODEL: "gpt-4o",
  AGENT: "CodeActAgent",
  LANGUAGE: "en",
  LLM_API_KEY: "",
@@ -79,8 +79,8 @@ export const saveSettings = (settings: Partial<Settings>) => {
 * Useful for notifying the user of exact changes.
 *
 * @example
- * // Assuming the current settings are: { LLM_MODEL: "gpt-3.5", AGENT: "MonologueAgent", LANGUAGE: "en" }
- * const updatedSettings = getSettingsDifference({ LLM_MODEL: "gpt-3.5", AGENT: "OTHER_AGENT", LANGUAGE: "en" });
+ * // Assuming the current settings are: { LLM_MODEL: "gpt-4o", AGENT: "MonologueAgent", LANGUAGE: "en" }
+ * const updatedSettings = getSettingsDifference({ LLM_MODEL: "gpt-4o", AGENT: "OTHER_AGENT", LANGUAGE: "en" });
 * // updatedSettings = { AGENT: "OTHER_AGENT" }
 *
 * @param settings - the settings to compare
@@ -47,6 +47,7 @@ class AgentController:
    event_stream: EventStream
    state: State
    agent_task: Optional[asyncio.Task] = None
+    parent: 'AgentController | None' = None
    delegate: 'AgentController | None' = None
    _pending_action: Action | None = None

@@ -58,7 +59,8 @@ class AgentController:
        max_iterations: int = MAX_ITERATIONS,
        max_chars: int = MAX_CHARS,
        max_budget_per_task: float | None = MAX_BUDGET_PER_TASK,
-        inputs: dict | None = None,
+        initial_state: State | None = None,
+        is_delegate: bool = False,
    ):
        """Initializes a new instance of the AgentController class.

@@ -69,25 +71,30 @@ class AgentController:
            max_iterations: The maximum number of iterations the agent can run.
            max_chars: The maximum number of characters the agent can output.
            max_budget_per_task: The maximum budget (in USD) allowed per task, beyond which the agent will stop.
-            inputs: The initial inputs to the agent.
+            initial_state: The initial state of the controller.
+            is_delegate: Whether this controller is a delegate.
        """
+        self._step_lock = asyncio.Lock()
        self.id = sid
        self.agent = agent
-        self.state = State(inputs=inputs or {}, max_iterations=max_iterations)
+        self.max_chars = max_chars
+        if initial_state is None:
+            self.state = State(inputs={}, max_iterations=max_iterations)
+        else:
+            self.state = initial_state
        self.event_stream = event_stream
        self.event_stream.subscribe(
-            EventStreamSubscriber.AGENT_CONTROLLER, self.on_event
+            EventStreamSubscriber.AGENT_CONTROLLER, self.on_event, append=is_delegate
        )
-        self.max_iterations = max_iterations
-        self.max_chars = max_chars
        self.max_budget_per_task = max_budget_per_task
-        self.agent_task = asyncio.create_task(self._start_step_loop())
+        if not is_delegate:
+            self.agent_task = asyncio.create_task(self._start_step_loop())

    async def close(self):
        if self.agent_task is not None:
            self.agent_task.cancel()
-        self.event_stream.unsubscribe(EventStreamSubscriber.AGENT_CONTROLLER)
        await self.set_agent_state_to(AgentState.STOPPED)
+        self.event_stream.unsubscribe(EventStreamSubscriber.AGENT_CONTROLLER)

    def update_state_before_step(self):
        self.state.iteration += 1
@@ -117,6 +124,7 @@ class AgentController:
        self.state.updated_info.append((action, observation))

    async def _start_step_loop(self):
+        logger.info(f'[Agent Controller {self.id}] Starting step loop...')
        while True:
            try:
                await self._step()
@@ -164,13 +172,16 @@ class AgentController:
            elif isinstance(event, CmdOutputObservation):
                await self.add_history(NullAction(), event)
                logger.info(event, extra={'msg_type': 'OBSERVATION'})
+            elif isinstance(event, AgentDelegateObservation):
+                await self.add_history(NullAction(), event)
+                logger.info(event, extra={'msg_type': 'OBSERVATION'})

    def reset_task(self):
        self.agent.reset()

    async def set_agent_state_to(self, new_state: AgentState):
        logger.info(
-            f'Setting agent({type(self.agent).__name__}) state from {self.state.agent_state} to {new_state}'
+            f'[Agent Controller {self.id}] Setting agent({type(self.agent).__name__}) state from {self.state.agent_state} to {new_state}'
        )

        if new_state == self.state.agent_state:
@@ -195,45 +206,84 @@ class AgentController:
    async def start_delegate(self, action: AgentDelegateAction):
        AgentCls: Type[Agent] = Agent.get_cls(action.agent)
        agent = AgentCls(llm=self.agent.llm)
+        state = State(
+            inputs=action.inputs or {},
+            iteration=0,
+            max_iterations=self.state.max_iterations,
+            num_of_chars=self.state.num_of_chars,
+            delegate_level=self.state.delegate_level + 1,
+        )
+        logger.info(f'[Agent Controller {self.id}]: start delegate')
        self.delegate = AgentController(
            sid=self.id + '-delegate',
            agent=agent,
            event_stream=self.event_stream,
-            max_iterations=self.max_iterations,
+            max_iterations=self.state.max_iterations,
            max_chars=self.max_chars,
-            inputs=action.inputs,
+            initial_state=state,
+            is_delegate=True,
        )
+        await self.delegate.set_agent_state_to(AgentState.RUNNING)

    async def _step(self):
+        logger.debug(f'[Agent Controller {self.id}] Entering step method')
        if self.get_agent_state() != AgentState.RUNNING:
-            logger.debug('waiting for agent to run...')
            await asyncio.sleep(1)
            return

        if self._pending_action:
-            logger.debug('waiting for pending action: ' + str(self._pending_action))
+            logger.info(
+                f'[Agent Controller {self.id}] waiting for pending action: {self._pending_action}'
+            )
            await asyncio.sleep(1)
            return

-        logger.info(f'STEP {self.state.iteration}', extra={'msg_type': 'STEP'})
-        if self.state.iteration >= self.max_iterations:
-            await self.report_error('Agent reached maximum number of iterations')
-            await self.set_agent_state_to(AgentState.ERROR)
-            return
-
        if self.delegate is not None:
-            delegate_done = await self.delegate._step()
+            logger.debug(f'[Agent Controller {self.id}] Delegate not none, awaiting...')
+            assert self.delegate != self
+            await self.delegate._step()
+            logger.debug(f'[Agent Controller {self.id}] Delegate step done')
+            assert self.delegate is not None
+            delegate_state = self.delegate.get_agent_state()
+            if delegate_state == AgentState.ERROR:
+                # close the delegate upon error
+                await self.delegate.close()
+                await self.report_error('Delegator agent encounters an error')
+                # propagate error state until an agent or user can handle it
+                await self.set_agent_state_to(AgentState.ERROR)
+                return
+            delegate_done = delegate_state == AgentState.FINISHED
            if delegate_done:
+                logger.info(
+                    f'[Agent Controller {self.id}] Delegate agent has finished execution'
+                )
+                # retrieve delegate result
                outputs = self.delegate.state.outputs if self.delegate.state else {}
-                obs: Observation = AgentDelegateObservation(content='', outputs=outputs)
-                await self.event_stream.add_event(obs, EventSource.AGENT)
+
+                # close delegate controller: we must close the delegate controller before adding new events
+                await self.delegate.close()
+
+                # clean up delegate status
                self.delegate = None
                self.delegateAction = None
+
+                # update delegate result observation
+                obs: Observation = AgentDelegateObservation(outputs=outputs, content='')
+                await self.event_stream.add_event(obs, EventSource.AGENT)
            return

        if self.state.num_of_chars > self.max_chars:
            raise MaxCharsExceedError(self.state.num_of_chars, self.max_chars)

+        logger.info(
+            f'{type(self.agent).__name__} LEVEL {self.state.delegate_level} STEP {self.state.iteration}',
+            extra={'msg_type': 'STEP'},
+        )
+        if self.state.iteration >= self.state.max_iterations:
+            await self.report_error('Agent reached maximum number of iterations')
+            await self.set_agent_state_to(AgentState.ERROR)
+            return
+
        self.update_state_before_step()
        action: Action = NullAction()
        try:
@@ -335,6 +385,14 @@ class AgentController:

        return False

+    def __repr__(self):
+        return (
+            f'AgentController(id={self.id}, agent={self.agent!r}, '
+            f'event_stream={self.event_stream!r}, '
+            f'state={self.state!r}, agent_task={self.agent_task!r}, '
+            f'delegate={self.delegate!r}, _pending_action={self._pending_action!r})'
+        )
+
    def _eq_no_pid(self, obj1, obj2):
        if isinstance(obj1, CmdOutputObservation) and isinstance(
            obj2, CmdOutputObservation
@@ -40,6 +40,8 @@ class State:
    agent_state: AgentState = AgentState.LOADING
    resume_state: AgentState | None = None
    metrics: Metrics = Metrics()
+    # root agent has level 0, and every delegate increases the level by one
+    delegate_level: int = 0

    def save_to_session(self, sid: str):
        fs = get_file_store()
@@ -48,7 +48,7 @@ class LLMConfig(metaclass=Singleton):
        output_cost_per_token: The cost per output token. This will available in logs for the user to check.
    """

-    model: str = 'gpt-3.5-turbo'
+    model: str = 'gpt-4o'
    api_key: str | None = None
    base_url: str | None = None
    api_version: str | None = None
@@ -179,6 +179,9 @@ class AppConfig(metaclass=Singleton):
    disable_color: bool = False
    sandbox_user_id: int = os.getuid() if hasattr(os, 'getuid') else 1000
    sandbox_timeout: int = 120
+    persist_sandbox: bool = False
+    ssh_port: int = 63710
+    ssh_password: str | None = None
    github_token: str | None = None
    jwt_secret: str = uuid.uuid4().hex
    debug: bool = False
@@ -2,6 +2,7 @@ from enum import Enum


 class ConfigType(str, Enum):
+    # For frontend
    LLM_CUSTOM_LLM_PROVIDER = 'LLM_CUSTOM_LLM_PROVIDER'
    LLM_MAX_INPUT_TOKENS = 'LLM_MAX_INPUT_TOKENS'
    LLM_MAX_OUTPUT_TOKENS = 'LLM_MAX_OUTPUT_TOKENS'
@@ -21,7 +21,9 @@ class EventStreamSubscriber(str, Enum):

 class EventStream:
    sid: str
-    _subscribers: dict[str, Callable]
+    # For each subscriber ID, there is a stack of callback functions - useful
+    # when there are agent delegates
+    _subscribers: dict[str, list[Callable]]
    _cur_id: int
    _lock: asyncio.Lock
    _file_store: FileStore
@@ -69,17 +71,22 @@ class EventStream:
        data = json.loads(content)
        return event_from_dict(data)

-    def subscribe(self, id: EventStreamSubscriber, callback: Callable):
+    def subscribe(self, id: EventStreamSubscriber, callback: Callable, append=False):
        if id in self._subscribers:
-            raise ValueError('Subscriber already exists: ' + id)
+            if append:
+                self._subscribers[id].append(callback)
+            else:
+                raise ValueError('Subscriber already exists: ' + id)
        else:
-            self._subscribers[id] = callback
+            self._subscribers[id] = [callback]

    def unsubscribe(self, id: EventStreamSubscriber):
        if id not in self._subscribers:
            logger.warning('Subscriber not found during unsubscribe: ' + id)
        else:
-            del self._subscribers[id]
+            self._subscribers[id].pop()
+            if len(self._subscribers[id]) == 0:
+                del self._subscribers[id]

    # TODO: make this not async
    async def add_event(self, event: Event, source: EventSource):
@@ -93,5 +100,6 @@ class EventStream:
            self._file_store.write(
                self._get_filename_for_id(event.id), json.dumps(data)
            )
-        for key, fn in self._subscribers.items():
-            await fn(event)
+        for key, stack in self._subscribers.items():
+            callback = stack[-1]
+            await callback(event)
@@ -131,7 +131,7 @@ class LLM:
        # litellm actually uses base Exception here for unknown model
        self.model_info = None
        try:
-            self.model_info = litellm.get_model_info(self.model_name)
+            self.model_info = litellm.get_model_info(self.model_name.split(':')[0])
        # noinspection PyBroadException
        except Exception:
            logger.warning(f'Could not get model info for {self.model_name}')
@@ -216,38 +216,50 @@ class DockerSSHBox(Sandbox):
            )
            raise ex

-        self.instance_id = (
-            sid + str(uuid.uuid4()) if sid is not None else str(uuid.uuid4())
-        )
+        if config.persist_sandbox:
+            self.instance_id = 'persisted'
+        else:
+            self.instance_id = (sid or '') + str(uuid.uuid4())

        self.timeout = timeout
-        self.container_image = (
-            config.sandbox_container_image
-            if container_image is None
-            else container_image
-        )
+        self.container_image = container_image or config.sandbox_container_image
        self.container_name = self.container_name_prefix + self.instance_id

        # set up random user password
-        self._ssh_password = str(uuid.uuid4())
-        self._ssh_port = find_available_tcp_port()
-
-        # always restart the container, cuz the initial be regarded as a new session
-        n_tries = 5
-        while n_tries > 0:
-            try:
-                self.restart_docker_container()
-                break
-            except Exception as e:
-                logger.exception(
-                    'Failed to start Docker container, retrying...', exc_info=False
+        if config.persist_sandbox:
+            if not config.ssh_password:
+                raise Exception(
+                    'Please add ssh_password to your config.toml or add -e SSH_PASSWORD to your docker run command'
                )
-                n_tries -= 1
-                if n_tries == 0:
-                    raise e
-                time.sleep(5)
-        self.setup_user()
-
+            self._ssh_password = config.ssh_password
+            self._ssh_port = config.ssh_port
+        else:
+            self._ssh_password = str(uuid.uuid4())
+            self._ssh_port = find_available_tcp_port()
+        try:
+            docker.DockerClient().containers.get(self.container_name)
+            is_initial_session = False
+        except docker.errors.NotFound:
+            is_initial_session = True
+            logger.info('Creating new Docker container')
+        if not config.persist_sandbox or is_initial_session:
+            n_tries = 5
+            while n_tries > 0:
+                try:
+                    self.restart_docker_container()
+                    break
+                except Exception as e:
+                    logger.exception(
+                        'Failed to start Docker container, retrying...', exc_info=False
+                    )
+                    n_tries -= 1
+                    if n_tries == 0:
+                        raise e
+                    time.sleep(5)
+            self.setup_user()
+        else:
+            self.container = self.docker_client.containers.get(self.container_name)
+            logger.info('Using existing Docker container')
        try:
            self.start_ssh_session()
        except pxssh.ExceptionPxssh as e:
@@ -391,6 +403,9 @@ class DockerSSHBox(Sandbox):
        # cd to workspace
        self.ssh.sendline(f'cd {self.sandbox_workspace_dir}')
        self.ssh.prompt()
+        # load bashrc
+        self.ssh.sendline('source ~/.bashrc')
+        self.ssh.prompt()

    def get_exec_cmd(self, cmd: str) -> list[str]:
        if self.run_as_devin:
@@ -704,7 +719,10 @@ class DockerSSHBox(Sandbox):
        containers = self.docker_client.containers.list(all=True)
        for container in containers:
            try:
-                if container.name.startswith(self.container_name):
+                if (
+                    container.name.startswith(self.container_name)
+                    and not config.persist_sandbox
+                ):
                    # only remove the container we created
                    # otherwise all other containers with the same prefix will be removed
                    # which will mess up with parallel evaluation
@@ -16,6 +16,7 @@ Functions:
 """

 import base64
+import functools
 import os
 import subprocess
 from inspect import signature
@@ -46,6 +47,22 @@ OPENAI_PROXY = f'{OPENAI_BASE_URL}/chat/completions'
 client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)


+# Define the decorator using the functionality of UpdatePwd
+def update_pwd_decorator(func):
+    @functools.wraps(func)
+    def wrapper(*args, **kwargs):
+        old_pwd = os.getcwd()
+        jupyter_pwd = os.environ.get('JUPYTER_PWD', None)
+        if jupyter_pwd:
+            os.chdir(jupyter_pwd)
+        try:
+            return func(*args, **kwargs)
+        finally:
+            os.chdir(old_pwd)
+
+    return wrapper
+
+
 def _lint_file(file_path: str) -> Optional[str]:
    """
    Lint the file at the given path.
@@ -88,12 +105,21 @@ def _print_window(CURRENT_FILE, CURRENT_LINE, WINDOW, return_str=False):
        start = max(0, CURRENT_LINE - WINDOW // 2)
        end = min(len(lines), CURRENT_LINE + WINDOW // 2)
        output = ''
+
+        # only display this when there's line above
+        if start > 0:
+            n_above_lines = start
+            output += f'({n_above_lines} more lines above)\n'
        for i in range(start, end):
            _new_line = f'{i + 1}|{lines[i]}'
            if not _new_line.endswith('\n'):
                _new_line += '\n'
            output += _new_line
+        if end < len(lines):
+            n_below_lines = len(lines) - end
+            output += f'({n_below_lines} more lines below)\n'
        output = output.rstrip()
+
        if return_str:
            return output
        else:
@@ -104,6 +130,7 @@ def _cur_file_header(CURRENT_FILE, total_lines):
    return f'[File: {os.path.abspath(CURRENT_FILE)} ({total_lines} lines total)]\n'


+@update_pwd_decorator
 def open_file(path: str, line_number: Optional[int] = None) -> None:
    """
    Opens the file at the given path in the editor. If line_number is provided, the window will be moved to include that line.
@@ -116,7 +143,7 @@ def open_file(path: str, line_number: Optional[int] = None) -> None:
    if not os.path.isfile(path):
        raise FileNotFoundError(f'File {path} not found')

-    CURRENT_FILE = path
+    CURRENT_FILE = os.path.abspath(path)
    with open(CURRENT_FILE) as file:
        total_lines = sum(1 for _ in file)

@@ -136,6 +163,7 @@ def open_file(path: str, line_number: Optional[int] = None) -> None:
    print(output)


+@update_pwd_decorator
 def goto_line(line_number: int) -> None:
    """
    Moves the window to show the specified line number.
@@ -158,6 +186,7 @@ def goto_line(line_number: int) -> None:
    print(output)


+@update_pwd_decorator
 def scroll_down() -> None:
    """Moves the window down by 100 lines.

@@ -175,6 +204,7 @@ def scroll_down() -> None:
    print(output)


+@update_pwd_decorator
 def scroll_up() -> None:
    """Moves the window up by 100 lines.

@@ -192,6 +222,7 @@ def scroll_up() -> None:
    print(output)


+@update_pwd_decorator
 def create_file(filename: str) -> None:
    """Creates and opens a new file with the given name.

@@ -209,6 +240,7 @@ def create_file(filename: str) -> None:
    print(f'[File {filename} created.]')


+@update_pwd_decorator
 def edit_file(start: int, end: int, content: str) -> None:
    """Edit a file.

@@ -227,21 +259,35 @@ def edit_file(start: int, end: int, content: str) -> None:
    with open(CURRENT_FILE, 'r') as file:
        lines = file.readlines()

+    ERROR_MSG = f'[Error editing opened file {CURRENT_FILE}. Please confirm the opened file is correct.]'
+    ERROR_MSG_SUFFIX = (
+        'Your changes have NOT been applied. Please fix your edit command and try again.\n'
+        'You either need to 1) Open the correct file and try again or 2) Specify the correct start/end line arguments.\n'
+        'DO NOT re-run the same failed edit command. Running it again will lead to the same error.'
+    )
    # Check arguments
    if not (1 <= start <= len(lines)):
-        raise ValueError(
-            f'Invalid start line number: {start}. Line numbers must be between 1 and {len(lines)} (inclusive).'
+        print(
+            f'{ERROR_MSG}\n'
+            f'Invalid start line number: {start}. Line numbers must be between 1 and {len(lines)} (inclusive).\n'
+            f'{ERROR_MSG_SUFFIX}'
        )
+        return

    if not (1 <= end <= len(lines)):
-        raise ValueError(
-            f'Invalid end line number: {end}. Line numbers must be between 1 and {len(lines)} (inclusive).'
+        print(
+            f'{ERROR_MSG}\n'
+            f'Invalid end line number: {end}. Line numbers must be between 1 and {len(lines)} (inclusive).\n'
+            f'{ERROR_MSG_SUFFIX}'
        )
-
+        return
    if start > end:
-        raise ValueError(
-            f'Invalid line range: {start}-{end}. Start must be less than or equal to end.'
+        print(
+            f'{ERROR_MSG}\n'
+            f'Invalid line range: {start}-{end}. Start must be less than or equal to end.\n'
+            f'{ERROR_MSG_SUFFIX}'
        )
+        return

    edited_content = content + '\n'
    n_edited_lines = len(edited_content.split('\n'))
@@ -270,14 +316,20 @@ def edit_file(start: int, end: int, content: str) -> None:
            print('[This is how your edit would have looked if applied]')
            print('-------------------------------------------------')
            cur_line = (n_edited_lines // 2) + start
-            _print_window(CURRENT_FILE, cur_line, WINDOW)
+            _print_window(CURRENT_FILE, cur_line, 10)
            print('-------------------------------------------------\n')

            print('[This is the original code before your edit]')
            print('-------------------------------------------------')
-            _print_window(original_file_backup_path, CURRENT_LINE, WINDOW)
+            _print_window(original_file_backup_path, cur_line, 10)
            print('-------------------------------------------------')

+            print(
+                'Your changes have NOT been applied. Please fix your edit command and try again.\n'
+                'You either need to 1) Specify the correct start/end line arguments or 2) Correct your edit code.\n'
+                'DO NOT re-run the same failed edit command. Running it again will lead to the same error.'
+            )
+
            # recover the original file
            with open(original_file_backup_path, 'r') as fin, open(
                CURRENT_FILE, 'w'
@@ -301,6 +353,7 @@ def edit_file(start: int, end: int, content: str) -> None:
    )


+@update_pwd_decorator
 def search_dir(search_term: str, dir_path: str = './') -> None:
    """Searches for search_term in all files in dir. If dir is not provided, searches in the current directory.

@@ -310,7 +363,6 @@ def search_dir(search_term: str, dir_path: str = './') -> None:
    """
    if not os.path.isdir(dir_path):
        raise FileNotFoundError(f'Directory {dir_path} not found')
-
    matches = []
    for root, _, files in os.walk(dir_path):
        for file in files:
@@ -341,6 +393,7 @@ def search_dir(search_term: str, dir_path: str = './') -> None:
    print(f'[End of matches for "{search_term}" in {dir_path}]')


+@update_pwd_decorator
 def search_file(search_term: str, file_path: Optional[str] = None) -> None:
    """Searches for search_term in file. If file is not provided, searches in the current open file.

@@ -373,6 +426,7 @@ def search_file(search_term: str, file_path: Optional[str] = None) -> None:
        print(f'[No matches found for "{search_term}" in {file_path}]')


+@update_pwd_decorator
 def find_file(file_name: str, dir_path: str = './') -> None:
    """Finds all files with the given name in the specified directory.

@@ -398,6 +452,7 @@ def find_file(file_name: str, dir_path: str = './') -> None:
        print(f'[No matches found for "{file_name}" in {dir_path}]')


+@update_pwd_decorator
 def parse_pdf(file_path: str) -> None:
    """Parses the content of a PDF file and prints it.

@@ -416,6 +471,7 @@ def parse_pdf(file_path: str) -> None:
    print(text.strip())


+@update_pwd_decorator
 def parse_docx(file_path: str) -> None:
    """
    Parses the content of a DOCX file and prints it.
@@ -431,6 +487,7 @@ def parse_docx(file_path: str) -> None:
    print(text)


+@update_pwd_decorator
 def parse_latex(file_path: str) -> None:
    """
    Parses the content of a LaTex file and prints it.
@@ -484,6 +541,7 @@ def _prepare_image_messages(task: str, base64_image: str):
    ]


+@update_pwd_decorator
 def parse_audio(file_path: str, model: str = 'whisper-1') -> None:
    """
    Parses the content of an audio file and prints it.
@@ -503,6 +561,7 @@ def parse_audio(file_path: str, model: str = 'whisper-1') -> None:
        print(f'Error transcribing audio file: {e}')


+@update_pwd_decorator
 def parse_image(
    file_path: str, task: str = 'Describe this image as detail as possible.'
 ) -> None:
@@ -529,6 +588,7 @@ def parse_image(
        print(f'Error with the request: {error}')


+@update_pwd_decorator
 def parse_video(
    file_path: str,
    task: str = 'Describe this image as detail as possible.',
@@ -577,6 +637,7 @@ def parse_video(
            print(f'Error with the request: {error}')


+@update_pwd_decorator
 def parse_pptx(file_path: str) -> None:
    """
    Parses the content of a pptx file and prints it.
@@ -7,20 +7,33 @@ import requests
 # Read the Python code from STDIN
 code = sys.stdin.read()

-# Set the default kernel ID
-kernel_id = 'default'

-PORT = os.environ.get('JUPYTER_EXEC_SERVER_PORT')
-POST_URL = f'http://localhost:{PORT}/execute'
+def execute_code(code, print_output=True):
+    PORT = os.environ.get('JUPYTER_EXEC_SERVER_PORT')
+    POST_URL = f'http://localhost:{PORT}/execute'

-for i in range(10):
-    try:
-        response = requests.post(POST_URL, json={'kernel_id': kernel_id, 'code': code})
-        if '500: Internal Server Error' not in response.text:
-            print(response.text)
-            break
-    except requests.exceptions.ConnectionError:
-        pass
-    time.sleep(2)
-else:
-    print('Failed to connect to the Jupyter server')
+    # Set the default kernel ID
+    kernel_id = 'default'
+
+    for i in range(10):
+        try:
+            response = requests.post(
+                POST_URL, json={'kernel_id': kernel_id, 'code': code}
+            )
+            if '500: Internal Server Error' not in response.text:
+                if print_output:
+                    print(response.text)
+                break
+        except requests.exceptions.ConnectionError:
+            pass
+        time.sleep(2)
+    else:
+        print('Failed to connect to the Jupyter server')
+
+
+if jupyter_pwd := os.environ.get('JUPYTER_PWD'):
+    execute_code(
+        f'import os\nos.environ["JUPYTER_PWD"] = "{jupyter_pwd}"\n', print_output=False
+    )
+
+execute_code(code)
@@ -134,7 +134,7 @@ class JupyterKernel:
        )
        self.heartbeat_callback.start()

-    async def execute(self, code, timeout=60):
+    async def execute(self, code, timeout=120):
        if not self.ws:
            await self._connect()

@@ -55,7 +55,10 @@ class ServerRuntime(Runtime):

        # run the code
        obs = self._run_command(
-            ('cat /tmp/opendevin_jupyter_temp.py | execute_cli'), background=False
+            (
+                'export JUPYTER_PWD=$(pwd) && cat /tmp/opendevin_jupyter_temp.py | execute_cli'
+            ),
+            background=False,
        )
        output = obs.content
        if 'pip install' in action.code and 'Successfully installed' in output:
@@ -24,7 +24,7 @@ websocat ws://127.0.0.1:3000/ws

 ```sh
 LLM_API_KEY=sk-... # Your OpenAI API Key
-LLM_MODEL=gpt-3.5-turbo # Default model for the agent to use
+LLM_MODEL=gpt-4o # Default model for the agent to use
 WORKSPACE_BASE=/path/to/your/workspace # Default path to model's workspace
 ```

@@ -416,17 +416,17 @@ files = [

 [[package]]
 name = "boto3"
-version = "1.34.112"
+version = "1.34.115"
 description = "The AWS SDK for Python"
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "boto3-1.34.112-py3-none-any.whl", hash = "sha256:4cf28ce2c19a4e4963f1cb1f9b659a548f840f88af3e2da727b35ceb104f9223"},
-    {file = "boto3-1.34.112.tar.gz", hash = "sha256:1092ac6c68acdd33051ed0d2b7cb6f5a4527c5d1535a48cda53f7012accde206"},
+    {file = "boto3-1.34.115-py3-none-any.whl", hash = "sha256:0a580de3d25364da5db26ecc7dde9438ee1be1e529a7c04cc96972b6e2258378"},
+    {file = "boto3-1.34.115.tar.gz", hash = "sha256:67f5a6d6e6eff9c15711c265173b53eb4ad8d05b756b76ef33ac792cea7958f6"},
 ]

 [package.dependencies]
-botocore = ">=1.34.112,<1.35.0"
+botocore = ">=1.34.115,<1.35.0"
 jmespath = ">=0.7.1,<2.0.0"
 s3transfer = ">=0.10.0,<0.11.0"

@@ -435,13 +435,13 @@ crt = ["botocore[crt] (>=1.21.0,<2.0a0)"]

 [[package]]
 name = "botocore"
-version = "1.34.112"
+version = "1.34.115"
 description = "Low-level, data-driven core of boto 3."
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "botocore-1.34.112-py3-none-any.whl", hash = "sha256:637f568a6c3322fb7e5ee55e0c5367324a15a331e87a497783ac6209253dde30"},
-    {file = "botocore-1.34.112.tar.gz", hash = "sha256:053495953910bcf95d336ab1adb13efb70edc5462932eff180560737ad069319"},
+    {file = "botocore-1.34.115-py3-none-any.whl", hash = "sha256:15b8ad1ee0e9cd57884fb0bcaf3a9551d2552e44a02c2ffb55ec583eebdb888e"},
+    {file = "botocore-1.34.115.tar.gz", hash = "sha256:a5d5e28b9c847b17a1ecb7660b46b83d9512b125f671e03e93d14bf6f0b274c2"},
 ]

 [package.dependencies]
@@ -454,31 +454,31 @@ crt = ["awscrt (==0.20.9)"]

 [[package]]
 name = "browsergym"
-version = "0.3.2"
+version = "0.3.4"
 description = "BrowserGym: a gym environment for web task automation in the Chromium browser"
 optional = false
 python-versions = ">3.7"
 files = [
-    {file = "browsergym-0.3.2-py3-none-any.whl", hash = "sha256:1e4380392804542c328bf990584ad7090f77d15c035c8160d6a15fc9dbba11d7"},
-    {file = "browsergym-0.3.2.tar.gz", hash = "sha256:8c11a6a5540af2ea8924fc00b5ee8ab18fca970aa7205568dffbccf6fffc74c5"},
+    {file = "browsergym-0.3.4-py3-none-any.whl", hash = "sha256:ecc06a42a6b7541f9025fa9cdc208d48eb4a745283358524715447257fc80adc"},
+    {file = "browsergym-0.3.4.tar.gz", hash = "sha256:853937f29c3855577a5fbc038a4371e82e50e393f4bdfc458df222590470807c"},
 ]

 [package.dependencies]
-browsergym-core = "0.3.2"
-browsergym-experiments = "0.3.2"
-browsergym-miniwob = "0.3.2"
-browsergym-webarena = "0.3.2"
+browsergym-core = "0.3.4"
+browsergym-experiments = "0.3.4"
+browsergym-miniwob = "0.3.4"
+browsergym-webarena = "0.3.4"
 browsergym-workarena = "*"

 [[package]]
 name = "browsergym-core"
-version = "0.3.2"
+version = "0.3.4"
 description = "BrowserGym: a gym environment for web task automation in the Chromium browser"
 optional = false
 python-versions = ">3.7"
 files = [
-    {file = "browsergym_core-0.3.2-py3-none-any.whl", hash = "sha256:b444d0297896ab9d1c5b04991286c6e52023673214302117cbd20ec3b4bb9279"},
-    {file = "browsergym_core-0.3.2.tar.gz", hash = "sha256:ff4750ffeb63ca96a6eb71fa30048175cf59cd5a27278238355118001b96730e"},
+    {file = "browsergym_core-0.3.4-py3-none-any.whl", hash = "sha256:1d7164b9afab613af6ae269fb811721738b09d5935df567cceba87dd1ecb4f23"},
+    {file = "browsergym_core-0.3.4.tar.gz", hash = "sha256:357d4cc61f2447983f9c5c0c262d5d6cca129e926ab576ec72f6b974bd1f7fd6"},
 ]

 [package.dependencies]
@@ -492,46 +492,46 @@ pyparsing = ">=3"

 [[package]]
 name = "browsergym-experiments"
-version = "0.3.2"
+version = "0.3.4"
 description = "Experimentation tools for BrowserGym"
 optional = false
 python-versions = ">3.7"
 files = [
-    {file = "browsergym_experiments-0.3.2-py3-none-any.whl", hash = "sha256:d27775ea401fc297111ccbb922a27be0f877ae021a824c1a918438454989fe8f"},
-    {file = "browsergym_experiments-0.3.2.tar.gz", hash = "sha256:47dce382162faf62c859a37b853e38bdac83e85b28a7c9bed36cb32391d412a8"},
+    {file = "browsergym_experiments-0.3.4-py3-none-any.whl", hash = "sha256:d2e4a75b4a2e79f9300eb289c9b2432f07dee82622d384924972f4157069f3fe"},
+    {file = "browsergym_experiments-0.3.4.tar.gz", hash = "sha256:16309c6b2be59627ea90c7e36448eb897512bcef033cf481472879f4c5be317b"},
 ]

 [package.dependencies]
-browsergym-core = "0.3.2"
+browsergym-core = "0.3.4"
 tiktoken = ">=0.4"

 [[package]]
 name = "browsergym-miniwob"
-version = "0.3.2"
+version = "0.3.4"
 description = "MiniWoB++ benchmark for BrowserGym"
 optional = false
 python-versions = ">3.7"
 files = [
-    {file = "browsergym_miniwob-0.3.2-py3-none-any.whl", hash = "sha256:d63d4eee2426bbf0557a0f81b35fd712ac8a478faa18559b1e763d808c1d9062"},
-    {file = "browsergym_miniwob-0.3.2.tar.gz", hash = "sha256:fb74866423c1b3f957aca6ce65e318cf852ca51f21aa3d828c00bed79c824c67"},
+    {file = "browsergym_miniwob-0.3.4-py3-none-any.whl", hash = "sha256:4de41ee146d6f0bcb2e49b0fb8fd49f519439bf44808aef6146f5ae00064062b"},
+    {file = "browsergym_miniwob-0.3.4.tar.gz", hash = "sha256:938d58a9882c4118e46160d303a9a6d93ac1a08288e81e2c6d5c768719f012fe"},
 ]

 [package.dependencies]
-browsergym-core = "0.3.2"
+browsergym-core = "0.3.4"

 [[package]]
 name = "browsergym-webarena"
-version = "0.3.2"
+version = "0.3.4"
 description = "WebArena benchmark for BrowserGym"
 optional = false
 python-versions = ">3.7"
 files = [
-    {file = "browsergym_webarena-0.3.2-py3-none-any.whl", hash = "sha256:bb706929d4c1e95f53592af58e4314d2775051b91800d0f2fb11f51a38b5b127"},
-    {file = "browsergym_webarena-0.3.2.tar.gz", hash = "sha256:a65013a98903bb14ad999dbedb0313ac35a21a3fb35984df2c76c8f7d423b95e"},
+    {file = "browsergym_webarena-0.3.4-py3-none-any.whl", hash = "sha256:fd9f9bb4cdf1e32d22e6cd525fd0c28adf9dda615e4dc614b677c25f675a9b73"},
+    {file = "browsergym_webarena-0.3.4.tar.gz", hash = "sha256:ba921a76223910d8842d0c9dd6d3393db14819f9a74c477289f0d2625bdd8feb"},
 ]

 [package.dependencies]
-browsergym-core = "0.3.2"
+browsergym-core = "0.3.4"
 libwebarena = "0.0.3"

 [[package]]
@@ -2403,13 +2403,13 @@ files = [

 [[package]]
 name = "json-repair"
-version = "0.19.2"
+version = "0.21.0"
 description = "A package to repair broken json strings"
 optional = false
 python-versions = ">=3.7"
 files = [
-    {file = "json_repair-0.19.2-py3-none-any.whl", hash = "sha256:eeacf422c620d98499c6a7d6da78dc52857bd419f2276157d44ef2441eccca2e"},
-    {file = "json_repair-0.19.2.tar.gz", hash = "sha256:0bb1963a2a0958b18f403a4cc937fdb580f63ba7b86b9779c5a9be6d9bdc9e9d"},
+    {file = "json_repair-0.21.0-py3-none-any.whl", hash = "sha256:b432d5f4a09c75c279e7185381d6ac600154793def0367a5df56f267038d39b0"},
+    {file = "json_repair-0.21.0.tar.gz", hash = "sha256:6df5b381b08a0cc386aefd4ddeabdb071f22345101d64ca2b34cbb32dfdf2eec"},
 ]

 [[package]]
@@ -2627,13 +2627,13 @@ types-tqdm = "*"

 [[package]]
 name = "litellm"
-version = "1.38.10"
+version = "1.39.3"
 description = "Library to easily interface with LLM API providers"
 optional = false
 python-versions = "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,>=3.8"
 files = [
-    {file = "litellm-1.38.10-py3-none-any.whl", hash = "sha256:4d33465eacde566832b9d7aa7677476e61aa7ba4ec26631fb1c8411c87219ed1"},
-    {file = "litellm-1.38.10.tar.gz", hash = "sha256:1a0b3088fe4b072f367343a7d7d25e4c5f9990975d9ee7dbf21f3b25ff046bb0"},
+    {file = "litellm-1.39.3-py3-none-any.whl", hash = "sha256:ac2769499b2d57091d49d0c9524d3368de9355075a3898f71448fa442b01c429"},
+    {file = "litellm-1.39.3.tar.gz", hash = "sha256:0c78d7bb03b077fa4e5a87fca85e7b2d448440da362f86c0b15fdde754d0468e"},
 ]

 [package.dependencies]
@@ -2770,13 +2770,13 @@ llama-index-llms-azure-openai = ">=0.1.3,<0.2.0"

 [[package]]
 name = "llama-index-embeddings-huggingface"
-version = "0.2.0"
+version = "0.2.1"
 description = "llama-index embeddings huggingface integration"
 optional = false
 python-versions = "<4.0,>=3.8.1"
 files = [
-    {file = "llama_index_embeddings_huggingface-0.2.0-py3-none-any.whl", hash = "sha256:e8beb7cbdea36bcee26a0282809f8329b0c55b2b4949a590a8da0f348aac066e"},
-    {file = "llama_index_embeddings_huggingface-0.2.0.tar.gz", hash = "sha256:dcf0a99455f37c4e1a2fdd5cd65c9dd1a451bb868c3f80c335c4d0c9b69d0071"},
+    {file = "llama_index_embeddings_huggingface-0.2.1-py3-none-any.whl", hash = "sha256:326468966e269acc7fbc77cad4f65ec061133ea91b0063fe181e72d01a6a8511"},
+    {file = "llama_index_embeddings_huggingface-0.2.1.tar.gz", hash = "sha256:bac68a13ad5131a055da3ef174cca70e15230426eec7d471b372e81e8489d888"},
 ]

 [package.dependencies]
@@ -4054,13 +4054,13 @@ sympy = "*"

 [[package]]
 name = "openai"
-version = "1.30.1"
+version = "1.30.5"
 description = "The official Python library for the openai API"
 optional = false
 python-versions = ">=3.7.1"
 files = [
-    {file = "openai-1.30.1-py3-none-any.whl", hash = "sha256:c9fb3c3545c118bbce8deb824397b9433a66d0d0ede6a96f7009c95b76de4a46"},
-    {file = "openai-1.30.1.tar.gz", hash = "sha256:4f85190e577cba0b066e1950b8eb9b11d25bc7ebcc43a86b326ce1bfa564ec74"},
+    {file = "openai-1.30.5-py3-none-any.whl", hash = "sha256:2ad95e926de0d2e09cde632a9204b0a6dca4a03c2cdcc84329b01f355784355a"},
+    {file = "openai-1.30.5.tar.gz", hash = "sha256:5366562eb2c5917e6116ae0391b7ae6e3acd62b0ae3f565ada32b35d8fcfa106"},
 ]

 [package.dependencies]
@@ -5665,28 +5665,28 @@ pyasn1 = ">=0.1.3"

 [[package]]
 name = "ruff"
-version = "0.4.5"
+version = "0.4.6"
 description = "An extremely fast Python linter and code formatter, written in Rust."
 optional = false
 python-versions = ">=3.7"
 files = [
-    {file = "ruff-0.4.5-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:8f58e615dec58b1a6b291769b559e12fdffb53cc4187160a2fc83250eaf54e96"},
-    {file = "ruff-0.4.5-py3-none-macosx_11_0_arm64.whl", hash = "sha256:84dd157474e16e3a82745d2afa1016c17d27cb5d52b12e3d45d418bcc6d49264"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:25f483ad9d50b00e7fd577f6d0305aa18494c6af139bce7319c68a17180087f4"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:63fde3bf6f3ad4e990357af1d30e8ba2730860a954ea9282c95fc0846f5f64af"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:78e3ba4620dee27f76bbcad97067766026c918ba0f2d035c2fc25cbdd04d9c97"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_ppc64.manylinux2014_ppc64.whl", hash = "sha256:441dab55c568e38d02bbda68a926a3d0b54f5510095c9de7f95e47a39e0168aa"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1169e47e9c4136c997f08f9857ae889d614c5035d87d38fda9b44b4338909cdf"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:755ac9ac2598a941512fc36a9070a13c88d72ff874a9781493eb237ab02d75df"},
-    {file = "ruff-0.4.5-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f4b02a65985be2b34b170025a8b92449088ce61e33e69956ce4d316c0fe7cce0"},
-    {file = "ruff-0.4.5-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:75a426506a183d9201e7e5664de3f6b414ad3850d7625764106f7b6d0486f0a1"},
-    {file = "ruff-0.4.5-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:6e1b139b45e2911419044237d90b60e472f57285950e1492c757dfc88259bb06"},
-    {file = "ruff-0.4.5-py3-none-musllinux_1_2_i686.whl", hash = "sha256:a6f29a8221d2e3d85ff0c7b4371c0e37b39c87732c969b4d90f3dad2e721c5b1"},
-    {file = "ruff-0.4.5-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:d6ef817124d72b54cc923f3444828ba24fa45c3164bc9e8f1813db2f3d3a8a11"},
-    {file = "ruff-0.4.5-py3-none-win32.whl", hash = "sha256:aed8166c18b1a169a5d3ec28a49b43340949e400665555b51ee06f22813ef062"},
-    {file = "ruff-0.4.5-py3-none-win_amd64.whl", hash = "sha256:b0b03c619d2b4350b4a27e34fd2ac64d0dabe1afbf43de57d0f9d8a05ecffa45"},
-    {file = "ruff-0.4.5-py3-none-win_arm64.whl", hash = "sha256:9d15de3425f53161b3f5a5658d4522e4eee5ea002bf2ac7aa380743dd9ad5fba"},
-    {file = "ruff-0.4.5.tar.gz", hash = "sha256:286eabd47e7d4d521d199cab84deca135557e6d1e0f0d01c29e757c3cb151b54"},
+    {file = "ruff-0.4.6-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:ef995583a038cd4a7edf1422c9e19118e2511b8ba0b015861b4abd26ec5367c5"},
+    {file = "ruff-0.4.6-py3-none-macosx_11_0_arm64.whl", hash = "sha256:602ebd7ad909eab6e7da65d3c091547781bb06f5f826974a53dbe563d357e53c"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3f9ced5cbb7510fd7525448eeb204e0a22cabb6e99a3cb160272262817d49786"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:04a80acfc862e0e1630c8b738e70dcca03f350bad9e106968a8108379e12b31f"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:be47700ecb004dfa3fd4dcdddf7322d4e632de3c06cd05329d69c45c0280e618"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_ppc64.manylinux2014_ppc64.whl", hash = "sha256:1ff930d6e05f444090a0139e4e13e1e2e1f02bd51bb4547734823c760c621e79"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f13410aabd3b5776f9c5699f42b37a3a348d65498c4310589bc6e5c548dc8a2f"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0cf5cc02d3ae52dfb0c8a946eb7a1d6ffe4d91846ffc8ce388baa8f627e3bd50"},
+    {file = "ruff-0.4.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ea3424793c29906407e3cf417f28fc33f689dacbbadfb52b7e9a809dd535dcef"},
+    {file = "ruff-0.4.6-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:1fa8561489fadf483ffbb091ea94b9c39a00ed63efacd426aae2f197a45e67fc"},
+    {file = "ruff-0.4.6-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:4d5b914818d8047270308fe3e85d9d7f4a31ec86c6475c9f418fbd1624d198e0"},
+    {file = "ruff-0.4.6-py3-none-musllinux_1_2_i686.whl", hash = "sha256:4f02284335c766678778475e7698b7ab83abaf2f9ff0554a07b6f28df3b5c259"},
+    {file = "ruff-0.4.6-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:3a6a0a4f4b5f54fff7c860010ab3dd81425445e37d35701a965c0248819dde7a"},
+    {file = "ruff-0.4.6-py3-none-win32.whl", hash = "sha256:9018bf59b3aa8ad4fba2b1dc0299a6e4e60a4c3bc62bbeaea222679865453062"},
+    {file = "ruff-0.4.6-py3-none-win_amd64.whl", hash = "sha256:a769ae07ac74ff1a019d6bd529426427c3e30d75bdf1e08bb3d46ac8f417326a"},
+    {file = "ruff-0.4.6-py3-none-win_arm64.whl", hash = "sha256:735a16407a1a8f58e4c5b913ad6102722e80b562dd17acb88887685ff6f20cf6"},
+    {file = "ruff-0.4.6.tar.gz", hash = "sha256:a797a87da50603f71e6d0765282098245aca6e3b94b7c17473115167d8dfb0b7"},
 ]

 [[package]]
@@ -6821,13 +6821,13 @@ zstd = ["zstandard (>=0.18.0)"]

 [[package]]
 name = "uvicorn"
-version = "0.29.0"
+version = "0.30.0"
 description = "The lightning-fast ASGI server."
 optional = false
 python-versions = ">=3.8"
 files = [
-    {file = "uvicorn-0.29.0-py3-none-any.whl", hash = "sha256:2c2aac7ff4f4365c206fd773a39bf4ebd1047c238f8b8268ad996829323473de"},
-    {file = "uvicorn-0.29.0.tar.gz", hash = "sha256:6a69214c0b6a087462412670b3ef21224fa48cae0e452b5883e8e8bdfdd11dd0"},
+    {file = "uvicorn-0.30.0-py3-none-any.whl", hash = "sha256:78fa0b5f56abb8562024a59041caeb555c86e48d0efdd23c3fe7de7a4075bdab"},
+    {file = "uvicorn-0.30.0.tar.gz", hash = "sha256:f678dec4fa3a39706bbf49b9ec5fc40049d42418716cea52b53f07828a60aa37"},
 ]

 [package.dependencies]
@@ -7552,4 +7552,4 @@ testing = ["coverage (>=5.0.3)", "zope.event", "zope.testing"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.11"
-content-hash = "05410bbac602e5b5a91986d9f58c06bab86f63a87ffa62f5e52de94b472a1910"
+content-hash = "3f55a686a38bee8dc0cf22e301e40c8103698ff0b9e1f4217db55a1dbd993762"
@@ -22,7 +22,7 @@ uvicorn = "*"
 types-toml = "*"
 numpy = "*"
 json-repair = "*"
-browsergym = "0.3.2" # integrate browsergym as the browsing interface
+browsergym = "0.3.4" # integrate browsergym as the browsing interface
 html2text = "*"
 e2b = "^0.17.0"
 pexpect = "*"
@@ -44,7 +44,7 @@ llama-index-embeddings-azure-openai = "*"
 llama-index-embeddings-ollama = "*"

 [tool.poetry.group.dev.dependencies]
-ruff = "0.4.5"
+ruff = "0.4.6"
 mypy = "1.10.0"
 pre-commit = "3.7.1"

@@ -42,6 +42,8 @@ where `conftest.py` defines the infrastructure needed to load real-world LLM pro
 and responses for mocking purpose. Prompts and responses generated during real runs
 of agents with real LLMs are stored under `mock/AgentName/TestName` folders.

+**Note:** Set PERSIST_SANDBOX=false to use a clean sandbox for each test.
+
 ## Run Integration Tests

 Take a look at `run-integration-tests.yml` to learn how integration tests are
@@ -2,6 +2,8 @@ import io
 import os
 import re
 import sys
+import tempfile
+import subprocess
 from functools import partial
 from http.server import HTTPServer, SimpleHTTPRequestHandler
 from threading import Thread
@@ -81,14 +83,24 @@ def get_mock_response(test_name: str, messages: str, id: int) -> str:
            # print the mismatched lines
            print('Mismatched Prompt File path', prompt_file_path)
            print('---' * 10)
-            print(messages)
+            # Create a temporary file to store messages
+            with tempfile.NamedTemporaryFile(delete=False, mode='w', encoding='utf-8') as tmp_file:
+                tmp_file_path = tmp_file.name
+                tmp_file.write(messages)
+
+            try:
+                # Use diff command to compare files and capture the output
+                result = subprocess.run(['diff', '-u', prompt_file_path, tmp_file_path], capture_output=True, text=True)
+                if result.returncode != 0:
+                    print('Diff:')
+                    print(result.stdout)
+                else:
+                    print('No differences found.')
+            finally:
+                # Clean up the temporary file
+                os.remove(tmp_file_path)
+
            print('---' * 10)
-            for i, (c1, c2) in enumerate(zip(file_content, prompt)):
-                if c1 != c2:
-                    print(
-                        f'Mismatch at index {i}: {c1[max(0,i-100):i+100]} vs {c2[max(0,i-100):i+100]}'
-                    )
-                    break


 def mock_user_response(*args, test_name, **kwargs):
@@ -0,0 +1,86 @@
+
+
+----------
+
+# Task
+You are a software architect. Your team has inherited an existing codebase, and
+need to finish a project:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+As an architect, you need to study the codebase to find all the information that
+might be helpful for your software engineering team.
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+You must ONLY `run` commands that have no side-effects, like `ls` and `grep`. You
+MUST NOT modify or write to any file.
+
+Do NOT finish until you have a complete understanding of which parts of the
+codebase are relevant to the project, including particular files, functions, and classes.
+When you're done, put your summary in `outputs.summary` in the `finish` action.
+Remember, your task is to explore and study the current repository, not actually
+implement the solution. If the codebase is empty, you shoud call the `finish` action.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
+
+
+## Examples
+
+Here is an example of how you can interact with the environment for task solving:
+
+--- START OF EXAMPLE ---
+
+USER: Can you create a list of numbers from 1 to 10, and create a web page to display them at port 5000?
+
+ASSISTANT:
+{
+  "action": "run",
+  "args": {
+    "command": "ls",
+    "background": false
+  }
+}
+
+USER:
+OBSERVATION:
+[]
+
+ASSISTANT:
+{
+  "action": "finish",
+  "args": {
+    "outputs": {
+      "summary": "The codebase appears to be empty. Engineers should start everything from scratch."
+    }
+  }
+}
+
+--- END OF EXAMPLE ---
@@ -0,0 +1,86 @@
+
+
+----------
+
+# Task
+You are a software architect. Your team has inherited an existing codebase, and
+need to finish a project:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+As an architect, you need to study the codebase to find all the information that
+might be helpful for your software engineering team.
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+You must ONLY `run` commands that have no side-effects, like `ls` and `grep`. You
+MUST NOT modify or write to any file.
+
+Do NOT finish until you have a complete understanding of which parts of the
+codebase are relevant to the project, including particular files, functions, and classes.
+When you're done, put your summary in `outputs.summary` in the `finish` action.
+Remember, your task is to explore and study the current repository, not actually
+implement the solution. If the codebase is empty, you shoud call the `finish` action.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[[{"source": "agent", "action": "run", "args": {"command": "ls", "background": false, "thought": ""}}, {"source": "agent", "observation": "run", "content": "bad.txt", "extras": {"command_id": -1, "command": "ls", "exit_code": 0}}]]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
+
+
+## Examples
+
+Here is an example of how you can interact with the environment for task solving:
+
+--- START OF EXAMPLE ---
+
+USER: Can you create a list of numbers from 1 to 10, and create a web page to display them at port 5000?
+
+ASSISTANT:
+{
+  "action": "run",
+  "args": {
+    "command": "ls",
+    "background": false
+  }
+}
+
+USER:
+OBSERVATION:
+[]
+
+ASSISTANT:
+{
+  "action": "finish",
+  "args": {
+    "outputs": {
+      "summary": "The codebase appears to be empty. Engineers should start everything from scratch."
+    }
+  }
+}
+
+--- END OF EXAMPLE ---
@@ -0,0 +1,86 @@
+
+
+----------
+
+# Task
+You are a software architect. Your team has inherited an existing codebase, and
+need to finish a project:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+As an architect, you need to study the codebase to find all the information that
+might be helpful for your software engineering team.
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+You must ONLY `run` commands that have no side-effects, like `ls` and `grep`. You
+MUST NOT modify or write to any file.
+
+Do NOT finish until you have a complete understanding of which parts of the
+codebase are relevant to the project, including particular files, functions, and classes.
+When you're done, put your summary in `outputs.summary` in the `finish` action.
+Remember, your task is to explore and study the current repository, not actually
+implement the solution. If the codebase is empty, you shoud call the `finish` action.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[[{"source": "agent", "action": "run", "args": {"command": "ls", "background": false, "thought": ""}}, {"source": "agent", "observation": "run", "content": "bad.txt", "extras": {"command_id": -1, "command": "ls", "exit_code": 0}}], [{"source": "agent", "action": "read", "args": {"path": "bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "bad.txt"}}]]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
+
+
+## Examples
+
+Here is an example of how you can interact with the environment for task solving:
+
+--- START OF EXAMPLE ---
+
+USER: Can you create a list of numbers from 1 to 10, and create a web page to display them at port 5000?
+
+ASSISTANT:
+{
+  "action": "run",
+  "args": {
+    "command": "ls",
+    "background": false
+  }
+}
+
+USER:
+OBSERVATION:
+[]
+
+ASSISTANT:
+{
+  "action": "finish",
+  "args": {
+    "outputs": {
+      "summary": "The codebase appears to be empty. Engineers should start everything from scratch."
+    }
+  }
+}
+
+--- END OF EXAMPLE ---
@@ -0,0 +1,59 @@
+
+
+----------
+
+# Task
+You are a software engineer. You've inherited an existing codebase, which you
+need to modify to complete this task:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+
+Here's a summary of the codebase, as it relates to this task:
+
+The codebase contains a single file named 'bad.txt' with some typos. The content of 'bad.txt' is:
+
+This is a stupid typoo.
+Really?
+No mor typos!
+Enjoy!
+
+The engineering team needs to correct the typos in this file.
+
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `write` - writes the content to a file. Arguments:
+  * `path` - the path of the file to write
+  * `content` - the content to write to the file
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+Do NOT finish until you have completed the tasks.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
@@ -0,0 +1,59 @@
+
+
+----------
+
+# Task
+You are a software engineer. You've inherited an existing codebase, which you
+need to modify to complete this task:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+
+Here's a summary of the codebase, as it relates to this task:
+
+The codebase contains a single file named 'bad.txt' with some typos. The content of 'bad.txt' is:
+
+This is a stupid typoo.
+Really?
+No mor typos!
+Enjoy!
+
+The engineering team needs to correct the typos in this file.
+
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `write` - writes the content to a file. Arguments:
+  * `path` - the path of the file to write
+  * `content` - the content to write to the file
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+Do NOT finish until you have completed the tasks.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[[{"source": "agent", "action": "read", "args": {"path": "bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "bad.txt"}}]]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
@@ -0,0 +1,59 @@
+
+
+----------
+
+# Task
+You are a software engineer. You've inherited an existing codebase, which you
+need to modify to complete this task:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+
+Here's a summary of the codebase, as it relates to this task:
+
+The codebase contains a single file named 'bad.txt' with some typos. The content of 'bad.txt' is:
+
+This is a stupid typoo.
+Really?
+No mor typos!
+Enjoy!
+
+The engineering team needs to correct the typos in this file.
+
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `write` - writes the content to a file. Arguments:
+  * `path` - the path of the file to write
+  * `content` - the content to write to the file
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+Do NOT finish until you have completed the tasks.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[[{"source": "agent", "action": "read", "args": {"path": "bad.txt", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "read", "content": "This is a stupid typoo.\nReally?\nNo mor typos!\nEnjoy!\n", "extras": {"path": "bad.txt"}}], [{"source": "agent", "action": "write", "args": {"path": "bad.txt", "content": "This is a stupid typo.\nReally?\nNo more typos!\nEnjoy!\n", "start": 0, "end": -1, "thought": ""}}, {"source": "agent", "observation": "write", "content": "", "extras": {"path": "bad.txt"}}]]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
@@ -0,0 +1,50 @@
+
+
+----------
+
+# Task
+You are a quality assurance engineer. Another engineer has made changes to the
+codebase which are supposed to solve this task:
+
+Fix typos in bad.txt. Do not ask me for confirmation at any point.
+
+Note the changes might have already been applied in-line. You should focus on
+validating if the task is solved, nothing else.
+
+## Available Actions
+* `run` - runs a command on the command line in a Linux shell. Arguments:
+  * `command` - the command to run
+  * `background` - if true, run the command in the background, so that other commands can be run concurrently. Useful for e.g. starting a server. You won't be able to see the logs. You don't need to end the command with `&`, just set this to true.
+
+* `read` - reads the content of a file. Arguments:
+  * `path` - the path of the file to read
+
+* `message` - make a plan, set a goal, record your thoughts, or ask for more input from the user. Arguments:
+  * `content` - the thought to record
+  * `wait_for_response` - set to `true` to wait for the user to respond before proceeding
+
+* `finish` - if you're absolutely certain that you've completed your task, use the finish action to stop working. Arguments:
+  * `outputs` - a dictionary representing the outputs of your task, if any
+
+
+You must ONLY `run` commands that have no side-effects, like `ls`, `grep`, and test scripts.
+
+Do NOT finish until you know whether the task is complete and correct.
+When you're done, add a `completed` boolean to the `outputs` of the `finish` action.
+If `completed` is `false`, you MUST also provide a `summary` in the `outputs` of the `finish` action
+explaining what the problem is.
+
+## History
+Here is a recent history of actions you've taken in service of this plan,
+as well as observations you've made. This only includes the MOST RECENT
+actions and observations--more may have happened before that.
+They are time-ordered, with your most recent action at the bottom.
+
+[]
+
+## Format
+Your response MUST be in JSON format. It must be an object, and it must contain two fields:
+* `action`, which is one of the actions specified here
+* `args`, which is a map of key-value pairs, specifying the arguments for that action
+
+You MUST NOT include any other text besides the JSON response
--- a/Show More
+++ b/Show More