Compare commits

...

88 Commits

Author SHA1 Message Date
Xingyao Wang
11a2d1682d Minor SWE-Bench inference config tweak (#2381)
* save infer logs to infer_logs

* set max budget for swebench eval
2024-06-10 20:14:22 +00:00
tobitege
e4145aef66 avoid repeat logging of unneeded messages (#2380) 2024-06-10 20:08:09 +00:00
Xingyao Wang
a6ba6c5277 Add SWEBench-docker eval (#2085)
* add initial version of swebench-docker eval

* update the branch of git repo

* add poetry run

* download dev set too and pre-load f2p and p2p

* update eval infer script

* increase timeout

* add poetry run

* install swebench from our fork

* update script

* update loc

* support single instance debug

* replace \r\n from model patch

* replace eval docker from namespace xingyaoww

* update script to auto detect swe-bench format jsonl

* support eval infer on single instance id

* change log output dir to logs

* update summarise result script

* update README

* update readme

* tweak branch

* Update evaluation/swe_bench/scripts/eval/prep_eval.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-06-10 19:30:40 +00:00
tobitege
9605106e72 feat: append_file incl. all tests [agentskills] (#2346)
* new skill: append_file incl. all tests

* more tests needed caring

* file_name for append_file/edit_file; updated tests
2024-06-10 17:18:40 +00:00
dependabot[bot]
a5f5bc30b4 chore(deps): bump @vitejs/plugin-react from 4.3.0 to 4.3.1 in /frontend (#2371) 2024-06-11 00:32:10 +08:00
Yufan Song
f4cb192ebe Fix llm key leaks bug (#2376)
* fix bug

* fix bug

* add
2024-06-10 15:55:33 +00:00
dependabot[bot]
c633d41091 chore(deps-dev): bump llama-index-vector-stores-chroma (#2375)
Bumps llama-index-vector-stores-chroma from 0.1.8 to 0.1.9.

---
updated-dependencies:
- dependency-name: llama-index-vector-stores-chroma
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-10 15:39:42 +00:00
dependabot[bot]
e2512b43b6 chore(deps-dev): bump llama-index-embeddings-azure-openai (#2374)
Bumps llama-index-embeddings-azure-openai from 0.1.9 to 0.1.10.

---
updated-dependencies:
- dependency-name: llama-index-embeddings-azure-openai
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-10 15:39:31 +00:00
dependabot[bot]
090046b2e6 chore(deps-dev): bump openai from 1.32.0 to 1.33.0 (#2373)
Bumps [openai](https://github.com/openai/openai-python) from 1.32.0 to 1.33.0.
- [Release notes](https://github.com/openai/openai-python/releases)
- [Changelog](https://github.com/openai/openai-python/blob/main/CHANGELOG.md)
- [Commits](https://github.com/openai/openai-python/compare/v1.32.0...v1.33.0)

---
updated-dependencies:
- dependency-name: openai
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-10 15:32:52 +00:00
dependabot[bot]
f29a2704f2 chore(deps): bump boto3 from 1.34.121 to 1.34.122 (#2372)
Bumps [boto3](https://github.com/boto/boto3) from 1.34.121 to 1.34.122.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.121...1.34.122)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-10 15:25:38 +00:00
dependabot[bot]
a3062ba4d0 chore(deps): bump litellm from 1.40.4 to 1.40.7 (#2370)
Bumps [litellm](https://github.com/BerriAI/litellm) from 1.40.4 to 1.40.7.
- [Release notes](https://github.com/BerriAI/litellm/releases)
- [Commits](https://github.com/BerriAI/litellm/compare/v1.40.4...v1.40.7)

---
updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-10 15:22:30 +00:00
tobitege
f1760f3a67 remove some MonologueAgent mentions (#2364) 2024-06-10 11:57:37 +00:00
Yufan Song
f7491bd2fa Refactor response to action in agent step (#2350)
* refactor action parser

* Fix typos

* fix typo

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-10 10:17:30 +00:00
Robert
7fc57650f3 BioCoder integration (#2076)
* prepare execution and inference

* Create README.md

* Update README.md

* Update evaluation/biocoder/README.md

* Update evaluation/swe_bench/swe_env_box.py

* switch to biocoder docker container and test-specific code

* code for copying and running test files into container

* add metrics

* add readme

* Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes)

* Update README.md

---------

Co-authored-by: lilbillybiscuit <qianbill2014@outlook.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
2024-06-10 11:11:40 +08:00
Boxuan Li
91ddd93756 conftest: Exit without revealing secrets (#2351) 2024-06-10 10:47:31 +08:00
மனோஜ்குமார் பழனிச்சாமி
003b599dd0 Issues Category Update: Removed Question Type (#2345)
We've removed the "Question" type from the Issues category to streamline our issue-tracking process. This change will help us focus on actionable issues and feature requests. If you have any questions or discussions, please use the Discussions tab. This is better suited for community engagement, sharing knowledge, and getting help from other contributors.
2024-06-09 21:14:56 -04:00
tobitege
41344f0dfe remove backtick handling from run_ipython (#2347) 2024-06-09 22:53:06 +00:00
RainRat
745ae42a72 fix typos (#2352) 2024-06-09 12:57:58 -07:00
Serg Kryvonos
a400e94971 Parameterize Python version (#2348) 2024-06-09 17:29:37 +00:00
Temo
e925cefeef Refactored prompt.py to reduce token usage (#1996)
* Refactored prompt.py to reduce token usage

* Reverted some destructive changes

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* Apply suggestions from code review

* Apply suggestions from code review

* Update agenthub/codeact_agent/prompt.py

* fix integration test

* make lint

* feat: support ToolQA benchmark (#2263)

* Add files via upload

* Update README.md

* Update run_infer.py

* Update utils.py

* make lint

* Update evaluation/toolqa/run_infer.py

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* feat: revert hiden special paths change in file action (#2328)

* revert change in file action

* remove useless code

* make lint

* Support gpqa benchmark evaluation (#2080)

* feat: add gpqa benchmark evaluation

* add metrics

* reset configs in final block

* make lint

---------

Co-authored-by: yufansong <yufan@risingwave-labs.com>

* fix(frontend): prevent API key from resetting after modal change (#2329)

* remove bottom chatbox fade

* Modal wider; fix lint error

* settings: attempt to not clear api key for same provider

* prevent api key from resetting after changing the model

* revert other changes and fix post test tear down error

---------

Co-authored-by: amanape <83104063+amanape@users.noreply.github.com>

* fix: codeact bug [If running a command that never returns, it gets stuck #1895] (#2034)

* fix: codeact bug https://github.com/OpenDevin/OpenDevin/issues/1895

* fix: add CmdRunAction timeout hint.

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* regenerate integration test

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>

* Feat: Support Gorilla APIBench  (#2081)

* removed unused files from gorilla

* Update run_infer.py, removed unused imports

* Update utils.py

* Update ast_eval_hf.py

* Update ast_eval_tf.py

* Update ast_eval_th.py

* Create README.md

* Update run_infer.py

* make lint

* Update run_infer.py

* fix lint

---------

Co-authored-by: yufansong <yufan@risingwave-labs.com>

* remote useless (#2332)

* fix integration test

* Update agenthub/codeact_agent/prompt.py

* Update agenthub/codeact_agent/prompt.py

* fix integration test

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: Frank Xu <frankxu2004@gmail.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: yueqis <141804823+yueqis@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: Jaskirat Singh <1.jaskiratsingh@gmail.com>
Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: amanape <83104063+amanape@users.noreply.github.com>
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-06-09 10:19:05 -07:00
Bibek Poudel
221a4e83f1 doc: Added citation subsection in README (#2339)
* added citation in readme

* minor change to date format

* Update README.md

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-06-09 14:05:35 +00:00
Frank Xu
bd00f0f049 Restore previous browsing agent behavior when evaluating on WebArena and miniwob++ only (#2341)
* restore eval mode

* fix
2024-06-09 04:10:02 -04:00
Engel Nyst
fab8c9003b remove deprecated github-token config (#2334)
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-06-09 09:50:24 +02:00
மனோஜ்குமார் பழனிச்சாமி
e0ad289483 Downgraded Python version to 3.12.3 (#2331)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-06-09 11:54:30 +05:30
Boxuan Li
a9a2f10170 Revamp AgentRejectAction and allow ManagerAgent to handle rejection (#1735)
* Fix AgentRejectAction handling

* Add ManagerAgent to integration tests

* Fix regenerate.sh

* Fix merge

* Update README for micro-agents

* Add test reject to regenerate.sh

* regenerate.sh: Add support for running a specific test and/or agent

* Refine reject schema, and allow ManagerAgent to handle reject

* Add test artifacts for test_simple_task_rejection

* Fix manager agent tests

* Fix README

* test_simple_task_rejection: check final agent state

* Integration test: exit if mock prompt not found

* Update test_simple_task_rejection tests

* Fix test_edits test artifacts after prompt update

* Fix ManagerAgent test_edits

* WIP

* Fix tests

* update test_edits for ManagerAgent

* Skip local sandbox for reject test

* Fix test comparison
2024-06-08 23:12:30 -07:00
tobitege
c062468dcf fix: warning about zope-interface (pyproject) (#2335) 2024-06-08 22:51:55 +00:00
tobitege
a97d0767e9 fix: Backticks get always escaped by runtime; add Ipython test (#2321)
* added tests related to backticks

* updated .gitignore

* added extra linter test for #2210

* hotfix for integration test

* added test_ipython unit test

* added test_ipython unit test

* remove draft test from test_ipython.py

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-06-08 21:02:27 +00:00
Yufan Song
1bdf8752e6 remote useless (#2332) 2024-06-08 19:04:43 +00:00
yueqis
68d9ad61cf Feat: Support Gorilla APIBench (#2081)
* removed unused files from gorilla

* Update run_infer.py, removed unused imports

* Update utils.py

* Update ast_eval_hf.py

* Update ast_eval_tf.py

* Update ast_eval_th.py

* Create README.md

* Update run_infer.py

* make lint

* Update run_infer.py

* fix lint

---------

Co-authored-by: yufansong <yufan@risingwave-labs.com>
2024-06-08 16:54:54 +00:00
Aaron Xia
b5a17efc45 fix: codeact bug [If running a command that never returns, it gets stuck #1895] (#2034)
* fix: codeact bug https://github.com/OpenDevin/OpenDevin/issues/1895

* fix: add CmdRunAction timeout hint.

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* regenerate integration test

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
2024-06-08 16:40:23 +00:00
tobitege
a8c6fd0d42 fix(frontend): prevent API key from resetting after modal change (#2329)
* remove bottom chatbox fade

* Modal wider; fix lint error

* settings: attempt to not clear api key for same provider

* prevent api key from resetting after changing the model

* revert other changes and fix post test tear down error

---------

Co-authored-by: amanape <83104063+amanape@users.noreply.github.com>
2024-06-08 16:27:43 +00:00
Jaskirat Singh
e8307608c2 Support gpqa benchmark evaluation (#2080)
* feat: add gpqa benchmark evaluation

* add metrics

* reset configs in final block

* make lint

---------

Co-authored-by: yufansong <yufan@risingwave-labs.com>
2024-06-08 16:24:24 +00:00
Yufan Song
06a6ffcb09 feat: revert hiden special paths change in file action (#2328)
* revert change in file action

* remove useless code

* make lint
2024-06-08 12:12:52 +00:00
yueqis
82d4d25b09 feat: support ToolQA benchmark (#2263)
* Add files via upload

* Update README.md

* Update run_infer.py

* Update utils.py

* make lint

* Update evaluation/toolqa/run_infer.py

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-08 07:54:01 -04:00
Xingyao Wang
903381f16e Add back jupyter PWD env var for agentskills (#2327)
* add back jupyter pwd env var for agentskills

* add unit test for pwd change in execute_cli
2024-06-08 08:51:42 +00:00
tobitege
c3c2b2d7b6 fix: remove bottom chatbox fade (frontend) (#2323)
* remove bottom chatbox fade

* Modal wider; fix lint error
2024-06-08 07:09:21 +00:00
tobitege
5e42f140cb fix: hide special paths; sort models (#2325) 2024-06-08 02:13:11 +00:00
dependabot[bot]
705758ac36 Bump vite from 5.2.12 to 5.2.13 in /frontend (#2315)
Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) from 5.2.12 to 5.2.13.
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/v5.2.13/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/v5.2.13/packages/vite)

---
updated-dependencies:
- dependency-name: vite
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-07 17:47:36 +00:00
dependabot[bot]
f1fc2c3fea Bump boto3 from 1.34.120 to 1.34.121 (#2316)
Bumps [boto3](https://github.com/boto/boto3) from 1.34.120 to 1.34.121.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.120...1.34.121)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-06-07 19:16:55 +02:00
dependabot[bot]
7e64df8332 Bump openai from 1.31.2 to 1.32.0 (#2317)
Bumps [openai](https://github.com/openai/openai-python) from 1.31.2 to 1.32.0.
- [Release notes](https://github.com/openai/openai-python/releases)
- [Changelog](https://github.com/openai/openai-python/blob/main/CHANGELOG.md)
- [Commits](https://github.com/openai/openai-python/compare/v1.31.2...v1.32.0)

---
updated-dependencies:
- dependency-name: openai
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-06-07 19:16:44 +02:00
Yufan Song
dc94914ad7 fix dogfood (#2313) 2024-06-07 16:35:12 +00:00
tobitege
b431fce938 tests: more Agentskills tests; updated .gitignore (#2307)
* added tests related to backticks

* updated .gitignore

* added extra linter test for #2210

* hotfix for integration test

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-06-07 16:29:03 +00:00
Yufan Song
6aba337416 fix (#2318) 2024-06-07 09:22:29 -07:00
Frank Xu
4455260290 [bugfix] browse actions shouldn't change url and screenshot, only observations (#2311)
* browse related actions shouldn't change url and screenshot, only the observations should

* fix linting

* fix integrat

* update integration test

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-06-08 00:03:32 +08:00
Boxuan Li
45ce09d70e CodeActAgent: Delegate to BrowsingAgent for browsing tasks (#2103) 2024-06-07 00:53:47 -07:00
Biraj Silwal
001cc33664 fix: ExplorerActions overlapping with file name. (#2287)
* fix ExplorerActions overlapping with file name.

* Update frontend/src/components/file-explorer/FileExplorer.tsx

---------

Co-authored-by: Leo <ifuryst@gmail.com>
2024-06-07 03:30:16 +00:00
dependabot[bot]
1df9649c7e Bump tailwindcss from 3.4.3 to 3.4.4 in /frontend (#2298) 2024-06-07 09:03:03 +08:00
Mohammad Sadoughi
19788cbad8 updated Makefile setup-config to store the persist_sandbox bolean value to config.toml (#2304)
Co-authored-by: msadough <msadough@amazon.com>
2024-06-06 22:14:09 +00:00
dependabot[bot]
dea9b5c258 Bump boto3 from 1.34.119 to 1.34.120 (#2299)
Bumps [boto3](https://github.com/boto/boto3) from 1.34.119 to 1.34.120.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.119...1.34.120)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-06 18:50:12 +02:00
dependabot[bot]
07423c9277 Bump ruff from 0.4.7 to 0.4.8 (#2297)
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.4.7 to 0.4.8.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/ruff/compare/v0.4.7...v0.4.8)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-06 18:49:38 +02:00
dependabot[bot]
bb757223a2 Bump litellm from 1.40.2 to 1.40.4 (#2300)
Bumps [litellm](https://github.com/BerriAI/litellm) from 1.40.2 to 1.40.4.
- [Release notes](https://github.com/BerriAI/litellm/releases)
- [Commits](https://github.com/BerriAI/litellm/compare/v1.40.2...v1.40.4)

---
updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-06 18:48:51 +02:00
dependabot[bot]
ac0c6efc82 Bump openai from 1.31.0 to 1.31.2 (#2301)
Bumps [openai](https://github.com/openai/openai-python) from 1.31.0 to 1.31.2.
- [Release notes](https://github.com/openai/openai-python/releases)
- [Changelog](https://github.com/openai/openai-python/blob/main/CHANGELOG.md)
- [Commits](https://github.com/openai/openai-python/compare/v1.31.0...v1.31.2)

---
updated-dependencies:
- dependency-name: openai
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-06 18:48:04 +02:00
tobitege
1ce4d383d3 doc: add Python keyring to Troubleshooting documentation (#2289)
* fix: set Python keyring for Poetry

* Python keyring troubleshooting added

* Revert Makefile change

* Troubleshooting extended

* setup config: added absolute path hint
2024-06-06 12:26:58 +00:00
Aleksandar
b0b19e6c25 Update AgentHubREADME.md (#2290)
Co-authored-by: sp.wack <83104063+amanape@users.noreply.github.com>
2024-06-06 11:14:41 +00:00
dependabot[bot]
08137d1968 Bump boto3 from 1.34.118 to 1.34.119 (#2280)
Bumps [boto3](https://github.com/boto/boto3) from 1.34.118 to 1.34.119.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.118...1.34.119)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-06 14:13:41 +08:00
super-dainiu
beabcce16d [Hotfix] Fix ML-Bench continue `run_inference.py` (#2284)
* add ml-bench w/o exec env

* fix typos (#1956)

no functional change

* Refactored Logs (#1939)

* [Feat] A competitive Web Browsing agent (#1856)

* initial attempt at a browsing only agent

* add browsing agent

* update

* implement agent

* update

* fix comments

* remove unnecessary things from memory extras

* update image processing

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* Update README.md SWE-bench score (#1959)

* Update README.md SWE-bench score

Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.

* Update

* fix: llm is_local function logic error (#1961)

Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

* doc: update documentation about poetry update (#1962)

* add doc

* Update Development.md

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* feat: add metrics related to cost for better observability (#1944)

* add metrics for total_cost

* make lint

* refact codeact

* change metrics into llm

* add costs list, add into state

* refactor log completion

* refactor and test others

* make lint

* Update opendevin/core/metrics.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/llm/llm.py

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* refactor

* add code

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* doc: add more cmd in unit test documentation (#1963)

* --- (#1975)

updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1976)

updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Logging security (#1943)

* update .gitignore

* Rename the confusing 'INFO' style to 'DETAIL'

* override str and repr

* feat: api_key desensitize

* feat: add SensitiveDataFilter in file handler

* tweak regex, add tests

* more tweaks, include other attrs

* add env vars, those with equivalent config

* fix tests

* tests are invaluable

---------

Co-authored-by: Shimada666 <649940882@qq.com>

* --- (#1967)

updated-dependencies:
- dependency-name: react-dom
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1968)

updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1969)

updated-dependencies:
- dependency-name: husky
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1970)

updated-dependencies:
- dependency-name: tailwind-merge
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1971)

updated-dependencies:
- dependency-name: i18next
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* Refactor session management (#1810)

* refactor session mgmt

* defer file handling to runtime

* add todo

* refactor sessions a bit more

* remove messages logic from FE

* fix up socket handshake

* refactor frontend auth a bit

* first pass at redoing file explorer

* implement directory suffix

* fix up file tree

* close agent on websocket close

* remove session saving

* move file refresh

* remove getWorkspace

* plumb path/code differently

* fix build issues

* fix the tests

* fix npm build

* add session rehydration

* fix event serialization

* logspam

* fix user message rehydration

* add get_event fn

* agent state restoration

* change history tracking for codeact

* fix responsiveness of init

* fix lint

* lint

* delint

* fix prop

* update tests

* logspam

* lint

* fix test

* revert codeact

* change fileService to use API

* fix up session loading

* delint

* delint

* fix integration tests

* revert test

* fix up access to options endpoints

* fix initial files load

* delint

* fix file initialization

* fix mock server

* fixl int

* fix auth for html

* Update frontend/src/i18n/translation.json

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* refactor sessions and sockets

* avoid reinitializing the same session

* fix reconnect issue

* change up intro message

* more guards on reinit

* rename agent_session

* delint

* fix a bunch of tests

* delint

* fix last test

* remove code editor context

* fix build

* fix any

* fix dot notation

* Update frontend/src/services/api.ts

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* fix up error handling

* Update opendevin/server/session/agent.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/server/session/agent.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update frontend/src/services/session.ts

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* fix build errs

* fix else

* add closed state

* delint

* Update opendevin/server/session/session.py

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* fix #1960 (#1964)

* Add ruff for shared mutable defaults (B) (#1938)

* Add ruff for shared mutable defaults (B)

* Apply B006, B008 on current files, except fast API

* Update agenthub/SWE_agent/prompts.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix unintended behavior change

* this is correct, tell Ruff to leave it alone

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888)

* Add MacOS to integration tests

* Switch back to python 3.11

* Install Docker for macos pipeline

* regenerate.sh: Use environmental variable for sandbox type

* Pack different agents' tests into a single check

* Fix CodeAct tests

* Reduce file match and extensive debug logs

* Add TEST_IN_CI mode that reports codecov

* Small fix: don't quit if reusing old responses failed

* Merge codecov results

* Fix typos

* Remove coverage merge step - codecov automatically does that

* Make mac integration tests as optional - too slow

* Fix codecov args

* Add comments in yaml

* Include sandbox type in codecov report name

* Fix codecov report merge

* Revert renaming of test_matrix_success

* Remove SWEAgent and PlannerAgent from tests

* Mark planner agent and SWE agent as deprecated

* CodeCov: Ignore planner and sweagent

* Revert "Remove SWEAgent and PlannerAgent from tests"

This reverts commit 040cb3bfb9.

* Remove all tests for SWE Agent

* Only keep basic tests for MonologueAgent and PlannerAgent

* Mark SWE Agent as deprecated, and ignore code coverage for it

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987)

Co-authored-by: jianghongwei <jianghongwei@58.com>
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

* Save CI cycles for backend tests (#1985)

* Fix typo in prompt (#1992)

* Refactor monologue and SWE agent to use the messages in state history (#1863)

* Refactor monologue to use the messages in state history

* add messages, clean up

* fix monologue

* update integration tests

* move private method

* update SWE agent to use the history from State

* integration tests for SWE agent

* rename monologue to initial_thoughts, since that is what it is

* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994)

* add ml-bench in readme

* Bump boto3 from 1.34.110 to 1.34.111 (#2001)

Bumps [boto3](https://github.com/boto/boto3) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker from 7.0.0 to 7.1.0 (#2002)

Bumps [docker](https://github.com/docker/docker-py) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases)
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0)

---
updated-dependencies:
- dependency-name: docker
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump litellm from 1.37.20 to 1.38.0 (#2005)

Bumps [litellm](https://github.com/BerriAI/litellm) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases)
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0)

---
updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix SWE-Bench evaluation due to setuptools version (#1995)

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* Revert "correctly setup plugins for swebench eval"

This reverts commit 2bd1055673.

* bump version

* fix session state after resuming (#1999)

* fix state resuming

* fix session reconnection

* fix lint

* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941)

* add draft for skills

* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file

* Remove new_sample.txt file

* add some work from opendevin w/ fixes

* Add unit tests for agentskills module

* fix some issues and updated tests

* add more tests for open

* tweak and handle goto_line

* add tests for some edge cases

* add tests for scrolling

* add tests for edit

* add tests for search_dir

* update tests to use pytest

* use pytest --forked to avoid file op unit tests to interfere with each other via global var

* update doc based on swe agent tool

* update and add tests for find_file and search_file

* move agent_skills to plugins

* add agentskills as plugin and docs

* add agentskill to ssh box and fix sandbox integration

* remove extra returns in doc

* add agentskills to initial tool for jupyter

* support re-init jupyter kernel (for agentskills) after restart

* fix print window's issue with indentation and add testcases

* add prompt for codeact with the newest edit primitives

* modify the way line number is presented (remove leading space)

* change prompt to the newest display format

* support tracking of costs via metrics

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* implement and add tests for py linting

* remove extra text arg for incompatible subprocess ver

* remove sample.txt

* update test_edits integration tests

* fix all integration

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/runtime/plugins/agent_skills/agentskills.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* Revert "correctly setup plugins for swebench eval"

This reverts commit 2bd1055673.

* bump version

* remove _AGENT_SKILLS_DOCS

* move flake8 to test dep

* update poetry.lock

* remove extra arg

* reduce max iter for eval

* update poetry

* fix integration tests

---------

Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* build: Add poetry command to use Python 3.11 for environment setup (#1972)

* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006)

Bumps [@react-types/shared](https://github.com/adobe/react-spectrum) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases)
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1)

---
updated-dependencies:
- dependency-name: "@react-types/shared"
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump @types/react-syntax-highlighter in /frontend (#2007)

Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter)

---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008)

Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser)

---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009)

Bumps [lint-staged](https://github.com/okonet/lint-staged) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md)
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4)

---
updated-dependencies:
- dependency-name: lint-staged
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update README.md

* Update README.md

* add run_infer.sh

* fix input output

* fix docker sandbox

* fix run

* update and clean run_infer.py

* add script to clean up dockers

* update repo uid

* add description

* new

* Update README.md

* use root for sandbox

* update readme

* update ml-bench conda env

* update readme

* update readme

* use try except

* modify raise exception

* add int

* update README

* longer time

* fix existing issues

* fix existing issue

* new docker image

* add metrics of cost

* add result parsing cost

* fix

* fix

* update summarize

* fix

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal>
Co-authored-by: RainRat <rainrat78@yahoo.ca>
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
Co-authored-by: Frank Xu <frankxu2004@gmail.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Shimada666 <649940882@qq.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com>
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com>
Co-authored-by: jianghongwei <jianghongwei@58.com>
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com>
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com>
Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com>
Co-authored-by: Robert <871607149@qq.com>
2024-06-06 03:53:21 +00:00
tobitege
1fa09e0414 fix: test_sandbox tests didn't close dockers (#2274)
* fix test_sandbox tests to close dockers

* removed try/finally

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-06-06 03:45:45 +00:00
Frank Xu
48151bdbb0 [feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes (#2170)
* add webarena, and revamp messaging for webarena eval

* add changes for browsergym

* update infer script

* fix unit tests

* update

* add multiple run for miniwob

* update instruction, remove personal path

* update

* add code for getting final reward, fix integration, add results

* add avg cost calculation
2024-06-06 09:01:20 +08:00
dependabot[bot]
99c6333e1a Bump openai from 1.30.5 to 1.31.0 (#2283)
Bumps [openai](https://github.com/openai/openai-python) from 1.30.5 to 1.31.0.
- [Release notes](https://github.com/openai/openai-python/releases)
- [Changelog](https://github.com/openai/openai-python/blob/main/CHANGELOG.md)
- [Commits](https://github.com/openai/openai-python/compare/v1.30.5...v1.31.0)

---
updated-dependencies:
- dependency-name: openai
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-05 21:14:49 +00:00
Xingyao Wang
42d3dd8a2e Update screenshot (#2286)
* Add files via upload

* Update screenshot.png
2024-06-05 22:43:46 +02:00
மனோஜ்குமார் பழனிச்சாமி
971ad68431 Solved Hugging Face cache issue. (#2277) 2024-06-05 21:18:33 +05:30
dependabot[bot]
3bf0636a53 Bump litellm from 1.40.0 to 1.40.2 (#2282)
Bumps [litellm](https://github.com/BerriAI/litellm) from 1.40.0 to 1.40.2.
- [Release notes](https://github.com/BerriAI/litellm/releases)
- [Commits](https://github.com/BerriAI/litellm/compare/v1.40.0...v1.40.2)

---
updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-05 23:46:00 +08:00
dependabot[bot]
105b5b9103 Bump json-repair from 0.21.0 to 0.23.1 (#2278)
Bumps [json-repair](https://github.com/mangiucugna/json_repair) from 0.21.0 to 0.23.1.
- [Release notes](https://github.com/mangiucugna/json_repair/releases)
- [Commits](https://github.com/mangiucugna/json_repair/compare/0.21.0...0.23.1)

---
updated-dependencies:
- dependency-name: json-repair
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-05 23:45:42 +08:00
dependabot[bot]
a4bccfc6aa Bump @types/node from 20.14.1 to 20.14.2 in /frontend (#2279)
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 20.14.1 to 20.14.2.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-05 23:40:53 +08:00
dependabot[bot]
4e540da85e Bump prettier from 3.3.0 to 3.3.1 in /frontend (#2281)
Bumps [prettier](https://github.com/prettier/prettier) from 3.3.0 to 3.3.1.
- [Release notes](https://github.com/prettier/prettier/releases)
- [Changelog](https://github.com/prettier/prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/prettier/compare/3.3.0...3.3.1)

---
updated-dependencies:
- dependency-name: prettier
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-05 23:39:41 +08:00
RainRat
3b0e1361a4 fix typos (#2267)
* fix typos

no functional change

* fix typos

* fix typos

* fix integration test

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Leo <ifuryst@gmail.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
2024-06-05 23:06:40 +08:00
மனோஜ்குமார் பழனிச்சாமி
ae815b20d2 Improved logs (#2272) 2024-06-05 17:54:40 +05:30
Aaron Xia
69542c9999 fix: there maybe unexpected files in event file list, not like 1.json… (#2270)
* fix: there maybe unexpected files in event file list, not like 1.json, 2.json, but .DS_Store for macOS system.

* log

---------

Co-authored-by: sp.wack <83104063+amanape@users.noreply.github.com>
2024-06-05 17:56:39 +08:00
dependabot[bot]
95a9be2dc5 Bump @typescript-eslint/eslint-plugin from 7.11.0 to 7.12.0 in /frontend (#2260) 2024-06-05 08:10:05 +00:00
Boxuan Li
208b1461ca [AgentBench evaluation] set run_as_devin to true (#2269)
Co-authored-by: Leo <ifuryst@gmail.com>
2024-06-05 07:53:33 +00:00
dependabot[bot]
1b25a37ad4 Bump @testing-library/react from 15.0.7 to 16.0.0 in /frontend (#2227)
* Bump @testing-library/react from 15.0.7 to 16.0.0 in /frontend

Bumps [@testing-library/react](https://github.com/testing-library/react-testing-library) from 15.0.7 to 16.0.0.
- [Release notes](https://github.com/testing-library/react-testing-library/releases)
- [Changelog](https://github.com/testing-library/react-testing-library/blob/main/CHANGELOG.md)
- [Commits](https://github.com/testing-library/react-testing-library/compare/v15.0.7...v16.0.0)

---
updated-dependencies:
- dependency-name: "@testing-library/react"
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* resolve error during test teardown

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: amanape <83104063+amanape@users.noreply.github.com>
2024-06-05 07:51:58 +00:00
Ryan H. Tran
0584e428b2 [Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals (#2268)
* fix: bug in stopping when the agent reaches max steps or solution proposals

* remove --eval-num-workers

* update env.py
2024-06-05 06:47:07 +00:00
super-dainiu
ebafb702e5 Add ML-Bench Evaluation with OpenDevin (#2015)
* add ml-bench w/o exec env

* fix typos (#1956)

no functional change

* Refactored Logs (#1939)

* [Feat] A competitive Web Browsing agent (#1856)

* initial attempt at a browsing only agent

* add browsing agent

* update

* implement agent

* update

* fix comments

* remove unnecessary things from memory extras

* update image processing

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* Update README.md SWE-bench score (#1959)

* Update README.md SWE-bench score

Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.

* Update

* fix: llm is_local function logic error (#1961)

Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

* doc: update documentation about poetry update (#1962)

* add doc

* Update Development.md

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* feat: add metrics related to cost for better observability (#1944)

* add metrics for total_cost

* make lint

* refact codeact

* change metrics into llm

* add costs list, add into state

* refactor log completion

* refactor and test others

* make lint

* Update opendevin/core/metrics.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/llm/llm.py

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* refactor

* add code

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* doc: add more cmd in unit test documentation (#1963)

* --- (#1975)

updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1976)

updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Logging security (#1943)

* update .gitignore

* Rename the confusing 'INFO' style to 'DETAIL'

* override str and repr

* feat: api_key desensitize

* feat: add SensitiveDataFilter in file handler

* tweak regex, add tests

* more tweaks, include other attrs

* add env vars, those with equivalent config

* fix tests

* tests are invaluable

---------

Co-authored-by: Shimada666 <649940882@qq.com>

* --- (#1967)

updated-dependencies:
- dependency-name: react-dom
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1968)

updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1969)

updated-dependencies:
- dependency-name: husky
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1970)

updated-dependencies:
- dependency-name: tailwind-merge
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* --- (#1971)

updated-dependencies:
- dependency-name: i18next
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* Refactor session management (#1810)

* refactor session mgmt

* defer file handling to runtime

* add todo

* refactor sessions a bit more

* remove messages logic from FE

* fix up socket handshake

* refactor frontend auth a bit

* first pass at redoing file explorer

* implement directory suffix

* fix up file tree

* close agent on websocket close

* remove session saving

* move file refresh

* remove getWorkspace

* plumb path/code differently

* fix build issues

* fix the tests

* fix npm build

* add session rehydration

* fix event serialization

* logspam

* fix user message rehydration

* add get_event fn

* agent state restoration

* change history tracking for codeact

* fix responsiveness of init

* fix lint

* lint

* delint

* fix prop

* update tests

* logspam

* lint

* fix test

* revert codeact

* change fileService to use API

* fix up session loading

* delint

* delint

* fix integration tests

* revert test

* fix up access to options endpoints

* fix initial files load

* delint

* fix file initialization

* fix mock server

* fixl int

* fix auth for html

* Update frontend/src/i18n/translation.json

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* refactor sessions and sockets

* avoid reinitializing the same session

* fix reconnect issue

* change up intro message

* more guards on reinit

* rename agent_session

* delint

* fix a bunch of tests

* delint

* fix last test

* remove code editor context

* fix build

* fix any

* fix dot notation

* Update frontend/src/services/api.ts

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* fix up error handling

* Update opendevin/server/session/agent.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/server/session/agent.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update frontend/src/services/session.ts

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* fix build errs

* fix else

* add closed state

* delint

* Update opendevin/server/session/session.py

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* fix #1960 (#1964)

* Add ruff for shared mutable defaults (B) (#1938)

* Add ruff for shared mutable defaults (B)

* Apply B006, B008 on current files, except fast API

* Update agenthub/SWE_agent/prompts.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix unintended behavior change

* this is correct, tell Ruff to leave it alone

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888)

* Add MacOS to integration tests

* Switch back to python 3.11

* Install Docker for macos pipeline

* regenerate.sh: Use environmental variable for sandbox type

* Pack different agents' tests into a single check

* Fix CodeAct tests

* Reduce file match and extensive debug logs

* Add TEST_IN_CI mode that reports codecov

* Small fix: don't quit if reusing old responses failed

* Merge codecov results

* Fix typos

* Remove coverage merge step - codecov automatically does that

* Make mac integration tests as optional - too slow

* Fix codecov args

* Add comments in yaml

* Include sandbox type in codecov report name

* Fix codecov report merge

* Revert renaming of test_matrix_success

* Remove SWEAgent and PlannerAgent from tests

* Mark planner agent and SWE agent as deprecated

* CodeCov: Ignore planner and sweagent

* Revert "Remove SWEAgent and PlannerAgent from tests"

This reverts commit 040cb3bfb9.

* Remove all tests for SWE Agent

* Only keep basic tests for MonologueAgent and PlannerAgent

* Mark SWE Agent as deprecated, and ignore code coverage for it

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987)

Co-authored-by: jianghongwei <jianghongwei@58.com>
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>

* Save CI cycles for backend tests (#1985)

* Fix typo in prompt (#1992)

* Refactor monologue and SWE agent to use the messages in state history (#1863)

* Refactor monologue to use the messages in state history

* add messages, clean up

* fix monologue

* update integration tests

* move private method

* update SWE agent to use the history from State

* integration tests for SWE agent

* rename monologue to initial_thoughts, since that is what it is

* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994)

* add ml-bench in readme

* Bump boto3 from 1.34.110 to 1.34.111 (#2001)

Bumps [boto3](https://github.com/boto/boto3) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker from 7.0.0 to 7.1.0 (#2002)

Bumps [docker](https://github.com/docker/docker-py) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases)
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0)

---
updated-dependencies:
- dependency-name: docker
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump litellm from 1.37.20 to 1.38.0 (#2005)

Bumps [litellm](https://github.com/BerriAI/litellm) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases)
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0)

---
updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix SWE-Bench evaluation due to setuptools version (#1995)

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* Revert "correctly setup plugins for swebench eval"

This reverts commit 2bd1055673.

* bump version

* fix session state after resuming (#1999)

* fix state resuming

* fix session reconnection

* fix lint

* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941)

* add draft for skills

* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file

* Remove new_sample.txt file

* add some work from opendevin w/ fixes

* Add unit tests for agentskills module

* fix some issues and updated tests

* add more tests for open

* tweak and handle goto_line

* add tests for some edge cases

* add tests for scrolling

* add tests for edit

* add tests for search_dir

* update tests to use pytest

* use pytest --forked to avoid file op unit tests to interfere with each other via global var

* update doc based on swe agent tool

* update and add tests for find_file and search_file

* move agent_skills to plugins

* add agentskills as plugin and docs

* add agentskill to ssh box and fix sandbox integration

* remove extra returns in doc

* add agentskills to initial tool for jupyter

* support re-init jupyter kernel (for agentskills) after restart

* fix print window's issue with indentation and add testcases

* add prompt for codeact with the newest edit primitives

* modify the way line number is presented (remove leading space)

* change prompt to the newest display format

* support tracking of costs via metrics

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* implement and add tests for py linting

* remove extra text arg for incompatible subprocess ver

* remove sample.txt

* update test_edits integration tests

* fix all integration

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/runtime/plugins/agent_skills/agentskills.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* Revert "correctly setup plugins for swebench eval"

This reverts commit 2bd1055673.

* bump version

* remove _AGENT_SKILLS_DOCS

* move flake8 to test dep

* update poetry.lock

* remove extra arg

* reduce max iter for eval

* update poetry

* fix integration tests

---------

Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* build: Add poetry command to use Python 3.11 for environment setup (#1972)

* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006)

Bumps [@react-types/shared](https://github.com/adobe/react-spectrum) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases)
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1)

---
updated-dependencies:
- dependency-name: "@react-types/shared"
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump @types/react-syntax-highlighter in /frontend (#2007)

Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter)

---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008)

Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser)

---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009)

Bumps [lint-staged](https://github.com/okonet/lint-staged) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md)
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4)

---
updated-dependencies:
- dependency-name: lint-staged
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update README.md

* Update README.md

* add run_infer.sh

* fix input output

* fix docker sandbox

* fix run

* update and clean run_infer.py

* add script to clean up dockers

* update repo uid

* add description

* new

* Update README.md

* use root for sandbox

* update readme

* update ml-bench conda env

* update readme

* update readme

* use try except

* modify raise exception

* add int

* update README

* longer time

* fix existing issues

* fix existing issue

* new docker image

* add metrics of cost

* add result parsing cost

* fix

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal>
Co-authored-by: RainRat <rainrat78@yahoo.ca>
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
Co-authored-by: Frank Xu <frankxu2004@gmail.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Shimada666 <649940882@qq.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com>
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com>
Co-authored-by: jianghongwei <jianghongwei@58.com>
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com>
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com>
Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com>
Co-authored-by: Robert <871607149@qq.com>
2024-06-05 01:56:39 +00:00
Leo
040d6bd806 fix: add an early exit check for agent answers in agent bench. (#2257)
Signed-off-by: ifuryst <ifuryst@gmail.com>
2024-06-04 18:45:07 -07:00
tobitege
5776474dcf Fix SWE-Bench README typos (#2250) 2024-06-05 01:18:02 +00:00
tobitege
44bbe5e208 Fix agentskills tests (#2242)
* Fix agentskills tests

* Improved test_agent_skill

---------

Co-authored-by: Leo <ifuryst@gmail.com>
2024-06-04 21:33:32 +00:00
tobitege
0082640ac8 fix test_config to prevent leaks (#2245) 2024-06-04 21:32:46 +02:00
tobitege
7263705492 fix frontend tests; minor readme update (#2219)
* fix frontend tests; minor readme update

* Fix indent in ChatInput.test

* Fix linting errors, finally

* lint: minor fixes (per make lint)

* All tests passed!
2024-06-04 20:46:47 +03:00
dependabot[bot]
4de08a9c00 Bump @types/node from 20.14.0 to 20.14.1 in /frontend (#2258)
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 20.14.0 to 20.14.1.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Leo <ifuryst@gmail.com>
2024-06-04 16:09:17 +00:00
dependabot[bot]
7e3e740616 Bump jose from 5.3.0 to 5.4.0 in /frontend (#2259)
Bumps [jose](https://github.com/panva/jose) from 5.3.0 to 5.4.0.
- [Release notes](https://github.com/panva/jose/releases)
- [Changelog](https://github.com/panva/jose/blob/main/CHANGELOG.md)
- [Commits](https://github.com/panva/jose/compare/v5.3.0...v5.4.0)

---
updated-dependencies:
- dependency-name: jose
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-04 16:06:38 +00:00
மனோஜ்குமார் பழனிச்சாமி
2ffd54d258 fixed output logging (#2244)
Co-authored-by: Leo <ifuryst@gmail.com>
2024-06-04 16:05:23 +00:00
dependabot[bot]
6dd6e6c087 Bump @typescript-eslint/parser from 7.11.0 to 7.12.0 in /frontend (#2261)
Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 7.11.0 to 7.12.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases)
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md)
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.12.0/packages/parser)

---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Leo <ifuryst@gmail.com>
2024-06-04 15:59:35 +00:00
dependabot[bot]
aec3e18836 Bump litellm from 1.39.5 to 1.40.0 (#2256)
Bumps [litellm](https://github.com/BerriAI/litellm) from 1.39.5 to 1.40.0.
- [Release notes](https://github.com/BerriAI/litellm/releases)
- [Commits](https://github.com/BerriAI/litellm/compare/v1.39.5...v1.40.0)

---
updated-dependencies:
- dependency-name: litellm
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-04 15:36:15 +00:00
dependabot[bot]
d85c548bf5 Bump opencv-python from 4.9.0.80 to 4.10.0.82 (#2255)
Bumps [opencv-python](https://github.com/opencv/opencv-python) from 4.9.0.80 to 4.10.0.82.
- [Release notes](https://github.com/opencv/opencv-python/releases)
- [Commits](https://github.com/opencv/opencv-python/commits)

---
updated-dependencies:
- dependency-name: opencv-python
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-04 15:08:05 +00:00
dependabot[bot]
0f60899ee0 Bump google-generativeai from 0.5.4 to 0.6.0 (#2254)
Bumps [google-generativeai](https://github.com/google/generative-ai-python) from 0.5.4 to 0.6.0.
- [Release notes](https://github.com/google/generative-ai-python/releases)
- [Changelog](https://github.com/google-gemini/generative-ai-python/blob/main/RELEASE.md)
- [Commits](https://github.com/google/generative-ai-python/compare/v0.5.4...v0.6.0)

---
updated-dependencies:
- dependency-name: google-generativeai
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-04 15:06:51 +00:00
மனோஜ்குமார் பழனிச்சாமி
4e479038f9 Bugfix by added config to disable plugin initialization for Persistent sandbox (#2179)
* refactored source bashrc logic

* added initialize_plugins config

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-06-04 10:59:30 -04:00
dependabot[bot]
11b66bd733 Bump boto3 from 1.34.117 to 1.34.118 (#2253)
Bumps [boto3](https://github.com/boto/boto3) from 1.34.117 to 1.34.118.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/boto/boto3/compare/1.34.117...1.34.118)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-04 14:57:58 +00:00
dependabot[bot]
62c179be6c Bump pytest from 8.2.1 to 8.2.2 (#2252)
Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.2.1 to 8.2.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pytest-dev/pytest/compare/8.2.1...8.2.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-04 14:55:09 +00:00
186 changed files with 8918 additions and 1439 deletions

View File

@@ -1,16 +0,0 @@
---
name: Question
about: Use this template to ask a question regarding the project.
title: ''
labels: question
assignees: ''
---
## Describe your question
<!--A clear and concise description of what you want to know.-->
## Additional context
<!--Add any other context about the question here, like what you've tried so far.-->

View File

@@ -23,7 +23,7 @@ jobs:
name: Test on macOS
runs-on: macos-13
env:
INSTALL_DOCKER: "0" # Set to '0' to skip Docker installation
INSTALL_DOCKER: "1" # Set to '0' to skip Docker installation
strategy:
matrix:
python-version: ["3.11"]

View File

@@ -35,15 +35,28 @@ jobs:
echo "" >> task.txt
echo "BODY:" >> task.txt
echo "${ISSUE_BODY}" >> task.txt
- name: Set up environment
run: |
curl -sSL https://install.python-poetry.org | python3 -
export PATH="/github/home/.local/bin:$PATH"
poetry install --without evaluation
poetry run playwright install --with-deps chromium
- name: Run OpenDevin
env:
ISSUE_TITLE: ${{ github.event.issue.title }}
ISSUE_BODY: ${{ github.event.issue.body }}
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SANDBOX_TYPE: exec
run: |
WORKSPACE_MOUNT_PATH=$GITHUB_WORKSPACE python ./opendevin/core/main.py -i 50 -f task.txt -d $GITHUB_WORKSPACE
# Append path to launch poetry
export PATH="/github/home/.local/bin:$PATH"
# Append path to correctly import package, note: must set pwd at first
export PYTHONPATH=$(pwd):$PYTHONPATH
WORKSPACE_MOUNT_PATH=$GITHUB_WORKSPACE poetry run python ./opendevin/core/main.py -i 50 -f task.txt -d $GITHUB_WORKSPACE
rm task.txt
- name: Setup Git, Create Branch, and Commit Changes

12
.gitignore vendored
View File

@@ -161,9 +161,14 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
.vscode/
.cursorignore
# evaluation
evaluation/evaluation_outputs
evaluation/outputs
evaluation/swe_bench/eval_workspace*
evaluation/SWE-bench/data
evaluation/webarena/scripts/webarena_env.sh
# frontend
@@ -176,6 +181,8 @@ frontend/yarn.lock
# testing
frontend/coverage
test_results*
/_test_files_tmp/
# production
frontend/build
@@ -204,8 +211,3 @@ cache
# configuration
config.toml
config.toml.bak
evaluation/swe_bench/eval_workspace*
evaluation/outputs
evaluation/evaluation_outputs
test_results*
/_test_files_tmp/

View File

@@ -98,5 +98,5 @@ Please refer to [this README](./tests/integration/README.md) for details.
### 9. Add or update dependency
1. Add your dependency in `pyproject.toml` or use `peotry add xxx`
1. Add your dependency in `pyproject.toml` or use `poetry add xxx`
2. Update the poetry.lock file via `poetry lock --no-update`

View File

@@ -10,6 +10,7 @@ DEFAULT_WORKSPACE_DIR = "./workspace"
DEFAULT_MODEL = "gpt-4o"
CONFIG_FILE = config.toml
PRECOMMIT_CONFIG_PATH = "./dev_config/python/.pre-commit-config.yaml"
PYTHON_VERSION = 3.11
# ANSI color codes
GREEN=$(shell tput -Txterm setaf 2)
@@ -62,10 +63,10 @@ check-system:
check-python:
@echo "$(YELLOW)Checking Python installation...$(RESET)"
@if command -v python3.11 > /dev/null; then \
echo "$(BLUE)$(shell python3.11 --version) is already installed.$(RESET)"; \
@if command -v python$(PYTHON_VERSION) > /dev/null; then \
echo "$(BLUE)$(shell python$(PYTHON_VERSION) --version) is already installed.$(RESET)"; \
else \
echo "$(RED)Python 3.11 is not installed. Please install Python 3.11 to continue.$(RESET)"; \
echo "$(RED)Python $(PYTHON_VERSION) is not installed. Please install Python $(PYTHON_VERSION) to continue.$(RESET)"; \
exit 1; \
fi
@@ -112,13 +113,13 @@ check-poetry:
echo "$(BLUE)$(shell poetry --version) is already installed.$(RESET)"; \
else \
echo "$(RED)Poetry 1.8 or later is required. You can install poetry by running the following command, then adding Poetry to your PATH:"; \
echo "$(RED) curl -sSL https://install.python-poetry.org | python3 -$(RESET)"; \
echo "$(RED) curl -sSL https://install.python-poetry.org | python$(PYTHON_VERSION) -$(RESET)"; \
echo "$(RED)More detail here: https://python-poetry.org/docs/#installing-with-the-official-installer$(RESET)"; \
exit 1; \
fi; \
else \
echo "$(RED)Poetry is not installed. You can install poetry by running the following command, then adding Poetry to your PATH:"; \
echo "$(RED) curl -sSL https://install.python-poetry.org | python3.11 -$(RESET)"; \
echo "$(RED) curl -sSL https://install.python-poetry.org | python$(PYTHON_VERSION) -$(RESET)"; \
echo "$(RED)More detail here: https://python-poetry.org/docs/#installing-with-the-official-installer$(RESET)"; \
exit 1; \
fi
@@ -130,7 +131,7 @@ pull-docker-image:
install-python-dependencies:
@echo "$(GREEN)Installing Python dependencies...$(RESET)"
poetry env use python3.11
poetry env use python$(PYTHON_VERSION)
@if [ "$(shell uname)" = "Darwin" ]; then \
echo "$(BLUE)Installing chroma-hnswlib...$(RESET)"; \
export HNSWLIB_NO_NATIVE=1; \
@@ -229,7 +230,7 @@ setup-config:
setup-config-prompts:
@echo "[core]" > $(CONFIG_FILE).tmp
@read -p "Enter your workspace directory [default: $(DEFAULT_WORKSPACE_DIR)]: " workspace_dir; \
@read -p "Enter your workspace directory (as absolute path) [default: $(DEFAULT_WORKSPACE_DIR)]: " workspace_dir; \
workspace_dir=$${workspace_dir:-$(DEFAULT_WORKSPACE_DIR)}; \
echo "workspace_base=\"$$workspace_dir\"" >> $(CONFIG_FILE).tmp
@@ -238,6 +239,7 @@ setup-config-prompts:
if [ "$$persist_sandbox" = "true" ]; then \
read -p "Enter a password for the sandbox container: " ssh_password; \
echo "ssh_password=\"$$ssh_password\"" >> $(CONFIG_FILE).tmp; \
echo "persist_sandbox=$$persist_sandbox" >> $(CONFIG_FILE).tmp; \
else \
echo "persist_sandbox=$$persist_sandbox" >> $(CONFIG_FILE).tmp; \
fi

View File

@@ -129,3 +129,16 @@ Distributed under the MIT License. See [`LICENSE`](./LICENSE) for more informati
[issues-url]: https://github.com/OpenDevin/OpenDevin/issues
[license-shield]: https://img.shields.io/github/license/opendevin/opendevin?style=for-the-badge
[license-url]: https://github.com/OpenDevin/OpenDevin/blob/main/LICENSE
## 📚 Cite
```
@misc{opendevin2024,
author = {{OpenDevin Team}},
title = {{OpenDevin: An Open Platform for AI Software Developers as Generalist Agents}},
year = {2024},
version = {v1.0},
howpublished = {\url{https://github.com/OpenDevin/OpenDevin}},
note = {Accessed: ENTER THE DATE YOU ACCESSED THE PROJECT}
}
```

View File

@@ -2,15 +2,15 @@
In this folder, there may exist multiple implementations of `Agent` that will be used by the framework.
For example, `agenthub/monologue_agent`, `agenthub/metagpt_agent`, `agenthub/codeact_agent`, etc.
For example, `agenthub/codeact_agent`, etc.
Contributors from different backgrounds and interests can choose to contribute to any (or all!) of these directions.
## Constructing an Agent
The abstraction for an agent can be found [here](../opendevin/agent.py).
The abstraction for an agent can be found [here](../opendevin/controller/agent.py).
Agents are run inside of a loop. At each iteration, `agent.step()` is called with a
[State](../opendevin/state.py) input, and the agent must output an [Action](../opendevin/action).
[State](../opendevin/controller/state/state.py) input, and the agent must output an [Action](../opendevin/events/action).
Every agent also has a `self.llm` which it can use to interact with the LLM configured by the user.
See the [LiteLLM docs for `self.llm.completion`](https://docs.litellm.ai/docs/completion).
@@ -28,21 +28,19 @@ The `state` contains:
Here is a list of available Actions, which can be returned by `agent.step()`:
- [`CmdRunAction`](../opendevin/action/bash.py) - Runs a command inside a sandboxed terminal
- [`CmdKillAction`](../opendevin/action/bash.py) - Kills a background command
- [`IPythonRunCellAction`](../opendevin/action/bash.py) - Execute a block of Python code interactively (in Jupyter notebook) and receives `CmdOutputObservation`. Requires setting up `jupyter` [plugin](../opendevin/sandbox/plugins) as a requirement.
- [`FileReadAction`](../opendevin/action/fileop.py) - Reads the content of a file
- [`FileWriteAction`](../opendevin/action/fileop.py) - Writes new content to a file
- [`BrowseURLAction`](../opendevin/action/browse.py) - Gets the content of a URL
- [`AgentRecallAction`](../opendevin/action/agent.py) - Searches memory (e.g. a vector database)
- [`AddTaskAction`](../opendevin/action/tasks.py) - Adds a subtask to the plan
- [`ModifyTaskAction`](../opendevin/action/tasks.py) - Changes the state of a subtask
- [`AgentThinkAction`](../opendevin/action/agent.py) - A no-op that allows the agent to add plaintext to the history (as well as the chat log)
- [`AgentTalkAction`](../opendevin/action/agent.py) - A no-op that allows the agent to add plaintext to the history and talk to the user.
- [`AgentFinishAction`](../opendevin/action/agent.py) - Stops the control loop, allowing the user/delegator agent to enter a new task
- [`AgentRejectAction`](../opendevin/action/agent.py) - Stops the control loop, allowing the user/delegator agent to enter a new task
- [`AgentFinishAction`](../opendevin/action/agent.py) - Stops the control loop, allowing the user to enter a new task
- [`MessageAction`](../opendevin/action/message.py) - Represents a message from an agent or the user
- [`CmdRunAction`](../opendevin/events/action/commands.py) - Runs a command inside a sandboxed terminal
- [`CmdKillAction`](../opendevin/events/action/commands.py) - Kills a background command
- [`IPythonRunCellAction`](../opendevin/events/action/commands.py) - Execute a block of Python code interactively (in Jupyter notebook) and receives `CmdOutputObservation`. Requires setting up `jupyter` [plugin](../opendevin/runtime/plugins) as a requirement.
- [`FileReadAction`](../opendevin/events/action/files.py) - Reads the content of a file
- [`FileWriteAction`](../opendevin/events/action/files.py) - Writes new content to a file
- [`BrowseURLAction`](../opendevin/events/action/browse.py) - Gets the content of a URL
- [`AgentRecallAction`](../opendevin/events/action/agent.py) - Searches memory (e.g. a vector database)
- [`AddTaskAction`](../opendevin/events/action/tasks.py) - Adds a subtask to the plan
- [`ModifyTaskAction`](../opendevin/events/action/tasks.py) - Changes the state of a subtask.
- [`AgentFinishAction`](../opendevin/events/action/agent.py) - Stops the control loop, allowing the user/delegator agent to enter a new task
- [`AgentRejectAction`](../opendevin/events/action/agent.py) - Stops the control loop, allowing the user/delegator agent to enter a new task
- [`AgentFinishAction`](../opendevin/events/action/agent.py) - Stops the control loop, allowing the user to enter a new task
- [`MessageAction`](../opendevin/events/action/message.py) - Represents a message from an agent or the user
You can use `action.to_dict()` and `action_from_dict` to serialize and deserialize actions.
@@ -54,13 +52,13 @@ in the background).
Here is a list of available Observations:
- [`CmdOutputObservation`](../opendevin/observation/run.py)
- [`BrowserOutputObservation`](../opendevin/observation/browse.py)
- [`FileReadObservation`](../opendevin/observation/files.py)
- [`FileWriteObservation`](../opendevin/observation/files.py)
- [`AgentRecallObservation`](../opendevin/observation/recall.py)
- [`ErrorObservation`](../opendevin/observation/error.py)
- [`SuccessObservation`](../opendevin/observation/success.py)
- [`CmdOutputObservation`](../opendevin/events/observation/commands.py)
- [`BrowserOutputObservation`](../opendevin/events/observation/browse.py)
- [`FileReadObservation`](../opendevin/events/observation/files.py)
- [`FileWriteObservation`](../opendevin/events/observation/files.py)
- [`AgentRecallObservation`](../opendevin/events/observation/recall.py)
- [`ErrorObservation`](../opendevin/events/observation/error.py)
- [`SuccessObservation`](../opendevin/events/observation/success.py)
You can use `observation.to_dict()` and `observation_from_dict` to serialize and deserialize observations.

View File

@@ -1,4 +1,5 @@
import ast
import os
from browsergym.core.action.highlevel import HighLevelActionSet
from browsergym.utils.obs import flatten_axtree_to_str
@@ -12,6 +13,7 @@ from opendevin.events.action import (
BrowseInteractiveAction,
MessageAction,
)
from opendevin.events.event import EventSource
from opendevin.events.observation import BrowserOutputObservation
from opendevin.llm.llm import LLM
from opendevin.runtime.plugins import (
@@ -19,21 +21,17 @@ from opendevin.runtime.plugins import (
)
from opendevin.runtime.tools import RuntimeTool
USE_NAV = (
os.environ.get('USE_NAV', 'true') == 'true'
) # only disable NAV actions when running webarena and miniwob benchmarks
USE_CONCISE_ANSWER = (
os.environ.get('USE_CONCISE_ANSWER', 'false') == 'true'
) # only return concise answer when running webarena and miniwob benchmarks
def parse_response(response: str) -> Action:
if '```' not in response:
# unexpected response format, message back to user
return MessageAction(response)
thought = response.split('```')[0].strip()
action_str = response.split('```')[1].strip()
# handle send message to user function call in BrowserGym
for sub_action in action_str.split('\n'):
if 'send_msg_to_user(' in sub_action:
tree = ast.parse(sub_action)
args = tree.body[0].value.args # type: ignore
return MessageAction(args[0].value)
return BrowseInteractiveAction(browser_actions=action_str, thought=thought)
if not USE_NAV and USE_CONCISE_ANSWER:
EVAL_MODE = True # disabled NAV actions and only return concise answer, for webarena and miniwob benchmarks\
else:
EVAL_MODE = False
class BrowsingAgent(Agent):
@@ -56,13 +54,13 @@ class BrowsingAgent(Agent):
- llm (LLM): The llm to be used by this agent
"""
super().__init__(llm)
# define a configurable action space, with chat functionality, web navigation, and webpage grounding using accessibility tree and HTML.
# see https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/highlevel.py for more details
action_subsets = ['chat', 'bid']
if USE_NAV:
action_subsets.append('nav')
self.action_space = HighLevelActionSet(
# see https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/highlevel.py for more details
subsets=[
'chat',
'bid',
'nav',
], # define a configurable action space, with chat functionality, web navigation, and webpage grounding using accessibility tree and HTML.
subsets=action_subsets,
strict=False, # less strict on the parsing of the actions
multiaction=True, # enable to agent to take multiple actions at once
)
@@ -75,6 +73,32 @@ class BrowsingAgent(Agent):
"""
super().reset()
self.cost_accumulator = 0
self.error_accumulator = 0
def parse_response(self, response: str) -> Action:
if '```' not in response:
# unexpected response format, message back to user
action_str = f'send_msg_to_user("""{response}""")'
return BrowseInteractiveAction(
browser_actions=action_str,
thought=response,
browsergym_send_msg_to_user=response,
)
thought = response.split('```')[0].strip()
action_str = response.split('```')[1].strip()
# handle send message to user function call in BrowserGym
msg_content = ''
for sub_action in action_str.split('\n'):
if 'send_msg_to_user(' in sub_action:
tree = ast.parse(sub_action)
args = tree.body[0].value.args # type: ignore
msg_content = args[0].value
return BrowseInteractiveAction(
browser_actions=action_str,
thought=thought,
browsergym_send_msg_to_user=msg_content,
)
def step(self, state: State) -> Action:
"""
@@ -90,27 +114,66 @@ class BrowsingAgent(Agent):
- AgentFinishAction() - end the interaction
"""
goal = state.get_current_user_intent()
if goal is None:
goal = state.inputs['task']
messages = []
prev_actions = ''
prev_actions = []
cur_axtree_txt = ''
error_prefix = ''
last_obs = None
last_action = None
if EVAL_MODE and len(state.history) == 1:
# for webarena and miniwob++ eval, we need to retrieve the initial observation already in browser env
# initialize and retrieve the first observation by issuing an noop OP
# For non-benchmark browsing, the browser env starts with a blank page, and the agent is expected to first navigate to desired websites
return BrowseInteractiveAction(browser_actions='noop()')
for prev_action, obs in state.history:
if isinstance(prev_action, BrowseInteractiveAction):
prev_actions += f'{prev_action.browser_actions}\n'
prev_actions.append(prev_action.browser_actions)
last_obs = obs
last_action = prev_action
elif (
isinstance(prev_action, MessageAction) and prev_action.source != 'user'
isinstance(prev_action, MessageAction)
and prev_action.source == EventSource.AGENT
):
# agent has responded, task finish.
return AgentFinishAction()
return AgentFinishAction(outputs={'content': prev_action.content})
if EVAL_MODE:
prev_actions = prev_actions[1:] # remove the first noop action
prev_action_str = '\n'.join(prev_actions)
# if the final BrowserInteractiveAction exec BrowserGym's send_msg_to_user,
# we should also send a message back to the user in OpenDevin and call it a day
if (
isinstance(last_action, BrowseInteractiveAction)
and last_action.browsergym_send_msg_to_user
):
return MessageAction(last_action.browsergym_send_msg_to_user)
if isinstance(last_obs, BrowserOutputObservation):
if last_obs.error:
# add error recovery prompt prefix
error_prefix = f'IMPORTANT! Last action is incorrect:\n{last_obs.last_browser_action}\nThink again with the current observation of the page.\n'
cur_axtree_txt = flatten_axtree_to_str(last_obs.axtree_object)
try:
cur_axtree_txt = flatten_axtree_to_str(
last_obs.axtree_object,
extra_properties=last_obs.extra_element_properties,
with_clickable=True,
filter_visible_only=True,
)
except Exception as e:
logger.error(
'Error when trying to process the accessibility tree: %s', e
)
return MessageAction('Error encountered when browsing.')
if error_prefix:
self.error_accumulator += 1
if self.error_accumulator > 5:
return MessageAction('Too many errors encountered. Task failed.')
system_msg = f"""\
# Instructions
Review the current state of the page and all other information to find the best
@@ -133,7 +196,7 @@ and executed by a program, make sure to follow the formatting instructions.
{cur_axtree_txt}
# Previous Actions
{prev_actions}
{prev_action_str}
Here is an example with chain of thought of a valid action when clicking on a button:
"
@@ -141,16 +204,31 @@ In order to accomplish my goal I need to click on the button with bid 12
```click("12")```
"
""".strip()
if USE_CONCISE_ANSWER:
concise_instruction = """\
Here is another example with chain of thought of a valid action when providing a concise answer to user:
"
In order to accomplish my goal I need to send the information asked back to the user. This page list the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I will send a message back to user with the answer.
```send_msg_to_user("$279.49")```
"
"""
prompt += concise_instruction
messages.append({'role': 'user', 'content': prompt})
response = self.llm.completion(
messages=messages,
temperature=0.0,
stop=[')```', ')\n```'],
)
self.log_cost(response)
action_resp = response['choices'][0]['message']['content']
action_resp = response['choices'][0]['message']['content'].strip()
if not action_resp.endswith('```'):
action_resp = action_resp + ')```'
logger.info(prompt)
logger.info(action_resp)
return parse_response(action_resp)
return self.parse_response(action_resp)
def search_memory(self, query: str) -> list[str]:
raise NotImplementedError('Implement this abstract method')

View File

@@ -161,7 +161,7 @@ class Truncater(Shrinkable):
def __init__(self, visible, shrink_speed=0.3, start_truncate_iteration=10):
super().__init__(visible=visible)
self.shrink_speed = shrink_speed # the percentage shrinked in each iteration
self.shrink_speed = shrink_speed # the percentage shrunk in each iteration
self.start_truncate_iteration = (
start_truncate_iteration # the iteration to start truncating
)

View File

@@ -0,0 +1,182 @@
import re
from opendevin.controller.action_parser import ActionParser, ResponseParser
from opendevin.events.action import (
Action,
AgentDelegateAction,
AgentFinishAction,
CmdRunAction,
IPythonRunCellAction,
MessageAction,
)
class CodeActResponseParser(ResponseParser):
"""
Parser action:
- CmdRunAction(command) - bash command to run
- IPythonRunCellAction(code) - IPython code to run
- AgentDelegateAction(agent, inputs) - delegate action for (sub)task
- MessageAction(content) - Message action to run (e.g. ask for clarification)
- AgentFinishAction() - end the interaction
"""
def __init__(
self,
):
# Need pay attention to the item order in self.action_parsers
self.action_parsers = [
CodeActActionParserFinish(),
CodeActActionParserCmdRun(),
CodeActActionParserIPythonRunCell(),
CodeActActionParserAgentDelegate(),
]
self.default_parser = CodeActActionParserMessage()
def parse(self, response: str) -> Action:
action_str = self.parse_response(response)
return self.parse_action(action_str)
def parse_response(self, response) -> str:
action = response.choices[0].message.content
for lang in ['bash', 'ipython', 'browse']:
if f'<execute_{lang}>' in action and f'</execute_{lang}>' not in action:
action += f'</execute_{lang}>'
return action
def parse_action(self, action_str: str) -> Action:
for action_parser in self.action_parsers:
if action_parser.check_condition(action_str):
return action_parser.parse(action_str)
return self.default_parser.parse(action_str)
class CodeActActionParserFinish(ActionParser):
"""
Parser action:
- AgentFinishAction() - end the interaction
"""
def __init__(
self,
):
self.finish_command = None
def check_condition(self, action_str: str) -> bool:
self.finish_command = re.search(r'<finish>.*</finish>', action_str, re.DOTALL)
return self.finish_command is not None
def parse(self, action_str: str) -> Action:
assert (
self.finish_command is not None
), 'self.finish_command should not be None when parse is called'
thought = action_str.replace(self.finish_command.group(0), '').strip()
return AgentFinishAction(thought=thought)
class CodeActActionParserCmdRun(ActionParser):
"""
Parser action:
- CmdRunAction(command) - bash command to run
- AgentFinishAction() - end the interaction
"""
def __init__(
self,
):
self.bash_command = None
def check_condition(self, action_str: str) -> bool:
self.bash_command = re.search(
r'<execute_bash>(.*?)</execute_bash>', action_str, re.DOTALL
)
return self.bash_command is not None
def parse(self, action_str: str) -> Action:
assert (
self.bash_command is not None
), 'self.bash_command should not be None when parse is called'
thought = action_str.replace(self.bash_command.group(0), '').strip()
# a command was found
command_group = self.bash_command.group(1).strip()
if command_group.strip() == 'exit':
return AgentFinishAction()
return CmdRunAction(command=command_group, thought=thought)
class CodeActActionParserIPythonRunCell(ActionParser):
"""
Parser action:
- IPythonRunCellAction(code) - IPython code to run
"""
def __init__(
self,
):
self.python_code = None
self.jupyter_kernel_init_code: str = 'from agentskills import *'
def check_condition(self, action_str: str) -> bool:
self.python_code = re.search(
r'<execute_ipython>(.*?)</execute_ipython>', action_str, re.DOTALL
)
return self.python_code is not None
def parse(self, action_str: str) -> Action:
assert (
self.python_code is not None
), 'self.python_code should not be None when parse is called'
code_group = self.python_code.group(1).strip()
thought = action_str.replace(self.python_code.group(0), '').strip()
return IPythonRunCellAction(
code=code_group,
thought=thought,
kernel_init_code=self.jupyter_kernel_init_code,
)
class CodeActActionParserAgentDelegate(ActionParser):
"""
Parser action:
- AgentDelegateAction(agent, inputs) - delegate action for (sub)task
"""
def __init__(
self,
):
self.agent_delegate = None
def check_condition(self, action_str: str) -> bool:
self.agent_delegate = re.search(
r'<execute_browse>(.*)</execute_browse>', action_str, re.DOTALL
)
return self.agent_delegate is not None
def parse(self, action_str: str) -> Action:
assert (
self.agent_delegate is not None
), 'self.agent_delegate should not be None when parse is called'
thought = action_str.replace(self.agent_delegate.group(0), '').strip()
browse_actions = self.agent_delegate.group(1).strip()
task = f'{thought}. I should start with: {browse_actions}'
return AgentDelegateAction(agent='BrowsingAgent', inputs={'task': task})
class CodeActActionParserMessage(ActionParser):
"""
Parser action:
- MessageAction(content) - Message action to run (e.g. ask for clarification)
"""
def __init__(
self,
):
pass
def check_condition(self, action_str: str) -> bool:
# We assume the LLM is GOOD enough that when it returns pure natural language
# it wants to talk to the user
return True
def parse(self, action_str: str) -> Action:
return MessageAction(content=action_str, wait_for_response=True)

View File

@@ -1,5 +1,4 @@
import re
from agenthub.codeact_agent.action_parser import CodeActResponseParser
from agenthub.codeact_agent.prompt import (
COMMAND_DOCS,
EXAMPLES,
@@ -18,6 +17,7 @@ from opendevin.events.action import (
MessageAction,
)
from opendevin.events.observation import (
AgentDelegateObservation,
BrowserOutputObservation,
CmdOutputObservation,
IPythonRunCellObservation,
@@ -33,14 +33,6 @@ from opendevin.runtime.tools import RuntimeTool
ENABLE_GITHUB = True
def parse_response(response) -> str:
action = response.choices[0].message.content
for lang in ['bash', 'ipython', 'browse']:
if f'<execute_{lang}>' in action and f'</execute_{lang}>' not in action:
action += f'</execute_{lang}>'
return action
def action_to_str(action: Action) -> str:
if isinstance(action, CmdRunAction):
return f'{action.thought}\n<execute_bash>\n{action.command}\n</execute_bash>'
@@ -89,6 +81,9 @@ def get_observation_message(obs) -> dict[str, str] | None:
elif isinstance(obs, BrowserOutputObservation):
content = 'OBSERVATION:\n' + truncate_observation(obs.content)
return {'role': 'user', 'content': content}
elif isinstance(obs, AgentDelegateObservation):
content = 'OBSERVATION:\n' + truncate_observation(str(obs.outputs))
return {'role': 'user', 'content': content}
return None
@@ -119,7 +114,7 @@ def get_in_context_example() -> str:
class CodeActAgent(Agent):
VERSION = '1.5'
VERSION = '1.6'
"""
The Code Act Agent is a minimalist agent.
The agent works by passing the model a list of action-observation pairs and prompting the model to take the next step.
@@ -164,11 +159,12 @@ class CodeActAgent(Agent):
JupyterRequirement(),
]
runtime_tools: list[RuntimeTool] = [RuntimeTool.BROWSER]
jupyter_kernel_init_code: str = 'from agentskills import *'
system_message: str = get_system_message()
in_context_example: str = f"Here is an example of how you can interact with the environment for task solving:\n{get_in_context_example()}\n\nNOW, LET'S START!"
action_parser = CodeActResponseParser()
def __init__(
self,
llm: LLM,
@@ -199,7 +195,7 @@ class CodeActAgent(Agent):
Returns:
- CmdRunAction(command) - bash command to run
- IPythonRunCellAction(code) - IPython code to run
- BrowseInteractiveAction(browsergym_command) - BrowserGym commands to run
- AgentDelegateAction(agent, inputs) - delegate action for (sub)task
- MessageAction(content) - Message action to run (e.g. ask for clarification)
- AgentFinishAction() - end the interaction
"""
@@ -234,50 +230,10 @@ class CodeActAgent(Agent):
],
temperature=0.0,
)
action_str: str = parse_response(response)
state.num_of_chars += sum(
len(message['content']) for message in messages
) + len(action_str)
if finish_command := re.search(r'<finish>.*</finish>', action_str, re.DOTALL):
thought = action_str.replace(finish_command.group(0), '').strip()
return AgentFinishAction(thought=thought)
if bash_command := re.search(
r'<execute_bash>(.*?)</execute_bash>', action_str, re.DOTALL
):
# remove the command from the action string to get thought
thought = action_str.replace(bash_command.group(0), '').strip()
# a command was found
command_group = bash_command.group(1).strip()
if command_group.strip() == 'exit':
return AgentFinishAction()
return CmdRunAction(command=command_group, thought=thought)
elif python_code := re.search(
r'<execute_ipython>(.*?)</execute_ipython>', action_str, re.DOTALL
):
# a code block was found
code_group = python_code.group(1).strip()
thought = action_str.replace(python_code.group(0), '').strip()
return IPythonRunCellAction(
code=code_group,
thought=thought,
kernel_init_code=self.jupyter_kernel_init_code,
)
elif browse_command := re.search(
r'<execute_browse>(.*)</execute_browse>', action_str, re.DOTALL
):
# BrowserGym actions was found
browse_actions = browse_command.group(1).strip()
thought = action_str.replace(browse_command.group(0), '').strip()
return BrowseInteractiveAction(
browser_actions=browse_actions, thought=thought
)
else:
# We assume the LLM is GOOD enough that when it returns pure natural language
# it want to talk to the user
return MessageAction(content=action_str, wait_for_response=True)
) + len(response.choices[0].message.content)
return self.action_parser.parse(response)
def search_memory(self, query: str) -> list[str]:
raise NotImplementedError('Implement this abstract method')

View File

@@ -5,35 +5,41 @@ _AGENT_SKILLS_DOCS = AgentSkillsRequirement.documentation
COMMAND_DOCS = (
'\nApart from the standard Python library, the assistant can also use the following functions (already imported) in <execute_ipython> environment:\n'
f'{_AGENT_SKILLS_DOCS}'
"Please note that THE `edit_file` FUNCTION REQUIRES PROPER INDENTATION. If the assistant would like to add the line ' print(x)', it must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run."
"Please note that THE `edit_file` and `append_file` FUNCTIONS REQUIRE PROPER INDENTATION. If the assistant would like to add the line ' print(x)', it must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run."
)
# ======= SYSTEM MESSAGE =======
MINIMAL_SYSTEM_PREFIX = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
The assistant can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. The code should be enclosed using "<execute_ipython>" tag, for example:
The assistant can use an interactive Python (Jupyter Notebook) environment, executing code with <execute_ipython>.
<execute_ipython>
print("Hello World!")
</execute_ipython>
The assistant can execute bash commands on behalf of the user by wrapping them with <execute_bash> and </execute_bash>.
For example, you can list the files in the current directory by <execute_bash> ls </execute_bash>.
Important, however: do not run interactive commands. You do not have access to stdin.
Also, you need to handle commands that may run indefinitely and not return a result. For such cases, you should redirect the output to a file and run the command in the background to avoid blocking the execution.
For example, to run a Python script that might run indefinitely without returning immediately, you can use the following format: <execute_bash> python3 app.py > server.log 2>&1 & </execute_bash>
Also, if a command execution result saying like: Command: "npm start" timed out. Sending SIGINT to the process, you should also retry with running the command in the background.
"""
BROWSING_PREFIX = """The assistant can browse the Internet with commands on behalf of the user by wrapping them with <execute_browse> and </execute_browse>.
For example, you can browse a given URL by <execute_browse> goto("<URL>") </execute_browse>.
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
BROWSING_PREFIX = """The assistant can browse the Internet with <execute_browse> and </execute_browse>.
For example, <execute_browse> Tell me the usa's president using google search </execute_browse>.
Or <execute_browse> Tell me what is in http://example.com </execute_browse>.
"""
PIP_INSTALL_PREFIX = """The assistant can install Python packages using the %pip magic command in an IPython environment by using the following syntax: <execute_ipython> %pip install [package needed] </execute_ipython> and should always import packages and define variables before starting to use them."""
SYSTEM_PREFIX = MINIMAL_SYSTEM_PREFIX + BROWSING_PREFIX + PIP_INSTALL_PREFIX
GITHUB_MESSAGE = """To do any activities on GitHub, the assistant should use the token in the $GITHUB_TOKEN environment variable.
For instance, to push a local branch `my_branch` to the github repo `owner/repo`, the assistant can use the following four commands:
GITHUB_MESSAGE = """To interact with GitHub, use the $GITHUB_TOKEN environment variable.
For example, to push a branch `my_branch` to the GitHub repo `owner/repo`:
<execute_bash> git push https://$GITHUB_TOKEN@github.com/owner/repo.git my_branch </execute_bash>
If the assistant require access to GitHub but $GITHUB_TOKEN is not set, ask the user to set it."""
If $GITHUB_TOKEN is not set, ask the user to set it."""
SYSTEM_SUFFIX = """The assistant's response should be concise.
The assistant should include ONLY ONE <execute_ipython> or <execute_bash> or <execute_browse> in every one of the responses, unless the assistant is finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> or <execute_browse> instead of providing it.
SYSTEM_SUFFIX = """Responses should be concise.
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
Include ONLY ONE <execute_ipython>, <execute_bash>, or <execute_browse> per response, unless the assistant is finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Execute code using <execute_ipython>, <execute_bash>, or <execute_browse> whenever possible.
"""
@@ -41,10 +47,10 @@ IMPORTANT: Whenever possible, execute the code for the user using <execute_ipyth
EXAMPLES = """
--- START OF EXAMPLE ---
USER: Can you create a list of numbers from 1 to 10, and create a web page to display them at port 5000?
USER: Create a list of numbers from 1 to 10, and display them in a web page at port 5000.
ASSISTANT:
Sure! Let me create a file first:
Sure! Let me create a Python file `app.py`:
<execute_ipython>
create_file('app.py')
</execute_ipython>
@@ -68,7 +74,7 @@ def index():
if __name__ == '__main__':
app.run(port=5000)\"\"\"
edit_file(start=1, end=1, content=EDITED_CODE)
edit_file('app.py', start=1, end=1, content=EDITED_CODE)
</execute_ipython>
USER:
@@ -163,7 +169,7 @@ USER: Now browse the newly started server's homepage and show me the content.
ASSISTANT:
Sure! Let me browse the server's homepage at http://127.0.0.1:5000:
<execute_browse>
goto("http://127.0.0.1:5000")
Get the content on "http://127.0.0.1:5000"
</execute_browse>
USER:
@@ -207,7 +213,7 @@ USER:
ASSISTANT:
I should edit the file to display the numbers in a table format. I should include correct indentation. Let me update the file:
<execute_ipython>
edit_file(start=7, end=7, content=" return '<table>' + ''.join([f'<tr><td>{i}</td></tr>' for i in numbers]) + '</table>'")
edit_file('app.py', start=7, end=7, content=" return '<table>' + ''.join([f'<tr><td>{i}</td></tr>' for i in numbers]) + '</table>'")
</execute_ipython>
USER:
@@ -226,7 +232,7 @@ Observation:
[File updated. Please review the changes and make sure they are correct (correct indentation, no duplicate lines, etc). Edit the file again if necessary.]
ASSISTANT:
The file has been updated. Let me run the Python file again with the new changes:
Running the updated file:
<execute_bash>
python3 app.py > server.log 2>&1 &
</execute_bash>
@@ -236,14 +242,14 @@ Observation:
[1] 126
ASSISTANT:
The server is running on port 5000 with PID 126. You can access the list of numbers in a table format by visiting http://127.0.0.1:5000. Free free to let me know if you have any further requests!
The server is running on port 5000 with PID 126. You can access the list of numbers in a table format by visiting http://127.0.0.1:5000. Let me know if you have any further requests!
--- END OF EXAMPLE ---
"""
INVALID_INPUT_MESSAGE = (
"I don't understand your input. \n"
'If you want to execute a bash command, please use <execute_bash> YOUR_COMMAND_HERE </execute_bash>.\n'
'If you want to execute a block of Python code, please use <execute_ipython> YOUR_COMMAND_HERE </execute_ipython>.\n'
'If you want to browse the Internet, please use <execute_browse> YOUR_COMMAND_HERE </execute_browse>.\n'
'For bash commands, use <execute_bash> YOUR_COMMAND </execute_bash>.\n'
'For Python code, use <execute_ipython> YOUR_CODE </execute_ipython>.\n'
'For browsing, use <execute_browse> YOUR_COMMAND </execute_browse>.\n'
)

View File

@@ -12,3 +12,6 @@ in the following structure:
Note that `prompt.md` could use jinja2 template syntax. During runtime, `prompt.md`
is loaded and rendered, and used together with `agent.yaml` to initialize a
micro-agent.
Micro-agents can be used independently. You can also use `ManagerAgent` which knows
how to coordinate the agents and collaboratively finish a task.

View File

@@ -1,2 +1,2 @@
* `reject` - reject the task. Arguments:
* `outputs` - a dictionary representing the outputs of your task, if any
* `outputs` - a dictionary with only a `reason` attribute

View File

@@ -3,3 +3,4 @@ description: "Write a git commit message for files in the git staging area"
inputs: {}
outputs:
answer: string
reason: string

View File

@@ -14,7 +14,7 @@ changes. The commit message should include:
You should find the diff using `git diff --cached`, compile a commit message,
and call the `finish` action with `outputs.answer` set to the answer. If current
repo is not a valid git repo, or there is no diff in the staging area, please call
the `reject` action with `outputs.answer` set to the reason.
the `reject` action.
## History
{{ instructions.history_truncated }}

View File

@@ -3,4 +3,6 @@ description: Delegates tasks to microagents based on their area of expertise
generates: Action
inputs:
task: string
outputs: {}
outputs:
summary: string # if finished
reason: string # if rejected

View File

@@ -7,6 +7,15 @@ can do the actual work. A description of each agent is provided below. You MUST
select one of the delegates below to move towards accomplishing the task, and you MUST
provide the correct inputs for the delegate you select.
Note: the delegated agent either returns "finish" or "reject".
- If the action is "finish", but the full task is not done yet, you should
continue to delegate to one of the agents below to until the full task is finished.
- If the action is "reject", it means the delegated agent is not capable of the
task you send to. You should revisit the input you send to the delegate, and consider
whether any other delegate would be able to solve the task. If you cannot find
a proper delegate agent, or the delegate attempts keep failing, call the `reject`
action.
## Agents
{% for name, details in delegates.items() %}
### {{ name }}
@@ -19,9 +28,13 @@ provide the correct inputs for the delegate you select.
{{ instructions.history_truncated }}
{{ history_to_json(state.history[-10:]) }}
If the last item in the history is an error, you should try to fix it. If you
cannot fix it, call the `reject` action.
## Available Actions
{{ instructions.actions.delegate }}
{{ instructions.actions.finish }}
{{ instructions.actions.reject }}
## Format
{{ instructions.format.action }}

View File

@@ -20,7 +20,7 @@ Do NOT finish until you have a complete understanding of which parts of the
codebase are relevant to the project, including particular files, functions, and classes.
When you're done, put your summary in `outputs.summary` in the `finish` action.
Remember, your task is to explore and study the current repository, not actually
implement the solution. If the codebase is empty, you shoud call the `finish` action.
implement the solution. If the codebase is empty, you should call the `finish` action.
## History
{{ instructions.history_truncated }}

View File

@@ -10,7 +10,7 @@ RUN npm ci
COPY ./frontend ./
RUN npm run make-i18n && npm run build
FROM python:3.12-slim as backend-builder
FROM python:3.12.3-slim as backend-builder
WORKDIR /app
ENV PYTHONPATH '/app'
@@ -28,7 +28,7 @@ COPY ./pyproject.toml ./poetry.lock ./
RUN touch README.md
RUN poetry install --without evaluation --no-root && rm -rf $POETRY_CACHE_DIR
FROM python:3.12-slim as runtime
FROM python:3.12.3-slim as runtime
WORKDIR /app

View File

@@ -50,6 +50,7 @@ else
groupadd -g $DOCKER_SOCKET_GID docker
fi
mkdir -p /home/enduser/.cache/huggingface/hub/
mkdir -p /home/enduser/.cache/ms-playwright/
mv /home/opendevin/.cache/ms-playwright/ /home/enduser/.cache/

View File

@@ -15,7 +15,7 @@ Achieving full replication of production-grade applications with LLMs is a compl
## 🚧 Default Agent
- Our default Agent is currently the MonologueAgent, which has limited capabilities, but is fairly stable. We're working on other Agent implementations, including [SWE Agent](https://swe-agent.com/). You can [read about our current set of agents here](./agents).
- Our default Agent is currently the CodeActAgent, which is capable of generating code and handling files. We're working on other Agent implementations, including [SWE Agent](https://swe-agent.com/). You can [read about our current set of agents here](./agents).
## 🤝 How to Contribute

View File

@@ -4,52 +4,53 @@ sidebar_position: 5
# 🚧 Troubleshooting
There are some error messages that get reported over and over by users.
We'll try to make the install process easier, and to make these error messages
better in the future. But for now, you can look for your error message below,
and see if there are any workaround.
There are some error messages that frequently get reported by users.
We'll try to make the install process easier and these error messages
better in the future. But for now, you can look for your error message below and see if there are any workarounds.
For each of these error messages **there is an existing issue**. Please do not
open an new issue--just comment there.
open a new issue--just comment there.
If you find more information or a workaround for one of these issues, please
open a PR to add details to this file.
open a *PR* to add details to this file.
:::tip
If you're running on Windows and having trouble, check out our [guide for Windows users](troubleshooting/windows)
If you're running on Windows and having trouble, check out our [guide for Windows (WSL) users](troubleshooting/windows).
:::
## Unable to connect to docker
## Unable to connect to Docker
[GitHub Issue](https://github.com/OpenDevin/OpenDevin/issues/1226)
### Symptoms
```
```bash
Error creating controller. Please check Docker is running and visit `https://opendevin.github.io/OpenDevin/modules/usage/troubleshooting` for more debugging information.
```
```
```bash
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
```
### Details
OpenDevin uses a docker container to do its work safely, without potentially breaking your machine.
OpenDevin uses a Docker container to do its work safely, without potentially breaking your machine.
### Workarounds
* Run `docker ps` to ensure that docker is running
* Make sure you don't need `sudo` to run docker [see here](https://www.baeldung.com/linux/docker-run-without-sudo)
* If you are on a mac, check the [permissions requirements](https://docs.docker.com/desktop/mac/permission-requirements/) and in particular consider enabling the "Allow the default Docker socket to be used" under "Settings > Advanced" in Docker Desktop.
* If you are on a mac, Upgrade your Docker to the latest version under "Check for Updates"
* If you are on a Mac, check the [permissions requirements](https://docs.docker.com/desktop/mac/permission-requirements/) and in particular consider enabling the `Allow the default Docker socket to be used` under `Settings > Advanced` in Docker Desktop.
* In addition, upgrade your Docker to the latest version under `Check for Updates`
## Unable to connect to SSH box
[GitHub Issue](https://github.com/OpenDevin/OpenDevin/issues/1156)
### Symptoms
```
```python
self.shell = DockerSSHBox(
...
pexpect.pxssh.ExceptionPxssh: Could not establish connection to host
@@ -62,17 +63,19 @@ especially Windows, this seems to fail.
### Workarounds
- Restart your computer (sometimes works?)
- Be sure to have the latest versions of WSL and Docker
- Try [this reinstallation guide](https://github.com/OpenDevin/OpenDevin/issues/1156#issuecomment-2064549427)
- Set `-e SANDBOX_TYPE=exec` to switch to the ExecBox docker container
* Restart your computer (sometimes it does work)
* Be sure to have the latest versions of WSL and Docker
* Check that your distribution in WSL is up to date as well
* Try [this reinstallation guide](https://github.com/OpenDevin/OpenDevin/issues/1156#issuecomment-2064549427)
* Set `-e SANDBOX_TYPE=exec` to switch to the ExecBox docker container
## Unable to connect to LLM
[GitHub Issue](https://github.com/OpenDevin/OpenDevin/issues/1208)
### Symptoms
```
```python
File "/app/.venv/lib/python3.12/site-packages/openai/_exceptions.py", line 81, in __init__
super().__init__(message, response.request, body=body)
^^^^^^^^^^^^^^^^
@@ -83,18 +86,20 @@ AttributeError: 'NoneType' object has no attribute 'request'
[GitHub Issues](https://github.com/OpenDevin/OpenDevin/issues?q=is%3Aissue+is%3Aopen+404)
This usually happens with local LLM setups, when OpenDevin can't connect to the LLM server.
This usually happens with *local* LLM setups, when OpenDevin can't connect to the LLM server.
See our guide for [local LLMs](llms/localLLMs) for more information.
### Workarounds
- Check your `LLM_BASE_URL`
- Check that ollama is running OK
- Make sure you're using `--add-host host.docker.internal:host-gateway` when running in docker
* Check your `base_url` in your config.toml (if it exists) under the "llm" section
* Check that ollama (or whatever LLM you're using) is running OK
* Make sure you're using `--add-host host.docker.internal:host-gateway` when running in Docker
## `404 Resource not found`
## 404 Resource not found
### Symptoms
```
```python
Traceback (most recent call last):
File "/app/.venv/lib/python3.12/site-packages/litellm/llms/openai.py", line 414, in completion
raise e
@@ -119,18 +124,86 @@ openai.NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Re
```
### Details
This happens when LiteLLM (our library for connecting to different LLM providers) can't find
the API you're trying to connect to. Most often this happens for Azure or ollama users.
the API endpoint you're trying to connect to. Most often this happens for Azure or ollama users.
### Workarounds
- Check that you've set `LLM_BASE_URL` properly
- Check that model is set properly, based on the [LiteLLM docs](https://docs.litellm.ai/docs/providers)
- If you're running inside the UI, be sure to set the `model` in the settings modal
- If you're running headless (via main.py) be sure to set `LLM_MODEL` in your env/config
- Make sure you've followed any special instructions for your LLM provider
- [ollama](/OpenDevin/modules/usage/llms/localLLMs)
- [Azure](/OpenDevin/modules/usage/llms/azureLLMs)
- [Google](/OpenDevin/modules/usage/llms/googleLLMs)
- Make sure your API key is correct
- See if you can connect to the LLM using `curl`
- Try [connecting via LiteLLM directly](https://github.com/BerriAI/litellm) to test your setup
* Check that you've set `LLM_BASE_URL` properly
* Check that model is set properly, based on the [LiteLLM docs](https://docs.litellm.ai/docs/providers)
* If you're running inside the UI, be sure to set the `model` in the settings modal
* If you're running headless (via main.py) be sure to set `LLM_MODEL` in your env/config
* Make sure you've followed any special instructions for your LLM provider
* [ollama](/OpenDevin/modules/usage/llms/localLLMs)
* [Azure](/OpenDevin/modules/usage/llms/azureLLMs)
* [Google](/OpenDevin/modules/usage/llms/googleLLMs)
* Make sure your API key is correct
* See if you can connect to the LLM using `curl`
* Try [connecting via LiteLLM directly](https://github.com/BerriAI/litellm) to test your setup
## `make build` getting stuck on package installations
### Symptoms
Package installation stuck on `Pending...` without any error message:
```bash
Package operations: 286 installs, 0 updates, 0 removals
- Installing certifi (2024.2.2): Pending...
- Installing h11 (0.14.0): Pending...
- Installing idna (3.7): Pending...
- Installing sniffio (1.3.1): Pending...
- Installing typing-extensions (4.11.0): Pending...
```
### Details
In rare cases, `make build` can seemingly get stuck on package installations
without any error message.
### Workarounds
* The package installer Poetry may miss a configuration setting for
where credentials are to be looked up (keyring).
### Workaround
First check with `env` if a value for `PYTHON_KEYRING_BACKEND` exists.
If not, run the below command to set it to a known value and retry the build:
```bash
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
```
## Sessions are not restored
### Symptoms
OpenDevin usually asks whether to resume or start a new session when opening the UI.
But clicking "Resume" still starts a fresh new chat.
### Details
With a standard installation as of today session data is stored in memory.
Currently, if OpenDevin's service is restarted, previous sessions become
invalid (a new secret is generated) and thus not recoverable.
### Workarounds
* Change configuration to make sessions persistent by editing the `config.toml`
file (in OpenDevin's root folder) by specifying a `file_store` and an
absolute `file_store_path`:
```toml
file_store="local"
file_store_path="/absolute/path/to/opendevin/cache/directory"
```
* Add a fixed jwt secret in your .bashrc, like below, so that previous session id's
should stay accepted.
```bash
EXPORT JWT_SECRET=A_CONST_VALUE
```

View File

@@ -3,7 +3,7 @@
.text-white {
color: white;
}
.welcome-container {
display: flex;
justify-content: center;
@@ -11,44 +11,43 @@
flex-direction: column;
background: linear-gradient(to bottom, #64748b, #1f2937);
}
@media (min-width: 768px) {
.welcome-container {
flex-direction: row;
background: linear-gradient(to bottom, #64748b, #1f2937);
}
}
.welcome-logo {
height: 45vh;
width: 45vw;
}
@media (max-width: 640px) {
.welcome-logo {
height: 40vw;
width: 40vw;
}
}
@media (min-width: 768px) {
.welcome-logo {
height: auto;
width: 350px;
}
}
.welcome-text {
padding: 24px;
margin-bottom: 24px;
font-weight: 300;
font-size: 1.125rem;
}
@media (min-width: 768px) {
.welcome-text {
padding: 8px;
font-size: 1.5rem;
}
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 453 KiB

After

Width:  |  Height:  |  Size: 663 KiB

View File

@@ -181,7 +181,7 @@ class Q20GameCelebrity(Q20Game):
user_messages = [
{
'role': 'system',
'content': f'Based on on your knowledge about the celebrity: {self.item}, '
'content': f'Based on your knowledge about the celebrity: {self.item}, '
f'respond to the following question or guess. '
f"Limit your respond to only 'Yes.', 'No.' or 'Dunno.', with no explanation or other words. "
f"Never say the name {self.item} in your response. Do not say 'Dunno.' if it can be answered by 'Yes.' or 'No.' "

View File

@@ -45,7 +45,7 @@ def codeact_user_response(state: State) -> str:
msg = game.generate_user_response(model_guess)
game.curr_turn += 1
logger.info(f'Model guess: {model_guess}')
logger.info(f'Anwser response: {msg}')
logger.info(f'Answer response: {msg}')
if 'bingo!' in msg.lower():
return '/exit'
return msg
@@ -65,7 +65,9 @@ AGENT_CLS_TO_INST_SUFFIX = {
}
def process_instance(instance, agent_class, metadata, reset_logger: bool = True):
def process_instance(
instance, agent_class, metadata, openai_api_key, reset_logger: bool = True
):
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
eval_output_dir = metadata['eval_output_dir']
if reset_logger:
@@ -107,7 +109,7 @@ def process_instance(instance, agent_class, metadata, reset_logger: bool = True)
answerer_model=metadata['answerer_model'],
guesser_model=None,
num_turns=metadata['max_iterations'],
openai_api_key=metadata['openai_api'],
openai_api_key=openai_api_key,
guesser_kargs=guesser_kargs,
)
@@ -234,7 +236,6 @@ if __name__ == '__main__':
'data_split': args.data_split,
'answerer_model': args.answerer_model,
'agent_class': agent_class,
'openai_api': args.OPENAI_API_KEY,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
@@ -317,6 +318,7 @@ if __name__ == '__main__':
instance,
agent_class,
metadata,
args.OPENAI_API_KEY,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(update_progress)

View File

@@ -13,6 +13,7 @@ all the preprocessing/evaluation/analysis scripts.
## Supported Benchmarks
- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
- ML-Bench: [`evaluation/ml_bench`](./ml_bench)
- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)
- GAIA: [`evaluation/gaia`](./gaia)
- Entity deduction Arena (EDA): [`evaluation/EDA`](./EDA)

View File

@@ -24,7 +24,8 @@ sandbox_timeout = 120
ssh_hostname = "localhost"
use_host_network = false
run_as_devin = false
# AgentBench specific
run_as_devin = true
enable_auto_lint = true
[eval_gpt35_turbo]

View File

@@ -1,4 +1,7 @@
import os
import re
from opendevin.events.action import CmdRunAction, MessageAction
def analysis_size(size_str):
@@ -42,3 +45,17 @@ def create_sh_file(filename: str, cmds: str) -> None:
with open(filename, 'w', encoding='utf-8') as file:
file.write(cmds.replace('\r\n', '\n'))
os.chmod(filename, 0o755)
def try_parse_answer(act) -> str | None:
raw_ans = ''
if isinstance(act, MessageAction) and act.source == 'agent':
raw_ans = act.content
elif isinstance(act, CmdRunAction) and act.source == 'agent':
raw_ans = act.thought
else:
return None
agent_answer = re.findall(r'<solution>(.*?)</solution>', raw_ans)
if not agent_answer:
return None
return agent_answer[0].strip()

View File

@@ -14,7 +14,11 @@ import docker
from datasets import load_dataset
from tqdm import tqdm
from evaluation.agent_bench.helper import compare_results, create_sh_file
from evaluation.agent_bench.helper import (
compare_results,
create_sh_file,
try_parse_answer,
)
from opendevin.controller.state.state import State
from opendevin.core.config import args, config, get_llm_config_arg
from opendevin.core.logger import get_console_handler
@@ -43,6 +47,12 @@ def codeact_user_response(state: State) -> str:
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP.\n'
)
if state.history:
# check if the last action is an answer, if so, return exit for early exit
last_action, _ = state.history[-1]
ans = try_parse_answer(last_action)
if ans is not None:
return '/exit'
user_msgs = [
action
for action, _ in state.history
@@ -99,7 +109,7 @@ def process_instance(
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {inst_id}.\nHint: run "tail -f {log_file}" to see live logs in a seperate shell'
f'Starting evaluation for instance {inst_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:

View File

@@ -0,0 +1,59 @@
# BioCoder Evaluation with Opendevin
Implements evaluation of agents on BioCoder from the BioCoder benchmark introduced in [BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models](https://arxiv.org/abs/2308.16458). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper.
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
## BioCoder Docker Image
In the opendevin branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenDevin environment. In the Docker image are testing scripts (`/testing/start_test_opendevin.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.
**Before first execution, pull our Docker image with the following command**
```bash
docker pull public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0
```
To reproduce this image, please see the Dockerfile_Opendevin in the `biocoder` repository.
## Start the evaluation
```bash
./evaluation/biocoder/scripts/run_infer.sh [model_config] [agent] [eval_limit]
```
where `model_config` is mandatory, while `agent`, `dataset` and `eval_limit` are optional.
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
LLM settings, as defined in your `config.toml`.
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By default it infers all instances.
Let's say you'd like to run 10 instances using `eval_gpt4_1106_eval_gpt4o_2024_05_13preview` and CodeActAgent,
then your command would be:
## Examples
```bash
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 CodeActAgent 1
```
## Reference
```
@misc{tang2024biocoder,
title={BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models},
author={Xiangru Tang and Bill Qian and Rick Gao and Jiakang Chen and Xinyun Chen and Mark Gerstein},
year={2024},
eprint={2308.16458},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

View File

@@ -0,0 +1,396 @@
import json
import os
import re
import sys
from collections import defaultdict
from dataclasses import dataclass
from datasets import load_dataset
from opendevin.core.config import config
from opendevin.core.logger import opendevin_logger as logger
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.plugins import (
JupyterRequirement,
PluginRequirement,
SWEAgentCommandsRequirement,
)
BIOCODER_BENCH_CONTAINER_IMAGE = 'public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0'
@dataclass
class BiocoderData:
filePath: str
numLines: int
lineStart: int
lineEnd: int
signature: str
comment: str
content: str
repository: str
promptSummaryOnly: str
contextCode: str
goldenCode: str
test_case_id: str
language: str
def to_dict(self):
return {
'filePath': self.filePath,
'numLines': self.numLines,
'lineStart': self.lineStart,
'lineEnd': self.lineEnd,
'signature': self.signature,
'comment': self.comment,
'content': self.content,
'repository': self.repository,
'promptSummaryOnly': self.promptSummaryOnly,
'contextCode': self.contextCode,
'goldenCode': self.goldenCode,
'test_case_id': self.test_case_id,
'language': self.language,
}
def get_likely_indent_size(array_of_tabs) -> int:
sizes = defaultdict(int)
for i in range(len(array_of_tabs) - 1):
diff = array_of_tabs[i + 1] - array_of_tabs[i]
if diff > 0:
sizes[diff] += 1
if len(sizes) == 0:
return 4
return int(max(sizes, key=sizes.get))
class BiocoderSSHBox(DockerSSHBox):
def __init__(
self,
container_image: str,
timeout: int = 120,
sid: str | None = None,
biocoder_instance_id: str | None = None,
biocoder_instance: BiocoderData | None = None,
skip_workspace_mount: bool = True,
sandbox_plugins: list[PluginRequirement] = [], # noqa: B006
biocoder_cache_folder: str = 'biocoder_cache',
workspace_dir_name: str | None = None,
):
if biocoder_instance_id is None:
raise ValueError('biocoder_instance_id must be provided')
self.biocoder_instance_id = biocoder_instance_id
self.biocoder_instance = biocoder_instance
self.skip_workspace_mount = skip_workspace_mount
self.biocoder_cache_folder = biocoder_cache_folder
self.first_line_after_removed = None
self.workspace_dir_name = workspace_dir_name
self.workspace_base = config.workspace_base
self.workspace_mount_path = config.workspace_mount_path
# self.workspace_dir_name_host = os.path.join(config.workspace_base, workspace_dir_name)
self.context_path = None
self.generated_path = None
self.golden_path = None
assert (
container_image is not None
), 'container_image is required for BiocoderBenchSSHBox!'
super().__init__(container_image, timeout, sid)
self.init_plugins(sandbox_plugins)
@property
def volumes(self):
if self.skip_workspace_mount:
return {
k: v
for k, v in super().volumes.items()
if not v['bind'] == self.sandbox_workspace_dir
}
return super().volumes
def get_target_filepath(self):
target_filepath = os.path.join(
self.workspace_mount_path,
self.biocoder_instance.repository.split('/')[1],
self.biocoder_instance.filePath,
)
return target_filepath
def get_changed_code(self, include_signature=False):
# copies changed code into /testing_files/
# Note that this does NOT copy the function signature
target_filepath = self.get_target_filepath()
selected_lines = []
offset = 1 if include_signature else 0
if self.first_line_after_removed is None:
logger.warning('First line after removed is None')
with open(target_filepath, 'r') as f:
lines = f.read().split('\n')
for i in range(self.biocoder_instance.lineStart - offset, len(lines)):
if lines[i].strip() == self.first_line_after_removed.strip():
break
selected_lines.append(lines[i])
text = '\n'.join(selected_lines)
return text
def copy_changed_code(self):
changed_code = self.get_changed_code(include_signature=True)
with open(self.generated_path, 'w') as f:
f.write(changed_code)
exit_code, output = self.execute_and_check(
f'cp -r /workspace/{self.biocoder_cache_folder}/* /testing_files',
'Failed to copy the files',
)
def remove_code(self):
comment_prefix = {'python': '#', 'java': '//'}
target_filepath = self.get_target_filepath()
line_start = self.biocoder_instance.lineStart
line_end = self.biocoder_instance.lineEnd
with open(target_filepath, 'r') as f:
lines = f.read().split('\n')
# print("="*10+"ORIGINAL"+"="*10)
# print("\n".join(lines))
signature_line = lines[line_start - 1]
# get the number of tabs
def get_indent_size(s: str):
return len(re.match(r'\s*', s).group())
indent_sizes = list(map(get_indent_size, lines))
indent_size = get_likely_indent_size(indent_sizes)
comment_indent_size = get_indent_size(signature_line) + indent_size
lines = (
lines[:line_start]
+ [
f"{' '*comment_indent_size+comment_prefix[self.biocoder_instance.language.lower()]}TODO: replace with your code here"
]
+ ([''] * 2)
+ lines[line_end:]
)
first_line_after_removed_index = line_start
while len(
lines[first_line_after_removed_index].strip()
) == 0 and first_line_after_removed_index < len(lines):
first_line_after_removed_index += 1
self.first_line_after_removed = lines[first_line_after_removed_index]
# print("FIRST LINE AFTER REMOVED: ", self.first_line_after_removed)
with open(target_filepath, 'w') as f:
f.write('\n'.join(lines))
# with open(target_filepath, 'r') as f:
# print("="*10+"MODIFIED"+"="*10)
# print(f.read())
def execute_and_check(self, cmd: str, error_msg: str) -> tuple[int, str]:
exit_code, output = self.execute(cmd)
if exit_code != 0:
logger.error(error_msg)
sys.exit(1)
return exit_code, output
@classmethod
def get_box_for_instance(
cls,
instance,
workspace_dir_name=None,
skip_workspace_mount: bool = False,
workspace_mount_path: str | None = None,
sandbox_plugins: list[PluginRequirement] = [], # noqa: B006
) -> 'BiocoderSSHBox':
"""This method initializes a container image, then runs some initialization commands"""
if workspace_dir_name is None:
workspace_dir_name = f'{instance.repository}__{instance.test_case_id[:10]}__{os.getpid()}'.replace(
'/', '__'
)
workspace_base = str(os.path.join(config.workspace_base, workspace_dir_name))
old_workspace_base = config.workspace_base
old_workspace_mount_path = config.workspace_mount_path
try:
config.workspace_base = workspace_base
config.workspace_mount_path = workspace_base
# linting python after editing helps LLM fix indentations
config.enable_auto_lint = True
# create folder for transferring files back/forth
biocoder_cache_folder = 'biocoder_cache'
if not os.path.exists(os.path.join(workspace_base, biocoder_cache_folder)):
os.makedirs(
os.path.join(workspace_base, biocoder_cache_folder), exist_ok=True
)
file_ext = {
'python': 'py',
'java': 'java',
'c': 'c',
'cpp': 'cpp',
'javascript': 'js',
'typescript': 'ts',
}[instance.language.lower()]
context_path = os.path.join(
workspace_base, biocoder_cache_folder, 'context.' + file_ext
)
generated_path = os.path.join(
workspace_base, biocoder_cache_folder, 'generated.' + file_ext
)
golden_path = os.path.join(
workspace_base, biocoder_cache_folder, 'golden.' + file_ext
)
# print(instance.contextCode)
with open(context_path, 'w') as f:
f.write(instance.contextCode)
with open(generated_path, 'w') as f:
f.write(instance.goldenCode)
with open(golden_path, 'w') as f:
f.write(instance.goldenCode)
testcase_json = {
'test_case_id': instance.test_case_id,
'num_cases': 1000,
'language': instance.language.lower(),
}
with open(
os.path.join(
workspace_base, biocoder_cache_folder, 'testcase_biocoder.json'
),
'w',
) as f:
f.write(json.dumps(testcase_json, indent=4))
# linting python after editing helps LLM fix indentations
config.enable_auto_lint = True
sandbox = cls(
container_image=BIOCODER_BENCH_CONTAINER_IMAGE,
biocoder_instance_id=instance.test_case_id,
biocoder_instance=instance,
skip_workspace_mount=skip_workspace_mount,
sandbox_plugins=sandbox_plugins,
biocoder_cache_folder=biocoder_cache_folder,
workspace_dir_name=workspace_dir_name,
)
except Exception:
raise
finally:
config.workspace_base = old_workspace_base
config.workspace_mount_path = old_workspace_mount_path
sandbox.context_path = context_path
sandbox.generated_path = generated_path
sandbox.golden_path = golden_path
logger.info(f'SSH box started for instance {instance.test_case_id}.')
# cd to the workspace
exit_code, output = sandbox.execute_and_check(
'cd /workspace', 'Failed to cd to workspace'
)
logger.info(f'cd to workspace: {output}')
# download repository archive
repository_url = f"https://biocoder.lilbillbiscuit.com/repos/{instance.repository.split('/')[1]}.zip"
exit_code, output = sandbox.execute_and_check(
'wget -O repo.zip ' + repository_url, 'Failed to download the repository'
)
logger.info(f'Downloaded the repository: {output}')
exit_code, output = sandbox.execute_and_check(
'unzip -o -q repo.zip', 'Failed to unzip the repository'
)
logger.info(f'Unzipped the repository: {output}')
# copy the context, generated and golden files to the /testing_files folder
exit_code, output = sandbox.execute_and_check(
f'cp -r /workspace/{biocoder_cache_folder}/* /testing_files',
'Failed to copy the files',
)
# chmod 777
exit_code, output = sandbox.execute_and_check(
'chmod -R 777 /workspace',
'Failed to chmod the files',
)
return sandbox
if __name__ == '__main__':
biocoder_dataset = load_dataset('Lilbillbiscuit/biocoder_public')
EXAMPLE_INSTANCE = biocoder_dataset['test'][0]
EXAMPLE_INSTANCE = BiocoderData(**EXAMPLE_INSTANCE)
sandbox = BiocoderSSHBox.get_box_for_instance(
instance=EXAMPLE_INSTANCE,
workspace_mount_path='/home/ubuntu/OpenDevinBioCoder/workspace',
skip_workspace_mount=False,
sandbox_plugins=[JupyterRequirement(), SWEAgentCommandsRequirement()],
)
# PRE TEST
exit_code, output = sandbox.execute_and_check(
'cd /testing',
'Failed to cd /testing',
)
logger.info(f'cd $REPO_PATH: {output}')
exit_code, output = sandbox.execute_and_check(
'whoami',
'Failed to run whoami',
)
logger.info(f'whoami: {output}')
# TEST
exit_code, output = sandbox.execute(
'/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
)
assert exit_code == 0, 'Expected exit code 0 (this should have passed)'
logger.info(f'$TEST_CMD:\n{output}')
exit_code, output = sandbox.execute_and_check(
'cat /testing_files/results_biocoder.json', 'Failed to read the result file'
)
print(output)
json_obj = json.loads(output)
if json_obj['result'] == 'pass':
print('PASS')
else:
print('FAIL')
bg_cmd = sandbox.execute_in_background(
"while true; do echo 'dot ' && sleep 10; done"
)
sys.stdout.flush()
try:
while True:
try:
user_input = input('>>> ')
except EOFError:
logger.info('Exiting...')
break
if user_input.lower() == 'exit':
logger.info('Exiting...')
break
if user_input.lower() == 'kill':
sandbox.kill_background(bg_cmd.pid)
logger.info('Background process killed')
continue
exit_code, output = sandbox.execute(user_input)
logger.info('exit code: %d', exit_code)
logger.info(output)
if bg_cmd.pid in sandbox.background_commands:
logs = sandbox.read_logs(bg_cmd.pid)
logger.info('background logs: %s', logs)
sys.stdout.flush()
except KeyboardInterrupt:
logger.info('Exiting...')
sandbox.close()

View File

@@ -0,0 +1,393 @@
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm
import agenthub
from evaluation.biocoder.biocoder_env_box import BiocoderData, BiocoderSSHBox
from opendevin.controller.state.state import State
from opendevin.core.config import args, config, get_llm_config_arg
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.action import MessageAction
from opendevin.events.serialization.event import event_to_dict
def cleanup():
print('Cleaning up child processes...')
for process in mp.active_children():
print(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
'Please continue working on the task on whatever approach you think is suitable.\n'
'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
)
if state.history:
user_msgs = [
action
for action, _ in state.history
if isinstance(action, MessageAction) and action.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
def monologue_user_response(state: State) -> str:
raise NotImplementedError('MonologueAgent should never ask for user responses.')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
'MonologueAgent': monologue_user_response,
}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have fixed the issue through code changes, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
def get_test_result(instance, sandbox, workspace_dir_name):
test_result = {'result': {}, 'metadata': {}}
try:
code = sandbox.get_changed_code(include_signature=True)
sandbox.copy_changed_code()
test_result['metadata']['1_copy_change_success'] = True
test_result['metadata']['1_copy_change_code'] = code
except Exception:
logger.error('Error fetching changed code for this instance')
test_result['metadata']['1_copy_change_success'] = False
test_result['metadata']['1_copy_change_code'] = None
exit_code, output = sandbox.execute_and_check(
'cd /testing',
'Failed to cd /testing',
)
logger.info(f'cd $REPO_PATH: {output}')
exit_code, output = sandbox.execute_and_check(
'whoami',
'Failed to run whoami',
)
logger.info(f'whoami: {output}')
exit_code, output = sandbox.execute(
'/home/devin/mambaforge/bin/mamba run -n test python3 /testing/start_test_opendevin.py'
)
logger.info(f'$TEST_CMD:\n{output}')
exit_code, output = sandbox.execute_and_check(
'cat /testing_files/results_biocoder.json', 'Failed to read the result file'
)
if exit_code == 0:
test_result['metadata']['2_run_test_success'] = True
test_result['metadata']['2_run_test_result'] = str(output)
else:
test_result['metadata']['2_run_test_success'] = False
test_result['metadata']['2_run_test_result'] = str(output)
json_obj = json.loads(output)
test_result['result'] = json_obj['result']
return test_result
def process_instance(
instance,
agent_class,
metadata,
skip_workspace_mount,
eval_output_dir,
reset_logger: bool = True,
):
instance = BiocoderData(**instance)
print(instance)
workspace_dir_name = (
f'{instance.repository}__{instance.test_case_id[:10]}__{os.getpid()}'.replace(
'/', '__'
)
)
workspace_mount_path = os.path.join(config.workspace_base, workspace_dir_name)
# create process-specific workspace dir
# if `not skip_workspace_mount` - we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
if not skip_workspace_mount:
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# Setup the logger properly, so you can run multi-processing to parallize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{instance.test_case_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.test_case_id}.\nHint: run "tail -f {log_file}" to see live logs in a seperate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
if not skip_workspace_mount:
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# NOTE: this is something special we do for SWE-Bench due to the reason described in the previous section
# You can omit this if you don't need to setup specialized sandbox
workspace_dir_name = f'{instance.repository}__{instance.test_case_id[:10]}'.replace(
'/', '__'
)
sandbox = BiocoderSSHBox.get_box_for_instance(
instance,
workspace_dir_name,
skip_workspace_mount=False,
workspace_mount_path=workspace_mount_path,
sandbox_plugins=agenthub.Agent.get_cls(agent_class).sandbox_plugins,
)
sandbox.remove_code()
# Prepare instruction
instruction = (
f'Please complete the function "{instance.signature}" in the file /workspace/{instance.repository.split("/")[1]}/{instance.filePath}.\n'
f'The environment has been set up for you to start working. You may assume all necessary tools are installed.\n'
f'To complete the task, you must directly modify the file and fill in the function, keeping in mind that the function signature is on line {instance.lineStart-1}\n\n'
f'The function should do the following:\n'
f'{instance.promptSummaryOnly}\n\n'
)
instruction += (
'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
'You should NOT modify any other files other than the file intended. This means that you should NOT write any test cases.\n'
'You may need context from other files in the repository to complete this task.'
'Do NOT add any import statements or change anything else other than the writing the function body.\n'
'You do not need to run the code to check if it works. \n'
'Make sure to include proper formatting in Java and Python, including correct braces and/or indentation.\n'
)
# instruction = (
# f'In the file {instance.filePath}, there is a function with a signature and without a body. Your job is to complete the function, according to the given instructions. When you complete the function, respond with the function body, and nothing else.'
# 'The repository has cloned for you to start working. You are not allowed to run any bash commands, just modify the files. \n\n'
# '# Problem Statement\n'
# 'Complete the following function signature:\n\n'
# f'{instance.signature}'
# 'The function should do the following:\n\n'
# f'{instance.promptSummaryOnly}\n\n'
# )
#
# instruction += (
# 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# 'You should NOT modify any other files other than the file intended. This means that you should NOT write any test cases.\n'
# 'Do NOT add any import statements or change anything else other than the writing the function body.\n'
# 'You do not need to run the code to check if it works. The system will automatically check the correctness of your code.\n'
# 'Make sure to include proper formatting in Java and Python, including correct braces and/or indentation.\n'
# )
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
state: State = asyncio.run(
main(
instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(agent_class),
sandbox=sandbox,
)
)
test_result = get_test_result(instance, sandbox, workspace_dir_name)
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
# Save the output
output = {
'test_case_id': instance.test_case_id,
'biocoder_instance': instance.to_dict(),
'instruction': instruction,
'generated': test_result['metadata']['1_copy_change_code'],
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
],
'metrics': metrics,
'error': state.error if state and state.error else None,
'test_result': test_result,
}
# Close the sandbox
sandbox.close()
return output
if __name__ == '__main__':
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
dataset = load_dataset('lilbillbiscuit/biocoder_public')
biocoder_tests = dataset['test'].to_pandas()
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
# TEST METADATA
agent_class = args.agent_cls
assert (
agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
), f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'biocoder',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
eval_output_dir = str(eval_output_dir)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
metadata = {
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproduciblity
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
biocoder_tests = biocoder_tests.head(eval_n_limit)
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
# OUTPUT FILE
output_file = os.path.join(eval_output_dir, 'output.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_test_case_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
finished_test_case_ids.add(data['test_case_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_test_case_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# =============================================
# filter out finished instances
new_biocoder_tests = []
for idx, instance in biocoder_tests.iterrows():
if instance.test_case_id in finished_test_case_ids:
logger.info(
f'Skipping instance {instance.test_case_id} as it is already finished.'
)
continue
new_biocoder_tests.append(instance)
biocoder_tests = pd.DataFrame(new_biocoder_tests)
logger.info(
f'Finished instances: {len(finished_test_case_ids)}, Remaining instances: {len(biocoder_tests)}'
)
# =============================================
pbar = tqdm(total=len(biocoder_tests))
# This function tracks the progress AND write the output to a JSONL file
def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["test_case_id"]}')
pbar.set_postfix_str(f'Test Result: {output["test_result"]}')
logger.info(
f'Finished evaluation for instance {output["test_case_id"]}: {output["test_result"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
# This sets the multi-processing
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
# This is SWE-Bench specific - CodeActAgent doesn't require mounted workspace to work
skip_workspace_mount = agent_class == 'CodeActAgent'
logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
# This is how we perform multi-processing
for row_idx, instance in biocoder_tests.iterrows():
future = executor.submit(
process_instance,
instance,
agent_class,
metadata,
skip_workspace_mount,
eval_output_dir,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(update_progress)
futures.append(future)
# Wait for all futures to complete
for future in futures:
future.result()
except KeyboardInterrupt:
print('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
logger.info('Evaluation finished.')

View File

@@ -0,0 +1,37 @@
#!/bin/bash
MODEL_CONFIG=$1
AGENT=$2
EVAL_LIMIT=$3
DATASET="biocoder"
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default CodeActAgent"
AGENT="CodeActAgent"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"
COMMAND="poetry run python evaluation/biocoder/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \
--max-chars 10000000 \
--eval-num-workers 1 \
--eval-note ${AGENT_VERSION}_${DATASET}"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
echo $COMMAND
eval $COMMAND

View File

@@ -0,0 +1,41 @@
# Gorilla APIBench Evaluation with OpenDevin
This folder contains evaluation harness we built on top of the original [Gorilla APIBench](https://github.com/ShishirPatil/gorilla) ([paper](https://arxiv.org/pdf/2305.15334)).
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Run `make setup-config` to set up the `config.toml` file if it does not exist at the root of the workspace.
## Run Inference on APIBench Instances
Make sure your Docker daemon is running, then run this bash script:
```bash
bash evaluation/gorilla/scripts/run_infer.sh [model_config] [agent] [eval_limit] [hubs]
```
where `model_config` is mandatory, while all other arguments are optional.
`model_config`, e.g. `llm`, is the config group name for your
LLM settings, as defined in your `config.toml`.
`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
`eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances.
By default, the script evaluates 1 instance.
`hubs`, the hub from APIBench to evaluate from. You could choose one or more from `torch` or `th` (which is abbreviation of torch), `hf` (which is abbreviation of huggingface), and `tf` (which is abbreviation of tensorflow), for `hubs`. The default is `hf,torch,tf`.
Note: in order to use `eval_limit`, you must also set `agent`; in order to use `hubs`, you must also set `eval_limit`.
Let's say you'd like to run 10 instances using `llm` and CodeActAgent on `th` test,
then your command would be:
```bash
bash evaluation/gorilla/scripts/run_infer.sh llm CodeActAgent 10 th
```

View File

@@ -0,0 +1,127 @@
# Copyright 2023 https://github.com/ShishirPatil/gorilla
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is modified from https://github.com/ShishirPatil/gorilla/blob/main/eval/eval-scripts/ast_eval_hf.py
from tree_sitter import Language, Parser
# Get all the subtrees given a root_node
def get_all_sub_trees(root_node):
node_stack = []
sub_tree_sexp_list = []
depth = 1
# text = root_node.text
node_stack.append([root_node, depth])
while len(node_stack) != 0:
cur_node, cur_depth = node_stack.pop()
if cur_node.child_count > 0:
sub_tree_sexp_list.append(
[cur_node.sexp(), cur_depth, cur_node, cur_node.children[0].text]
)
else:
sub_tree_sexp_list.append([cur_node.sexp(), cur_depth, cur_node, None])
for child_node in cur_node.children:
if len(child_node.children) != 0:
depth = cur_depth + 1
node_stack.append([child_node, depth])
return sub_tree_sexp_list
# Parse the program into AST trees
def ast_parse(candidate, lang='python'):
LANGUAGE = Language('evaluation/gorilla/my-languages.so', lang)
parser = Parser()
parser.set_language(LANGUAGE)
candidate_tree = parser.parse(bytes(candidate, 'utf8')).root_node
return candidate_tree
# Get all the arguments in the ast tree
def get_args(node):
if node.child_count == 0:
return []
args_list = []
for child in node.children[0].children[0].children[1].children:
if '=' in child.text.decode():
args_list.append(child.children[2].text)
elif (
child.text.decode() != '('
and child.text.decode() != ')'
and child.text.decode() != ','
):
args_list.append(child.text)
return args_list
# Check if there is an api match
def ast_check(candidate_subtree_list, base_tree_list):
for idx, base_tree in enumerate(base_tree_list):
if base_tree.children[0].children[0].child_count == 0:
continue
api_name = base_tree.children[0].children[0].children[0].text
for candidate_tree in candidate_subtree_list:
if candidate_tree[3] == api_name:
break
# Now we have a sub-tree
candidate_tree = candidate_tree[2]
args_list = get_args(base_tree)
if len(args_list) == 0:
continue
ast_match = True
for arg in args_list:
if arg.decode().lstrip("'").rstrip("'") not in candidate_tree.text.decode():
ast_match = False
break
if ast_match:
return idx
return -1
def ast_eval_hf(api_database, qa_pairs, ast_database, question_id, response):
# Check correctness
correct = False
hallucination = False
output = response
# Index the "api_call" domain
output = output.split('api_call')
if len(output) == 1:
api_call = output[0]
else:
# Parse the output
output = output[1].split('api_provider')[0]
if ':' not in output:
start = 0
else:
start = output.index(':')
if ')' not in output:
end = -2
else:
end = output.rindex(')')
api_call = output[start + 2 : end + 1]
# Parse the api_call into AST tree
ast_tree = ast_parse(api_call)
# Search for a subtree
ast_subtree_list = get_all_sub_trees(ast_tree)
# Check which ast tree is matching
database_index = ast_check(ast_subtree_list, ast_database)
# We cannot index this ast in our database
if database_index == -1:
hallucination = True
# We index our reference api_call
ref_api_call = api_database[database_index]
# Check for functionality
if ref_api_call['domain'] == qa_pairs[question_id - 1]['domain']:
correct = True
return correct, hallucination

View File

@@ -0,0 +1,127 @@
# Copyright 2023 https://github.com/ShishirPatil/gorilla
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is modified from https://github.com/ShishirPatil/gorilla/blob/main/eval/eval-scripts/ast_eval_tf.py
from tree_sitter import Language, Parser
# Get all the subtrees given a root_node
def get_all_sub_trees(root_node):
node_stack = []
sub_tree_sexp_list = []
depth = 1
# text = root_node.text
node_stack.append([root_node, depth])
while len(node_stack) != 0:
cur_node, cur_depth = node_stack.pop()
if cur_node.child_count > 0:
sub_tree_sexp_list.append(
[cur_node.sexp(), cur_depth, cur_node, cur_node.children[0].text]
)
else:
sub_tree_sexp_list.append([cur_node.sexp(), cur_depth, cur_node, None])
for child_node in cur_node.children:
if len(child_node.children) != 0:
depth = cur_depth + 1
node_stack.append([child_node, depth])
return sub_tree_sexp_list
# Parse the program into AST trees
def ast_parse(candidate, lang='python'):
LANGUAGE = Language('evaluation/gorilla/my-languages.so', lang)
parser = Parser()
parser.set_language(LANGUAGE)
candidate_tree = parser.parse(bytes(candidate, 'utf8')).root_node
return candidate_tree
# Get all the arguments in the ast tree
def get_args(node):
if node.child_count == 0:
return []
args_list = []
for child in node.children[0].children[0].children[1].children:
if 'model=' in child.text.decode() or 'model =' in child.text.decode():
args_list.append(child.children[2].text)
elif (
child.text.decode() != '('
and child.text.decode() != ')'
and child.text.decode() != ','
):
args_list.append(child.text)
return args_list
# Check if there is an api match
def ast_check(candidate_subtree_list, base_tree_list):
for idx, base_tree in enumerate(base_tree_list):
if base_tree.children[0].children[0].child_count == 0:
continue
api_name = base_tree.children[0].children[0].children[0].text
for candidate_tree in candidate_subtree_list:
if candidate_tree[3] == api_name:
break
# Now we have a sub-tree
candidate_tree = candidate_tree[2]
args_list = get_args(base_tree)
if len(args_list) == 0:
continue
ast_match = True
for arg in args_list:
if arg.decode().lstrip("'").rstrip("'") not in candidate_tree.text.decode():
ast_match = False
break
if ast_match:
return idx
return -1
def ast_eval_tf(api_database, qa_pairs, ast_database, question_id, response):
# Check correctness
correct = False
hallucination = False
output = response
# Index the "api_call" domain
output = output.split('api_call')
if len(output) == 1:
api_call = output[0]
else:
# Parse the output
output = output[1].split('api_provider')[0]
if ':' not in output:
start = 0
else:
start = output.index(':')
if ')' not in output:
end = -2
else:
end = output.rindex(')')
api_call = output[start + 2 : end + 1]
# Parse the api_call into AST tree
ast_tree = ast_parse(api_call)
# Search for a subtree
ast_subtree_list = get_all_sub_trees(ast_tree)
# Check which ast tree is matching
database_index = ast_check(ast_subtree_list, ast_database)
# We cannot index this ast in our database
if database_index == -1:
hallucination = True
# We index our reference api_call
ref_api_call = api_database[database_index]
# Check for functionality
if ref_api_call['domain'] == qa_pairs[question_id - 1]['domain']:
correct = True
return correct, hallucination

View File

@@ -0,0 +1,123 @@
# Copyright 2023 https://github.com/ShishirPatil/gorilla
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is modified from https://github.com/ShishirPatil/gorilla/blob/main/eval/eval-scripts/ast_eval_th.py
from tree_sitter import Language, Parser
# Get all the subtrees given a root_node
def get_all_sub_trees(root_node):
node_stack = []
sub_tree_sexp_list = []
depth = 1
# text = root_node.text
node_stack.append([root_node, depth])
while len(node_stack) != 0:
cur_node, cur_depth = node_stack.pop()
if cur_node.child_count > 0:
sub_tree_sexp_list.append(
[cur_node.sexp(), cur_depth, cur_node, cur_node.children[0].text]
)
else:
sub_tree_sexp_list.append([cur_node.sexp(), cur_depth, cur_node, None])
for child_node in cur_node.children:
if len(child_node.children) != 0:
depth = cur_depth + 1
node_stack.append([child_node, depth])
return sub_tree_sexp_list
# Parse the program into AST trees
def ast_parse(candidate, lang='python'):
LANGUAGE = Language('evaluation/gorilla/my-languages.so', lang)
parser = Parser()
parser.set_language(LANGUAGE)
candidate_tree = parser.parse(bytes(candidate, 'utf8')).root_node
return candidate_tree
# Get all the arguments in the ast tree
def get_args(node):
if node.child_count == 0:
return []
args_list = []
for child in node.children[0].children[0].children[1].children:
if 'repo_or_dir' in child.text.decode() or 'model' in child.text.decode():
args_list.append(child.children[2].text)
return args_list
# Check if there is an api match
def ast_check(candidate_subtree_list, base_tree_list):
for idx, base_tree in enumerate(base_tree_list):
if base_tree.children[0].children[0].child_count == 0:
continue
api_name = base_tree.children[0].children[0].children[0].text
for candidate_tree in candidate_subtree_list:
if candidate_tree[3] == api_name:
break
# Now we have a sub-tree
candidate_tree = candidate_tree[2]
args_list = get_args(base_tree)
if len(args_list) == 0:
continue
ast_match = True
for arg in args_list:
if arg.decode().lstrip("'").rstrip("'") not in candidate_tree.text.decode():
ast_match = False
break
if ast_match:
return idx
return -1
def process_response(question_id, output, api_database, qa_pairs, ast_database):
# Index the "api_call" domain
output = output.split('api_call')
if len(output) == 1:
return False, False
else:
output = output[1].split('api_provider')[0]
if ':' not in output:
start = 0
else:
start = output.index(':')
if ')' not in output:
end = -2
else:
end = output.rindex(')')
api_call = output[start + 2 : end + 1]
# Parse the api_call into AST tree
ast_tree = ast_parse(api_call)
# Search for a subtree
ast_subtree_list = get_all_sub_trees(ast_tree)
# Check which ast tree is matching
database_index = ast_check(ast_subtree_list, ast_database)
# We cannot index this ast in our database
if database_index == -1:
return False, True
# We index our reference api_call
ref_api_call = api_database[database_index]
# Check for functionality
if ref_api_call['domain'] == qa_pairs[question_id - 1]['domain']:
return True, False
else:
return False, False
def ast_eval_th(api_database, qa_pairs, ast_database, question_id, response):
# Check correctness
return process_response(question_id, response, api_database, qa_pairs, ast_database)

View File

@@ -0,0 +1,355 @@
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
from utils import encode_question, get_data
from opendevin.controller.state.state import State
from opendevin.core.config import config, get_llm_config_arg, get_parser
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.action import MessageAction
from opendevin.events.serialization.event import event_to_dict
def cleanup():
print('Cleaning up child processes...')
for process in mp.active_children():
print(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
#'Please continue working on the task on whatever approach you think is suitable.\n'
'Please run the following command: <execute_bash> exit </execute_bash>.\n'
#'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
)
if state.history:
user_msgs = [
action
for action, _ in state.history
if isinstance(action, MessageAction) and action.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
def monologue_user_response(state: State) -> str:
raise NotImplementedError('MonologueAgent should never ask for user responses.')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
'MonologueAgent': monologue_user_response,
}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have completed the request, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
def process_instance(
question_id, question, agent_class, metadata, reset_logger: bool = True
):
# create process-specific workspace dir
# we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
old_workspace_mount_path = config.workspace_mount_path
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
config.workspace_mount_path = workspace_mount_path
# Setup the logger properly, so you can run multi-processing to parallize the evaluation
eval_output_dir = metadata['eval_output_dir']
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{question_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {question_id}.\nLOG: tail -f {log_file}'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# Prepare instruction
instruction = encode_question(question, metadata['hub'])
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
# logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
# Here's how you can run the agent (similar to the `main` function) and get the final task state
state: State = asyncio.run(
main(
instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent_class
),
)
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
model_answer_raw = ''
for act, _ in reversed(state.history):
if isinstance(act, MessageAction) and act.source == 'agent':
model_answer_raw = act.content
break
# attempt to parse model_answer
_, _, ast_eval = get_data(metadata['hub'])
correct, hallucination = ast_eval(question_id, model_answer_raw)
metrics = state.metrics.get() if state.metrics else None
logger.info(
f'Final message: {model_answer_raw} | Correctness: {correct} | Hallucination: {hallucination}'
)
# Save the output
output = {
'question_id': question_id,
'text': model_answer_raw,
'correct': correct,
'hallucination': hallucination,
'answer_id': 'None',
'model_id': metadata['model_name'],
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs))
for action, obs in state.history
],
'metrics': metrics,
'error': state.error if state and state.error else None,
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
return output
if __name__ == '__main__':
parser = get_parser()
parser.add_argument(
'--hubs',
type=str,
help='Which hubs to evaluate from APIBench. APIBench contains 3 hubs, namely huggingface, torch, and tensorflow. You could choose one or more from hf, torch, or tf, separated by commas. For example, the default is --hub hf,torch,tf.',
default='hf,torch,tf',
)
args, _ = parser.parse_known_args()
if args.directory:
config.workspace_base = os.path.abspath(args.directory)
print(f'Setting workspace base to {config.workspace_base}')
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
agent_class = args.agent_cls
assert (
agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
), f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'gorilla',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
hubs = []
if 'hf' in args.hubs:
hubs.append('hf')
if 'torch' in args.hubs or 'th' in args.hubs:
hubs.append('torch')
if 'tf' in args.hubs:
hubs.append('tf')
if hubs == []:
raise ValueError('Please choose at least one from hf, torch, and tf for hubs.')
for hub in hubs:
logger.info(f'Evaluating APIBench {hub} test')
questions, question_ids, ast_eval = get_data(hub)
# TEST METADATA
metadata = {
'hub': hub,
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproduciblity
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, f'metadata_{hub}.json'), 'w') as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
questions = questions[: (eval_n_limit // len(hubs))]
question_ids = question_ids[: (eval_n_limit // len(hubs))]
logger.info(
f'Limiting evaluation to a total of first {eval_n_limit} instances -> first {eval_n_limit//len(hubs)} instances per hub.'
)
output_file = os.path.join(eval_output_dir, f'output_{model_name}_{hub}.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_task_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
for i in range(len(question_ids)):
if question_ids[i] == int(data['question_id']):
finished_task_ids.add(data['question_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_task_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# =============================================
# filter out finished instances
new_questions = []
new_question_ids = []
for i in range(len(question_ids)):
if question_ids[i] in finished_task_ids:
logger.info(
f'Skipping instance {question_ids[i]} as it is already finished.'
)
continue
new_questions.append(questions[i])
new_question_ids.append(question_ids[i])
finished_task_number = len(finished_task_ids)
questions = new_questions
question_ids = new_question_ids
logger.info(
f'Finished instances: {finished_task_number}, Remaining instances: {len(question_ids)}'
)
# =============================================
pbar = tqdm(total=len(question_ids))
# This function tracks the progress AND write the output to a JSONL file
def update_progress(future, pbar, output_fp, finished_task_ids):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["question_id"]}')
pbar.set_postfix_str(f'Test Result: {output["correct"]}')
logger.info(
f'Finished evaluation for instance {output["question_id"]}: {output["correct"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
finished_task_ids.add(output['question_id'])
# This sets the multi-processing
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
# This is how we perform multi-processing
for i in range(len(question_ids)):
try:
question_id = question_ids[i]
question = questions[i]
future = executor.submit(
process_instance,
question_id,
question,
agent_class,
metadata,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(
update_progress, pbar, output_fp, finished_task_ids
)
futures.append(future)
except Exception:
continue
# Wait for all futures to complete
for future in futures:
try:
future.result()
except Exception:
continue
except KeyboardInterrupt:
logger.info('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
total_correct = 0
total_hallucination = 0
output = []
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
output.append(data)
if int(data['question_id']) in finished_task_ids:
if str(data['correct']).lower() == 'true':
total_correct += 1
if str(data['hallucination']).lower() == 'true':
total_hallucination += 1
# sort all output by question_id
output = sorted(output, key=lambda x: x['question_id'])
with open(output_file, 'w') as f:
for dat in output:
f.write(json.dumps(dat) + '\n')
f.flush()
logger.info(
f'Evaluation finished for {hub}. Total: {len(question_ids)+finished_task_number}; Correct: {total_correct}; Hallucination: {total_hallucination}. Accuracy: {total_correct / (len(question_ids)+finished_task_number)}'
)

View File

@@ -0,0 +1,42 @@
#!/bin/bash
MODEL_CONFIG=$1
AGENT=$2
EVAL_LIMIT=$3
HUBS=$4
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default CodeActAgent"
AGENT="CodeActAgent"
fi
if [ -z "$HUBS" ]; then
HUBS="hf,torch,tf"
echo "Hubs not specified, use default $HUBS"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "HUBS: $HUBS"
COMMAND="poetry run python evaluation/gorilla/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \
--hubs $HUBS \
--data-split validation \
--max-chars 10000000 \
--eval-num-workers 1 \
--eval-note ${AGENT_VERSION}_${LEVELS}"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
eval $COMMAND

101
evaluation/gorilla/utils.py Normal file
View File

@@ -0,0 +1,101 @@
import json
from functools import partial
import requests
from ast_eval_hf import ast_eval_hf, ast_parse
from ast_eval_tf import ast_eval_tf
from ast_eval_th import ast_eval_th
# This function is modified from Gorilla's APIBench implementations (https://github.com/ShishirPatil/gorilla/blob/main/eval/get_llm_responses.py).
def encode_question(question, api_name):
"""Encode multiple prompt instructions into a single string."""
prompts = []
if api_name == 'torch':
api_name = 'torchhub'
domains = '1. $DOMAIN is inferred from the task description and should include one of {Classification, Semantic Segmentation, Object Detection, Audio Separation, Video Classification, Text-to-Speech}.'
elif api_name == 'hf':
api_name = 'huggingface'
domains = '1. $DOMAIN should include one of {Multimodal Feature Extraction, Multimodal Text-to-Image, Multimodal Image-to-Text, Multimodal Text-to-Video, \
Multimodal Visual Question Answering, Multimodal Document Question Answer, Multimodal Graph Machine Learning, Computer Vision Depth Estimation,\
Computer Vision Image Classification, Computer Vision Object Detection, Computer Vision Image Segmentation, Computer Vision Image-to-Image, \
Computer Vision Unconditional Image Generation, Computer Vision Video Classification, Computer Vision Zero-Shor Image Classification, \
Natural Language Processing Text Classification, Natural Language Processing Token Classification, Natural Language Processing Table Question Answering, \
Natural Language Processing Question Answering, Natural Language Processing Zero-Shot Classification, Natural Language Processing Translation, \
Natural Language Processing Summarization, Natural Language Processing Conversational, Natural Language Processing Text Generation, Natural Language Processing Fill-Mask,\
Natural Language Processing Text2Text Generation, Natural Language Processing Sentence Similarity, Audio Text-to-Speech, Audio Automatic Speech Recognition, \
Audio Audio-to-Audio, Audio Audio Classification, Audio Voice Activity Detection, Tabular Tabular Classification, Tabular Tabular Regression, \
Reinforcement Learning Reinforcement Learning, Reinforcement Learning Robotics }'
elif api_name == 'tf':
api_name = 'tensorhub'
domains = '1. $DOMAIN is inferred from the task description and should include one of {text-sequence-alignment, text-embedding, text-language-model, text-preprocessing, text-classification, text-generation, text-question-answering, text-retrieval-question-answering, text-segmentation, text-to-mel, image-classification, image-feature-vector, image-object-detection, image-segmentation, image-generator, image-pose-detection, image-rnn-agent, image-augmentation, image-classifier, image-style-transfer, image-aesthetic-quality, image-depth-estimation, image-super-resolution, image-deblurring, image-extrapolation, image-text-recognition, image-dehazing, image-deraining, image-enhancemenmt, image-classification-logits, image-frame-interpolation, image-text-detection, image-denoising, image-others, video-classification, video-feature-extraction, video-generation, video-audio-text, video-text, audio-embedding, audio-event-classification, audio-command-detection, audio-paralinguists-classification, audio-speech-to-text, audio-speech-synthesis, audio-synthesis, audio-pitch-extraction}'
else:
print('Error: API name is not supported.')
prompt = (
question
+ '\nWrite a python program in 1 to 2 lines to call API in '
+ api_name
+ '.\n\nThe answer should follow the format: <<<domain>>> $DOMAIN, <<<api_call>>>: $API_CALL, <<<api_provider>>>: $API_PROVIDER, <<<explanation>>>: $EXPLANATION, <<<code>>>: $CODE}. Here are the requirements:\n'
+ domains
+ '\n2. The $API_CALL should have only 1 line of code that calls api.\n3. The $API_PROVIDER should be the programming framework used.\n4. $EXPLANATION should be a step-by-step explanation.\n5. The $CODE is the python code.\n6. Do not repeat the format in your answer.'
)
# prompts.append({"role": "system", "content": ""})
prompts = (
'You are a helpful API writer who can write APIs based on requirements.\n'
+ prompt
)
return prompts
def get_data(hub):
if hub == 'hf':
question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/huggingface/questions_huggingface_0_shot.jsonl'
api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/huggingface_api.jsonl'
apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/huggingface_eval.json'
ast_eval = ast_eval_hf
if hub == 'torch':
question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/torchhub/questions_torchhub_0_shot.jsonl'
api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/torchhub_api.jsonl'
apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/torchhub_eval.json'
ast_eval = ast_eval_th
if hub == 'tf':
question_data = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/eval/eval-data/questions/tensorflowhub/questions_tensorflowhub_0_shot.jsonl'
api_dataset = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/api/tensorflowhub_api.jsonl'
apibench = 'https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/tensorflow_eval.json'
ast_eval = ast_eval_tf
# get questions and question_ids
questions = []
question_ids = []
question_data = requests.get(question_data)
if question_data.status_code == 200:
lines = question_data.text.splitlines()
for line in lines:
questions.append(json.loads(line)['text'])
question_ids.append(json.loads(line)['question_id'])
# get the api datasest
api_database = []
api_dataset = requests.get(api_dataset)
if api_dataset.status_code == 200:
lines = api_dataset.text.splitlines()
for line in lines:
api_database.append(json.loads(line))
# get the question answer pair datasest
qa_pairs = []
apibench = requests.get(apibench)
if apibench.status_code == 200:
lines = apibench.text.splitlines()
for line in lines:
qa_pairs.append(json.loads(line)['api_data'])
# Parse all apis to ast trees
ast_database = []
for data in api_database:
ast_tree = ast_parse(data['api_call'])
ast_database.append(ast_tree)
ast_eval = partial(ast_eval, api_database, qa_pairs, ast_database)
return questions, question_ids, ast_eval

70
evaluation/gpqa/README.md Normal file
View File

@@ -0,0 +1,70 @@
# Evaluating GPQA (A Graduate-Level Google-Proof Q&A Benchmark) with OpenDevin
Implements the evaluation of agents on the GPQA benchmark introduced in [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](https://arxiv.org/abs/2308.07124).
This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.
- The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
- Even experts in the corresponding domains achieve only 65% accuracy.
- State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
**Note**
Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.
Further references:
- https://arxiv.org/pdf/2311.12022
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa
## TODOs
- [ ] Add support for other agents (currently only tested on `CodeActAgent`)
- [ ] Complete full benchmark evaluation
- [ ] Fix intermittent `BrowserException: Failed to start browser environment` error
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
enable_auto_lint = true
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[eval_azure_openai_compatible_model]
model = "AZURE_OPENAI_EXACT_DEPLOYMENT_MODEL_NAME"
base_url = "AZURE_OPENAI_ENDPOINT"
api_key = "AZURE_ENDPOINT_API_KEY"
temperature = 0.0
```
## Run Inference on GPQA Benchmark
'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
From the root of the OpenDevin repo, run the following command:
```bash
./evaluation/gpqa/scripts/run_infer.sh [model_config_name] [num_samples_eval] [data_split] [AgentClass]
```
You can replace `model_config_name` with any model you set up in `config.toml`.
- `model_config_name`: The model configuration name from `config.toml` that you want to evaluate.
- `num_samples_eval`: Number of samples to evaluate (useful for testing and debugging).
- `data_split`: The data split to evaluate on. Must be one of `gpqa_main`, `gqpa_diamond`, `gpqa_experts`, `gpqa_extended`. Defaults to `gpqa_diamond` as done in the paper.
- `AgentClass`: The agent class to use for evaluation. Currently only supports `CodeActAgent` for CodeActAgent.
## Benchmark Evaluation Results
- [] TODO: Finish the evaluation run across the entire benchmark and compile results

View File

View File

@@ -0,0 +1,468 @@
"""
Overview:
This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.
- The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
- Even experts in the corresponding domains achieve only 65% accuracy.
- State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.
Further references:
- https://arxiv.org/pdf/2311.12022
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa
TODOs:
- Add evaluation on other Agent classes (e.g., MonologueAgent)
- Batch inference and evaluation of agents on the GPQA Benchmark.
"""
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import random
import re
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm
from opendevin.controller.state.state import State
from opendevin.core.config import config, get_llm_config_arg, get_parser
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.action import MessageAction
from opendevin.events.serialization.event import event_to_dict
def cleanup():
logger.info('Cleaning up child processes...')
for process in mp.active_children():
logger.info(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
'Please continue working on the task on whatever approach you think is suitable.\n'
'Feel free to use all tools for calculations and solving the problem, and web-search for finding relevant facts during the process if needed\n'
'If you think you have reliably finished solving the problem, first generate a message reporting the final concise answer to the user. Once that is done, please run the following command: <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP TO SOLVE THIS TASK.\n'
)
if state.history:
user_msgs = [
action
for action, _ in state.history
if isinstance(action, MessageAction) and action.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, just generate a final answer message to the user and in the next turn --> run: <execute_bash> exit </execute_bash>.\n'
)
return msg
def monologue_user_response(state: State) -> str:
raise NotImplementedError('MonologueAgent should never ask for user responses.')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
'MonologueAgent': monologue_user_response,
}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': '\n\n SUPER IMPORTANT: When you think you have solved the question, first report it back to the user in the requested format. Only once that is done, in the next turn, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
def parse_final_answer(final_answer: str) -> str:
"""
Parse the final answer from the final message generated by the agent
to extract the final answer. The final answer is usually enclosed in the format:
<<FINAL_ANSWER||
<insert correct answer here>
||FINAL_ANSWER>>
"""
pattern = re.compile(r'<<FINAL_ANSWER\|\|(.*?)\|\|FINAL_ANSWER>>', re.DOTALL)
match = pattern.search(final_answer)
if match:
return match.group(1).strip()
else:
return 'No final answer found in the provided string.'
def compare_answers(predicted_answer, ground_truth):
"""
Compare the predicted answer with the ground truth answer
"""
return predicted_answer == ground_truth
def get_test_result(model_output, ground_truth):
"""
Implements the evaluation logic for GPQA
Checks if the output of a given instance is correct (as per the ground truth)
"""
# parse the final answer from model output
predicted_answer = parse_final_answer(model_output)
# check if the model output matches the ground truth
result = compare_answers(predicted_answer, ground_truth)
return result
def convert_instance_dict(instance):
"""
Used for preprocessing the hf dataset into a format that can be used by the agent.
Reads and extracts relevant information from the dataset instance.
"""
out_instance_dict = {}
out_instance_dict['question'] = instance['Question']
correct_answer = instance['Correct Answer']
out_instance_dict['choices'] = [
correct_answer,
instance['Incorrect Answer 1'],
instance['Incorrect Answer 2'],
instance['Incorrect Answer 3'],
]
# Randomize the order of choices
random.shuffle(out_instance_dict['choices'])
# Find the index of the correct answer after shuffling and store it as a letter (A/B/C/D)
correct_index = out_instance_dict['choices'].index(correct_answer)
correct_letter = chr(
65 + correct_index
) # Convert index (0-3) to corresponding letter (A-D)
out_instance_dict['correct_solution'] = correct_letter
return out_instance_dict
def process_instance(
instance: dict,
agent_class: str,
metadata: dict,
skip_workspace_mount: bool,
eval_output_dir: str,
reset_logger: bool = True,
):
"""
Process a single instance from the dataset
"""
old_workspace_mount_path = config.workspace_mount_path
old_workspace_base = config.workspace_base
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
# create process-specific workspace dir
# if `not skip_workspace_mount` - we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
skip_workspace_mount = False
if not skip_workspace_mount:
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
# workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
# workspace_mount_path = os.path.abspath(workspace_mount_path)
# # create process-specific workspace dir
# # if `not skip_workspace_mount` - we will create a workspace directory for EACH process
# # so that different agent don't interfere with each other.
# if not skip_workspace_mount:
# workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
# pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{instance.instance_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {instance.instance_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
else:
logger.info(f'Starting evaluation for instance {instance.instance_id}.')
if not skip_workspace_mount:
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# ======= Run the agent on the instance =======
# Prepare instruction for the agent using suggested format in gpqa codebase
instruction = f"""
What is the correct answer to this question:\n
{instance['question']}\n
Choices:\n
(A) {instance['choices'][0]}\n
(B) {instance['choices'][1]}\n
(C) {instance['choices'][2]}\n
(D) {instance['choices'][3]}\n
\n\n
MOST IMPORTANT: Format your response as follows:
<<FINAL_ANSWER||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).)
||FINAL_ANSWER>>
Additional Instructions:
- You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.
"""
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
state: State = asyncio.run(
main(
instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent_class
),
)
)
# ======= Attempt to evaluate the agent's edits =======
# get the final message from the state history (default to None if not found)
final_message = next(
(
act.content
for act in reversed(state.history)
if isinstance(act, MessageAction)
),
None,
)
logger.info(f'Final message generated by the agent: {final_message}')
test_result = get_test_result(final_message, instance.correct_solution)
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
# Save the output
output = {
'task_id': instance.task_id,
'instance_id': instance.instance_id,
'instruction': instruction,
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs))
for action, obs in state.history
],
'metrics': metrics,
'error': state.error if state and state.error else None,
'test_result': test_result,
}
except Exception:
logger.error('Process instance failed')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
config.workspace_base = old_workspace_base
return output
if __name__ == '__main__':
parser = get_parser()
# data split must be one of 'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended'
parser.add_argument(
'--data-split',
type=str,
choices=['gpqa_main', 'gpqa_diamond', 'gpqa_experts', 'gpqa_extended'],
default='gpqa_diamond',
help='data split to evaluate, eg. gpqa_diamond',
)
args, _ = parser.parse_known_args()
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
dataset = load_dataset('Idavidrein/gpqa', args.data_split)
gpqa_dataset = dataset['train']
# preprocess the dataset
gpqa_dataset = gpqa_dataset.map(convert_instance_dict)
gpqa_dataset = gpqa_dataset.to_pandas()
# Add a new column 'instance_id' with the index
gpqa_dataset['instance_id'] = gpqa_dataset.index
gpqa_dataset['task_id'] = gpqa_dataset.index
# gpqa_dataset = dataset['train'].to_pandas().sort_values(by='id').reset_index(drop=True)
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
# TEST METADATA
agent_class = args.agent_cls
assert (
agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
), f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'gpqa',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
metadata = {
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproduciblity
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit # NOTE: This is useful for debugging and testing using a smaller subset of the dataset
if eval_n_limit:
# start_index = 20
# gpqa_dataset = gpqa_dataset.iloc[start_index:]
gpqa_dataset = gpqa_dataset.head(eval_n_limit)
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
logger.info('#############################################')
logger.info(f'{eval_n_limit} instances will be evaluated.')
logger.info('#############################################')
# OUTPUT FILE
output_file = os.path.join(eval_output_dir, 'output.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_instance_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
finished_instance_ids.add(data['instance_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# =============================================
# filter out finished instances
new_gpqa_dataset = []
for idx, instance in gpqa_dataset.iterrows():
# instance = convert_instance_dict(instance) # preprocessing
if instance.instance_id in finished_instance_ids:
logger.info(
f'Skipping instance {instance.instance_id} as it is already finished.'
)
continue
new_gpqa_dataset.append(instance)
gpqa_dataset = pd.DataFrame(new_gpqa_dataset)
logger.info(
f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(gpqa_dataset)}'
)
# =============================================
pbar = tqdm(total=len(gpqa_dataset))
# This function tracks the progress AND write the output to a JSONL file
def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["instance_id"]}')
pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
logger.info(
f'Finished evaluation for instance {output["instance_id"]}: {output["test_result"]["result"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
# This sets the multi-processing
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
# This is SWE-Bench specific - CodeActAgent doesn't require mounted workspace to work
skip_workspace_mount = agent_class == 'CodeActAgent'
logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
# This is how we perform multi-processing
for row_idx, instance in gpqa_dataset.iterrows():
future = executor.submit(
process_instance,
instance,
agent_class,
metadata,
skip_workspace_mount,
eval_output_dir,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(update_progress)
futures.append(future)
# Wait for all futures to complete
for future in futures:
future.result()
except KeyboardInterrupt:
print('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
logger.info('Evaluation finished.')

View File

@@ -0,0 +1,41 @@
#!/bin/bash
MODEL_CONFIG=$1
EVAL_LIMIT=$2
DATA_SPLIT=$3
AGENT=$4
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default CodeActAgent ..."
AGENT="CodeActAgent"
fi
# NOTE: if data split is not provided, use the default value 'gpqa_diamond'
if [ -z "$DATA_SPLIT" ]; then
echo "Data split not specified, using default gpqa_diamond ..."
DATA_SPLIT="gpqa_diamond"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/gpqa/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \
--max-chars 10000000 \
--eval-num-workers 1 \
--data-split $DATA_SPLIT \
--eval-note $AGENT_VERSION"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
eval $COMMAND

View File

@@ -0,0 +1,81 @@
# WebArena Evaluation with OpenDevin Browsing Agents
This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on synthetic web browsing tasks.
## Setup OpenDevin Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
sandbox_container_image = "ghcr.io/opendevin/sandbox:latest"
sandbox_type = "ssh"
ssh_hostname = "localhost"
sandbox_timeout = 120
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
## Setup MiniWoB++ Environment and Environment Variables of MiniWoB++
MiniWoB++ requires you to set up websites containing a static website that is accessible via URL to the machine running the OpenDevin agents.
- Clone miniwob (use a specific frozen commit for reproducibility)
```sh
git clone git@github.com:Farama-Foundation/miniwob-plusplus.git
git -C "./miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838
```
- Setup Miniwob URL (change `PATH_TO_MINIWOB_CLONED_REPO` here to the absolute path to your `miniwob-plusplus` folder) in `evaluation/miniwob/scripts/run_infer.sh`
```sh
export MINIWOB_URL="file://<PATH_TO_MINIWOB_CLONED_REPO>/miniwob/html/miniwob/"
```
## Test if your environment works
Access with browser the above MiniWoB URLs and see if they load correctly.
## Run Evaluation
```sh
bash evaluation/miniwob/scripts/run_infer.sh
```
Results will be in `evaluation/evaluation_outputs/outputs/miniwob/`
To calculate the average reward, run:
```sh
poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_outputs/outputs/miniwob/SOME_AGENT/EXP_NAME/output.jsonl
```
## Submit your evaluation results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
## BrowsingAgent V1.0 result
Tested on BrowsingAgent V1.0
MiniWoB++, 125 tasks (3 runs due to random init task), max step 10
- GPT4o: 0.384, 0.416, 0.424, avg: 0.408
- GPT3.5: 0.288, 0.256, 0.272, avg: 0.272

View File

View File

@@ -0,0 +1,33 @@
import argparse
import json
import browsergym.miniwob # noqa F401 register miniwob tasks as gym environments
import gymnasium as gym
parser = argparse.ArgumentParser(description='Calculate average reward.')
parser.add_argument('output_path', type=str, help='path to output.jsonl')
args = parser.parse_args()
if __name__ == '__main__':
env_ids = [
id for id in gym.envs.registry.keys() if id.startswith('browsergym/miniwob')
]
total_num = len(env_ids)
print('Total number of tasks: ', total_num)
total_reward = 0
total_cost = 0
actual_num = 0
with open(args.output_path, 'r') as f:
for line in f:
data = json.loads(line)
actual_num += 1
total_cost += data['metrics']['accumulated_cost']
total_reward += data['test_result']
avg_reward = total_reward / total_num
print('Avg Reward: ', avg_reward)
avg_cost = total_cost / actual_num
print('Avg Cost: ', avg_cost)
print('Actual number of tasks finished: ', actual_num)

View File

@@ -0,0 +1,214 @@
import asyncio
import json
import logging
import os
import pathlib
import subprocess
import time
import browsergym.miniwob # noqa F401 register miniwob tasks as gym environments
import gymnasium as gym
from tqdm import tqdm
from opendevin.controller.state.state import State
from opendevin.core.config import args, config, get_llm_config_arg
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.serialization.event import event_to_dict
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.tools import RuntimeTool
SUPPORTED_AGENT_CLS = {'BrowsingAgent'}
def process_instance(
env_id: str,
metadata: dict,
eval_output_dir: str,
docker_sandbox: DockerSSHBox,
reset_logger: bool = True,
):
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(eval_output_dir, 'logs', f'instance_{env_id}.log')
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
else:
logger.info(f'Starting evaluation for instance {env_id}.')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
runtime_tools_config = {
RuntimeTool.BROWSER: {
'browsergym_eval': env_id,
'browsergym_eval_save_dir': eval_output_dir,
}
}
state: State = asyncio.run(
main(
'PLACEHOLDER_GOAL',
runtime_tools_config=runtime_tools_config,
sandbox=docker_sandbox,
)
)
# ======= Attempt to evaluate the agent's environment impact =======
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
browsergym_eval_dir = os.path.join(eval_output_dir, env_id.split('/')[1])
# read goal
with open(
os.path.join(browsergym_eval_dir, 'goal.txt'), 'r', encoding='utf-8'
) as f:
instruction = f.read()
# read reward
with open(
os.path.join(browsergym_eval_dir, 'rewards.json'), 'r', encoding='utf-8'
) as f:
rewards = json.load(f)
reward = max(rewards)
# Save the output
output = {
'instance_id': env_id,
'instruction': instruction,
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
],
'metrics': metrics,
'error': state.error if state and state.error else None,
'test_result': reward,
}
return output
if __name__ == '__main__':
env_ids = [
id for id in gym.envs.registry.keys() if id.startswith('browsergym/miniwob')
]
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
# TEST METADATA
agent_class = args.agent_cls
assert agent_class in SUPPORTED_AGENT_CLS, f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'miniwob',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
metadata = {
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproducibility
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
env_ids = env_ids[:eval_n_limit]
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
# OUTPUT FILE
output_file = os.path.join(eval_output_dir, 'output.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_instance_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
finished_instance_ids.add(data['instance_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# =============================================
# filter out finished instances
new_env_ids = []
for idx in env_ids:
if idx in finished_instance_ids:
logger.info(f'Skipping instance {idx} as it is already finished.')
continue
new_env_ids.append(idx)
env_ids = new_env_ids
logger.info(
f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(env_ids)}'
)
# =============================================
docker_sandbox = DockerSSHBox()
for env_id in tqdm(env_ids):
try:
output = process_instance(
env_id=env_id,
metadata=metadata,
eval_output_dir=eval_output_dir,
docker_sandbox=docker_sandbox,
reset_logger=False,
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
except Exception as e:
logger.error(f'Error processing instance {env_id}: {e}')
output_fp.close()
logger.info('Evaluation finished.')

View File

@@ -0,0 +1,44 @@
#!/bin/bash
# configure miniwob website, change URL to yours
export MINIWOB_URL="file:///home/fangzhex/miniwob-plusplus/miniwob/html/miniwob/"
# configure browsing agent
export USE_NAV="false"
export USE_CONCISE_ANSWER="true"
MODEL_CONFIG=$1
AGENT=$2
NOTE=$3
EVAL_LIMIT=$4
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default BrowsingAgent"
AGENT="BrowsingAgent"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
EVAL_NOTE="${AGENT_VERSION}_${NOTE}"
COMMAND="poetry run python evaluation/miniwob/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \
--max-chars 10000000 \
--eval-note $EVAL_NOTE"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
eval $COMMAND

View File

@@ -13,11 +13,15 @@ class TaskState:
):
self.finished = finished
self.success = success
self.agent_action_count: Dict[str, int] = agent_action_count or {
'propose_solution': 0,
'use_tool': 0,
'invalid_action': 0,
}
self.agent_action_count: Dict[str, int] = (
agent_action_count
if agent_action_count
else {
'propose_solution': 0,
'use_tool': 0,
'invalid_action': 0,
}
)
self.terminate_reason = terminate_reason
self.latest_output = latest_output

View File

@@ -19,7 +19,20 @@ class SimplifiedEnv:
def __init__(self, agent_state: State, task: Task, task_config: Dict[str, int]):
self.agent_state = agent_state
self.task = task
self.task_state = TaskState()
agent_action_count = {
'propose_solution': 0,
'use_tool': 0,
'invalid_action': 0,
}
# check if agent_state has attribute turn_info set
if hasattr(self.agent_state, 'propose_solution_count'):
agent_action_count['propose_solution'] = (
self.agent_state.propose_solution_count
)
self.task_state = TaskState(agent_action_count=agent_action_count)
self.task_config = task_config
def step(self, lm_message: str):
@@ -39,6 +52,9 @@ class SimplifiedEnv:
turn_info=turn_info,
)
self.agent_state.propose_solution_count = self.task_state.agent_action_count[
'propose_solution'
]
self.log_output(output)
return self.task_state
@@ -109,11 +125,7 @@ class SimplifiedEnv:
self.task_state.finished = True
self.task_state.success = False
self.task_state.terminate_reason = 'max_propose_steps'
elif (
# (propose_solution + use_tool) > max iteration limit
sum(self.task_state.agent_action_count.values())
>= self.task_config['max_iterations']
):
elif self.agent_state.iteration >= self.task_config['max_iterations']:
self.task_state.finished = True
self.task_state.success = False
self.task_state.terminate_reason = 'max_iterations'

View File

@@ -51,10 +51,8 @@ def codeact_user_response(state: State, task: Task, task_config: Dict[str, int])
state.task_state = result_state
if not result_state.latest_output:
if result_state.success:
msg = '/exit'
else:
msg = 'Something went wrong! No output from the model.'
# Task is finished
msg = '/exit'
else:
msg = result_state.latest_output['content']

View File

@@ -15,6 +15,7 @@ echo "AGENT_VERSION: $AGENT_VERSION"
export PYTHONPATH=$(pwd)
COMMAND="poetry run python ./evaluation/mint/run_infer.py \
--llm-config $MODEL_CONFIG \
--max-iterations 5 \
--max-propose-solution 2 \
--eval-note $AGENT_VERSION"

View File

@@ -0,0 +1,126 @@
# ML-Bench Evaluation with OpenDevin
This project implements the evaluation of agents on the [ML-Bench](https://arxiv.org/abs/2311.09835) dataset using OpenDevin. [ML-Bench](https://arxiv.org/abs/2311.09835) is a comprehensive benchmark designed to assess the effectiveness of Large Language Models (LLMs) in leveraging existing functions in open-source libraries for machine learning tasks. The benchmark consists of 10,040 samples spanning 130 tasks over 14 notable machine learning GitHub repositories.
## Task Overview
The ML-Bench task presents a scenario where, given a GitHub repository, the language model has access to all files within the repository. Upon receiving a user instruction with a set of specific parameters, the Agent is required to write code that invokes models or functions from the GitHub repository. The generated code must align with the user's instruction, particularly in terms of the specified parameters, and must be executable.
The task introduces new challenges for LLMs, such as comprehending long and language-code interleaved documents, understanding complex cross-file code structures, and effectively navigating the codebase to locate relevant information. ML-Bench serves as a critical tool for assessing the efficiency and adaptability of various methods in real-world scenarios.
For more details on the ML-Bench task and dataset, please refer to the paper: [ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks](https://arxiv.org/abs/2311.09835).
## Setup Environment
Please follow the [OpenDevin setup guide](https://github.com/OpenDevin/OpenDevin/blob/main/docs/setup.md) to set up the local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
enable_auto_lint = true
run_as_devin = false
sandbox_container_image = "public.ecr.aws/i5g0m1f6/ml-bench" # Use the latest image from the ML-Bench repository
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
## Run Inference on ML-Bench
To run the evaluation on the ML-Bench dataset, use the following command:
```bash
./evaluation/ml_bench/scripts/run_infer.sh [model_config] [split] [agent] [eval_limit]
# e.g., ./evaluation/ml_bench/scripts/run_infer.sh eval_gpt4_1106_preview full CodeActAgent 10
```
You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.
## Examples
For each task in the ML-Bench dataset, OpenDevin provides the agent with a set number of iterations to complete the task. The `history` field in the evaluation output shows each iteration's response and actions taken by the agent to complete the task.
Here's an example of the evaluation output for a single task instance:
```json
{
"instance_id": 3,
"repo": "https://github.com/dmlc/dgl",
"instruction": "Please complete the Machine Learning task in the following repository: dgl\n\nThe task is: DGL Implementation of NGCF model\n\nI have a deep desire to embark on a journey brimming with knowledge and expertise. My objective is to train a cutting-edge NGCF Model, known for its unparalleled capabilities, on the illustrious dataset known as gowalla. To ensure swift execution, I kindly request your assistance in crafting the code, making use of the powerful GPU #3 and an embedding size of 32. Can you lend a helping hand to transform this dream into a reality?\n\nYou should create a script named `run.sh` under the specified path in the repo to run the task.\n\nYou can find the task repo at: /workspace/dgl/examples/pytorch/NGCF/NGCF\n\nYou should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n",
"metadata": {
"agent_class": "CodeActAgent",
"model_name": "gpt-4-1106-preview",
"max_iterations": 10,
"eval_output_dir": "evaluation/evaluation_outputs/outputs/ml_bench/CodeActAgent/gpt-4-1106-preview_maxiter_10_N_v1.5",
"start_time": "2024-05-26 17:39:59",
"git_commit": "dd8ee9044a94a213dc2e31d2085dbf2924ee80a1"
},
"history": [
[
{
"id": 0,
"timestamp": "2024-05-26T17:40:41.060009",
"source": "user",
"message": "Please complete the Machine Learning task in the following repository: dgl\n\nThe task is: DGL Implementation of NGCF model\n\nI have a deep desire to embark on a journey brimming with knowledge and expertise. My objective is to train a cutting-edge NGCF Model, known for its unparalleled capabilities, on the illustrious dataset known as gowalla. To ensure swift execution, I kindly request your assistance in crafting the code, making use of the powerful GPU #3 and an embedding size of 32. Can you lend a helping hand to transform this dream into a reality?\n\nYou should create a script named `run.sh` under the specified path in the repo to run the task.\n\nYou can find the task repo at: /workspace/dgl/examples/pytorch/NGCF/NGCF\n\nYou should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n",
"action": "message",
"args": {
"content": "Please complete the Machine Learning task in the following repository: dgl\n\nThe task is: DGL Implementation of NGCF model\n\nI have a deep desire to embark on a journey brimming with knowledge and expertise. My objective is to train a cutting-edge NGCF Model, known for its unparalleled capabilities, on the illustrious dataset known as gowalla. To ensure swift execution, I kindly request your assistance in crafting the code, making use of the powerful GPU #3 and an embedding size of 32. Can you lend a helping hand to transform this dream into a reality?\n\nYou should create a script named `run.sh` under the specified path in the repo to run the task.\n\nYou can find the task repo at: /workspace/dgl/examples/pytorch/NGCF/NGCF\n\nYou should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n",
"wait_for_response": false
}
},
{
"message": "No observation",
"observation": "null",
"content": "",
"extras": {}
}
],
// ... more iterations
],
"eval_exit_code": 124, // ML-Bench believes the agent is successful if it continues to run until timeout
"eval_output": "",
"eval_script": "pip install Matplotlib==2.2.2\r\n"
"cd /workspace/dgl/examples/pytorch/dgmg\r\n"
"python main.py",
"metrics": {
"success": 1
}
}
```
The `history` field contains the agent's actions and observations at each iteration, including the commands executed, file edits, and the agent's thoughts.
The `eval_exit_code` and `eval_output` fields provide information about the execution of the evaluation command and its output.
The `metrics` field contains the parsed evaluation metrics from the `eval_output`.
## Customization
You can customize the evaluation script by modifying the `evaluation/ml_bench/run_infer.py` file. This script handles loading the ML-Bench dataset, running the agent on each task instance, and saving the evaluation outputs.
Feel free to adjust the configuration, logging, and output formatting to suit your needs.
## Contributing
If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the [GitHub repository](https://github.com/gersteinlab/ML-bench).
## License
This project is licensed under the [MIT License](LICENSE).

View File

View File

@@ -0,0 +1,387 @@
"""
Implements evaluation of agents on ML-Bench, a benchmark for assessing the effectiveness of
Large Language Models (LLMs) in leveraging existing functions in open-source libraries for
machine learning tasks. The benchmark is introduced in the paper "ML-Bench: Evaluating Large
Language Models for Code Generation in Repository-Level Machine Learning Tasks"
(https://arxiv.org/abs/2311.09835).
Please see https://ghcr.io/super-dainiu/ml_bench and https://huggingface.co/datasets/super-dainiu/ml-bench
for more details on the dataset and docker image used in this evaluation script.
TODOs:
- Support additional evaluation settings, such as providing raw README content or using a
retriever to extract relevant segments.
- Clean up the code and docker image used for evaluation.
"""
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
from datasets import load_dataset
from tqdm import tqdm
from opendevin.controller.state.state import State
from opendevin.core.config import config, get_llm_config_arg, get_parser
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.action import MessageAction
from opendevin.events.serialization.event import event_to_dict
from opendevin.runtime.docker.ssh_box import DockerSSHBox
def cleanup():
logger.info('Cleaning up child processes...')
for process in mp.active_children():
logger.info(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
'Please continue working on the task on whatever approach you think is suitable.\n'
'If you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
)
if state.history:
user_msgs = [
action
for action, _ in state.history
if isinstance(action, MessageAction) and action.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
def monologue_user_response(state: State) -> str:
raise NotImplementedError('MonologueAgent should never ask for user responses.')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
'MonologueAgent': monologue_user_response,
}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have completed the task, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
ID2CONDA = {
1: 'dgl_DS',
2: 'bert_DS',
3: 'lavis_DS',
4: 'if_DS',
5: 'V2V_DS',
6: 'esm_DS',
7: 'OP_DS',
8: 'TSL_DS',
9: 'EAP_DS',
10: 'PG_DS',
11: 'PIM_DS',
12: 'AD2_DS',
13: 'L3_DS',
14: 'MZ2_DS',
15: 'GSA2_DS',
}
def process_instance(
instance, agent_class, metadata, eval_output_dir, reset_logger: bool = True
):
old_workspace_mount_path = config.workspace_mount_path
old_workspace_base = config.workspace_base
try:
workspace_mount_path = os.path.join(
config.workspace_mount_path, '_eval_workspace'
)
# create process-specific workspace dir
# so that different agent don't interfere with each other.
workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir,
'logs',
f"instance_{instance['id']}_pid_{os.getpid()}.log",
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f"Starting evaluation for instance {instance['id']}.\nLOG: tail -f {log_file}"
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# Create a sandbox, using the instance ID as the session ID to avoid conflicts
sandbox = DockerSSHBox(sid=str(instance['id']) + '_' + str(os.getpid()))
# Set up the task environment
sandbox.execute(f'conda activate {ID2CONDA[instance["github_id"]]}')
# Clone the task repo into the sandbox
repo_url = instance['github']
repo_name = repo_url.split('/')[-1]
sandbox.execute(f'git clone {repo_url} /workspace/{repo_name}')
sandbox.execute(f'chmod -R 777 /workspace/{repo_name}')
# Navigate to the task's code path
task_path = os.path.join('/workspace', repo_name, instance['path'][2:])
sandbox.execute(f'cd {task_path}')
# Prepare the task instruction
instruction = (
f'Please complete the Machine Learning task in the following repository: {repo_name}\n\n'
f'The task is: {instance["task"]}\n\n'
f'{instance["instruction"]}\n\n'
'You should create a script named `run.sh` under the specified path in the repo to run the task.\n\n'
f'You can find the task repo at: {task_path}\n\n'
+ (
'Here is the prefix code for the task:\n'
'```bash\n'
f'{instance["prefix_code"]}\n'
'```\n\n'
if instance['prefix_code']
else ''
)
+ 'You should terminate the subprocess after running the task (e.g., call subprocess.Popen(args).wait()).'
)
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
# Run the agent
state: State = asyncio.run(
main(
instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(
agent_class
),
sandbox=sandbox,
)
)
metrics = state.metrics.get() if state.metrics else {}
# Evaluate the agent's script
eval_script = os.path.join(task_path, 'run.sh')
logger.info(f'Running evaluation script: {eval_script}')
try:
_, eval_script_content = sandbox.execute(f'cat {eval_script}')
except Exception as e:
logger.error(f'Error reading evaluation script: {e}')
eval_script_content = ''
try:
exit_code, eval_output = sandbox.execute(
f'timeout 120s conda run -n {ID2CONDA[instance["github_id"]]} bash {eval_script}',
timeout=600,
)
except Exception as e:
logger.error(f'Error running evaluation script: {e}')
exit_code = -1
eval_output = ''
if exit_code != 0 and exit_code != 124:
logger.warning(f'Evaluation script failed with exit code {exit_code}')
logger.warning(f'Output: {eval_output}')
metrics['success'] = int(
'KeyboardInterrupt' in eval_output
) # super-dainiu: assume ``KeyboardInterrupt`` is a success as is done in ML-Bench
else:
logger.info(f'Evaluation script succeeded with exit code {exit_code}')
logger.info(f'Output: {eval_output}')
metrics['success'] = 1
# Save the output
output = {
'instance_id': instance['id'],
'repo': repo_url,
'instruction': instruction,
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs))
for action, obs in state.history
],
'eval_script': eval_script_content,
'eval_exit_code': exit_code,
'eval_output': eval_output,
'metrics': metrics,
}
except Exception as e:
logger.error(f'Error processing instance {instance["id"]}: {e}')
raise
finally:
config.workspace_mount_path = old_workspace_mount_path
config.workspace_base = old_workspace_base
# Shutdown the sandbox
sandbox.close()
return output
if __name__ == '__main__':
parser = get_parser()
parser.add_argument(
'-s',
'--eval-split',
type=str,
default='quarter',
choices=['full', 'quarter'],
help='data split to evaluate on, either full or quarter',
)
args, _ = parser.parse_known_args()
data_split = args.eval_split
agent_class = args.agent_cls
num_workers = args.eval_num_workers
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
# so we don't need to manage file uploading to OpenDevin's repo
ml_bench = load_dataset('super-dainiu/ml-bench', split=data_split).to_pandas()
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
ml_bench = ml_bench.head(eval_n_limit)
logger.info(f'Limiting evaluation to {eval_n_limit} instances.')
# TEST METADATA
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'ml_bench',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
os.makedirs(eval_output_dir, exist_ok=True)
os.makedirs(os.path.join(eval_output_dir, 'logs'), exist_ok=True)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
metadata = {
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproducibility
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
output_file = os.path.join(eval_output_dir, 'output.jsonl')
logger.info(f'Evaluating on data split: {data_split}')
logger.info(f'Using {num_workers} worker processes')
logger.info(f'Writing evaluation output to {output_file}')
finished_instance_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
try:
data = json.loads(line)
except json.JSONDecodeError:
print(f'Error parsing line: {line}')
finished_instance_ids.add(data['instance_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, data split {data_split}.'
)
# Filter out finished instances
new_instances = [
instance
for _, instance in ml_bench.iterrows()
if instance['id'] not in finished_instance_ids
]
logger.info(
f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(new_instances)}'
)
pbar = tqdm(total=len(new_instances))
# This function tracks the progress AND writes the output to a JSONL file
def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["instance_id"]}')
pbar.set_postfix_str(f'Metrics: {output["metrics"]}')
logger.info(
f'Finished evaluation for instance {output["instance_id"]}: {output["metrics"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
# This sets the multi-processing
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
for _, instance in enumerate(new_instances):
future = executor.submit(
process_instance,
instance,
agent_class,
metadata,
eval_output_dir,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(update_progress)
futures.append(future)
for future in futures:
output = future.result()
except KeyboardInterrupt:
print('KeyboardInterrupt received. Cleaning up...')
cleanup()
logger.info('Evaluation completed.')

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# Step 1: Stop all running containers
echo "Stopping all running containers..."
docker stop $(docker ps -q)
# Step 2: Remove all containers (running and stopped)
echo "Removing all containers..."
docker rm $(docker ps -a -q)
# Optional: Remove all Docker images (if you want to clean up images too)
# echo "Removing all Docker images..."
# docker rmi $(docker images -q)
echo "All containers have been removed."

View File

@@ -0,0 +1,44 @@
#!/bin/bash
MODEL_CONFIG=$1
SPLIT=$2
AGENT=$3
EVAL_LIMIT=$4
if [ -z "$MODEL_CONFIG" ]; then
echo "Model config not specified, use default"
MODEL_CONFIG="eval_gpt4_1106_preview"
fi
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default CodeActAgent"
AGENT="CodeActAgent"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
COMMAND="poetry run python evaluation/ml_bench/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 10 \
--eval-num-workers 4 \
--eval-note $AGENT_VERSION"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
if [ -n "$SPLIT" ]; then
echo "SPLIT: $SPLIT"
COMMAND="$COMMAND --eval-split $SPLIT"
fi
# Run the command
eval $COMMAND

View File

@@ -0,0 +1,70 @@
import json
import pprint
import sys
def extract_test_results(res_file_path: str) -> tuple[list[str], list[str]]:
passed = []
failed = []
costs = []
instance_ids = set()
instances = []
with open(res_file_path, 'r') as file:
for line in file:
data = json.loads(line.strip())
success = data['metrics']['success']
if data['instance_id'] in instance_ids:
print(f'WARNING: Duplicate instance_id found: {data["instance_id"]}')
continue
instance_ids.add(data['instance_id'])
instances.append(data)
if success:
passed.append(
{
'instance_id': data['instance_id'],
'repo': data['repo'],
'instruction': data['instruction'],
'eval_script': data['eval_script'],
'eval_exit_code': data['eval_exit_code'],
'eval_output': data['eval_output'],
'accumulated_cost': data['metrics']['accumulated_cost'],
}
)
else:
failed.append(
{
'instance_id': data['instance_id'],
'repo': data['repo'],
'instruction': data['instruction'],
'eval_script': data['eval_script'],
'eval_exit_code': data['eval_exit_code'],
'eval_output': data['eval_output'],
'accumulated_cost': data['metrics']['accumulated_cost'],
}
)
costs.append(data['metrics']['accumulated_cost'])
# sort by instance_id
instances.sort(key=lambda x: x['instance_id'])
with open(res_file_path, 'w') as file:
for instance in instances:
file.write(json.dumps(instance) + '\n')
return passed, failed, costs
if __name__ == '__main__':
if len(sys.argv) != 2:
print(
'Usage: poetry run python summarise_results.py <path_to_output_jsonl_file>'
)
sys.exit(1)
json_file_path = sys.argv[1]
passed_tests, failed_tests, costs = extract_test_results(json_file_path)
success_rate = len(passed_tests) / (len(passed_tests) + len(failed_tests))
print('PASSED TESTS:')
pprint.pprint(passed_tests)
print('FAILED TESTS:')
pprint.pprint(failed_tests)
print(
f'\nPassed {len(passed_tests)} tests, failed {len(failed_tests)} tests, success rate = {success_rate}, average cost = {sum(costs) / len(costs)}'
)

View File

@@ -1,256 +0,0 @@
# Evaluate Generated Patches
## Evaluate patches generated by OpenDevin
This section explains in detail how `evaluation/swe_bench/scripts/eval_infer.sh` described in [SWE-Bench README](./README.md) works.
Use `scripts/setup/get_agent_report.sh` to evaluate patches generated by an OpenDevin agent. This script is available in the container at `/swe_util/get_agent_report.sh`.
- `output-file` (*required*): specify the path to your patch file inside the container
- `agent-name` (*required*): your agent name
- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
- `num-processes`: defaults to 15.
- `experiment-name`: set to `${parent_folder_of_output_fils}_${current_folder_of_output_file}` if not given. E.g., `xxx/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v2_cd/output.jsonl` -> `CodeActAgent_gpt-4-1106-preview_maxiter_50_N_v2_cd` as experiment name.
- `merge_report`: if set, merges the evaluation report into the original output jsonl file and saves as a `.merged.jsonl` file.
An example to run evaluation on the given example agent output (`./examples/example_agent_output.json`).
```shell
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_agent_report.sh --output-file /swe_bench_output/example_agent_output.jsonl \
--agent-name CodeActAgent \
--dataset swe-bench-test-lite \
--experiment-name test_experiment \
--merge-report
```
You should get the following report:
```shell
- no_generation: 4
- generated: 26
- with_logs: 26
- install_fail: 0
- reset_failed: 0
- no_apply: 0
- applied: 24
- test_errored: 0
- test_timeout: 0
- resolved: 6
['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
Report saved at /swe_util/eval_data/eval_logs/test_experiment/test_experiment_swe-bench-test-lite.report.json
Agent output with report merged created at /swe_bench_output/example_agent_output.merged.jsonl
```
An additional `fine_grained_report` field will be added to each instance in the `example_agent_output.merged.jsonl`.
```json
"fine_grained_report": {
"gold_tests": {
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
},
"generated": true,
"with_logs": true,
"applied": true,
"test_errored": false,
"test_timeout": false,
"resolved": true,
"log_parse": {
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
},
"eval_report": {
"FAIL_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
],
"failure": []
},
"PASS_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
"tests/test_ext_viewcode.py::test_linkcode",
"tests/test_ext_viewcode.py::test_local_source_files"
],
"failure": []
},
"FAIL_TO_FAIL": {
"success": [],
"failure": []
},
"PASS_TO_FAIL": {
"success": [],
"failure": []
}
}
}
```
## If you already have patches not generated by OpenDevin
### Prepare Output Files
Ensure that model outputs are formatted correctly as below:
```json
[
{
"instance_id": "",
"model_patch": "",
"model_name_or_path": ""
},
...
]
```
An example can be found [here](./examples/example_model_output.json).
Agent output should be adhere to the OpenDevin format. An example can be found [here](./examples/example_agent_output.json).
### Set Up the Environment
Before evaluating generated patches, you need to set up the Docker environment. Run the following command to instantiate the Docker container and mount the directory to your output files on the host:
```shell
docker run -it \
-v DIR_TO_YOUR_PATCH_FILES_ON_HOST:/swe_bench_output \
ghcr.io/opendevin/eval-swe-bench:full-v1.2.1 /bin/bash
```
### Evaluate Model Generated Patches
Use `scripts/get_model_report.sh` to evaluate patches generated by a model. This script is located in the container at `/swe_util/get_model_report.sh`.
- `output-file` (*required*): specify the path to your patch file inside the container
- `model-name` (*required*): this must match the `model_name_or_path` in your patch file
- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
- `num-processes`: defaults to 15.
- `experiment-name`: set to `{model-name}__{dataset}` unless specified
An example to run evaluation on the given example model output (`./examples/example_agent_output.json`).
```shell
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/swe_util/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_model_report.sh --output-file /swe_bench_output/example_model_output.json \
--model-name opendevin \
--dataset swe-bench-test-lite
```
You should get the following report:
```shell
- no_generation: 4
- generated: 26
- with_logs: 26
- install_fail: 0
- reset_failed: 0
- no_apply: 0
- applied: 24
- test_errored: 0
- test_timeout: 0
- resolved: 6
['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
Report saved at /swe_util/eval_data/eval_logs/opendevin__swe-bench-test-lite/example_model_output.report.json
```
Note: please ignore the `no_apply` in the report for now.
The script will generate a `{experiment_name}` folder under `$EVAL_DATA_DIR/eval_logs`
```shell
├── $EVAL_DATA_DIR/eval_logs/$experiment_name
│ ├── $experiment_name.json
│ ├── $experiment_name.report.json
│ ├── $model_name # eval log dir
```
### Evaluate Agent Generated Patches
Use `scripts/setup/get_agent_report.sh` to evaluate patches generated by an agent. This script is available in the container at `/swe_util/get_agent_report.sh`.
- `output-file` (*required*): specify the path to your patch file inside the container
- `agent-name` (*required*): your agent name
- `dataset` (*required*): `swe-bench-test-lite` or `swe-bench-test`
- `num-processes`: defaults to 15.
- `experiment-name`: set to `${parent_folder_of_output_fils}_${current_folder_of_output_file}` if not given. E.g., `xxx/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v2_cd/output.jsonl` -> `CodeActAgent_gpt-4-1106-preview_maxiter_50_N_v2_cd` as experiment name.
- `merge_report`: if set, merges the evaluation report into the original output jsonl file and saves as a `.merged.jsonl` file.
An example to run evaluation on the given example agent output (`./examples/example_agent_output.json`).
```shell
export MINICONDA3=/swe_util/miniforge3
export OD_SWE_BENCH=/OD-SWE-bench
export EVAL_DATA_DIR=/swe_util/eval_data
cd /swe_util && ./get_agent_report.sh --output-file /swe_bench_output/example_agent_output.jsonl \
--agent-name CodeActAgent \
--dataset swe-bench-test-lite \
--experiment-name test_experiment \
--merge-report
```
You should get the following report:
```shell
- no_generation: 4
- generated: 26
- with_logs: 26
- install_fail: 0
- reset_failed: 0
- no_apply: 0
- applied: 24
- test_errored: 0
- test_timeout: 0
- resolved: 6
['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'django__django-17087', 'sympy__sympy-20590', 'django__django-11583', 'sympy__sympy-21612']
Report saved at /swe_util/eval_data/eval_logs/test_experiment/test_experiment_swe-bench-test-lite.report.json
Agent output with report merged created at /swe_bench_output/example_agent_output.merged.jsonl
```
An additional `fine_grained_report` field will be added to each instance in the `example_agent_output.merged.jsonl`.
```json
"fine_grained_report": {
"gold_tests": {
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
},
"generated": true,
"with_logs": true,
"applied": true,
"test_errored": false,
"test_timeout": false,
"resolved": true,
"log_parse": {
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
},
"eval_report": {
"FAIL_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
],
"failure": []
},
"PASS_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
"tests/test_ext_viewcode.py::test_linkcode",
"tests/test_ext_viewcode.py::test_local_source_files"
],
"failure": []
},
"FAIL_TO_FAIL": {
"success": [],
"failure": []
},
"PASS_TO_FAIL": {
"success": [],
"failure": []
}
}
}
```

View File

@@ -1,16 +1,14 @@
# SWE-Bench Evaluation with OpenDevin SWE-Bench Docker Image
This folder contains evaluation harness we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We create [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly build on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We created [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly built on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to set up a local development environment for OpenDevin.
## OpenDevin SWE-Bench Docker Image
In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for effecienct evaluation.
In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for efficient evaluation.
**We pack everything you need for SWE-Bench evaluation into one, gigantic, docker image.** To use it:
@@ -19,8 +17,9 @@ docker pull ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
```
The Docker image contains several important directories:
- `/swe_util/OD-SWE-bench`: root directory for the OD-SWE-bench repository
- `/swe_util/eval_data`: director to eval data
- `/swe_util/eval_data`: directory to eval data
- `/swe_util/eval_data/eval_logs/`: evaluation logs
- `/swe_util/eval_data/eval_temp/`: temporary folder for the evaluation process
- `/swe_util/eval_data/instances/`: swe-bench raw instances
@@ -31,7 +30,7 @@ The Docker image contains several important directories:
To reproduce how we pack the image, check [this doc](./BUILD_TESTBED_AND_ENV.md).
NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straight forward.
NOTE: We only support SWE-Bench lite for now. But modifying our existing scripts for full SWE-Bench should be quite straightforward.
## Configure OpenDevin and your LLM
@@ -52,6 +51,7 @@ sandbox_timeout = 120
use_host_network = false
run_as_devin = false
enable_auto_lint = true
max_budget_per_task = 4 # 4 USD
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
@@ -128,62 +128,28 @@ If you want to evaluate existing results, you should first run this to clone exi
git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
```
To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
```bash
evaluation/swe_bench/scripts/eval/prep_eval.sh
```
Then you can run the following:
```bash
# ./evaluation/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL
# For example:
./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
```
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.merged.jsonl`.
PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
It will contains an additional field `fine_grained_report` (see example below) compared to the `output.jsonl` from the previous inference stage.
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
```json
"fine_grained_report": {
"gold_tests": {
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
},
"generated": true,
"with_logs": true,
"applied": true,
"test_errored": false,
"test_timeout": false,
"resolved": true,
"log_parse": {
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
},
"eval_report": {
"FAIL_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
],
"failure": []
},
"PASS_TO_PASS": {
"success": [
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
"tests/test_ext_viewcode.py::test_linkcode",
"tests/test_ext_viewcode.py::test_local_source_files"
],
"failure": []
},
"FAIL_TO_FAIL": {
"success": [],
"failure": []
},
"PASS_TO_FAIL": {
"success": [],
"failure": []
}
}
}
```
- `README.md`: a report showing what are the instances that passed, failed, etc.
- `logs/`: a directory of test logs
- `report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
- `summary.json`: a JSON file contains more fine-grained information for each test instance.
Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).
@@ -192,8 +158,8 @@ Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about
If you just want to know the resolve rate, and/or a summary of what tests pass and what don't, you could run
```bash
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_output_merged_jsonl_file>
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/output.merged.jsonl
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_report_json_file>
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/report.json
```
## Submit your evaluation results

View File

@@ -209,7 +209,7 @@ def process_instance(
if reset_logger:
# Set up logger
log_file = os.path.join(
eval_output_dir, 'logs', f'instance_{instance.instance_id}.log'
eval_output_dir, 'infer_logs', f'instance_{instance.instance_id}.log'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
@@ -471,7 +471,7 @@ if __name__ == '__main__':
def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["instance_id"]}')
pbar.set_description(f'Instance {output["instance_id"][:10]}')
pbar.set_postfix_str(f'Test Result: {output["test_result"]["result"]}')
logger.info(
f'Finished evaluation for instance {output["instance_id"]}: {output["test_result"]["result"]}'

View File

@@ -0,0 +1,7 @@
#!/bin/bash
mkdir evaluation/swe_bench/eval_workspace
pushd evaluation/swe_bench/eval_workspace
git clone https://github.com/OpenDevin/SWE-bench-docker.git
cd SWE-bench-docker
scripts/pull_docker_images.sh docker/ xingyaoww

View File

@@ -0,0 +1,26 @@
import argparse
import os
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument('od_output_file', type=str)
args = parser.parse_args()
output_filepath = args.od_output_file.replace('.jsonl', '.swebench.jsonl')
print(f'Converting {args.od_output_file} to {output_filepath}')
od_format = pd.read_json(args.od_output_file, orient='records', lines=True)
# model name is the folder name of od_output_file
model_name = os.path.basename(os.path.dirname(args.od_output_file))
def convert_row_to_swebench_format(row):
return {
'instance_id': row['instance_id'],
'model_patch': row['git_patch'].replace('\r\n', '\n'),
'model_name_or_path': model_name,
}
swebench_format = od_format.apply(convert_row_to_swebench_format, axis=1)
swebench_format.to_json(output_filepath, lines=True, orient='records')

View File

@@ -0,0 +1,34 @@
import argparse
import json
import pandas as pd
from datasets import load_dataset
parser = argparse.ArgumentParser()
parser.add_argument(
'output_dir',
type=str,
default='eval_data/instances',
help='Path to the directory to save the instances.',
)
args = parser.parse_args()
dataset = load_dataset('princeton-nlp/SWE-bench')
test = dataset['test'].to_pandas()
test['FAIL_TO_PASS'] = test['FAIL_TO_PASS'].apply(json.loads)
test['PASS_TO_PASS'] = test['PASS_TO_PASS'].apply(json.loads)
test.to_json(f'{args.output_dir}/swe-bench-test.json', orient='records')
dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
test = dataset['test'].to_pandas()
test['FAIL_TO_PASS'] = test['FAIL_TO_PASS'].apply(json.loads)
test['PASS_TO_PASS'] = test['PASS_TO_PASS'].apply(json.loads)
test.to_json(f'{args.output_dir}/swe-bench-lite-test.json', orient='records')
dev = dataset['dev'].to_pandas()
dev['FAIL_TO_PASS'] = dev['FAIL_TO_PASS'].apply(json.loads)
dev['PASS_TO_PASS'] = dev['PASS_TO_PASS'].apply(json.loads)
dev.to_json(f'{args.output_dir}/swe-bench-lite-dev.json', orient='records')
all_data = pd.concat([test, dev])
all_data.to_json(f'{args.output_dir}/swe-bench-lite-all.json', orient='records')

View File

@@ -0,0 +1,11 @@
#!/bin/bash
echo "Cloning OpenDevin SWE-Bench Fork"
git clone https://github.com/OpenDevin/SWE-bench.git evaluation/swe_bench/eval_workspace/SWE-bench
echo "Pulling all evaluation dockers..."
evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh
echo "Downloading SWE-bench data..."
mkdir -p evaluation/swe_bench/eval_workspace/eval_data/instances
poetry run python3 evaluation/swe_bench/scripts/eval/download_swe_bench_data.py evaluation/swe_bench/eval_workspace/eval_data/instances

View File

@@ -11,25 +11,91 @@ if [ ! -f $PROCESS_FILEPATH ]; then
exit 1
fi
# If instance_id is empty, it means we want to eval on the whole $PROCESS_FILEPATH
# otherwise, we want to eval on the instance_id
INSTANCE_ID=$2
echo "INSTANCE_ID: $INSTANCE_ID"
PROCESS_FILEPATH=$(realpath $PROCESS_FILEPATH)
FILE_DIR=$(dirname $PROCESS_FILEPATH)
FILE_NAME=$(basename $PROCESS_FILEPATH)
mkdir -p $FILE_DIR/eval_logs
mkdir -p $FILE_DIR/logs
mkdir -p $FILE_DIR/swe_bench_format
echo "Evaluating $FILE_NAME @ $FILE_DIR"
echo "Merged output file with fine-grained report will be saved to $FILE_DIR"
DOCKERHUB_NAMESPACE="xingyaoww"
SWEBENCH_TASKS=$(realpath evaluation/swe_bench/eval_workspace/eval_data/instances/swe-bench-lite-all.json)
export SWEBENCH_DOCKER_FORK_DIR=$(realpath evaluation/swe_bench/eval_workspace/SWE-bench-docker)
docker run --rm \
-v $FILE_DIR:/swe_bench_output \
-e MINICONDA3=/swe_util/miniforge3 \
-e OD_SWE_BENCH=/swe_util/OD-SWE-bench \
-e EVAL_DATA_DIR=/swe_util/eval_data \
-w /swe_util \
ghcr.io/opendevin/eval-swe-bench:full-v1.2.1 \
bash -c "./get_agent_report.sh --output-file /swe_bench_output/$FILE_NAME \
--agent-name CodeActAgent \
--dataset swe-bench-test-lite \
--experiment-name test_experiment \
--merge-report && cp -r /swe_util/eval_data/eval_logs/test_experiment/* /swe_bench_output/eval_logs \
&& cp -r /swe_util/eval_data/outputs/* /swe_bench_output/swe_bench_format/"
# ================================================
# detect whether PROCESS_FILEPATH is in OD format or in SWE-bench format
echo "=============================================================="
echo "Detecting whether PROCESS_FILEPATH is in OD format or in SWE-bench format"
echo "=============================================================="
# SWE-bench format is a JSONL where every line has three fields: model_name_or_path, instance_id, and model_patch
function is_swebench_format() {
# Read the first line of the file
read -r first_line < "$PROCESS_FILEPATH"
# Use jq to check if the first line has the required fields
echo "$first_line" | jq -e '. | has("model_name_or_path") and has("instance_id") and has("model_patch")' > /dev/null
if [ $? -ne 0 ]; then
return 1 # Return 1 if the first line does not have the required fields
fi
return 0 # Return 0 if the first line has the required fields
}
# Call the function with the file path
is_swebench_format "$PROCESS_FILEPATH"
IS_SWEBENCH_FORMAT=$?
# Use the result in an if-else statement
if [ $IS_SWEBENCH_FORMAT -eq 0 ]; then
echo "The file IS in SWE-bench format."
SWEBENCH_FORMAT_JSONL=$PROCESS_FILEPATH
else
echo "The file IS NOT in SWE-bench format."
# ==== Convert OD format to SWE-bench format ====
echo "Merged output file with fine-grained report will be saved to $FILE_DIR"
poetry run python3 evaluation/swe_bench/scripts/eval/convert_od_output_to_swe_json.py $PROCESS_FILEPATH
# replace .jsonl with .swebench.jsonl in filename
SWEBENCH_FORMAT_JSONL=${PROCESS_FILEPATH/.jsonl/.swebench.jsonl}
echo "SWEBENCH_FORMAT_JSONL: $SWEBENCH_FORMAT_JSONL"
# assert that the file exists
if [ ! -f $SWEBENCH_FORMAT_JSONL ]; then
echo "Error: $SWEBENCH_FORMAT_JSONL does not exist. There is probably an error in the conversion process."
exit 1
fi
SWEBENCH_FORMAT_JSONL=$(realpath $SWEBENCH_FORMAT_JSONL)
fi
# ================================================
echo "=============================================================="
echo "Running SWE-bench evaluation"
echo "=============================================================="
if [ -z "$INSTANCE_ID" ]; then
echo "Running SWE-bench evaluation on the whole input file..."
poetry run python $SWEBENCH_DOCKER_FORK_DIR/run_evaluation.py \
--predictions_path $SWEBENCH_FORMAT_JSONL \
--log_dir $FILE_DIR/logs \
--swe_bench_tasks $SWEBENCH_TASKS \
--namespace $DOCKERHUB_NAMESPACE \
--timeout 1800
else
echo "Running SWE-bench evaluation on the instance_id: $INSTANCE_ID"
poetry run python $SWEBENCH_DOCKER_FORK_DIR/run_single_instance.py \
--predictions_path $SWEBENCH_FORMAT_JSONL \
--swe_bench_tasks $SWEBENCH_TASKS \
--namespace $DOCKERHUB_NAMESPACE \
--instance_id $INSTANCE_ID
fi
poetry run python $SWEBENCH_DOCKER_FORK_DIR/generate_report.py \
--predictions_path $SWEBENCH_FORMAT_JSONL \
--log_dir $FILE_DIR/logs \
--output_dir $FILE_DIR \
--swe_bench_tasks $SWEBENCH_TASKS

View File

@@ -3,37 +3,37 @@ import sys
def extract_test_results(json_file_path):
passed_tests = []
failed_tests = []
passed_instances = set()
all_instances = set()
with open(json_file_path, 'r') as file:
for line in file:
data = json.loads(line.strip())
instance_id = data['instance_id']
resolved = False
if 'fine_grained_report' in data:
resolved = data['fine_grained_report']['resolved']
else:
resolved = data['test_result']['result']['resolved']
if resolved:
passed_tests.append(instance_id)
else:
failed_tests.append(instance_id)
return passed_tests, failed_tests
report = json.load(file)
# Add resolved instances
for instance_id in report['resolved']:
passed_instances.add(instance_id)
# Add all instances in the report
for _, instance_ids in report.items():
for instance_id in instance_ids:
all_instances.add(instance_id)
return passed_instances, all_instances
if __name__ == '__main__':
if len(sys.argv) != 2:
print(
'Usage: poetry run python summarise_results.py <path_to_output_merged_jsonl_file>'
'Usage: poetry run python summarise_results.py <path_to_report_json_file>'
)
sys.exit(1)
json_file_path = sys.argv[1]
passed_tests, failed_tests = extract_test_results(json_file_path)
succ_rate = len(passed_tests) / (len(passed_tests) + len(failed_tests))
passed_instances, all_instances = extract_test_results(json_file_path)
succ_rate = len(passed_instances) / len(all_instances)
print(
f'\nPassed {len(passed_tests)} tests, failed {len(failed_tests)} tests, resolve rate = {succ_rate}'
f'\nPassed {len(passed_instances)} tests, total {len(all_instances)} tests, resolve rate = {succ_rate:.2%}'
)
print('PASSED TESTS:')
print(passed_tests)
print(sorted(list(passed_instances)))
print('FAILED TESTS:')
print(failed_tests)
print(sorted(list(all_instances - passed_instances)))

View File

@@ -51,7 +51,9 @@ class SWEBenchSSHBox(DockerSSHBox):
assert exit_code == 0, f'Failed to set SWE_INSTANCE_ID in ~/.bashrc: {output}'
logger.info('Sourcing swe_entry.sh to set up environment variables')
# larger timeout for SWEBench init to account for long-running installations (e.g., require compilation)
logger.info(
'Initialization of SWEBench may take approximately 10 minutes due to long-running installations, such as those requiring compilation.'
)
exit_code, output = self.execute('source /swe_util/swe_entry.sh', timeout=600)
logger.info('exit code: %d', exit_code)
logger.info(output)

View File

@@ -0,0 +1,45 @@
# ToolQA Evaluation with OpenDevin
This folder contains an evaluation harness we built on top of the original [ToolQA](https://github.com/night-chen/ToolQA) ([paper](https://arxiv.org/pdf/2306.13304)).
## Setup Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local development environment for OpenDevin.
## Configure OpenDevin and your LLM
Run `make setup-config` to set up the `config.toml` file if it does not exist at the root of the workspace.
## Run Inference on ToolQA Instances
Make sure your Docker daemon is running, then run this bash script:
```bash
bash evaluation/toolqa/scripts/run_infer.sh [model_config] [agent] [eval_limit] [dataset] [hardness] [wolfram_alpha_appid]
```
where `model_config` is mandatory, while all other arguments are optional.
`model_config`, e.g. `llm`, is the config group name for your
LLM settings, as defined in your `config.toml`.
`agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
to `CodeActAgent`.
`eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances.
By default, the script evaluates 1 instance.
`dataset`, the dataset from ToolQA to evaluate from. You could choose from `agenda`, `airbnb`, `coffee`, `dblp`, `flight`, `gsm8k`, `scirex`, `yelp` for dataset. The default is `coffee`.
`hardness`, the hardness to evaluate. You could choose from `easy` and `hard`. The default is `easy`.
`wolfram_alpha_appid` is an optional argument. When given `wolfram_alpha_appid`, the agent will be able to access Wolfram Alpha's APIs.
Note: in order to use `eval_limit`, you must also set `agent`; in order to use `dataset`, you must also set `eval_limit`; in order to use `hardness`, you must also set `dataset`.
Let's say you'd like to run 10 instances using `llm` and CodeActAgent on `coffee` `easy` test,
then your command would be:
```bash
bash evaluation/toolqa/scripts/run_infer.sh llm CodeActAgent 10 coffee easy
```

View File

@@ -0,0 +1,353 @@
import asyncio
import json
import logging
import multiprocessing as mp
import os
import pathlib
import subprocess
import time
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
from utils import download_data, download_tools, encode_question, eval_answer, get_data
from opendevin.controller.state.state import State
from opendevin.core.config import config, get_llm_config_arg, get_parser
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.action import MessageAction
from opendevin.events.serialization.event import event_to_dict
def cleanup():
print('Cleaning up child processes...')
for process in mp.active_children():
print(f'Terminating child process: {process.name}')
process.terminate()
process.join()
def codeact_user_response(state: State) -> str:
msg = (
'Please continue working on the task on whatever approach you think is suitable.\n'
'When you think you finished the task, respond with `Finish[answer]` where you include your answer in `[]`\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
)
if state.history:
user_msgs = [
action
for action, _ in state.history
if isinstance(action, MessageAction) and action.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
def monologue_user_response(state: State) -> str:
raise NotImplementedError('MonologueAgent should never ask for user responses.')
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
'CodeActAgent': codeact_user_response,
'MonologueAgent': monologue_user_response,
}
AGENT_CLS_TO_INST_SUFFIX = {
'CodeActAgent': 'When you think you have completed the request, please run the following command: <execute_bash> exit </execute_bash>.\n'
}
def process_instance(task, agent_class, metadata, reset_logger: bool = True):
# create process-specific workspace dir
# we will create a workspace directory for EACH process
# so that different agent don't interfere with each other.
workspace_mount_path = config.workspace_mount_path
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
eval_output_dir = metadata['eval_output_dir']
qid = task['qid']
question = task['question']
answer = task['answer']
if reset_logger:
# Set up logger
log_file = os.path.join(eval_output_dir, 'logs', f'instance_{qid}.log')
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {qid}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
# Prepare instruction
instruction = encode_question(question)
instruction += 'IMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\n'
# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
# logger.info(f'Instruction:\n{instruction}', extra={'msg_type': 'OBSERVATION'})
# Here's how you can run the agent (similar to the `main` function) and get the final task state
state: State = asyncio.run(
main(
instruction,
fake_user_response_fn=AGENT_CLS_TO_FAKE_USER_RESPONSE_FN.get(agent_class),
)
)
# ======= Attempt to evaluate the agent's edits =======
# If you are working on simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
model_answer_raw = ''
for act, _ in reversed(state.history):
if isinstance(act, MessageAction) and act.source == 'agent':
model_answer_raw = act.content
break
# attempt to parse model_answer
correct = eval_answer(str(model_answer_raw), str(answer))
metrics = state.metrics.get() if state.metrics else None
logger.info(f'Final message: {model_answer_raw} | Correctness: {correct}')
# Save the output
output = {
'qid': qid,
'text': model_answer_raw,
'correct': correct,
'answer_id': 'None',
'model_id': metadata['model_name'],
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
],
'metrics': metrics,
'error': state.error if state and state.error else None,
}
return output
if __name__ == '__main__':
parser = get_parser()
parser.add_argument(
'--dataset',
type=str,
help='Which dataset to evaluate from ToolQA. ToolQA contains 8 datasets, namely agenda, airbnb, coffee, dblp, flight, gsm8k, scirex, yelp. For example, the default is --dataset flight.',
default='flight',
)
parser.add_argument(
'--hardness',
type=str,
help='Which level of difficulty to evaluate from ToolQA. ToolQA contains 2 levels of hardness, namely easy and hard. For example, the default is --hardness easy.',
default='easy',
)
parser.add_argument(
'--wolfram_alpha_appid',
type=str,
help='wolfram alpha appid to use for wolfram alpha related tests',
default='YOUR_WOLFRAMALPHA_APPID',
)
args, _ = parser.parse_known_args()
if args.directory:
config.workspace_base = os.path.abspath(args.directory)
print(f'Setting workspace base to {config.workspace_base}')
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
agent_class = args.agent_cls
assert (
agent_class in AGENT_CLS_TO_FAKE_USER_RESPONSE_FN
), f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'toolqa',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
dataset = ''
hardness = ''
dataset_choices = [
'agenda',
'airbnb',
'coffee',
'dblp',
'flight',
'gsm8k',
'scirex',
'yelp',
'genda',
]
if args.dataset in dataset_choices:
dataset = args.dataset
else:
raise ValueError(
'Please choose from agenda, airbnb, coffee, dblp, flight, gsm8k, scirex, yelp for dataset.'
)
if args.hardness == 'easy':
hardness = 'easy'
elif args.hardness == 'hard':
hardness = 'hard'
else:
raise ValueError('Please choose from easy and hard for hardness.')
logger.info(f'Evaluating ToolQA {dataset} {hardness} test')
# workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
workspace_mount_path = config.workspace_mount_path
pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
toolqa_test = get_data(dataset, hardness)
toolqa_data_path = download_data(workspace_mount_path)
toolqa_tool_path = download_tools(workspace_mount_path, args.wolfram_alpha_appid)
# TEST METADATA
metadata = {
'dataset': dataset,
'hardness': hardness,
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproduciblity
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(
os.path.join(eval_output_dir, f'metadata_{dataset}_{hardness}.json'), 'w'
) as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
toolqa_test = toolqa_test[:eval_n_limit]
logger.info(
f'Limiting evaluation to a total of first {eval_n_limit} instances.'
)
output_file = os.path.join(
eval_output_dir, f'output_{model_name}_{dataset}_{hardness}.jsonl'
)
logger.info(f'Writing evaluation output to {output_file}')
finished_task_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
task = json.loads(line)
finished_task_ids.add(task['qid'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_task_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# =============================================
# filter out finished instances
new_toolqa_test = []
for task in toolqa_test:
qid = task['qid']
if qid in finished_task_ids:
logger.info(f'Skipping instance {qid} as it is already finished.')
continue
new_toolqa_test.append(task)
finished_task_number = len(finished_task_ids)
toolqa_test = new_toolqa_test
logger.info(
f'Finished instances: {finished_task_number}, Remaining instances: {len(toolqa_test)}'
)
# =============================================
pbar = tqdm(total=len(toolqa_test))
# This function tracks the progress AND write the output to a JSONL file
def update_progress(future):
pbar.update(1)
output = future.result()
pbar.set_description(f'Instance {output["qid"]}')
pbar.set_postfix_str(f'Test Result: {output["correct"]}')
logger.info(
f'Finished evaluation for instance {output["qid"]}: {output["correct"]}'
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
finished_task_ids.add(output['qid'])
# This sets the multi-processing
num_workers = args.eval_num_workers
logger.info(f'Using {num_workers} workers for evaluation.')
try:
with ProcessPoolExecutor(num_workers) as executor:
futures = []
# This is how we perform multi-processing
for task in toolqa_test:
try:
future = executor.submit(
process_instance,
task,
agent_class,
metadata,
reset_logger=bool(num_workers > 1),
)
future.add_done_callback(update_progress)
futures.append(future)
except Exception:
continue
# Wait for all futures to complete
for future in futures:
try:
future.result()
except Exception:
continue
except KeyboardInterrupt:
logger.info('KeyboardInterrupt received. Cleaning up...')
cleanup()
output_fp.close()
total_correct = 0
output = []
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
output.append(data)
if data['qid'] in finished_task_ids:
if str(data['correct']).lower() == 'true':
total_correct += 1
# sort all output by question_id
output = sorted(output, key=lambda x: x['qid'])
with open(output_file, 'w') as f:
for dat in output:
f.write(json.dumps(dat) + '\n')
f.flush()
logger.info(
f'Evaluation finished for {dataset}-{hardness}. Total: {len(toolqa_test)+finished_task_number}; Correct: {total_correct}; Accuracy: {total_correct / (len(toolqa_test)+finished_task_number)}'
)

View File

@@ -0,0 +1,58 @@
#!/bin/bash
MODEL_CONFIG=$1
AGENT=$2
EVAL_LIMIT=$3
DATASET=$4
HARDNESS=$5
WOLFRAM_APPID=$6
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default CodeActAgent"
AGENT="CodeActAgent"
fi
if [ -z "$DATASET" ]; then
DATASET="flight"
echo "Dataset not specified, use default $DATASET"
fi
if [ -z "$HARDNESS" ]; then
HARDNESS="easy"
echo "Hardness not specified, use default $HARDNESS"
fi
if [ -z "$WOLFRAM_APPID" ]; then
WOLFRAM_APPID="YOUR_WOLFRAMALPHA_APPID"
echo "WOLFRAM_APPID not specified"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
echo "DATASET: $DATASET"
echo "HARDNESS: $HARDNESS"
echo "WOLFRAM_APPID: $WOLFRAM_APPID"
COMMAND="poetry run python evaluation/toolqa/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 30 \
--dataset $DATASET \
--hardness $HARDNESS \
--wolfram_alpha_appid $WOLFRAM_APPID\
--data-split validation \
--max-chars 10000000 \
--eval-num-workers 1 \
--eval-note ${AGENT_VERSION}_${LEVELS}"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
eval $COMMAND

112
evaluation/toolqa/utils.py Normal file
View File

@@ -0,0 +1,112 @@
import json
import os
import re
import string
import zipfile
import gdown
import requests
def download_data(dir):
data_path = os.path.join(dir, 'data/external_corpus')
if os.path.exists(data_path):
return data_path
url = 'https://drive.google.com/uc?id=1zRbHzPW2x4dDcfmphBWlan8cxUCRNmqk'
zip_path = os.path.join(dir, 'data.zip')
gdown.download(url, zip_path, quiet=False)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(os.path.join(dir, 'data'))
if os.path.exists(zip_path):
os.remove(zip_path)
return data_path
def download_tools(dir, wolfram_alpha_appid='YOUR_WOLFRAMALPHA_APPID'):
tool_path = os.path.join(dir, 'tools')
if os.path.exists(tool_path):
return tool_path
os.mkdir(tool_path)
tools = [
'code/sql_interpreter.py',
'graph/graphtools.py',
'math/calculator.py',
'table/mysql_db_create.py',
'table/tabtools.py',
'text/agenda_retriever.py',
'text/scirex_retriever.py',
]
for tool in tools:
url = f'https://raw.githubusercontent.com/night-chen/ToolQA/main/benchmark/ReAct/code/tools/{tool}'
response = requests.get(url)
output_file = os.path.join(tool_path, tool.split('/')[1])
with open(output_file, 'wb') as f:
f.write(response.content)
with open(os.path.join(tool_path, 'calculator.py'), 'r') as f:
content = f.read()
new_content = content.replace('YOUR_WOLFRAMALPHA_APPID', wolfram_alpha_appid)
with open(os.path.join(tool_path, 'calculator.py'), 'w') as f:
f.write(new_content)
with open(os.path.join(tool_path, 'agenda_retriever.py'), 'r') as f:
content = f.read()
new_content = content.replace('/<YOUR_OWN_PATH>/ToolQA/', '')
with open(os.path.join(tool_path, 'agenda_retriever.py'), 'w') as f:
f.write(new_content)
with open(os.path.join(tool_path, 'mysql_db_create.py'), 'r') as f:
content = f.read()
new_content = content.replace('/<YOUR_OWN_PATH>/ToolQA/', '')
with open(os.path.join(tool_path, 'mysql_db_create.py'), 'w') as f:
f.write(new_content)
with open(os.path.join(tool_path, 'scirex_retriever.py'), 'r') as f:
content = f.read()
new_content = content.replace('/<YOUR_OWN_PATH>/ToolQA/', '')
with open(os.path.join(tool_path, 'scirex_retriever.py'), 'w') as f:
f.write(new_content)
def get_data(dataset, hardness):
data = []
url = f'https://raw.githubusercontent.com/night-chen/ToolQA/main/data/questions/{hardness}/{dataset}-{hardness}.jsonl'
url = requests.get(url)
if url.status_code == 200:
lines = url.text.splitlines()
for line in lines:
data.append(json.loads(line))
return data
REACT_INSTRUCTION = """Use tools in the tools directory to solve the task: {question}
You could use all tools which are under the tools/ directory and all the data under the data/ directory.
When you think you finished the task, respond with `Finish[answer]` where you include your answer in `[]`.
IMPORTANT: Make sure that in your final answer, you should not print any additional text/instructions other than the actual answer, which should be a word or a simple phrase.
"""
def encode_question(question):
return REACT_INSTRUCTION.format(question=question)
# imported from https://github.com/night-chen/ToolQA/tree/main/benchmark/ReAct/code/agents_chatgpt.py
def normalize_answer(s):
def remove_articles(text):
return re.sub(r'\b(a|an|the|usd)\b', ' ', text)
def white_space_fix(text):
return ' '.join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return ''.join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def eval_answer(pred, answer):
pattern = r'Finish\[(.*?)\]'
match = re.search(pattern, pred)
if match:
pred = match.group(1)
return normalize_answer(pred) == normalize_answer(answer)

View File

@@ -0,0 +1,91 @@
# WebArena Evaluation with OpenDevin Browsing Agents
This folder contains evaluation for [WebArena](https://github.com/web-arena-x/webarena) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on realistic web browsing tasks.
## Setup OpenDevin Environment
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
## Configure OpenDevin and your LLM
Create a `config.toml` file if it does not exist at the root of the workspace.
Add the following configurations:
```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
sandbox_container_image = "ghcr.io/opendevin/sandbox:latest"
sandbox_type = "ssh"
ssh_hostname = "localhost"
sandbox_timeout = 120
# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```
## Setup WebArena Environment
WebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenDevin agents.
Follow [this document](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md) to set up your own WebArena environment through local servers or AWS EC2 instances.
Take note of the base URL of the machine where the environment is installed.
## Setup Environment Variables of WebArena Websites
Create a script `webarena_env.sh` under `evaluation/webarena/scripts` with the following:
```bash
export BASE_URL=<YOUR_SERVER_URL_HERE>
export SHOPPING="$BASE_URL:7770/"
export SHOPPING_ADMIN="$BASE_URL:7780/admin"
export REDDIT="$BASE_URL:9999"
export GITLAB="$BASE_URL:8023"
export WIKIPEDIA="$BASE_URL:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing"
export MAP="$BASE_URL:3000"
export HOMEPAGE="$BASE_URL:4399"
export OPENAI_API_KEY="yourkey" # this key is required for some WebArena validators that utilize LLMs
```
## Test if your environment works
Access with browser the above WebArena website URLs and see if they load correctly.
If you cannot access the website, make sure the firewall allows public access of the aforementioned ports on your server
Check the network security policy if you are using an AWS machine.
Follow the WebArena environment setup guide carefully, and make sure the URL fields are populated with the correct base URL of your server.
## Run Evaluation
```sh
bash evaluation/webarena/scripts/run_infer.sh
```
Results will be in `evaluation/evaluation_outputs/outputs/webarena/`
To calculate the success rate, run:
```sh
poetry run python evaluation/webarena/get_success_rate.py evaluation/evaluation_outputs/outputs/webarena/SOME_AGENT/EXP_NAME/output.jsonl
```
## Submit your evaluation results
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
## BrowsingAgent V1.0 result
Tested on BrowsingAgent V1.0
WebArena, 812 tasks (high cost, single run due to fixed task), max step 15
- GPT4o: 0.1478
- GPT3.5: 0.0517

View File

View File

@@ -0,0 +1,33 @@
import argparse
import json
import browsergym.webarena # noqa F401 register webarena tasks as gym environments
import gymnasium as gym
parser = argparse.ArgumentParser(description='Calculate average reward.')
parser.add_argument('output_path', type=str, help='path to output.jsonl')
args = parser.parse_args()
if __name__ == '__main__':
env_ids = [
id for id in gym.envs.registry.keys() if id.startswith('browsergym/webarena')
]
total_num = len(env_ids)
print('Total number of tasks: ', total_num)
total_reward = 0
total_cost = 0
actual_num = 0
with open(args.output_path, 'r') as f:
for line in f:
data = json.loads(line)
actual_num += 1
total_cost += data['metrics']['accumulated_cost']
total_reward += data['test_result']
avg_reward = total_reward / total_num
print('Success Rate: ', avg_reward)
avg_cost = total_cost / actual_num
print('Avg Cost: ', avg_cost)
print('Actual number of tasks finished: ', actual_num)

View File

@@ -0,0 +1,214 @@
import asyncio
import json
import logging
import os
import pathlib
import subprocess
import time
import browsergym.webarena # noqa F401 register webarena tasks as gym environments
import gymnasium as gym
from tqdm import tqdm
from opendevin.controller.state.state import State
from opendevin.core.config import args, config, get_llm_config_arg
from opendevin.core.logger import get_console_handler
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import main
from opendevin.events.serialization.event import event_to_dict
from opendevin.runtime.docker.ssh_box import DockerSSHBox
from opendevin.runtime.tools import RuntimeTool
SUPPORTED_AGENT_CLS = {'BrowsingAgent'}
def process_instance(
env_id: str,
metadata: dict,
eval_output_dir: str,
docker_sandbox: DockerSSHBox,
reset_logger: bool = True,
):
# Setup the logger properly, so you can run multi-processing to parallelize the evaluation
if reset_logger:
# Set up logger
log_file = os.path.join(eval_output_dir, 'logs', f'instance_{env_id}.log')
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
# add back the console handler to print ONE line
logger.addHandler(get_console_handler())
logger.info(
f'Starting evaluation for instance {env_id}.\nHint: run "tail -f {log_file}" to see live logs in a separate shell'
)
# Remove all existing handlers from logger
for handler in logger.handlers[:]:
logger.removeHandler(handler)
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
)
logger.addHandler(file_handler)
else:
logger.info(f'Starting evaluation for instance {env_id}.')
# Here's how you can run the agent (similar to the `main` function) and get the final task state
runtime_tools_config = {
RuntimeTool.BROWSER: {
'browsergym_eval': env_id,
'browsergym_eval_save_dir': eval_output_dir,
}
}
state: State = asyncio.run(
main(
'PLACEHOLDER_GOAL',
runtime_tools_config=runtime_tools_config,
sandbox=docker_sandbox,
)
)
# ======= Attempt to evaluate the agent's environment impact =======
# If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
# You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
if state is None:
raise ValueError('State should not be None.')
metrics = state.metrics.get() if state.metrics else None
browsergym_eval_dir = os.path.join(eval_output_dir, env_id.split('/')[1])
# read goal
with open(
os.path.join(browsergym_eval_dir, 'goal.txt'), 'r', encoding='utf-8'
) as f:
instruction = f.read()
# read reward
with open(
os.path.join(browsergym_eval_dir, 'rewards.json'), 'r', encoding='utf-8'
) as f:
rewards = json.load(f)
reward = max(rewards)
# Save the output
output = {
'instance_id': env_id,
'instruction': instruction,
'metadata': metadata,
'history': [
(event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
],
'metrics': metrics,
'error': state.error if state and state.error else None,
'test_result': reward,
}
return output
if __name__ == '__main__':
env_ids = [
id for id in gym.envs.registry.keys() if id.startswith('browsergym/webarena')
]
# Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
# for details of how to set `llm_config`
if args.llm_config:
specified_llm_config = get_llm_config_arg(args.llm_config)
if specified_llm_config:
config.llm = specified_llm_config
logger.info(f'Config for evaluation: {config}')
# TEST METADATA
agent_class = args.agent_cls
assert agent_class in SUPPORTED_AGENT_CLS, f'Unsupported agent class: {agent_class}'
model_name = config.llm.model.split('/')[-1]
max_iterations = args.max_iterations
eval_note = ''
if args.eval_note is not None:
eval_note += '_N_' + args.eval_note
eval_output_dir = os.path.join(
args.eval_output_dir,
'webarena',
agent_class,
model_name + '_maxiter_' + str(max_iterations) + eval_note,
)
pathlib.Path(eval_output_dir).mkdir(parents=True, exist_ok=True)
pathlib.Path(os.path.join(eval_output_dir, 'logs')).mkdir(
parents=True, exist_ok=True
)
logger.info(f'Using evaluation output directory: {eval_output_dir}')
metadata = {
'agent_class': agent_class,
'model_name': model_name,
'max_iterations': max_iterations,
'eval_output_dir': eval_output_dir,
'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
# get the commit id of current repo for reproducibility
'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
.decode('utf-8')
.strip(),
}
logger.info(f'Metadata: {metadata}')
with open(os.path.join(eval_output_dir, 'metadata.json'), 'w') as f:
json.dump(metadata, f)
# LIMIT EVALUATION
eval_n_limit = args.eval_n_limit
if eval_n_limit:
env_ids = env_ids[:eval_n_limit]
logger.info(f'Limiting evaluation to first {eval_n_limit} instances.')
# OUTPUT FILE
output_file = os.path.join(eval_output_dir, 'output.jsonl')
logger.info(f'Writing evaluation output to {output_file}')
finished_instance_ids = set()
if os.path.exists(output_file):
with open(output_file, 'r') as f:
for line in f:
data = json.loads(line)
finished_instance_ids.add(data['instance_id'])
logger.warning(
f'Output file {output_file} already exists. Loaded {len(finished_instance_ids)} finished instances.'
)
output_fp = open(output_file, 'a')
logger.info(
f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
)
# =============================================
# filter out finished instances
new_env_ids = []
for idx in env_ids:
if idx in finished_instance_ids:
logger.info(f'Skipping instance {idx} as it is already finished.')
continue
new_env_ids.append(idx)
env_ids = new_env_ids
logger.info(
f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(env_ids)}'
)
# =============================================
docker_sandbox = DockerSSHBox()
for env_id in tqdm(env_ids):
try:
output = process_instance(
env_id=env_id,
metadata=metadata,
eval_output_dir=eval_output_dir,
docker_sandbox=docker_sandbox,
reset_logger=False,
)
output_fp.write(json.dumps(output) + '\n')
output_fp.flush()
except Exception as e:
logger.error(f'Error processing instance {env_id}: {e}')
output_fp.close()
logger.info('Evaluation finished.')

View File

@@ -0,0 +1,42 @@
#!/bin/bash
# configure webarena websites and environment
source evaluation/webarena/scripts/webarena_env.sh
# configure browsing agent
export USE_NAV="false"
export USE_CONCISE_ANSWER="true"
MODEL_CONFIG=$1
AGENT=$2
EVAL_LIMIT=$3
if [ -z "$AGENT" ]; then
echo "Agent not specified, use default BrowsingAgent"
AGENT="BrowsingAgent"
fi
# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
# We need to track the version of Agent in the evaluation to make sure results are comparable
AGENT_VERSION=v$(poetry run python -c "import agenthub; from opendevin.controller.agent import Agent; print(Agent.get_cls('$AGENT').VERSION)")
echo "AGENT: $AGENT"
echo "AGENT_VERSION: $AGENT_VERSION"
echo "MODEL_CONFIG: $MODEL_CONFIG"
EVAL_NOTE="$AGENT_VERSION"
COMMAND="poetry run python evaluation/webarena/run_infer.py \
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 15 \
--max-chars 10000000 \
--eval-note $EVAL_NOTE"
if [ -n "$EVAL_LIMIT" ]; then
echo "EVAL_LIMIT: $EVAL_LIMIT"
COMMAND="$COMMAND --eval-n-limit $EVAL_LIMIT"
fi
# Run the command
eval $COMMAND

View File

@@ -6,7 +6,7 @@ In the project directory, you can run:
### `npm run start -- --port 3001`
Runs the app in the development mode.\
Runs the app in development mode.\
Open [http://localhost:3001](http://localhost:3001) to view it in the browser.
The page will reload if you make edits.\
@@ -14,13 +14,15 @@ You will also see any lint errors in the console.
### `npm run make-i18n`
This command is used to generate the i18n declaration file.\
It should be run when first setting up the repository or when updating translations.
Generates the i18n declaration file.\
Run this when first setting up the repository or when updating translations.
### `npm run test`
This command runs the available test suites for the application.\
It launches the test runner in the interactive watch mode, allowing you to see the results of your tests in real time.
Runs the available test suites for the application.\
It launches the test runner in interactive watch mode, allowing you to see the results of your tests in real time.
In order to skip all but one specific test file, like the one for the ChatInterface, the following command might be used: `npm run test -- -t "ChatInterface"`
### `npm run build`
@@ -31,14 +33,18 @@ The build is minified and the filenames include the hashes.\
Your app is ready to be deployed!
## Environment Variables
You can set the environment variables in `frontend/.env` to configure the frontend. The following variables are available:
You can set the environment variables in `frontend/.env` to configure the frontend.
The following variables are available:
```javascript
VITE_BACKEND_HOST="127.0.0.1:3000" // The host of the backend
VITE_USE_TLS="false" // Whether to use TLS for the backend(includes HTTPS and WSS)
VITE_USE_TLS="false" // Whether to use TLS for the backend (includes HTTPS and WSS)
VITE_INSECURE_SKIP_VERIFY="false" // Whether to skip verifying the backend's certificate. Only takes effect if `VITE_USE_TLS` is true. Don't use this in production!
VITE_FRONTEND_PORT="3001" // The port of the frontend
```
You can also set the environment variables from outside the project, like `exporter VITE_BACKEND_HOST="127.0.0.1:3000"`.
You can also set the environment variables from outside the project, like `export VITE_BACKEND_HOST="127.0.0.1:3000"`.
The outside environment variables will override the ones in the `.env` file.

View File

@@ -12,7 +12,7 @@
"@nextui-org/react": "^2.4.1",
"@react-types/shared": "^3.23.1",
"@reduxjs/toolkit": "^2.2.5",
"@vitejs/plugin-react": "^4.3.0",
"@vitejs/plugin-react": "^4.3.1",
"@xterm/addon-fit": "^0.10.0",
"@xterm/xterm": "^5.4.0",
"clsx": "^2.1.1",
@@ -21,7 +21,7 @@
"i18next": "^23.11.5",
"i18next-browser-languagedetector": "^8.0.0",
"i18next-http-backend": "^2.5.2",
"jose": "^5.3.0",
"jose": "^5.4.0",
"monaco-editor": "^0.49.0",
"react": "^18.3.1",
"react-dom": "^18.3.1",
@@ -34,21 +34,21 @@
"react-router-dom": "^6.23.1",
"react-syntax-highlighter": "^15.5.0",
"tailwind-merge": "^2.3.0",
"vite": "^5.2.12",
"vite": "^5.2.13",
"web-vitals": "^3.5.2"
},
"devDependencies": {
"@tailwindcss/typography": "^0.5.13",
"@testing-library/jest-dom": "^6.4.5",
"@testing-library/react": "^15.0.7",
"@testing-library/react": "^16.0.0",
"@testing-library/user-event": "^14.5.2",
"@types/node": "^20.14.0",
"@types/node": "^20.14.2",
"@types/react": "^18.3.3",
"@types/react-dom": "^18.3.0",
"@types/react-highlight": "^0.12.8",
"@types/react-syntax-highlighter": "^15.5.13",
"@typescript-eslint/eslint-plugin": "^7.11.0",
"@typescript-eslint/parser": "^7.11.0",
"@typescript-eslint/eslint-plugin": "^7.12.0",
"@typescript-eslint/parser": "^7.12.0",
"autoprefixer": "^10.4.19",
"eslint": "^8.57.0",
"eslint-config-airbnb": "^19.0.4",
@@ -63,8 +63,8 @@
"jsdom": "^24.1.0",
"lint-staged": "^15.2.5",
"postcss": "^8.4.38",
"prettier": "^3.3.0",
"tailwindcss": "^3.4.2",
"prettier": "^3.3.1",
"tailwindcss": "^3.4.4",
"typescript": "^5.4.5",
"vite-tsconfig-paths": "^4.3.2",
"vitest": "^1.6.0"
@@ -6004,6 +6004,7 @@
"resolved": "https://registry.npmjs.org/@testing-library/dom/-/dom-10.1.0.tgz",
"integrity": "sha512-wdsYKy5zupPyLCW2Je5DLHSxSfbIp6h80WoHOQc+RPtmPGA52O9x5MJEkv92Sjonpq+poOAtUKhh1kBGAXBrNA==",
"dev": true,
"peer": true,
"dependencies": {
"@babel/code-frame": "^7.10.4",
"@babel/runtime": "^7.12.5",
@@ -6083,26 +6084,29 @@
"dev": true
},
"node_modules/@testing-library/react": {
"version": "15.0.7",
"resolved": "https://registry.npmjs.org/@testing-library/react/-/react-15.0.7.tgz",
"integrity": "sha512-cg0RvEdD1TIhhkm1IeYMQxrzy0MtUNfa3minv4MjbgcYzJAZ7yD0i0lwoPOTPr+INtiXFezt2o8xMSnyHhEn2Q==",
"version": "16.0.0",
"resolved": "https://registry.npmjs.org/@testing-library/react/-/react-16.0.0.tgz",
"integrity": "sha512-guuxUKRWQ+FgNX0h0NS0FIq3Q3uLtWVpBzcLOggmfMoUpgBnzBzvLLd4fbm6yS8ydJd94cIfY4yP9qUQjM2KwQ==",
"dev": true,
"dependencies": {
"@babel/runtime": "^7.12.5",
"@testing-library/dom": "^10.0.0",
"@types/react-dom": "^18.0.0"
"@babel/runtime": "^7.12.5"
},
"engines": {
"node": ">=18"
},
"peerDependencies": {
"@testing-library/dom": "^10.0.0",
"@types/react": "^18.0.0",
"@types/react-dom": "^18.0.0",
"react": "^18.0.0",
"react-dom": "^18.0.0"
},
"peerDependenciesMeta": {
"@types/react": {
"optional": true
},
"@types/react-dom": {
"optional": true
}
}
},
@@ -6123,7 +6127,8 @@
"version": "5.0.4",
"resolved": "https://registry.npmjs.org/@types/aria-query/-/aria-query-5.0.4.tgz",
"integrity": "sha512-rfT93uj5s0PRL7EzccGMs3brplhcrghnDoV26NqKhCAS1hVo+WdNsPvE/yb6ilfr5hi2MEk6d5EWJTKdxg8jVw==",
"dev": true
"dev": true,
"peer": true
},
"node_modules/@types/babel__core": {
"version": "7.20.5",
@@ -6315,9 +6320,9 @@
"integrity": "sha512-nG96G3Wp6acyAgJqGasjODb+acrI7KltPiRxzHPXnP3NgI28bpQDRv53olbqGXbfcgF5aiiHmO3xpwEpS5Ld9g=="
},
"node_modules/@types/node": {
"version": "20.14.0",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.14.0.tgz",
"integrity": "sha512-5cHBxFGJx6L4s56Bubp4fglrEpmyJypsqI6RgzMfBHWUJQGWAAi8cWcgetEbZXHYXo9C2Fa4EEds/uSyS4cxmA==",
"version": "20.14.2",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.14.2.tgz",
"integrity": "sha512-xyu6WAMVwv6AKFLB+e/7ySZVr/0zLCzOa7rSpq6jNwpqOrUbcACDWC+53d4n2QHOnDou0fbIsg8wZu/sxrnI4Q==",
"devOptional": true,
"dependencies": {
"undici-types": "~5.26.4"
@@ -6402,16 +6407,16 @@
"peer": true
},
"node_modules/@typescript-eslint/eslint-plugin": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/eslint-plugin/-/eslint-plugin-7.11.0.tgz",
"integrity": "sha512-P+qEahbgeHW4JQ/87FuItjBj8O3MYv5gELDzr8QaQ7fsll1gSMTYb6j87MYyxwf3DtD7uGFB9ShwgmCJB5KmaQ==",
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/eslint-plugin/-/eslint-plugin-7.12.0.tgz",
"integrity": "sha512-7F91fcbuDf/d3S8o21+r3ZncGIke/+eWk0EpO21LXhDfLahriZF9CGj4fbAetEjlaBdjdSm9a6VeXbpbT6Z40Q==",
"dev": true,
"dependencies": {
"@eslint-community/regexpp": "^4.10.0",
"@typescript-eslint/scope-manager": "7.11.0",
"@typescript-eslint/type-utils": "7.11.0",
"@typescript-eslint/utils": "7.11.0",
"@typescript-eslint/visitor-keys": "7.11.0",
"@typescript-eslint/scope-manager": "7.12.0",
"@typescript-eslint/type-utils": "7.12.0",
"@typescript-eslint/utils": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0",
"graphemer": "^1.4.0",
"ignore": "^5.3.1",
"natural-compare": "^1.4.0",
@@ -6434,16 +6439,63 @@
}
}
},
"node_modules/@typescript-eslint/parser": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/parser/-/parser-7.11.0.tgz",
"integrity": "sha512-yimw99teuaXVWsBcPO1Ais02kwJ1jmNA1KxE7ng0aT7ndr1pT1wqj0OJnsYVGKKlc4QJai86l/025L6z8CljOg==",
"node_modules/@typescript-eslint/eslint-plugin/node_modules/@typescript-eslint/scope-manager": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/scope-manager/-/scope-manager-7.12.0.tgz",
"integrity": "sha512-itF1pTnN6F3unPak+kutH9raIkL3lhH1YRPGgt7QQOh43DQKVJXmWkpb+vpc/TiDHs6RSd9CTbDsc/Y+Ygq7kg==",
"dev": true,
"dependencies": {
"@typescript-eslint/scope-manager": "7.11.0",
"@typescript-eslint/types": "7.11.0",
"@typescript-eslint/typescript-estree": "7.11.0",
"@typescript-eslint/visitor-keys": "7.11.0",
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/eslint-plugin/node_modules/@typescript-eslint/types": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/types/-/types-7.12.0.tgz",
"integrity": "sha512-o+0Te6eWp2ppKY3mLCU+YA9pVJxhUJE15FV7kxuD9jgwIAa+w/ycGJBMrYDTpVGUM/tgpa9SeMOugSabWFq7bg==",
"dev": true,
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/eslint-plugin/node_modules/@typescript-eslint/visitor-keys": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/visitor-keys/-/visitor-keys-7.12.0.tgz",
"integrity": "sha512-uZk7DevrQLL3vSnfFl5bj4sL75qC9D6EdjemIdbtkuUmIheWpuiiylSY01JxJE7+zGrOWDZrp1WxOuDntvKrHQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.12.0",
"eslint-visitor-keys": "^3.4.3"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/parser": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/parser/-/parser-7.12.0.tgz",
"integrity": "sha512-dm/J2UDY3oV3TKius2OUZIFHsomQmpHtsV0FTh1WO8EKgHLQ1QCADUqscPgTpU+ih1e21FQSRjXckHn3txn6kQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/scope-manager": "7.12.0",
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/typescript-estree": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0",
"debug": "^4.3.4"
},
"engines": {
@@ -6462,14 +6514,14 @@
}
}
},
"node_modules/@typescript-eslint/scope-manager": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/scope-manager/-/scope-manager-7.11.0.tgz",
"integrity": "sha512-27tGdVEiutD4POirLZX4YzT180vevUURJl4wJGmm6TrQoiYwuxTIY98PBp6L2oN+JQxzE0URvYlzJaBHIekXAw==",
"node_modules/@typescript-eslint/parser/node_modules/@typescript-eslint/scope-manager": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/scope-manager/-/scope-manager-7.12.0.tgz",
"integrity": "sha512-itF1pTnN6F3unPak+kutH9raIkL3lhH1YRPGgt7QQOh43DQKVJXmWkpb+vpc/TiDHs6RSd9CTbDsc/Y+Ygq7kg==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.11.0",
"@typescript-eslint/visitor-keys": "7.11.0"
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
@@ -6479,37 +6531,10 @@
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/type-utils": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/type-utils/-/type-utils-7.11.0.tgz",
"integrity": "sha512-WmppUEgYy+y1NTseNMJ6mCFxt03/7jTOy08bcg7bxJJdsM4nuhnchyBbE8vryveaJUf62noH7LodPSo5Z0WUCg==",
"dev": true,
"dependencies": {
"@typescript-eslint/typescript-estree": "7.11.0",
"@typescript-eslint/utils": "7.11.0",
"debug": "^4.3.4",
"ts-api-utils": "^1.3.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
},
"peerDependencies": {
"eslint": "^8.56.0"
},
"peerDependenciesMeta": {
"typescript": {
"optional": true
}
}
},
"node_modules/@typescript-eslint/types": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/types/-/types-7.11.0.tgz",
"integrity": "sha512-MPEsDRZTyCiXkD4vd3zywDCifi7tatc4K37KqTprCvaXptP7Xlpdw0NR2hRJTetG5TxbWDB79Ys4kLmHliEo/w==",
"node_modules/@typescript-eslint/parser/node_modules/@typescript-eslint/types": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/types/-/types-7.12.0.tgz",
"integrity": "sha512-o+0Te6eWp2ppKY3mLCU+YA9pVJxhUJE15FV7kxuD9jgwIAa+w/ycGJBMrYDTpVGUM/tgpa9SeMOugSabWFq7bg==",
"dev": true,
"engines": {
"node": "^18.18.0 || >=20.0.0"
@@ -6519,14 +6544,14 @@
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/typescript-estree": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/typescript-estree/-/typescript-estree-7.11.0.tgz",
"integrity": "sha512-cxkhZ2C/iyi3/6U9EPc5y+a6csqHItndvN/CzbNXTNrsC3/ASoYQZEt9uMaEp+xFNjasqQyszp5TumAVKKvJeQ==",
"node_modules/@typescript-eslint/parser/node_modules/@typescript-eslint/typescript-estree": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/typescript-estree/-/typescript-estree-7.12.0.tgz",
"integrity": "sha512-5bwqLsWBULv1h6pn7cMW5dXX/Y2amRqLaKqsASVwbBHMZSnHqE/HN4vT4fE0aFsiwxYvr98kqOWh1a8ZKXalCQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.11.0",
"@typescript-eslint/visitor-keys": "7.11.0",
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0",
"debug": "^4.3.4",
"globby": "^11.1.0",
"is-glob": "^4.0.3",
@@ -6547,16 +6572,118 @@
}
}
},
"node_modules/@typescript-eslint/parser/node_modules/@typescript-eslint/visitor-keys": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/visitor-keys/-/visitor-keys-7.12.0.tgz",
"integrity": "sha512-uZk7DevrQLL3vSnfFl5bj4sL75qC9D6EdjemIdbtkuUmIheWpuiiylSY01JxJE7+zGrOWDZrp1WxOuDntvKrHQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.12.0",
"eslint-visitor-keys": "^3.4.3"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/type-utils": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/type-utils/-/type-utils-7.12.0.tgz",
"integrity": "sha512-lib96tyRtMhLxwauDWUp/uW3FMhLA6D0rJ8T7HmH7x23Gk1Gwwu8UZ94NMXBvOELn6flSPiBrCKlehkiXyaqwA==",
"dev": true,
"dependencies": {
"@typescript-eslint/typescript-estree": "7.12.0",
"@typescript-eslint/utils": "7.12.0",
"debug": "^4.3.4",
"ts-api-utils": "^1.3.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
},
"peerDependencies": {
"eslint": "^8.56.0"
},
"peerDependenciesMeta": {
"typescript": {
"optional": true
}
}
},
"node_modules/@typescript-eslint/type-utils/node_modules/@typescript-eslint/types": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/types/-/types-7.12.0.tgz",
"integrity": "sha512-o+0Te6eWp2ppKY3mLCU+YA9pVJxhUJE15FV7kxuD9jgwIAa+w/ycGJBMrYDTpVGUM/tgpa9SeMOugSabWFq7bg==",
"dev": true,
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/type-utils/node_modules/@typescript-eslint/typescript-estree": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/typescript-estree/-/typescript-estree-7.12.0.tgz",
"integrity": "sha512-5bwqLsWBULv1h6pn7cMW5dXX/Y2amRqLaKqsASVwbBHMZSnHqE/HN4vT4fE0aFsiwxYvr98kqOWh1a8ZKXalCQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0",
"debug": "^4.3.4",
"globby": "^11.1.0",
"is-glob": "^4.0.3",
"minimatch": "^9.0.4",
"semver": "^7.6.0",
"ts-api-utils": "^1.3.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
},
"peerDependenciesMeta": {
"typescript": {
"optional": true
}
}
},
"node_modules/@typescript-eslint/type-utils/node_modules/@typescript-eslint/visitor-keys": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/visitor-keys/-/visitor-keys-7.12.0.tgz",
"integrity": "sha512-uZk7DevrQLL3vSnfFl5bj4sL75qC9D6EdjemIdbtkuUmIheWpuiiylSY01JxJE7+zGrOWDZrp1WxOuDntvKrHQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.12.0",
"eslint-visitor-keys": "^3.4.3"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/utils": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/utils/-/utils-7.11.0.tgz",
"integrity": "sha512-xlAWwPleNRHwF37AhrZurOxA1wyXowW4PqVXZVUNCLjB48CqdPJoJWkrpH2nij9Q3Lb7rtWindtoXwxjxlKKCA==",
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/utils/-/utils-7.12.0.tgz",
"integrity": "sha512-Y6hhwxwDx41HNpjuYswYp6gDbkiZ8Hin9Bf5aJQn1bpTs3afYY4GX+MPYxma8jtoIV2GRwTM/UJm/2uGCVv+DQ==",
"dev": true,
"dependencies": {
"@eslint-community/eslint-utils": "^4.4.0",
"@typescript-eslint/scope-manager": "7.11.0",
"@typescript-eslint/types": "7.11.0",
"@typescript-eslint/typescript-estree": "7.11.0"
"@typescript-eslint/scope-manager": "7.12.0",
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/typescript-estree": "7.12.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
@@ -6569,13 +6696,71 @@
"eslint": "^8.56.0"
}
},
"node_modules/@typescript-eslint/visitor-keys": {
"version": "7.11.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/visitor-keys/-/visitor-keys-7.11.0.tgz",
"integrity": "sha512-7syYk4MzjxTEk0g/w3iqtgxnFQspDJfn6QKD36xMuuhTzjcxY7F8EmBLnALjVyaOF1/bVocu3bS/2/F7rXrveQ==",
"node_modules/@typescript-eslint/utils/node_modules/@typescript-eslint/scope-manager": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/scope-manager/-/scope-manager-7.12.0.tgz",
"integrity": "sha512-itF1pTnN6F3unPak+kutH9raIkL3lhH1YRPGgt7QQOh43DQKVJXmWkpb+vpc/TiDHs6RSd9CTbDsc/Y+Ygq7kg==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.11.0",
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/utils/node_modules/@typescript-eslint/types": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/types/-/types-7.12.0.tgz",
"integrity": "sha512-o+0Te6eWp2ppKY3mLCU+YA9pVJxhUJE15FV7kxuD9jgwIAa+w/ycGJBMrYDTpVGUM/tgpa9SeMOugSabWFq7bg==",
"dev": true,
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
}
},
"node_modules/@typescript-eslint/utils/node_modules/@typescript-eslint/typescript-estree": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/typescript-estree/-/typescript-estree-7.12.0.tgz",
"integrity": "sha512-5bwqLsWBULv1h6pn7cMW5dXX/Y2amRqLaKqsASVwbBHMZSnHqE/HN4vT4fE0aFsiwxYvr98kqOWh1a8ZKXalCQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.12.0",
"@typescript-eslint/visitor-keys": "7.12.0",
"debug": "^4.3.4",
"globby": "^11.1.0",
"is-glob": "^4.0.3",
"minimatch": "^9.0.4",
"semver": "^7.6.0",
"ts-api-utils": "^1.3.0"
},
"engines": {
"node": "^18.18.0 || >=20.0.0"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/typescript-eslint"
},
"peerDependenciesMeta": {
"typescript": {
"optional": true
}
}
},
"node_modules/@typescript-eslint/utils/node_modules/@typescript-eslint/visitor-keys": {
"version": "7.12.0",
"resolved": "https://registry.npmjs.org/@typescript-eslint/visitor-keys/-/visitor-keys-7.12.0.tgz",
"integrity": "sha512-uZk7DevrQLL3vSnfFl5bj4sL75qC9D6EdjemIdbtkuUmIheWpuiiylSY01JxJE7+zGrOWDZrp1WxOuDntvKrHQ==",
"dev": true,
"dependencies": {
"@typescript-eslint/types": "7.12.0",
"eslint-visitor-keys": "^3.4.3"
},
"engines": {
@@ -6592,9 +6777,9 @@
"integrity": "sha512-zuVdFrMJiuCDQUMCzQaD6KL28MjnqqN8XnAqiEq9PNm/hCPTSGfrXCOfwj1ow4LFb/tNymJPwsNbVePc1xFqrQ=="
},
"node_modules/@vitejs/plugin-react": {
"version": "4.3.0",
"resolved": "https://registry.npmjs.org/@vitejs/plugin-react/-/plugin-react-4.3.0.tgz",
"integrity": "sha512-KcEbMsn4Dpk+LIbHMj7gDPRKaTMStxxWRkRmxsg/jVdFdJCZWt1SchZcf0M4t8lIKdwwMsEyzhrcOXRrDPtOBw==",
"version": "4.3.1",
"resolved": "https://registry.npmjs.org/@vitejs/plugin-react/-/plugin-react-4.3.1.tgz",
"integrity": "sha512-m/V2syj5CuVnaxcUJOQRel/Wr31FFXRFlnOoq1TVtkCxsY5veGMTEmpWHndrhB2U8ScHtCQB1e+4hWYExQc6Lg==",
"dependencies": {
"@babel/core": "^7.24.5",
"@babel/plugin-transform-react-jsx-self": "^7.24.5",
@@ -8223,7 +8408,8 @@
"version": "0.5.16",
"resolved": "https://registry.npmjs.org/dom-accessibility-api/-/dom-accessibility-api-0.5.16.tgz",
"integrity": "sha512-X7BJ2yElsnOJ30pZF4uIIDfBEVgF4XEBxL9Bxhy6dnrm5hkzqmsWHGTiHqRiITNhMyFLyAiWndIJP7Z1NTteDg==",
"dev": true
"dev": true,
"peer": true
},
"node_modules/eastasianwidth": {
"version": "0.2.0",
@@ -11712,9 +11898,9 @@
}
},
"node_modules/jose": {
"version": "5.3.0",
"resolved": "https://registry.npmjs.org/jose/-/jose-5.3.0.tgz",
"integrity": "sha512-IChe9AtAE79ru084ow8jzkN2lNrG3Ntfiv65Cvj9uOCE2m5LNsdHG+9EbxWxAoWRF9TgDOqLN5jm08++owDVRg==",
"version": "5.4.0",
"resolved": "https://registry.npmjs.org/jose/-/jose-5.4.0.tgz",
"integrity": "sha512-6rpxTHPAQyWMb9A35BroFl1Sp0ST3DpPcm5EVIxZxdH+e0Hv9fwhyB3XLKFUcHNpdSDnETmBfuPPTTlYz5+USw==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
@@ -12360,6 +12546,7 @@
"resolved": "https://registry.npmjs.org/lz-string/-/lz-string-1.5.0.tgz",
"integrity": "sha512-h5bgJWpxJNswbU7qCrV0tIKQCaS3blPDrqKWx+QxzuzL1zGUzij9XCWLrSLsJPu5t+eWA/ycetzYAO5IOMcWAQ==",
"dev": true,
"peer": true,
"bin": {
"lz-string": "bin/bin.js"
}
@@ -13975,9 +14162,9 @@
}
},
"node_modules/prettier": {
"version": "3.3.0",
"resolved": "https://registry.npmjs.org/prettier/-/prettier-3.3.0.tgz",
"integrity": "sha512-J9odKxERhCQ10OC2yb93583f6UnYutOeiV5i0zEDS7UGTdUt0u+y8erxl3lBKvwo/JHyyoEdXjwp4dke9oyZ/g==",
"version": "3.3.1",
"resolved": "https://registry.npmjs.org/prettier/-/prettier-3.3.1.tgz",
"integrity": "sha512-7CAwy5dRsxs8PHXT3twixW9/OEll8MLE0VRPCJyl7CkS6VHGPSlsVaWTiASPTyGyYRyApxlaWTzwUxVNrhcwDg==",
"dev": true,
"bin": {
"prettier": "bin/prettier.cjs"
@@ -14006,6 +14193,7 @@
"resolved": "https://registry.npmjs.org/pretty-format/-/pretty-format-27.5.1.tgz",
"integrity": "sha512-Qb1gy5OrP5+zDf2Bvnzdl3jsTf1qXVMazbvCoKhtKqVs4/YK4ozX4gKQJJVyNe+cajNPn0KoC0MC3FUmaHWEmQ==",
"dev": true,
"peer": true,
"dependencies": {
"ansi-regex": "^5.0.1",
"ansi-styles": "^5.0.0",
@@ -14020,6 +14208,7 @@
"resolved": "https://registry.npmjs.org/ansi-styles/-/ansi-styles-5.2.0.tgz",
"integrity": "sha512-Cxwpt2SfTzTtXcfOlzGEee8O+c+MmUgGrNiBcXnuWxuFJHe6a5Hz7qwhwe5OgaSYI0IJvkLqWX1ASG+cJOkEiA==",
"dev": true,
"peer": true,
"engines": {
"node": ">=10"
},
@@ -14216,7 +14405,8 @@
"version": "17.0.2",
"resolved": "https://registry.npmjs.org/react-is/-/react-is-17.0.2.tgz",
"integrity": "sha512-w2GsyukL62IJnlaff/nRegPQR94C/XXamvMWmSHRJ4y7Ts/4ocGRmTHvOs8PSE6pB3dWOrD/nueuU5sduBsQ4w==",
"dev": true
"dev": true,
"peer": true
},
"node_modules/react-markdown": {
"version": "9.0.1",
@@ -15483,9 +15673,9 @@
}
},
"node_modules/tailwindcss": {
"version": "3.4.3",
"resolved": "https://registry.npmjs.org/tailwindcss/-/tailwindcss-3.4.3.tgz",
"integrity": "sha512-U7sxQk/n397Bmx4JHbJx/iSOOv5G+II3f1kpLpY2QeUv5DcPdcTsYLlusZfq1NthHS1c1cZoyFmmkex1rzke0A==",
"version": "3.4.4",
"resolved": "https://registry.npmjs.org/tailwindcss/-/tailwindcss-3.4.4.tgz",
"integrity": "sha512-ZoyXOdJjISB7/BcLTR6SEsLgKtDStYyYZVLsUtWChO4Ps20CBad7lfJKVDiejocV4ME1hLmyY0WJE3hSDcmQ2A==",
"dependencies": {
"@alloc/quick-lru": "^5.2.0",
"arg": "^5.0.2",
@@ -16243,9 +16433,9 @@
"integrity": "sha512-dqId9J8K/vGi5Zr7oo212BGii5m3q5Hxlkwy3WpYuKPklmBEvsbMYYyLxAQpSffdLl/gdW0XUpKWFvYmyoWCoQ=="
},
"node_modules/vite": {
"version": "5.2.12",
"resolved": "https://registry.npmjs.org/vite/-/vite-5.2.12.tgz",
"integrity": "sha512-/gC8GxzxMK5ntBwb48pR32GGhENnjtY30G4A0jemunsBkiEZFw60s8InGpN8gkhHEkjnRK1aSAxeQgwvFhUHAA==",
"version": "5.2.13",
"resolved": "https://registry.npmjs.org/vite/-/vite-5.2.13.tgz",
"integrity": "sha512-SSq1noJfY9pR3I1TUENL3rQYDQCFqgD+lM6fTRAM8Nv6Lsg5hDLaXkjETVeBt+7vZBCMoibD+6IWnT2mJ+Zb/A==",
"dependencies": {
"esbuild": "^0.20.1",
"postcss": "^8.4.38",

View File

@@ -11,7 +11,7 @@
"@nextui-org/react": "^2.4.1",
"@react-types/shared": "^3.23.1",
"@reduxjs/toolkit": "^2.2.5",
"@vitejs/plugin-react": "^4.3.0",
"@vitejs/plugin-react": "^4.3.1",
"@xterm/addon-fit": "^0.10.0",
"@xterm/xterm": "^5.4.0",
"clsx": "^2.1.1",
@@ -20,7 +20,7 @@
"i18next": "^23.11.5",
"i18next-browser-languagedetector": "^8.0.0",
"i18next-http-backend": "^2.5.2",
"jose": "^5.3.0",
"jose": "^5.4.0",
"monaco-editor": "^0.49.0",
"react": "^18.3.1",
"react-dom": "^18.3.1",
@@ -33,7 +33,7 @@
"react-router-dom": "^6.23.1",
"react-syntax-highlighter": "^15.5.0",
"tailwind-merge": "^2.3.0",
"vite": "^5.2.12",
"vite": "^5.2.13",
"web-vitals": "^3.5.2"
},
"scripts": {
@@ -61,15 +61,15 @@
"devDependencies": {
"@tailwindcss/typography": "^0.5.13",
"@testing-library/jest-dom": "^6.4.5",
"@testing-library/react": "^15.0.7",
"@testing-library/react": "^16.0.0",
"@testing-library/user-event": "^14.5.2",
"@types/node": "^20.14.0",
"@types/node": "^20.14.2",
"@types/react": "^18.3.3",
"@types/react-dom": "^18.3.0",
"@types/react-highlight": "^0.12.8",
"@types/react-syntax-highlighter": "^15.5.13",
"@typescript-eslint/eslint-plugin": "^7.11.0",
"@typescript-eslint/parser": "^7.11.0",
"@typescript-eslint/eslint-plugin": "^7.12.0",
"@typescript-eslint/parser": "^7.12.0",
"autoprefixer": "^10.4.19",
"eslint": "^8.57.0",
"eslint-config-airbnb": "^19.0.4",
@@ -84,8 +84,8 @@
"jsdom": "^24.1.0",
"lint-staged": "^15.2.5",
"postcss": "^8.4.38",
"prettier": "^3.3.0",
"tailwindcss": "^3.4.2",
"prettier": "^3.3.1",
"tailwindcss": "^3.4.4",
"typescript": "^5.4.5",
"vite-tsconfig-paths": "^4.3.2",
"vitest": "^1.6.0"

View File

@@ -98,7 +98,7 @@ function Workspace() {
// Only need to show the tab only when a cell is added
setChanges((prev) => ({ ...prev, [TabOption.JUPYTER]: true }));
}
}, [jupyterCells]);
}, [activeTab, jupyterCells]);
return (
<div className="flex flex-col min-h-0 grow">

View File

@@ -1,15 +1,15 @@
import React from "react";
import userEvent from "@testing-library/user-event";
import { act, render } from "@testing-library/react";
import { act, render, fireEvent } from "@testing-library/react";
import ChatInput from "./ChatInput";
describe("ChatInput", () => {
const onSendMessage = vi.fn();
afterEach(() => {
vi.clearAllMocks();
});
const onSendMessage = vi.fn();
it("should render a textarea", () => {
const { getByRole } = render(<ChatInput onSendMessage={onSendMessage} />);
const textarea = getByRole("textbox");
@@ -47,26 +47,29 @@ describe("ChatInput", () => {
expect(button).toBeInTheDocument();
});
it("should call sendChatMessage with the input when the send button is clicked", () => {
it("should call sendChatMessage with the input when the send button is clicked", async () => {
const { getByRole } = render(<ChatInput onSendMessage={onSendMessage} />);
const textarea = getByRole("textbox");
const button = getByRole("button");
act(() => {
userEvent.type(textarea, "Hello, world!");
userEvent.click(button);
fireEvent.change(textarea, { target: { value: "Hello, world!" } });
await act(async () => {
await userEvent.click(button);
});
expect(onSendMessage).toHaveBeenCalledWith("Hello, world!");
// Additionally, check if the callback is called exactly once
expect(onSendMessage).toHaveBeenCalledTimes(1);
});
it("should be able to send a message when the enter key is pressed", () => {
const { getByRole } = render(<ChatInput onSendMessage={onSendMessage} />);
const textarea = getByRole("textbox");
act(() => {
userEvent.type(textarea, "Hello, world!{enter}");
});
fireEvent.change(textarea, { target: { value: "Hello, world!" } });
fireEvent.keyDown(textarea, { key: "Enter", code: "Enter", charCode: 13 });
expect(onSendMessage).toHaveBeenCalledWith("Hello, world!");
});
@@ -100,28 +103,18 @@ describe("ChatInput", () => {
expect(onSendMessage).not.toHaveBeenCalled();
});
it("should clear the input message after sending a message", () => {
it("should clear the input message after sending a message", async () => {
const { getByRole } = render(<ChatInput onSendMessage={onSendMessage} />);
const textarea = getByRole("textbox");
const button = getByRole("button");
act(() => {
userEvent.type(textarea, "Hello, world!");
});
fireEvent.change(textarea, { target: { value: "Hello, world!" } });
expect(textarea).toHaveValue("Hello, world!");
act(() => {
userEvent.click(button);
});
fireEvent.click(button);
expect(textarea).toHaveValue("");
act(() => {
userEvent.type(textarea, "Hello, world!{enter}");
});
expect(textarea).toHaveValue(""); // no new line
});
// this is already implemented but need to figure out how to test it

View File

@@ -1,7 +1,6 @@
import React from "react";
import { screen } from "@testing-library/react";
import { screen, act, fireEvent } from "@testing-library/react";
import { describe, expect, it } from "vitest";
import { act } from "react-dom/test-utils";
import userEvent from "@testing-library/user-event";
import { renderWithProviders } from "test-utils";
import ChatInterface from "./ChatInterface";
@@ -23,12 +22,16 @@ vi.spyOn(Session, "isConnected").mockImplementation(() => true);
HTMLElement.prototype.scrollTo = vi.fn(() => {});
describe("ChatInterface", () => {
afterEach(() => {
sessionSpy.mockClear();
});
it("should render empty message list and input", () => {
renderWithProviders(<ChatInterface />);
expect(screen.queryAllByTestId("message")).toHaveLength(0);
});
it("should render the new message the user has typed", async () => {
it("should render the new message the user has typed", () => {
renderWithProviders(<ChatInterface />, {
preloadedState: {
agent: {
@@ -38,12 +41,8 @@ describe("ChatInterface", () => {
});
const input = screen.getByRole("textbox");
act(() => {
userEvent.type(input, "my message{enter}");
});
expect(screen.getByText("my message")).toBeInTheDocument();
fireEvent.change(input, { target: { value: "my message" } });
expect(input).toHaveValue("my message");
});
it("should render user and assistant messages", () => {
@@ -66,7 +65,7 @@ describe("ChatInterface", () => {
expect(screen.getByText("Hello to you!")).toBeInTheDocument();
});
it("should send the a start event to the Session", () => {
it("should send a start event to the Session", () => {
renderWithProviders(<ChatInterface />, {
preloadedState: {
agent: {
@@ -76,9 +75,8 @@ describe("ChatInterface", () => {
});
const input = screen.getByRole("textbox");
act(() => {
userEvent.type(input, "my message{enter}");
});
fireEvent.change(input, { target: { value: "my message" } });
fireEvent.keyDown(input, { key: "Enter", code: "Enter", charCode: 13 });
const event = {
action: ActionType.MESSAGE,
@@ -87,7 +85,7 @@ describe("ChatInterface", () => {
expect(sessionSpy).toHaveBeenCalledWith(JSON.stringify(event));
});
it("should send the a user message event to the Session", () => {
it("should send a user message event to the Session", async () => {
renderWithProviders(<ChatInterface />, {
preloadedState: {
agent: {
@@ -97,9 +95,7 @@ describe("ChatInterface", () => {
});
const input = screen.getByRole("textbox");
act(() => {
userEvent.type(input, "my message{enter}");
});
await userEvent.type(input, "my message{enter}");
const event = {
action: ActionType.MESSAGE,

View File

@@ -3,7 +3,6 @@ import { useDispatch, useSelector } from "react-redux";
import { IoMdChatbubbles } from "react-icons/io";
import { RiArrowRightDoubleLine } from "react-icons/ri";
import { useTranslation } from "react-i18next";
import { twMerge } from "tailwind-merge";
import { VscArrowDown } from "react-icons/vsc";
import { FaRegThumbsDown, FaRegThumbsUp } from "react-icons/fa";
import { useDisclosure } from "@nextui-org/react";
@@ -125,14 +124,6 @@ function ChatInterface() {
>
<Chat messages={messages} />
</div>
{/* Fade between messages and input */}
<div
className={twMerge(
"absolute bottom-0 left-0 right-0",
curAgentState === AgentState.AWAITING_USER_INPUT ? "h-10" : "h-4",
"bg-gradient-to-b from-transparent to-neutral-800",
)}
/>
</div>
<div className="relative">

View File

@@ -1,7 +1,6 @@
import React from "react";
import { waitFor, screen } from "@testing-library/react";
import { waitFor, act } from "@testing-library/react";
import userEvent from "@testing-library/user-event";
import { act } from "react-dom/test-utils";
import { renderWithProviders } from "test-utils";
import { describe, it, expect, vi, Mock } from "vitest";
import FileExplorer from "./FileExplorer";
@@ -43,7 +42,7 @@ describe("FileExplorer", () => {
it.todo("should render an empty workspace");
it.only("should refetch the workspace when clicking the refresh button", async () => {
const { getByText } = renderWithProviders(<FileExplorer />, {
const { getByText, getByTestId } = renderWithProviders(<FileExplorer />, {
preloadedState: {
agent: {
curAgentState: AgentState.RUNNING,
@@ -57,11 +56,13 @@ describe("FileExplorer", () => {
expect(listFiles).toHaveBeenCalledTimes(2); // once for root, once for folder 1
// The 'await' keyword is required here to avoid a warning during test runs
await act(() => {
userEvent.click(screen.getByTestId("refresh"));
await act(async () => {
await userEvent.click(getByTestId("refresh"));
});
expect(listFiles).toHaveBeenCalledTimes(4); // 2 from initial render, 2 from refresh button
await waitFor(() => {
expect(listFiles).toHaveBeenCalledTimes(4); // 2 from initial render, 2 from refresh button
});
});
it("should toggle the explorer visibility when clicking the close button", async () => {

View File

@@ -167,17 +167,19 @@ function FileExplorer() {
isHidden ? "min-w-[48px]" : "min-w-[228px]",
)}
>
<div className="flex p-2 items-center justify-between relative">
<div className="flex flex-col p-2 relative">
<div className="flex items-center justify-end mb-8">
<ExplorerActions
isHidden={isHidden}
toggleHidden={() => setIsHidden((prev) => !prev)}
onRefresh={refreshWorkspace}
onUpload={selectFileInput}
/>
</div>
<div style={{ display: isHidden ? "none" : "block" }}>
<ExplorerTree files={files} defaultOpen />
</div>
<ExplorerActions
isHidden={isHidden}
toggleHidden={() => setIsHidden((prev) => !prev)}
onRefresh={refreshWorkspace}
onUpload={selectFileInput}
/>
</div>
<input
data-testid="file-input"

View File

@@ -1,7 +1,6 @@
import React from "react";
import { render, screen } from "@testing-library/react";
import { render, screen, act } from "@testing-library/react";
import userEvent from "@testing-library/user-event";
import { act } from "react-dom/test-utils";
import BaseModal from "./BaseModal";
describe("BaseModal", () => {
@@ -27,7 +26,7 @@ describe("BaseModal", () => {
expect(screen.getByText("Subtitle")).toBeInTheDocument();
});
it("should render actions", () => {
it("should render actions", async () => {
const onPrimaryClickMock = vi.fn();
const onSecondaryClickMock = vi.fn();
@@ -53,18 +52,18 @@ describe("BaseModal", () => {
expect(screen.getByText("Save")).toBeInTheDocument();
expect(screen.getByText("Cancel")).toBeInTheDocument();
act(() => {
userEvent.click(screen.getByText("Save"));
await act(async () => {
await userEvent.click(screen.getByText("Save"));
});
expect(onPrimaryClickMock).toHaveBeenCalledTimes(1);
act(() => {
userEvent.click(screen.getByText("Cancel"));
await act(async () => {
await userEvent.click(screen.getByText("Cancel"));
});
expect(onSecondaryClickMock).toHaveBeenCalledTimes(1);
});
it("should close the modal after an action is performed", () => {
it("should close the modal after an action is performed", async () => {
const onOpenChangeMock = vi.fn();
render(
<BaseModal
@@ -81,8 +80,8 @@ describe("BaseModal", () => {
/>,
);
act(() => {
userEvent.click(screen.getByText("Save"));
await act(async () => {
await userEvent.click(screen.getByText("Save"));
});
expect(onOpenChangeMock).toHaveBeenCalledTimes(1);
});

View File

@@ -39,7 +39,7 @@ function BaseModal({
size="sm"
className="bg-neutral-900 rounded-lg"
>
<ModalContent className="max-w-[24rem] p-[40px]">
<ModalContent className="max-w-[30rem] p-[40px]">
{(closeModal) => (
<>
<ModalHeader className="flex flex-col p-0">

View File

@@ -32,7 +32,7 @@ describe("LoadPreviousSession", () => {
screen.getByRole("button", { name: RESUME_SESSION_BUTTON_LABEL_KEY });
});
it("should clear messages if user chooses to start a new session", () => {
it("should clear messages if user chooses to start a new session", async () => {
const onOpenChangeMock = vi.fn();
render(<LoadPreviousSessionModal isOpen onOpenChange={onOpenChangeMock} />);
@@ -40,8 +40,8 @@ describe("LoadPreviousSession", () => {
name: START_NEW_SESSION_BUTTON_LABEL_KEY,
});
act(() => {
userEvent.click(startNewSessionButton);
await act(async () => {
await userEvent.click(startNewSessionButton);
});
// modal should close right after clearing messages

View File

@@ -29,7 +29,7 @@ describe("AutocompleteCombobox", () => {
expect(modelInput).toHaveValue("model1");
});
it("should open a dropdown with the available values", () => {
it("should open a dropdown with the available values", async () => {
renderComponent();
const modelInput = screen.getByRole("combobox", { name: "model" });
@@ -37,27 +37,27 @@ describe("AutocompleteCombobox", () => {
expect(screen.queryByText("model2")).not.toBeInTheDocument();
expect(screen.queryByText("model3")).not.toBeInTheDocument();
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
expect(screen.getByText("model2")).toBeInTheDocument();
expect(screen.getByText("model3")).toBeInTheDocument();
});
it("should call the onChange handler when a new value is selected", () => {
it("should call the onChange handler when a new value is selected", async () => {
renderComponent();
const modelInput = screen.getByRole("combobox", { name: "model" });
expect(modelInput).toHaveValue("model1");
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
const model2 = screen.getByText("model2");
act(() => {
userEvent.click(model2);
await act(async () => {
await userEvent.click(model2);
});
expect(onChangeMock).toHaveBeenCalledWith("model2");

View File

@@ -92,60 +92,60 @@ describe("SettingsForm", () => {
});
describe("onChange handlers", () => {
it("should call the onModelChange handler when the model changes", () => {
it("should call the onModelChange handler when the model changes", async () => {
renderSettingsForm();
const modelInput = screen.getByRole("combobox", { name: "model" });
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
const model3 = screen.getByText("model3");
act(() => {
userEvent.click(model3);
await act(async () => {
await userEvent.click(model3);
});
expect(onModelChangeMock).toHaveBeenCalledWith("model3");
});
it("should call the onAgentChange handler when the agent changes", () => {
it("should call the onAgentChange handler when the agent changes", async () => {
renderSettingsForm();
const agentInput = screen.getByRole("combobox", { name: "agent" });
act(() => {
userEvent.click(agentInput);
await act(async () => {
await userEvent.click(agentInput);
});
const agent3 = screen.getByText("agent3");
act(() => {
userEvent.click(agent3);
await act(async () => {
await userEvent.click(agent3);
});
expect(onAgentChangeMock).toHaveBeenCalledWith("agent3");
});
it("should call the onLanguageChange handler when the language changes", () => {
it("should call the onLanguageChange handler when the language changes", async () => {
renderSettingsForm();
const languageInput = screen.getByRole("combobox", { name: "language" });
act(() => {
userEvent.click(languageInput);
await act(async () => {
await userEvent.click(languageInput);
});
const french = screen.getByText("Français");
act(() => {
userEvent.click(french);
await act(async () => {
await userEvent.click(french);
});
expect(onLanguageChangeMock).toHaveBeenCalledWith("Français");
});
it("should call the onAPIKeyChange handler when the API key changes", () => {
it("should call the onAPIKeyChange handler when the API key changes", async () => {
renderSettingsForm();
const apiKeyInput = screen.getByTestId("apikey");
act(() => {
userEvent.type(apiKeyInput, "x");
await act(async () => {
await userEvent.type(apiKeyInput, "x");
});
expect(onAPIKeyChangeMock).toHaveBeenCalledWith("sk-...x");

View File

@@ -1,4 +1,4 @@
import { act, screen, waitFor } from "@testing-library/react";
import { screen, act, waitFor } from "@testing-library/react";
import userEvent from "@testing-library/user-event";
import i18next from "i18next";
import React from "react";
@@ -26,6 +26,7 @@ vi.mock("#/services/settings", async (importOriginal) => ({
LLM_MODEL: "gpt-4o",
AGENT: "MonologueAgent",
LANGUAGE: "en",
LLM_API_KEY: "sk-...",
}),
getDefaultSettings: vi.fn().mockReturnValue({
LLM_MODEL: "gpt-4o",
@@ -47,6 +48,14 @@ vi.mock("#/services/options", async (importOriginal) => ({
.mockResolvedValue(Promise.resolve(["agent1", "agent2", "agent3"])),
}));
// Helper function to assert that fetchModels was called
async function assertModelsAndAgentsFetched() {
await waitFor(() => {
expect(fetchAgents).toHaveBeenCalledTimes(1);
expect(fetchModels).toHaveBeenCalledTimes(1);
});
}
describe("SettingsModal", () => {
afterEach(() => {
vi.clearAllMocks();
@@ -55,10 +64,7 @@ describe("SettingsModal", () => {
it("should fetch existing agents and models from the API", async () => {
renderWithProviders(<SettingsModal isOpen onOpenChange={vi.fn()} />);
await waitFor(() => {
expect(fetchModels).toHaveBeenCalledTimes(1);
expect(fetchAgents).toHaveBeenCalledTimes(1);
});
assertModelsAndAgentsFetched();
});
it("should close the modal when the close button is clicked", async () => {
@@ -71,8 +77,8 @@ describe("SettingsModal", () => {
name: /MODAL_CLOSE_BUTTON_LABEL/i, // i18n key
});
act(() => {
userEvent.click(cancelButton);
await act(async () => {
await userEvent.click(cancelButton);
});
expect(onOpenChange).toHaveBeenCalledWith(false);
@@ -98,7 +104,7 @@ describe("SettingsModal", () => {
describe("onHandleSave", () => {
const initialSettings: Settings = {
LLM_MODEL: "gpt-4o",
AGENT: "MonologueAgent",
AGENT: "CodeActAgent",
LANGUAGE: "en",
LLM_API_KEY: "sk-...",
};
@@ -111,27 +117,29 @@ describe("SettingsModal", () => {
),
);
// Use the helper function to assert models were fetched
await assertModelsAndAgentsFetched();
const saveButton = screen.getByRole("button", { name: /save/i });
const modelInput = screen.getByRole("combobox", { name: "model" });
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
const model3 = screen.getByText("model3");
act(() => {
userEvent.click(model3);
await act(async () => {
await userEvent.click(model3);
});
act(() => {
userEvent.click(saveButton);
await act(async () => {
await userEvent.click(saveButton);
});
expect(saveSettings).toHaveBeenCalledWith({
...initialSettings,
LLM_MODEL: "model3",
LLM_API_KEY: "", // reset after model change
});
});
@@ -146,18 +154,18 @@ describe("SettingsModal", () => {
const saveButton = screen.getByRole("button", { name: /save/i });
const modelInput = screen.getByRole("combobox", { name: "model" });
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
const model3 = screen.getByText("model3");
act(() => {
userEvent.click(model3);
await act(async () => {
await userEvent.click(model3);
});
act(() => {
userEvent.click(saveButton);
await act(async () => {
await userEvent.click(saveButton);
});
expect(startNewSessionSpy).toHaveBeenCalled();
@@ -174,21 +182,21 @@ describe("SettingsModal", () => {
const saveButton = screen.getByRole("button", { name: /save/i });
const modelInput = screen.getByRole("combobox", { name: "model" });
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
const model3 = screen.getByText("model3");
act(() => {
userEvent.click(model3);
await act(async () => {
await userEvent.click(model3);
});
act(() => {
userEvent.click(saveButton);
await act(async () => {
await userEvent.click(saveButton);
});
expect(toastSpy).toHaveBeenCalledTimes(2);
expect(toastSpy).toHaveBeenCalledTimes(3);
});
it("should change the language", async () => {
@@ -202,18 +210,18 @@ describe("SettingsModal", () => {
const saveButton = screen.getByRole("button", { name: /save/i });
const languageInput = screen.getByRole("combobox", { name: "language" });
act(() => {
userEvent.click(languageInput);
await act(async () => {
await userEvent.click(languageInput);
});
const spanish = screen.getByText("Español");
act(() => {
userEvent.click(spanish);
await act(async () => {
await userEvent.click(spanish);
});
act(() => {
userEvent.click(saveButton);
await act(async () => {
await userEvent.click(saveButton);
});
expect(i18nSpy).toHaveBeenCalledWith("es");
@@ -227,21 +235,25 @@ describe("SettingsModal", () => {
),
);
await waitFor(() => {
expect(fetchModels).toHaveBeenCalledTimes(1);
});
const saveButton = screen.getByRole("button", { name: /save/i });
const modelInput = screen.getByRole("combobox", { name: "model" });
act(() => {
userEvent.click(modelInput);
await act(async () => {
await userEvent.click(modelInput);
});
const model3 = screen.getByText("model3");
act(() => {
userEvent.click(model3);
await act(async () => {
await userEvent.click(model3);
});
act(() => {
userEvent.click(saveButton);
await act(async () => {
await userEvent.click(saveButton);
});
expect(onOpenChangeMock).toHaveBeenCalledWith(false);
@@ -261,17 +273,17 @@ describe("SettingsModal", () => {
});
const agentInput = screen.getByRole("combobox", { name: "agent" });
act(() => {
userEvent.click(agentInput);
await act(async () => {
await userEvent.click(agentInput);
});
const agent3 = screen.getByText("agent3");
act(() => {
userEvent.click(agent3);
await act(async () => {
await userEvent.click(agent3);
});
expect(agentInput).toHaveValue("agent3");
act(() => {
userEvent.click(resetButton);
await act(async () => {
await userEvent.click(resetButton);
});
expect(getDefaultSettings).toHaveBeenCalled();

View File

@@ -66,12 +66,9 @@ function SettingsModal({ isOpen, onOpenChange }: SettingsProps) {
}, []);
const handleModelChange = (model: string) => {
// Needs to also reset the API key.
const key = localStorage.getItem(`API_KEY_${model}`);
setSettings((prev) => ({
...prev,
LLM_MODEL: model,
LLM_API_KEY: key || "",
}));
};

View File

@@ -1,4 +1,3 @@
import { setScreenshotSrc, setUrl } from "#/state/browserSlice";
import { addAssistantMessage, addUserMessage } from "#/state/chatSlice";
import { setCode, setActiveFilepath } from "#/state/codeSlice";
import { appendInput } from "#/state/commandSlice";
@@ -13,18 +12,14 @@ import { getRootTask } from "./taskService";
const messageActions = {
[ActionType.BROWSE]: (message: ActionMessage) => {
const { url, screenshotSrc } = message.args;
store.dispatch(setUrl(url));
store.dispatch(setScreenshotSrc(screenshotSrc));
store.dispatch(addAssistantMessage(message.message));
},
[ActionType.BROWSE_INTERACTIVE]: (message: ActionMessage) => {
if (message.args.thought) {
store.dispatch(addAssistantMessage(message.args.thought));
} else {
store.dispatch(addAssistantMessage(message.message));
}
const { url, screenshotSrc } = message.args;
store.dispatch(setUrl(url));
store.dispatch(setScreenshotSrc(screenshotSrc));
},
[ActionType.WRITE]: (message: ActionMessage) => {
const { path, content } = message.args;

View File

@@ -0,0 +1,76 @@
from abc import ABC, abstractmethod
from opendevin.events.action import Action
class ResponseParser(ABC):
"""
This abstract base class is a general interface for an response parser dedicated to
parsing the action from the response from the LLM.
"""
def __init__(
self,
):
# Need pay attention to the item order in self.action_parsers
self.action_parsers = []
@abstractmethod
def parse(self, response: str) -> Action:
"""
Parses the action from the response from the LLM.
Parameters:
- response (str): The response from the LLM.
Returns:
- action (Action): The action parsed from the response.
"""
pass
@abstractmethod
def parse_response(self, response) -> str:
"""
Parses the action from the response from the LLM.
Parameters:
- response (str): The response from the LLM.
Returns:
- action_str (str): The action str parsed from the response.
"""
pass
@abstractmethod
def parse_action(self, action_str: str) -> Action:
"""
Parses the action from the response from the LLM.
Parameters:
- action_str (str): The response from the LLM.
Returns:
- action (Action): The action parsed from the response.
"""
pass
class ActionParser(ABC):
"""
This abstract base class is an general interface for an action parser dedicated to
parsing the action from the action str from the LLM.
"""
@abstractmethod
def check_condition(self, action_str: str) -> bool:
"""
Check if the action string can be parsed by this parser.
"""
pass
@abstractmethod
def parse(self, action_str: str) -> Action:
"""
Parses the action from the action string from the LLM response.
"""
pass

View File

@@ -19,6 +19,7 @@ from opendevin.events.action import (
AddTaskAction,
AgentDelegateAction,
AgentFinishAction,
AgentRejectAction,
ChangeAgentStateAction,
MessageAction,
ModifyTaskAction,
@@ -164,6 +165,9 @@ class AgentController:
elif isinstance(event, AgentFinishAction):
self.state.outputs = event.outputs # type: ignore[attr-defined]
await self.set_agent_state_to(AgentState.FINISHED)
elif isinstance(event, AgentRejectAction):
self.state.outputs = event.outputs # type: ignore[attr-defined]
await self.set_agent_state_to(AgentState.REJECTED)
elif isinstance(event, Observation):
if self._pending_action and self._pending_action.id == event.cause:
await self.add_history(self._pending_action, event)
@@ -252,7 +256,7 @@ class AgentController:
# propagate error state until an agent or user can handle it
await self.set_agent_state_to(AgentState.ERROR)
return
delegate_done = delegate_state == AgentState.FINISHED
delegate_done = delegate_state in (AgentState.FINISHED, AgentState.REJECTED)
if delegate_done:
logger.info(
f'[Agent Controller {self.id}] Delegate agent has finished execution'

Some files were not shown because too many files have changed in this diff Show More