Xingyao Wang
b3bdc44292
mkdir infer_logs instead of logs ( #2382 )
2024-06-11 07:18:19 +08:00
Xingyao Wang
11a2d1682d
Minor SWE-Bench inference config tweak ( #2381 )
...
* save infer logs to infer_logs
* set max budget for swebench eval
2024-06-10 20:14:22 +00:00
Xingyao Wang
a6ba6c5277
Add SWEBench-docker eval ( #2085 )
...
* add initial version of swebench-docker eval
* update the branch of git repo
* add poetry run
* download dev set too and pre-load f2p and p2p
* update eval infer script
* increase timeout
* add poetry run
* install swebench from our fork
* update script
* update loc
* support single instance debug
* replace \r\n from model patch
* replace eval docker from namespace xingyaoww
* update script to auto detect swe-bench format jsonl
* support eval infer on single instance id
* change log output dir to logs
* update summarise result script
* update README
* update readme
* tweak branch
* Update evaluation/swe_bench/scripts/eval/prep_eval.sh
Co-authored-by: Graham Neubig <neubig@gmail.com >
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-06-10 19:30:40 +00:00
Yufan Song
f4cb192ebe
Fix llm key leaks bug ( #2376 )
...
* fix bug
* fix bug
* add
2024-06-10 15:55:33 +00:00
Robert
7fc57650f3
BioCoder integration ( #2076 )
...
* prepare execution and inference
* Create README.md
* Update README.md
* Update evaluation/biocoder/README.md
* Update evaluation/swe_bench/swe_env_box.py
* switch to biocoder docker container and test-specific code
* code for copying and running test files into container
* add metrics
* add readme
* Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes)
* Update README.md
---------
Co-authored-by: lilbillybiscuit <qianbill2014@outlook.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-10 11:11:40 +08:00
RainRat
745ae42a72
fix typos ( #2352 )
2024-06-09 12:57:58 -07:00
yueqis
68d9ad61cf
Feat: Support Gorilla APIBench ( #2081 )
...
* removed unused files from gorilla
* Update run_infer.py, removed unused imports
* Update utils.py
* Update ast_eval_hf.py
* Update ast_eval_tf.py
* Update ast_eval_th.py
* Create README.md
* Update run_infer.py
* make lint
* Update run_infer.py
* fix lint
---------
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-08 16:54:54 +00:00
Jaskirat Singh
e8307608c2
Support gpqa benchmark evaluation ( #2080 )
...
* feat: add gpqa benchmark evaluation
* add metrics
* reset configs in final block
* make lint
---------
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-08 16:24:24 +00:00
yueqis
82d4d25b09
feat: support ToolQA benchmark ( #2263 )
...
* Add files via upload
* Update README.md
* Update run_infer.py
* Update utils.py
* make lint
* Update evaluation/toolqa/run_infer.py
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-08 07:54:01 -04:00
super-dainiu
beabcce16d
[Hotfix] Fix ML-Bench continue `run_inference.py` ( #2284 )
...
* add ml-bench w/o exec env
* fix typos (#1956 )
no functional change
* Refactored Logs (#1939 )
* [Feat] A competitive Web Browsing agent (#1856 )
* initial attempt at a browsing only agent
* add browsing agent
* update
* implement agent
* update
* fix comments
* remove unnecessary things from memory extras
* update image processing
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Update README.md SWE-bench score (#1959 )
* Update README.md SWE-bench score
Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.
* Update
* fix: llm is_local function logic error (#1961 )
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* doc: update documentation about poetry update (#1962 )
* add doc
* Update Development.md
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* feat: add metrics related to cost for better observability (#1944 )
* add metrics for total_cost
* make lint
* refact codeact
* change metrics into llm
* add costs list, add into state
* refactor log completion
* refactor and test others
* make lint
* Update opendevin/core/metrics.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/llm/llm.py
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor
* add code
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* doc: add more cmd in unit test documentation (#1963 )
* --- (#1975 )
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1976 )
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Logging security (#1943 )
* update .gitignore
* Rename the confusing 'INFO' style to 'DETAIL'
* override str and repr
* feat: api_key desensitize
* feat: add SensitiveDataFilter in file handler
* tweak regex, add tests
* more tweaks, include other attrs
* add env vars, those with equivalent config
* fix tests
* tests are invaluable
---------
Co-authored-by: Shimada666 <649940882@qq.com >
* --- (#1967 )
updated-dependencies:
- dependency-name: react-dom
dependency-type: direct:production
update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1968 )
updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1969 )
updated-dependencies:
- dependency-name: husky
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1970 )
updated-dependencies:
- dependency-name: tailwind-merge
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1971 )
updated-dependencies:
- dependency-name: i18next
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Refactor session management (#1810 )
* refactor session mgmt
* defer file handling to runtime
* add todo
* refactor sessions a bit more
* remove messages logic from FE
* fix up socket handshake
* refactor frontend auth a bit
* first pass at redoing file explorer
* implement directory suffix
* fix up file tree
* close agent on websocket close
* remove session saving
* move file refresh
* remove getWorkspace
* plumb path/code differently
* fix build issues
* fix the tests
* fix npm build
* add session rehydration
* fix event serialization
* logspam
* fix user message rehydration
* add get_event fn
* agent state restoration
* change history tracking for codeact
* fix responsiveness of init
* fix lint
* lint
* delint
* fix prop
* update tests
* logspam
* lint
* fix test
* revert codeact
* change fileService to use API
* fix up session loading
* delint
* delint
* fix integration tests
* revert test
* fix up access to options endpoints
* fix initial files load
* delint
* fix file initialization
* fix mock server
* fixl int
* fix auth for html
* Update frontend/src/i18n/translation.json
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor sessions and sockets
* avoid reinitializing the same session
* fix reconnect issue
* change up intro message
* more guards on reinit
* rename agent_session
* delint
* fix a bunch of tests
* delint
* fix last test
* remove code editor context
* fix build
* fix any
* fix dot notation
* Update frontend/src/services/api.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix up error handling
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update frontend/src/services/session.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix build errs
* fix else
* add closed state
* delint
* Update opendevin/server/session/session.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* fix #1960 (#1964 )
* Add ruff for shared mutable defaults (B) (#1938 )
* Add ruff for shared mutable defaults (B)
* Apply B006, B008 on current files, except fast API
* Update agenthub/SWE_agent/prompts.py
Co-authored-by: Graham Neubig <neubig@gmail.com >
* fix unintended behavior change
* this is correct, tell Ruff to leave it alone
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888 )
* Add MacOS to integration tests
* Switch back to python 3.11
* Install Docker for macos pipeline
* regenerate.sh: Use environmental variable for sandbox type
* Pack different agents' tests into a single check
* Fix CodeAct tests
* Reduce file match and extensive debug logs
* Add TEST_IN_CI mode that reports codecov
* Small fix: don't quit if reusing old responses failed
* Merge codecov results
* Fix typos
* Remove coverage merge step - codecov automatically does that
* Make mac integration tests as optional - too slow
* Fix codecov args
* Add comments in yaml
* Include sandbox type in codecov report name
* Fix codecov report merge
* Revert renaming of test_matrix_success
* Remove SWEAgent and PlannerAgent from tests
* Mark planner agent and SWE agent as deprecated
* CodeCov: Ignore planner and sweagent
* Revert "Remove SWEAgent and PlannerAgent from tests"
This reverts commit 040cb3bfb9 .
* Remove all tests for SWE Agent
* Only keep basic tests for MonologueAgent and PlannerAgent
* Mark SWE Agent as deprecated, and ignore code coverage for it
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987 )
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* Save CI cycles for backend tests (#1985 )
* Fix typo in prompt (#1992 )
* Refactor monologue and SWE agent to use the messages in state history (#1863 )
* Refactor monologue to use the messages in state history
* add messages, clean up
* fix monologue
* update integration tests
* move private method
* update SWE agent to use the history from State
* integration tests for SWE agent
* rename monologue to initial_thoughts, since that is what it is
* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994 )
* add ml-bench in readme
* Bump boto3 from 1.34.110 to 1.34.111 (#2001 )
Bumps [boto3](https://github.com/boto/boto3 ) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases )
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst )
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111 )
---
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump docker from 7.0.0 to 7.1.0 (#2002 )
Bumps [docker](https://github.com/docker/docker-py ) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases )
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0 )
---
updated-dependencies:
- dependency-name: docker
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump litellm from 1.37.20 to 1.38.0 (#2005 )
Bumps [litellm](https://github.com/BerriAI/litellm ) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases )
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0 )
---
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix SWE-Bench evaluation due to setuptools version (#1995 )
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* fix session state after resuming (#1999 )
* fix state resuming
* fix session reconnection
* fix lint
* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941 )
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* build: Add poetry command to use Python 3.11 for environment setup (#1972 )
* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006 )
Bumps [@react-types/shared](https://github.com/adobe/react-spectrum ) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases )
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1 )
---
updated-dependencies:
- dependency-name: "@react-types/shared"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @types/react-syntax-highlighter in /frontend (#2007 )
Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter ) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases )
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter )
---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008 )
Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser ) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases )
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md )
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser )
---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009 )
Bumps [lint-staged](https://github.com/okonet/lint-staged ) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases )
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md )
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4 )
---
updated-dependencies:
- dependency-name: lint-staged
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update README.md
* Update README.md
* add run_infer.sh
* fix input output
* fix docker sandbox
* fix run
* update and clean run_infer.py
* add script to clean up dockers
* update repo uid
* add description
* new
* Update README.md
* use root for sandbox
* update readme
* update ml-bench conda env
* update readme
* update readme
* use try except
* modify raise exception
* add int
* update README
* longer time
* fix existing issues
* fix existing issue
* new docker image
* add metrics of cost
* add result parsing cost
* fix
* fix
* update summarize
* fix
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal >
Co-authored-by: RainRat <rainrat78@yahoo.ca >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
Co-authored-by: Frank Xu <frankxu2004@gmail.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Robert Brennan <accounts@rbren.io >
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com >
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com >
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com >
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com >
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com >
Co-authored-by: Robert <871607149@qq.com >
2024-06-06 03:53:21 +00:00
Frank Xu
48151bdbb0
[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes ( #2170 )
...
* add webarena, and revamp messaging for webarena eval
* add changes for browsergym
* update infer script
* fix unit tests
* update
* add multiple run for miniwob
* update instruction, remove personal path
* update
* add code for getting final reward, fix integration, add results
* add avg cost calculation
2024-06-06 09:01:20 +08:00
மனோஜ்குமார் பழனிச்சாமி
ae815b20d2
Improved logs ( #2272 )
2024-06-05 17:54:40 +05:30
Boxuan Li
208b1461ca
[AgentBench evaluation] set run_as_devin to true ( #2269 )
...
Co-authored-by: Leo <ifuryst@gmail.com >
2024-06-05 07:53:33 +00:00
Ryan H. Tran
0584e428b2
[Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals ( #2268 )
...
* fix: bug in stopping when the agent reaches max steps or solution proposals
* remove --eval-num-workers
* update env.py
2024-06-05 06:47:07 +00:00
super-dainiu
ebafb702e5
Add ML-Bench Evaluation with OpenDevin ( #2015 )
...
* add ml-bench w/o exec env
* fix typos (#1956 )
no functional change
* Refactored Logs (#1939 )
* [Feat] A competitive Web Browsing agent (#1856 )
* initial attempt at a browsing only agent
* add browsing agent
* update
* implement agent
* update
* fix comments
* remove unnecessary things from memory extras
* update image processing
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Update README.md SWE-bench score (#1959 )
* Update README.md SWE-bench score
Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.
* Update
* fix: llm is_local function logic error (#1961 )
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* doc: update documentation about poetry update (#1962 )
* add doc
* Update Development.md
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* feat: add metrics related to cost for better observability (#1944 )
* add metrics for total_cost
* make lint
* refact codeact
* change metrics into llm
* add costs list, add into state
* refactor log completion
* refactor and test others
* make lint
* Update opendevin/core/metrics.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/llm/llm.py
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor
* add code
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* doc: add more cmd in unit test documentation (#1963 )
* --- (#1975 )
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1976 )
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Logging security (#1943 )
* update .gitignore
* Rename the confusing 'INFO' style to 'DETAIL'
* override str and repr
* feat: api_key desensitize
* feat: add SensitiveDataFilter in file handler
* tweak regex, add tests
* more tweaks, include other attrs
* add env vars, those with equivalent config
* fix tests
* tests are invaluable
---------
Co-authored-by: Shimada666 <649940882@qq.com >
* --- (#1967 )
updated-dependencies:
- dependency-name: react-dom
dependency-type: direct:production
update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1968 )
updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1969 )
updated-dependencies:
- dependency-name: husky
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1970 )
updated-dependencies:
- dependency-name: tailwind-merge
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1971 )
updated-dependencies:
- dependency-name: i18next
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Refactor session management (#1810 )
* refactor session mgmt
* defer file handling to runtime
* add todo
* refactor sessions a bit more
* remove messages logic from FE
* fix up socket handshake
* refactor frontend auth a bit
* first pass at redoing file explorer
* implement directory suffix
* fix up file tree
* close agent on websocket close
* remove session saving
* move file refresh
* remove getWorkspace
* plumb path/code differently
* fix build issues
* fix the tests
* fix npm build
* add session rehydration
* fix event serialization
* logspam
* fix user message rehydration
* add get_event fn
* agent state restoration
* change history tracking for codeact
* fix responsiveness of init
* fix lint
* lint
* delint
* fix prop
* update tests
* logspam
* lint
* fix test
* revert codeact
* change fileService to use API
* fix up session loading
* delint
* delint
* fix integration tests
* revert test
* fix up access to options endpoints
* fix initial files load
* delint
* fix file initialization
* fix mock server
* fixl int
* fix auth for html
* Update frontend/src/i18n/translation.json
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor sessions and sockets
* avoid reinitializing the same session
* fix reconnect issue
* change up intro message
* more guards on reinit
* rename agent_session
* delint
* fix a bunch of tests
* delint
* fix last test
* remove code editor context
* fix build
* fix any
* fix dot notation
* Update frontend/src/services/api.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix up error handling
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update frontend/src/services/session.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix build errs
* fix else
* add closed state
* delint
* Update opendevin/server/session/session.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* fix #1960 (#1964 )
* Add ruff for shared mutable defaults (B) (#1938 )
* Add ruff for shared mutable defaults (B)
* Apply B006, B008 on current files, except fast API
* Update agenthub/SWE_agent/prompts.py
Co-authored-by: Graham Neubig <neubig@gmail.com >
* fix unintended behavior change
* this is correct, tell Ruff to leave it alone
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888 )
* Add MacOS to integration tests
* Switch back to python 3.11
* Install Docker for macos pipeline
* regenerate.sh: Use environmental variable for sandbox type
* Pack different agents' tests into a single check
* Fix CodeAct tests
* Reduce file match and extensive debug logs
* Add TEST_IN_CI mode that reports codecov
* Small fix: don't quit if reusing old responses failed
* Merge codecov results
* Fix typos
* Remove coverage merge step - codecov automatically does that
* Make mac integration tests as optional - too slow
* Fix codecov args
* Add comments in yaml
* Include sandbox type in codecov report name
* Fix codecov report merge
* Revert renaming of test_matrix_success
* Remove SWEAgent and PlannerAgent from tests
* Mark planner agent and SWE agent as deprecated
* CodeCov: Ignore planner and sweagent
* Revert "Remove SWEAgent and PlannerAgent from tests"
This reverts commit 040cb3bfb9 .
* Remove all tests for SWE Agent
* Only keep basic tests for MonologueAgent and PlannerAgent
* Mark SWE Agent as deprecated, and ignore code coverage for it
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987 )
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* Save CI cycles for backend tests (#1985 )
* Fix typo in prompt (#1992 )
* Refactor monologue and SWE agent to use the messages in state history (#1863 )
* Refactor monologue to use the messages in state history
* add messages, clean up
* fix monologue
* update integration tests
* move private method
* update SWE agent to use the history from State
* integration tests for SWE agent
* rename monologue to initial_thoughts, since that is what it is
* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994 )
* add ml-bench in readme
* Bump boto3 from 1.34.110 to 1.34.111 (#2001 )
Bumps [boto3](https://github.com/boto/boto3 ) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases )
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst )
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111 )
---
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump docker from 7.0.0 to 7.1.0 (#2002 )
Bumps [docker](https://github.com/docker/docker-py ) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases )
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0 )
---
updated-dependencies:
- dependency-name: docker
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump litellm from 1.37.20 to 1.38.0 (#2005 )
Bumps [litellm](https://github.com/BerriAI/litellm ) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases )
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0 )
---
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix SWE-Bench evaluation due to setuptools version (#1995 )
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* fix session state after resuming (#1999 )
* fix state resuming
* fix session reconnection
* fix lint
* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941 )
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* build: Add poetry command to use Python 3.11 for environment setup (#1972 )
* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006 )
Bumps [@react-types/shared](https://github.com/adobe/react-spectrum ) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases )
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1 )
---
updated-dependencies:
- dependency-name: "@react-types/shared"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @types/react-syntax-highlighter in /frontend (#2007 )
Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter ) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases )
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter )
---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008 )
Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser ) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases )
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md )
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser )
---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009 )
Bumps [lint-staged](https://github.com/okonet/lint-staged ) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases )
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md )
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4 )
---
updated-dependencies:
- dependency-name: lint-staged
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update README.md
* Update README.md
* add run_infer.sh
* fix input output
* fix docker sandbox
* fix run
* update and clean run_infer.py
* add script to clean up dockers
* update repo uid
* add description
* new
* Update README.md
* use root for sandbox
* update readme
* update ml-bench conda env
* update readme
* update readme
* use try except
* modify raise exception
* add int
* update README
* longer time
* fix existing issues
* fix existing issue
* new docker image
* add metrics of cost
* add result parsing cost
* fix
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal >
Co-authored-by: RainRat <rainrat78@yahoo.ca >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
Co-authored-by: Frank Xu <frankxu2004@gmail.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Robert Brennan <accounts@rbren.io >
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com >
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com >
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com >
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com >
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com >
Co-authored-by: Robert <871607149@qq.com >
2024-06-05 01:56:39 +00:00
Leo
040d6bd806
fix: add an early exit check for agent answers in agent bench. ( #2257 )
...
Signed-off-by: ifuryst <ifuryst@gmail.com >
2024-06-04 18:45:07 -07:00
tobitege
5776474dcf
Fix SWE-Bench README typos ( #2250 )
2024-06-05 01:18:02 +00:00
Leo
9ada36e30b
fix: restore python linting. ( #2228 )
...
* fix: restore python linting.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* update: extend the Python lint check to evaluation.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/logic_reasoning/instruction.txt
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-04 06:36:19 +00:00
finaltrip
05b84df9cb
chore: fix some comments ( #2234 )
...
Signed-off-by: finaltrip <finaltrip@qq.com >
2024-06-03 16:04:34 +00:00
Boxuan Li
538d1d85a2
evaluation: Reset configs in finally block ( #2214 )
2024-06-03 09:52:12 +08:00
Ryan H. Tran
22e8fb39b1
add cost metrics to evaluation outputs for all benchmarks ( #2199 )
2024-06-02 08:28:00 +00:00
Yizhe Zhang
8d79c3edbc
modify the exiting logic and reward calculation, delete unused function ( #2198 )
2024-06-02 06:38:09 +00:00
RainRat
ed6dcc8381
fix typos ( #2187 )
...
* fix typos
no functional change
* fix typos
2024-06-01 20:40:30 +00:00
Leo
2c231c57c9
Add supported benchmarks to evaluation README (AgentBench, BIRD, LogicReasoning) ( #2183 )
...
Signed-off-by: ifuryst <ifuryst@gmail.com >
2024-06-01 11:33:01 -04:00
Binyuan Hui
46dcf4bb3e
Support BIRD benchmark ( #2117 )
...
* update: change timeout from 10 to 30
* update: readme for bird evaluation
* Update evaluation/bird/run_infer.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Update evaluation/bird/README.md
Co-authored-by: Shimada666 <649940882@qq.com >
* Update evaluation/bird/README.md
Co-authored-by: Shimada666 <649940882@qq.com >
* Update evaluation/bird/run_infer.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
2024-06-01 11:34:36 +00:00
Leo
be251b11de
Add AgentBench. ( #2012 )
...
* Add AgentBench.
* Load the datasets from HF.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add helper functions.
* Add mock executor.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add retriv agent answer cmd.
* Adjust the dataset.
* Refine test results.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Consolidate all AgentBench datasets and scripts into a single CSV dataset.
* Refactor dataset source.
* Update helper functions.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Fix the CRLF problem.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Separate the instance's workspace.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add cleanup logic and error handling for sandbox closure.
* Normalized dataset
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update README.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update the prompt to capture the answer.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Refactor script execution paths to use absolute container workspace path.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update AgentBench README.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Delete useless functions.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/agent_bench/README.md
* Add script to summarize test results from JSONL file in AgentBench
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Delete useless script and codes.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/agent_bench/scripts/summarise_results.py
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-01 07:58:14 +00:00
Ryan H. Tran
01296ff79d
Add remaining subsets for MINT benchmark ( #2142 )
...
* add MMLU subset
* add theoremqa subset
* remove redundant packages from requirements.txt, adjust prompts, handle gpt3.5 propose a wrong answer after a correct answer
* add MBPP subset
* add humaneval subset
* update README
* exit actively after the agent finishes the task
2024-05-31 20:04:13 +00:00
Boxuan Li
4d14b44a9a
SWE-bench: Add summarise utility script to view passed/failed task IDs ( #2137 )
...
* SWE-bench: Add summarise utility script to view passed/failed task IDs
* Fix typos
* Move file
* Prettify
* Use merged jsonl file
2024-05-31 12:32:17 +08:00
Boxuan Li
f188abd7a3
Delete evaluation outputs files ( #2152 )
...
* Delete evaluation outputs files
* Fix README
2024-05-31 03:12:27 +00:00
Ren Ma
a9823491e6
Support Logic Reasoning Benchmark ( #1973 )
2024-05-30 16:35:15 +08:00
Xingyao Wang
01ef90205d
Add CodeActSWEAgent to remove browsing & github + improvements on agentskills ( #2105 )
...
* update swe_bench prompt;
use minimal prompt for codeact;
* upgrade agentskills and update testcases
* update infer prompt
* fix cwd
* add icl for swebench
* also log in_context_example to run infer
* remove extra print
* change prompt to abs path
* update error message to include current file info
* change cwd for jupyter if needed
* update edit error message
* update prompt
* improve git get patch
* update hint string
* default to 50 turns
* revert changes from codeact agent and create new CodeActSWEAgent
* revert changes to codeact
* revert instructions for run infer
* revert instructions for run infer
* update README
* update max iter
* add codeact swe agent
* fix issue for CodeActSWEAgent
* allow specifying max iter in cmdline script
* stop printing
* Update agenthub/codeact_swe_agent/README.md
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Fix prompt regression in jupyter plugin
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-05-29 21:19:00 -07:00
Ryan H. Tran
9434bcce48
Support MINT benchmark (MATH, GSM8K subset) ( #1955 )
...
* setup boilerplate and README
* setup test script and load dataset
* add temp intg that works
* refactor code
* add solution evaluation through 'fake_user_response_fn'
* finish integrating MATH subset
* Update evaluation/mint/run_infer.py
* Update evaluation/mint/run_infer.sh
* Update opendevin/core/main.py
* remove redudant templates, add eval_note, update README
* use <execute_ipython> tag instead of <execute>
* hardcode AGENT option for run_infer.sh
* Update evaluation/mint/task.py
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* fix: bug no message returned when task's success
* change message to make the agent exit
* import bash abstractmethod
* install all required packages inside sandbox before the agent runs, adjust prompt
* add subset eval folder separation and test for gsm8k
* fix bug in Reasoning task result check, add requirements.txt
* Fix syntax error in evaluation/mint/run_infer.py
* update README, add default values for `SUBSET` and `EVAL_LIMIT`
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-05-28 07:42:52 +00:00
Xingyao Wang
2c0a2dbc61
fix yet another swe_bench issue ( #2069 )
2024-05-26 10:01:43 -07:00
Gant
f0271f9f91
need to run as root to use SWEBench container ( #2068 )
2024-05-26 14:21:33 +00:00
Xingyao Wang
5114230e53
Some SWE-Bench infer fixes and improvements ( #2065 )
...
* reset workspace base properly
* support running without hint
* support running without hint
* bump swe-bench eval docker to v1.2 for latest agentskills
* only give hint when use hint text is trie
* add swe-agent instructions for validation
* update dockerfile
* pin the python interpreter for execute_cli
* avoid initialize plugins twice
* default to use hint
* save results to swe_bench_lite
* unset gh token and increase max iter to 50
* remove printing of use hint status
* refractor ssh login into one function
* ok drop to 30 turns bc it is so expensive :(
* remove reproduce comments to avoid stuck
2024-05-26 10:02:11 +00:00
Yizhe Zhang
0c829cd067
Support Entity-Deduction-Arena (EDA) Benchmark ( #1931 )
...
* adding draft evaluation code for EDA, using chatgpt as the temporal agent for now
* Update README.md
* Delete frontend/package.json
* reverse the irrelevant changes
* reverse package.json
* use chatgpt as the codeactagent
* integrate with opendevin
* Update evaluation/EDA/README.md
* Update evaluation/EDA/README.md
* Use poetry to manage packages
* integrate with opendevin
* minor update
* minor update
* update poetry
* update README
* clean-up infer scripts
* add run_infer script and improve readme
* log final success and final message & ground truth
---------
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-05-25 23:17:04 +08:00
Xingyao Wang
28ab00946b
update README for GAIA ( #2054 )
...
* update README for GAIA
* Update evaluation/gaia/README.md
* Update evaluation/gaia/README.md
* Update evaluation/gaia/README.md
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
2024-05-25 15:01:03 +00:00
Jiayi Pan
2d52298a1d
Support GAIA benchmark ( #1911 )
...
* Add gaia test
* Improve gaia prompts
* Fix browser_env hang bug
* Fix gaia bugs
* add gaia to eval readme
* Fix gaia bugs
* minor fix
* add run_infer.sh and update readme
* set num eval worker to 1
* default to 2023 gaia level1 subset
* default to level 1
* add prompt to instruct model enclose answer within <solution> tag
* add missing break
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
2024-05-24 11:22:28 +00:00
Xingyao Wang
602ffcdffb
Implement agentskills for OpenDevin to helpfully improve edit AND including more useful tools/skills ( #1941 )
...
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-05-23 16:04:09 +00:00
Xingyao Wang
6ff50ed369
Fix SWE-Bench evaluation due to setuptools version ( #1995 )
...
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
2024-05-23 23:17:42 +08:00
Niklas Muennighoff
ef6cdb7532
HumanEvalFix integration ( #1908 )
...
* Preliminary HumanEvalFix integration
* Clean paths
* fix: set workspace path correctly for config
fix: task in that contains /
* add missing run_infer.sh
* update run_infer w/o hard coded agent
* fix typo
* change `instance_id` to `task_id`
* add the warning and env var setting to run_infer.sh
* reset back workspace mount at the end of each instance
* 10 max iter is probably enough for humanevalfix
* Remove unneeded section
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* Fix link
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Use logger
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Update run_infer.py
fix a bug:
ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError>
concurrent.futures.process._RemoteTraceback:
* Update README.md
* Update README.md
* Update README.md
* Update README.md
added an example
* Update README.md
added: enable_auto_lint = true
* Update pyproject.toml
add: evaluate package
* Delete poetry.lock
update poetry.lock
* update poetry.lock
update poetry.lock
* Update README.md
* Update README.md
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Robert <871607149@qq.com >
2024-05-23 13:09:40 +00:00
RainRat
43c187b949
fix typos ( #1956 )
...
no functional change
2024-05-21 19:00:48 +00:00
Boxuan Li
4add8a5595
SWE-bench: Allow selection of tasks ( #1935 )
2024-05-21 16:53:29 +08:00
Shimada666
75cecf68e0
docs: update tutorial docs ( #1912 )
...
* docs: update tutorial docs
* Update evaluation/TUTORIAL.md
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
2024-05-20 14:40:31 +00:00
Boxuan Li
b845a38169
Small improvements & fixes to SWE-Bench ( #1874 )
...
I was able to run a few benchmark instances from SWE-Bench by myself following the documentation - it was great! In general the experience was smooth, thanks to @xingyaoww, @libowen2121 and the team! I made a few small enhancements and fixes to further improve the developer experience.
Always use poetry run python (using python from poetry's virtual environment) over python or python3 in scripts to make sure the behavior is consistent.
Make AGENT configurable. One can use an argument to control which agent they would like to benchmark. To facilitate this, I removed hardcoded CodeActAgent from run_infer.sh, and also added VERSION attribute to all agents, as the benchmark needs to record the agent version.
Make EVAL_LIMIT configurable. One can use an argument to control how many instances they'd like to benchmark. Useful for debugging & development purposes.
Fix 'eval_output_dir' not defined error in run_infer.py.
Other enhancements to the README file and logs.
I also notice that a lot of code from run_infer.py could be shared by other benchmarks, but since we only have one benchmark now, I think we could avoid over-engineering. A refactor and code dedup would be useful in the future once we have more benchmarks, though.
2024-05-20 08:03:30 +00:00
Xingyao Wang
b2fdb963b6
Add detailed tutorial for adding new evaluation benchmarks ( #1827 )
...
* Add detailed tutorial for adding new evaluation benchmarks
* update tutorial, fix typo, and log observation to the cmdline
* fix url
* Update evaluation/TUTORIAL.md
* Update evaluation/TUTORIAL.md
* Update evaluation/TUTORIAL.md
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/TUTORIAL.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* simplify readme and add comments to the actual code
* Fix typo in evaluation/TUTORIAL.md
* Fix typo in evaluation/swe_bench/run_infer.py
* Fix another typo in evaluation/swe_bench/run_infer.py
* Update TUTORIAL.md
* Set host net work to false for SWEBench
* Update evaluation/TUTORIAL.md
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update evaluation/TUTORIAL.md
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update evaluation/TUTORIAL.md
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update evaluation/TUTORIAL.md
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-05-18 13:40:53 -04:00
Boxuan Li
a57a213c7c
Turn off auto linting by default, and on for swe_bench ( #1861 )
...
Disable Python linting by default, and turn it on for SWE Bench.
It is turned off by default since this behavior is weird and somewhat annoying to end users.
It is turned on for SWE Bench because linting python files gives LLM a chance to fix the indentations.
2024-05-18 04:04:38 +00:00
Xingyao Wang
e31f8b8322
automatically get agent version for eval ( #1844 )
2024-05-16 13:39:00 -04:00
Leo
e89cc8f19b
Feat: add stream output to exec_run ( #1625 )
...
* Feat: add stream output to exec_run
* Using command timeout to control the exec_box's timeout.
* add bash -c to source command to compatible for sh.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Feat: add stream output to SSHBox execute
Signed-off-by: ifuryst <ifuryst@gmail.com >
* fix the test case fail.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* fix the test case import wrong path for method.
Signed-off-by: ifuryst <ifuryst@gmail.com >
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
2024-05-16 14:37:49 +00:00
Xingyao Wang
0fdbe1ee93
Update README.md ( #1825 )
2024-05-16 11:06:28 +00:00