Graham Neubig
a081935fd8
Simplify eval code ( #2775 )
...
* Start simplifying eval code
* Update
* Add EDA
* Updated GAIA
* Update gpqa
* Add humanevalfix
* Fix logic_reasoning
* Add miniwob
* Add mint and ml_bench
* toolqa
* Added swe-bench
* Fixed webarena
* Refactor parameters
2024-07-05 19:33:08 +09:00
மனோஜ்குமார் பழனிச்சாமி
143f38d25a
Refactored sandbox config and added fast boot ( #2455 )
...
* Refactored sandbox config and added fastboot
* added tests
* fixed tests
* fixed tests
* intimate user about breaking change
* remove default config from eval
* check for lowercase env
* add test
* Revert Migration
* migrate old sandbox configs
* resolve merge conflict
* revert migration 2
* Revert "remove default config from eval"
This reverts commit de57c588db .
* change type to box_type
* fix var name
* linted
* lint
* lint comments
* fix tests
* fix tests
* fix typo
* fix box_type, remove fast_boot
* add tests for sandbox config
* fix test
* update eval docs
* small removal comments
* adapt toml template
* old fields shouldn't be in the app dataclass
* fix old keys in app config
* clean up exec box
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
2024-07-05 03:30:21 +00:00
Xingyao Wang
298956c78a
[Eval] initialize llm inside process_instance to circumvent "AttributeError:… ( #2805 )
...
* initialize llm inside process_instance to circumvent "AttributeError: Can't pickle local object"
* update kwargs
2024-07-05 01:26:03 +00:00
Xingyao Wang
e6cdf18d3b
[Evaluation] Log empty patch stats for SWE-Bench ( #2776 )
...
* bump swebench version since the fix PR is merged
* add empy generation stats from latest pr
* delete eval_outputs if it already exists
* handle non string patch
2024-07-05 07:03:27 +08:00
Graham Neubig
ffd3c7144c
Remove global args ( #2760 )
...
* Remove global args
* Remove global args
* Update files
* Update main
* Bug fixes
* Fix logging
2024-07-03 20:07:52 +09:00
Xingyao Wang
4d0c4f37d6
[Evaluation] fix SWE-Bench docker image name ( #2751 )
...
* fix double underscore
* remove unused script
2024-07-03 04:30:38 +08:00
Xingyao Wang
41ddba84bd
[Agent] (Potentially) improve Editing using diff ( #2685 )
...
* add replace-based block edit & preliminary test case fix
* further fix the insert behavior
* make edit only work on first occurence
* bump codeact version since we now use new edit agentskills
* update prompt for new agentskills
* update integration tests
* make run_infer.sh executable
* remove code block for edit_file
* update integration test for prompt changes
* default to not use hint for eval
* fix insert emptyfile bug
* throw value error when `to_replace` is empty
* make `_edit_or_insert_file` return string so we can try to fix some linter errors (best attempt)
* add todo
* update integration test
* fix sandbox test for this PR
2024-07-02 11:50:15 +09:00
Xingyao Wang
6a0ffc5c61
[Evaluation] Use the latest official SWE-Bench Dockerization for evaluation ( #2728 )
...
* add newline after patch to fix patch apply
* new swebench wip
* add newline after patch to fix patch apply
* only add newline if not empty
* update swebench source and update
* update gitignore for swebench eval
* update old prep_eval
* update gitignore
* add scripts for push and pull swebench images
* update eval_infer.sh
* update eval_infer for new docker workflow
* update script to create markdown report based on report.json
* update eval infer to use update output
* update readme
* only move result to folder if running whole file
* remove set-x
* update conversion script
* Update evaluation/swe_bench/README.md
* Update evaluation/swe_bench/README.md
* Update evaluation/swe_bench/README.md
* make sure last line end with newline
* switch to an fix attempt branch of swebench
* Update evaluation/swe_bench/README.md
* Update evaluation/swe_bench/README.md
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
2024-07-01 23:58:30 +00:00
Engel Nyst
2d9bb56763
Add ability to restore the cli session (optional) ( #2699 )
...
* add ability to restore the main session
* add quick log
* rename to cli session
2024-06-30 06:56:55 +00:00
Engel Nyst
874b4c9075
CLI concurrency ( #2695 )
...
* add session id in cli, evals
* fix main sid
2024-06-30 04:04:30 +02:00
Xingyao Wang
15e0c524f4
default to not use hint for eval ( #2696 )
2024-06-29 21:27:57 +00:00
Xingyao Wang
e8cb6803df
[Evaluation] Improve patch apply in SWE-Bench ( #2684 )
...
* add newline after patch to fix patch apply
* only add newline if not empty
2024-06-29 14:11:07 +08:00
மனோஜ்குமார் பழனிச்சாமி
af9385322b
Refactor: Simplify message formatting ( #2670 )
...
Removed redundant `str()` conversion in f-string.
2024-06-28 07:34:26 +02:00
Jiayi Pan
917d96e06f
Fix doc error in evals ( #2654 )
2024-06-27 16:13:47 +00:00
Xavier Vergés
cd91d45b44
Allow SANDBOX_CONTAINER_IMAGEs built from opendevin/sandbox:main ( #2622 )
2024-06-26 12:05:07 +08:00
Xingyao Wang
6de584d77d
update swe-bench output with eval results ( #2606 )
2024-06-24 08:07:28 +09:00
Graham Neubig
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings ( #2597 )
...
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings
* Update evaluation/webarena/scripts/run_infer.sh
---------
Co-authored-by: OpenDevin <opendevin@all-hands.dev >
2024-06-23 03:43:43 +00:00
மனோஜ்குமார் பழனிச்சாமி
41564c2eac
Use :main instead of :latest ( #2539 )
...
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-21 03:57:50 +00:00
Boxuan Li
feabc97aba
Evaluation time travel: build sandbox on the fly ( #2491 )
2024-06-20 20:22:02 -06:00
Xingyao Wang
b569ba710d
docs: Add visualizer instruction for SWE-Bench ( #2529 )
...
* Update README.md for visualizer instruction
* Polish the visualization guidance (#2531 )
* fix conda create error
* fix and polish the readme for visualization
* Update README.md
---------
Co-authored-by: Haofei Yu <haofeiy@cs.cmu.edu >
2024-06-19 20:41:09 +00:00
Xingyao Wang
1f379bebc2
Update README.md ( #2505 )
...
LGTM
2024-06-18 18:14:21 +02:00
Boxuan Li
6f235937cf
Evaluation time travel: allow evaluation on a specific version ( #2356 )
...
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
super-dainiu
563bc41fd3
Use LLM to analyze ML-Bench failure cases ( #2399 )
...
* add ml-bench w/o exec env
* fix typos (#1956 )
no functional change
* Refactored Logs (#1939 )
* [Feat] A competitive Web Browsing agent (#1856 )
* initial attempt at a browsing only agent
* add browsing agent
* update
* implement agent
* update
* fix comments
* remove unnecessary things from memory extras
* update image processing
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Update README.md SWE-bench score (#1959 )
* Update README.md SWE-bench score
Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.
* Update
* fix: llm is_local function logic error (#1961 )
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* doc: update documentation about poetry update (#1962 )
* add doc
* Update Development.md
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* feat: add metrics related to cost for better observability (#1944 )
* add metrics for total_cost
* make lint
* refact codeact
* change metrics into llm
* add costs list, add into state
* refactor log completion
* refactor and test others
* make lint
* Update opendevin/core/metrics.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/llm/llm.py
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor
* add code
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* doc: add more cmd in unit test documentation (#1963 )
* --- (#1975 )
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1976 )
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Logging security (#1943 )
* update .gitignore
* Rename the confusing 'INFO' style to 'DETAIL'
* override str and repr
* feat: api_key desensitize
* feat: add SensitiveDataFilter in file handler
* tweak regex, add tests
* more tweaks, include other attrs
* add env vars, those with equivalent config
* fix tests
* tests are invaluable
---------
Co-authored-by: Shimada666 <649940882@qq.com >
* --- (#1967 )
updated-dependencies:
- dependency-name: react-dom
dependency-type: direct:production
update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1968 )
updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1969 )
updated-dependencies:
- dependency-name: husky
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1970 )
updated-dependencies:
- dependency-name: tailwind-merge
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1971 )
updated-dependencies:
- dependency-name: i18next
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Refactor session management (#1810 )
* refactor session mgmt
* defer file handling to runtime
* add todo
* refactor sessions a bit more
* remove messages logic from FE
* fix up socket handshake
* refactor frontend auth a bit
* first pass at redoing file explorer
* implement directory suffix
* fix up file tree
* close agent on websocket close
* remove session saving
* move file refresh
* remove getWorkspace
* plumb path/code differently
* fix build issues
* fix the tests
* fix npm build
* add session rehydration
* fix event serialization
* logspam
* fix user message rehydration
* add get_event fn
* agent state restoration
* change history tracking for codeact
* fix responsiveness of init
* fix lint
* lint
* delint
* fix prop
* update tests
* logspam
* lint
* fix test
* revert codeact
* change fileService to use API
* fix up session loading
* delint
* delint
* fix integration tests
* revert test
* fix up access to options endpoints
* fix initial files load
* delint
* fix file initialization
* fix mock server
* fixl int
* fix auth for html
* Update frontend/src/i18n/translation.json
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor sessions and sockets
* avoid reinitializing the same session
* fix reconnect issue
* change up intro message
* more guards on reinit
* rename agent_session
* delint
* fix a bunch of tests
* delint
* fix last test
* remove code editor context
* fix build
* fix any
* fix dot notation
* Update frontend/src/services/api.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix up error handling
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update frontend/src/services/session.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix build errs
* fix else
* add closed state
* delint
* Update opendevin/server/session/session.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* fix #1960 (#1964 )
* Add ruff for shared mutable defaults (B) (#1938 )
* Add ruff for shared mutable defaults (B)
* Apply B006, B008 on current files, except fast API
* Update agenthub/SWE_agent/prompts.py
Co-authored-by: Graham Neubig <neubig@gmail.com >
* fix unintended behavior change
* this is correct, tell Ruff to leave it alone
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888 )
* Add MacOS to integration tests
* Switch back to python 3.11
* Install Docker for macos pipeline
* regenerate.sh: Use environmental variable for sandbox type
* Pack different agents' tests into a single check
* Fix CodeAct tests
* Reduce file match and extensive debug logs
* Add TEST_IN_CI mode that reports codecov
* Small fix: don't quit if reusing old responses failed
* Merge codecov results
* Fix typos
* Remove coverage merge step - codecov automatically does that
* Make mac integration tests as optional - too slow
* Fix codecov args
* Add comments in yaml
* Include sandbox type in codecov report name
* Fix codecov report merge
* Revert renaming of test_matrix_success
* Remove SWEAgent and PlannerAgent from tests
* Mark planner agent and SWE agent as deprecated
* CodeCov: Ignore planner and sweagent
* Revert "Remove SWEAgent and PlannerAgent from tests"
This reverts commit 040cb3bfb9 .
* Remove all tests for SWE Agent
* Only keep basic tests for MonologueAgent and PlannerAgent
* Mark SWE Agent as deprecated, and ignore code coverage for it
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987 )
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* Save CI cycles for backend tests (#1985 )
* Fix typo in prompt (#1992 )
* Refactor monologue and SWE agent to use the messages in state history (#1863 )
* Refactor monologue to use the messages in state history
* add messages, clean up
* fix monologue
* update integration tests
* move private method
* update SWE agent to use the history from State
* integration tests for SWE agent
* rename monologue to initial_thoughts, since that is what it is
* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994 )
* add ml-bench in readme
* Bump boto3 from 1.34.110 to 1.34.111 (#2001 )
Bumps [boto3](https://github.com/boto/boto3 ) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases )
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst )
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111 )
---
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump docker from 7.0.0 to 7.1.0 (#2002 )
Bumps [docker](https://github.com/docker/docker-py ) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases )
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0 )
---
updated-dependencies:
- dependency-name: docker
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump litellm from 1.37.20 to 1.38.0 (#2005 )
Bumps [litellm](https://github.com/BerriAI/litellm ) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases )
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0 )
---
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix SWE-Bench evaluation due to setuptools version (#1995 )
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* fix session state after resuming (#1999 )
* fix state resuming
* fix session reconnection
* fix lint
* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941 )
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* build: Add poetry command to use Python 3.11 for environment setup (#1972 )
* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006 )
Bumps [@react-types/shared](https://github.com/adobe/react-spectrum ) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases )
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1 )
---
updated-dependencies:
- dependency-name: "@react-types/shared"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @types/react-syntax-highlighter in /frontend (#2007 )
Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter ) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases )
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter )
---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008 )
Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser ) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases )
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md )
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser )
---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009 )
Bumps [lint-staged](https://github.com/okonet/lint-staged ) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases )
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md )
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4 )
---
updated-dependencies:
- dependency-name: lint-staged
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update README.md
* Update README.md
* add run_infer.sh
* fix input output
* fix docker sandbox
* fix run
* update and clean run_infer.py
* add script to clean up dockers
* update repo uid
* add description
* new
* Update README.md
* use root for sandbox
* update readme
* update ml-bench conda env
* update readme
* update readme
* use try except
* modify raise exception
* add int
* update README
* longer time
* fix existing issues
* fix existing issue
* new docker image
* add metrics of cost
* add result parsing cost
* fix
* fix
* update summarize
* fix
* add analyze
* update readme
* use 4o
* add eval output
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal >
Co-authored-by: RainRat <rainrat78@yahoo.ca >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
Co-authored-by: Frank Xu <frankxu2004@gmail.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Robert Brennan <accounts@rbren.io >
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com >
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com >
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com >
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com >
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com >
Co-authored-by: Robert <871607149@qq.com >
2024-06-13 09:30:55 +08:00
Xingyao Wang
b3bdc44292
mkdir infer_logs instead of logs ( #2382 )
2024-06-11 07:18:19 +08:00
Xingyao Wang
11a2d1682d
Minor SWE-Bench inference config tweak ( #2381 )
...
* save infer logs to infer_logs
* set max budget for swebench eval
2024-06-10 20:14:22 +00:00
Xingyao Wang
a6ba6c5277
Add SWEBench-docker eval ( #2085 )
...
* add initial version of swebench-docker eval
* update the branch of git repo
* add poetry run
* download dev set too and pre-load f2p and p2p
* update eval infer script
* increase timeout
* add poetry run
* install swebench from our fork
* update script
* update loc
* support single instance debug
* replace \r\n from model patch
* replace eval docker from namespace xingyaoww
* update script to auto detect swe-bench format jsonl
* support eval infer on single instance id
* change log output dir to logs
* update summarise result script
* update README
* update readme
* tweak branch
* Update evaluation/swe_bench/scripts/eval/prep_eval.sh
Co-authored-by: Graham Neubig <neubig@gmail.com >
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-06-10 19:30:40 +00:00
Yufan Song
f4cb192ebe
Fix llm key leaks bug ( #2376 )
...
* fix bug
* fix bug
* add
2024-06-10 15:55:33 +00:00
Robert
7fc57650f3
BioCoder integration ( #2076 )
...
* prepare execution and inference
* Create README.md
* Update README.md
* Update evaluation/biocoder/README.md
* Update evaluation/swe_bench/swe_env_box.py
* switch to biocoder docker container and test-specific code
* code for copying and running test files into container
* add metrics
* add readme
* Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes)
* Update README.md
---------
Co-authored-by: lilbillybiscuit <qianbill2014@outlook.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-10 11:11:40 +08:00
RainRat
745ae42a72
fix typos ( #2352 )
2024-06-09 12:57:58 -07:00
yueqis
68d9ad61cf
Feat: Support Gorilla APIBench ( #2081 )
...
* removed unused files from gorilla
* Update run_infer.py, removed unused imports
* Update utils.py
* Update ast_eval_hf.py
* Update ast_eval_tf.py
* Update ast_eval_th.py
* Create README.md
* Update run_infer.py
* make lint
* Update run_infer.py
* fix lint
---------
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-08 16:54:54 +00:00
Jaskirat Singh
e8307608c2
Support gpqa benchmark evaluation ( #2080 )
...
* feat: add gpqa benchmark evaluation
* add metrics
* reset configs in final block
* make lint
---------
Co-authored-by: yufansong <yufan@risingwave-labs.com >
2024-06-08 16:24:24 +00:00
yueqis
82d4d25b09
feat: support ToolQA benchmark ( #2263 )
...
* Add files via upload
* Update README.md
* Update run_infer.py
* Update utils.py
* make lint
* Update evaluation/toolqa/run_infer.py
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: yufansong <yufan@risingwave-labs.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-08 07:54:01 -04:00
super-dainiu
beabcce16d
[Hotfix] Fix ML-Bench continue `run_inference.py` ( #2284 )
...
* add ml-bench w/o exec env
* fix typos (#1956 )
no functional change
* Refactored Logs (#1939 )
* [Feat] A competitive Web Browsing agent (#1856 )
* initial attempt at a browsing only agent
* add browsing agent
* update
* implement agent
* update
* fix comments
* remove unnecessary things from memory extras
* update image processing
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Update README.md SWE-bench score (#1959 )
* Update README.md SWE-bench score
Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.
* Update
* fix: llm is_local function logic error (#1961 )
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* doc: update documentation about poetry update (#1962 )
* add doc
* Update Development.md
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* feat: add metrics related to cost for better observability (#1944 )
* add metrics for total_cost
* make lint
* refact codeact
* change metrics into llm
* add costs list, add into state
* refactor log completion
* refactor and test others
* make lint
* Update opendevin/core/metrics.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/llm/llm.py
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor
* add code
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* doc: add more cmd in unit test documentation (#1963 )
* --- (#1975 )
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1976 )
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Logging security (#1943 )
* update .gitignore
* Rename the confusing 'INFO' style to 'DETAIL'
* override str and repr
* feat: api_key desensitize
* feat: add SensitiveDataFilter in file handler
* tweak regex, add tests
* more tweaks, include other attrs
* add env vars, those with equivalent config
* fix tests
* tests are invaluable
---------
Co-authored-by: Shimada666 <649940882@qq.com >
* --- (#1967 )
updated-dependencies:
- dependency-name: react-dom
dependency-type: direct:production
update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1968 )
updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1969 )
updated-dependencies:
- dependency-name: husky
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1970 )
updated-dependencies:
- dependency-name: tailwind-merge
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1971 )
updated-dependencies:
- dependency-name: i18next
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Refactor session management (#1810 )
* refactor session mgmt
* defer file handling to runtime
* add todo
* refactor sessions a bit more
* remove messages logic from FE
* fix up socket handshake
* refactor frontend auth a bit
* first pass at redoing file explorer
* implement directory suffix
* fix up file tree
* close agent on websocket close
* remove session saving
* move file refresh
* remove getWorkspace
* plumb path/code differently
* fix build issues
* fix the tests
* fix npm build
* add session rehydration
* fix event serialization
* logspam
* fix user message rehydration
* add get_event fn
* agent state restoration
* change history tracking for codeact
* fix responsiveness of init
* fix lint
* lint
* delint
* fix prop
* update tests
* logspam
* lint
* fix test
* revert codeact
* change fileService to use API
* fix up session loading
* delint
* delint
* fix integration tests
* revert test
* fix up access to options endpoints
* fix initial files load
* delint
* fix file initialization
* fix mock server
* fixl int
* fix auth for html
* Update frontend/src/i18n/translation.json
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor sessions and sockets
* avoid reinitializing the same session
* fix reconnect issue
* change up intro message
* more guards on reinit
* rename agent_session
* delint
* fix a bunch of tests
* delint
* fix last test
* remove code editor context
* fix build
* fix any
* fix dot notation
* Update frontend/src/services/api.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix up error handling
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update frontend/src/services/session.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix build errs
* fix else
* add closed state
* delint
* Update opendevin/server/session/session.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* fix #1960 (#1964 )
* Add ruff for shared mutable defaults (B) (#1938 )
* Add ruff for shared mutable defaults (B)
* Apply B006, B008 on current files, except fast API
* Update agenthub/SWE_agent/prompts.py
Co-authored-by: Graham Neubig <neubig@gmail.com >
* fix unintended behavior change
* this is correct, tell Ruff to leave it alone
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888 )
* Add MacOS to integration tests
* Switch back to python 3.11
* Install Docker for macos pipeline
* regenerate.sh: Use environmental variable for sandbox type
* Pack different agents' tests into a single check
* Fix CodeAct tests
* Reduce file match and extensive debug logs
* Add TEST_IN_CI mode that reports codecov
* Small fix: don't quit if reusing old responses failed
* Merge codecov results
* Fix typos
* Remove coverage merge step - codecov automatically does that
* Make mac integration tests as optional - too slow
* Fix codecov args
* Add comments in yaml
* Include sandbox type in codecov report name
* Fix codecov report merge
* Revert renaming of test_matrix_success
* Remove SWEAgent and PlannerAgent from tests
* Mark planner agent and SWE agent as deprecated
* CodeCov: Ignore planner and sweagent
* Revert "Remove SWEAgent and PlannerAgent from tests"
This reverts commit 040cb3bfb9 .
* Remove all tests for SWE Agent
* Only keep basic tests for MonologueAgent and PlannerAgent
* Mark SWE Agent as deprecated, and ignore code coverage for it
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987 )
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* Save CI cycles for backend tests (#1985 )
* Fix typo in prompt (#1992 )
* Refactor monologue and SWE agent to use the messages in state history (#1863 )
* Refactor monologue to use the messages in state history
* add messages, clean up
* fix monologue
* update integration tests
* move private method
* update SWE agent to use the history from State
* integration tests for SWE agent
* rename monologue to initial_thoughts, since that is what it is
* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994 )
* add ml-bench in readme
* Bump boto3 from 1.34.110 to 1.34.111 (#2001 )
Bumps [boto3](https://github.com/boto/boto3 ) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases )
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst )
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111 )
---
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump docker from 7.0.0 to 7.1.0 (#2002 )
Bumps [docker](https://github.com/docker/docker-py ) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases )
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0 )
---
updated-dependencies:
- dependency-name: docker
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump litellm from 1.37.20 to 1.38.0 (#2005 )
Bumps [litellm](https://github.com/BerriAI/litellm ) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases )
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0 )
---
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix SWE-Bench evaluation due to setuptools version (#1995 )
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* fix session state after resuming (#1999 )
* fix state resuming
* fix session reconnection
* fix lint
* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941 )
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* build: Add poetry command to use Python 3.11 for environment setup (#1972 )
* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006 )
Bumps [@react-types/shared](https://github.com/adobe/react-spectrum ) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases )
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1 )
---
updated-dependencies:
- dependency-name: "@react-types/shared"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @types/react-syntax-highlighter in /frontend (#2007 )
Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter ) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases )
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter )
---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008 )
Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser ) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases )
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md )
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser )
---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009 )
Bumps [lint-staged](https://github.com/okonet/lint-staged ) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases )
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md )
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4 )
---
updated-dependencies:
- dependency-name: lint-staged
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update README.md
* Update README.md
* add run_infer.sh
* fix input output
* fix docker sandbox
* fix run
* update and clean run_infer.py
* add script to clean up dockers
* update repo uid
* add description
* new
* Update README.md
* use root for sandbox
* update readme
* update ml-bench conda env
* update readme
* update readme
* use try except
* modify raise exception
* add int
* update README
* longer time
* fix existing issues
* fix existing issue
* new docker image
* add metrics of cost
* add result parsing cost
* fix
* fix
* update summarize
* fix
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal >
Co-authored-by: RainRat <rainrat78@yahoo.ca >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
Co-authored-by: Frank Xu <frankxu2004@gmail.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Robert Brennan <accounts@rbren.io >
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com >
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com >
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com >
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com >
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com >
Co-authored-by: Robert <871607149@qq.com >
2024-06-06 03:53:21 +00:00
Frank Xu
48151bdbb0
[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes ( #2170 )
...
* add webarena, and revamp messaging for webarena eval
* add changes for browsergym
* update infer script
* fix unit tests
* update
* add multiple run for miniwob
* update instruction, remove personal path
* update
* add code for getting final reward, fix integration, add results
* add avg cost calculation
2024-06-06 09:01:20 +08:00
மனோஜ்குமார் பழனிச்சாமி
ae815b20d2
Improved logs ( #2272 )
2024-06-05 17:54:40 +05:30
Boxuan Li
208b1461ca
[AgentBench evaluation] set run_as_devin to true ( #2269 )
...
Co-authored-by: Leo <ifuryst@gmail.com >
2024-06-05 07:53:33 +00:00
Ryan H. Tran
0584e428b2
[Mint evaluation] Fix bug in stopping when the agent reaches max steps or solution proposals ( #2268 )
...
* fix: bug in stopping when the agent reaches max steps or solution proposals
* remove --eval-num-workers
* update env.py
2024-06-05 06:47:07 +00:00
super-dainiu
ebafb702e5
Add ML-Bench Evaluation with OpenDevin ( #2015 )
...
* add ml-bench w/o exec env
* fix typos (#1956 )
no functional change
* Refactored Logs (#1939 )
* [Feat] A competitive Web Browsing agent (#1856 )
* initial attempt at a browsing only agent
* add browsing agent
* update
* implement agent
* update
* fix comments
* remove unnecessary things from memory extras
* update image processing
---------
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Update README.md SWE-bench score (#1959 )
* Update README.md SWE-bench score
Our most recent results on swe-bench lite are 25%, so this updates the README accordingly.
* Update
* fix: llm is_local function logic error (#1961 )
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* doc: update documentation about poetry update (#1962 )
* add doc
* Update Development.md
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* feat: add metrics related to cost for better observability (#1944 )
* add metrics for total_cost
* make lint
* refact codeact
* change metrics into llm
* add costs list, add into state
* refactor log completion
* refactor and test others
* make lint
* Update opendevin/core/metrics.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/llm/llm.py
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor
* add code
---------
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* doc: add more cmd in unit test documentation (#1963 )
* --- (#1975 )
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1976 )
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Logging security (#1943 )
* update .gitignore
* Rename the confusing 'INFO' style to 'DETAIL'
* override str and repr
* feat: api_key desensitize
* feat: add SensitiveDataFilter in file handler
* tweak regex, add tests
* more tweaks, include other attrs
* add env vars, those with equivalent config
* fix tests
* tests are invaluable
---------
Co-authored-by: Shimada666 <649940882@qq.com >
* --- (#1967 )
updated-dependencies:
- dependency-name: react-dom
dependency-type: direct:production
update-type: version-update:semver-minor
- dependency-name: "@types/react-dom"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1968 )
updated-dependencies:
- dependency-name: "@reduxjs/toolkit"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1969 )
updated-dependencies:
- dependency-name: husky
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1970 )
updated-dependencies:
- dependency-name: tailwind-merge
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* --- (#1971 )
updated-dependencies:
- dependency-name: i18next
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
* Refactor session management (#1810 )
* refactor session mgmt
* defer file handling to runtime
* add todo
* refactor sessions a bit more
* remove messages logic from FE
* fix up socket handshake
* refactor frontend auth a bit
* first pass at redoing file explorer
* implement directory suffix
* fix up file tree
* close agent on websocket close
* remove session saving
* move file refresh
* remove getWorkspace
* plumb path/code differently
* fix build issues
* fix the tests
* fix npm build
* add session rehydration
* fix event serialization
* logspam
* fix user message rehydration
* add get_event fn
* agent state restoration
* change history tracking for codeact
* fix responsiveness of init
* fix lint
* lint
* delint
* fix prop
* update tests
* logspam
* lint
* fix test
* revert codeact
* change fileService to use API
* fix up session loading
* delint
* delint
* fix integration tests
* revert test
* fix up access to options endpoints
* fix initial files load
* delint
* fix file initialization
* fix mock server
* fixl int
* fix auth for html
* Update frontend/src/i18n/translation.json
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
* refactor sessions and sockets
* avoid reinitializing the same session
* fix reconnect issue
* change up intro message
* more guards on reinit
* rename agent_session
* delint
* fix a bunch of tests
* delint
* fix last test
* remove code editor context
* fix build
* fix any
* fix dot notation
* Update frontend/src/services/api.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix up error handling
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/server/session/agent.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update frontend/src/services/session.ts
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* fix build errs
* fix else
* add closed state
* delint
* Update opendevin/server/session/session.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* fix #1960 (#1964 )
* Add ruff for shared mutable defaults (B) (#1938 )
* Add ruff for shared mutable defaults (B)
* Apply B006, B008 on current files, except fast API
* Update agenthub/SWE_agent/prompts.py
Co-authored-by: Graham Neubig <neubig@gmail.com >
* fix unintended behavior change
* this is correct, tell Ruff to leave it alone
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Refactor integration testing CI, add optional Mac tests, and mark a few agents as deprecated (#1888 )
* Add MacOS to integration tests
* Switch back to python 3.11
* Install Docker for macos pipeline
* regenerate.sh: Use environmental variable for sandbox type
* Pack different agents' tests into a single check
* Fix CodeAct tests
* Reduce file match and extensive debug logs
* Add TEST_IN_CI mode that reports codecov
* Small fix: don't quit if reusing old responses failed
* Merge codecov results
* Fix typos
* Remove coverage merge step - codecov automatically does that
* Make mac integration tests as optional - too slow
* Fix codecov args
* Add comments in yaml
* Include sandbox type in codecov report name
* Fix codecov report merge
* Revert renaming of test_matrix_success
* Remove SWEAgent and PlannerAgent from tests
* Mark planner agent and SWE agent as deprecated
* CodeCov: Ignore planner and sweagent
* Revert "Remove SWEAgent and PlannerAgent from tests"
This reverts commit 040cb3bfb9 .
* Remove all tests for SWE Agent
* Only keep basic tests for MonologueAgent and PlannerAgent
* Mark SWE Agent as deprecated, and ignore code coverage for it
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Fix Repeated Responses in Chat by Adding IPythonRunCellObservation (#1987 )
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
* Save CI cycles for backend tests (#1985 )
* Fix typo in prompt (#1992 )
* Refactor monologue and SWE agent to use the messages in state history (#1863 )
* Refactor monologue to use the messages in state history
* add messages, clean up
* fix monologue
* update integration tests
* move private method
* update SWE agent to use the history from State
* integration tests for SWE agent
* rename monologue to initial_thoughts, since that is what it is
* fix: catch session file not existed exception when init EventStream(maybe creating a new session with no session files stored). (#1994 )
* add ml-bench in readme
* Bump boto3 from 1.34.110 to 1.34.111 (#2001 )
Bumps [boto3](https://github.com/boto/boto3 ) from 1.34.110 to 1.34.111.
- [Release notes](https://github.com/boto/boto3/releases )
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst )
- [Commits](https://github.com/boto/boto3/compare/1.34.110...1.34.111 )
---
updated-dependencies:
- dependency-name: boto3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump docker from 7.0.0 to 7.1.0 (#2002 )
Bumps [docker](https://github.com/docker/docker-py ) from 7.0.0 to 7.1.0.
- [Release notes](https://github.com/docker/docker-py/releases )
- [Commits](https://github.com/docker/docker-py/compare/7.0.0...7.1.0 )
---
updated-dependencies:
- dependency-name: docker
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump litellm from 1.37.20 to 1.38.0 (#2005 )
Bumps [litellm](https://github.com/BerriAI/litellm ) from 1.37.20 to 1.38.0.
- [Release notes](https://github.com/BerriAI/litellm/releases )
- [Commits](https://github.com/BerriAI/litellm/compare/v1.37.20...v1.38.0 )
---
updated-dependencies:
- dependency-name: litellm
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix SWE-Bench evaluation due to setuptools version (#1995 )
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* fix session state after resuming (#1999 )
* fix state resuming
* fix session reconnection
* fix lint
* Implement `agentskills` for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941 )
* add draft for skills
* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file
* Remove new_sample.txt file
* add some work from opendevin w/ fixes
* Add unit tests for agentskills module
* fix some issues and updated tests
* add more tests for open
* tweak and handle goto_line
* add tests for some edge cases
* add tests for scrolling
* add tests for edit
* add tests for search_dir
* update tests to use pytest
* use pytest --forked to avoid file op unit tests to interfere with each other via global var
* update doc based on swe agent tool
* update and add tests for find_file and search_file
* move agent_skills to plugins
* add agentskills as plugin and docs
* add agentskill to ssh box and fix sandbox integration
* remove extra returns in doc
* add agentskills to initial tool for jupyter
* support re-init jupyter kernel (for agentskills) after restart
* fix print window's issue with indentation and add testcases
* add prompt for codeact with the newest edit primitives
* modify the way line number is presented (remove leading space)
* change prompt to the newest display format
* support tracking of costs via metrics
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* implement and add tests for py linting
* remove extra text arg for incompatible subprocess ver
* remove sample.txt
* update test_edits integration tests
* fix all integration
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update opendevin/runtime/plugins/agent_skills/README.md
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update agenthub/codeact_agent/prompt.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* Update opendevin/runtime/plugins/agent_skills/agentskills.py
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* correctly setup plugins for swebench eval
* bump swe-bench version and add logging
* Revert "correctly setup plugins for swebench eval"
This reverts commit 2bd1055673 .
* bump version
* remove _AGENT_SKILLS_DOCS
* move flake8 to test dep
* update poetry.lock
* remove extra arg
* reduce max iter for eval
* update poetry
* fix integration tests
---------
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
* build: Add poetry command to use Python 3.11 for environment setup (#1972 )
* Bump @react-types/shared from 3.23.0 to 3.23.1 in /frontend (#2006 )
Bumps [@react-types/shared](https://github.com/adobe/react-spectrum ) from 3.23.0 to 3.23.1.
- [Release notes](https://github.com/adobe/react-spectrum/releases )
- [Commits](https://github.com/adobe/react-spectrum/compare/@react-types/shared@3.23.0...@react-types/shared@3.23.1 )
---
updated-dependencies:
- dependency-name: "@react-types/shared"
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @types/react-syntax-highlighter in /frontend (#2007 )
Bumps [@types/react-syntax-highlighter](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react-syntax-highlighter ) from 15.5.11 to 15.5.13.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases )
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react-syntax-highlighter )
---
updated-dependencies:
- dependency-name: "@types/react-syntax-highlighter"
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump @typescript-eslint/parser from 7.9.0 to 7.10.0 in /frontend (#2008 )
Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser ) from 7.9.0 to 7.10.0.
- [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases )
- [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md )
- [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v7.10.0/packages/parser )
---
updated-dependencies:
- dependency-name: "@typescript-eslint/parser"
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump lint-staged from 15.2.2 to 15.2.4 in /frontend (#2009 )
Bumps [lint-staged](https://github.com/okonet/lint-staged ) from 15.2.2 to 15.2.4.
- [Release notes](https://github.com/okonet/lint-staged/releases )
- [Changelog](https://github.com/lint-staged/lint-staged/blob/master/CHANGELOG.md )
- [Commits](https://github.com/okonet/lint-staged/compare/v15.2.2...v15.2.4 )
---
updated-dependencies:
- dependency-name: lint-staged
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Update README.md
* Update README.md
* add run_infer.sh
* fix input output
* fix docker sandbox
* fix run
* update and clean run_infer.py
* add script to clean up dockers
* update repo uid
* add description
* new
* Update README.md
* use root for sandbox
* update readme
* update ml-bench conda env
* update readme
* update readme
* use try except
* modify raise exception
* add int
* update README
* longer time
* fix existing issues
* fix existing issue
* new docker image
* add metrics of cost
* add result parsing cost
* fix
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: Ubuntu <ubuntu@ip-172-31-31-157.ec2.internal >
Co-authored-by: RainRat <rainrat78@yahoo.ca >
Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
Co-authored-by: Frank Xu <frankxu2004@gmail.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
Co-authored-by: Graham Neubig <neubig@gmail.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Robert Brennan <accounts@rbren.io >
Co-authored-by: Rahul Anand <62982824+zeul22@users.noreply.github.com >
Co-authored-by: jiangleo <jiangleo@users.noreply.github.com >
Co-authored-by: jianghongwei <jianghongwei@58.com >
Co-authored-by: Jeremi Joslin <jeremi@newlogic.com >
Co-authored-by: Aaron Xia <zhhuaxia@gmail.com >
Co-authored-by: OpenDevin <opendevin@opendevin.ai >
Co-authored-by: DaxServer <7479937+DaxServer@users.noreply.github.com >
Co-authored-by: Robert <871607149@qq.com >
2024-06-05 01:56:39 +00:00
Leo
040d6bd806
fix: add an early exit check for agent answers in agent bench. ( #2257 )
...
Signed-off-by: ifuryst <ifuryst@gmail.com >
2024-06-04 18:45:07 -07:00
tobitege
5776474dcf
Fix SWE-Bench README typos ( #2250 )
2024-06-05 01:18:02 +00:00
Leo
9ada36e30b
fix: restore python linting. ( #2228 )
...
* fix: restore python linting.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* update: extend the Python lint check to evaluation.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/logic_reasoning/instruction.txt
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-04 06:36:19 +00:00
finaltrip
05b84df9cb
chore: fix some comments ( #2234 )
...
Signed-off-by: finaltrip <finaltrip@qq.com >
2024-06-03 16:04:34 +00:00
Boxuan Li
538d1d85a2
evaluation: Reset configs in finally block ( #2214 )
2024-06-03 09:52:12 +08:00
Ryan H. Tran
22e8fb39b1
add cost metrics to evaluation outputs for all benchmarks ( #2199 )
2024-06-02 08:28:00 +00:00
Yizhe Zhang
8d79c3edbc
modify the exiting logic and reward calculation, delete unused function ( #2198 )
2024-06-02 06:38:09 +00:00
RainRat
ed6dcc8381
fix typos ( #2187 )
...
* fix typos
no functional change
* fix typos
2024-06-01 20:40:30 +00:00
Leo
2c231c57c9
Add supported benchmarks to evaluation README (AgentBench, BIRD, LogicReasoning) ( #2183 )
...
Signed-off-by: ifuryst <ifuryst@gmail.com >
2024-06-01 11:33:01 -04:00
Binyuan Hui
46dcf4bb3e
Support BIRD benchmark ( #2117 )
...
* update: change timeout from 10 to 30
* update: readme for bird evaluation
* Update evaluation/bird/run_infer.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
* Update evaluation/bird/README.md
Co-authored-by: Shimada666 <649940882@qq.com >
* Update evaluation/bird/README.md
Co-authored-by: Shimada666 <649940882@qq.com >
* Update evaluation/bird/run_infer.py
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
---------
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Shimada666 <649940882@qq.com >
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com >
2024-06-01 11:34:36 +00:00
Leo
be251b11de
Add AgentBench. ( #2012 )
...
* Add AgentBench.
* Load the datasets from HF.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add helper functions.
* Add mock executor.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add retriv agent answer cmd.
* Adjust the dataset.
* Refine test results.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Consolidate all AgentBench datasets and scripts into a single CSV dataset.
* Refactor dataset source.
* Update helper functions.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Fix the CRLF problem.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Separate the instance's workspace.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Add cleanup logic and error handling for sandbox closure.
* Normalized dataset
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update README.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update the prompt to capture the answer.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Refactor script execution paths to use absolute container workspace path.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update AgentBench README.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Delete useless functions.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/agent_bench/README.md
* Add script to summarize test results from JSONL file in AgentBench
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Delete useless script and codes.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/agent_bench/scripts/summarise_results.py
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-01 07:58:14 +00:00
Ryan H. Tran
01296ff79d
Add remaining subsets for MINT benchmark ( #2142 )
...
* add MMLU subset
* add theoremqa subset
* remove redundant packages from requirements.txt, adjust prompts, handle gpt3.5 propose a wrong answer after a correct answer
* add MBPP subset
* add humaneval subset
* update README
* exit actively after the agent finishes the task
2024-05-31 20:04:13 +00:00