Commit Graph

54 Commits

Author SHA1 Message Date
Xingyao Wang
53390d9885 Fix issue #4583: [Bug]: Unable to pull the full SWE-Bench test set (#4813)
Co-authored-by: openhands <openhands@all-hands.dev>
2024-11-07 22:35:20 +08:00
Xingyao Wang
966da7b7c8 feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667)
Co-authored-by: tofarr <tofarr@gmail.com>
2024-11-05 00:27:27 +08:00
Xingyao Wang
6d19c93d19 [eval] add evaluation workflow (#4489)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-10-29 13:52:25 +00:00
Xingyao Wang
ae13171194 feat(agent): CodeAct with function calling (#4537)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-29 11:06:33 +08:00
Xingyao Wang
da548d308c [agent] LLM-based editing (#3985)
Co-authored-by: Tim O'Farrell <tofarr@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-10-22 04:51:44 +08:00
Xingyao Wang
50c13aad98 [Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396) 2024-10-15 21:34:52 +08:00
Xingyao Wang
4dfc7a7ef0 [Eval] Add a more lightweight / easier-to-use SWE-Bench output visualizer (#4360)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-10-14 02:09:01 +00:00
Alejandro Cuadron Lafuente
a3571ec510 [Fix] Error when trying to pull all docker evaluation containers (#4244) 2024-10-08 05:03:36 +08:00
Xingyao Wang
01ae54a69d fix swebench repo/version being string (#4241) 2024-10-07 22:01:42 +08:00
Xingyao Wang
245334e89d [eval] improve update output script for swe-bench (#4180) 2024-10-04 15:10:03 +00:00
Xingyao Wang
1109637efb Update instruction for new version of eval runtime-api (#4128) 2024-09-30 23:48:38 +00:00
Xingyao Wang
8d6eda3623 fix eval_infer.sh to correctly copy SWE-Bench logs (#4111) 2024-09-29 18:39:18 -05:00
Xingyao Wang
b13ed017d8 [eval] add git patch post-processing for SWE-Bench eval_infer (#3980) 2024-09-20 15:33:53 +00:00
Xingyao Wang
5d7f2fd4ae [eval] Allow evaluation of SWE-Bench patches on RemoteRuntime (#3927)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-18 16:07:34 -04:00
Xingyao Wang
f996b31d64 [eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each run_infer (#3907)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-09-17 14:07:58 +00:00
tobitege
52c5abccbf (enh) Dockerfile.j2: improve env vars for bash and activate in .bashrc (#3871) 2024-09-17 08:49:04 +02:00
Xingyao Wang
688068a44e Fix issues for running RemoteRuntime in parallel on SWE-Bench (#3716)
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* increase timeout for remote runtime

* add push script

* handle the case when ret push is an generator

* update pbar

* set SWE-Bench default to run SWE-Bench lite

* add script to cleanup remote runtime

* fix the cases when tag is too long

* update README

* update readme for cleanup

* rename od to oh

* Update evaluation/swe_bench/README.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/README.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

* gets API key and Runtime from env var

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-05 10:34:31 +08:00
Xingyao Wang
d8a87d7ccb [Eval] Make SWE-Bench run_infer.sh to default to run SWE-Bench Lite (#3704)
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* increase timeout for remote runtime

* add push script

* handle the case when ret push is an generator

* update pbar

* set SWE-Bench default to run SWE-Bench lite
2024-09-04 00:58:14 +08:00
Xingyao Wang
d283420ac2 feat: add SWE-bench fullset support (#3477)
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* add push script

* handle the case when ret push is an generator

* update pbar
2024-09-02 20:28:52 -04:00
Xingyao Wang
98081b9b1b (eval) EOF fixes for SWE-Bench evaluation (#3623)
* add error handling for client eof

* remove root check

* remove set -e

* echo USER to fix for swebench infer

* fix entry timeout

* add timeout;
fix runtime close
2024-08-27 21:09:31 +00:00
tobitege
7ef5a2d1ff (fix) Rename last opendevin occurences (#3490)
* renaming more opendevin occurences

* remove DOCKER_IMAGE variable from Makefile

* Revert rename in evaluation/swe_bench/run_infer.py

Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>

---------

Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2024-08-20 16:45:26 +00:00
Robert Brennan
01ae22ef57 Rename OpenDevin to OpenHands (#3472)
* Replace OpenDevin with OpenHands

* Update CONTRIBUTING.md

* Update README.md

* Update README.md

* update poetry lock; move opendevin folder to openhands

* fix env var

* revert image references in docs

* revert permissions

* revert permissions

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-08-20 00:44:54 +08:00
Xingyao Wang
c3f62c3ce9 allow setting dataset name and split 2024-08-17 20:43:59 -05:00
Xingyao Wang
b8ec420ccd remove unused stuff 2024-08-17 20:43:09 -05:00
Xingyao Wang
7b6ae3638e remove unused swebench scripts 2024-08-17 20:38:49 -05:00
Xingyao Wang
31b244f95e [Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)
* move multi-line bash tests to test_runtime;
support multi-line bash for esruntime;

* add testcase to handle PS2 prompt

* use bashlex for bash parsing to handle multi-line commands;
add testcases for multi-line commands

* revert ghcr runtime change

* Apply stash

* fix run as other user;
make test async;

* fix test runtime for run as od

* add run-as-devin to all the runtime tests

* handle the case when username is root

* move all run-as-devin tests from sandbox;
only tests a few cases on different user to save time;

* move over multi-line echo related tests to test_runtime

* fix user-specific jupyter by fixing the pypoetry virtualenv folder

* make plugin's init async;
chdir at initialization of jupyter plugin;
move ipy simple testcase to test runtime;

* support agentskills import in
move tests for jupyter pwd tests;
overload `add_env_vars` for EventStreamRuntime to update env var also in Jupyter;
make agentskills read env var lazily, in case env var is updated;

* fix ServerRuntime agentskills issue

* move agnostic image test to test_runtime

* merge runtime tests in CI

* fix enable auto lint as env var

* update warning message

* update warning message

* test for different container images

* change parsing output as debug

* add exception handling for update_pwd_decorator

* fix unit test indentation

* add plugins as default input to Runtime class;
remove init_sandbox_plugins;
implement add_env_var (include jupyter) in the base class;

* fix server runtime auto lint

* Revert "add exception handling for update_pwd_decorator"

This reverts commit 2b668b1506.

* tries to print debugging info for agentskills

* explictly setting uid (try fix permission issue)

* Revert "tries to print debugging info for agentskills"

This reverts commit 8be4c86756.

* set sandbox user id during testing to hopefully fix the permission issue

* add browser tools for server runtime

* try to debug for old pwd

* update debug cmd

* only test agnostic runtime when TEST_RUNTIME is Server

* fix temp dir mkdir

* load TEST_RUNTIME at the beginning

* remove ipython tests

* only log to file when DEBUG

* default logging to project root

* temporarily remove log to file

* fix LLM logger dir

* fix logger

* make set pwd an optional aux action

* fix prev pwd

* fix infinity recursion

* simplify

* do not import the whole od library to avoid logger folder by jupyter

* fix browsing

* increase timeout

* attempt to fix agentskills yet again

* clean up in testcases, since CI maybe run as non-root

* add _cause attribute for event.id

* remove parent

* add a bunch of debugging statement again for CI :(

* fix temp_dir fixture

* change all temp dir to follow pytest's tmp_path_factory

* remove extra bracket

* clean up error printing a bit

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* jupyter chdir to self.config.workspace_mount_path_in_sandbox on initialization

* add typing for tmp dir fixture

* clear the directory before running the test to avoid weird CI temp dir

* remove agnostic test case for server runtime

* Revert "remove agnostic test case for server runtime"

This reverts commit 30e2181c3f.

* disable agnostic tests in CI

* fix test

* make sure plugin arg is not passed when no plugin is specified;
remove redundant on_event function;

* move mock prompt

* rename runtime

* remove extra logging

* refactor run_controller's interface;
support multiple runtime for integration test;
filter out hostname for prompt

* uncomment other tests

* pass the right runtime to controller

* log runtime when start

* uncomment tests

* improve symbol filters

* add intergration test prompts that seemd ok

* add integration test workflow

* add python3 to default ubuntu image

* symlink python and fix permission to jupyter pip

* add retry for jupyter execute server

* fix jupyter pip install;
add post-process for jupyter pip install;
simplify init by add agent_skills path to PYTHONPATH;
add testcase to tests jupyter pip install;

* fix bug

* use ubuntu:22.04 for eventstream integration tests

* add todo

* update testcase

* remove redundant code

* fix unit test

* reduce dependency for runtime

* try making llama-index an optional dependency that's not installed by default

* remove pip install since it seemd not needed

* log ipython execution;
await write message since it returns a future

* update ipy testcase

* do not install llama-index in CI

* do not install llama-index in the app docker as well

* set sandbox container image in the integration test script

* log plugins & env var for runtime

* update conftest for sha256

* add git

* remove all non-alphanumeric chalracters

* add working ipy module tests!

* default to use host network

* remove is_async from browser to make thing a little more reliable;
retry loading browser when error;

* add sleep to wait a bit for http server

* kill http server before regenerate browsing tests

* fix browsing

* only set sandbox container image if undefined

* skip empty config value

* update evaluation to use the latest run_controller

* revert logger in execute_server to be compatible with server runtime

* revert logging level to fix jupyter

* set logger level

* revert the logging

* chmod for workspace to fix permission

* support getting timeout from action

* update test for server runtime

* try to fix file permission

* fix test_cmd_run_action_serialization_deserialization test (added timeout)

* poetry: pip 24.2, torch 2.2.2

* revert adding pip to pyproject.toml

* add build to dependencies in pyproject.toml

* forgot poetry lock --no-update

* fix a DelegatorAgent prompt_002.log (timeout)

* fix a DelegatorAgent prompt_003.log (timeout)

* couple more timeout attribs in prompt files

* some more prompt files

* prompts galore

* add clarification comment for timeout

* default timeout to config

* add assert

* update integraton tests for eventstream

* update integration tests

* fix timeout for action<->dict

* remove redundant on_event

* default to use instance image

* update run_controller interface

* add logging for copy

* refactor swe_bench for the new design

* fix action execution timeout

* updatelock

* remove build sandbox locally

* fix runtime

* use plain for-loop for single process

* remove extra print

* get swebench inference working

* print whole `test_result` dict

* got swebench patch post-process working

* update swe-bench evaluation readme

* refactor using shared reset_logger function

* move messy swebench prompt to a different file

* support the ability to specify whether to keep prompt

* support the ability to specify whether to keep prompt

* fix dockerfile

* fix import and remove unnecessary strip logic

* fix action serialization

* get agentbench running

* remove extra ls for agent bench

* fix agentbench metric

* factor out common documentation for eval

* update biocoder doc

* remove swe_env_box since it is no longer needed

* get biocoder working

* add func timeout for bird

* fix jupyter pwd with ~ as user name

* fix jupyter pwd with ~ as user name

* get bird working

* get browsing evaluation working

* make eda runnable

* fix id column

* fix eda run_infer

* unify eval output using a structured format;
make swebench coompatible with that format;
update client source code for every swebench run;
do not inject testcmd for swebench

* standardize existing benchs for the new eval output

* set update source code = true

* get gaia standardized

* fix gaia

* gorilla refactored but stuck at language.so to test

* refactor and make gpqa work

* refactor humanevalfix and get it working

* refactor logic reasoning and get it working

* refactor browser env so it works with eventstream runtime for eval

* add initial version of miniwob refactor

* fix browsergym environment

* get miniwob working!!

* allowing injecting additional dependency to OD runtime docker image

* allowing injecting additional dependency to OD runtime docker image

* support logic reasoning with pre-injected dependency

* get mint working

* update runtime build

* fix mint docker

* add test for keep_prompt;
add missing await close for some tests

* update integration tests for eventstream runtime

* fix integration tests for server runtime

* refactor ml bench and toolqa

* refactor webarena

* fix default factory

* Update run_infer.py

* add APIError to retry

* increase timeout for swebench

* make sure to hide api key when dump eval output

* update the behavior of put source code to put files instead of tarball

* add dishash to dependency

* sendintr when timeout

* fix dockerfile copy

* reduce timeout

* use dirhash to avoid repeat building for update source

* fix runtime_build testcase

* add dir_hash to docker build pipeline

* revert api error

* update poetry lock

* add retries for swebench run infer

* fix git patch

* update poetry lock

* adjust config order

* fix mount volumns

* enforce all eval to use "instance_id"

* remove file store from runtime

* make file_store public inside eventstream

* move the runtime logic inside `main` out

* support using async function for process_instance_fn

* refactor run_infer with the create_time

* fix file store

* Update evaluation/toolqa/utils.py

Co-authored-by: Graham Neubig <neubig@gmail.com>

* fix typo

---------

Co-authored-by: tobitege <tobitege@gmx.de>
Co-authored-by: super-dainiu <78588128+super-dainiu@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-08-06 17:21:45 +00:00
Xingyao Wang
1c813a2fa0 support swebench pull from custom namespace (#3136) 2024-07-26 18:46:36 +00:00
Jiayi Pan
7111e8ee14 Support Instance Level Images for SWE-Bench Evaluation (#2874)
* rename pulled instance images

* Swebench: add support to instance level images

* Update evaluation/swe_bench/run_infer.py

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

* instance swebench: use env var and docker tags instead

* swebench disable instance report for instance images

* Update evaluation/swe_bench/README.md

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-07-17 01:31:42 +08:00
Anush Kumar V
8f76587e5c docs: updated docstrings using ruff's autofix feature (#2923)
* Updated documentation using ruff's autofix feature

* Updated pyproject.toml to include docstring validations

* Updated documentation using ruff's autofix feature

* Updated pyproject.toml to include docstring validations

* Updated docstrings using ruff's autfix feature

* Deleted opendevin/runtime/utils/soource.py, Keeping in sync with main

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-07-16 01:35:33 +00:00
Boxuan Li
4b4fa1c390 Remove legacy swe_bench/scripts/summarise_results.py (#2932)
* Remove swe_bench/scripts/summarise_results.py

* Remove mention of legacy script
2024-07-15 15:03:07 -04:00
Boxuan Li
b834b354e5 Add compare_patch_filename.py (#2934) 2024-07-15 23:55:45 +08:00
Xingyao Wang
e6cdf18d3b [Evaluation] Log empty patch stats for SWE-Bench (#2776)
* bump swebench version since the fix PR is merged

* add empy generation stats from latest pr

* delete eval_outputs if it already exists

* handle non string patch
2024-07-05 07:03:27 +08:00
Xingyao Wang
4d0c4f37d6 [Evaluation] fix SWE-Bench docker image name (#2751)
* fix double underscore

* remove unused script
2024-07-03 04:30:38 +08:00
Xingyao Wang
41ddba84bd [Agent] (Potentially) improve Editing using diff (#2685)
* add replace-based block edit & preliminary test case fix

* further fix the insert behavior

* make edit only work on first occurence

* bump codeact version since we now use new edit agentskills

* update prompt for new agentskills

* update integration tests

* make run_infer.sh executable

* remove code block for edit_file

* update integration test for prompt changes

* default to not use hint for eval

* fix insert emptyfile bug

* throw value error when `to_replace` is empty

* make `_edit_or_insert_file` return string so we can try to fix some linter errors (best attempt)

* add todo

* update integration test

* fix sandbox test for this PR
2024-07-02 11:50:15 +09:00
Xingyao Wang
6a0ffc5c61 [Evaluation] Use the latest official SWE-Bench Dockerization for evaluation (#2728)
* add newline after patch to fix patch apply

* new swebench wip

* add newline after patch to fix patch apply

* only add newline if not empty

* update swebench source and update

* update gitignore for swebench eval

* update old prep_eval

* update gitignore

* add scripts for push and pull swebench images

* update eval_infer.sh

* update eval_infer for new docker workflow

* update script to create markdown report based on report.json

* update eval infer to use update output

* update readme

* only move result to folder if running whole file

* remove set-x

* update conversion script

* Update evaluation/swe_bench/README.md

* Update evaluation/swe_bench/README.md

* Update evaluation/swe_bench/README.md

* make sure last line end with newline

* switch to an fix attempt branch of swebench

* Update evaluation/swe_bench/README.md

* Update evaluation/swe_bench/README.md

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-07-01 23:58:30 +00:00
Xingyao Wang
15e0c524f4 default to not use hint for eval (#2696) 2024-06-29 21:27:57 +00:00
Xingyao Wang
e8cb6803df [Evaluation] Improve patch apply in SWE-Bench (#2684)
* add newline after patch to fix patch apply

* only add newline if not empty
2024-06-29 14:11:07 +08:00
மனோஜ்குமார் பழனிச்சாமி
af9385322b Refactor: Simplify message formatting (#2670)
Removed redundant `str()` conversion in f-string.
2024-06-28 07:34:26 +02:00
Xavier Vergés
cd91d45b44 Allow SANDBOX_CONTAINER_IMAGEs built from opendevin/sandbox:main (#2622) 2024-06-26 12:05:07 +08:00
Xingyao Wang
6de584d77d update swe-bench output with eval results (#2606) 2024-06-24 08:07:28 +09:00
Graham Neubig
cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>
2024-06-23 03:43:43 +00:00
மனோஜ்குமார் பழனிச்சாமி
41564c2eac Use :main instead of :latest (#2539)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-21 03:57:50 +00:00
Boxuan Li
6f235937cf Evaluation time travel: allow evaluation on a specific version (#2356)
* Time travel for evaluation

* Fix source script path

* Exit script if given version doesn't exist

* Exit on failure

* Update README

* Change scripts of all other benchmarks

* Modify README files

* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Xingyao Wang
a6ba6c5277 Add SWEBench-docker eval (#2085)
* add initial version of swebench-docker eval

* update the branch of git repo

* add poetry run

* download dev set too and pre-load f2p and p2p

* update eval infer script

* increase timeout

* add poetry run

* install swebench from our fork

* update script

* update loc

* support single instance debug

* replace \r\n from model patch

* replace eval docker from namespace xingyaoww

* update script to auto detect swe-bench format jsonl

* support eval infer on single instance id

* change log output dir to logs

* update summarise result script

* update README

* update readme

* tweak branch

* Update evaluation/swe_bench/scripts/eval/prep_eval.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-06-10 19:30:40 +00:00
Boxuan Li
4d14b44a9a SWE-bench: Add summarise utility script to view passed/failed task IDs (#2137)
* SWE-bench: Add summarise utility script to view passed/failed task IDs

* Fix typos

* Move file

* Prettify

* Use merged jsonl file
2024-05-31 12:32:17 +08:00
Xingyao Wang
01ef90205d Add CodeActSWEAgent to remove browsing & github + improvements on agentskills (#2105)
* update swe_bench prompt;
use minimal prompt for codeact;

* upgrade agentskills and update testcases

* update infer prompt

* fix cwd

* add icl for swebench

* also log in_context_example to run infer

* remove extra print

* change prompt to abs path

* update error message to include current file info

* change cwd for jupyter if needed

* update edit error message

* update prompt

* improve git get patch

* update hint string

* default to 50 turns

* revert changes from codeact agent and create new CodeActSWEAgent

* revert changes to codeact

* revert instructions for run infer

* revert instructions for run infer

* update README

* update max iter

* add codeact swe agent

* fix issue for CodeActSWEAgent

* allow specifying max iter in cmdline script

* stop printing

* Update agenthub/codeact_swe_agent/README.md

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

* Fix prompt regression in jupyter plugin

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-05-29 21:19:00 -07:00
Xingyao Wang
2c0a2dbc61 fix yet another swe_bench issue (#2069) 2024-05-26 10:01:43 -07:00
Xingyao Wang
5114230e53 Some SWE-Bench infer fixes and improvements (#2065)
* reset workspace base properly

* support running without hint

* support running without hint

* bump swe-bench eval docker to v1.2 for latest agentskills

* only give hint when use hint text is trie

* add swe-agent instructions for validation

* update dockerfile

* pin the python interpreter for execute_cli

* avoid initialize plugins twice

* default to use hint

* save results to swe_bench_lite

* unset gh token and increase max iter to 50

* remove printing of use hint status

* refractor ssh login into one function

* ok drop to 30 turns bc it is so expensive :(

* remove reproduce comments to avoid stuck
2024-05-26 10:02:11 +00:00
Xingyao Wang
602ffcdffb Implement agentskills for OpenDevin to helpfully improve edit AND including more useful tools/skills (#1941)
* add draft for skills

* Implement and test agentskills functions: open_file, goto_line, scroll_down, scroll_up, create_file, search_dir, search_file, find_file

* Remove new_sample.txt file

* add some work from opendevin w/ fixes

* Add unit tests for agentskills module

* fix some issues and updated tests

* add more tests for open

* tweak and handle goto_line

* add tests for some edge cases

* add tests for scrolling

* add tests for edit

* add tests for search_dir

* update tests to use pytest

* use pytest --forked to avoid file op unit tests to interfere with each other via global var

* update doc based on swe agent tool

* update and add tests for find_file and search_file

* move agent_skills to plugins

* add agentskills as plugin and docs

* add agentskill to ssh box and fix sandbox integration

* remove extra returns in doc

* add agentskills to initial tool for jupyter

* support re-init jupyter kernel (for agentskills) after restart

* fix print window's issue with indentation and add testcases

* add prompt for codeact with the newest edit primitives

* modify the way line number is presented (remove leading space)

* change prompt to the newest display format

* support tracking of costs via metrics

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* implement and add tests for py linting

* remove extra text arg for incompatible subprocess ver

* remove sample.txt

* update test_edits integration tests

* fix all integration

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update opendevin/runtime/plugins/agent_skills/README.md

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update agenthub/codeact_agent/prompt.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* Update opendevin/runtime/plugins/agent_skills/agentskills.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* Revert "correctly setup plugins for swebench eval"

This reverts commit 2bd1055673.

* bump version

* remove _AGENT_SKILLS_DOCS

* move flake8 to test dep

* update poetry.lock

* remove extra arg

* reduce max iter for eval

* update poetry

* fix integration tests

---------

Co-authored-by: OpenDevin <opendevin@opendevin.ai>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-05-23 16:04:09 +00:00
Xingyao Wang
6ff50ed369 Fix SWE-Bench evaluation due to setuptools version (#1995)
* correctly setup plugins for swebench eval

* bump swe-bench version and add logging

* Revert "correctly setup plugins for swebench eval"

This reverts commit 2bd1055673.

* bump version
2024-05-23 23:17:42 +08:00