Commit Graph

127 Commits

Author SHA1 Message Date
Xingyao Wang
4ce3b9094a Revert "(feat): Prompt engineering to remind o1 to generate a patch" (#4846) 2024-11-08 16:12:57 +00:00
Alejandro Cuadron Lafuente
a6810fa6ad (feat): Prompt engineering to remind o1 to generate a patch (#4807)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
2024-11-08 03:10:18 +00:00
Xingyao Wang
53390d9885 Fix issue #4583: [Bug]: Unable to pull the full SWE-Bench test set (#4813)
Co-authored-by: openhands <openhands@all-hands.dev>
2024-11-07 22:35:20 +08:00
OpenHands
025dac5d8f Fix issue #4776: [Bug]: Files are not uploaded to the environment (SWE-Bench) (#4795) 2024-11-06 19:05:06 +00:00
Engel Nyst
eeb2342509 Refactor history/event stream (#3808) 2024-11-05 03:36:14 +01:00
Xingyao Wang
966da7b7c8 feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667)
Co-authored-by: tofarr <tofarr@gmail.com>
2024-11-05 00:27:27 +08:00
Xingyao Wang
9c2b48ff5d fix(eval): SWE-Bench instance with upper-case instance id (#4649) 2024-10-30 21:24:18 +00:00
Xingyao Wang
6d19c93d19 [eval] add evaluation workflow (#4489)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-10-29 13:52:25 +00:00
Xingyao Wang
ae13171194 feat(agent): CodeAct with function calling (#4537)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-29 11:06:33 +08:00
Xingyao Wang
1f23dc89b6 fix(eval): add runtime.connect to all eval harness (#4565) 2024-10-26 00:41:30 +08:00
Xingyao Wang
7340b78962 feat(eval): rewrite log_completions to save completions to directory (#4566) 2024-10-25 16:36:11 +00:00
Xingyao Wang
2d5b360505 refactor: re-organize different runtime implementations into an impl folder (#4346)
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-10-23 10:10:03 +00:00
Xingyao Wang
da548d308c [agent] LLM-based editing (#3985)
Co-authored-by: Tim O'Farrell <tofarr@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-10-22 04:51:44 +08:00
Alejandro Cuadron Lafuente
a9a593bb21 [Fix] Added support to specify the platform on which the runtime image should be built. (#4402)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
2024-10-20 09:19:05 +08:00
Xingyao Wang
91308ba4dc feat: clean-up retries RemoteRuntime & add FatalErrorObservation (#4485) 2024-10-18 17:23:13 +00:00
Jiayi Pan
c1b323a076 Show actual dataset name in swebench log directory (#4417) 2024-10-17 10:32:38 +08:00
Xingyao Wang
50c13aad98 [Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396) 2024-10-15 21:34:52 +08:00
Xingyao Wang
4dfc7a7ef0 [Eval] Add a more lightweight / easier-to-use SWE-Bench output visualizer (#4360)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-10-14 02:09:01 +00:00
Xingyao Wang
b23c7aab5a [eval] stop set sid in eval (#4311) 2024-10-10 11:47:27 +08:00
Robert Brennan
45fb4fb9bc allow reconnecting to a runtime (#4223) 2024-10-09 16:37:52 +00:00
Engel Nyst
e6847e9e61 Move agenthub within openhands (#4130) 2024-10-08 00:34:18 +00:00
Alejandro Cuadron Lafuente
a3571ec510 [Fix] Error when trying to pull all docker evaluation containers (#4244) 2024-10-08 05:03:36 +08:00
Aditya Bharat Soni
0809d26f4d fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings (#4100)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2024-10-07 15:37:08 -04:00
Xingyao Wang
01ae54a69d fix swebench repo/version being string (#4241) 2024-10-07 22:01:42 +08:00
Xingyao Wang
245334e89d [eval] improve update output script for swe-bench (#4180) 2024-10-04 15:10:03 +00:00
Xingyao Wang
9cc9b19958 eval: improve swebench infer error handling and retry (#4205) 2024-10-04 07:09:56 -05:00
tofarr
152f99c64f Chore Bump python version (#3545) 2024-10-03 13:40:55 -04:00
mamoodi
0144caaf1f Update eval doc for remote runtime (#4145) 2024-10-01 13:14:36 -04:00
Xingyao Wang
1109637efb Update instruction for new version of eval runtime-api (#4128) 2024-09-30 23:48:38 +00:00
Xingyao Wang
8d6eda3623 fix eval_infer.sh to correctly copy SWE-Bench logs (#4111) 2024-09-29 18:39:18 -05:00
Ikko Eltociear Ashimine
c84495830e [eval] update swe_bench/README.md (#3990) 2024-09-23 11:03:09 +02:00
Xingyao Wang
714e46f29a [eval] save eventstream & llm completions for SWE-Bench run_infer (#3923) 2024-09-22 04:39:13 +00:00
Xingyao Wang
b13ed017d8 [eval] add git patch post-processing for SWE-Bench eval_infer (#3980) 2024-09-20 15:33:53 +00:00
tofarr
ad0b549d8b Feat Tightening up Timeouts and interrupt conditions. (#3926) 2024-09-18 20:50:42 +00:00
Xingyao Wang
5d7f2fd4ae [eval] Allow evaluation of SWE-Bench patches on RemoteRuntime (#3927)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-18 16:07:34 -04:00
Engel Nyst
ef09f0fe37 Small fix in readme (#3912) 2024-09-17 14:33:25 +00:00
Xingyao Wang
f996b31d64 [eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each run_infer (#3907)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-09-17 14:07:58 +00:00
tobitege
52c5abccbf (enh) Dockerfile.j2: improve env vars for bash and activate in .bashrc (#3871) 2024-09-17 08:49:04 +02:00
Xingyao Wang
3a1b8c093b [eval] yet another eval fixes on multi-processing (#3854)
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-13 15:51:22 +00:00
Xingyao Wang
78c5f58adc refactor & improve retry for the reliability of RemoteRuntime & evaluation (#3846) 2024-09-13 07:37:07 -04:00
Xingyao Wang
797f02ff6f rename huggingface evaluation benchmark (#3845) 2024-09-12 18:50:26 +00:00
Xingyao Wang
47d9621742 [eval] SWE-Bench eval usability fixes (#3836)
* [eval] increase timeout for swebench eval init/complete

* allow CmdRunAction to optionally block when .timeout is setted

* fix unit test for serialization

* fix unit tests for security analyzer

* fix integration tests

* add more timeout

* only check P2P when instances are non-empty;
convert P2P and F2P columns to string instead of list

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-12 16:33:51 +00:00
Xingyao Wang
2fe2f4c530 [eval] increase timeout for SWEBench eval init/complete (#3829)
* [eval] increase timeout for swebench eval init/complete

* allow CmdRunAction to optionally block when .timeout is setted

* fix unit test for serialization

* fix unit tests for security analyzer

* fix integration tests

* add more timeout
2024-09-12 15:20:58 +00:00
Jiayi Pan
43c4a7fff4 Allow Generalized SWE-Bench format for evaluation (#3752)
* allow generalized swe-bench format

* Update run_infer.py

* fix linter

---------

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-09-06 13:05:00 +00:00
Xingyao Wang
688068a44e Fix issues for running RemoteRuntime in parallel on SWE-Bench (#3716)
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* increase timeout for remote runtime

* add push script

* handle the case when ret push is an generator

* update pbar

* set SWE-Bench default to run SWE-Bench lite

* add script to cleanup remote runtime

* fix the cases when tag is too long

* update README

* update readme for cleanup

* rename od to oh

* Update evaluation/swe_bench/README.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/README.md

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

* gets API key and Runtime from env var

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-09-05 10:34:31 +08:00
Xingyao Wang
d8a87d7ccb [Eval] Make SWE-Bench run_infer.sh to default to run SWE-Bench Lite (#3704)
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* increase timeout for remote runtime

* add push script

* handle the case when ret push is an generator

* update pbar

* set SWE-Bench default to run SWE-Bench lite
2024-09-04 00:58:14 +08:00
Xingyao Wang
d283420ac2 feat: add SWE-bench fullset support (#3477)
* feat: add SWE-bench fullset support

* fix instance image list

* update eval script and documentation

* add push script

* handle the case when ret push is an generator

* update pbar
2024-09-02 20:28:52 -04:00
Xingyao Wang
090c911a50 (refactor) Make Runtime class synchronous (#3661)
* change runtime to be synchronous

* fix test runtime with the new interface

* fix arg

* fix eval

* fix missing config attribute

* fix plugins

* fix on_event by revert it back to async

* update upload_file endpoint

* fix argument to upload file

* remove unncessary async for eval;
fix evaluation run in parallel

* use asyncio to run controller for eval

* revert file upload

* truncate eval test result output
2024-08-30 01:37:03 +00:00
Xingyao Wang
8b1f207d39 feat: support remote runtime (#3406)
* feat: refactor building logic into runtime builder

* return image name

* fix testcases

* use runtime builder for eventstream runtime

* have runtime builder return str

* add api_key to sandbox config

* draft remote runtime

* remove extra if clause

* initialize runtime based on box class

* add build logic

* use base64 for file upload

* get runtime image prefix from API

* replace ___ with _s_ to make it a valid image name

* use /build to start build and /build_status to check the build progress

* update logging

* fix exit code

* always use port

* add remote runtime

* rename runtime

* fix tests import

* make dir first if work_dir does not exists;

* update debug print to remote runtime

* fix exit close_sync

* update logging

* add retry for stop

* use all box class for test keep prompt

* fix test browsing

* add retry stop

* merge init commands to save startup time

* fix await

* remove sandbox url

* support execute through specific runtime url

* fix file ops

* simplify close

* factor out runtime retry code

* fix exception handling

* fix content type error (e.g., bad gateway when runtime is not ready)

* add retry for wait until alive;
add retry for check image exists

* Revert "add retry for wait until alive;"

This reverts commit dd013cd268.

* retry when wait until alive

* clean up msg

* directly save sdist to temp dir for _put_source_code_to_dir

* support running testcases in parallel

* tweak logging;
try to close session

* try to close session even on exception

* update poetry lock

* support remote to run integration tests

* add warning for workspace base on remote runtime

* set default runtime api

* remove server runtime

* update poetry lock

* support running swe-bench (n=1) eval on remoteruntime

* add a timeout of 30 min

* add todo for docker namespace

* update poetry loc
2024-08-29 15:53:37 +00:00
tobitege
9c39f07430 (enh) Aider-Bench: make resumable with skip_num arg (#3626)
* added optional START_ID env flag to resume from that instance id

* prepare_dataset: fix comparisons by using instance id's as int

* aider bench complete_runtime: close runtime to close container

* added matrix display of instance id for logging

* fix typo in summarize_results.py saying summarise_results

* changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable)

* doc changes about huggingface spaces to temporarily point back to OD
2024-08-28 15:42:01 +00:00