Xingyao Wang
da548d308c
[agent] LLM-based editing ( #3985 )
...
Co-authored-by: Tim O'Farrell <tofarr@gmail.com >
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
Co-authored-by: Robert Brennan <accounts@rbren.io >
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-10-22 04:51:44 +08:00
Alejandro Cuadron Lafuente
a9a593bb21
[Fix] Added support to specify the platform on which the runtime image should be built. ( #4402 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev >
Co-authored-by: mamoodi <mamoodiha@gmail.com >
Co-authored-by: tofarr <tofarr@gmail.com >
Co-authored-by: Robert Brennan <contact@rbren.io >
2024-10-20 09:19:05 +08:00
Xingyao Wang
91308ba4dc
feat: clean-up retries RemoteRuntime & add FatalErrorObservation ( #4485 )
2024-10-18 17:23:13 +00:00
Jiayi Pan
c1b323a076
Show actual dataset name in swebench log directory ( #4417 )
2024-10-17 10:32:38 +08:00
Xingyao Wang
84a578ad20
[test] remove integration tests from CI & move them into evaluation ( #4447 )
2024-10-17 05:38:23 +08:00
mamoodi
6f2e678028
Fix eval output path in case of @ char ( #4416 )
2024-10-15 22:45:08 +00:00
Abhijeetsingh Meena
173018eb58
fix: Resolves HumanEval Inference by replacing task_id with instance_id ( #4364 )
...
Co-authored-by: Harshit Surana <surana.h@gmail.com >
2024-10-15 15:18:38 +00:00
Xingyao Wang
50c13aad98
[Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification ( #4396 )
2024-10-15 21:34:52 +08:00
Xingyao Wang
25f9413965
[Eval] Fix eval stuck when result is too large for pbar ( #4361 )
2024-10-14 22:08:34 +08:00
Xingyao Wang
4dfc7a7ef0
[Eval] Add a more lightweight / easier-to-use SWE-Bench output visualizer ( #4360 )
...
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com >
2024-10-14 02:09:01 +00:00
Xingyao Wang
b23c7aab5a
[eval] stop set sid in eval ( #4311 )
2024-10-10 11:47:27 +08:00
Robert Brennan
45fb4fb9bc
allow reconnecting to a runtime ( #4223 )
2024-10-09 16:37:52 +00:00
Engel Nyst
e6847e9e61
Move agenthub within openhands ( #4130 )
2024-10-08 00:34:18 +00:00
Alejandro Cuadron Lafuente
a3571ec510
[Fix] Error when trying to pull all docker evaluation containers ( #4244 )
2024-10-08 05:03:36 +08:00
Aditya Bharat Soni
0809d26f4d
fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings ( #4100 )
...
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev >
2024-10-07 15:37:08 -04:00
Xingyao Wang
01ae54a69d
fix swebench repo/version being string ( #4241 )
2024-10-07 22:01:42 +08:00
Xingyao Wang
245334e89d
[eval] improve update output script for swe-bench ( #4180 )
2024-10-04 15:10:03 +00:00
Xingyao Wang
80a631361b
eval: update aiderbench readme ( #4209 )
2024-10-04 09:26:12 -04:00
Xingyao Wang
9cc9b19958
eval: improve swebench infer error handling and retry ( #4205 )
2024-10-04 07:09:56 -05:00
Xingyao Wang
0c2a35b256
[eval] update aider bench scripts ( #4203 )
2024-10-04 02:23:06 +00:00
tofarr
152f99c64f
Chore Bump python version ( #3545 )
2024-10-03 13:40:55 -04:00
Xingyao Wang
53a015f718
fix: make llm_completions optional to fix eval_infer.py ( #4148 )
2024-10-02 03:55:03 +08:00
mamoodi
0144caaf1f
Update eval doc for remote runtime ( #4145 )
2024-10-01 13:14:36 -04:00
Xingyao Wang
1109637efb
Update instruction for new version of eval runtime-api ( #4128 )
2024-09-30 23:48:38 +00:00
Xingyao Wang
8d6eda3623
fix eval_infer.sh to correctly copy SWE-Bench logs ( #4111 )
2024-09-29 18:39:18 -05:00
tobitege
c3bbe604eb
(fix) Fix logging in shared eval file to prevent key disclosure ( #4108 )
2024-09-28 19:33:16 +00:00
Xingyao Wang
81b3cd71b3
[eval] log evaluating warnings directly to console ( #4026 )
2024-09-26 03:42:32 +08:00
Xingyao Wang
1b1d8f0b02
[eval] Use imap_unorderd for parallizing evaluation ( #4040 )
2024-09-24 20:47:27 +00:00
Xingyao Wang
a66e738957
[eval] use mp Pool instead ProcessPoolExecutor ( #4025 )
2024-09-24 23:59:06 +08:00
Ikko Eltociear Ashimine
c84495830e
[eval] update swe_bench/README.md ( #3990 )
2024-09-23 11:03:09 +02:00
Xingyao Wang
714e46f29a
[eval] save eventstream & llm completions for SWE-Bench run_infer ( #3923 )
2024-09-22 04:39:13 +00:00
Xingyao Wang
b13ed017d8
[eval] add git patch post-processing for SWE-Bench eval_infer ( #3980 )
2024-09-20 15:33:53 +00:00
Engel Nyst
8fdfece059
Refactor messages serialization ( #3832 )
...
Co-authored-by: Robert Brennan <accounts@rbren.io >
2024-09-18 23:48:58 +02:00
tofarr
ad0b549d8b
Feat Tightening up Timeouts and interrupt conditions. ( #3926 )
2024-09-18 20:50:42 +00:00
Xingyao Wang
5d7f2fd4ae
[eval] Allow evaluation of SWE-Bench patches on RemoteRuntime ( #3927 )
...
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-09-18 16:07:34 -04:00
Engel Nyst
ef09f0fe37
Small fix in readme ( #3912 )
2024-09-17 14:33:25 +00:00
Xingyao Wang
f996b31d64
[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each run_infer ( #3907 )
...
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-09-17 14:07:58 +00:00
tobitege
52c5abccbf
(enh) Dockerfile.j2: improve env vars for bash and activate in .bashrc ( #3871 )
2024-09-17 08:49:04 +02:00
Graham Neubig
243cb492aa
Run pre-commit on all files ( #3884 )
2024-09-16 11:07:08 -04:00
Xingyao Wang
2b3925278d
[eval] refactor process instance logic into update_progress ( #3875 )
2024-09-15 18:47:15 -04:00
Engel Nyst
379f2b6f23
Fix queue length on Macs ( #3867 )
2024-09-14 01:11:29 +00:00
Xingyao Wang
3a1b8c093b
[eval] yet another eval fixes on multi-processing ( #3854 )
...
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-09-13 15:51:22 +00:00
Xingyao Wang
78c5f58adc
refactor & improve retry for the reliability of RemoteRuntime & evaluation ( #3846 )
2024-09-13 07:37:07 -04:00
Xingyao Wang
797f02ff6f
rename huggingface evaluation benchmark ( #3845 )
2024-09-12 18:50:26 +00:00
Xingyao Wang
47d9621742
[eval] SWE-Bench eval usability fixes ( #3836 )
...
* [eval] increase timeout for swebench eval init/complete
* allow CmdRunAction to optionally block when .timeout is setted
* fix unit test for serialization
* fix unit tests for security analyzer
* fix integration tests
* add more timeout
* only check P2P when instances are non-empty;
convert P2P and F2P columns to string instead of list
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-09-12 16:33:51 +00:00
Xingyao Wang
2fe2f4c530
[eval] increase timeout for SWEBench eval init/complete ( #3829 )
...
* [eval] increase timeout for swebench eval init/complete
* allow CmdRunAction to optionally block when .timeout is setted
* fix unit test for serialization
* fix unit tests for security analyzer
* fix integration tests
* add more timeout
2024-09-12 15:20:58 +00:00
Jiayi Pan
43c4a7fff4
Allow Generalized SWE-Bench format for evaluation ( #3752 )
...
* allow generalized swe-bench format
* Update run_infer.py
* fix linter
---------
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu >
2024-09-06 13:05:00 +00:00
Xingyao Wang
688068a44e
Fix issues for running RemoteRuntime in parallel on SWE-Bench ( #3716 )
...
* feat: add SWE-bench fullset support
* fix instance image list
* update eval script and documentation
* increase timeout for remote runtime
* add push script
* handle the case when ret push is an generator
* update pbar
* set SWE-Bench default to run SWE-Bench lite
* add script to cleanup remote runtime
* fix the cases when tag is too long
* update README
* update readme for cleanup
* rename od to oh
* Update evaluation/swe_bench/README.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/swe_bench/README.md
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh
Co-authored-by: Graham Neubig <neubig@gmail.com >
* Update evaluation/swe_bench/scripts/cleanup_remote_runtime.sh
Co-authored-by: Graham Neubig <neubig@gmail.com >
* gets API key and Runtime from env var
---------
Co-authored-by: Graham Neubig <neubig@gmail.com >
2024-09-05 10:34:31 +08:00
Xingyao Wang
d8a87d7ccb
[Eval] Make SWE-Bench run_infer.sh to default to run SWE-Bench Lite ( #3704 )
...
* feat: add SWE-bench fullset support
* fix instance image list
* update eval script and documentation
* increase timeout for remote runtime
* add push script
* handle the case when ret push is an generator
* update pbar
* set SWE-Bench default to run SWE-Bench lite
2024-09-04 00:58:14 +08:00
Xingyao Wang
d283420ac2
feat: add SWE-bench fullset support ( #3477 )
...
* feat: add SWE-bench fullset support
* fix instance image list
* update eval script and documentation
* add push script
* handle the case when ret push is an generator
* update pbar
2024-09-02 20:28:52 -04:00