Xingyao Wang
|
4ce3b9094a
|
Revert "(feat): Prompt engineering to remind o1 to generate a patch" (#4846)
|
2024-11-08 16:12:57 +00:00 |
|
Alejandro Cuadron Lafuente
|
a6810fa6ad
|
(feat): Prompt engineering to remind o1 to generate a patch (#4807)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
|
2024-11-08 03:10:18 +00:00 |
|
Xingyao Wang
|
53390d9885
|
Fix issue #4583: [Bug]: Unable to pull the full SWE-Bench test set (#4813)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-11-07 22:35:20 +08:00 |
|
OpenHands
|
025dac5d8f
|
Fix issue #4776: [Bug]: Files are not uploaded to the environment (SWE-Bench) (#4795)
|
2024-11-06 19:05:06 +00:00 |
|
Engel Nyst
|
eeb2342509
|
Refactor history/event stream (#3808)
|
2024-11-05 03:36:14 +01:00 |
|
Xingyao Wang
|
1d2a616be7
|
Fix issue #4739: '[Bug]: The agent doesn'"'"'t know its name' (#4740)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-11-04 21:24:35 +00:00 |
|
Xingyao Wang
|
966da7b7c8
|
feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667)
Co-authored-by: tofarr <tofarr@gmail.com>
|
2024-11-05 00:27:27 +08:00 |
|
Abhijeetsingh Meena
|
8857f02083
|
[Eval] DiscoveryBench OpenHands Integration (#4627)
Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>
Co-authored-by: Harshit Surana <surana.h@gmail.com>
|
2024-11-02 07:24:34 -04:00 |
|
Ziru "Ron" Chen
|
db4e1dbbec
|
[eval] Add ScienceAgentBench. (#4645)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2024-11-01 02:30:55 +08:00 |
|
Xingyao Wang
|
9c2b48ff5d
|
fix(eval): SWE-Bench instance with upper-case instance id (#4649)
|
2024-10-30 21:24:18 +00:00 |
|
Xingyao Wang
|
6d19c93d19
|
[eval] add evaluation workflow (#4489)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-10-29 13:52:25 +00:00 |
|
Xingyao Wang
|
ae13171194
|
feat(agent): CodeAct with function calling (#4537)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
|
2024-10-29 11:06:33 +08:00 |
|
Xingyao Wang
|
1f23dc89b6
|
fix(eval): add runtime.connect to all eval harness (#4565)
|
2024-10-26 00:41:30 +08:00 |
|
Xingyao Wang
|
7340b78962
|
feat(eval): rewrite log_completions to save completions to directory (#4566)
|
2024-10-25 16:36:11 +00:00 |
|
tofarr
|
c4f5c07be1
|
Refactor: shorter syntax (#4558)
|
2024-10-25 06:45:28 -06:00 |
|
Graham Neubig
|
ce2430180f
|
Update README.md to fix miniwob name (#4534)
|
2024-10-23 18:24:43 +00:00 |
|
Xingyao Wang
|
2d5b360505
|
refactor: re-organize different runtime implementations into an impl folder (#4346)
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-10-23 10:10:03 +00:00 |
|
Graham Neubig
|
54250e3fe2
|
Update evaluation README.md structure (#4516)
|
2024-10-22 14:42:22 +00:00 |
|
Xingyao Wang
|
da548d308c
|
[agent] LLM-based editing (#3985)
Co-authored-by: Tim O'Farrell <tofarr@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-10-22 04:51:44 +08:00 |
|
Alejandro Cuadron Lafuente
|
a9a593bb21
|
[Fix] Added support to specify the platform on which the runtime image should be built. (#4402)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
|
2024-10-20 09:19:05 +08:00 |
|
Xingyao Wang
|
91308ba4dc
|
feat: clean-up retries RemoteRuntime & add FatalErrorObservation (#4485)
|
2024-10-18 17:23:13 +00:00 |
|
Jiayi Pan
|
c1b323a076
|
Show actual dataset name in swebench log directory (#4417)
|
2024-10-17 10:32:38 +08:00 |
|
Xingyao Wang
|
84a578ad20
|
[test] remove integration tests from CI & move them into evaluation (#4447)
|
2024-10-17 05:38:23 +08:00 |
|
mamoodi
|
6f2e678028
|
Fix eval output path in case of @ char (#4416)
|
2024-10-15 22:45:08 +00:00 |
|
Abhijeetsingh Meena
|
173018eb58
|
fix: Resolves HumanEval Inference by replacing task_id with instance_id (#4364)
Co-authored-by: Harshit Surana <surana.h@gmail.com>
|
2024-10-15 15:18:38 +00:00 |
|
Xingyao Wang
|
50c13aad98
|
[Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396)
|
2024-10-15 21:34:52 +08:00 |
|
Xingyao Wang
|
25f9413965
|
[Eval] Fix eval stuck when result is too large for pbar (#4361)
|
2024-10-14 22:08:34 +08:00 |
|
Xingyao Wang
|
4dfc7a7ef0
|
[Eval] Add a more lightweight / easier-to-use SWE-Bench output visualizer (#4360)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-10-14 02:09:01 +00:00 |
|
Xingyao Wang
|
b23c7aab5a
|
[eval] stop set sid in eval (#4311)
|
2024-10-10 11:47:27 +08:00 |
|
Robert Brennan
|
45fb4fb9bc
|
allow reconnecting to a runtime (#4223)
|
2024-10-09 16:37:52 +00:00 |
|
Engel Nyst
|
e6847e9e61
|
Move agenthub within openhands (#4130)
|
2024-10-08 00:34:18 +00:00 |
|
Alejandro Cuadron Lafuente
|
a3571ec510
|
[Fix] Error when trying to pull all docker evaluation containers (#4244)
|
2024-10-08 05:03:36 +08:00 |
|
Aditya Bharat Soni
|
0809d26f4d
|
fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings (#4100)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2024-10-07 15:37:08 -04:00 |
|
Xingyao Wang
|
01ae54a69d
|
fix swebench repo/version being string (#4241)
|
2024-10-07 22:01:42 +08:00 |
|
Xingyao Wang
|
245334e89d
|
[eval] improve update output script for swe-bench (#4180)
|
2024-10-04 15:10:03 +00:00 |
|
Xingyao Wang
|
80a631361b
|
eval: update aiderbench readme (#4209)
|
2024-10-04 09:26:12 -04:00 |
|
Xingyao Wang
|
9cc9b19958
|
eval: improve swebench infer error handling and retry (#4205)
|
2024-10-04 07:09:56 -05:00 |
|
Xingyao Wang
|
0c2a35b256
|
[eval] update aider bench scripts (#4203)
|
2024-10-04 02:23:06 +00:00 |
|
tofarr
|
152f99c64f
|
Chore Bump python version (#3545)
|
2024-10-03 13:40:55 -04:00 |
|
Xingyao Wang
|
53a015f718
|
fix: make llm_completions optional to fix eval_infer.py (#4148)
|
2024-10-02 03:55:03 +08:00 |
|
mamoodi
|
0144caaf1f
|
Update eval doc for remote runtime (#4145)
|
2024-10-01 13:14:36 -04:00 |
|
Xingyao Wang
|
1109637efb
|
Update instruction for new version of eval runtime-api (#4128)
|
2024-09-30 23:48:38 +00:00 |
|
Xingyao Wang
|
8d6eda3623
|
fix eval_infer.sh to correctly copy SWE-Bench logs (#4111)
|
2024-09-29 18:39:18 -05:00 |
|
tobitege
|
c3bbe604eb
|
(fix) Fix logging in shared eval file to prevent key disclosure (#4108)
|
2024-09-28 19:33:16 +00:00 |
|
Xingyao Wang
|
81b3cd71b3
|
[eval] log evaluating warnings directly to console (#4026)
|
2024-09-26 03:42:32 +08:00 |
|
Xingyao Wang
|
1b1d8f0b02
|
[eval] Use imap_unorderd for parallizing evaluation (#4040)
|
2024-09-24 20:47:27 +00:00 |
|
Xingyao Wang
|
a66e738957
|
[eval] use mp Pool instead ProcessPoolExecutor (#4025)
|
2024-09-24 23:59:06 +08:00 |
|
Ikko Eltociear Ashimine
|
c84495830e
|
[eval] update swe_bench/README.md (#3990)
|
2024-09-23 11:03:09 +02:00 |
|
Xingyao Wang
|
714e46f29a
|
[eval] save eventstream & llm completions for SWE-Bench run_infer (#3923)
|
2024-09-22 04:39:13 +00:00 |
|
Xingyao Wang
|
b13ed017d8
|
[eval] add git patch post-processing for SWE-Bench eval_infer (#3980)
|
2024-09-20 15:33:53 +00:00 |
|