OpenHands
|
8975fcd714
|
Fix issue #5748: Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749)
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-12-26 23:30:19 +08:00 |
|
OpenHands
|
bfb191b5c7
|
Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740)
|
2024-12-25 17:17:06 -05:00 |
|
Boxuan Li
|
ecff5c67fb
|
Evaluation README: Add TheAgentCompany (#5777)
|
2024-12-24 02:37:42 +00:00 |
|
Boxuan Li
|
b1719bb3db
|
Add TheAgentCompany evaluation harness (#5731)
|
2024-12-22 14:12:30 -05:00 |
|
OpenHands
|
21948fa81b
|
Fix issue #5735: [Bug]: Inconsistent command line arguments in evaluation directory (#5736)
|
2024-12-22 04:41:39 +08:00 |
|
Xingyao Wang
|
581d5ec7a8
|
feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709)
|
2024-12-21 01:47:06 +08:00 |
|
Xingyao Wang
|
c333938384
|
feat(eval): add standard error to swebench summarize outputs (#5700)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-12-20 08:39:43 +08:00 |
|
Xingyao Wang
|
e9cafb0372
|
chore: Cleanup runtime exception handling (#5696)
|
2024-12-19 17:28:29 +00:00 |
|
Xingyao Wang
|
9cdb8d06c0
|
fix(eval): Use cp -r instead of mv for SWE-Bench Initialization (#5659)
|
2024-12-17 21:21:27 +00:00 |
|
Engel Nyst
|
3297e4d5a8
|
Use litellm's modify params (#5636)
|
2024-12-17 21:32:49 +01:00 |
|
OpenHands
|
4998b5de32
|
Fix issue #5559: The turn limit should be measured from the last user interaction (#5560)
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-12-16 16:28:23 -05:00 |
|
Engel Nyst
|
b295f5775c
|
Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)
|
2024-12-16 20:39:57 +00:00 |
|
OpenHands
|
09735c7869
|
Fix issue #5609: Use litellm's modify_params with default True (#5611)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-12-16 20:18:45 +01:00 |
|
Engel Nyst
|
4716955960
|
Remove unused codeact-SWE agent (#5600)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-12-14 20:49:44 +01:00 |
|
Ryan H. Tran
|
8ae2fb636e
|
Remove symlink use for swebench setup (#5549)
|
2024-12-13 22:18:14 +08:00 |
|
Engel Nyst
|
b11e905988
|
Verify costs script (#5469)
|
2024-12-10 14:20:53 +01:00 |
|
Engel Nyst
|
455e667739
|
add cost to summary (#5473)
|
2024-12-10 03:14:03 +08:00 |
|
Cheng Yang
|
8f47547b08
|
docs: fix markdown linting and broken links (#5401)
|
2024-12-05 01:28:04 +08:00 |
|
Xingyao Wang
|
9908e1b285
|
[Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)
|
2024-12-04 03:33:43 +00:00 |
|
Xingyao Wang
|
990f277132
|
misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime (#5385)
|
2024-12-03 15:37:21 +00:00 |
|
Engel Nyst
|
ea994b6209
|
More integration tests info (#5319)
|
2024-11-29 16:39:03 +01:00 |
|
Cheng Yang
|
b808a639d9
|
docs: improve evaluation README with proper links and formatting (#5221)
|
2024-11-27 18:27:36 -05:00 |
|
Xingyao Wang
|
4d3b035e00
|
feat(agent): add BrowseURLAction to CodeAct (produce markdown from URL) (#5285)
|
2024-11-27 21:55:57 +00:00 |
|
OpenHands
|
f0ca2239f3
|
Fix issue #5076: Integration test github action (#5077)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-11-27 21:31:48 +01:00 |
|
Graham Neubig
|
12dd3352c5
|
Add remote runtime support to agent_bench (#5280)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-11-26 13:45:49 +00:00 |
|
OpenHands
|
678436da30
|
Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-11-25 08:35:52 -05:00 |
|
Nan Jiang
|
463d4e9a46
|
eval: add commit0 benchmark (#5153)
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2024-11-22 19:49:45 +00:00 |
|
Xingyao Wang
|
ff84a3eede
|
chore: remove specified sid (#5127)
|
2024-11-19 16:41:27 +00:00 |
|
Xingyao Wang
|
a531413d86
|
fix(eval): support setting hard timeout per evaluation instance (#5110)
|
2024-11-18 21:22:55 -05:00 |
|
Xingyao Wang
|
bdc4513937
|
fix(swebench): handle error in eval_infer and run_infer (#5017)
|
2024-11-15 23:04:56 +08:00 |
|
Graham Neubig
|
ce6f99d80e
|
Add GITHUB_USERNAME env var to resolver step (#4999)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-11-14 18:42:59 +00:00 |
|
Ketan Ramaneti
|
852c90f64a
|
[fix eval] Fix issues with miniwob remote runtime evaluation (#5001)
|
2024-11-14 18:00:48 +00:00 |
|
Ketan Ramaneti
|
42b49e6c43
|
[fix eval] Fix issues with aider_bench remote runtime evaluation (#5000)
|
2024-11-14 17:58:45 +00:00 |
|
Xingyao Wang
|
07f0d1ccb3
|
feat(llm): convert function call request for non-funcall OSS model (#4711)
Co-authored-by: Calvin Smith <email@cjsmith.io>
|
2024-11-15 00:40:09 +08:00 |
|
Robert Brennan
|
bc3f0ac24a
|
fix imports (#4974)
|
2024-11-13 17:04:16 +00:00 |
|
Calvin Smith
|
50e7da9c3d
|
fix(evaluation): SWE-bench evaluation script supports multiprocessing (#4943)
|
2024-11-12 12:19:57 -07:00 |
|
Robert Brennan
|
17f4c6e1a9
|
Refactor sessions a bit, and fix issue where runtimes get killed (#4900)
|
2024-11-12 16:20:36 +00:00 |
|
Xingyao Wang
|
a07e8272da
|
fix: improve remote runtime reliability on large-scale evaluation (#4869)
|
2024-11-09 20:17:10 +00:00 |
|
Robert Brennan
|
be82832eb1
|
Use keyword matching for CodeAct microagents (#4568)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2024-11-09 11:25:02 -05:00 |
|
Xingyao Wang
|
4ce3b9094a
|
Revert "(feat): Prompt engineering to remind o1 to generate a patch" (#4846)
|
2024-11-08 16:12:57 +00:00 |
|
Alejandro Cuadron Lafuente
|
a6810fa6ad
|
(feat): Prompt engineering to remind o1 to generate a patch (#4807)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: Robert Brennan <contact@rbren.io>
|
2024-11-08 03:10:18 +00:00 |
|
Xingyao Wang
|
53390d9885
|
Fix issue #4583: [Bug]: Unable to pull the full SWE-Bench test set (#4813)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-11-07 22:35:20 +08:00 |
|
OpenHands
|
025dac5d8f
|
Fix issue #4776: [Bug]: Files are not uploaded to the environment (SWE-Bench) (#4795)
|
2024-11-06 19:05:06 +00:00 |
|
Engel Nyst
|
eeb2342509
|
Refactor history/event stream (#3808)
|
2024-11-05 03:36:14 +01:00 |
|
Xingyao Wang
|
1d2a616be7
|
Fix issue #4739: '[Bug]: The agent doesn'"'"'t know its name' (#4740)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-11-04 21:24:35 +00:00 |
|
Xingyao Wang
|
966da7b7c8
|
feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667)
Co-authored-by: tofarr <tofarr@gmail.com>
|
2024-11-05 00:27:27 +08:00 |
|
Abhijeetsingh Meena
|
8857f02083
|
[Eval] DiscoveryBench OpenHands Integration (#4627)
Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>
Co-authored-by: Harshit Surana <surana.h@gmail.com>
|
2024-11-02 07:24:34 -04:00 |
|
Ziru "Ron" Chen
|
db4e1dbbec
|
[eval] Add ScienceAgentBench. (#4645)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2024-11-01 02:30:55 +08:00 |
|
Xingyao Wang
|
9c2b48ff5d
|
fix(eval): SWE-Bench instance with upper-case instance id (#4649)
|
2024-10-30 21:24:18 +00:00 |
|
Xingyao Wang
|
6d19c93d19
|
[eval] add evaluation workflow (#4489)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-10-29 13:52:25 +00:00 |
|