Xingyao Wang
|
391200510c
|
fix: revert #5506 for SWE-Bench performance regression (#6491)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-28 22:52:57 +08:00 |
|
Aditya Bharat Soni
|
aebb583779
|
Support for VisualWebArena evaluation in OpenHands (#4773)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-01-23 20:18:30 +00:00 |
|
Engel Nyst
|
b9a3f1c753
|
Fix eval on remote runtime (#6398)
|
2025-01-21 20:49:30 +00:00 |
|
Engel Nyst
|
5b7fcfbe1a
|
Disable prompt extensions in SWE-bench (#6391)
|
2025-01-21 17:18:30 +00:00 |
|
louria
|
7f57dbebda
|
Update MiniWoB README (#6385)
|
2025-01-21 16:26:47 +01:00 |
|
Calvin Smith
|
a12087243a
|
Pydantic-based configuration and setting objects (#6321)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-17 12:33:22 -07:00 |
|
Xingyao Wang
|
899c1f8360
|
fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-18 03:31:23 +08:00 |
|
Xingyao Wang
|
72af7bbba2
|
feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-17 02:48:41 +08:00 |
|
Xingyao Wang
|
0c961bfd8b
|
refactor(prompt): move runtime/repo info to user message and disable them in eval (#6291)
|
2025-01-16 17:53:10 +00:00 |
|
Xingyao Wang
|
0bed17758f
|
fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command (#6280)
|
2025-01-17 01:27:00 +08:00 |
|
Boxuan Li
|
92b8d55c2d
|
Rename trajectories_path config to save_trajectory_path (#6216)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-14 04:32:45 +00:00 |
|
tofarr
|
23473070b9
|
Revert "Config objects as Pydantic BaseModels (#6176)" (#6214)
|
2025-01-13 07:36:25 -07:00 |
|
Calvin Smith
|
873dddb4e8
|
Config objects as Pydantic BaseModels (#6176)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-01-12 15:09:45 -05:00 |
|
Calvin Smith
|
6e4ff56934
|
feature: Condenser Interface and Defaults (#5306)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-08 04:36:30 +08:00 |
|
Dmitry Kozlov
|
17d722f3b3
|
Update README.md (#6076)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-01-06 17:31:19 +00:00 |
|
Xingyao Wang
|
ec70af9412
|
refactor: Replace pexpect with libtmux in BashSession (#4881)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-04 05:22:13 +08:00 |
|
Xingyao Wang
|
61ebec9ff7
|
feat(eval): better visualization for comparing two swe-bench runs (#5993)
|
2025-01-03 02:36:51 +00:00 |
|
Xingyao Wang
|
9dd5463e06
|
Set default value of use_microagents to False to prevent breaking eval (#5976)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-03 05:39:17 +08:00 |
|
Robert Brennan
|
0e4e1b3316
|
Factor out ActionExecutionClient (#5796)
|
2024-12-30 15:32:13 +00:00 |
|
Boxuan Li
|
6a4442e590
|
[Evaluation] Add summarise_results script for TheAgentCompany benchmark (#5811)
|
2024-12-27 20:33:41 -08:00 |
|
Boxuan Li
|
5ed80b5c32
|
[doc] Fix link in TheAgentCompany benchmark's README.md (#5848)
|
2024-12-27 22:21:02 +08:00 |
|
OpenHands
|
8975fcd714
|
Fix issue #5748: Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749)
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-12-26 23:30:19 +08:00 |
|
OpenHands
|
bfb191b5c7
|
Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740)
|
2024-12-25 17:17:06 -05:00 |
|
Boxuan Li
|
b1719bb3db
|
Add TheAgentCompany evaluation harness (#5731)
|
2024-12-22 14:12:30 -05:00 |
|
OpenHands
|
21948fa81b
|
Fix issue #5735: [Bug]: Inconsistent command line arguments in evaluation directory (#5736)
|
2024-12-22 04:41:39 +08:00 |
|
Xingyao Wang
|
581d5ec7a8
|
feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709)
|
2024-12-21 01:47:06 +08:00 |
|
Xingyao Wang
|
c333938384
|
feat(eval): add standard error to swebench summarize outputs (#5700)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-12-20 08:39:43 +08:00 |
|
Xingyao Wang
|
e9cafb0372
|
chore: Cleanup runtime exception handling (#5696)
|
2024-12-19 17:28:29 +00:00 |
|
Xingyao Wang
|
9cdb8d06c0
|
fix(eval): Use cp -r instead of mv for SWE-Bench Initialization (#5659)
|
2024-12-17 21:21:27 +00:00 |
|
Engel Nyst
|
3297e4d5a8
|
Use litellm's modify params (#5636)
|
2024-12-17 21:32:49 +01:00 |
|
OpenHands
|
4998b5de32
|
Fix issue #5559: The turn limit should be measured from the last user interaction (#5560)
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-12-16 16:28:23 -05:00 |
|
Engel Nyst
|
b295f5775c
|
Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)
|
2024-12-16 20:39:57 +00:00 |
|
OpenHands
|
09735c7869
|
Fix issue #5609: Use litellm's modify_params with default True (#5611)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-12-16 20:18:45 +01:00 |
|
Engel Nyst
|
4716955960
|
Remove unused codeact-SWE agent (#5600)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2024-12-14 20:49:44 +01:00 |
|
Ryan H. Tran
|
8ae2fb636e
|
Remove symlink use for swebench setup (#5549)
|
2024-12-13 22:18:14 +08:00 |
|
Engel Nyst
|
b11e905988
|
Verify costs script (#5469)
|
2024-12-10 14:20:53 +01:00 |
|
Engel Nyst
|
455e667739
|
add cost to summary (#5473)
|
2024-12-10 03:14:03 +08:00 |
|
Cheng Yang
|
8f47547b08
|
docs: fix markdown linting and broken links (#5401)
|
2024-12-05 01:28:04 +08:00 |
|
Xingyao Wang
|
9908e1b285
|
[Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)
|
2024-12-04 03:33:43 +00:00 |
|
Xingyao Wang
|
990f277132
|
misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime (#5385)
|
2024-12-03 15:37:21 +00:00 |
|
Graham Neubig
|
12dd3352c5
|
Add remote runtime support to agent_bench (#5280)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-11-26 13:45:49 +00:00 |
|
OpenHands
|
678436da30
|
Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2024-11-25 08:35:52 -05:00 |
|