291 Commits

Author SHA1 Message Date
Xingyao Wang
391200510c fix: revert #5506 for SWE-Bench performance regression (#6491)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-28 22:52:57 +08:00
Aditya Bharat Soni
aebb583779 Support for VisualWebArena evaluation in OpenHands (#4773)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-23 20:18:30 +00:00
Engel Nyst
b9a3f1c753 Fix eval on remote runtime (#6398) 2025-01-21 20:49:30 +00:00
Engel Nyst
5b7fcfbe1a Disable prompt extensions in SWE-bench (#6391) 2025-01-21 17:18:30 +00:00
louria
7f57dbebda Update MiniWoB README (#6385) 2025-01-21 16:26:47 +01:00
Xingyao Wang
2b04ee2e62 feat(eval): reliability improvement for SWE-Bench eval_infer (#6347) 2025-01-18 14:02:59 -05:00
Calvin Smith
a12087243a Pydantic-based configuration and setting objects (#6321)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-17 12:33:22 -07:00
Xingyao Wang
899c1f8360 fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318)
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-18 03:31:23 +08:00
Xingyao Wang
72af7bbba2 feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-17 02:48:41 +08:00
Xingyao Wang
0c961bfd8b refactor(prompt): move runtime/repo info to user message and disable them in eval (#6291) 2025-01-16 17:53:10 +00:00
Xingyao Wang
0bed17758f fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command (#6280) 2025-01-17 01:27:00 +08:00
Engel Nyst
b9a70c8d5c Delegation fixes (#6165) 2025-01-15 03:24:39 +00:00
Boxuan Li
92b8d55c2d Rename trajectories_path config to save_trajectory_path (#6216)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-14 04:32:45 +00:00
tofarr
23473070b9 Revert "Config objects as Pydantic BaseModels (#6176)" (#6214) 2025-01-13 07:36:25 -07:00
Calvin Smith
873dddb4e8 Config objects as Pydantic BaseModels (#6176)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
2025-01-12 15:09:45 -05:00
Calvin Smith
6e4ff56934 feature: Condenser Interface and Defaults (#5306)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2025-01-08 04:36:30 +08:00
Dmitry Kozlov
17d722f3b3 Update README.md (#6076)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
2025-01-06 17:31:19 +00:00
Xingyao Wang
ec70af9412 refactor: Replace pexpect with libtmux in BashSession (#4881)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
2025-01-04 05:22:13 +08:00
Xingyao Wang
f14f75b064 feat: runtime improvements for rate-limit and 502/503/404 error (#5975) 2025-01-03 08:36:19 -07:00
Xingyao Wang
61ebec9ff7 feat(eval): better visualization for comparing two swe-bench runs (#5993) 2025-01-03 02:36:51 +00:00
Xingyao Wang
9dd5463e06 Set default value of use_microagents to False to prevent breaking eval (#5976)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-03 05:39:17 +08:00
Robert Brennan
0e4e1b3316 Factor out ActionExecutionClient (#5796) 2024-12-30 15:32:13 +00:00
Boxuan Li
6a4442e590 [Evaluation] Add summarise_results script for TheAgentCompany benchmark (#5811) 2024-12-27 20:33:41 -08:00
Boxuan Li
5ed80b5c32 [doc] Fix link in TheAgentCompany benchmark's README.md (#5848) 2024-12-27 22:21:02 +08:00
OpenHands
8975fcd714 Fix issue #5748: Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749)
Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-12-26 23:30:19 +08:00
OpenHands
bfb191b5c7 Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740) 2024-12-25 17:17:06 -05:00
Boxuan Li
ecff5c67fb Evaluation README: Add TheAgentCompany (#5777) 2024-12-24 02:37:42 +00:00
Boxuan Li
b1719bb3db Add TheAgentCompany evaluation harness (#5731) 2024-12-22 14:12:30 -05:00
OpenHands
21948fa81b Fix issue #5735: [Bug]: Inconsistent command line arguments in evaluation directory (#5736) 2024-12-22 04:41:39 +08:00
Xingyao Wang
581d5ec7a8 feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709) 2024-12-21 01:47:06 +08:00
Xingyao Wang
c333938384 feat(eval): add standard error to swebench summarize outputs (#5700)
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-20 08:39:43 +08:00
Xingyao Wang
e9cafb0372 chore: Cleanup runtime exception handling (#5696) 2024-12-19 17:28:29 +00:00
Xingyao Wang
9cdb8d06c0 fix(eval): Use cp -r instead of mv for SWE-Bench Initialization (#5659) 2024-12-17 21:21:27 +00:00
Engel Nyst
3297e4d5a8 Use litellm's modify params (#5636) 2024-12-17 21:32:49 +01:00
OpenHands
4998b5de32 Fix issue #5559: The turn limit should be measured from the last user interaction (#5560)
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 16:28:23 -05:00
Engel Nyst
b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631) 2024-12-16 20:39:57 +00:00
OpenHands
09735c7869 Fix issue #5609: Use litellm's modify_params with default True (#5611)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-12-16 20:18:45 +01:00
Engel Nyst
4716955960 Remove unused codeact-SWE agent (#5600)
Co-authored-by: openhands <openhands@all-hands.dev>
2024-12-14 20:49:44 +01:00
Ryan H. Tran
8ae2fb636e Remove symlink use for swebench setup (#5549) 2024-12-13 22:18:14 +08:00
Engel Nyst
b11e905988 Verify costs script (#5469) 2024-12-10 14:20:53 +01:00
Engel Nyst
455e667739 add cost to summary (#5473) 2024-12-10 03:14:03 +08:00
Cheng Yang
8f47547b08 docs: fix markdown linting and broken links (#5401) 2024-12-05 01:28:04 +08:00
Xingyao Wang
9908e1b285 [Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394) 2024-12-04 03:33:43 +00:00
Xingyao Wang
990f277132 misc: Support folder-level exp analysis for SWE-Bench summarize_outputs.py; Handle CrashLoopBackoff for RemoteRuntime (#5385) 2024-12-03 15:37:21 +00:00
Engel Nyst
ea994b6209 More integration tests info (#5319) 2024-11-29 16:39:03 +01:00
Cheng Yang
b808a639d9 docs: improve evaluation README with proper links and formatting (#5221) 2024-11-27 18:27:36 -05:00
Xingyao Wang
4d3b035e00 feat(agent): add BrowseURLAction to CodeAct (produce markdown from URL) (#5285) 2024-11-27 21:55:57 +00:00
OpenHands
f0ca2239f3 Fix issue #5076: Integration test github action (#5077)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-27 21:31:48 +01:00
Graham Neubig
12dd3352c5 Add remote runtime support to agent_bench (#5280)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-26 13:45:49 +00:00
OpenHands
678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-11-25 08:35:52 -05:00