Elena Chistova
|
38e866cde4
|
Fix official SWE-Bench docker image prefix (#7214)
|
2025-03-12 18:23:19 +00:00 |
|
juanmichelini
|
b36deca265
|
Added link to paper in commit0 README (#7221)
|
2025-03-12 17:17:22 +00:00 |
|
Xingyao Wang
|
a4908f9a75
|
[agent] system message + SWE-Bench instruction improvements (#7018)
|
2025-03-08 00:27:02 +08:00 |
|
Nan Jiang
|
ec087993f1
|
rename commit0_bench to commit0 (#7124)
|
2025-03-06 02:55:39 +00:00 |
|
Xingyao Wang
|
9f720a9d69
|
[eval] SWE-Gym Integration (#6651)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-03-05 20:15:02 +00:00 |
|
Xingyao Wang
|
bbf40c6576
|
docs: cleanup and update SWE-Bench documentation; and remove the support of non-instance-level image (#7118)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-03-06 03:18:40 +08:00 |
|
Xingyao Wang
|
4be33a079b
|
Update SWE-Bench README.md about RemoteRuntime (#7108)
|
2025-03-05 23:00:54 +08:00 |
|
He Du
|
896d7b8b96
|
Openhands fix issue 7091 (#7092)
Co-authored-by: 杜贺 <duhe@duhedeMacBook-Pro-2.local>
|
2025-03-04 18:39:28 +01:00 |
|
Rohit Malhotra
|
5ffb1ef704
|
Fix typing (#7083)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-03-03 20:41:11 +00:00 |
|
Engel Nyst
|
395c1ea9e3
|
[Refactor] split runtime initialization (create, connect, init) in cli scripts (#7036)
|
2025-03-03 00:19:25 +01:00 |
|
Engel Nyst
|
660d1d1e64
|
Fix argument in swe-bench grading scripts (#7046)
|
2025-03-02 12:37:15 +08:00 |
|
Magic Mai
|
8a58e724c6
|
fix: Remove nested git repositories before adding files in SWE-bench (#6536)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-02-28 01:19:33 +00:00 |
|
Xingyao Wang
|
33780f97d0
|
[eval] Upgrade SWE-Bench to use official image and latest harness (#6838)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-02-27 08:15:05 -05:00 |
|
Engel Nyst
|
4f98bce6df
|
Add selected_repo to command line (#6949)
|
2025-02-26 20:42:59 +01:00 |
|
Mateusz Kwiatkowski
|
6562297615
|
Replace shebang with /usr/bin/env bash for improved portability (#6876)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-02-24 18:07:28 +00:00 |
|
Xingyao Wang
|
e52aee168e
|
Docs: Clarify config.toml usage in evaluation harness (#6828)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-02-20 22:16:17 -08:00 |
|
Xingyao Wang
|
1a7003a705
|
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-02-18 20:02:28 +00:00 |
|
Boxuan Li
|
4443417c75
|
A few fixes for TAC evaluation harness (#6586)
|
2025-02-14 21:01:57 -08:00 |
|
Boxuan Li
|
ef12bc5381
|
Evaluation harness: Add agent config option (#6662)
|
2025-02-13 15:05:03 -05:00 |
|
Graham Neubig
|
e930cd0aef
|
Better error logging in posthog (#6346)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Ray Myers <ray.myers@gmail.com>
|
2025-02-06 20:16:37 +00:00 |
|
Xingyao Wang
|
90bbd4edbe
|
fix: initialize default metadata with all required fields (#6583)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-02-04 02:52:11 +08:00 |
|
tofarr
|
bbfdc62139
|
Fix for issue where retries continue on a closed runtime (#6564)
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
|
2025-02-03 08:44:09 -07:00 |
|
Boxuan Li
|
62402cd617
|
The-Agent-Company evaluation harness: Support splits (#6577)
|
2025-02-02 13:12:01 +08:00 |
|
Xingyao Wang
|
1a9971b1bf
|
misc: make RemoteRuntime API timeout configurable (#6518)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-30 06:30:18 +08:00 |
|
Xingyao Wang
|
391200510c
|
fix: revert #5506 for SWE-Bench performance regression (#6491)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-28 22:52:57 +08:00 |
|
Aditya Bharat Soni
|
aebb583779
|
Support for VisualWebArena evaluation in OpenHands (#4773)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-01-23 20:18:30 +00:00 |
|
Engel Nyst
|
b9a3f1c753
|
Fix eval on remote runtime (#6398)
|
2025-01-21 20:49:30 +00:00 |
|
Engel Nyst
|
5b7fcfbe1a
|
Disable prompt extensions in SWE-bench (#6391)
|
2025-01-21 17:18:30 +00:00 |
|
louria
|
7f57dbebda
|
Update MiniWoB README (#6385)
|
2025-01-21 16:26:47 +01:00 |
|
Xingyao Wang
|
2b04ee2e62
|
feat(eval): reliability improvement for SWE-Bench eval_infer (#6347)
|
2025-01-18 14:02:59 -05:00 |
|
Calvin Smith
|
a12087243a
|
Pydantic-based configuration and setting objects (#6321)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-17 12:33:22 -07:00 |
|
Xingyao Wang
|
899c1f8360
|
fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-18 03:31:23 +08:00 |
|
Xingyao Wang
|
72af7bbba2
|
feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-17 02:48:41 +08:00 |
|
Xingyao Wang
|
0c961bfd8b
|
refactor(prompt): move runtime/repo info to user message and disable them in eval (#6291)
|
2025-01-16 17:53:10 +00:00 |
|
Xingyao Wang
|
0bed17758f
|
fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command (#6280)
|
2025-01-17 01:27:00 +08:00 |
|
Engel Nyst
|
b9a70c8d5c
|
Delegation fixes (#6165)
|
2025-01-15 03:24:39 +00:00 |
|
Boxuan Li
|
92b8d55c2d
|
Rename trajectories_path config to save_trajectory_path (#6216)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-14 04:32:45 +00:00 |
|
tofarr
|
23473070b9
|
Revert "Config objects as Pydantic BaseModels (#6176)" (#6214)
|
2025-01-13 07:36:25 -07:00 |
|
Calvin Smith
|
873dddb4e8
|
Config objects as Pydantic BaseModels (#6176)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-01-12 15:09:45 -05:00 |
|
Calvin Smith
|
6e4ff56934
|
feature: Condenser Interface and Defaults (#5306)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-08 04:36:30 +08:00 |
|
Dmitry Kozlov
|
17d722f3b3
|
Update README.md (#6076)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-01-06 17:31:19 +00:00 |
|
Xingyao Wang
|
ec70af9412
|
refactor: Replace pexpect with libtmux in BashSession (#4881)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-04 05:22:13 +08:00 |
|
Xingyao Wang
|
f14f75b064
|
feat: runtime improvements for rate-limit and 502/503/404 error (#5975)
|
2025-01-03 08:36:19 -07:00 |
|
Xingyao Wang
|
61ebec9ff7
|
feat(eval): better visualization for comparing two swe-bench runs (#5993)
|
2025-01-03 02:36:51 +00:00 |
|
Xingyao Wang
|
9dd5463e06
|
Set default value of use_microagents to False to prevent breaking eval (#5976)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-03 05:39:17 +08:00 |
|
Robert Brennan
|
0e4e1b3316
|
Factor out ActionExecutionClient (#5796)
|
2024-12-30 15:32:13 +00:00 |
|
Boxuan Li
|
6a4442e590
|
[Evaluation] Add summarise_results script for TheAgentCompany benchmark (#5811)
|
2024-12-27 20:33:41 -08:00 |
|
Boxuan Li
|
5ed80b5c32
|
[doc] Fix link in TheAgentCompany benchmark's README.md (#5848)
|
2024-12-27 22:21:02 +08:00 |
|
OpenHands
|
8975fcd714
|
Fix issue #5748: Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749)
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-12-26 23:30:19 +08:00 |
|
OpenHands
|
bfb191b5c7
|
Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740)
|
2024-12-25 17:17:06 -05:00 |
|