Kevin Musgrave
|
74ba21bad0
|
feat(evaluation): Added INSTRUCTION_TEMPLATE_NAME to run_infer.py in swe_bench (#10270)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
|
2025-08-18 14:18:08 +00:00 |
|
Zhonghao Jiang
|
7229a16b45
|
feat(evaluation): Add NoCode-bench evaluation script (#10229)
|
2025-08-16 16:41:22 +00:00 |
|
Engel Nyst
|
f7f4fcf98f
|
chore(eval): remove old, unused regression test framework under evaluation/regression (#10419)
|
2025-08-16 01:08:23 +02:00 |
|
Xingyao Wang
|
c2f46200c0
|
chore(lint): Apply comprehensive linting and formatting fixes (#10287)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-13 21:13:19 +02:00 |
|
Ibragim Badertdinov
|
19a6b6b618
|
feat(eval): Support evaluation on SWE-rebench (#10251)
|
2025-08-12 14:05:43 +00:00 |
|
Insop
|
1d0d88d491
|
Readability improvement & remove duplicated and unused prompts (#10241)
|
2025-08-12 12:42:17 +08:00 |
|
Ryan H. Tran
|
758e30c9a8
|
Remove SecretStr conversion in GAIA eval (#10204)
|
2025-08-11 21:30:18 +08:00 |
|
Xingyao Wang
|
04ff4a025b
|
feat(cli): Use CLI to launch OpenHands UI server via Docker (#9783)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-09 02:04:07 +08:00 |
|
Xingyao Wang
|
c4f303a07b
|
chore(eval): Remove eval_infer_remote.sh script and related references (#10157)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-07 20:46:59 +00:00 |
|
Boxuan Li
|
7af35ab827
|
Evaluation: disable browser when NOT run_with_browsing (#9837)
|
2025-07-22 01:45:52 +00:00 |
|
juanmichelini
|
ea50fe4e3c
|
Fix: Continue evaluation when an instance fails after max retries (#8868)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyaoww@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-07-16 22:42:44 +00:00 |
|
Engel Nyst
|
fba2218760
|
Fix integration tests (#9746)
|
2025-07-16 22:16:40 +02:00 |
|
Boxuan Li
|
5c3619bc48
|
Add README for terminal_bench evaluation harness (#9700)
|
2025-07-15 09:48:34 -04:00 |
|
xhguo7
|
9388fef0ef
|
feat(eval): loc acc evaluation (#8515)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
|
2025-07-11 03:22:35 +08:00 |
|
Xingyao Wang
|
cff5697456
|
eval: remove gemini-specific swebench template (#9623)
|
2025-07-08 18:34:23 +00:00 |
|
Ryan H. Tran
|
dfa54673d2
|
[OH-Versa] Add remaining browsing & GAIA eval improvement (#9015)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-06-25 12:36:15 +07:00 |
|
Maxim Evtush
|
653a8a7ce2
|
Refactor: Improve Consistency in Function Signatures and Regex Usage in compute_ism_pm_score.py (#9145)
|
2025-06-18 04:22:16 +08:00 |
|
Ryan H. Tran
|
ddaa186971
|
[GAIA] Add prompt improvement to alleviate solution parsing issue & support Tavily search tools (#9057)
|
2025-06-17 13:16:50 +07:00 |
|
better629
|
432d8829dc
|
disable mcp in run_localize and install oh-aci[llama] for issue 9150 (#9151)
|
2025-06-16 11:03:17 +00:00 |
|
FT
|
e5bff91e8e
|
Fix Typo: Change "accurancy" to "accuracy" in Evaluation Benchmark Comments (#9139)
|
2025-06-15 12:48:26 +00:00 |
|
Linghao Zhang
|
a93b0457c6
|
feat(eval): Support evaluation on SWE-bench-Live (#9137)
|
2025-06-15 12:30:47 +00:00 |
|
kilavvy
|
4e99aabcb2
|
Minor Code Comment Corrections and Clarifications (#9129)
|
2025-06-14 18:57:14 +00:00 |
|
Graham Neubig
|
0c307ea12e
|
Lint all files in the repo (#9131)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-06-14 16:25:59 +00:00 |
|
ASTONE
|
be62ba6b35
|
add_versicode (#8221)
|
2025-06-14 13:17:18 +00:00 |
|
leopardracer
|
13c298d35f
|
Minor Typo Fixes in Comments and Documentation (#9058)
|
2025-06-14 12:51:38 +00:00 |
|
Engel Nyst
|
fd3b4ac8e6
|
Refactor SWE-bench instruction (#8010)
|
2025-06-13 23:27:52 +02:00 |
|
Leander Maben
|
d84befe28f
|
Adding LLM Based Editing capability (#8677)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
|
2025-06-09 21:57:20 +08:00 |
|
Sergey
|
49939c1f02
|
Fix typo in evaluation README.md (#8987)
|
2025-06-08 14:14:07 +00:00 |
|
llamantino
|
880c05ed94
|
Fix all broken docs links across the project (#8830)
Co-authored-by: llamantino <12345678+yourusername@users.noreply.github.com>
|
2025-05-31 21:24:59 -04:00 |
|
Robert Brennan
|
205f0234e8
|
Rename Conversation to ServerConversation and AppConfig to OpenHandsConfig (#8754)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-28 21:48:34 +02:00 |
|
Xuhui Zhou
|
14498c5e25
|
Feature/swe run interact (#8714)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-05-27 19:35:21 +00:00 |
|
Zhaoling Chen
|
efe287ce34
|
integrate LocAgent into OpenHands (#7371)
Co-authored-by: czlll <gangda@huaihe.usc.edu>
Co-authored-by: Hoang Tran <descience.thh10@gmail.com>
|
2025-05-23 22:42:58 +07:00 |
|
Ryan H. Tran
|
3980ba53c9
|
Add option to run patch evaluation on Modal (#8607)
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-05-23 00:45:45 +07:00 |
|
Engel Nyst
|
637cb0726a
|
specify condenser config for evals (#8177)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-21 22:08:57 +02:00 |
|
luolin101
|
1a3cb16ba6
|
add Visual SWE-bench benchmark (#7131)
Co-authored-by: tsukimi <yuailun@pku.edu.cn>
Co-authored-by: Ryan H. Tran <descience.thh10@gmail.com>
|
2025-05-19 12:08:46 +07:00 |
|
Xingyao Wang
|
2ecc39ffcc
|
[eval]: disable MCP for SWE-Bench evaluation (#8574)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
|
2025-05-19 01:32:46 +00:00 |
|
Yueqi Song
|
3ca585b79f
|
Update run_infer.py to incorporate selection of task based on repo (#8509)
|
2025-05-15 12:27:28 +08:00 |
|
omahs
|
4bb6ec2ee5
|
Fix typos (#8469)
|
2025-05-13 09:34:21 +00:00 |
|
Graham Neubig
|
f317c03b1b
|
Fix inconsistent max_iterations in SWE-bench evaluation (#8467)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-13 02:07:57 +00:00 |
|
Graham Neubig
|
689d3c9046
|
Update pre-commit hook versions to most recent versions (#8343)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-08 03:59:13 +00:00 |
|
Engel Nyst
|
985e20d529
|
[chore] Run full agent pre-commit (#8235)
|
2025-05-03 11:24:03 -04:00 |
|
Qi Liu
|
3d22520992
|
[Feat] add multi-swe-bench (#8174)
Co-authored-by: ByteDance User <tiger@bytedance.localdomain>
|
2025-05-01 00:23:19 +00:00 |
|
Michael Panchenko
|
14564b25d6
|
Fix linting (#7965)
|
2025-04-21 06:34:40 +08:00 |
|
Engel Nyst
|
a2c55cfdef
|
Refactor to clean up and move utility/legacy out of the agent (#7917)
|
2025-04-19 01:53:33 +08:00 |
|
Xingyao Wang
|
7c23993344
|
fix(eval): typo in SWE_Bench evaluation (#7930)
|
2025-04-19 00:31:08 +08:00 |
|
Engel Nyst
|
9b9b1291fc
|
[chore] Just linting on swe-bench files (#7918)
|
2025-04-18 22:12:01 +08:00 |
|
Niels Mündler
|
4b124d5906
|
Add inference for SWT-Bench (#7201)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Calvin Smith <email@cjsmith.io>
|
2025-04-17 14:49:42 -06:00 |
|
juanmichelini
|
6bcebd4b9d
|
Jetbrains CI Benchmark (#7811)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-04-17 15:10:20 +00:00 |
|
Engel Nyst
|
5e5bf23f9c
|
[Evaluation] Fix KeyError when the instance failed prematurely (#7864)
|
2025-04-15 15:19:31 +00:00 |
|
Engel Nyst
|
d05a6f30e1
|
[Refactor] Rename codeact_* agent options to simple name (#7853)
|
2025-04-15 00:14:13 +02:00 |
|