Kevin Musgrave
|
12d6da8130
|
feat(evaluation): Filter task ids by difficulty for SWE Gym rollouts (#11490)
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-10-30 02:30:19 +00:00 |
|
Xingyao Wang
|
b082ccc0fb
|
feat(llm): add support for deepseek and gpt-5-mini, util for token count (#10626)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-27 11:03:35 +08:00 |
|
Xingyao Wang
|
4507a25b85
|
Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-22 13:34:02 +00:00 |
|
Engel Nyst
|
91d3d1d20a
|
Fix: expose aggregated LLM metrics in State for evaluation scripts (#10537)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-21 17:43:09 +02:00 |
|
Kevin Musgrave
|
74ba21bad0
|
feat(evaluation): Added INSTRUCTION_TEMPLATE_NAME to run_infer.py in swe_bench (#10270)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: mamoodi <mamoodiha@gmail.com>
|
2025-08-18 14:18:08 +00:00 |
|
Xingyao Wang
|
c2f46200c0
|
chore(lint): Apply comprehensive linting and formatting fixes (#10287)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-08-13 21:13:19 +02:00 |
|
Ibragim Badertdinov
|
19a6b6b618
|
feat(eval): Support evaluation on SWE-rebench (#10251)
|
2025-08-12 14:05:43 +00:00 |
|
juanmichelini
|
ea50fe4e3c
|
Fix: Continue evaluation when an instance fails after max retries (#8868)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyaoww@gmail.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-07-16 22:42:44 +00:00 |
|
Ryan H. Tran
|
dfa54673d2
|
[OH-Versa] Add remaining browsing & GAIA eval improvement (#9015)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-06-25 12:36:15 +07:00 |
|
Linghao Zhang
|
a93b0457c6
|
feat(eval): Support evaluation on SWE-bench-Live (#9137)
|
2025-06-15 12:30:47 +00:00 |
|
Graham Neubig
|
689d3c9046
|
Update pre-commit hook versions to most recent versions (#8343)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-05-08 03:59:13 +00:00 |
|
Rohit Malhotra
|
9adfcede31
|
(Hotfix): Track reason for Error AgentState (#7584)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-03-31 21:24:42 +00:00 |
|
Xingyao Wang
|
01e0e29a9f
|
Reduce bash SOFT timeout from 30 to 10 seconds (#7423)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-03-22 22:42:24 +00:00 |
|
Xingyao Wang
|
33780f97d0
|
[eval] Upgrade SWE-Bench to use official image and latest harness (#6838)
Co-authored-by: Robert Brennan <accounts@rbren.io>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-02-27 08:15:05 -05:00 |
|
Mateusz Kwiatkowski
|
6562297615
|
Replace shebang with /usr/bin/env bash for improved portability (#6876)
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
|
2025-02-24 18:07:28 +00:00 |
|
Xingyao Wang
|
1a7003a705
|
Add sysbox support to remote runtime for eval; Add memory monitor, stress tests to help debug memory issue (#6684)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-02-18 20:02:28 +00:00 |
|
Boxuan Li
|
ef12bc5381
|
Evaluation harness: Add agent config option (#6662)
|
2025-02-13 15:05:03 -05:00 |
|
Xingyao Wang
|
2b04ee2e62
|
feat(eval): reliability improvement for SWE-Bench eval_infer (#6347)
|
2025-01-18 14:02:59 -05:00 |
|
Calvin Smith
|
a12087243a
|
Pydantic-based configuration and setting objects (#6321)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-17 12:33:22 -07:00 |
|
Xingyao Wang
|
899c1f8360
|
fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318)
Co-authored-by: Robert Brennan <accounts@rbren.io>
|
2025-01-18 03:31:23 +08:00 |
|
tofarr
|
23473070b9
|
Revert "Config objects as Pydantic BaseModels (#6176)" (#6214)
|
2025-01-13 07:36:25 -07:00 |
|
Calvin Smith
|
873dddb4e8
|
Config objects as Pydantic BaseModels (#6176)
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2025-01-12 15:09:45 -05:00 |
|
Calvin Smith
|
6e4ff56934
|
feature: Condenser Interface and Defaults (#5306)
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Calvin Smith <calvin@all-hands.dev>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
|
2025-01-08 04:36:30 +08:00 |
|
Xingyao Wang
|
f14f75b064
|
feat: runtime improvements for rate-limit and 502/503/404 error (#5975)
|
2025-01-03 08:36:19 -07:00 |
|
OpenHands
|
bfb191b5c7
|
Fix issue #5739: [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740)
|
2024-12-25 17:17:06 -05:00 |
|
Xingyao Wang
|
581d5ec7a8
|
feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709)
|
2024-12-21 01:47:06 +08:00 |
|
Xingyao Wang
|
e9cafb0372
|
chore: Cleanup runtime exception handling (#5696)
|
2024-12-19 17:28:29 +00:00 |
|
Xingyao Wang
|
9908e1b285
|
[Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)
|
2024-12-04 03:33:43 +00:00 |
|
Xingyao Wang
|
a531413d86
|
fix(eval): support setting hard timeout per evaluation instance (#5110)
|
2024-11-18 21:22:55 -05:00 |
|
Xingyao Wang
|
07f0d1ccb3
|
feat(llm): convert function call request for non-funcall OSS model (#4711)
Co-authored-by: Calvin Smith <email@cjsmith.io>
|
2024-11-15 00:40:09 +08:00 |
|
Calvin Smith
|
50e7da9c3d
|
fix(evaluation): SWE-bench evaluation script supports multiprocessing (#4943)
|
2024-11-12 12:19:57 -07:00 |
|
Engel Nyst
|
eeb2342509
|
Refactor history/event stream (#3808)
|
2024-11-05 03:36:14 +01:00 |
|
Xingyao Wang
|
966da7b7c8
|
feat(agent, CodeAct 2.2): native CodeAct support for Browsing (#4667)
Co-authored-by: tofarr <tofarr@gmail.com>
|
2024-11-05 00:27:27 +08:00 |
|
Xingyao Wang
|
ae13171194
|
feat(agent): CodeAct with function calling (#4537)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: tobitege <10787084+tobitege@users.noreply.github.com>
Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
Co-authored-by: tofarr <tofarr@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
|
2024-10-29 11:06:33 +08:00 |
|
Xingyao Wang
|
7340b78962
|
feat(eval): rewrite log_completions to save completions to directory (#4566)
|
2024-10-25 16:36:11 +00:00 |
|
mamoodi
|
6f2e678028
|
Fix eval output path in case of @ char (#4416)
|
2024-10-15 22:45:08 +00:00 |
|
Xingyao Wang
|
25f9413965
|
[Eval] Fix eval stuck when result is too large for pbar (#4361)
|
2024-10-14 22:08:34 +08:00 |
|
Engel Nyst
|
e6847e9e61
|
Move agenthub within openhands (#4130)
|
2024-10-08 00:34:18 +00:00 |
|
Xingyao Wang
|
9cc9b19958
|
eval: improve swebench infer error handling and retry (#4205)
|
2024-10-04 07:09:56 -05:00 |
|
Xingyao Wang
|
53a015f718
|
fix: make llm_completions optional to fix eval_infer.py (#4148)
|
2024-10-02 03:55:03 +08:00 |
|
tobitege
|
c3bbe604eb
|
(fix) Fix logging in shared eval file to prevent key disclosure (#4108)
|
2024-09-28 19:33:16 +00:00 |
|
Xingyao Wang
|
81b3cd71b3
|
[eval] log evaluating warnings directly to console (#4026)
|
2024-09-26 03:42:32 +08:00 |
|
Xingyao Wang
|
1b1d8f0b02
|
[eval] Use imap_unorderd for parallizing evaluation (#4040)
|
2024-09-24 20:47:27 +00:00 |
|
Xingyao Wang
|
a66e738957
|
[eval] use mp Pool instead ProcessPoolExecutor (#4025)
|
2024-09-24 23:59:06 +08:00 |
|
Xingyao Wang
|
714e46f29a
|
[eval] save eventstream & llm completions for SWE-Bench run_infer (#3923)
|
2024-09-22 04:39:13 +00:00 |
|
Xingyao Wang
|
5d7f2fd4ae
|
[eval] Allow evaluation of SWE-Bench patches on RemoteRuntime (#3927)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-09-18 16:07:34 -04:00 |
|
Xingyao Wang
|
f996b31d64
|
[eval] Fix multi-processing bug (again^3) & allow set EXP_NAME for each run_infer (#3907)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
|
2024-09-17 14:07:58 +00:00 |
|
Xingyao Wang
|
2b3925278d
|
[eval] refactor process instance logic into update_progress (#3875)
|
2024-09-15 18:47:15 -04:00 |
|
Engel Nyst
|
379f2b6f23
|
Fix queue length on Macs (#3867)
|
2024-09-14 01:11:29 +00:00 |
|
Xingyao Wang
|
3a1b8c093b
|
[eval] yet another eval fixes on multi-processing (#3854)
Co-authored-by: Graham Neubig <neubig@gmail.com>
|
2024-09-13 15:51:22 +00:00 |
|