OpenHands

mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-01-06 21:44:00 -05:00

Author	SHA1	Message	Date
Aaron Sequeira	4c0f0a1e9b	feat: Support Tau-Bench and BFCL evaluation benchmarks (#11953 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-12-31 03:12:50 +00:00
Graham Neubig	089d9c1ee5	Add deprecation warning to evaluation README (#11997 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-12-16 00:21:13 +08:00
Jeffrey Ma	974bcdfd0b	SWE-fficiency benchmark implementation (#11716 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: enyst <engel.nyst@gmail.com>	2025-11-27 09:13:15 +01:00
John Eismeier	967e9e1891	Propose fix some typos and ignore emacs backup files (#11701 ) Signed-off-by: John E <jeis4wpi@outlook.com>	2025-11-11 09:20:42 -05:00
Engel Nyst	14807ed273	ci: remove outdated integration runner (#11653 )	2025-11-10 15:51:40 +01:00
Kevin Musgrave	12d6da8130	feat(evaluation): Filter task ids by difficulty for SWE Gym rollouts (#11490 ) Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: openhands <openhands@all-hands.dev>	2025-10-30 02:30:19 +00:00
Zacharias Fisches	818f743dc7	Bugfix: respect config.tom system_prompt_filename when running swe-bench (#11091 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-10-27 21:55:05 +00:00
Robert Brennan	b5e00f577c	Replace All-Hands-AI references with OpenHands (#11287 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <engel.nyst@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-10-26 01:52:45 +02:00
Tim O'Farrell	4b303ec9b4	Fixes to unblock frontend (#11488 ) Co-authored-by: Ray Myers <ray.myers@gmail.com>	2025-10-23 14:43:45 -06:00
Kevin Musgrave	a237b578c0	feat(evaluation): Add multi-swe-bench dependency and fix rollout script (#11326 ) Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-10-16 14:35:19 +00:00
Engel Nyst	3e645f8649	fix(integration-tests): accept --eval-num-workers and --eval-note in integration test runner (#11387 )	2025-10-16 09:50:24 -04:00
juanmichelini	471d272c7c	Mint security eval fix (#11273 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2025-10-16 01:42:05 +00:00
Kevin Musgrave	19bae5ac0f	feat(evaluation): Add placeholders to `swe_gpt4.j2` (#11228 ) Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-10-13 22:15:05 +08:00
Xinyi He	7906eab6b1	Add inference generation of SWE-Perf Benchmark (#10246 ) Co-authored-by: mamoodi <mamoodiha@gmail.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-22 20:35:30 +00:00
juanmichelini	547e1049f1	Multi swe gym (#10605 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-22 15:56:26 -04:00
Ryan H. Tran	df9320f8ab	Implement model routing support (#9738 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-09-08 16:19:34 +07:00
Haowei Lin	bd8b1bfa25	Add a new benchmark: AlgoTune (#10724 ) Co-authored-by: linhaowei <linhaowei@wizardquant.com> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-09-04 18:08:50 +00:00
Zacharias Fisches	20e5c40969	Fix swe-bench `run_infer.py` config parsing from config.toml (#10792 )	2025-09-04 20:10:08 +08:00
Xingyao Wang	b082ccc0fb	feat(llm): add support for deepseek and gpt-5-mini, util for token count (#10626 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-27 11:03:35 +08:00
Xingyao Wang	4507a25b85	Evaluation: redirect sessions to repo-local .eval_sessions via helper; apply across entrypoints; add tests (#10540 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-22 13:34:02 +00:00
Engel Nyst	91d3d1d20a	Fix: expose aggregated LLM metrics in State for evaluation scripts (#10537 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-21 17:43:09 +02:00
Kevin Musgrave	74ba21bad0	feat(evaluation): Added INSTRUCTION_TEMPLATE_NAME to run_infer.py in swe_bench (#10270 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: mamoodi <mamoodiha@gmail.com>	2025-08-18 14:18:08 +00:00
Zhonghao Jiang	7229a16b45	feat(evaluation): Add NoCode-bench evaluation script (#10229 )	2025-08-16 16:41:22 +00:00
Engel Nyst	f7f4fcf98f	chore(eval): remove old, unused regression test framework under evaluation/regression (#10419 )	2025-08-16 01:08:23 +02:00
Xingyao Wang	c2f46200c0	chore(lint): Apply comprehensive linting and formatting fixes (#10287 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-13 21:13:19 +02:00
Ibragim Badertdinov	19a6b6b618	feat(eval): Support evaluation on SWE-rebench (#10251 )	2025-08-12 14:05:43 +00:00
Insop	1d0d88d491	Readability improvement & remove duplicated and unused prompts (#10241 )	2025-08-12 12:42:17 +08:00
Ryan H. Tran	758e30c9a8	Remove SecretStr conversion in GAIA eval (#10204 )	2025-08-11 21:30:18 +08:00
Xingyao Wang	04ff4a025b	feat(cli): Use CLI to launch OpenHands UI server via Docker (#9783 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-09 02:04:07 +08:00
Xingyao Wang	c4f303a07b	chore(eval): Remove eval_infer_remote.sh script and related references (#10157 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-08-07 20:46:59 +00:00
Boxuan Li	7af35ab827	Evaluation: disable browser when NOT run_with_browsing (#9837 )	2025-07-22 01:45:52 +00:00
juanmichelini	ea50fe4e3c	Fix: Continue evaluation when an instance fails after max retries (#8868 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Xingyao Wang <xingyaoww@gmail.com> Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2025-07-16 22:42:44 +00:00
Engel Nyst	fba2218760	Fix integration tests (#9746 )	2025-07-16 22:16:40 +02:00
Boxuan Li	5c3619bc48	Add README for terminal_bench evaluation harness (#9700 )	2025-07-15 09:48:34 -04:00
xhguo7	9388fef0ef	feat(eval): loc acc evaluation (#8515 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: mamoodi <mamoodiha@gmail.com>	2025-07-11 03:22:35 +08:00
Xingyao Wang	cff5697456	eval: remove gemini-specific swebench template (#9623 )	2025-07-08 18:34:23 +00:00
Ryan H. Tran	dfa54673d2	[OH-Versa] Add remaining browsing & GAIA eval improvement (#9015 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-06-25 12:36:15 +07:00
Maxim Evtush	653a8a7ce2	Refactor: Improve Consistency in Function Signatures and Regex Usage in compute_ism_pm_score.py (#9145 )	2025-06-18 04:22:16 +08:00
Ryan H. Tran	ddaa186971	[GAIA] Add prompt improvement to alleviate solution parsing issue & support Tavily search tools (#9057 )	2025-06-17 13:16:50 +07:00
better629	432d8829dc	disable mcp in run_localize and install oh-aci[llama] for issue 9150 (#9151 )	2025-06-16 11:03:17 +00:00
FT	e5bff91e8e	Fix Typo: Change "accurancy" to "accuracy" in Evaluation Benchmark Comments (#9139 )	2025-06-15 12:48:26 +00:00
Linghao Zhang	a93b0457c6	feat(eval): Support evaluation on SWE-bench-Live (#9137 )	2025-06-15 12:30:47 +00:00
kilavvy	4e99aabcb2	Minor Code Comment Corrections and Clarifications (#9129 )	2025-06-14 18:57:14 +00:00
Graham Neubig	0c307ea12e	Lint all files in the repo (#9131 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-06-14 16:25:59 +00:00
ASTONE	be62ba6b35	add_versicode (#8221 )	2025-06-14 13:17:18 +00:00
leopardracer	13c298d35f	Minor Typo Fixes in Comments and Documentation (#9058 )	2025-06-14 12:51:38 +00:00
Engel Nyst	fd3b4ac8e6	Refactor SWE-bench instruction (#8010 )	2025-06-13 23:27:52 +02:00
Leander Maben	d84befe28f	Adding LLM Based Editing capability (#8677 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Engel Nyst <engel.nyst@gmail.com>	2025-06-09 21:57:20 +08:00
Sergey	49939c1f02	Fix typo in evaluation README.md (#8987 )	2025-06-08 14:14:07 +00:00
llamantino	880c05ed94	Fix all broken docs links across the project (#8830 ) Co-authored-by: llamantino <12345678+yourusername@users.noreply.github.com>	2025-05-31 21:24:59 -04:00

1 2 3 4 5 ...

407 Commits