OpenHands

mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Author	SHA1	Message	Date
Xingyao Wang	391200510c	fix: revert #5506 for SWE-Bench performance regression (#6491 ) Co-authored-by: Robert Brennan <accounts@rbren.io>	2025-01-28 22:52:57 +08:00
Aditya Bharat Soni	aebb583779	Support for VisualWebArena evaluation in OpenHands (#4773 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev> Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-01-23 20:18:30 +00:00
Engel Nyst	b9a3f1c753	Fix eval on remote runtime (#6398 )	2025-01-21 20:49:30 +00:00
Engel Nyst	5b7fcfbe1a	Disable prompt extensions in SWE-bench (#6391 )	2025-01-21 17:18:30 +00:00
louria	7f57dbebda	Update MiniWoB README (#6385 )	2025-01-21 16:26:47 +01:00
Xingyao Wang	2b04ee2e62	feat(eval): reliability improvement for SWE-Bench eval_infer (#6347 )	2025-01-18 14:02:59 -05:00
Calvin Smith	a12087243a	Pydantic-based configuration and setting objects (#6321 ) Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-01-17 12:33:22 -07:00
Xingyao Wang	899c1f8360	fix(bash): also show timeout reminder when no_change_timeout is triggered (#6318 ) Co-authored-by: Robert Brennan <accounts@rbren.io>	2025-01-18 03:31:23 +08:00
Xingyao Wang	72af7bbba2	feat(eval): misc SWE-Bench improvement - use different resources for different instances (#6313 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-01-17 02:48:41 +08:00
Xingyao Wang	0c961bfd8b	refactor(prompt): move runtime/repo info to user message and disable them in eval (#6291 )	2025-01-16 17:53:10 +00:00
Xingyao Wang	0bed17758f	fix: incorrect soft-timeout implementation & fix hard-timeout follow-up command (#6280 )	2025-01-17 01:27:00 +08:00
Engel Nyst	b9a70c8d5c	Delegation fixes (#6165 )	2025-01-15 03:24:39 +00:00
Boxuan Li	92b8d55c2d	Rename `trajectories_path` config to `save_trajectory_path` (#6216 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-01-14 04:32:45 +00:00
tofarr	23473070b9	Revert "Config objects as Pydantic BaseModels (#6176 )" (#6214 )	2025-01-13 07:36:25 -07:00
Calvin Smith	873dddb4e8	Config objects as Pydantic BaseModels (#6176 ) Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Graham Neubig <neubig@gmail.com>	2025-01-12 15:09:45 -05:00
Calvin Smith	6e4ff56934	feature: Condenser Interface and Defaults (#5306 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Calvin Smith <calvin@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2025-01-08 04:36:30 +08:00
Dmitry Kozlov	17d722f3b3	Update README.md (#6076 ) Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>	2025-01-06 17:31:19 +00:00
Xingyao Wang	ec70af9412	refactor: Replace pexpect with libtmux in BashSession (#4881 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com> Co-authored-by: Robert Brennan <accounts@rbren.io>	2025-01-04 05:22:13 +08:00
Xingyao Wang	f14f75b064	feat: runtime improvements for rate-limit and 502/503/404 error (#5975 )	2025-01-03 08:36:19 -07:00
Xingyao Wang	61ebec9ff7	feat(eval): better visualization for comparing two swe-bench runs (#5993 )	2025-01-03 02:36:51 +00:00
Xingyao Wang	9dd5463e06	Set default value of use_microagents to False to prevent breaking eval (#5976 ) Co-authored-by: openhands <openhands@all-hands.dev>	2025-01-03 05:39:17 +08:00
Robert Brennan	0e4e1b3316	Factor out ActionExecutionClient (#5796 )	2024-12-30 15:32:13 +00:00
Boxuan Li	6a4442e590	[Evaluation] Add summarise_results script for TheAgentCompany benchmark (#5811 )	2024-12-27 20:33:41 -08:00
Boxuan Li	5ed80b5c32	[doc] Fix link in TheAgentCompany benchmark's README.md (#5848 )	2024-12-27 22:21:02 +08:00
OpenHands	8975fcd714	Fix issue #5748 : Rename "Ran a Jupyter Command" to "Ran a Python Command" in UI (#5749 ) Co-authored-by: Graham Neubig <neubig@gmail.com>	2024-12-26 23:30:19 +08:00
OpenHands	bfb191b5c7	Fix issue #5739 : [Bug]: Move ./evaluation/swe_bench/scripts/cleanup_remote_runtime.sh to general eval utils (#5740 )	2024-12-25 17:17:06 -05:00
Boxuan Li	ecff5c67fb	Evaluation README: Add TheAgentCompany (#5777 )	2024-12-24 02:37:42 +00:00
Boxuan Li	b1719bb3db	Add TheAgentCompany evaluation harness (#5731 )	2024-12-22 14:12:30 -05:00
OpenHands	21948fa81b	Fix issue #5735 : [Bug]: Inconsistent command line arguments in evaluation directory (#5736 )	2024-12-22 04:41:39 +08:00
Xingyao Wang	581d5ec7a8	feat(eval): increase resource factor for remote runtime when previous run failed due to resource (#5709 )	2024-12-21 01:47:06 +08:00
Xingyao Wang	c333938384	feat(eval): add standard error to swebench summarize outputs (#5700 ) Co-authored-by: openhands <openhands@all-hands.dev>	2024-12-20 08:39:43 +08:00
Xingyao Wang	e9cafb0372	chore: Cleanup runtime exception handling (#5696 )	2024-12-19 17:28:29 +00:00
Xingyao Wang	9cdb8d06c0	fix(eval): Use cp -r instead of mv for SWE-Bench Initialization (#5659 )	2024-12-17 21:21:27 +00:00
Engel Nyst	3297e4d5a8	Use litellm's modify params (#5636 )	2024-12-17 21:32:49 +01:00
OpenHands	4998b5de32	Fix issue #5559 : The turn limit should be measured from the last user interaction (#5560 ) Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-12-16 16:28:23 -05:00
Engel Nyst	b295f5775c	Revert "Fix issue #5609 : Use litellm's modify_params with default True" (#5631 )	2024-12-16 20:39:57 +00:00
OpenHands	09735c7869	Fix issue #5609 : Use litellm's modify_params with default True (#5611 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-12-16 20:18:45 +01:00
Engel Nyst	4716955960	Remove unused codeact-SWE agent (#5600 ) Co-authored-by: openhands <openhands@all-hands.dev>	2024-12-14 20:49:44 +01:00
Ryan H. Tran	8ae2fb636e	Remove symlink use for swebench setup (#5549 )	2024-12-13 22:18:14 +08:00
Engel Nyst	b11e905988	Verify costs script (#5469 )	2024-12-10 14:20:53 +01:00
Engel Nyst	455e667739	add cost to summary (#5473 )	2024-12-10 03:14:03 +08:00
Cheng Yang	8f47547b08	docs: fix markdown linting and broken links (#5401 )	2024-12-05 01:28:04 +08:00
Xingyao Wang	9908e1b285	[Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394 )	2024-12-04 03:33:43 +00:00
Xingyao Wang	990f277132	misc: Support folder-level exp analysis for SWE-Bench `summarize_outputs.py`; Handle CrashLoopBackoff for RemoteRuntime (#5385 )	2024-12-03 15:37:21 +00:00
Engel Nyst	ea994b6209	More integration tests info (#5319 )	2024-11-29 16:39:03 +01:00
Cheng Yang	b808a639d9	docs: improve evaluation README with proper links and formatting (#5221 )	2024-11-27 18:27:36 -05:00
Xingyao Wang	4d3b035e00	feat(agent): add BrowseURLAction to CodeAct (produce markdown from URL) (#5285 )	2024-11-27 21:55:57 +00:00
OpenHands	f0ca2239f3	Fix issue #5076 : Integration test github action (#5077 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-11-27 21:31:48 +01:00
Graham Neubig	12dd3352c5	Add remote runtime support to agent_bench (#5280 ) Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-11-26 13:45:49 +00:00
OpenHands	678436da30	Fix issue #5222 : [Refactor]: Refactor the evaluation directory (#5223 ) Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>	2024-11-25 08:35:52 -05:00

1 2 3 4 5 ...

291 Commits