mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Files

Xingyao Wang 41ddba84bd [Agent] (Potentially) improve Editing using diff (#2685 )

* add replace-based block edit & preliminary test case fix

* further fix the insert behavior

* make edit only work on first occurence

* bump codeact version since we now use new edit agentskills

* update prompt for new agentskills

* update integration tests

* make run_infer.sh executable

* remove code block for edit_file

* update integration test for prompt changes

* default to not use hint for eval

* fix insert emptyfile bug

* throw value error when `to_replace` is empty

* make `_edit_or_insert_file` return string so we can try to fix some linter errors (best attempt)

* add todo

* update integration test

* fix sandbox test for this PR

2024-07-02 11:50:15 +09:00

agent_bench

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

biocoder

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

bird

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

EDA

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

gaia

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

gorilla

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

gpqa

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

humanevalfix

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

logic_reasoning

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

miniwob

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

mint

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

ml_bench

CLI concurrency (#2695 )

2024-06-30 04:04:30 +02:00

regression

Feat: add stream output to exec_run (#1625 )

2024-05-16 14:37:49 +00:00

static

Add detailed tutorial for adding new evaluation benchmarks (#1827 )

2024-05-18 13:40:53 -04:00

swe_bench

[Agent] (Potentially) improve Editing using diff (#2685 )

2024-07-02 11:50:15 +09:00

toolqa

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

utils

Evaluation time travel: build sandbox on the fly (#2491 )

2024-06-20 20:22:02 -06:00

webarena

Add ability to restore the cli session (optional) (#2699 )

2024-06-30 06:56:55 +00:00

__init__.py

feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468 )

2024-05-15 16:15:55 +00:00

README.md

Add ML-Bench Evaluation with OpenDevin (#2015 )

2024-06-05 01:56:39 +00:00

TUTORIAL.md

Use :main instead of :latest (#2539 )

2024-06-21 03:57:50 +00:00

README.md

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

Each subfolder contains a specific benchmark or experiment. For example, evaluation/swe_bench should contain all the preprocessing/evaluation/analysis scripts.
Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Supported Benchmarks

SWE-Bench: evaluation/swe_bench
ML-Bench: evaluation/ml_bench
HumanEvalFix: evaluation/humanevalfix
GAIA: evaluation/gaia
Entity deduction Arena (EDA): evaluation/EDA
MINT: evaluation/mint
AgentBench: evaluation/agent_bench
BIRD: evaluation/bird
LogicReasoning: evaluation/logic_reasoning

Result Visualization

Check this huggingface space for visualization of existing experimental results.

Upload your results

You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.