mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-01-10 23:38:08 -05:00

Files

jigsawlabs-student fa6c12473e #2220 , integrated aider style linting, currently passes related o… (#2489 )

* WIP for integrate aider linter, see OpenDevin#2220

Updated aider linter to:
    * Always return text and line numbers
    * Moved extract line number more consistently
    * Changed pylint to stop after first linter detects errors
Updated agentskills
    * To get back a LintResult object and then use lines and text for error message and related line number
    * Moved code for extracting line number to aider linter
Tests:
* Added additional unit tests for aider to test for
* Return values from lint failures
* Confirm linter works for non-configured languages like Ruby

* move to agent_skills, fixes not seeing skills error

* format/lint to new code, fix failing tests, remove unused code from aider linter

* small changes (remove litellm, fix readme typo)

* fix failing sandbox test

* keep, change dumping of metadata

* WIP for integrate aider linter, see OpenDevin#2220

Updated aider linter to:
    * Always return text and line numbers
    * Moved extract line number more consistently
    * Changed pylint to stop after first linter detects errors
Updated agentskills
    * To get back a LintResult object and then use lines and text for error message and related line number
    * Moved code for extracting line number to aider linter
Tests:
* Added additional unit tests for aider to test for
* Return values from lint failures
* Confirm linter works for non-configured languages like Ruby

* move to agent_skills, fixes not seeing skills error

* format/lint to new code, fix failing tests, remove unused code from aider linter

* remove duplication of tree-sitter, grep-ast and update poetry.lock

* revert to main branch poetry.lock version

* only update necessary package

* fix jupyter kernel wrong interpreter issue (only for swebench)

* fix failing lint tests

* update syntax error checks for flake

* update poetry lock file

* update poetry.lock file, which update content-hash

* add grep ast

* remove extra stuff caused by merge

* update pyproject

* remove extra pytest fixture, ruff styling fixes

* lint files

* update poetry.lock file

---------

Co-authored-by: Jeff Katzy <jeffreyerickatz@gmail.com>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: tobitege <tobitege@gmx.de>

2024-07-19 21:58:54 +08:00

agent_bench

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

biocoder

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

bird

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

browsing_delegation

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

EDA

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

gaia

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

gorilla

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

gpqa

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

humanevalfix

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

logic_reasoning

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

miniwob

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

mint

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

ml_bench

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

regression

docs: updated docstrings using ruff's autofix feature (#2923 )

2024-07-16 01:35:33 +00:00

static

Add detailed tutorial for adding new evaluation benchmarks (#1827 )

2024-05-18 13:40:53 -04:00

swe_bench

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

toolqa

#2220 , integrated aider style linting, currently passes related o… (#2489 )

2024-07-19 21:58:54 +08:00

utils

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

webarena

fix eval api_key leak in metadata; fix llm config in run infer (#2998 )

2024-07-18 15:46:59 +00:00

__init__.py

feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468 )

2024-05-15 16:15:55 +00:00

README.md

Add ML-Bench Evaluation with OpenDevin (#2015 )

2024-06-05 01:56:39 +00:00

TUTORIAL.md

fix: runtime test for mac (#3005 )

2024-07-19 03:03:55 +00:00

README.md

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

Each subfolder contains a specific benchmark or experiment. For example, evaluation/swe_bench should contain all the preprocessing/evaluation/analysis scripts.
Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Supported Benchmarks

SWE-Bench: evaluation/swe_bench
ML-Bench: evaluation/ml_bench
HumanEvalFix: evaluation/humanevalfix
GAIA: evaluation/gaia
Entity deduction Arena (EDA): evaluation/EDA
MINT: evaluation/mint
AgentBench: evaluation/agent_bench
BIRD: evaluation/bird
LogicReasoning: evaluation/logic_reasoning

Result Visualization

Check this huggingface space for visualization of existing experimental results.

Upload your results

You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.