Files
OpenHands/evaluation
Xingyao Wang a6ba6c5277 Add SWEBench-docker eval (#2085)
* add initial version of swebench-docker eval

* update the branch of git repo

* add poetry run

* download dev set too and pre-load f2p and p2p

* update eval infer script

* increase timeout

* add poetry run

* install swebench from our fork

* update script

* update loc

* support single instance debug

* replace \r\n from model patch

* replace eval docker from namespace xingyaoww

* update script to auto detect swe-bench format jsonl

* support eval infer on single instance id

* change log output dir to logs

* update summarise result script

* update README

* update readme

* tweak branch

* Update evaluation/swe_bench/scripts/eval/prep_eval.sh

Co-authored-by: Graham Neubig <neubig@gmail.com>

---------

Co-authored-by: Graham Neubig <neubig@gmail.com>
2024-06-10 19:30:40 +00:00
..
2024-06-09 12:57:58 -07:00
2024-06-10 11:11:40 +08:00
2024-06-03 16:04:34 +00:00
2024-06-10 15:55:33 +00:00
2024-06-04 06:36:19 +00:00
2024-06-09 12:57:58 -07:00
2024-06-09 12:57:58 -07:00
2024-06-09 12:57:58 -07:00
2024-06-09 12:57:58 -07:00
2024-06-09 12:57:58 -07:00
2024-06-09 12:57:58 -07:00
2024-05-20 14:40:31 +00:00

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

  • Each subfolder contains a specific benchmark or experiment. For example, evaluation/swe_bench should contain all the preprocessing/evaluation/analysis scripts.
  • Raw data and experimental records should not be stored within this repo.
  • Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Supported Benchmarks

Result Visualization

Check this huggingface space for visualization of existing experimental results.

Upload your results

You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.