Files
OpenHands/evaluation/README.md
Jiaxin Pei dc88dac296 adding a script to fetch and convert devin's output for evaluation (#81)
* adding code to fetch and convert devin's output for evaluation

* update README.md

* update code for fetching and processing devin's outputs

* update code for fetching and processing devin's outputs
2024-03-22 01:33:01 +08:00

1.4 KiB

Evaluation

This folder contains code and resources to run experiments and evaluations.

Logistics

To better organize the evaluation folder, we should follow the rules below:

  • Each subfolder contains a specific benchmark or experiment. For example, evaluation/SWE-bench should contain all the preprocessing/evaluation/analysis scripts.
  • Raw data and experimental records should not be stored within this repo (e.g. Google Drive or Hugging Face Datasets).
  • Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.

Tasks

SWE-bench

  • notebooks
    • devin_eval_analysis.ipynb: notebook analyzing devin's outputs
  • scripts
    • prepare_devin_outputs_for_evaluation.py: script fetching and converting devin's output into the desired json file for evaluation.
      • usage: python prepare_devin_outputs_for_evaluation.py <setting> where setting can be passed, failed or all
  • resources
    • Devin's outputs processed for evaluations is available on Huggingface
      • get predictions that passed the test: wget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.json
      • get all predictionswget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.json