mirror of
https://github.com/All-Hands-AI/OpenHands.git
synced 2026-04-29 03:00:45 -04:00
* setup boilerplate and README * setup test script and load dataset * add temp intg that works * refactor code * add solution evaluation through 'fake_user_response_fn' * finish integrating MATH subset * Update evaluation/mint/run_infer.py * Update evaluation/mint/run_infer.sh * Update opendevin/core/main.py * remove redudant templates, add eval_note, update README * use <execute_ipython> tag instead of <execute> * hardcode AGENT option for run_infer.sh * Update evaluation/mint/task.py Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> * fix: bug no message returned when task's success * change message to make the agent exit * import bash abstractmethod * install all required packages inside sandbox before the agent runs, adjust prompt * add subset eval folder separation and test for gsm8k * fix bug in Reasoning task result check, add requirements.txt * Fix syntax error in evaluation/mint/run_infer.py * update README, add default values for `SUBSET` and `EVAL_LIMIT` --------- Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com> Co-authored-by: yufansong <yufan@risingwave-labs.com> Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
1.5 KiB
1.5 KiB
Evaluation
This folder contains code and resources to run experiments and evaluations.
Logistics
To better organize the evaluation folder, we should follow the rules below:
- Each subfolder contains a specific benchmark or experiment. For example,
evaluation/swe_benchshould contain all the preprocessing/evaluation/analysis scripts. - Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
Supported Benchmarks
- SWE-Bench:
evaluation/swe_bench - HumanEvalFix:
evaluation/humanevalfix - GAIA:
evaluation/gaia - Entity deduction Arena (EDA):
evaluation/EDA - MINT:
evaluation/mint
Result Visualization
Check this huggingface space for visualization of existing experimental results.
Upload your results
You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.