Commit Graph

7 Commits

Author SHA1 Message Date
Xingyao Wang
50c13aad98 [Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396) 2024-10-15 21:34:52 +08:00
Xingyao Wang
0c2a35b256 [eval] update aider bench scripts (#4203) 2024-10-04 02:23:06 +00:00
tobitege
c875a5fb77 (feat) Add Aider bench output visualizer (#3643)
* aider-bench: add visualization to summarize script and readme

* added example cost and actions histogram images for readme

* moved dependencies to evaluation section
2024-08-29 05:03:44 +00:00
tobitege
9c39f07430 (enh) Aider-Bench: make resumable with skip_num arg (#3626)
* added optional START_ID env flag to resume from that instance id

* prepare_dataset: fix comparisons by using instance id's as int

* aider bench complete_runtime: close runtime to close container

* added matrix display of instance id for logging

* fix typo in summarize_results.py saying summarise_results

* changed start_id to skip_num to skip rows from dataset (start_id wasn't supportable)

* doc changes about huggingface spaces to temporarily point back to OD
2024-08-28 15:42:01 +00:00
Raj Maheshwari
0cdeb83b17 Enabling of unittests in aider benchmark should be optional. (#3620) 2024-08-27 17:25:55 +00:00
tobitege
8fcf0817d4 (eval) Aider_bench: add eval_ids arg to run specific instance id's (#3592)
* add eval_ids arg to run specific instance id's; fix/extend README

* fix description in parser for --eval-ids

* fix test_arg_parser.py to account for added arg

* fix typo in README to say "summarize" instead of "summarise" for script
2024-08-27 00:49:26 +08:00
Raj Maheshwari
80f88e14cd [Feat] Aider Benchmark (#3507)
* [Feat] Aider Benchmark

* [Add] README.md
2024-08-21 18:05:41 +00:00