Graham Neubig
a081935fd8
Simplify eval code ( #2775 )
...
* Start simplifying eval code
* Update
* Add EDA
* Updated GAIA
* Update gpqa
* Add humanevalfix
* Fix logic_reasoning
* Add miniwob
* Add mint and ml_bench
* toolqa
* Added swe-bench
* Fixed webarena
* Refactor parameters
2024-07-05 19:33:08 +09:00
Graham Neubig
ffd3c7144c
Remove global args ( #2760 )
...
* Remove global args
* Remove global args
* Update files
* Update main
* Bug fixes
* Fix logging
2024-07-03 20:07:52 +09:00
Engel Nyst
2d9bb56763
Add ability to restore the cli session (optional) ( #2699 )
...
* add ability to restore the main session
* add quick log
* rename to cli session
2024-06-30 06:56:55 +00:00
Engel Nyst
874b4c9075
CLI concurrency ( #2695 )
...
* add session id in cli, evals
* fix main sid
2024-06-30 04:04:30 +02:00
Graham Neubig
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings ( #2597 )
...
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings
* Update evaluation/webarena/scripts/run_infer.sh
---------
Co-authored-by: OpenDevin <opendevin@all-hands.dev >
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba
Evaluation time travel: build sandbox on the fly ( #2491 )
2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf
Evaluation time travel: allow evaluation on a specific version ( #2356 )
...
* Time travel for evaluation
* Fix source script path
* Exit script if given version doesn't exist
* Exit on failure
* Update README
* Change scripts of all other benchmarks
* Modify README files
* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Leo
9ada36e30b
fix: restore python linting. ( #2228 )
...
* fix: restore python linting.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* update: extend the Python lint check to evaluation.
Signed-off-by: ifuryst <ifuryst@gmail.com >
* Update evaluation/logic_reasoning/instruction.txt
---------
Signed-off-by: ifuryst <ifuryst@gmail.com >
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk >
2024-06-04 06:36:19 +00:00
finaltrip
05b84df9cb
chore: fix some comments ( #2234 )
...
Signed-off-by: finaltrip <finaltrip@qq.com >
2024-06-03 16:04:34 +00:00
Boxuan Li
538d1d85a2
evaluation: Reset configs in finally block ( #2214 )
2024-06-03 09:52:12 +08:00
Ryan H. Tran
22e8fb39b1
add cost metrics to evaluation outputs for all benchmarks ( #2199 )
2024-06-02 08:28:00 +00:00
RainRat
ed6dcc8381
fix typos ( #2187 )
...
* fix typos
no functional change
* fix typos
2024-06-01 20:40:30 +00:00
Boxuan Li
f188abd7a3
Delete evaluation outputs files ( #2152 )
...
* Delete evaluation outputs files
* Fix README
2024-05-31 03:12:27 +00:00
Ren Ma
a9823491e6
Support Logic Reasoning Benchmark ( #1973 )
2024-05-30 16:35:15 +08:00