7 Commits

Author SHA1 Message Date
Leo
9ada36e30b fix: restore python linting. (#2228)
* fix: restore python linting.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* update: extend the Python lint check to evaluation.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update evaluation/logic_reasoning/instruction.txt

---------

Signed-off-by: ifuryst <ifuryst@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-04 06:36:19 +00:00
finaltrip
05b84df9cb chore: fix some comments (#2234)
Signed-off-by: finaltrip <finaltrip@qq.com>
2024-06-03 16:04:34 +00:00
Boxuan Li
538d1d85a2 evaluation: Reset configs in finally block (#2214) 2024-06-03 09:52:12 +08:00
Ryan H. Tran
22e8fb39b1 add cost metrics to evaluation outputs for all benchmarks (#2199) 2024-06-02 08:28:00 +00:00
RainRat
ed6dcc8381 fix typos (#2187)
* fix typos

no functional change

* fix typos
2024-06-01 20:40:30 +00:00
Xingyao Wang
28ab00946b update README for GAIA (#2054)
* update README for GAIA

* Update evaluation/gaia/README.md

* Update evaluation/gaia/README.md

* Update evaluation/gaia/README.md

---------

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
2024-05-25 15:01:03 +00:00
Jiayi Pan
2d52298a1d Support GAIA benchmark (#1911)
* Add gaia test

* Improve gaia prompts

* Fix browser_env hang bug

* Fix gaia bugs

* add gaia to eval readme

* Fix gaia bugs

* minor fix

* add run_infer.sh and update readme

* set num eval worker to 1

* default to 2023 gaia level1 subset

* default to level 1

* add prompt to instruct model enclose answer within <solution> tag

* add missing break

---------

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
2024-05-24 11:22:28 +00:00