12 Commits

Author SHA1 Message Date
Jiayi Pan
917d96e06f Fix doc error in evals (#2654) 2024-06-27 16:13:47 +00:00
Graham Neubig
cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba Evaluation time travel: build sandbox on the fly (#2491) 2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf Evaluation time travel: allow evaluation on a specific version (#2356)
* Time travel for evaluation

* Fix source script path

* Exit script if given version doesn't exist

* Exit on failure

* Update README

* Change scripts of all other benchmarks

* Modify README files

* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Yufan Song
f4cb192ebe Fix llm key leaks bug (#2376)
* fix bug

* fix bug

* add
2024-06-10 15:55:33 +00:00
RainRat
745ae42a72 fix typos (#2352) 2024-06-09 12:57:58 -07:00
Leo
9ada36e30b fix: restore python linting. (#2228)
* fix: restore python linting.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* update: extend the Python lint check to evaluation.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update evaluation/logic_reasoning/instruction.txt

---------

Signed-off-by: ifuryst <ifuryst@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-04 06:36:19 +00:00
finaltrip
05b84df9cb chore: fix some comments (#2234)
Signed-off-by: finaltrip <finaltrip@qq.com>
2024-06-03 16:04:34 +00:00
Ryan H. Tran
22e8fb39b1 add cost metrics to evaluation outputs for all benchmarks (#2199) 2024-06-02 08:28:00 +00:00
Yizhe Zhang
8d79c3edbc modify the exiting logic and reward calculation, delete unused function (#2198) 2024-06-02 06:38:09 +00:00
RainRat
ed6dcc8381 fix typos (#2187)
* fix typos

no functional change

* fix typos
2024-06-01 20:40:30 +00:00
Yizhe Zhang
0c829cd067 Support Entity-Deduction-Arena (EDA) Benchmark (#1931)
* adding draft evaluation code for EDA, using chatgpt as the temporal agent for now

* Update README.md

* Delete frontend/package.json

* reverse the irrelevant changes

* reverse package.json

* use chatgpt as the codeactagent

* integrate with opendevin

* Update evaluation/EDA/README.md

* Update evaluation/EDA/README.md

* Use poetry to manage packages

* integrate with opendevin

* minor update

* minor update

* update poetry

* update README

* clean-up infer scripts

* add run_infer script and improve readme

* log final success and final message & ground truth

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>
Co-authored-by: yufansong <yufan@risingwave-labs.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-05-25 23:17:04 +08:00