11 Commits

Author SHA1 Message Date
Graham Neubig
a081935fd8 Simplify eval code (#2775)
* Start simplifying eval code

* Update

* Add EDA

* Updated GAIA

* Update gpqa

* Add humanevalfix

* Fix logic_reasoning

* Add miniwob

* Add mint and ml_bench

* toolqa

* Added swe-bench

* Fixed webarena

* Refactor parameters
2024-07-05 19:33:08 +09:00
மனோஜ்குமார் பழனிச்சாமி
143f38d25a Refactored sandbox config and added fast boot (#2455)
* Refactored sandbox config and added fastboot

* added tests

* fixed tests

* fixed tests

* intimate user about breaking change

* remove default config from eval

* check for lowercase env

* add test

* Revert Migration

* migrate old sandbox configs

* resolve merge conflict

* revert migration 2

* Revert "remove default config from eval"

This reverts commit de57c588db.

* change type to box_type

* fix var name

* linted

* lint

* lint comments

* fix tests

* fix tests

* fix typo

* fix box_type, remove fast_boot

* add tests for sandbox config

* fix test

* update eval docs

* small removal comments

* adapt toml template

* old fields shouldn't be in the app dataclass

* fix old keys in app config

* clean up exec box

---------

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>
2024-07-05 03:30:21 +00:00
Graham Neubig
ffd3c7144c Remove global args (#2760)
* Remove global args

* Remove global args

* Update files

* Update main

* Bug fixes

* Fix logging
2024-07-03 20:07:52 +09:00
Engel Nyst
2d9bb56763 Add ability to restore the cli session (optional) (#2699)
* add ability to restore the main session

* add quick log

* rename to cli session
2024-06-30 06:56:55 +00:00
Engel Nyst
874b4c9075 CLI concurrency (#2695)
* add session id in cli, evals

* fix main sid
2024-06-30 04:04:30 +02:00
Graham Neubig
cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>
2024-06-23 03:43:43 +00:00
மனோஜ்குமார் பழனிச்சாமி
41564c2eac Use :main instead of :latest (#2539)
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-21 03:57:50 +00:00
Boxuan Li
feabc97aba Evaluation time travel: build sandbox on the fly (#2491) 2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf Evaluation time travel: allow evaluation on a specific version (#2356)
* Time travel for evaluation

* Fix source script path

* Exit script if given version doesn't exist

* Exit on failure

* Update README

* Change scripts of all other benchmarks

* Modify README files

* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
RainRat
745ae42a72 fix typos (#2352) 2024-06-09 12:57:58 -07:00
Frank Xu
48151bdbb0 [feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes (#2170)
* add webarena, and revamp messaging for webarena eval

* add changes for browsergym

* update infer script

* fix unit tests

* update

* add multiple run for miniwob

* update instruction, remove personal path

* update

* add code for getting final reward, fix integration, add results

* add avg cost calculation
2024-06-06 09:01:20 +08:00