Commit Graph

4 Commits

Author SHA1 Message Date
Graham Neubig
cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
* Add NUM_WORKERS variable to run_infer.sh scripts for configurable worker settings

* Update evaluation/webarena/scripts/run_infer.sh

---------

Co-authored-by: OpenDevin <opendevin@all-hands.dev>
2024-06-23 03:43:43 +00:00
Boxuan Li
feabc97aba Evaluation time travel: build sandbox on the fly (#2491) 2024-06-20 20:22:02 -06:00
Boxuan Li
6f235937cf Evaluation time travel: allow evaluation on a specific version (#2356)
* Time travel for evaluation

* Fix source script path

* Exit script if given version doesn't exist

* Exit on failure

* Update README

* Change scripts of all other benchmarks

* Modify README files

* Fix logic_reasoning README
2024-06-16 10:25:14 -04:00
Leo
be251b11de Add AgentBench. (#2012)
* Add AgentBench.

* Load the datasets from HF.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Add helper functions.

* Add mock executor.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Add retriv agent answer cmd.

* Adjust the dataset.
* Refine test results.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Consolidate all AgentBench datasets and scripts into a single CSV dataset.

* Refactor dataset source.
* Update helper functions.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Fix the CRLF problem.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Separate the instance's workspace.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Add cleanup logic and error handling for sandbox closure.

* Normalized dataset

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update README.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update the prompt to capture the answer.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Refactor script execution paths to use absolute container workspace path.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update AgentBench README.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Delete useless functions.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update evaluation/agent_bench/README.md

* Add script to summarize test results from JSONL file in AgentBench

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Delete useless script and codes.

Signed-off-by: ifuryst <ifuryst@gmail.com>

* Update evaluation/agent_bench/scripts/summarise_results.py

---------

Signed-off-by: ifuryst <ifuryst@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
2024-06-01 07:58:14 +00:00