Boxuan Li
|
208b1461ca
|
[AgentBench evaluation] set run_as_devin to true (#2269)
Co-authored-by: Leo <ifuryst@gmail.com>
|
2024-06-05 07:53:33 +00:00 |
|
Leo
|
040d6bd806
|
fix: add an early exit check for agent answers in agent bench. (#2257)
Signed-off-by: ifuryst <ifuryst@gmail.com>
|
2024-06-04 18:45:07 -07:00 |
|
Ryan H. Tran
|
22e8fb39b1
|
add cost metrics to evaluation outputs for all benchmarks (#2199)
|
2024-06-02 08:28:00 +00:00 |
|
Leo
|
be251b11de
|
Add AgentBench. (#2012)
* Add AgentBench.
* Load the datasets from HF.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Add helper functions.
* Add mock executor.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Add retriv agent answer cmd.
* Adjust the dataset.
* Refine test results.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Consolidate all AgentBench datasets and scripts into a single CSV dataset.
* Refactor dataset source.
* Update helper functions.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Fix the CRLF problem.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Separate the instance's workspace.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Add cleanup logic and error handling for sandbox closure.
* Normalized dataset
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Update README.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Update the prompt to capture the answer.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Refactor script execution paths to use absolute container workspace path.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Update AgentBench README.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Delete useless functions.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Update evaluation/agent_bench/README.md
* Add script to summarize test results from JSONL file in AgentBench
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Delete useless script and codes.
Signed-off-by: ifuryst <ifuryst@gmail.com>
* Update evaluation/agent_bench/scripts/summarise_results.py
---------
Signed-off-by: ifuryst <ifuryst@gmail.com>
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
|
2024-06-01 07:58:14 +00:00 |
|