mirror of https://github.com/All-Hands-AI/OpenHands.git synced 2026-04-29 03:00:45 -04:00

Files

Graham Neubig a081935fd8 Simplify eval code (#2775 )

* Start simplifying eval code

* Update

* Add EDA

* Updated GAIA

* Update gpqa

* Add humanevalfix

* Fix logic_reasoning

* Add miniwob

* Add mint and ml_bench

* toolqa

* Added swe-bench

* Fixed webarena

* Refactor parameters

2024-07-05 19:33:08 +09:00

.cache_program

Support Logic Reasoning Benchmark (#1973 )

2024-05-30 16:35:15 +08:00

scripts

Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597 )

2024-06-23 03:43:43 +00:00

__init__.py

Support Logic Reasoning Benchmark (#1973 )

2024-05-30 16:35:15 +08:00

instruction.txt

fix: restore python linting. (#2228 )

2024-06-04 06:36:19 +00:00

logic_inference.py

chore: fix some comments (#2234 )

2024-06-03 16:04:34 +00:00

README.md

Evaluation time travel: allow evaluation on a specific version (#2356 )

2024-06-16 10:25:14 -04:00

run_infer.py

Simplify eval code (#2775 )

2024-07-05 19:33:08 +09:00

README.md

Logic Reasoning Evaluation

This folder contains evaluation harness for evaluating agents on the logic reasoning benchmark ProntoQA and ProofWriter.

Configure OpenDevin and your LLM

Create a config.toml file if it does not exist at the root of the workspace.

Add the following configurations:

[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
enable_auto_lint = true

# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0

[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0

Run Inference on logic_reasoning

The following code will run inference on the first example of the ProntoQA dataset with model gpt-4o, using OpenDevin 0.6.2 version.

./evaluation/logic_reasoning/scripts/run_infer.sh ProntoQA gpt-4o 0.6.2 1