# **Localization Evaluation for SWE-Bench**

This folder implements localization evaluation at both file and function levels to complementing the assessment of agent inference on [SWE-Bench](https://www.swebench.com/).

## **1. Environment Setup**
- Python env: [Install python environment](../../../README.md#development-environment)
- LLM config: [Configure LLM config](../../../README.md#configure-openhands-and-your-llm)

## **2. Inference & Evaluation**
- Inference and evaluation follow the original `run_infer.sh` and `run_eval.sh` implementation
    - You may refer to instructions at [README.md](../README.md) for running inference and evaluation on SWE-Bench

## **3. Localization Evaluation**
- Localization evaluation computes two-level localization accuracy, while also considers task success as an additional metric for overall evaluation:
    - **File Localization Accuracy:** Accuracy of correctly localizing the target file
    - **Function Localization Accuracy:** Accuracy of correctly localizing the target function
    - **Resolve Rate** (will be auto-skipped if missing): Success rate of whether tasks are successfully resolved
    - **File Localization Efficiency:** Average number of iterations taken to successfully localize the target file
    - **Function Localization Efficiency:** Average number of iterations taken to successfully localize the target file
    - **Task success efficiency:** Average number of iterations taken to resolve the task
    - **Resource efficiency:** the API expenditure of the agent running inference on SWE-Bench instances

- Run localization evaluation
    - Format:
        ```bash
        ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh [infer-dir] [split] [dataset] [max-infer-turn] [align-with-max]
        ```
        - `infer-dir`: inference directory containing inference outputs
        - `split`: SWE-Bench dataset split to use
        - `dataset`: SWE-Bench dataset name
        - `max-infer-turn`: the maximum number of iterations the agent took to run inference
        - `align-with-max`: whether to align failure indices (e.g., incorrect localization, unresolved tasks) with `max_iter`

    - Example:
        ```bash
        # Example
        ./evaluation/benchmarks/swe_bench/scripts/eval_localization.sh \
            --infer-dir ./evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Verified-test/CodeActAgent/gpt_4o_100_N \
            --split test \
            --dataset princeton-nlp/SWE-bench_Verified \
            --max-infer-turn 100 \
            --align-with-max true
        ```

- Localization evaluation results will be automatically saved to `[infer-dir]/loc_eval`