{% extends "admin/base.html" %} {% block title %}Benchmark Report: {{ run.name }}{% endblock %} {% block content %} {% if is_pdf %}
{{ run.name }}
Abstract: This document presents a comparative performance analysis of {{ run.models|length }} Large Language Models evaluated against a curated benchmark dataset. Testing parameters utilized a temperature of 0.0 to ensure deterministic output consistency. Scores were calculated using {{ run.evaluator_config.type }} verification.

1. Statistical Overview

Comprehensive metrics across all participants. Accuracy is normalized as a percentage of the reference benchmark.
{% if ai_analysis_html %}

2. Qualitative Analysis

{{ ai_analysis_html | safe }}
{% endif %} {% if static_plot %}
Figure 1: Mean performance accuracy across the target model cluster.
{% endif %} {% for model, s in stats.items() %} {% endfor %}
Model Identifier Avg. Accuracy Max Score Standard Deviation
{{ model }} {{ (s.avg * 100) | round(1) }}% {{ (s.max * 100) | round(1) }}% ±{{ s.std | round(3) }}
{% if report_format == 'full' %}

{{ '3' if ai_analysis_html else '2' }}. Qualitative Comparison

{% set first_model = stats.keys() | list | first %} {% for i in range(stats[first_model].count) %}
Experimental Task #{{ i + 1 }} ({{ run.results[first_model][i].category }})
Input Prompt: {{ run.results[first_model][i].prompt }}
{% for model in run.results.keys() %} {% endfor %}
Model & Score Response Output
{{ model }}
SCORE: {{ (run.results[model][i].score * 100) | round(1) }}%
{{ run.results[model][i].model_answer_html | safe }}
{% if run.results[model][i].reasoning_html %}
Critique: {{ run.results[model][i].reasoning_html | safe }}
{% endif %}
{% if not loop.last and i % 2 == 0 %}
{% endif %} {% endfor %} {% endif %}
{% else %}
Execution ID: #{{ run.id }}

{{ run.name }}

Evaluated using {{ run.evaluator_config.type }} logic on {{ run.created_at.strftime('%Y-%m-%d %H:%M') }}

{% for model, s in stats.items() %}
Model Performance

{{ model }}

Average Accuracy {{ (s.avg * 100) | round(1) }}%
{% endfor %}

Category Breakdown

Global Ranking

Individual Task Audit

{% set first_model = stats.keys() | list | first %} {% for i in range(stats[first_model].count) %}
TASK #{{ i + 1 }}
{% for model in run.results.keys() %}
{{ model }} {{ (run.results[model][i].score * 100) | round(1) }}%
{% endfor %}
{{ run.results[first_model][i].prompt }}
Reference: {{ run.results[first_model][i].reference_answer }}
{% for model in run.results.keys() %}
{{ model }} Output
{{ run.results[model][i].model_answer }}
{% if run.results[model][i].reasoning %}
View Audit Reasoning
{{ run.results[model][i].reasoning }}
{% endif %}
{% endfor %}
{% endfor %}
{% endif %} {% endblock %}