scaffolding for blog post

2026-02-11 07:04:58 -05:00 · 2026-02-11 01:06:19 -08:00
parent 2f492cacc1
commit c8f4df7683
1 changed files with 90 additions and 0 deletions
--- a/apps/sim/content/blog/workflow-bench/index.mdx
+++ b/apps/sim/content/blog/workflow-bench/index.mdx
@@ -0,0 +1,90 @@
+---
+slug: workflow-bench
+title: 'Introducing Workflow Bench - Benchmarking Natural Language Workflow Building'
+description: 'How we built a benchmark to measure how well AI models translate natural language instructions into executable workflows, and what we learned along the way'
+date: 2026-02-11
+updated: 2026-02-11
+authors:
+  - sid
+readingTime: 10
+tags: [Benchmark, Evaluation, Workflows, Natural Language]
+ogImage: /studio/workflow-bench/cover.png
+ogAlt: 'Workflow Bench benchmark overview'
+about: ['Benchmarking', 'Workflow Building', 'Natural Language']
+timeRequired: PT10M
+canonical: https://sim.ai/studio/workflow-bench
+featured: false
+draft: true
+---
+
+Building workflows from natural language sounds straightforward until you try to measure it. When a user says "send me a Slack message every morning with a summary of my unread emails," how do you evaluate whether the resulting workflow is correct? Is partial credit fair? What about workflows that are functionally equivalent but structurally different?
+
+We built Workflow Bench to answer these questions. This post covers why we needed a dedicated benchmark, how we designed it, and what the results tell us about the current state of natural language workflow building.
+
+## Why a Workflow Benchmark?
+
+<!-- TODO: Motivation for building Workflow Bench -->
+<!-- - Gap in existing benchmarks (code gen benchmarks don't capture workflow semantics) -->
+<!-- - Need to track progress as we iterate on the copilot / natural language builder -->
+<!-- - Workflows are structured artifacts, not just code — they have topology, block types, connections, configs -->
+
+## What We're Measuring
+
+<!-- TODO: Define the core evaluation dimensions -->
+<!-- - Structural correctness (right blocks, right connections) -->
+<!-- - Configuration accuracy (correct params, API mappings) -->
+<!-- - Functional equivalence (does it do the same thing even if shaped differently?) -->
+<!-- - Edge cases: loops, conditionals, parallel branches, error handling -->
+
+## Benchmark Design
+
+<!-- TODO: How the benchmark dataset is constructed -->
+<!-- - Task categories and complexity tiers -->
+<!-- - How ground truth workflows are defined -->
+<!-- - Natural language prompt variations (terse vs. detailed, ambiguous vs. precise) -->
+
+### Task Categories
+
+<!-- TODO: Break down the types of workflows in the benchmark -->
+<!-- - Simple linear (A → B → C) -->
+<!-- - Branching / conditional -->
+<!-- - Looping / iterative -->
+<!-- - Parallel fan-out / fan-in -->
+<!-- - Multi-trigger -->
+
+### Scoring
+
+<!-- TODO: Explain the scoring methodology -->
+<!-- - How partial credit works -->
+<!-- - Structural similarity metrics -->
+<!-- - Config-level accuracy -->
+<!-- - Overall composite score -->
+
+## Evaluation Pipeline
+
+<!-- TODO: How we run the benchmark end to end -->
+<!-- - Prompt → model → workflow JSON → evaluator → score -->
+<!-- - Automation and reproducibility -->
+<!-- - How we handle non-determinism across runs -->
+
+## Results
+
+<!-- TODO: Present the benchmark results -->
+<!-- - Model comparisons -->
+<!-- - Performance by task category -->
+<!-- - Where models struggle most -->
+<!-- - Trends over time as we iterate -->
+
+## What We Learned
+
+<!-- TODO: Key takeaways from running the benchmark -->
+<!-- - Surprising strengths and weaknesses -->
+<!-- - How benchmark results influenced product decisions -->
+<!-- - Common failure modes -->
+
+## What's Next
+
+<!-- TODO: Future directions -->
+<!-- - Expanding the benchmark (more tasks, more complexity) -->
+<!-- - Community contributions / open-sourcing -->
+<!-- - Using the benchmark to guide copilot improvements -->