From c8f4df7683a154ec9cbea0e5e4613b95e4f372f9 Mon Sep 17 00:00:00 2001 From: Waleed Latif Date: Wed, 11 Feb 2026 01:06:19 -0800 Subject: [PATCH] scaffolding for blog post --- .../sim/content/blog/workflow-bench/index.mdx | 90 +++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 apps/sim/content/blog/workflow-bench/index.mdx diff --git a/apps/sim/content/blog/workflow-bench/index.mdx b/apps/sim/content/blog/workflow-bench/index.mdx new file mode 100644 index 000000000..d492e687e --- /dev/null +++ b/apps/sim/content/blog/workflow-bench/index.mdx @@ -0,0 +1,90 @@ +--- +slug: workflow-bench +title: 'Introducing Workflow Bench - Benchmarking Natural Language Workflow Building' +description: 'How we built a benchmark to measure how well AI models translate natural language instructions into executable workflows, and what we learned along the way' +date: 2026-02-11 +updated: 2026-02-11 +authors: + - sid +readingTime: 10 +tags: [Benchmark, Evaluation, Workflows, Natural Language] +ogImage: /studio/workflow-bench/cover.png +ogAlt: 'Workflow Bench benchmark overview' +about: ['Benchmarking', 'Workflow Building', 'Natural Language'] +timeRequired: PT10M +canonical: https://sim.ai/studio/workflow-bench +featured: false +draft: true +--- + +Building workflows from natural language sounds straightforward until you try to measure it. When a user says "send me a Slack message every morning with a summary of my unread emails," how do you evaluate whether the resulting workflow is correct? Is partial credit fair? What about workflows that are functionally equivalent but structurally different? + +We built Workflow Bench to answer these questions. This post covers why we needed a dedicated benchmark, how we designed it, and what the results tell us about the current state of natural language workflow building. + +## Why a Workflow Benchmark? + + + + + + +## What We're Measuring + + + + + + + +## Benchmark Design + + + + + + +### Task Categories + + + + + + + + +### Scoring + + + + + + + +## Evaluation Pipeline + + + + + + +## Results + + + + + + + +## What We Learned + + + + + + +## What's Next + + + + +