mirror of
https://github.com/simstudioai/sim.git
synced 2026-04-28 03:00:29 -04:00
* feat(tools): added calcom * added more triggers, tested * updated regex in script for release to be more lenient * fix(tag-dropdown): performance improvements and scroll bug fixes - Add flatTagIndexMap for O(1) tag lookups (replaces O(n²) findIndex calls) - Memoize caret position calculation to avoid DOM manipulation on every render - Use refs for inputValue/cursorPosition to keep handleTagSelect callback stable - Change itemRefs from index-based to tag-based keys to prevent stale refs - Fix scroll jump in nested folders by removing scroll reset from registerFolder - Add onFolderEnter callback for scroll reset when entering folder via keyboard - Disable keyboard navigation wrap-around at boundaries - Simplify selection reset to single effect on flatTagList.length change Also: - Add safeCompare utility for timing-safe string comparison - Refactor webhook signature validation to use safeCompare Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * updated types * fix(calcom): simplify required field constraints for booking attendee The condition field already restricts these to calcom_create_booking, so simplified to required: true. Per Cal.com API docs, email is optional while name and timeZone are required. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * added tests * updated folder multi select, updated calcom and github tools and docs generator script * updated drag, updated outputs for tools, regen docs with nested docs script * updated setup instructions links, destructure trigger outputs, fix text subblock styling * updated docs gen script * updated docs script * updated docs script * updated script * remove destructuring of stripe webhook * expanded wand textarea, updated calcom tools --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
222 lines
12 KiB
Plaintext
222 lines
12 KiB
Plaintext
---
|
||
title: Speech-to-Text
|
||
description: Convert speech to text using AI
|
||
---
|
||
|
||
import { BlockInfoCard } from "@/components/ui/block-info-card"
|
||
|
||
<BlockInfoCard
|
||
type="stt"
|
||
color="#181C1E"
|
||
/>
|
||
|
||
{/* MANUAL-CONTENT-START:intro */}
|
||
Transcribe speech to text using the latest AI models from world-class providers. Sim's Speech-to-Text (STT) tools empower you to turn audio and video into accurate, timestamped, and optionally translated transcripts—supporting a diversity of languages and enhanced with advanced features such as diarization and speaker identification.
|
||
|
||
**Supported Providers & Models:**
|
||
|
||
- **[OpenAI Whisper](https://platform.openai.com/docs/guides/speech-to-text/overview)** (OpenAI):
|
||
OpenAI’s Whisper is an open-source deep learning model renowned for its robustness across languages and audio conditions. It supports advanced models such as `whisper-1`, excelling in transcription, translation, and tasks demanding high model generalization. Backed by OpenAI—the company known for ChatGPT and leading AI research—Whisper is widely used in research and as a baseline for comparative evaluation.
|
||
|
||
- **[Deepgram](https://deepgram.com/)** (Deepgram Inc.):
|
||
Based in San Francisco, Deepgram offers scalable, production-grade speech recognition APIs for developers and enterprises. Deepgram’s models include `nova-3`, `nova-2`, and `whisper-large`, offering real-time and batch transcription with industry-leading accuracy, multi-language support, automatic punctuation, intelligent diarization, call analytics, and features for use cases ranging from telephony to media production.
|
||
|
||
- **[ElevenLabs](https://elevenlabs.io/)** (ElevenLabs):
|
||
A leader in voice AI, ElevenLabs is especially known for premium voice synthesis and recognition. Its STT product delivers high-accuracy, natural understanding of numerous languages, dialects, and accents. Recent ElevenLabs STT models are optimized for clarity, speaker distinction, and are suitable for both creative and accessibility scenarios. ElevenLabs is recognized for cutting-edge advancements in AI-powered speech technologies.
|
||
|
||
- **[AssemblyAI](https://www.assemblyai.com/)** (AssemblyAI Inc.):
|
||
AssemblyAI provides API-driven, highly accurate speech recognition, with features such as auto chaptering, topic detection, summarization, sentiment analysis, and content moderation alongside transcription. Its proprietary model, including the acclaimed `Conformer-2`, powers some of the largest media, call center, and compliance applications in the industry. AssemblyAI is trusted by Fortune 500s and leading AI startups globally.
|
||
|
||
- **[Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text)** (Google Cloud):
|
||
Google’s enterprise-grade Speech-to-Text API supports over 125 languages and variants, offering high accuracy and features such as real-time streaming, word-level confidence, speaker diarization, automatic punctuation, custom vocabulary, and domain-specific tuning. Models such as `latest_long`, `video`, and domain-optimized models are available, powered by Google’s years of research and deployed for global scalability.
|
||
|
||
- **[AWS Transcribe](https://aws.amazon.com/transcribe/)** (Amazon Web Services):
|
||
AWS Transcribe leverages Amazon’s cloud infrastructure to deliver robust speech recognition as an API. It supports multiple languages and features such as speaker identification, custom vocabulary, channel identification (for call center audio), and medical-specific transcription. Popular models include `standard` and domain-specific variations. AWS Transcribe is ideal for organizations already using Amazon’s cloud.
|
||
|
||
**How to Choose:**
|
||
Select the provider and model that fits your application—whether you need fast, enterprise-ready transcription with extra analytics (Deepgram, AssemblyAI, Google, AWS), high versatility and open-source access (OpenAI Whisper), or advanced speaker/contextual understanding (ElevenLabs). Consider the pricing, language coverage, accuracy, and any special features (like summarization, chaptering, or sentiment analysis) you might need.
|
||
|
||
For more details on capabilities, pricing, feature highlights, and fine-tuning options, refer to each provider’s official documentation via the links above.
|
||
{/* MANUAL-CONTENT-END */}
|
||
|
||
|
||
## Usage Instructions
|
||
|
||
Transcribe audio and video files to text using leading AI providers. Supports multiple languages, timestamps, and speaker diarization.
|
||
|
||
|
||
|
||
## Tools
|
||
|
||
### `stt_whisper`
|
||
|
||
Transcribe audio to text using OpenAI Whisper
|
||
|
||
#### Input
|
||
|
||
| Parameter | Type | Required | Description |
|
||
| --------- | ---- | -------- | ----------- |
|
||
| `provider` | string | Yes | STT provider \(whisper\) |
|
||
| `apiKey` | string | Yes | OpenAI API key |
|
||
| `model` | string | No | Whisper model to use \(default: whisper-1\) |
|
||
| `audioFile` | file | No | Audio or video file to transcribe |
|
||
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
|
||
| `audioUrl` | string | No | URL to audio or video file |
|
||
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
|
||
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
|
||
| `translateToEnglish` | boolean | No | Translate audio to English |
|
||
| `prompt` | string | No | Optional text to guide the model's style or continue a previous audio segment. Helps with proper nouns and context. |
|
||
| `temperature` | number | No | Sampling temperature between 0 and 1. Higher values make output more random, lower values more focused and deterministic. |
|
||
|
||
#### Output
|
||
|
||
| Parameter | Type | Description |
|
||
| --------- | ---- | ----------- |
|
||
| `transcript` | string | Full transcribed text |
|
||
| `segments` | array | Timestamped segments |
|
||
| ↳ `text` | string | Transcribed text for this segment |
|
||
| ↳ `start` | number | Start time in seconds |
|
||
| ↳ `end` | number | End time in seconds |
|
||
| ↳ `speaker` | string | Speaker identifier \(if diarization enabled\) |
|
||
| ↳ `confidence` | number | Confidence score \(0-1\) |
|
||
| `language` | string | Detected or specified language |
|
||
| `duration` | number | Audio duration in seconds |
|
||
|
||
### `stt_deepgram`
|
||
|
||
Transcribe audio to text using Deepgram
|
||
|
||
#### Input
|
||
|
||
| Parameter | Type | Required | Description |
|
||
| --------- | ---- | -------- | ----------- |
|
||
| `provider` | string | Yes | STT provider \(deepgram\) |
|
||
| `apiKey` | string | Yes | Deepgram API key |
|
||
| `model` | string | No | Deepgram model to use \(nova-3, nova-2, whisper-large, etc.\) |
|
||
| `audioFile` | file | No | Audio or video file to transcribe |
|
||
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
|
||
| `audioUrl` | string | No | URL to audio or video file |
|
||
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
|
||
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
|
||
| `diarization` | boolean | No | Enable speaker diarization |
|
||
|
||
#### Output
|
||
|
||
| Parameter | Type | Description |
|
||
| --------- | ---- | ----------- |
|
||
| `transcript` | string | Full transcribed text |
|
||
| `segments` | array | Timestamped segments with speaker labels |
|
||
| ↳ `text` | string | Transcribed text for this segment |
|
||
| ↳ `start` | number | Start time in seconds |
|
||
| ↳ `end` | number | End time in seconds |
|
||
| ↳ `speaker` | string | Speaker identifier \(if diarization enabled\) |
|
||
| ↳ `confidence` | number | Confidence score \(0-1\) |
|
||
| `language` | string | Detected or specified language |
|
||
| `duration` | number | Audio duration in seconds |
|
||
| `confidence` | number | Overall confidence score |
|
||
|
||
### `stt_elevenlabs`
|
||
|
||
Transcribe audio to text using ElevenLabs
|
||
|
||
#### Input
|
||
|
||
| Parameter | Type | Required | Description |
|
||
| --------- | ---- | -------- | ----------- |
|
||
| `provider` | string | Yes | STT provider \(elevenlabs\) |
|
||
| `apiKey` | string | Yes | ElevenLabs API key |
|
||
| `model` | string | No | ElevenLabs model to use \(scribe_v1, scribe_v1_experimental\) |
|
||
| `audioFile` | file | No | Audio or video file to transcribe |
|
||
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
|
||
| `audioUrl` | string | No | URL to audio or video file |
|
||
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
|
||
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
|
||
|
||
#### Output
|
||
|
||
| Parameter | Type | Description |
|
||
| --------- | ---- | ----------- |
|
||
| `transcript` | string | Full transcribed text |
|
||
| `segments` | array | Timestamped segments |
|
||
| `language` | string | Detected or specified language |
|
||
| `duration` | number | Audio duration in seconds |
|
||
| `confidence` | number | Overall confidence score |
|
||
|
||
### `stt_assemblyai`
|
||
|
||
Transcribe audio to text using AssemblyAI with advanced NLP features
|
||
|
||
#### Input
|
||
|
||
| Parameter | Type | Required | Description |
|
||
| --------- | ---- | -------- | ----------- |
|
||
| `provider` | string | Yes | STT provider \(assemblyai\) |
|
||
| `apiKey` | string | Yes | AssemblyAI API key |
|
||
| `model` | string | No | AssemblyAI model to use \(default: best\) |
|
||
| `audioFile` | file | No | Audio or video file to transcribe |
|
||
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
|
||
| `audioUrl` | string | No | URL to audio or video file |
|
||
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
|
||
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
|
||
| `diarization` | boolean | No | Enable speaker diarization |
|
||
| `sentiment` | boolean | No | Enable sentiment analysis |
|
||
| `entityDetection` | boolean | No | Enable entity detection |
|
||
| `piiRedaction` | boolean | No | Enable PII redaction |
|
||
| `summarization` | boolean | No | Enable automatic summarization |
|
||
|
||
#### Output
|
||
|
||
| Parameter | Type | Description |
|
||
| --------- | ---- | ----------- |
|
||
| `transcript` | string | Full transcribed text |
|
||
| `segments` | array | Timestamped segments with speaker labels |
|
||
| ↳ `text` | string | Transcribed text for this segment |
|
||
| ↳ `start` | number | Start time in seconds |
|
||
| ↳ `end` | number | End time in seconds |
|
||
| ↳ `speaker` | string | Speaker identifier \(if diarization enabled\) |
|
||
| ↳ `confidence` | number | Confidence score \(0-1\) |
|
||
| `language` | string | Detected or specified language |
|
||
| `duration` | number | Audio duration in seconds |
|
||
| `confidence` | number | Overall confidence score |
|
||
| `sentiment` | array | Sentiment analysis results |
|
||
| ↳ `text` | string | Text that was analyzed |
|
||
| ↳ `sentiment` | string | Sentiment \(POSITIVE, NEGATIVE, NEUTRAL\) |
|
||
| ↳ `confidence` | number | Confidence score |
|
||
| ↳ `start` | number | Start time in milliseconds |
|
||
| ↳ `end` | number | End time in milliseconds |
|
||
| `entities` | array | Detected entities |
|
||
| ↳ `entity_type` | string | Entity type \(e.g., person_name, location, organization\) |
|
||
| ↳ `text` | string | Entity text |
|
||
| ↳ `start` | number | Start time in milliseconds |
|
||
| ↳ `end` | number | End time in milliseconds |
|
||
| `summary` | string | Auto-generated summary |
|
||
|
||
### `stt_gemini`
|
||
|
||
Transcribe audio to text using Google Gemini with multimodal capabilities
|
||
|
||
#### Input
|
||
|
||
| Parameter | Type | Required | Description |
|
||
| --------- | ---- | -------- | ----------- |
|
||
| `provider` | string | Yes | STT provider \(gemini\) |
|
||
| `apiKey` | string | Yes | Google API key |
|
||
| `model` | string | No | Gemini model to use \(default: gemini-2.5-flash\) |
|
||
| `audioFile` | file | No | Audio or video file to transcribe |
|
||
| `audioFileReference` | file | No | Reference to audio/video file from previous blocks |
|
||
| `audioUrl` | string | No | URL to audio or video file |
|
||
| `language` | string | No | Language code \(e.g., "en", "es", "fr"\) or "auto" for auto-detection |
|
||
| `timestamps` | string | No | Timestamp granularity: none, sentence, or word |
|
||
|
||
#### Output
|
||
|
||
| Parameter | Type | Description |
|
||
| --------- | ---- | ----------- |
|
||
| `transcript` | string | Full transcribed text |
|
||
| `segments` | array | Timestamped segments |
|
||
| `language` | string | Detected or specified language |
|
||
| `duration` | number | Audio duration in seconds |
|
||
| `confidence` | number | Overall confidence score |
|
||
|
||
|