Fabric/docs/Using-Speech-To-Text.md

# Using Speech-To-Text (STT) with Fabric

Fabric supports speech-to-text transcription of audio and video files using OpenAI's transcription models. This feature allows you to convert spoken content into text that can then be processed through Fabric's patterns.

## Overview

The STT feature integrates OpenAI's Whisper and GPT-4o transcription models to convert audio/video files into text. The transcribed text is automatically passed as input to your chosen pattern or chat session.

## Requirements

- OpenAI API key configured in Fabric
- For files larger than 25MB: `ffmpeg` installed on your system
- Supported audio/video formats: `.mp3`, `.mp4`, `.mpeg`, `.mpga`, `.m4a`, `.wav`, `.webm`

## Basic Usage

### Simple Transcription

To transcribe an audio file and send the result to a pattern:

```bash
fabric --transcribe-file /path/to/audio.mp3 --transcribe-model whisper-1 --pattern summarize
```

### Transcription Only

To just transcribe a file without applying a pattern:

```bash
fabric --transcribe-file /path/to/audio.mp3 --transcribe-model whisper-1
```

## Command Line Flags

### Required Flags

- `--transcribe-file`: Path to the audio or video file to transcribe
- `--transcribe-model`: Model to use for transcription (required when using transcription)

### Optional Flags

- `--split-media-file`: Automatically split files larger than 25MB into chunks using ffmpeg

## Available Models

You can list all available transcription models with:

```bash
fabric --list-transcription-models
```

Currently supported models:

- `whisper-1`: OpenAI's Whisper model
- `gpt-4o-mini-transcribe`: GPT-4o Mini transcription model
- `gpt-4o-transcribe`: GPT-4o transcription model

## File Size Handling

### Files Under 25MB

Files under the 25MB limit are processed directly without any special handling.

### Files Over 25MB

For files exceeding OpenAI's 25MB limit, you have two options:

1. **Manual handling**: The command will fail with an error message suggesting to use `--split-media-file`
2. **Automatic splitting**: Use the `--split-media-file` flag to automatically split the file into chunks

```bash
fabric --transcribe-file large_recording.mp4 --transcribe-model whisper-1 --split-media-file --pattern summarize
```

When splitting is enabled:

- Fabric uses `ffmpeg` to split the file into 10-minute segments initially
- If segments are still too large, it reduces the segment time by half repeatedly
- All segments are transcribed and the results are concatenated
- Temporary files are automatically cleaned up after processing

## Integration with Patterns

The transcribed text is seamlessly integrated into Fabric's workflow:

1. File is transcribed using the specified model
2. Transcribed text becomes the input message
3. Text is sent to the specified pattern or chat session

### Example Workflows

**Meeting transcription and summarization:**

```bash
fabric --transcribe-file meeting.mp4 --transcribe-model gpt-4o-transcribe --pattern summarize
```

**Interview analysis:**

```bash
fabric --transcribe-file interview.mp3 --transcribe-model whisper-1 --pattern extract_insights
```

**Large video file processing:**

```bash
fabric --transcribe-file presentation.mp4 --transcribe-model gpt-4o-transcribe --split-media-file --pattern create_summary
```

## Error Handling

Common error scenarios:

- **Unsupported format**: Only the listed audio/video formats are supported
- **File too large**: Use `--split-media-file` for files over 25MB
- **Missing ffmpeg**: Install ffmpeg for automatic file splitting
- **Invalid model**: Use `--list-transcription-models` to see available models
- **Missing model**: The `--transcribe-model` flag is required when using `--transcribe-file`

## Technical Details

### Implementation

- Transcription is handled in `internal/cli/transcribe.go:14`
- OpenAI-specific implementation in `internal/plugins/ai/openai/openai_audio.go:41`
- File splitting uses ffmpeg with configurable segment duration
- Supports any vendor that implements the `transcriber` interface

### Processing Pipeline

1. CLI validates file format and size
2. If file > 25MB and splitting enabled, file is split using ffmpeg
3. Each file/segment is sent to OpenAI's transcription API
4. Results are concatenated with spaces between segments
5. Transcribed text is passed as input to the main Fabric pipeline

### Vendor Support

Currently, only OpenAI is supported for transcription, but the interface allows for future expansion to other vendors that provide transcription capabilities.