16 Commits
v1.1.1 ... 310

Author SHA1 Message Date
Mahmoud Ashraf
ba812f55a2 Fix quotes for Python version in CI workflow 2025-10-30 21:14:30 +03:00
Mahmoud Ashraf
44466c7535 Upgrade Python version from 3.9 to 3.10 in CI 2025-10-30 21:12:36 +03:00
Mahmoud Ashraf
e3e46675b2 Update Python version requirements to 3.10 and 3.12 2025-10-30 21:11:50 +03:00
Mahmoud Ashraf
14ad587c98 Update Python version requirement to 3.10 or greater 2025-10-30 21:11:07 +03:00
Purfview
9090997d25 Fix a typo (#1377) 2025-10-22 15:51:56 +03:00
Mahmoud Ashraf
dea24cbcc6 Upgrade to Silero-VAD V6 (#1373)
Co-authored-by: sssshhhhhh 193317444+sssshhhhhh@users.noreply.github.com
2025-10-14 15:29:56 +03:00
Mario
14ba1051f3 Fix: add <|nocaptions|> to suppressed tokens (#1338)
* Fix: Prevent <|nocaptions|> tokens in BatchedInferencePipeline

- Add nocaptions component tokens [1771, 496, 9799] to suppress_tokens list
- Add segment filtering to remove any remaining <|nocaptions|> segments
- Resolves issue where BatchedInferencePipeline would generate malformed
  special tokens during periods of silence or low-confidence transcription
- Includes comprehensive tests to verify the fix

The issue occurred because while bracket tokens ('<', '|', '>') were
already suppressed, the content tokens ('no', 'ca', 'ptions') were not,
leading to partial token generation that formed complete <|nocaptions|>
tags in the output.

Files changed:
- faster_whisper/transcribe.py: Core fix implementation
- test_nocaptions_comprehensive.py: Comprehensive test suite
- tests/test_nocaptions_fix.py: Unit tests

* removed

* Fix: Prevent <|nocaptions|> tokens in BatchedInferencePipeline

* Fix: Implement proper <|nocaptions|> token suppression using single token approach

* ci: trigger tests

* fix: remove trailing whitespace from blank lines

* Update faster_whisper/transcribe.py

Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>

* Update faster_whisper/tokenizer.py

Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>

* Update faster_whisper/tokenizer.py

Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>

* Rename no_speech to no_captions in tokenizer

* nocaptions has been renamed to nospeech

* break line

* line break

* Refactor no_speech method for improved readability by adjusting line breaks

---------

Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>
2025-10-10 21:56:54 +03:00
Mahmoud Ashraf
c26d609974 only merge when clip_timestamps are not provided (#1345)
fixes #1340 and allows for batching multiple audio files less than 30s each
2025-08-16 14:30:50 +03:00
黑墨水鱼
4bd98d5c5b Update README.md to include whisper-fastapi (#1325) 2025-08-11 13:44:48 +03:00
Mahmoud Ashraf
93001a9438 bump version to 1.2.0 2025-08-06 03:31:36 +03:00
Mahmoud Ashraf
a0c3cb9802 Remove Silence in Batched transcription (#1297) 2025-08-06 03:30:59 +03:00
Mahmoud Ashraf
fbeb1ba731 get correct index for samples (#1336) 2025-08-06 03:17:45 +03:00
Rishil
d3bfd0a305 feat: Allow loading of private HF models (#1309)
* feat: add HuggingFace auth token support to model download

* Format
2025-06-02 14:12:34 +03:00
Mahmoud Ashraf
43d4163fe0 Support distil-large-v3.5 (#1311) 2025-06-02 14:09:20 +03:00
Felix Mosheev
700584b2e6 feat: allow passing specific revision to download (#1292) 2025-04-30 00:55:48 +03:00
David Jiménez
1383fd4d37 Update README.md with speaches instead of faster-whisper-server (#1267)
Was previously named faster-whisper-server. They've decided to change the name from faster-whisper-server to speaches, as the project has evolved to support more than just ASR.
2025-03-20 17:20:26 +03:00
14 changed files with 169 additions and 122 deletions

View File

@@ -17,10 +17,10 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.9
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: 3.9
python-version: '3.10'
- name: Install module
run: |
@@ -47,10 +47,10 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.9
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: 3.9
python-version: '3.10'
- name: Install module
run: |
@@ -69,10 +69,10 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.9
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: 3.9
python-version: '3.10'
- name: Install dependencies
run: |

View File

@@ -1,4 +1,3 @@
include faster_whisper/assets/silero_encoder_v5.onnx
include faster_whisper/assets/silero_decoder_v5.onnx
include faster_whisper/assets/silero_vad_v6.onnx
include requirements.txt
include requirements.conversion.txt

View File

@@ -56,7 +56,7 @@ For reference, here's the time and memory usage that are required to transcribe
## Requirements
* Python 3.9 or greater
* Python 3.10 or greater
Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.
@@ -237,7 +237,7 @@ See more model and transcription options in the [`WhisperModel`](https://github.
Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
* [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
* [speaches](https://github.com/speaches-ai/speaches) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
@@ -250,6 +250,7 @@ Here is a non exhaustive list of open-source projects using faster-whisper. Feel
* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
* [Open-dubbing](https://github.com/softcatala/open-dubbing) is open dubbing is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages.
* [Whisper-FastAPI](https://github.com/heimoshuiyu/whisper-fastapi) whisper-fastapi is a very simple script that provides an API backend compatible with OpenAI, HomeAssistant, and Konele (Android voice typing) formats.
## Model conversion

Binary file not shown.

View File

@@ -67,6 +67,12 @@ class Tokenizer:
def no_timestamps(self) -> int:
return self.tokenizer.token_to_id("<|notimestamps|>")
@cached_property
def no_speech(self) -> int:
return self.tokenizer.token_to_id("<|nospeech|>") or self.tokenizer.token_to_id(
"<|nocaptions|>"
)
@property
def timestamp_begin(self) -> int:
return self.no_timestamps + 1

View File

@@ -25,7 +25,6 @@ from faster_whisper.vad import (
VadOptions,
collect_chunks,
get_speech_timestamps,
merge_segments,
)
@@ -125,7 +124,7 @@ class BatchedInferencePipeline:
segmented_outputs = []
segment_sizes = []
for chunk_metadata, output in zip(chunks_metadata, outputs):
duration = chunk_metadata["end_time"] - chunk_metadata["start_time"]
duration = chunk_metadata["duration"]
segment_size = int(ceil(duration) * self.model.frames_per_second)
segment_sizes.append(segment_size)
(
@@ -135,7 +134,7 @@ class BatchedInferencePipeline:
) = self.model._split_segments_by_timestamps(
tokenizer=tokenizer,
tokens=output["tokens"],
time_offset=chunk_metadata["start_time"],
time_offset=chunk_metadata["offset"],
segment_size=segment_size,
segment_duration=duration,
seek=0,
@@ -153,7 +152,7 @@ class BatchedInferencePipeline:
tokenizer.decode(subsegment["tokens"])
),
seek=int(
chunk_metadata["start_time"] * self.model.frames_per_second
chunk_metadata["offset"] * self.model.frames_per_second
),
)
for subsegment in subsegments
@@ -409,8 +408,7 @@ class BatchedInferencePipeline:
**vad_parameters, max_speech_duration_s=chunk_length
)
active_segments = get_speech_timestamps(audio, vad_parameters)
clip_timestamps = merge_segments(active_segments, vad_parameters)
clip_timestamps = get_speech_timestamps(audio, vad_parameters)
# run the audio if it is less than 30 sec even without clip_timestamps
elif duration < chunk_length:
clip_timestamps = [{"start": 0, "end": audio.shape[0]}]
@@ -420,6 +418,27 @@ class BatchedInferencePipeline:
"Set 'vad_filter' to True or provide 'clip_timestamps'."
)
audio_chunks, chunks_metadata = collect_chunks(
audio, clip_timestamps, max_duration=chunk_length
)
else:
clip_timestamps = [
{k: int(v * sampling_rate) for k, v in segment.items()}
for segment in clip_timestamps
]
audio_chunks, chunks_metadata = [], []
for clip in clip_timestamps:
audio_chunks.append(audio[clip["start"] : clip["end"]])
chunks_metadata.append(
{
"offset": clip["start"] / sampling_rate,
"duration": (clip["end"] - clip["start"]) / sampling_rate,
"segments": [clip],
}
)
duration_after_vad = (
sum((segment["end"] - segment["start"]) for segment in clip_timestamps)
/ sampling_rate
@@ -430,7 +449,6 @@ class BatchedInferencePipeline:
format_timestamp(duration - duration_after_vad),
)
audio_chunks, chunks_metadata = collect_chunks(audio, clip_timestamps)
features = (
[self.model.feature_extractor(chunk)[..., :-1] for chunk in audio_chunks]
if duration_after_vad
@@ -541,6 +559,7 @@ class BatchedInferencePipeline:
options,
log_progress,
)
segments = restore_speech_timestamps(segments, clip_timestamps, sampling_rate)
return segments, info
@@ -596,6 +615,8 @@ class WhisperModel:
download_root: Optional[str] = None,
local_files_only: bool = False,
files: dict = None,
revision: Optional[str] = None,
use_auth_token: Optional[Union[str, bool]] = None,
**model_kwargs,
):
"""Initializes the Whisper model.
@@ -627,6 +648,11 @@ class WhisperModel:
files: Load model files from the memory. This argument is a dictionary mapping file names
to file contents as file-like or bytes objects. If this is set, model_path acts as an
identifier for this model.
revision:
An optional Git revision id which can be a branch name, a tag, or a
commit hash.
use_auth_token: HuggingFace authentication token or True to use the
token stored by the HuggingFace config folder.
"""
self.logger = get_logger()
@@ -642,6 +668,8 @@ class WhisperModel:
model_size_or_path,
local_files_only=local_files_only,
cache_dir=download_root,
revision=revision,
use_auth_token=use_auth_token,
)
self.model = ctranslate2.models.Whisper(
@@ -1750,7 +1778,7 @@ class WhisperModel:
Returns:
language: Detected language.
languege_probability: Probability of the detected language.
language_probability: Probability of the detected language.
all_language_probs: List of tuples with all language names and probabilities.
"""
assert (
@@ -1823,7 +1851,7 @@ def restore_speech_timestamps(
else:
segment.start = ts_map.get_original_time(segment.start)
segment.end = ts_map.get_original_time(segment.end)
segment.end = ts_map.get_original_time(segment.end, is_end=True)
yield segment
@@ -1858,6 +1886,7 @@ def get_suppressed_tokens(
tokenizer.sot,
tokenizer.sot_prev,
tokenizer.sot_lm,
tokenizer.no_speech,
]
)

View File

@@ -2,7 +2,7 @@ import logging
import os
import re
from typing import List, Optional
from typing import List, Optional, Union
import huggingface_hub
import requests
@@ -26,6 +26,7 @@ _MODELS = {
"distil-medium.en": "Systran/faster-distil-whisper-medium.en",
"distil-small.en": "Systran/faster-distil-whisper-small.en",
"distil-large-v3": "Systran/faster-distil-whisper-large-v3",
"distil-large-v3.5": "distil-whisper/distil-large-v3.5-ct2",
"large-v3-turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo",
"turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo",
}
@@ -51,6 +52,8 @@ def download_model(
output_dir: Optional[str] = None,
local_files_only: bool = False,
cache_dir: Optional[str] = None,
revision: Optional[str] = None,
use_auth_token: Optional[Union[str, bool]] = None,
):
"""Downloads a CTranslate2 Whisper model from the Hugging Face Hub.
@@ -65,6 +68,10 @@ def download_model(
local_files_only: If True, avoid downloading the file and return the path to the local
cached file if it exists.
cache_dir: Path to the folder where cached files are stored.
revision: An optional Git revision id which can be a branch name, a tag, or a
commit hash.
use_auth_token: HuggingFace authentication token or True to use the
token stored by the HuggingFace config folder.
Returns:
The path to the downloaded model.
@@ -94,6 +101,7 @@ def download_model(
"local_files_only": local_files_only,
"allow_patterns": allow_patterns,
"tqdm_class": disabled_tqdm,
"revision": revision,
}
if output_dir is not None:
@@ -103,6 +111,9 @@ def download_model(
if cache_dir is not None:
kwargs["cache_dir"] = cache_dir
if use_auth_token is not None:
kwargs["token"] = use_auth_token
try:
return huggingface_hub.snapshot_download(repo_id, **kwargs)
except (

View File

@@ -86,7 +86,7 @@ def get_speech_timestamps(
padded_audio = np.pad(
audio, (0, window_size_samples - audio.shape[0] % window_size_samples)
)
speech_probs = model(padded_audio.reshape(1, -1)).squeeze(0)
speech_probs = model(padded_audio)
triggered = False
speeches = []
@@ -184,25 +184,62 @@ def get_speech_timestamps(
def collect_chunks(
audio: np.ndarray, chunks: List[dict], sampling_rate: int = 16000
) -> Tuple[List[np.ndarray], List[Dict[str, int]]]:
"""Collects audio chunks."""
audio: np.ndarray,
chunks: List[dict],
sampling_rate: int = 16000,
max_duration: float = float("inf"),
) -> Tuple[List[np.ndarray], List[Dict[str, float]]]:
"""This function merges the chunks of audio into chunks of max_duration (s) length."""
if not chunks:
chunk_metadata = {
"start_time": 0,
"end_time": 0,
"offset": 0,
"duration": 0,
"segments": [],
}
return [np.array([], dtype=np.float32)], [chunk_metadata]
audio_chunks = []
chunks_metadata = []
current_segments = []
current_duration = 0
total_duration = 0
current_audio = np.array([], dtype=np.float32)
for chunk in chunks:
chunk_metadata = {
"start_time": chunk["start"] / sampling_rate,
"end_time": chunk["end"] / sampling_rate,
}
audio_chunks.append(audio[chunk["start"] : chunk["end"]])
chunks_metadata.append(chunk_metadata)
if (
current_duration + chunk["end"] - chunk["start"]
> max_duration * sampling_rate
):
audio_chunks.append(current_audio)
chunk_metadata = {
"offset": total_duration / sampling_rate,
"duration": current_duration / sampling_rate,
"segments": current_segments,
}
total_duration += current_duration
chunks_metadata.append(chunk_metadata)
current_segments = []
current_audio = audio[chunk["start"] : chunk["end"]]
current_duration = chunk["end"] - chunk["start"]
else:
current_segments.append(chunk)
current_audio = np.concatenate(
(current_audio, audio[chunk["start"] : chunk["end"]])
)
current_duration += chunk["end"] - chunk["start"]
audio_chunks.append(current_audio)
chunk_metadata = {
"offset": total_duration / sampling_rate,
"duration": current_duration / sampling_rate,
"segments": current_segments,
}
chunks_metadata.append(chunk_metadata)
return audio_chunks, chunks_metadata
@@ -229,15 +266,19 @@ class SpeechTimestampsMap:
self,
time: float,
chunk_index: Optional[int] = None,
is_end: bool = False,
) -> float:
if chunk_index is None:
chunk_index = self.get_chunk_index(time)
chunk_index = self.get_chunk_index(time, is_end)
total_silence_before = self.total_silence_before[chunk_index]
return round(total_silence_before + time, self.time_precision)
def get_chunk_index(self, time: float) -> int:
def get_chunk_index(self, time: float, is_end: bool = False) -> int:
sample = int(time * self.sampling_rate)
if sample in self.chunk_end_sample and is_end:
return self.chunk_end_sample.index(sample)
return min(
bisect.bisect(self.chunk_end_sample, sample),
len(self.chunk_end_sample) - 1,
@@ -247,13 +288,12 @@ class SpeechTimestampsMap:
@functools.lru_cache
def get_vad_model():
"""Returns the VAD model instance."""
encoder_path = os.path.join(get_assets_path(), "silero_encoder_v5.onnx")
decoder_path = os.path.join(get_assets_path(), "silero_decoder_v5.onnx")
return SileroVADModel(encoder_path, decoder_path)
path = os.path.join(get_assets_path(), "silero_vad_v6.onnx")
return SileroVADModel(path)
class SileroVADModel:
def __init__(self, encoder_path, decoder_path):
def __init__(self, path):
try:
import onnxruntime
except ImportError as e:
@@ -267,13 +307,8 @@ class SileroVADModel:
opts.enable_cpu_mem_arena = False
opts.log_severity_level = 4
self.encoder_session = onnxruntime.InferenceSession(
encoder_path,
providers=["CPUExecutionProvider"],
sess_options=opts,
)
self.decoder_session = onnxruntime.InferenceSession(
decoder_path,
self.session = onnxruntime.InferenceSession(
path,
providers=["CPUExecutionProvider"],
sess_options=opts,
)
@@ -281,92 +316,36 @@ class SileroVADModel:
def __call__(
self, audio: np.ndarray, num_samples: int = 512, context_size_samples: int = 64
):
assert audio.ndim == 1, "Input should be a 1D array"
assert (
audio.ndim == 2
), "Input should be a 2D array with size (batch_size, num_samples)"
assert (
audio.shape[1] % num_samples == 0
audio.shape[0] % num_samples == 0
), "Input size should be a multiple of num_samples"
batch_size = audio.shape[0]
state = np.zeros((2, batch_size, 128), dtype="float32")
h = np.zeros((1, 1, 128), dtype="float32")
c = np.zeros((1, 1, 128), dtype="float32")
context = np.zeros(
(batch_size, context_size_samples),
(1, context_size_samples),
dtype="float32",
)
batched_audio = audio.reshape(batch_size, -1, num_samples)
batched_audio = audio.reshape(-1, num_samples)
context = batched_audio[..., -context_size_samples:]
context[:, -1] = 0
context = np.roll(context, 1, 1)
batched_audio = np.concatenate([context, batched_audio], 2)
context[-1] = 0
context = np.roll(context, 1, 0)
batched_audio = np.concatenate([context, batched_audio], 1)
batched_audio = batched_audio.reshape(-1, num_samples + context_size_samples)
encoder_batch_size = 10000
num_segments = batched_audio.shape[0]
encoder_outputs = []
outputs = []
for i in range(0, num_segments, encoder_batch_size):
encoder_output = self.encoder_session.run(
None, {"input": batched_audio[i : i + encoder_batch_size]}
)[0]
encoder_outputs.append(encoder_output)
encoder_output = np.concatenate(encoder_outputs, axis=0)
encoder_output = encoder_output.reshape(batch_size, -1, 128)
decoder_outputs = []
for window in np.split(encoder_output, encoder_output.shape[1], axis=1):
out, state = self.decoder_session.run(
None, {"input": window.squeeze(1), "state": state}
output, h, c = self.session.run(
None,
{"input": batched_audio[i : i + encoder_batch_size], "h": h, "c": c},
)
decoder_outputs.append(out)
outputs.append(output)
out = np.concatenate(outputs, axis=0)
out = np.stack(decoder_outputs, axis=1).squeeze(-1)
return out
def merge_segments(segments_list, vad_options: VadOptions, sampling_rate: int = 16000):
if not segments_list:
return []
curr_end = 0
seg_idxs = []
merged_segments = []
edge_padding = vad_options.speech_pad_ms * sampling_rate // 1000
chunk_length = vad_options.max_speech_duration_s * sampling_rate
curr_start = segments_list[0]["start"]
for idx, seg in enumerate(segments_list):
# if any segment start timing is less than previous segment end timing,
# reset the edge padding. Similarly for end timing.
if idx > 0:
if seg["start"] < segments_list[idx - 1]["end"]:
seg["start"] += edge_padding
if idx < len(segments_list) - 1:
if seg["end"] > segments_list[idx + 1]["start"]:
seg["end"] -= edge_padding
if seg["end"] - curr_start > chunk_length and curr_end - curr_start > 0:
merged_segments.append(
{
"start": curr_start,
"end": curr_end,
"segments": seg_idxs,
}
)
curr_start = seg["start"]
seg_idxs = []
curr_end = seg["end"]
seg_idxs.append((seg["start"], seg["end"]))
# add final
merged_segments.append(
{
"start": curr_start,
"end": curr_end,
"segments": seg_idxs,
}
)
return merged_segments

View File

@@ -1,3 +1,3 @@
"""Version information."""
__version__ = "1.1.1"
__version__ = "1.2.0"

View File

@@ -45,13 +45,13 @@ setup(
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
keywords="openai whisper speech ctranslate2 inference quantization transformer",
python_requires=">=3.9",
python_requires=">=3.10",
install_requires=install_requires,
extras_require={
"conversion": conversion_requires,

View File

@@ -98,6 +98,7 @@ def test_suppressed_tokens_minus_1():
50358,
50359,
50360,
50361,
)
@@ -106,7 +107,7 @@ def test_suppressed_tokens_minus_value():
tokenizer = Tokenizer(model.hf_tokenizer, False)
tokens = get_suppressed_tokens(tokenizer, [13])
assert tokens == (13, 50257, 50357, 50358, 50359, 50360)
assert tokens == (13, 50257, 50357, 50358, 50359, 50360, 50361)
def test_split_on_unicode():

View File

@@ -71,7 +71,7 @@ def test_batched_transcribe(physcisworks_path):
{"start": segment.start, "end": segment.end, "text": segment.text}
)
# number of near 30 sec segments
assert len(segments) == 7
assert len(segments) == 6
result, info = batched_model.transcribe(
physcisworks_path,
@@ -269,3 +269,24 @@ def test_monotonic_timestamps(physcisworks_path):
assert word.start <= word.end
assert word.end <= segments[i].end
assert segments[-1].end <= info.duration
def test_cliptimestamps_segments(jfk_path):
model = WhisperModel("tiny")
pipeline = BatchedInferencePipeline(model=model)
audio = decode_audio(jfk_path)
audio = np.concatenate([audio, audio])
clip_timestamps = [{"start": 0.0, "end": 11.0}, {"start": 11.0, "end": 22.0}]
segments, info = pipeline.transcribe(audio, clip_timestamps=clip_timestamps)
segments = list(segments)
assert len(segments) == 2
for segment, clip in zip(segments, clip_timestamps):
assert segment.start == clip["start"]
assert segment.end == clip["end"]
assert segment.text == (
" And so my fellow Americans ask not what your country can do for you, "
"ask what you can do for your country."
)