Fix quotes for Python version in CI workflow

Upgrade Python version from 3.9 to 3.10 in CI
Update Python version requirements to 3.10 and 3.12
2026-01-12 23:18:06 -05:00 · 2025-10-30 21:14:30 +03:00 · 2025-10-30 21:12:36 +03:00 · 2025-10-30 21:11:50 +03:00 · 2025-10-30 21:11:07 +03:00 · 2025-10-22 15:51:56 +03:00
12 changed files with 79 additions and 55 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -17,10 +17,10 @@ jobs:
    steps:
      - uses: actions/checkout@v4

-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
-          python-version: 3.9
+          python-version: '3.10'

      - name: Install module
        run: |
@@ -47,10 +47,10 @@ jobs:
    steps:
      - uses: actions/checkout@v4

-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
-          python-version: 3.9
+          python-version: '3.10'

      - name: Install module
        run: |
@@ -69,10 +69,10 @@ jobs:
    steps:
      - uses: actions/checkout@v4

-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
-          python-version: 3.9
+          python-version: '3.10'

      - name: Install dependencies
        run: |
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,4 +1,3 @@
-include faster_whisper/assets/silero_encoder_v5.onnx
-include faster_whisper/assets/silero_decoder_v5.onnx
+include faster_whisper/assets/silero_vad_v6.onnx
 include requirements.txt
 include requirements.conversion.txt
--- a/README.md
+++ b/README.md
@@ -56,7 +56,7 @@ For reference, here's the time and memory usage that are required to transcribe

 ## Requirements

-* Python 3.9 or greater
+* Python 3.10 or greater

 Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.

@@ -250,6 +250,7 @@ Here is a non exhaustive list of open-source projects using faster-whisper. Feel
 * [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
 * [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
 * [Open-dubbing](https://github.com/softcatala/open-dubbing) is open dubbing is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages.
+* [Whisper-FastAPI](https://github.com/heimoshuiyu/whisper-fastapi) whisper-fastapi is a very simple script that provides an API backend compatible with OpenAI, HomeAssistant, and Konele (Android voice typing) formats.

 ## Model conversion

--- a/faster_whisper/assets/silero_decoder_v5.onnx
+++ b/faster_whisper/assets/silero_decoder_v5.onnx
--- a/faster_whisper/assets/silero_encoder_v5.onnx
+++ b/faster_whisper/assets/silero_encoder_v5.onnx
--- a/faster_whisper/assets/silero_vad_v6.onnx
+++ b/faster_whisper/assets/silero_vad_v6.onnx
--- a/faster_whisper/tokenizer.py
+++ b/faster_whisper/tokenizer.py
@@ -67,6 +67,12 @@ class Tokenizer:
    def no_timestamps(self) -> int:
        return self.tokenizer.token_to_id("<|notimestamps|>")

+    @cached_property
+    def no_speech(self) -> int:
+        return self.tokenizer.token_to_id("<|nospeech|>") or self.tokenizer.token_to_id(
+            "<|nocaptions|>"
+        )
+
    @property
    def timestamp_begin(self) -> int:
        return self.no_timestamps + 1
--- a/faster_whisper/transcribe.py
+++ b/faster_whisper/transcribe.py
@@ -417,15 +417,27 @@ class BatchedInferencePipeline:
                    "No clip timestamps found. "
                    "Set 'vad_filter' to True or provide 'clip_timestamps'."
                )
+
+            audio_chunks, chunks_metadata = collect_chunks(
+                audio, clip_timestamps, max_duration=chunk_length
+            )
+
        else:
            clip_timestamps = [
                {k: int(v * sampling_rate) for k, v in segment.items()}
                for segment in clip_timestamps
            ]

-        audio_chunks, chunks_metadata = collect_chunks(
-            audio, clip_timestamps, max_duration=chunk_length
-        )
+            audio_chunks, chunks_metadata = [], []
+            for clip in clip_timestamps:
+                audio_chunks.append(audio[clip["start"] : clip["end"]])
+                chunks_metadata.append(
+                    {
+                        "offset": clip["start"] / sampling_rate,
+                        "duration": (clip["end"] - clip["start"]) / sampling_rate,
+                        "segments": [clip],
+                    }
+                )

        duration_after_vad = (
            sum((segment["end"] - segment["start"]) for segment in clip_timestamps)
@@ -1766,7 +1778,7 @@ class WhisperModel:

        Returns:
            language: Detected language.
-            languege_probability: Probability of the detected language.
+            language_probability: Probability of the detected language.
            all_language_probs: List of tuples with all language names and probabilities.
        """
        assert (
@@ -1874,6 +1886,7 @@ def get_suppressed_tokens(
            tokenizer.sot,
            tokenizer.sot_prev,
            tokenizer.sot_lm,
+            tokenizer.no_speech,
        ]
    )

--- a/faster_whisper/vad.py
+++ b/faster_whisper/vad.py
@@ -86,7 +86,7 @@ def get_speech_timestamps(
    padded_audio = np.pad(
        audio, (0, window_size_samples - audio.shape[0] % window_size_samples)
    )
-    speech_probs = model(padded_audio.reshape(1, -1)).squeeze(0)
+    speech_probs = model(padded_audio)

    triggered = False
    speeches = []
@@ -288,13 +288,12 @@ class SpeechTimestampsMap:
@functools.lru_cache
 def get_vad_model():
    """Returns the VAD model instance."""
-    encoder_path = os.path.join(get_assets_path(), "silero_encoder_v5.onnx")
-    decoder_path = os.path.join(get_assets_path(), "silero_decoder_v5.onnx")
-    return SileroVADModel(encoder_path, decoder_path)
+    path = os.path.join(get_assets_path(), "silero_vad_v6.onnx")
+    return SileroVADModel(path)


 class SileroVADModel:
-    def __init__(self, encoder_path, decoder_path):
+    def __init__(self, path):
        try:
            import onnxruntime
        except ImportError as e:
@@ -308,13 +307,8 @@ class SileroVADModel:
        opts.enable_cpu_mem_arena = False
        opts.log_severity_level = 4

-        self.encoder_session = onnxruntime.InferenceSession(
-            encoder_path,
-            providers=["CPUExecutionProvider"],
-            sess_options=opts,
-        )
-        self.decoder_session = onnxruntime.InferenceSession(
-            decoder_path,
+        self.session = onnxruntime.InferenceSession(
+            path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )
@@ -322,47 +316,36 @@ class SileroVADModel:
    def __call__(
        self, audio: np.ndarray, num_samples: int = 512, context_size_samples: int = 64
    ):
+        assert audio.ndim == 1, "Input should be a 1D array"
        assert (
-            audio.ndim == 2
-        ), "Input should be a 2D array with size (batch_size, num_samples)"
-        assert (
-            audio.shape[1] % num_samples == 0
+            audio.shape[0] % num_samples == 0
        ), "Input size should be a multiple of num_samples"

-        batch_size = audio.shape[0]
-
-        state = np.zeros((2, batch_size, 128), dtype="float32")
+        h = np.zeros((1, 1, 128), dtype="float32")
+        c = np.zeros((1, 1, 128), dtype="float32")
        context = np.zeros(
-            (batch_size, context_size_samples),
+            (1, context_size_samples),
            dtype="float32",
        )

-        batched_audio = audio.reshape(batch_size, -1, num_samples)
+        batched_audio = audio.reshape(-1, num_samples)
        context = batched_audio[..., -context_size_samples:]
-        context[:, -1] = 0
-        context = np.roll(context, 1, 1)
-        batched_audio = np.concatenate([context, batched_audio], 2)
+        context[-1] = 0
+        context = np.roll(context, 1, 0)
+        batched_audio = np.concatenate([context, batched_audio], 1)

        batched_audio = batched_audio.reshape(-1, num_samples + context_size_samples)

        encoder_batch_size = 10000
        num_segments = batched_audio.shape[0]
-        encoder_outputs = []
+        outputs = []
        for i in range(0, num_segments, encoder_batch_size):
-            encoder_output = self.encoder_session.run(
-                None, {"input": batched_audio[i : i + encoder_batch_size]}
-            )[0]
-            encoder_outputs.append(encoder_output)
-
-        encoder_output = np.concatenate(encoder_outputs, axis=0)
-        encoder_output = encoder_output.reshape(batch_size, -1, 128)
-
-        decoder_outputs = []
-        for window in np.split(encoder_output, encoder_output.shape[1], axis=1):
-            out, state = self.decoder_session.run(
-                None, {"input": window.squeeze(1), "state": state}
+            output, h, c = self.session.run(
+                None,
+                {"input": batched_audio[i : i + encoder_batch_size], "h": h, "c": c},
            )
-            decoder_outputs.append(out)
+            outputs.append(output)
+
+        out = np.concatenate(outputs, axis=0)

-        out = np.stack(decoder_outputs, axis=1).squeeze(-1)
        return out
--- a/setup.py
+++ b/setup.py
@@ -45,13 +45,13 @@ setup(
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3 :: Only",
-        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
        "Programming Language :: Python :: 3.11",
+        "Programming Language :: Python :: 3.12",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
    keywords="openai whisper speech ctranslate2 inference quantization transformer",
-    python_requires=">=3.9",
+    python_requires=">=3.10",
    install_requires=install_requires,
    extras_require={
        "conversion": conversion_requires,
--- a/tests/test_tokenizer.py
+++ b/tests/test_tokenizer.py
@@ -98,6 +98,7 @@ def test_suppressed_tokens_minus_1():
        50358,
        50359,
        50360,
+        50361,
    )


@@ -106,7 +107,7 @@ def test_suppressed_tokens_minus_value():

    tokenizer = Tokenizer(model.hf_tokenizer, False)
    tokens = get_suppressed_tokens(tokenizer, [13])
-    assert tokens == (13, 50257, 50357, 50358, 50359, 50360)
+    assert tokens == (13, 50257, 50357, 50358, 50359, 50360, 50361)


 def test_split_on_unicode():
--- a/tests/test_transcribe.py
+++ b/tests/test_transcribe.py
@@ -269,3 +269,24 @@ def test_monotonic_timestamps(physcisworks_path):
            assert word.start <= word.end
            assert word.end <= segments[i].end
    assert segments[-1].end <= info.duration
+
+
+def test_cliptimestamps_segments(jfk_path):
+    model = WhisperModel("tiny")
+    pipeline = BatchedInferencePipeline(model=model)
+
+    audio = decode_audio(jfk_path)
+    audio = np.concatenate([audio, audio])
+    clip_timestamps = [{"start": 0.0, "end": 11.0}, {"start": 11.0, "end": 22.0}]
+
+    segments, info = pipeline.transcribe(audio, clip_timestamps=clip_timestamps)
+    segments = list(segments)
+
+    assert len(segments) == 2
+    for segment, clip in zip(segments, clip_timestamps):
+        assert segment.start == clip["start"]
+        assert segment.end == clip["end"]
+        assert segment.text == (
+            " And so my fellow Americans ask not what your country can do for you, "
+            "ask what you can do for your country."
+        )
Author	SHA1	Message	Date
Mahmoud Ashraf	ba812f55a2	Fix quotes for Python version in CI workflow	2025-10-30 21:14:30 +03:00
Mahmoud Ashraf	44466c7535	Upgrade Python version from 3.9 to 3.10 in CI	2025-10-30 21:12:36 +03:00
Mahmoud Ashraf	e3e46675b2	Update Python version requirements to 3.10 and 3.12	2025-10-30 21:11:50 +03:00
Mahmoud Ashraf	14ad587c98	Update Python version requirement to 3.10 or greater	2025-10-30 21:11:07 +03:00
Purfview	9090997d25	Fix a typo (#1377 )	2025-10-22 15:51:56 +03:00
Mahmoud Ashraf	dea24cbcc6	Upgrade to Silero-VAD V6 (#1373 ) Co-authored-by: sssshhhhhh 193317444+sssshhhhhh@users.noreply.github.com	2025-10-14 15:29:56 +03:00
Mario	14ba1051f3	Fix: add `<\|nocaptions\|>` to suppressed tokens (#1338 ) * Fix: Prevent <\|nocaptions\|> tokens in BatchedInferencePipeline - Add nocaptions component tokens [1771, 496, 9799] to suppress_tokens list - Add segment filtering to remove any remaining <\|nocaptions\|> segments - Resolves issue where BatchedInferencePipeline would generate malformed special tokens during periods of silence or low-confidence transcription - Includes comprehensive tests to verify the fix The issue occurred because while bracket tokens ('<', '\|', '>') were already suppressed, the content tokens ('no', 'ca', 'ptions') were not, leading to partial token generation that formed complete <\|nocaptions\|> tags in the output. Files changed: - faster_whisper/transcribe.py: Core fix implementation - test_nocaptions_comprehensive.py: Comprehensive test suite - tests/test_nocaptions_fix.py: Unit tests * removed * Fix: Prevent <\|nocaptions\|> tokens in BatchedInferencePipeline * Fix: Implement proper <\|nocaptions\|> token suppression using single token approach * ci: trigger tests * fix: remove trailing whitespace from blank lines * Update faster_whisper/transcribe.py Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com> * Update faster_whisper/tokenizer.py Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com> * Update faster_whisper/tokenizer.py Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com> * Rename no_speech to no_captions in tokenizer * nocaptions has been renamed to nospeech * break line * line break * Refactor no_speech method for improved readability by adjusting line breaks --------- Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>	2025-10-10 21:56:54 +03:00
Mahmoud Ashraf	c26d609974	only merge when `clip_timestamps` are not provided (#1345 ) fixes #1340 and allows for batching multiple audio files less than 30s each	2025-08-16 14:30:50 +03:00
黑墨水鱼	4bd98d5c5b	Update README.md to include whisper-fastapi (#1325 )	2025-08-11 13:44:48 +03:00