Fix quotes for Python version in CI workflow

Upgrade Python version from 3.9 to 3.10 in CI
Update Python version requirements to 3.10 and 3.12
2026-01-12 23:18:06 -05:00 · 2025-10-30 21:14:30 +03:00 · 2025-10-30 21:12:36 +03:00 · 2025-10-30 21:11:50 +03:00 · 2025-10-30 21:11:07 +03:00 · 2025-10-22 15:51:56 +03:00
14 changed files with 169 additions and 122 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -17,10 +17,10 @@ jobs:
    steps:
      - uses: actions/checkout@v4

-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
-          python-version: 3.9
+          python-version: '3.10'

      - name: Install module
        run: |
@@ -47,10 +47,10 @@ jobs:
    steps:
      - uses: actions/checkout@v4

-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
-          python-version: 3.9
+          python-version: '3.10'

      - name: Install module
        run: |
@@ -69,10 +69,10 @@ jobs:
    steps:
      - uses: actions/checkout@v4

-      - name: Set up Python 3.9
+      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
-          python-version: 3.9
+          python-version: '3.10'

      - name: Install dependencies
        run: |
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,4 +1,3 @@
-include faster_whisper/assets/silero_encoder_v5.onnx
-include faster_whisper/assets/silero_decoder_v5.onnx
+include faster_whisper/assets/silero_vad_v6.onnx
 include requirements.txt
 include requirements.conversion.txt
--- a/README.md
+++ b/README.md
@@ -56,7 +56,7 @@ For reference, here's the time and memory usage that are required to transcribe

 ## Requirements

-* Python 3.9 or greater
+* Python 3.10 or greater

 Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.

@@ -237,7 +237,7 @@ See more model and transcription options in the [`WhisperModel`](https://github.
 Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!


-* [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
+* [speaches](https://github.com/speaches-ai/speaches) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
 * [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
 * [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
 * [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
@@ -250,6 +250,7 @@ Here is a non exhaustive list of open-source projects using faster-whisper. Feel
 * [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
 * [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
 * [Open-dubbing](https://github.com/softcatala/open-dubbing) is open dubbing is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages.
+* [Whisper-FastAPI](https://github.com/heimoshuiyu/whisper-fastapi) whisper-fastapi is a very simple script that provides an API backend compatible with OpenAI, HomeAssistant, and Konele (Android voice typing) formats.

 ## Model conversion

--- a/faster_whisper/assets/silero_decoder_v5.onnx
+++ b/faster_whisper/assets/silero_decoder_v5.onnx
--- a/faster_whisper/assets/silero_encoder_v5.onnx
+++ b/faster_whisper/assets/silero_encoder_v5.onnx
--- a/faster_whisper/assets/silero_vad_v6.onnx
+++ b/faster_whisper/assets/silero_vad_v6.onnx
--- a/faster_whisper/tokenizer.py
+++ b/faster_whisper/tokenizer.py
@@ -67,6 +67,12 @@ class Tokenizer:
    def no_timestamps(self) -> int:
        return self.tokenizer.token_to_id("<|notimestamps|>")

+    @cached_property
+    def no_speech(self) -> int:
+        return self.tokenizer.token_to_id("<|nospeech|>") or self.tokenizer.token_to_id(
+            "<|nocaptions|>"
+        )
+
    @property
    def timestamp_begin(self) -> int:
        return self.no_timestamps + 1
--- a/faster_whisper/transcribe.py
+++ b/faster_whisper/transcribe.py
@@ -25,7 +25,6 @@ from faster_whisper.vad import (
    VadOptions,
    collect_chunks,
    get_speech_timestamps,
-    merge_segments,
 )


@@ -125,7 +124,7 @@ class BatchedInferencePipeline:
        segmented_outputs = []
        segment_sizes = []
        for chunk_metadata, output in zip(chunks_metadata, outputs):
-            duration = chunk_metadata["end_time"] - chunk_metadata["start_time"]
+            duration = chunk_metadata["duration"]
            segment_size = int(ceil(duration) * self.model.frames_per_second)
            segment_sizes.append(segment_size)
            (
@@ -135,7 +134,7 @@ class BatchedInferencePipeline:
            ) = self.model._split_segments_by_timestamps(
                tokenizer=tokenizer,
                tokens=output["tokens"],
-                time_offset=chunk_metadata["start_time"],
+                time_offset=chunk_metadata["offset"],
                segment_size=segment_size,
                segment_duration=duration,
                seek=0,
@@ -153,7 +152,7 @@ class BatchedInferencePipeline:
                            tokenizer.decode(subsegment["tokens"])
                        ),
                        seek=int(
-                            chunk_metadata["start_time"] * self.model.frames_per_second
+                            chunk_metadata["offset"] * self.model.frames_per_second
                        ),
                    )
                    for subsegment in subsegments
@@ -409,8 +408,7 @@ class BatchedInferencePipeline:
                        **vad_parameters, max_speech_duration_s=chunk_length
                    )

-                active_segments = get_speech_timestamps(audio, vad_parameters)
-                clip_timestamps = merge_segments(active_segments, vad_parameters)
+                clip_timestamps = get_speech_timestamps(audio, vad_parameters)
            # run the audio if it is less than 30 sec even without clip_timestamps
            elif duration < chunk_length:
                clip_timestamps = [{"start": 0, "end": audio.shape[0]}]
@@ -420,6 +418,27 @@ class BatchedInferencePipeline:
                    "Set 'vad_filter' to True or provide 'clip_timestamps'."
                )

+            audio_chunks, chunks_metadata = collect_chunks(
+                audio, clip_timestamps, max_duration=chunk_length
+            )
+
+        else:
+            clip_timestamps = [
+                {k: int(v * sampling_rate) for k, v in segment.items()}
+                for segment in clip_timestamps
+            ]
+
+            audio_chunks, chunks_metadata = [], []
+            for clip in clip_timestamps:
+                audio_chunks.append(audio[clip["start"] : clip["end"]])
+                chunks_metadata.append(
+                    {
+                        "offset": clip["start"] / sampling_rate,
+                        "duration": (clip["end"] - clip["start"]) / sampling_rate,
+                        "segments": [clip],
+                    }
+                )
+
        duration_after_vad = (
            sum((segment["end"] - segment["start"]) for segment in clip_timestamps)
            / sampling_rate
@@ -430,7 +449,6 @@ class BatchedInferencePipeline:
            format_timestamp(duration - duration_after_vad),
        )

-        audio_chunks, chunks_metadata = collect_chunks(audio, clip_timestamps)
        features = (
            [self.model.feature_extractor(chunk)[..., :-1] for chunk in audio_chunks]
            if duration_after_vad
@@ -541,6 +559,7 @@ class BatchedInferencePipeline:
            options,
            log_progress,
        )
+        segments = restore_speech_timestamps(segments, clip_timestamps, sampling_rate)

        return segments, info

@@ -596,6 +615,8 @@ class WhisperModel:
        download_root: Optional[str] = None,
        local_files_only: bool = False,
        files: dict = None,
+        revision: Optional[str] = None,
+        use_auth_token: Optional[Union[str, bool]] = None,
        **model_kwargs,
    ):
        """Initializes the Whisper model.
@@ -627,6 +648,11 @@ class WhisperModel:
          files: Load model files from the memory. This argument is a dictionary mapping file names
            to file contents as file-like or bytes objects. If this is set, model_path acts as an
            identifier for this model.
+          revision:
+            An optional Git revision id which can be a branch name, a tag, or a
+            commit hash.
+          use_auth_token: HuggingFace authentication token or True to use the
+            token stored by the HuggingFace config folder.
        """
        self.logger = get_logger()

@@ -642,6 +668,8 @@ class WhisperModel:
                model_size_or_path,
                local_files_only=local_files_only,
                cache_dir=download_root,
+                revision=revision,
+                use_auth_token=use_auth_token,
            )

        self.model = ctranslate2.models.Whisper(
@@ -1750,7 +1778,7 @@ class WhisperModel:

        Returns:
            language: Detected language.
-            languege_probability: Probability of the detected language.
+            language_probability: Probability of the detected language.
            all_language_probs: List of tuples with all language names and probabilities.
        """
        assert (
@@ -1823,7 +1851,7 @@ def restore_speech_timestamps(

        else:
            segment.start = ts_map.get_original_time(segment.start)
-            segment.end = ts_map.get_original_time(segment.end)
+            segment.end = ts_map.get_original_time(segment.end, is_end=True)

        yield segment

@@ -1858,6 +1886,7 @@ def get_suppressed_tokens(
            tokenizer.sot,
            tokenizer.sot_prev,
            tokenizer.sot_lm,
+            tokenizer.no_speech,
        ]
    )

--- a/faster_whisper/utils.py
+++ b/faster_whisper/utils.py
@@ -2,7 +2,7 @@ import logging
 import os
 import re

-from typing import List, Optional
+from typing import List, Optional, Union

 import huggingface_hub
 import requests
@@ -26,6 +26,7 @@ _MODELS = {
    "distil-medium.en": "Systran/faster-distil-whisper-medium.en",
    "distil-small.en": "Systran/faster-distil-whisper-small.en",
    "distil-large-v3": "Systran/faster-distil-whisper-large-v3",
+    "distil-large-v3.5": "distil-whisper/distil-large-v3.5-ct2",
    "large-v3-turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo",
    "turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo",
 }
@@ -51,6 +52,8 @@ def download_model(
    output_dir: Optional[str] = None,
    local_files_only: bool = False,
    cache_dir: Optional[str] = None,
+    revision: Optional[str] = None,
+    use_auth_token: Optional[Union[str, bool]] = None,
 ):
    """Downloads a CTranslate2 Whisper model from the Hugging Face Hub.

@@ -65,6 +68,10 @@ def download_model(
      local_files_only:  If True, avoid downloading the file and return the path to the local
        cached file if it exists.
      cache_dir: Path to the folder where cached files are stored.
+      revision: An optional Git revision id which can be a branch name, a tag, or a
+            commit hash.
+      use_auth_token: HuggingFace authentication token or True to use the
+            token stored by the HuggingFace config folder.

    Returns:
      The path to the downloaded model.
@@ -94,6 +101,7 @@ def download_model(
        "local_files_only": local_files_only,
        "allow_patterns": allow_patterns,
        "tqdm_class": disabled_tqdm,
+        "revision": revision,
    }

    if output_dir is not None:
@@ -103,6 +111,9 @@ def download_model(
    if cache_dir is not None:
        kwargs["cache_dir"] = cache_dir

+    if use_auth_token is not None:
+        kwargs["token"] = use_auth_token
+
    try:
        return huggingface_hub.snapshot_download(repo_id, **kwargs)
    except (
--- a/faster_whisper/vad.py
+++ b/faster_whisper/vad.py
@@ -86,7 +86,7 @@ def get_speech_timestamps(
    padded_audio = np.pad(
        audio, (0, window_size_samples - audio.shape[0] % window_size_samples)
    )
-    speech_probs = model(padded_audio.reshape(1, -1)).squeeze(0)
+    speech_probs = model(padded_audio)

    triggered = False
    speeches = []
@@ -184,25 +184,62 @@ def get_speech_timestamps(


 def collect_chunks(
-    audio: np.ndarray, chunks: List[dict], sampling_rate: int = 16000
-) -> Tuple[List[np.ndarray], List[Dict[str, int]]]:
-    """Collects audio chunks."""
+    audio: np.ndarray,
+    chunks: List[dict],
+    sampling_rate: int = 16000,
+    max_duration: float = float("inf"),
+) -> Tuple[List[np.ndarray], List[Dict[str, float]]]:
+    """This function merges the chunks of audio into chunks of max_duration (s) length."""
    if not chunks:
        chunk_metadata = {
-            "start_time": 0,
-            "end_time": 0,
+            "offset": 0,
+            "duration": 0,
+            "segments": [],
        }
        return [np.array([], dtype=np.float32)], [chunk_metadata]

    audio_chunks = []
    chunks_metadata = []
+
+    current_segments = []
+    current_duration = 0
+    total_duration = 0
+    current_audio = np.array([], dtype=np.float32)
+
    for chunk in chunks:
-        chunk_metadata = {
-            "start_time": chunk["start"] / sampling_rate,
-            "end_time": chunk["end"] / sampling_rate,
-        }
-        audio_chunks.append(audio[chunk["start"] : chunk["end"]])
-        chunks_metadata.append(chunk_metadata)
+        if (
+            current_duration + chunk["end"] - chunk["start"]
+            > max_duration * sampling_rate
+        ):
+            audio_chunks.append(current_audio)
+            chunk_metadata = {
+                "offset": total_duration / sampling_rate,
+                "duration": current_duration / sampling_rate,
+                "segments": current_segments,
+            }
+            total_duration += current_duration
+            chunks_metadata.append(chunk_metadata)
+
+            current_segments = []
+
+            current_audio = audio[chunk["start"] : chunk["end"]]
+            current_duration = chunk["end"] - chunk["start"]
+        else:
+            current_segments.append(chunk)
+            current_audio = np.concatenate(
+                (current_audio, audio[chunk["start"] : chunk["end"]])
+            )
+
+            current_duration += chunk["end"] - chunk["start"]
+
+    audio_chunks.append(current_audio)
+
+    chunk_metadata = {
+        "offset": total_duration / sampling_rate,
+        "duration": current_duration / sampling_rate,
+        "segments": current_segments,
+    }
+    chunks_metadata.append(chunk_metadata)
    return audio_chunks, chunks_metadata


@@ -229,15 +266,19 @@ class SpeechTimestampsMap:
        self,
        time: float,
        chunk_index: Optional[int] = None,
+        is_end: bool = False,
    ) -> float:
        if chunk_index is None:
-            chunk_index = self.get_chunk_index(time)
+            chunk_index = self.get_chunk_index(time, is_end)

        total_silence_before = self.total_silence_before[chunk_index]
        return round(total_silence_before + time, self.time_precision)

-    def get_chunk_index(self, time: float) -> int:
+    def get_chunk_index(self, time: float, is_end: bool = False) -> int:
        sample = int(time * self.sampling_rate)
+        if sample in self.chunk_end_sample and is_end:
+            return self.chunk_end_sample.index(sample)
+
        return min(
            bisect.bisect(self.chunk_end_sample, sample),
            len(self.chunk_end_sample) - 1,
@@ -247,13 +288,12 @@ class SpeechTimestampsMap:
@functools.lru_cache
 def get_vad_model():
    """Returns the VAD model instance."""
-    encoder_path = os.path.join(get_assets_path(), "silero_encoder_v5.onnx")
-    decoder_path = os.path.join(get_assets_path(), "silero_decoder_v5.onnx")
-    return SileroVADModel(encoder_path, decoder_path)
+    path = os.path.join(get_assets_path(), "silero_vad_v6.onnx")
+    return SileroVADModel(path)


 class SileroVADModel:
-    def __init__(self, encoder_path, decoder_path):
+    def __init__(self, path):
        try:
            import onnxruntime
        except ImportError as e:
@@ -267,13 +307,8 @@ class SileroVADModel:
        opts.enable_cpu_mem_arena = False
        opts.log_severity_level = 4

-        self.encoder_session = onnxruntime.InferenceSession(
-            encoder_path,
-            providers=["CPUExecutionProvider"],
-            sess_options=opts,
-        )
-        self.decoder_session = onnxruntime.InferenceSession(
-            decoder_path,
+        self.session = onnxruntime.InferenceSession(
+            path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )
@@ -281,92 +316,36 @@ class SileroVADModel:
    def __call__(
        self, audio: np.ndarray, num_samples: int = 512, context_size_samples: int = 64
    ):
+        assert audio.ndim == 1, "Input should be a 1D array"
        assert (
-            audio.ndim == 2
-        ), "Input should be a 2D array with size (batch_size, num_samples)"
-        assert (
-            audio.shape[1] % num_samples == 0
+            audio.shape[0] % num_samples == 0
        ), "Input size should be a multiple of num_samples"

-        batch_size = audio.shape[0]
-
-        state = np.zeros((2, batch_size, 128), dtype="float32")
+        h = np.zeros((1, 1, 128), dtype="float32")
+        c = np.zeros((1, 1, 128), dtype="float32")
        context = np.zeros(
-            (batch_size, context_size_samples),
+            (1, context_size_samples),
            dtype="float32",
        )

-        batched_audio = audio.reshape(batch_size, -1, num_samples)
+        batched_audio = audio.reshape(-1, num_samples)
        context = batched_audio[..., -context_size_samples:]
-        context[:, -1] = 0
-        context = np.roll(context, 1, 1)
-        batched_audio = np.concatenate([context, batched_audio], 2)
+        context[-1] = 0
+        context = np.roll(context, 1, 0)
+        batched_audio = np.concatenate([context, batched_audio], 1)

        batched_audio = batched_audio.reshape(-1, num_samples + context_size_samples)

        encoder_batch_size = 10000
        num_segments = batched_audio.shape[0]
-        encoder_outputs = []
+        outputs = []
        for i in range(0, num_segments, encoder_batch_size):
-            encoder_output = self.encoder_session.run(
-                None, {"input": batched_audio[i : i + encoder_batch_size]}
-            )[0]
-            encoder_outputs.append(encoder_output)
-
-        encoder_output = np.concatenate(encoder_outputs, axis=0)
-        encoder_output = encoder_output.reshape(batch_size, -1, 128)
-
-        decoder_outputs = []
-        for window in np.split(encoder_output, encoder_output.shape[1], axis=1):
-            out, state = self.decoder_session.run(
-                None, {"input": window.squeeze(1), "state": state}
+            output, h, c = self.session.run(
+                None,
+                {"input": batched_audio[i : i + encoder_batch_size], "h": h, "c": c},
            )
-            decoder_outputs.append(out)
+            outputs.append(output)
+
+        out = np.concatenate(outputs, axis=0)

-        out = np.stack(decoder_outputs, axis=1).squeeze(-1)
        return out
-
-
-def merge_segments(segments_list, vad_options: VadOptions, sampling_rate: int = 16000):
-    if not segments_list:
-        return []
-
-    curr_end = 0
-    seg_idxs = []
-    merged_segments = []
-    edge_padding = vad_options.speech_pad_ms * sampling_rate // 1000
-    chunk_length = vad_options.max_speech_duration_s * sampling_rate
-
-    curr_start = segments_list[0]["start"]
-
-    for idx, seg in enumerate(segments_list):
-        # if any segment start timing is less than previous segment end timing,
-        # reset the edge padding. Similarly for end timing.
-        if idx > 0:
-            if seg["start"] < segments_list[idx - 1]["end"]:
-                seg["start"] += edge_padding
-        if idx < len(segments_list) - 1:
-            if seg["end"] > segments_list[idx + 1]["start"]:
-                seg["end"] -= edge_padding
-
-        if seg["end"] - curr_start > chunk_length and curr_end - curr_start > 0:
-            merged_segments.append(
-                {
-                    "start": curr_start,
-                    "end": curr_end,
-                    "segments": seg_idxs,
-                }
-            )
-            curr_start = seg["start"]
-            seg_idxs = []
-        curr_end = seg["end"]
-        seg_idxs.append((seg["start"], seg["end"]))
-    # add final
-    merged_segments.append(
-        {
-            "start": curr_start,
-            "end": curr_end,
-            "segments": seg_idxs,
-        }
-    )
-    return merged_segments
--- a/faster_whisper/version.py
+++ b/faster_whisper/version.py
@@ -1,3 +1,3 @@
 """Version information."""

-__version__ = "1.1.1"
+__version__ = "1.2.0"
--- a/setup.py
+++ b/setup.py
@@ -45,13 +45,13 @@ setup(
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3 :: Only",
-        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
        "Programming Language :: Python :: 3.11",
+        "Programming Language :: Python :: 3.12",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
    keywords="openai whisper speech ctranslate2 inference quantization transformer",
-    python_requires=">=3.9",
+    python_requires=">=3.10",
    install_requires=install_requires,
    extras_require={
        "conversion": conversion_requires,
--- a/tests/test_tokenizer.py
+++ b/tests/test_tokenizer.py
@@ -98,6 +98,7 @@ def test_suppressed_tokens_minus_1():
        50358,
        50359,
        50360,
+        50361,
    )


@@ -106,7 +107,7 @@ def test_suppressed_tokens_minus_value():

    tokenizer = Tokenizer(model.hf_tokenizer, False)
    tokens = get_suppressed_tokens(tokenizer, [13])
-    assert tokens == (13, 50257, 50357, 50358, 50359, 50360)
+    assert tokens == (13, 50257, 50357, 50358, 50359, 50360, 50361)


 def test_split_on_unicode():
--- a/tests/test_transcribe.py
+++ b/tests/test_transcribe.py
@@ -71,7 +71,7 @@ def test_batched_transcribe(physcisworks_path):
            {"start": segment.start, "end": segment.end, "text": segment.text}
        )
    # number of near 30 sec segments
-    assert len(segments) == 7
+    assert len(segments) == 6

    result, info = batched_model.transcribe(
        physcisworks_path,
@@ -269,3 +269,24 @@ def test_monotonic_timestamps(physcisworks_path):
            assert word.start <= word.end
            assert word.end <= segments[i].end
    assert segments[-1].end <= info.duration
+
+
+def test_cliptimestamps_segments(jfk_path):
+    model = WhisperModel("tiny")
+    pipeline = BatchedInferencePipeline(model=model)
+
+    audio = decode_audio(jfk_path)
+    audio = np.concatenate([audio, audio])
+    clip_timestamps = [{"start": 0.0, "end": 11.0}, {"start": 11.0, "end": 22.0}]
+
+    segments, info = pipeline.transcribe(audio, clip_timestamps=clip_timestamps)
+    segments = list(segments)
+
+    assert len(segments) == 2
+    for segment, clip in zip(segments, clip_timestamps):
+        assert segment.start == clip["start"]
+        assert segment.end == clip["end"]
+        assert segment.text == (
+            " And so my fellow Americans ask not what your country can do for you, "
+            "ask what you can do for your country."
+        )
Author	SHA1	Message	Date
Mahmoud Ashraf	ba812f55a2	Fix quotes for Python version in CI workflow	2025-10-30 21:14:30 +03:00
Mahmoud Ashraf	44466c7535	Upgrade Python version from 3.9 to 3.10 in CI	2025-10-30 21:12:36 +03:00
Mahmoud Ashraf	e3e46675b2	Update Python version requirements to 3.10 and 3.12	2025-10-30 21:11:50 +03:00
Mahmoud Ashraf	14ad587c98	Update Python version requirement to 3.10 or greater	2025-10-30 21:11:07 +03:00
Purfview	9090997d25	Fix a typo (#1377 )	2025-10-22 15:51:56 +03:00
Mahmoud Ashraf	dea24cbcc6	Upgrade to Silero-VAD V6 (#1373 ) Co-authored-by: sssshhhhhh 193317444+sssshhhhhh@users.noreply.github.com	2025-10-14 15:29:56 +03:00
Mario	14ba1051f3	Fix: add `<\|nocaptions\|>` to suppressed tokens (#1338 ) * Fix: Prevent <\|nocaptions\|> tokens in BatchedInferencePipeline - Add nocaptions component tokens [1771, 496, 9799] to suppress_tokens list - Add segment filtering to remove any remaining <\|nocaptions\|> segments - Resolves issue where BatchedInferencePipeline would generate malformed special tokens during periods of silence or low-confidence transcription - Includes comprehensive tests to verify the fix The issue occurred because while bracket tokens ('<', '\|', '>') were already suppressed, the content tokens ('no', 'ca', 'ptions') were not, leading to partial token generation that formed complete <\|nocaptions\|> tags in the output. Files changed: - faster_whisper/transcribe.py: Core fix implementation - test_nocaptions_comprehensive.py: Comprehensive test suite - tests/test_nocaptions_fix.py: Unit tests * removed * Fix: Prevent <\|nocaptions\|> tokens in BatchedInferencePipeline * Fix: Implement proper <\|nocaptions\|> token suppression using single token approach * ci: trigger tests * fix: remove trailing whitespace from blank lines * Update faster_whisper/transcribe.py Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com> * Update faster_whisper/tokenizer.py Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com> * Update faster_whisper/tokenizer.py Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com> * Rename no_speech to no_captions in tokenizer * nocaptions has been renamed to nospeech * break line * line break * Refactor no_speech method for improved readability by adjusting line breaks --------- Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>	2025-10-10 21:56:54 +03:00
Mahmoud Ashraf	c26d609974	only merge when `clip_timestamps` are not provided (#1345 ) fixes #1340 and allows for batching multiple audio files less than 30s each	2025-08-16 14:30:50 +03:00
黑墨水鱼	4bd98d5c5b	Update README.md to include whisper-fastapi (#1325 )	2025-08-11 13:44:48 +03:00
Mahmoud Ashraf	93001a9438	bump version to 1.2.0	2025-08-06 03:31:36 +03:00
Mahmoud Ashraf	a0c3cb9802	Remove Silence in Batched transcription (#1297 )	2025-08-06 03:30:59 +03:00
Mahmoud Ashraf	fbeb1ba731	get correct index for samples (#1336 )	2025-08-06 03:17:45 +03:00
Rishil	d3bfd0a305	feat: Allow loading of private HF models (#1309 ) * feat: add HuggingFace auth token support to model download * Format	2025-06-02 14:12:34 +03:00
Mahmoud Ashraf	43d4163fe0	Support `distil-large-v3.5` (#1311 )	2025-06-02 14:09:20 +03:00
Felix Mosheev	700584b2e6	feat: allow passing specific revision to download (#1292 )	2025-04-30 00:55:48 +03:00
David Jiménez	1383fd4d37	Update README.md with speaches instead of faster-whisper-server (#1267 ) Was previously named faster-whisper-server. They've decided to change the name from faster-whisper-server to speaches, as the project has evolved to support more than just ASR.	2025-03-20 17:20:26 +03:00