Bump version to 1.0.2 (#816 )

Allow av to include version 12. (#819 )
Clarify documentation for hotwords (#817 )
2026-01-12 23:18:06 -05:00 · 2024-05-06 09:02:54 +07:00 · 2024-05-06 08:57:35 +07:00 · 2024-05-06 08:52:59 +07:00 · 2024-05-04 15:12:59 +07:00 · 2024-05-04 15:12:43 +07:00
30 changed files with 3725 additions and 219 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -25,7 +25,7 @@ jobs:
      - name: Install module
        run: |
          pip install wheel
-          pip install .[dev] --extra-index-url https://download.pytorch.org/whl/cpu
+          pip install -e .[dev]

      - name: Check code format with Black
        run: |
@@ -55,8 +55,36 @@ jobs:
      - name: Install module
        run: |
          pip install wheel
-          pip install .[dev] --extra-index-url https://download.pytorch.org/whl/cpu
+          pip install -e .[dev]

      - name: Run pytest
        run: |
-          pytest -v tests/test.py
+          pytest -v tests/
+
+
+  build-and-push-package:
+    runs-on: ubuntu-latest
+    needs: [check-code-format, run-tests]
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python 3.8
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.8
+
+      - name: Install dependencies
+        run: |
+          pip install wheel
+
+      - name: Build package
+        run: |
+          python3 setup.py sdist bdist_wheel
+
+      - name: Push package on PyPI
+        if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          user: __token__
+          password: ${{ secrets.PYPI_API_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,15 @@
+# Byte-compiled / Optimized / DLL Files
+*.pyc
+*.pyo
+*.pyd
+__pycache__/
+
+# Distribution / Packaging
+venv/
+
+# Unit Test
+.pytest_cache/
+
+# Ignore IDE, Editor Files
+.idea/
+.vscode/
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,31 @@
+# Contributing to faster-whisper
+
+Contributions are welcome! Here are some pointers to help you install the library for development and validate your changes before submitting a pull request.
+
+## Install the library for development
+
+We recommend installing the module in editable mode with the `dev` extra requirements:
+
+```bash
+git clone https://github.com/SYSTRAN/faster-whisper.git
+cd faster-whisper/
+pip install -e .[dev]
+```
+
+## Validate the changes before creating a pull request
+
+1. Make sure the existing tests are still passing (and consider adding new tests as well!):
+
+```bash
+pytest tests/
+```
+
+2. Reformat and validate the code with the following tools:
+
+```bash
+black .
+isort .
+flake8 .
+```
+
+These steps are also run automatically in the CI when you open the pull request.
--- a/2
+++ b/2
@@ -1,6 +1,6 @@
 MIT License

-Copyright (c) 2023 Guillaume Klein
+Copyright (c) 2023 SYSTRAN

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1,3 @@
+include faster_whisper/assets/silero_vad.onnx
+include requirements.txt
+include requirements.conversion.txt
--- a/README.md
+++ b/README.md
@@ -1,16 +1,20 @@
+[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
+
 # Faster Whisper transcription with CTranslate2

-This repository demonstrates how to implement the Whisper transcription using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.
+**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.

 This implementation is up to 4 times faster than [openai/whisper](https://github.com/openai/whisper) for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

 ## Benchmark

-For reference, here's the time and memory usage that are required to transcribe **13 minutes** of audio using different implementations:
+### Whisper
+
+For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:

 * [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
 * [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
-* [faster-whisper](https://github.com/guillaumekln/faster-whisper)@[cce6b53e](https://github.com/guillaumekln/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
+* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)

 ### Large-v2 model on GPU

@@ -34,63 +38,119 @@ For reference, here's the time and memory usage that are required to transcribe

 *Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*

+
+### Distil-whisper
+
+| Implementation | Precision | Beam size | Time | Gigaspeech WER |
+| --- | --- | --- | --- | --- |
+| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
+| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
+| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
+| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
+
+*Executed with CUDA 11.4 on a NVIDIA 3090.*
+
+<details>
+<summary>testing details (click to expand)</summary>
+
+For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
+```python
+from faster_whisper import WhisperModel
+
+model_size = "distil-large-v2"
+# model_size = "distil-medium.en"
+# Run on GPU with FP16
+model = WhisperModel(model_size, device="cuda", compute_type="float16")
+segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
+```
+</details>
+
+## Requirements
+
+* Python 3.8 or greater
+
+Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.
+
+### GPU
+
+GPU execution requires the following NVIDIA libraries to be installed:
+
+* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
+* [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn)
+
+**Note**: Latest versions of `ctranslate2` support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the `3.24.0` version of `ctranslate2` (This can be done with `pip install --force-reinsall ctranslate2==3.24.0` or specifying the version in a `requirements.txt`).
+
+There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below. 
+
+<details>
+<summary>Other installation methods (click to expand)</summary>
+
+
+**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.
+
+#### Use Docker
+
+The libraries (cuBLAS, cuDNN) are installed in these official NVIDIA CUDA Docker images: `nvidia/cuda:12.0.0-runtime-ubuntu20.04` or `nvidia/cuda:12.0.0-runtime-ubuntu22.04`.
+
+#### Install with `pip` (Linux only)
+
+On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.
+
+```bash
+pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
+
+export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
+```
+
+**Note**: Version 9+ of `nvidia-cudnn-cu12` appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8.
+
+#### Download the libraries from Purfview's repository (Windows & Linux)
+
+Purfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows & Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`.
+
+</details>
+
 ## Installation

-```bash
-pip install -e .[conversion]
-```
-
-The model conversion requires the modules `transformers` and `torch` which are installed by the `[conversion]` requirement. Once a model is converted, these modules are no longer needed and the installation could be simplified to:
+The module can be installed from [PyPI](https://pypi.org/project/faster-whisper/):

 ```bash
-pip install -e .
+pip install faster-whisper
 ```

-It is also possible to install the module without cloning the Git repository:
+<details>
+<summary>Other installation methods (click to expand)</summary>
+
+### Install the master branch

 ```bash
-# Install the master branch:
-pip install "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
-
-# Install a specific commit:
-pip install "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
 ```

-### GPU support
+### Install a specific commit

-GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the [CTranslate2 documentation](https://opennmt.net/CTranslate2/installation.html).
+```bash
+pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
+```
+
+</details>

 ## Usage

-### Model conversion
-
-A Whisper model should be first converted into the CTranslate2 format. We provide a script to download and convert models from the [Hugging Face model repository](https://huggingface.co/models?sort=downloads&search=whisper).
-
-For example the command below converts the "large-v2" Whisper model and saves the weights in FP16:
-
-```bash
-ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2 \
-    --copy_files tokenizer.json --quantization float16
-```
-
-If the option `--copy_files tokenizer.json` is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.
-
-Models can also be converted from the code. See the [conversion API](https://opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html).
-
-### Transcription
+### Faster-whisper

 ```python
 from faster_whisper import WhisperModel

-model_path = "whisper-large-v2-ct2/"
+model_size = "large-v3"

 # Run on GPU with FP16
-model = WhisperModel(model_path, device="cuda", compute_type="float16")
+model = WhisperModel(model_size, device="cuda", compute_type="float16")

 # or run on GPU with INT8
-# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
+# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
 # or run on CPU with INT8
-# model = WhisperModel(model_path, device="cpu", compute_type="int8")
+# model = WhisperModel(model_size, device="cpu", compute_type="int8")

 segments, info = model.transcribe("audio.mp3", beam_size=5)

@@ -100,7 +160,33 @@ for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
 ```

-#### Word-level timestamps
+**Warning:** `segments` is a *generator* so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a `for` loop:
+
+```python
+segments, _ = model.transcribe("audio.mp3")
+segments = list(segments)  # The transcription will actually run here.
+```
+### Faster Distil-Whisper
+
+The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
+checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet 
+demonstrates how to run inference with distil-large-v3 on a specified audio file:
+
+```python
+from faster_whisper import WhisperModel
+
+model_size = "distil-large-v3"
+
+model = WhisperModel(model_size, device="cuda", compute_type="float16")
+segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
+
+for segment in segments:
+    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
+```
+
+For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).
+
+### Word-level timestamps

 ```python
 segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
@@ -110,7 +196,87 @@ for segment in segments:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
 ```

-See more model and transcription options in the [`WhisperModel`](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
+### VAD filter
+
+The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad) model to filter out parts of the audio without speech:
+
+```python
+segments, _ = model.transcribe("audio.mp3", vad_filter=True)
+```
+
+The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
+
+```python
+segments, _ = model.transcribe(
+    "audio.mp3",
+    vad_filter=True,
+    vad_parameters=dict(min_silence_duration_ms=500),
+)
+```
+
+### Logging
+
+The library logging level can be configured like this:
+
+```python
+import logging
+
+logging.basicConfig()
+logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
+```
+
+### Going further
+
+See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
+
+## Community integrations
+
+Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
+
+
+* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
+* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
+* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
+* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS. 
+* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
+* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
+* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
+* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.
+* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
+* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
+* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
+
+## Model conversion
+
+When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
+
+We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.
+
+For example the command below converts the [original "large-v3" Whisper model](https://huggingface.co/openai/whisper-large-v3) and saves the weights in FP16:
+
+```bash
+pip install transformers[torch]>=4.23
+
+ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2
+--copy_files tokenizer.json preprocessor_config.json --quantization float16
+```
+
+* The option `--model` accepts a model name on the Hub or a path to a model directory.
+* If the option `--copy_files tokenizer.json` is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.
+
+Models can also be converted from the code. See the [conversion API](https://opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html).
+
+### Load a converted model
+
+1. Directly load the model from a local directory:
+```python
+model = faster_whisper.WhisperModel("whisper-large-v3-ct2")
+```
+
+2. [Upload your model to the Hugging Face Hub](https://huggingface.co/docs/transformers/model_sharing#upload-with-the-web-interface) and load it from its name:
+```python
+model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2")
+```

 ## Comparing performance against other implementations

--- a/benchmark/benchmark.m4a
+++ b/benchmark/benchmark.m4a
--- a/benchmark/memory_benchmark.py
+++ b/benchmark/memory_benchmark.py
@@ -0,0 +1,94 @@
+import argparse
+import time
+
+from typing import Callable
+
+import py3nvml.py3nvml as nvml
+
+from memory_profiler import memory_usage
+from utils import MyThread, get_logger, inference
+
+logger = get_logger("faster-whisper")
+parser = argparse.ArgumentParser(description="Memory benchmark")
+parser.add_argument(
+    "--gpu_memory", action="store_true", help="Measure GPU memory usage"
+)
+parser.add_argument("--device-index", type=int, default=0, help="GPU device index")
+parser.add_argument(
+    "--interval",
+    type=float,
+    default=0.5,
+    help="Interval at which measurements are collected",
+)
+args = parser.parse_args()
+device_idx = args.device_index
+interval = args.interval
+
+
+def measure_memory(func: Callable[[], None]):
+    if args.gpu_memory:
+        logger.info(
+            "Measuring maximum GPU memory usage on GPU device."
+            " Make sure to not have additional processes running on the same GPU."
+        )
+        # init nvml
+        nvml.nvmlInit()
+        handle = nvml.nvmlDeviceGetHandleByIndex(device_idx)
+        gpu_name = nvml.nvmlDeviceGetName(handle)
+        gpu_memory_limit = nvml.nvmlDeviceGetMemoryInfo(handle).total >> 20
+        gpu_power_limit = nvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000.0
+        info = {"gpu_memory_usage": [], "gpu_power_usage": []}
+
+        def _get_gpu_info():
+            while True:
+                info["gpu_memory_usage"].append(
+                    nvml.nvmlDeviceGetMemoryInfo(handle).used >> 20
+                )
+                info["gpu_power_usage"].append(
+                    nvml.nvmlDeviceGetPowerUsage(handle) / 1000
+                )
+                time.sleep(interval)
+
+                if stop:
+                    break
+
+            return info
+
+        stop = False
+        thread = MyThread(_get_gpu_info, params=())
+        thread.start()
+        func()
+        stop = True
+        thread.join()
+        result = thread.get_result()
+
+        # shutdown nvml
+        nvml.nvmlShutdown()
+        max_memory_usage = max(result["gpu_memory_usage"])
+        max_power_usage = max(result["gpu_power_usage"])
+        print("GPU name: %s" % gpu_name)
+        print("GPU device index: %s" % device_idx)
+        print(
+            "Maximum GPU memory usage: %dMiB / %dMiB (%.2f%%)"
+            % (
+                max_memory_usage,
+                gpu_memory_limit,
+                (max_memory_usage / gpu_memory_limit) * 100,
+            )
+        )
+        print(
+            "Maximum GPU power usage: %dW / %dW (%.2f%%)"
+            % (
+                max_power_usage,
+                gpu_power_limit,
+                (max_power_usage / gpu_power_limit) * 100,
+            )
+        )
+    else:
+        logger.info("Measuring maximum increase of memory usage.")
+        max_usage = memory_usage(func, max_usage=True, interval=interval)
+        print("Maximum increase of RAM memory usage: %d MiB" % max_usage)
+
+
+if __name__ == "__main__":
+    measure_memory(inference)
--- a/benchmark/normalizer.json
+++ b/benchmark/normalizer.json
--- a/benchmark/requirements.benchmark.txt
+++ b/benchmark/requirements.benchmark.txt
@@ -0,0 +1,6 @@
+transformers
+jiwer
+evaluate
+datasets
+memory_profiler
+py3nvml
--- a/benchmark/speed_benchmark.py
+++ b/benchmark/speed_benchmark.py
@@ -0,0 +1,31 @@
+import argparse
+import timeit
+
+from typing import Callable
+
+from utils import inference
+
+parser = argparse.ArgumentParser(description="Speed benchmark")
+parser.add_argument(
+    "--repeat",
+    type=int,
+    default=3,
+    help="Times an experiment will be run.",
+)
+args = parser.parse_args()
+
+
+def measure_speed(func: Callable[[], None]):
+    # as written in https://docs.python.org/3/library/timeit.html#timeit.Timer.repeat,
+    # min should be taken rather than the average
+    runtimes = timeit.repeat(
+        func,
+        repeat=args.repeat,
+        number=10,
+    )
+    print(runtimes)
+    print("Min execution time: %.3fs" % (min(runtimes) / 10.0))
+
+
+if __name__ == "__main__":
+    measure_speed(inference)
--- a/benchmark/utils.py
+++ b/benchmark/utils.py
@@ -0,0 +1,39 @@
+import logging
+
+from threading import Thread
+from typing import Optional
+
+from faster_whisper import WhisperModel
+
+model_path = "large-v3"
+model = WhisperModel(model_path, device="cuda")
+
+
+def inference():
+    segments, info = model.transcribe("benchmark.m4a", language="fr")
+    for segment in segments:
+        print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
+
+
+def get_logger(name: Optional[str] = None) -> logging.Logger:
+    formatter = logging.Formatter("%(levelname)s: %(message)s")
+    logger = logging.getLogger(name)
+    logger.setLevel(logging.DEBUG)
+    handler = logging.StreamHandler()
+    handler.setFormatter(formatter)
+    logger.addHandler(handler)
+    return logger
+
+
+class MyThread(Thread):
+    def __init__(self, func, params):
+        super(MyThread, self).__init__()
+        self.func = func
+        self.params = params
+        self.result = None
+
+    def run(self):
+        self.result = self.func(*self.params)
+
+    def get_result(self):
+        return self.result
--- a/benchmark/wer_benchmark.py
+++ b/benchmark/wer_benchmark.py
@@ -0,0 +1,61 @@
+import argparse
+import json
+
+from datasets import load_dataset
+from evaluate import load
+from tqdm import tqdm
+from transformers.models.whisper.english_normalizer import EnglishTextNormalizer
+
+from faster_whisper import WhisperModel
+
+parser = argparse.ArgumentParser(description="WER benchmark")
+parser.add_argument(
+    "--audio_numb",
+    type=int,
+    default=None,
+    help="Specify the number of validation audio files in the dataset."
+    " Set to None to retrieve all audio files.",
+)
+args = parser.parse_args()
+
+model_path = "large-v3"
+model = WhisperModel(model_path, device="cuda")
+
+# load the dataset with streaming mode
+dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
+
+# define the evaluation metric
+wer_metric = load("wer")
+normalizer = EnglishTextNormalizer(json.load(open("normalizer.json")))
+
+
+def inference(batch):
+    batch["transcription"] = []
+    for sample in batch["audio"]:
+        segments, info = model.transcribe(sample["array"], language="en")
+        batch["transcription"].append("".join([segment.text for segment in segments]))
+    batch["reference"] = batch["text"]
+    return batch
+
+
+dataset = dataset.map(function=inference, batched=True, batch_size=16)
+
+all_transcriptions = []
+all_references = []
+
+# iterate over the dataset and run inference
+for i, result in tqdm(enumerate(dataset), desc="Evaluating..."):
+    all_transcriptions.append(result["transcription"])
+    all_references.append(result["reference"])
+    if args.audio_numb and i == (args.audio_numb - 1):
+        break
+
+# normalize predictions and references
+all_transcriptions = [normalizer(transcription) for transcription in all_transcriptions]
+all_references = [normalizer(reference) for reference in all_references]
+
+# compute the WER metric
+wer = 100 * wer_metric.compute(
+    predictions=all_transcriptions, references=all_references
+)
+print("WER: %.3f" % wer)
--- a/faster_whisper/init.py
+++ b/faster_whisper/init.py
@@ -1,9 +1,13 @@
 from faster_whisper.audio import decode_audio
 from faster_whisper.transcribe import WhisperModel
-from faster_whisper.utils import format_timestamp
+from faster_whisper.utils import available_models, download_model, format_timestamp
+from faster_whisper.version import __version__

 __all__ = [
+    "available_models",
    "decode_audio",
    "WhisperModel",
+    "download_model",
    "format_timestamp",
+    "__version__",
 ]
--- a/faster_whisper/assets/init.py
+++ b/faster_whisper/assets/init.py
--- a/faster_whisper/assets/silero_vad.onnx
+++ b/faster_whisper/assets/silero_vad.onnx
--- a/faster_whisper/audio.py
+++ b/faster_whisper/audio.py
@@ -6,6 +6,7 @@ system dependencies. FFmpeg does not need to be installed on the system.
 However, the API is quite low-level so we need to manipulate audio frames directly.
 """

+import gc
 import io
 import itertools

@@ -15,27 +16,36 @@ import av
 import numpy as np


-def decode_audio(input_file: Union[str, BinaryIO], sampling_rate: int = 16000):
+def decode_audio(
+    input_file: Union[str, BinaryIO],
+    sampling_rate: int = 16000,
+    split_stereo: bool = False,
+):
    """Decodes the audio.

    Args:
      input_file: Path to the input file or a file-like object.
      sampling_rate: Resample the audio to this sample rate.
+      split_stereo: Return separate left and right channels.

    Returns:
      A float32 Numpy array.
+
+      If `split_stereo` is enabled, the function returns a 2-tuple with the
+      separated left and right channels.
    """
    resampler = av.audio.resampler.AudioResampler(
        format="s16",
-        layout="mono",
+        layout="mono" if not split_stereo else "stereo",
        rate=sampling_rate,
    )

    raw_buffer = io.BytesIO()
    dtype = None

-    with av.open(input_file, metadata_errors="ignore") as container:
+    with av.open(input_file, mode="r", metadata_errors="ignore") as container:
        frames = container.decode(audio=0)
+        frames = _ignore_invalid_frames(frames)
        frames = _group_frames(frames, 500000)
        frames = _resample_frames(frames, resampler)

@@ -44,10 +54,34 @@ def decode_audio(input_file: Union[str, BinaryIO], sampling_rate: int = 16000):
            dtype = array.dtype
            raw_buffer.write(array)

+    # It appears that some objects related to the resampler are not freed
+    # unless the garbage collector is manually run.
+    del resampler
+    gc.collect()
+
    audio = np.frombuffer(raw_buffer.getbuffer(), dtype=dtype)

    # Convert s16 back to f32.
-    return audio.astype(np.float32) / 32768.0
+    audio = audio.astype(np.float32) / 32768.0
+
+    if split_stereo:
+        left_channel = audio[0::2]
+        right_channel = audio[1::2]
+        return left_channel, right_channel
+
+    return audio
+
+
+def _ignore_invalid_frames(frames):
+    iterator = iter(frames)
+
+    while True:
+        try:
+            yield next(iterator)
+        except StopIteration:
+            break
+        except av.error.InvalidDataError:
+            continue


 def _group_frames(frames, num_samples=None):
@@ -68,3 +102,18 @@ def _resample_frames(frames, resampler):
    # Add None to flush the resampler.
    for frame in itertools.chain(frames, [None]):
        yield from resampler.resample(frame)
+
+
+def pad_or_trim(array, length: int, *, axis: int = -1):
+    """
+    Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
+    """
+    if array.shape[axis] > length:
+        array = array.take(indices=range(length), axis=axis)
+
+    if array.shape[axis] < length:
+        pad_widths = [(0, 0)] * array.ndim
+        pad_widths[axis] = (0, length - array.shape[axis])
+        array = np.pad(array, pad_widths)
+
+    return array
--- a/faster_whisper/feature_extractor.py
+++ b/faster_whisper/feature_extractor.py
@@ -142,11 +142,15 @@ class FeatureExtractor:
            data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
        return data.T

-    def __call__(self, waveform, padding=True):
+    def __call__(self, waveform, padding=True, chunk_length=None):
        """
        Compute the log-Mel spectrogram of the provided audio, gives similar results
        whisper's original torch implementation with 1e-5 tolerance.
        """
+        if chunk_length is not None:
+            self.n_samples = chunk_length * self.sampling_rate
+            self.nb_max_frames = self.n_samples // self.hop_length
+
        if padding:
            waveform = np.pad(waveform, [(0, self.n_samples)])

--- a/faster_whisper/tokenizer.py
+++ b/faster_whisper/tokenizer.py
@@ -19,24 +19,42 @@ class Tokenizer:
        self.tokenizer = tokenizer

        if multilingual:
+            if task not in _TASKS:
+                raise ValueError(
+                    "'%s' is not a valid task (accepted tasks: %s)"
+                    % (task, ", ".join(_TASKS))
+                )
+
+            if language not in _LANGUAGE_CODES:
+                raise ValueError(
+                    "'%s' is not a valid language code (accepted language codes: %s)"
+                    % (language, ", ".join(_LANGUAGE_CODES))
+                )
+
            self.task = self.tokenizer.token_to_id("<|%s|>" % task)
-            if self.task is None:
-                raise ValueError("%s is not a valid task" % task)
-
-            self.language_code = language
            self.language = self.tokenizer.token_to_id("<|%s|>" % language)
-            if self.language is None:
-                raise ValueError("%s is not a valid language code" % language)
-
+            self.language_code = language
        else:
            self.task = None
            self.language = None
            self.language_code = "en"

+    @cached_property
+    def transcribe(self) -> int:
+        return self.tokenizer.token_to_id("<|transcribe|>")
+
+    @cached_property
+    def translate(self) -> int:
+        return self.tokenizer.token_to_id("<|translate|>")
+
    @cached_property
    def sot(self) -> int:
        return self.tokenizer.token_to_id("<|startoftranscript|>")

+    @cached_property
+    def sot_lm(self) -> int:
+        return self.tokenizer.token_to_id("<|startoflm|>")
+
    @cached_property
    def sot_prev(self) -> int:
        return self.tokenizer.token_to_id("<|startofprev|>")
@@ -90,7 +108,7 @@ class Tokenizer:
    def split_to_word_tokens(
        self, tokens: List[int]
    ) -> Tuple[List[str], List[List[int]]]:
-        if self.language_code in {"zh", "ja", "th", "lo", "my"}:
+        if self.language_code in {"zh", "ja", "th", "lo", "my", "yue"}:
            # These languages don't typically use spaces, so it is difficult to split words
            # without morpheme analysis. Here, we instead split words at any
            # position where the tokens are decoded as valid unicode points
@@ -113,10 +131,15 @@ class Tokenizer:
            current_tokens.append(token)
            decoded = self.decode_with_timestamps(current_tokens)

-            if (
-                replacement_char not in decoded
-                or decoded_full[unicode_offset + decoded.index(replacement_char)]
-                == replacement_char
+            try:
+                replacement_char_index = decoded.index(replacement_char)
+                replacement_char_index += unicode_offset
+            except ValueError:
+                replacement_char_index = None
+
+            if replacement_char_index is None or (
+                replacement_char_index < len(decoded_full)
+                and decoded_full[replacement_char_index] == replacement_char
            ):
                words.append(decoded)
                word_tokens.append(current_tokens)
@@ -144,3 +167,112 @@ class Tokenizer:
                word_tokens[-1].extend(subword_tokens)

        return words, word_tokens
+
+
+_TASKS = (
+    "transcribe",
+    "translate",
+)
+
+_LANGUAGE_CODES = (
+    "af",
+    "am",
+    "ar",
+    "as",
+    "az",
+    "ba",
+    "be",
+    "bg",
+    "bn",
+    "bo",
+    "br",
+    "bs",
+    "ca",
+    "cs",
+    "cy",
+    "da",
+    "de",
+    "el",
+    "en",
+    "es",
+    "et",
+    "eu",
+    "fa",
+    "fi",
+    "fo",
+    "fr",
+    "gl",
+    "gu",
+    "ha",
+    "haw",
+    "he",
+    "hi",
+    "hr",
+    "ht",
+    "hu",
+    "hy",
+    "id",
+    "is",
+    "it",
+    "ja",
+    "jw",
+    "ka",
+    "kk",
+    "km",
+    "kn",
+    "ko",
+    "la",
+    "lb",
+    "ln",
+    "lo",
+    "lt",
+    "lv",
+    "mg",
+    "mi",
+    "mk",
+    "ml",
+    "mn",
+    "mr",
+    "ms",
+    "mt",
+    "my",
+    "ne",
+    "nl",
+    "nn",
+    "no",
+    "oc",
+    "pa",
+    "pl",
+    "ps",
+    "pt",
+    "ro",
+    "ru",
+    "sa",
+    "sd",
+    "si",
+    "sk",
+    "sl",
+    "sn",
+    "so",
+    "sq",
+    "sr",
+    "su",
+    "sv",
+    "sw",
+    "ta",
+    "te",
+    "tg",
+    "th",
+    "tk",
+    "tl",
+    "tr",
+    "tt",
+    "uk",
+    "ur",
+    "uz",
+    "vi",
+    "yi",
+    "yo",
+    "zh",
+    "yue",
+)
--- a/faster_whisper/transcribe.py
+++ b/faster_whisper/transcribe.py
--- a/faster_whisper/utils.py
+++ b/faster_whisper/utils.py
@@ -1,3 +1,125 @@
+import logging
+import os
+import re
+
+from typing import List, Optional
+
+import huggingface_hub
+import requests
+
+from tqdm.auto import tqdm
+
+_MODELS = {
+    "tiny.en": "Systran/faster-whisper-tiny.en",
+    "tiny": "Systran/faster-whisper-tiny",
+    "base.en": "Systran/faster-whisper-base.en",
+    "base": "Systran/faster-whisper-base",
+    "small.en": "Systran/faster-whisper-small.en",
+    "small": "Systran/faster-whisper-small",
+    "medium.en": "Systran/faster-whisper-medium.en",
+    "medium": "Systran/faster-whisper-medium",
+    "large-v1": "Systran/faster-whisper-large-v1",
+    "large-v2": "Systran/faster-whisper-large-v2",
+    "large-v3": "Systran/faster-whisper-large-v3",
+    "large": "Systran/faster-whisper-large-v3",
+    "distil-large-v2": "Systran/faster-distil-whisper-large-v2",
+    "distil-medium.en": "Systran/faster-distil-whisper-medium.en",
+    "distil-small.en": "Systran/faster-distil-whisper-small.en",
+    "distil-large-v3": "Systran/faster-distil-whisper-large-v3",
+}
+
+
+def available_models() -> List[str]:
+    """Returns the names of available models."""
+    return list(_MODELS.keys())
+
+
+def get_assets_path():
+    """Returns the path to the assets directory."""
+    return os.path.join(os.path.dirname(os.path.abspath(__file__)), "assets")
+
+
+def get_logger():
+    """Returns the module logger."""
+    return logging.getLogger("faster_whisper")
+
+
+def download_model(
+    size_or_id: str,
+    output_dir: Optional[str] = None,
+    local_files_only: bool = False,
+    cache_dir: Optional[str] = None,
+):
+    """Downloads a CTranslate2 Whisper model from the Hugging Face Hub.
+
+    Args:
+      size_or_id: Size of the model to download from https://huggingface.co/Systran
+        (tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
+        large-v3, large), or a CTranslate2-converted model ID from the Hugging Face Hub
+        (e.g. Systran/faster-whisper-large-v3).
+      output_dir: Directory where the model should be saved. If not set, the model is saved in
+        the cache directory.
+      local_files_only:  If True, avoid downloading the file and return the path to the local
+        cached file if it exists.
+      cache_dir: Path to the folder where cached files are stored.
+
+    Returns:
+      The path to the downloaded model.
+
+    Raises:
+      ValueError: if the model size is invalid.
+    """
+    if re.match(r".*/.*", size_or_id):
+        repo_id = size_or_id
+    else:
+        repo_id = _MODELS.get(size_or_id)
+        if repo_id is None:
+            raise ValueError(
+                "Invalid model size '%s', expected one of: %s"
+                % (size_or_id, ", ".join(_MODELS.keys()))
+            )
+
+    allow_patterns = [
+        "config.json",
+        "preprocessor_config.json",
+        "model.bin",
+        "tokenizer.json",
+        "vocabulary.*",
+    ]
+
+    kwargs = {
+        "local_files_only": local_files_only,
+        "allow_patterns": allow_patterns,
+        "tqdm_class": disabled_tqdm,
+    }
+
+    if output_dir is not None:
+        kwargs["local_dir"] = output_dir
+        kwargs["local_dir_use_symlinks"] = False
+
+    if cache_dir is not None:
+        kwargs["cache_dir"] = cache_dir
+
+    try:
+        return huggingface_hub.snapshot_download(repo_id, **kwargs)
+    except (
+        huggingface_hub.utils.HfHubHTTPError,
+        requests.exceptions.ConnectionError,
+    ) as exception:
+        logger = get_logger()
+        logger.warning(
+            "An error occured while synchronizing the model %s from the Hugging Face Hub:\n%s",
+            repo_id,
+            exception,
+        )
+        logger.warning(
+            "Trying to load the model directly from the local cache, if it exists."
+        )
+
+        kwargs["local_files_only"] = True
+        return huggingface_hub.snapshot_download(repo_id, **kwargs)
+
+
 def format_timestamp(
    seconds: float,
    always_include_hours: bool = False,
@@ -19,3 +141,16 @@ def format_timestamp(
    return (
        f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
    )
+
+
+class disabled_tqdm(tqdm):
+    def __init__(self, *args, **kwargs):
+        kwargs["disable"] = True
+        super().__init__(*args, **kwargs)
+
+
+def get_end(segments: List[dict]) -> Optional[float]:
+    return next(
+        (w["end"] for s in reversed(segments) for w in reversed(s["words"])),
+        segments[-1]["end"] if segments else None,
+    )
--- a/faster_whisper/vad.py
+++ b/faster_whisper/vad.py
@@ -0,0 +1,291 @@
+import bisect
+import functools
+import os
+import warnings
+
+from typing import List, NamedTuple, Optional
+
+import numpy as np
+
+from faster_whisper.utils import get_assets_path
+
+
+# The code below is adapted from https://github.com/snakers4/silero-vad.
+class VadOptions(NamedTuple):
+    """VAD options.
+
+    Attributes:
+      threshold: Speech threshold. Silero VAD outputs speech probabilities for each audio chunk,
+        probabilities ABOVE this value are considered as SPEECH. It is better to tune this
+        parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
+      min_speech_duration_ms: Final speech chunks shorter min_speech_duration_ms are thrown out.
+      max_speech_duration_s: Maximum duration of speech chunks in seconds. Chunks longer
+        than max_speech_duration_s will be split at the timestamp of the last silence that
+        lasts more than 100ms (if any), to prevent aggressive cutting. Otherwise, they will be
+        split aggressively just before max_speech_duration_s.
+      min_silence_duration_ms: In the end of each speech chunk wait for min_silence_duration_ms
+        before separating it
+      window_size_samples: Audio chunks of window_size_samples size are fed to the silero VAD model.
+        WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate.
+        Values other than these may affect model performance!!
+      speech_pad_ms: Final speech chunks are padded by speech_pad_ms each side
+    """
+
+    threshold: float = 0.5
+    min_speech_duration_ms: int = 250
+    max_speech_duration_s: float = float("inf")
+    min_silence_duration_ms: int = 2000
+    window_size_samples: int = 1024
+    speech_pad_ms: int = 400
+
+
+def get_speech_timestamps(
+    audio: np.ndarray,
+    vad_options: Optional[VadOptions] = None,
+    **kwargs,
+) -> List[dict]:
+    """This method is used for splitting long audios into speech chunks using silero VAD.
+
+    Args:
+      audio: One dimensional float array.
+      vad_options: Options for VAD processing.
+      kwargs: VAD options passed as keyword arguments for backward compatibility.
+
+    Returns:
+      List of dicts containing begin and end samples of each speech chunk.
+    """
+    if vad_options is None:
+        vad_options = VadOptions(**kwargs)
+
+    threshold = vad_options.threshold
+    min_speech_duration_ms = vad_options.min_speech_duration_ms
+    max_speech_duration_s = vad_options.max_speech_duration_s
+    min_silence_duration_ms = vad_options.min_silence_duration_ms
+    window_size_samples = vad_options.window_size_samples
+    speech_pad_ms = vad_options.speech_pad_ms
+
+    if window_size_samples not in [512, 1024, 1536]:
+        warnings.warn(
+            "Unusual window_size_samples! Supported window_size_samples:\n"
+            " - [512, 1024, 1536] for 16000 sampling_rate"
+        )
+
+    sampling_rate = 16000
+    min_speech_samples = sampling_rate * min_speech_duration_ms / 1000
+    speech_pad_samples = sampling_rate * speech_pad_ms / 1000
+    max_speech_samples = (
+        sampling_rate * max_speech_duration_s
+        - window_size_samples
+        - 2 * speech_pad_samples
+    )
+    min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
+    min_silence_samples_at_max_speech = sampling_rate * 98 / 1000
+
+    audio_length_samples = len(audio)
+
+    model = get_vad_model()
+    state = model.get_initial_state(batch_size=1)
+
+    speech_probs = []
+    for current_start_sample in range(0, audio_length_samples, window_size_samples):
+        chunk = audio[current_start_sample : current_start_sample + window_size_samples]
+        if len(chunk) < window_size_samples:
+            chunk = np.pad(chunk, (0, int(window_size_samples - len(chunk))))
+        speech_prob, state = model(chunk, state, sampling_rate)
+        speech_probs.append(speech_prob)
+
+    triggered = False
+    speeches = []
+    current_speech = {}
+    neg_threshold = threshold - 0.15
+
+    # to save potential segment end (and tolerate some silence)
+    temp_end = 0
+    # to save potential segment limits in case of maximum segment size reached
+    prev_end = next_start = 0
+
+    for i, speech_prob in enumerate(speech_probs):
+        if (speech_prob >= threshold) and temp_end:
+            temp_end = 0
+            if next_start < prev_end:
+                next_start = window_size_samples * i
+
+        if (speech_prob >= threshold) and not triggered:
+            triggered = True
+            current_speech["start"] = window_size_samples * i
+            continue
+
+        if (
+            triggered
+            and (window_size_samples * i) - current_speech["start"] > max_speech_samples
+        ):
+            if prev_end:
+                current_speech["end"] = prev_end
+                speeches.append(current_speech)
+                current_speech = {}
+                # previously reached silence (< neg_thres) and is still not speech (< thres)
+                if next_start < prev_end:
+                    triggered = False
+                else:
+                    current_speech["start"] = next_start
+                prev_end = next_start = temp_end = 0
+            else:
+                current_speech["end"] = window_size_samples * i
+                speeches.append(current_speech)
+                current_speech = {}
+                prev_end = next_start = temp_end = 0
+                triggered = False
+                continue
+
+        if (speech_prob < neg_threshold) and triggered:
+            if not temp_end:
+                temp_end = window_size_samples * i
+            # condition to avoid cutting in very short silence
+            if (window_size_samples * i) - temp_end > min_silence_samples_at_max_speech:
+                prev_end = temp_end
+            if (window_size_samples * i) - temp_end < min_silence_samples:
+                continue
+            else:
+                current_speech["end"] = temp_end
+                if (
+                    current_speech["end"] - current_speech["start"]
+                ) > min_speech_samples:
+                    speeches.append(current_speech)
+                current_speech = {}
+                prev_end = next_start = temp_end = 0
+                triggered = False
+                continue
+
+    if (
+        current_speech
+        and (audio_length_samples - current_speech["start"]) > min_speech_samples
+    ):
+        current_speech["end"] = audio_length_samples
+        speeches.append(current_speech)
+
+    for i, speech in enumerate(speeches):
+        if i == 0:
+            speech["start"] = int(max(0, speech["start"] - speech_pad_samples))
+        if i != len(speeches) - 1:
+            silence_duration = speeches[i + 1]["start"] - speech["end"]
+            if silence_duration < 2 * speech_pad_samples:
+                speech["end"] += int(silence_duration // 2)
+                speeches[i + 1]["start"] = int(
+                    max(0, speeches[i + 1]["start"] - silence_duration // 2)
+                )
+            else:
+                speech["end"] = int(
+                    min(audio_length_samples, speech["end"] + speech_pad_samples)
+                )
+                speeches[i + 1]["start"] = int(
+                    max(0, speeches[i + 1]["start"] - speech_pad_samples)
+                )
+        else:
+            speech["end"] = int(
+                min(audio_length_samples, speech["end"] + speech_pad_samples)
+            )
+
+    return speeches
+
+
+def collect_chunks(audio: np.ndarray, chunks: List[dict]) -> np.ndarray:
+    """Collects and concatenates audio chunks."""
+    if not chunks:
+        return np.array([], dtype=np.float32)
+
+    return np.concatenate([audio[chunk["start"] : chunk["end"]] for chunk in chunks])
+
+
+class SpeechTimestampsMap:
+    """Helper class to restore original speech timestamps."""
+
+    def __init__(self, chunks: List[dict], sampling_rate: int, time_precision: int = 2):
+        self.sampling_rate = sampling_rate
+        self.time_precision = time_precision
+        self.chunk_end_sample = []
+        self.total_silence_before = []
+
+        previous_end = 0
+        silent_samples = 0
+
+        for chunk in chunks:
+            silent_samples += chunk["start"] - previous_end
+            previous_end = chunk["end"]
+
+            self.chunk_end_sample.append(chunk["end"] - silent_samples)
+            self.total_silence_before.append(silent_samples / sampling_rate)
+
+    def get_original_time(
+        self,
+        time: float,
+        chunk_index: Optional[int] = None,
+    ) -> float:
+        if chunk_index is None:
+            chunk_index = self.get_chunk_index(time)
+
+        total_silence_before = self.total_silence_before[chunk_index]
+        return round(total_silence_before + time, self.time_precision)
+
+    def get_chunk_index(self, time: float) -> int:
+        sample = int(time * self.sampling_rate)
+        return min(
+            bisect.bisect(self.chunk_end_sample, sample),
+            len(self.chunk_end_sample) - 1,
+        )
+
+
+@functools.lru_cache
+def get_vad_model():
+    """Returns the VAD model instance."""
+    path = os.path.join(get_assets_path(), "silero_vad.onnx")
+    return SileroVADModel(path)
+
+
+class SileroVADModel:
+    def __init__(self, path):
+        try:
+            import onnxruntime
+        except ImportError as e:
+            raise RuntimeError(
+                "Applying the VAD filter requires the onnxruntime package"
+            ) from e
+
+        opts = onnxruntime.SessionOptions()
+        opts.inter_op_num_threads = 1
+        opts.intra_op_num_threads = 1
+        opts.log_severity_level = 4
+
+        self.session = onnxruntime.InferenceSession(
+            path,
+            providers=["CPUExecutionProvider"],
+            sess_options=opts,
+        )
+
+    def get_initial_state(self, batch_size: int):
+        h = np.zeros((2, batch_size, 64), dtype=np.float32)
+        c = np.zeros((2, batch_size, 64), dtype=np.float32)
+        return h, c
+
+    def __call__(self, x, state, sr: int):
+        if len(x.shape) == 1:
+            x = np.expand_dims(x, 0)
+        if len(x.shape) > 2:
+            raise ValueError(
+                f"Too many dimensions for input audio chunk {len(x.shape)}"
+            )
+        if sr / x.shape[1] > 31.25:
+            raise ValueError("Input audio chunk is too short")
+
+        h, c = state
+
+        ort_inputs = {
+            "input": x,
+            "h": h,
+            "c": c,
+            "sr": np.array(sr, dtype="int64"),
+        }
+
+        out, h, c = self.session.run(None, ort_inputs)
+        state = (h, c)
+
+        return out, state
--- a/faster_whisper/version.py
+++ b/faster_whisper/version.py
@@ -0,0 +1,3 @@
+"""Version information."""
+
+__version__ = "1.0.2"
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,5 @@
-av==10.*
-ctranslate2>=3.9,<4
-tokenizers==0.13.*
+av>=11.0,<13
+ctranslate2>=4.0,<5
+huggingface_hub>=0.13
+tokenizers>=0.13,<1
+onnxruntime>=1.14,<2
--- a/setup.py
+++ b/setup.py
@@ -11,6 +11,14 @@ def get_long_description():
        return readme_file.read()


+def get_project_version():
+    version_path = os.path.join(base_dir, "faster_whisper", "version.py")
+    version = {}
+    with open(version_path, encoding="utf-8") as fp:
+        exec(fp.read(), version)
+    return version["__version__"]
+
+
 def get_requirements(path):
    with open(path, encoding="utf-8") as requirements:
        return [requirement.strip() for requirement in requirements]
@@ -23,13 +31,13 @@ conversion_requires = get_requirements(

 setup(
    name="faster-whisper",
-    version="0.2.0",
+    version=get_project_version(),
    license="MIT",
    description="Faster Whisper transcription with CTranslate2",
    long_description=get_long_description(),
    long_description_content_type="text/markdown",
    author="Guillaume Klein",
-    url="https://github.com/guillaumekln/faster-whisper",
+    url="https://github.com/SYSTRAN/faster-whisper",
    classifiers=[
        "Development Status :: 4 - Beta",
        "Intended Audience :: Developers",
@@ -48,8 +56,7 @@ setup(
    install_requires=install_requires,
    extras_require={
        "conversion": conversion_requires,
-        "dev": conversion_requires
-        + [
+        "dev": [
            "black==23.*",
            "flake8==6.*",
            "isort==5.*",
@@ -57,4 +64,5 @@ setup(
        ],
    },
    packages=find_packages(),
+    include_package_data=True,
 )
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1,6 +1,5 @@
 import os

-import ctranslate2
 import pytest


@@ -12,20 +11,3 @@ def data_dir():
@pytest.fixture
 def jfk_path(data_dir):
    return os.path.join(data_dir, "jfk.flac")
-
-
-@pytest.fixture(scope="session")
-def tiny_model_dir(tmp_path_factory):
-    model_path = str(tmp_path_factory.mktemp("data") / "model")
-    convert_model("tiny", model_path)
-    return model_path
-
-
-def convert_model(size, output_dir):
-    name = "openai/whisper-%s" % size
-
-    ctranslate2.converters.TransformersConverter(
-        name,
-        copy_files=["tokenizer.json"],
-        load_as_float16=True,
-    ).convert(output_dir, quantization="float16")
--- a/tests/data/stereo_diarization.wav
+++ b/tests/data/stereo_diarization.wav
--- a/tests/test.py
+++ b/tests/test.py
@@ -1,25 +0,0 @@
-from faster_whisper import WhisperModel
-
-
-def test_transcribe(tiny_model_dir, jfk_path):
-    model = WhisperModel(tiny_model_dir)
-    segments, info = model.transcribe(jfk_path, word_timestamps=True)
-
-    assert info.language == "en"
-    assert info.language_probability > 0.9
-    assert info.duration == 11
-
-    segments = list(segments)
-
-    assert len(segments) == 1
-
-    segment = segments[0]
-
-    assert segment.text == (
-        " And so my fellow Americans ask not what your country can do for you, "
-        "ask what you can do for your country."
-    )
-
-    assert segment.text == "".join(word.word for word in segment.words)
-    assert segment.start == segment.words[0].start
-    assert segment.end == segment.words[-1].end
--- a/tests/test_transcribe.py
+++ b/tests/test_transcribe.py
@@ -0,0 +1,99 @@
+import os
+
+from faster_whisper import WhisperModel, decode_audio
+
+
+def test_supported_languages():
+    model = WhisperModel("tiny.en")
+    assert model.supported_languages == ["en"]
+
+
+def test_transcribe(jfk_path):
+    model = WhisperModel("tiny")
+    segments, info = model.transcribe(jfk_path, word_timestamps=True)
+    assert info.all_language_probs is not None
+
+    assert info.language == "en"
+    assert info.language_probability > 0.9
+    assert info.duration == 11
+
+    # Get top language info from all results, which should match the
+    # already existing metadata
+    top_lang, top_lang_score = info.all_language_probs[0]
+    assert info.language == top_lang
+    assert abs(info.language_probability - top_lang_score) < 1e-16
+
+    segments = list(segments)
+
+    assert len(segments) == 1
+
+    segment = segments[0]
+
+    assert segment.text == (
+        " And so my fellow Americans ask not what your country can do for you, "
+        "ask what you can do for your country."
+    )
+
+    assert segment.text == "".join(word.word for word in segment.words)
+    assert segment.start == segment.words[0].start
+    assert segment.end == segment.words[-1].end
+
+
+def test_prefix_with_timestamps(jfk_path):
+    model = WhisperModel("tiny")
+    segments, _ = model.transcribe(jfk_path, prefix="And so my fellow Americans")
+    segments = list(segments)
+
+    assert len(segments) == 1
+
+    segment = segments[0]
+
+    assert segment.text == (
+        " And so my fellow Americans ask not what your country can do for you, "
+        "ask what you can do for your country."
+    )
+
+    assert segment.start == 0
+    assert 10 < segment.end < 11
+
+
+def test_vad(jfk_path):
+    model = WhisperModel("tiny")
+    segments, info = model.transcribe(
+        jfk_path,
+        vad_filter=True,
+        vad_parameters=dict(min_silence_duration_ms=500, speech_pad_ms=200),
+    )
+    segments = list(segments)
+
+    assert len(segments) == 1
+    segment = segments[0]
+
+    assert segment.text == (
+        " And so my fellow Americans ask not what your country can do for you, "
+        "ask what you can do for your country."
+    )
+
+    assert 0 < segment.start < 1
+    assert 10 < segment.end < 11
+
+    assert info.vad_options.min_silence_duration_ms == 500
+    assert info.vad_options.speech_pad_ms == 200
+
+
+def test_stereo_diarization(data_dir):
+    model = WhisperModel("tiny")
+
+    audio_path = os.path.join(data_dir, "stereo_diarization.wav")
+    left, right = decode_audio(audio_path, split_stereo=True)
+
+    segments, _ = model.transcribe(left)
+    transcription = "".join(segment.text for segment in segments).strip()
+    assert transcription == (
+        "He began a confused complaint against the wizard, "
+        "who had vanished behind the curtain on the left."
+    )
+
+    segments, _ = model.transcribe(right)
+    transcription = "".join(segment.text for segment in segments).strip()
+    assert transcription == "The horizon seems extremely distant."
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -0,0 +1,29 @@
+import os
+
+from faster_whisper import available_models, download_model
+
+
+def test_available_models():
+    models = available_models()
+    assert isinstance(models, list)
+    assert "tiny" in models
+
+
+def test_download_model(tmpdir):
+    output_dir = str(tmpdir.join("model"))
+
+    model_dir = download_model("tiny", output_dir=output_dir)
+
+    assert model_dir == output_dir
+    assert os.path.isdir(model_dir)
+    assert not os.path.islink(model_dir)
+
+    for filename in os.listdir(model_dir):
+        path = os.path.join(model_dir, filename)
+        assert not os.path.islink(path)
+
+
+def test_download_model_in_cache(tmpdir):
+    cache_dir = str(tmpdir.join("model"))
+    download_model("tiny", cache_dir=cache_dir)
+    assert os.path.isdir(cache_dir)