mirror of
https://github.com/SYSTRAN/faster-whisper.git
synced 2026-01-09 13:38:01 -05:00
* Fix: Prevent <|nocaptions|> tokens in BatchedInferencePipeline
- Add nocaptions component tokens [1771, 496, 9799] to suppress_tokens list
- Add segment filtering to remove any remaining <|nocaptions|> segments
- Resolves issue where BatchedInferencePipeline would generate malformed
special tokens during periods of silence or low-confidence transcription
- Includes comprehensive tests to verify the fix
The issue occurred because while bracket tokens ('<', '|', '>') were
already suppressed, the content tokens ('no', 'ca', 'ptions') were not,
leading to partial token generation that formed complete <|nocaptions|>
tags in the output.
Files changed:
- faster_whisper/transcribe.py: Core fix implementation
- test_nocaptions_comprehensive.py: Comprehensive test suite
- tests/test_nocaptions_fix.py: Unit tests
* removed
* Fix: Prevent <|nocaptions|> tokens in BatchedInferencePipeline
* Fix: Implement proper <|nocaptions|> token suppression using single token approach
* ci: trigger tests
* fix: remove trailing whitespace from blank lines
* Update faster_whisper/transcribe.py
Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>
* Update faster_whisper/tokenizer.py
Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>
* Update faster_whisper/tokenizer.py
Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>
* Rename no_speech to no_captions in tokenizer
* nocaptions has been renamed to nospeech
* break line
* line break
* Refactor no_speech method for improved readability by adjusting line breaks
---------
Co-authored-by: Mahmoud Ashraf <hassouna97.ma@gmail.com>