An open and multilingual ASR model
On 21 September 2022 OpenAI releases Whisper under the MIT licence, together with the paper “Robust Speech Recognition via Large-Scale Weak Supervision”. It is an Automatic Speech Recognition (ASR) system trained on 680,000 hours of audio collected from the web under weak supervision: audio-text pairs of variable quality but in far higher quantities than the curated datasets traditionally used in ASR.
The dataset size and the variety of sources result in a model that is robust to noise, accents, technical terms and non-ideal audio conditions — areas where previous ASR systems showed significant limitations.
Architecture and tasks
Whisper adopts a standard encoder-decoder Transformer architecture. Audio is converted into an 80-channel log-Mel spectrogram with 30-second windows; the encoder processes it and the decoder generates text tokens conditioned on special tokens that specify the required task.
The model natively handles three tasks: transcription in the original language, translation to English and spoken language identification. Task selection happens via control tokens in the decoder input sequence, without the need for separate fine-tunings.
Available sizes
Whisper is released as a family of variants: tiny, base, small, medium, large, with the large variant later joined by large-v2 (December 2022) and large-v3 (November 2023), which improve accuracy and language coverage. In October 2024 Whisper Turbo is released, an inference-optimised variant that maintains accuracy close to large with significantly higher speed.
Licence and adoption
The MIT licence allows any commercial use, modification and redistribution. Whisper is today at the core of many production ASR implementations, including optimised frameworks such as whisper.cpp and faster-whisper that enable execution on CPUs or consumer hardware.
