Whisper: OpenAI's Open Source multilingual ASR model

OpenAI releases Whisper on 21 September 2022 under the MIT licence: 680,000 hours of audio, multilingual transcription, translation and language identification.

Open SourceAI Open SourceWhisperOpenAIASRSpeechAI

An open and multilingual ASR model

On 21 September 2022 OpenAI releases Whisper under the MIT licence, together with the paper “Robust Speech Recognition via Large-Scale Weak Supervision”. It is an Automatic Speech Recognition (ASR) system trained on 680,000 hours of audio collected from the web under weak supervision: audio-text pairs of variable quality but in far higher quantities than the curated datasets traditionally used in ASR.

The dataset size and the variety of sources result in a model that is robust to noise, accents, technical terms and non-ideal audio conditions — areas where previous ASR systems showed significant limitations.

Architecture and tasks

Whisper adopts a standard encoder-decoder Transformer architecture. Audio is converted into an 80-channel log-Mel spectrogram with 30-second windows; the encoder processes it and the decoder generates text tokens conditioned on special tokens that specify the required task.

The model natively handles three tasks: transcription in the original language, translation to English and spoken language identification. Task selection happens via control tokens in the decoder input sequence, without the need for separate fine-tunings.

Available sizes

Whisper is released as a family of variants: tiny, base, small, medium, large, designed to cover different trade-offs between accuracy, memory and inference speed.

Licence and adoption

The MIT licence allows any commercial use, modification and redistribution, making Whisper usable as the base of open ASR implementations in research and production.

Link: github.com/openai/whisper

Need support? Under attack? Service Status
Need support? Under attack? Service Status