AI Project attributes
Other attributes
Whisper is an automatic speech recognition (ASR) system that is approaching human levels of accuracy for the English language. The model is trained on 680,000 hours of multilingual and multitask supervised data collected from the internet. Using such a large training dataset helps Whisper improve its robustness to accents, background noise and technical language, enabling transcription in multiple languages as well as translating from various languages into English.
Whisper's architecture uses an end-to-end approach implemented as an encoder-decoder transformer. Audio is divided into 30-second chunks before being converted into a log-Mel spectrogram and then passed into an encoder. A decoder is trained to predict the corresponding text caption as well as perform specific tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation. While specialized models show better speech recognition performance, using a large and diverse dataset allows Whisper to be used for a variety of tasks with fewer errors. Roughly a third of Whisper's audio dataset is non-English. OpenAI open-sourced the Whisper model and inference code.