Back to Course
NLP Specialist: BERT & Beyond
Module 7 of 8
7. Speech-to-Text (Whisper)
1. Weak Supervision
OpenAI trained Whisper on 680,000 hours of internet audio. It solves accents, background noise, and even translation implicitly.
2. The Architecture
It is a standard Transformer Encoder-Decoder. Input: Log-Mel Spectrogram. Output: Text tokens.