7. Speech-to-Text (Whisper)

1. Weak Supervision

OpenAI trained Whisper on 680,000 hours of internet audio. It solves accents, background noise, and even translation implicitly.

It is a standard Transformer Encoder-Decoder. Input: Log-Mel Spectrogram. Output: Text tokens.