10. Audio Processing

1. Images of Sound

We don't feed raw waveforms to CNNs. We feed Spectrograms. It turns Time-Amplitude into Time-Frequency (like an image).

Self-supervised learning on audio. Masking parts of the sound and asking the model to guess the missing bits.