This talk by Sander Dieleman from DeepMind on "Generating music in the raw audio domain" was presented on 10th September 2018 at IDEA London as part of the Creative AI meetup.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Sander Dieleman - Generating music in the raw audio domain - Creative AI meetup
1. Generating music in the raw audio domain
Sander Dieleman - September 10th, 2018
2. PhD student, Ghent University, Belgium (2010 - 2016)
“Learning feature hierarchies for musical audio signals”
Machine learning intern, Spotify, NYC (summer 2014)
Scaling up content-based music recommendation
Research Scientist, DeepMind, London UK (since July 2015)
AlphaGo, WaveNet, ...
http://benanne.github.io
sanderdieleman@gmail.com
@sedielem
3. Overview
Why model music in the raw audio domain?
WaveNet: an autoregressive model of raw audio
Generating music with WaveNets
Modelling long-range correlations with WaveNets
Generating music with long-range consistency
3
8. Why raw audio?
8
Many instruments have complex action spaces
⇒ Rich palette of sounds and timbral variations
Guitar
● pick vs. finger
● picking position
● frets
● harmonics
● ...
11. Modelling raw audio: μ-law encoding / quantisation
11
uniform
quantisation
μ-law
quantisation
WaveNet models audio as a discrete time series
12. WaveNet: an autoregressive model of raw audio
12
input
hidden
layer
hidden
layer
hidden
layer
output
“WaveNet: A Generative Model for Raw Audio”, van den Oord et al. (2016)
22. NSynth: WaveNet autoencoders for timbre modelling
22
Collaboration with
“Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders”, Engel et al. (ICML 2017)
27. 27
… but memory usage ~ receptive field length
Required model depth is logarithmic in the desired receptive field length
Required memory usage during training is still linear in the desired
receptive field length!
⇒ We cannot scale indefinitely using dilation
29. Autoregressive discrete autoencoders (ADAs)
29
“Neural Discrete Representation Learning”, van den Oord et al. (2017)
“The challenge of realistic music generation: modelling raw audio at scale”, Dieleman et al. (2018)
Two variants:
● VQ-VAE: vector quantisation variational autoencoder
learn a codebook, quantise the queries to elements of this codebook
● AMAE: argmax autoencoder
encourage the queries to be close to one-hot (e.g. use softmax), quantise
them on the simplex
42. Code sequences are harder to model than audio
48
Training second level ADAs is much harder
● Train VQ-VAE with population-based training (PBT)
● Train AMAE, which is less unstable
Predictability(NLL)
Receptive field length
Audio Code sequence
“Population Based Training
of Neural Networks”,
Jaderberg et al. (2017)
46. 52
Keynote speakers
Kenneth Stanley, University of Central Florida
David Ha, Google Brain
Allison Parrish, NYU ITP
Yaroslav Ganin, DeepMind
Yaniv Taigman, Facebook AI Research
Submission deadline (papers & art): October 28th
https://nips2018creativity.github.io/
What is wavenet?
blog post / paper
WaveNet: generative model of raw audio waveforms: learn probability distribution of sound
This is challenging because sound is a very long time series: 16kHz -> 16000 samples / second
Very successful for text-to-speech: state of the art in terms of audio quality (used in the Google Assistant)
WaveNet is inspired by similarly structured models that can generate images, called PixelCNNs
We drew some lessons from this work when developing WaveNet: treat the data to be modelled as discrete.
(even though amplitude of an audio waveform, just like the intensity of a pixel in an image, is actually continuous in the real world.)
Output of the model: distribution over a number of discrete choices (softmax).
Mu-law matches our logarithmic perception of loudness (not a new idea: also used in e.g. phones).
WaveNet is a convolutional network, consisting of many layers of nonlinear processing, and the functions that these layers compute are learnt from data.
Causal convolutions: at each timestep, connections only to previous timesteps (so the model can’t see the future)
Black nodes represent input signal values
Red nodes represent internal values computed by the network
Grey nodes represent the output: the predicted distribution of the next signal value
We need large receptive fields so the model remembers what it generated before.
Much cheaper to get large receptive fields
Samples: MTT
Samples from a model trained on a varied collection of music (contemporary, classical, non-western music)
The model drifts between different styles: the receptive field is not large enough (60 layers, receptive field < 1 second).
Music is harder to model than speech signals: much less constrained (in terms of timbre etc.), extremely long-range temporal correlations
Samples: piano
We reduced the complexity of the problem a bit by constructing single-instrument datasets from YouTube
Samples: guitar
Samples: string quartet
An ADA consists of a WaveNet (local model) with an encoder and a modulator attached to it.
The encoder takes the waveform and tries to compress it into a new representation, that makes abstraction of local signal structure.
The modulator tries to influence the local model such that it reproduces the encoder input.
The local model reconstructs any local signal structure that the encoder removed when compressing the signal.
The quantisation module ensures that the compressed representation is discrete, limiting its information capacity and making it easier to model with another WaveNet.
Although it’s possible to interpret this as a VAE, there isn’t really any tangible benefit to doing this.
Remember: D dimensional space, K codes in the codebook.
There are K codes, each code is represented by a D-dimensional vector!
Quantise to closest point in code space
output quantised vector for the decoder, and also an integer representation of the selected code
In backprop, the backward pass takes a different path than the forward pass! We basically pretend the quantisation didn’t happen.
As long as q and q’ are close enough (the quantisation error is low), this is fine.
Three terms in the loss function
Reconstruction loss: here we do straight-through for q’->q
Codebook loss: only affects codebook (gradient w.r.t. query is stopped)
Commitment loss: only affects encoder (gradient w.r.t. code is stopped)
NLL and commitment are the same as in VQ-VAE. We have an additional “diversity loss” which ensures that all possible codes are used: their expectation over time/batch should be one over the number of codes C.
Note that commitment is an L2 loss, putting this on a softmax output doesn’t really work. This is another reason for ReLU + divisive normalisation.
28 respondents who rated clips 1-5 (calibrated based on real excerpts from the dataset), in terms of audio fidelity and musicality.
For musicality, we clearly see that stacking more levels makes the score improve.
There is an upcoming blog post about the stacked model as well. Hopefully within the next month or so.