FastSpeech 1&2

2021-11-20

Word count: 334 | Reading time≈ 2 min

FastSpeech1: An End-to-End TTS model, which is focus on synthesis speed, robustness and controllability.

Link: https://arxiv.org/pdf/1905.09263.pdf

Github: https://github.com/xcmyz/FastSpeech

FastSpeech2: Better solves the one-to-many problem(phoneme-spectrogram frame alignment); Predict more feature of speech, such as pitch, energy and more accurate duration.

Link: https://arxiv.org/pdf/2006.04558.pdf

Github: https://github.com/ming024/FastSpeech2

FastSpeech1

Challanges

Slow inference speed, due to the feature of RNN.
Weak robustness (e.g. word skipping&repeating), due to error propagation in autoregressive generation and wrong attention alignment between text and speech.
Lack of controllability, due to the feature of autoregressive progress.

So FastSpeech uses non-autoregressive model instead of RNN or any autoregressive model.

Solution

Slow inference speed

FastSpeech uses non-autoregressive model to avoid the drawback of autoregressive model.

FFT block is the main part of this net.

Weak robustness

FastSpeech uses Length Regulator to solve the problem of length mismatch between phoneme and spectrogram, the core of Length Regulator is Duration Predictor.

Duration Predictor uses an trained Autoregressive Transformer TTS model as the teacher of alignment process, then it calculate the MSE loss of them and optimize the Duration Predictor.

Lack of controllability

FastSpeech uses $\alpha$ as a multiplication factor to control the duration of every phoneme.

FastSpeech2&2s

Challenges

Slow training speed and inaccurate alignment due to the complexity introduced in Teacher Model.
FastSpeech 1 can only predict the length of phoneme, it cannot predict more feature of speech.
Not a truly End-to-End model.

Solution

Slow training speed and inaccurate alignment

Train Duration Predictor by ground truth waveform.

Only predict the length of phoneme

Use a Variance Adaptor to solve the one-to-many problem with simpler pipeline. The Variance Adaptor calculates the corresponding loss between ground truth and predicted feature.

Not a truly End-to-End model

Use a Waveform Decoder to decode instead of Mel-spectrogram Decoder to achieve truly end-to-end .