FastSpeech 1&2

FastSpeech1: An End-to-End TTS model, which is focus on synthesis speed, robustness and controllability.

Link: https://arxiv.org/pdf/1905.09263.pdf

Github: https://github.com/xcmyz/FastSpeech

FastSpeech2: Better solves the one-to-many problem(phoneme-spectrogram frame alignment); Predict more feature of speech, such as pitch, energy and more accurate duration.

Link: https://arxiv.org/pdf/2006.04558.pdf

Github: https://github.com/ming024/FastSpeech2

FastSpeech1

Challanges

  • Slow inference speed, due to the feature of RNN.
  • Weak robustness (e.g. word skipping&repeating), due to error propagation in autoregressive generation and wrong attention alignment between text and speech.
  • Lack of controllability, due to the feature of autoregressive progress.

So FastSpeech uses non-autoregressive model instead of RNN or any autoregressive model.

Solution

image-20211122093150537

Slow inference speed

FastSpeech uses non-autoregressive model to avoid the drawback of autoregressive model.

FFT block is the main part of this net.

image-20211122093608617

Weak robustness

FastSpeech uses Length Regulator to solve the problem of length mismatch between phoneme and spectrogram, the core of Length Regulator is Duration Predictor.

image-20211122093718775

Duration Predictor uses an trained Autoregressive Transformer TTS model as the teacher of alignment process, then it calculate the MSE loss of them and optimize the Duration Predictor.

Lack of controllability

FastSpeech uses α\alpha as a multiplication factor to control the duration of every phoneme.

image-20211122094436472

FastSpeech2&2s

Challenges

  • Slow training speed and inaccurate alignment due to the complexity introduced in Teacher Model.
  • FastSpeech 1 can only predict the length of phoneme, it cannot predict more feature of speech.
  • Not a truly End-to-End model.

Solution

image-20211122101028976

Slow training speed and inaccurate alignment

Train Duration Predictor by ground truth waveform.

image-20211122101320482

Only predict the length of phoneme

Use a Variance Adaptor to solve the one-to-many problem with simpler pipeline. The Variance Adaptor calculates the corresponding loss between ground truth and predicted feature.

image-20211122101458457

Not a truly End-to-End model

Use a Waveform Decoder to decode instead of Mel-spectrogram Decoder to achieve truly end-to-end .

image-20211122101648174

扫一扫,分享到微信

微信分享二维码
  • Copyrights © 2021 BakerBunker

请我喝杯咖啡吧~

支付宝
微信