WaveNet

A flow based vocoder and an autoregressive model.Published in 2016.

Link: https://arxiv.org/pdf/1609.03499.pdf

Github: https://github.com/r9y9/wavenet_vocoder

Introduction

​ WaveNet is a flow based vocoder and it is an autoregressive model which means that WaveNet takes previous output as input. Here will briefly explain what is flow.

Flow

​ Flow is a method based on probability. It uses a series of reversible function to represent a complex probability distribution with a simple distribution and a series of function.

​ Flow based on a constraint P(x)dx=1\int_{-\infty}^{\infty}P(x)dx=1, so we can know, for each function y=f(x)y=f(x), there is

py(y)dy=px(x)dxp_y(y) \cdot dy=p_x(x) \cdot dx\\

Transformation of probability distribution

because we do not care about if dxdx or dydy is positive or negative, and probability is always positive, so we have

px(x)=py(y)dydxp_x(x)=p_y(y)\left|\frac{dy}{dx}\right|\\

then we can transform it to log domain, so we have

logpx(x)=logpy(y)+logdydx\log{p_x(x)}=\log{p_y(y)}+\log{\left|\frac{dy}{dx}\right|}\\

Now we can consider about multivariate case, where we can alter dydx\frac{dy}{dx} with detJ(f(x))\det{J(f(x))}, where J()J(\cdot) means the jacobian

matrix of a function. we can do this transformation because

The Jacobian matrix represents the differential of f at every point where f is differentiable.

​ ——Wikipedia

then we can make it a chain, assume xn=fn(fn1(...f1(x0)))x_n=f_n(f_{n-1}(...f_1(x_0))) , then we have

logpn(xn)=logp0(x0)i=1nlogdetJ(fi(x))\log{p_n(x_n)}=\log{p_0(x_0)}-\sum_{i=1}^{n}\log{\left|\det{J(f_i(x))}\right|}

so this is the flow of the probability distribution.

Autoregressive Flow

According to what I have said above, autoregressive means that it takes previous output as input. we combine autoregressive and flow and we can get autoregressive flow, which is used in WaveNet.

In autoregressive flow, we have

zt=xtσt(x<t;υ)+μt(x<t;υ)xt=ztμt(x<t;υ)σt(x<t;υ)z_t = x_t \cdot\sigma_t(x_{<t};\upsilon)+\mu_t(x_{<t};\upsilon)\\ x_t=\frac{z_t-\mu_t(x_{<t};\upsilon)}{\sigma_t(x_{<t};\upsilon)}

where σt(x<t;υ)\sigma_t(x_{<t};\upsilon) and μt(x<t;υ)\mu_t(x_{<t};\upsilon) refer to neural network which have parameter υ\upsilon and take input x<tx_{<t}(output before time tt).Then we can easily calculate the jacobian matrix of z(x)z(x), and we can get

Jacobian matrix

Note this jacobian matrix is a lower triangular matrix, so we can easily calculate the determinant of this matrix.

During train period we have

Train

and during train period we have

Inference

we use exp(α)exp(\alpha) because α\alpha often uses as a natural logarithm in modeling.

Novelty

Dilated Causal Convolution

The main ingredient of WaveNet are causal convolutions.

By using causal convolutions, the model cannot depend on any of the future timesteps.

Causal convolution

But compared to RNN, the reception field of causal convolution is only 5(layers+filter length-1).So in this paper, it had used dilated convolution.

A dilated convolution (also called a trous ` , or convolution with holes) is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step.

Dilated convolution

By using dilated convolution, the reception field of the model comes to 16(2layers2^{layers}), and it increases exponentially as layer increases.

Gated Activation Units

Same as PixelCNN.

z=tanh(Wf,kx)σ(Wg,kx)z=tanh(W_{f,k}\ast x)\odot \sigma(W_{g,k}\ast x)

Conditional WaveNets

WaveNet can use a given additional input h to model distribution p(xh)p(x|\bold{h}).

It can be used in multi-speaker condition or takes extra information as input.

  • Copyrights © 2021 BakerBunker

请我喝杯咖啡吧~

支付宝
微信