MENU

Tacotron2 模型详解

August 15, 2020 • Read: 11854 • Deep Learning阅读设置

1 概述

Tacotron2 是由 Google Brain 在 2017 年提出来的一个 End-to-End 语音合成框架。模型从下到上可以看作由两部分组成:

  1. 声谱预测网络:一个 Encoder-Attention-Decoder 网络,用于将输入的字符序列预测为梅尔频谱的帧序列
  2. 声码器(vocoder):一个 WaveNet 的修订版,用于将预测的梅尔频谱帧序列产生时域波形

2 编码器

Encoder 的输入是多个句子,每个句子的基本单位是 character,例如

  • 英文 "hello world" 就会被拆成 "h e l l o w o r l d" 作为输入
  • 中文 "你好世界" 则会先把拼音标识出来得到 "ni hao shi jie",然后进一步按照声韵母的方式来分割成 "n i h ao sh i j ie",或者直接按照类似英文的方式分割成 "n i h a o s h i j i e"

Encoder 的具体流程为:

  1. 输入的数据维度为 [batch_size, char_seq_length]
  2. 使用 512 维的 Character Embedding,把每个 character 映射为 512 维的向量,输出维度为 [batch_size, char_seq_length, 512]
  3. 3 个一维卷积,每个卷积包括 512 个 kernel,每个 kernel 的大小是 5*1(即每次看 5 个 characters)。每做完一次卷积,进行一次 BatchNorm、ReLU 以及 Dropout。输出维度为 [batch_size, char_seq_length, 512](为了保证每次卷积的维度不变,因此使用了 pad)
  4. 上面得到的输出,扔给一个单层 BiLSTM,隐藏层维度是 256,由于这是双向的 LSTM,因此最终输出维度是 [batch_size, char_seq_length, 512]
  • class Encoder(nn.Module):
  • def __init__(self, hparams):
  • super(Encoder, self).__init__()
  • convolutions = []
  • for _ in range(hparams.encoder_n_convolutions):
  • conv_layer = nn.Sequential(
  • ConvNorm(hparams.encoder_embedding_dim,
  • hparams.encoder_embedding_dim,x
  • kernel_size=hparams.encoder_kernel_size, stride=1,
  • padding=int((hparams.encoder_kernel_size - 1) / 2),
  • dilation=1, w_init_gain='relu'),
  • nn.BatchNorm1d(hparams.encoder_embedding_dim))
  • convolutions.append(conv_layer)
  • self.convolutions = nn.ModuleList(convolutions)
  • self.lstm = nn.LSTM(hparams.encoder_embedding_dim,
  • int(hparams.encoder_embedding_dim / 2), 1,
  • batch_first=True, bidirectional=True)
  • class ConvNorm(torch.nn.Module):
  • def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
  • padding=None, dilation=1, bias=True, w_init_gain='linear'):
  • super(ConvNorm, self).__init__()
  • if padding is None:
  • assert(kernel_size % 2 == 1)
  • padding = int(dilation * (kernel_size - 1) / 2)
  • self.conv = torch.nn.Conv1d(in_channels, out_channels,
  • kernel_size=kernel_size, stride=stride,
  • padding=padding, dilation=dilation,
  • bias=bias)
  • torch.nn.init.xavier_uniform_(
  • self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))

3 注意力机制

上图描述了第一次做 attention 时的输入和输出。其中,$y_0$ 是 PreNet 初始输入 <S> 的编码表示,$c_0$ 是当前的 "注意力上下文"。初始第一步时,$y_0$ 和 $c_0$ 都被初始化为全 0 向量,然后将 $y_0$ 和 $c_0$ 拼接起来,得到一个 768 维的向量 $y_{0,c}$,将该向量与 attention_hidden 和 attention_cell 一起作为 LSTMcell 的输入(attention_hidden 其实就是 LSTMcell 的 hidden_state,attention_cell 其实就是 LSTMcell 的 cell_state)。得到的结果是 $h_1$ 和 attention_cell,这里没有给 attention_cell 单独起名字,主要考虑其是 "打酱油" 的,因为除了 attention_rnn 之外,其它地方没有用到 attention_cell

Attention_Layer 一共接受五个输入:

  1. $h_1$ 是和 mel 谱相关的变量
  2. $m$ 来自 source character sequence 经过 Encoder 层提取得到的特征
  3. $m'$ 是 $m$ 通过一个 Linear 后得到的
  4. attention_weights_cat 是将历史(上一时刻)的 attention_weights 和 attention_weights_cum 拼接得到的
  5. mask 全 false,基本没用

计算细节如下:

其中最核心的部分即 get_alignment_energies,这个函数内部引入了位置特征,因此是混合注意力机制

混合注意力机制实际上是内容注意力机制(常规的 Attention)与位置注意力机制的结合:

$$ e_{ij}=score(s_{i-1},\alpha_{i-1},h_j) $$

其中,$s_{i-1}$ 为之前解码器的隐状态,$\alpha_{i-1}$ 是之前的注意力权重,$h_j$ 是第 $j$ 个编码器隐状态。为其添加偏置 $b$,最终的 score 函数计算如下:

$$ e_{ij}=v_a^T\mathop{tanh}(Ws_{i-1}+Vh_j+Uf_{i,j}+b) $$

其中,$v_a$、$W$、$V$、$U$ 和 $b$ 均为待训练参数,$f_{i,j}$ 是之前的注意力权重 $\alpha_{i,j}$ 经卷积而得的位置特征(location feature),$f_i=F*\alpha_{i-1}$

Tancotron2 的注意力机制基本上和混合注意力机制差不多,但稍有不同

$$ e_{i,j}=score(s_i,c\alpha_{i-1},h_j)=v_a^T\mathop{tanh}(Ws_i+Vh_j+Uf_{i,j}+b) $$

其中,$s_i$ 为当前解码器隐状态而非上一步,偏置 $b$ 初始化为 0,位置特征 $f_i$ 是用累加注意力权重 $c\alpha_i$ 卷积而来:

$$ \begin{align*} f_i&=F*c\alpha_{i-1}\\ c\alpha_i&=\sum_{j=1}^{i-1}\alpha_j \end{align*} $$

get_alignment_energies 函数图示如下:

图中 Location_Layer 的代码如下:

  • class LocationLayer(nn.Module):
  • def __init__(self, attention_n_filters, attention_kernel_size, # 32, 31
  • attention_dim): # 128
  • super(LocationLayer, self).__init__()
  • padding = int((attention_kernel_size - 1) / 2) # padding=15
  • self.location_conv = ConvNorm(2, attention_n_filters,
  • kernel_size=attention_kernel_size,
  • padding=padding, bias=False, stride=1,
  • dilation=1)
  • self.location_dense = LinearNorm(attention_n_filters, attention_dim,
  • bias=False, w_init_gain='tanh')
  • def forward(self, attention_weights_cat): # [1, 2, 151]
  • processed_attention = self.location_conv(attention_weights_cat) # [1, 32, 151]
  • processed_attention = processed_attention.transpose(1, 2) # [1, 151, 32]
  • processed_attention = self.location_dense(processed_attention) # [1, 151, 128]
  • return processed_attention

Attention 的代码如下:

  • class Attention(nn.Module):
  • def __init__(self, attention_rnn_dim, embedding_dim, attention_dim,
  • attention_location_n_filters, attention_location_kernel_size):
  • super(Attention, self).__init__()
  • self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
  • bias=False, w_init_gain='tanh')
  • self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
  • w_init_gain='tanh')
  • self.v = LinearNorm(attention_dim, 1, bias=False)
  • self.location_layer = LocationLayer(attention_location_n_filters,
  • attention_location_kernel_size,
  • attention_dim)
  • self.score_mask_value = -float("inf")
  • def get_alignment_energies(self, query, processed_memory,
  • attention_weights_cat):
  • """
  • PARAMS
  • ------
  • query: decoder output (batch, n_mel_channels * n_frames_per_step)
  • processed_memory: processed encoder outputs (B, T_in, attention_dim)
  • attention_weights_cat: cumulative and prev. att weights (B, 2, max_time)
  • RETURNS
  • -------
  • alignment (batch, max_time)
  • """
  • processed_query = self.query_layer(query.unsqueeze(1))
  • processed_attention_weights = self.location_layer(attention_weights_cat)
  • energies = self.v(torch.tanh(
  • processed_query + processed_attention_weights + processed_memory))
  • energies = energies.squeeze(-1)
  • return energies
  • def forward(self, attention_hidden_state, memory, processed_memory,
  • attention_weights_cat, mask):
  • """
  • PARAMS
  • ------
  • attention_hidden_state: attention rnn last output
  • memory: encoder outputs
  • processed_memory: processed encoder outputs
  • attention_weights_cat: previous and cummulative attention weights
  • mask: binary mask for padded data
  • """
  • alignment = self.get_alignment_energies(
  • attention_hidden_state, processed_memory, attention_weights_cat)
  • if mask is not None:
  • alignment.data.masked_fill_(mask, self.score_mask_value)
  • attention_weights = F.softmax(alignment, dim=1)
  • attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
  • attention_context = attention_context.squeeze(1)
  • return attention_context, attention_weights

4 解码器

解码器是一个自回归结构,它从编码的输入序列预测出声谱图,一次预测一帧

  1. 上一步预测出的频谱首先传入一个 PreNet,它包含两层神经网络,PreNet 作为一个信息瓶颈层(bottleneck),对于学习注意力是必要的
  2. PreNet 的输出和 Attention Context 向量拼接在一起,传给一个含有 1024 个单元的两层 LSTM。LSTM 的输出再次和 Attention Context 向量拼接在一起,然后经过一个线性投影来预测目标频谱
  3. 最后,目标频谱帧经过一个 5 层卷积的 PostNet(后处理网络),再将该输出和 Linear Projection 的输出相加(残差连接)作为最终的输出
  4. 另一边,LSTM 的输出和 Attention Context 向量拼接在一起,投影成标量后传给 sigmoid 激活函数,来预测输出序列是否已完成预测

PreNet 层的图示及代码如下所示:

  • class LinearNorm(torch.nn.Module):
  • def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
  • super(LinearNorm, self).__init__()
  • self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
  • torch.nn.init.xavier_uniform_(
  • self.linear_layer.weight,
  • gain=torch.nn.init.calculate_gain(w_init_gain))
  • def forward(self, x):
  • return self.linear_layer(x)
  • class Prenet(nn.Module):
  • def __init__(self, in_dim, sizes):
  • super(Prenet, self).__init__()
  • in_sizes = [in_dim] + sizes[:-1]
  • self.layers = nn.ModuleList(
  • [LinearNorm(in_size, out_size, bias=False)
  • for (in_size, out_size) in zip(in_sizes, sizes)])
  • def forward(self, x):
  • for linear in self.layers:
  • x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
  • return x

PostNet 层的图示及代码如下所示:

  • class Postnet(nn.Module):
  • """Postnet
  • - Five 1-d convolution with 512 channels and kernel size 5
  • """
  • def __init__(self, hparams):
  • super(Postnet, self).__init__()
  • self.convolutions = nn.ModuleList()
  • self.convolutions.append(
  • nn.Sequential(
  • ConvNorm(hparams.n_mel_channels, hparams.postnet_embedding_dim,
  • kernel_size=hparams.postnet_kernel_size, stride=1,
  • padding=int((hparams.postnet_kernel_size - 1) / 2),
  • dilation=1, w_init_gain='tanh'),
  • nn.BatchNorm1d(hparams.postnet_embedding_dim))
  • )
  • for i in range(1, hparams.postnet_n_convolutions - 1):
  • self.convolutions.append(
  • nn.Sequential(
  • ConvNorm(hparams.postnet_embedding_dim,
  • hparams.postnet_embedding_dim,
  • kernel_size=hparams.postnet_kernel_size, stride=1,
  • padding=int((hparams.postnet_kernel_size - 1) / 2),
  • dilation=1, w_init_gain='tanh'),
  • nn.BatchNorm1d(hparams.postnet_embedding_dim))
  • )
  • self.convolutions.append(
  • nn.Sequential(
  • ConvNorm(hparams.postnet_embedding_dim, hparams.n_mel_channels,
  • kernel_size=hparams.postnet_kernel_size, stride=1,
  • padding=int((hparams.postnet_kernel_size - 1) / 2),
  • dilation=1, w_init_gain='linear'),
  • nn.BatchNorm1d(hparams.n_mel_channels))
  • )
  • def forward(self, x):
  • for i in range(len(self.convolutions) - 1):
  • x = F.dropout(torch.tanh(self.convolutions[i](x)), 0.5, self.training)
  • x = F.dropout(self.convolutions[-1](x), 0.5, self.training)
  • return x

从下面 Decoder 初始化部分可以看出 Decoder 由 prenet,attention_rnn,attention_layer,decoder_rnn,linear_projection,gate_layer 组成

  • class Decoder(nn.Module):
  • def __init__(self, hparams):
  • super(Decoder, self).__init__()
  • self.n_mel_channels = hparams.n_mel_channels
  • self.n_frames_per_step = hparams.n_frames_per_step
  • self.encoder_embedding_dim = hparams.encoder_embedding_dim
  • self.attention_rnn_dim = hparams.attention_rnn_dim
  • self.decoder_rnn_dim = hparams.decoder_rnn_dim
  • self.prenet_dim = hparams.prenet_dim
  • self.max_decoder_steps = hparams.max_decoder_steps
  • self.gate_threshold = hparams.gate_threshold
  • self.p_attention_dropout = hparams.p_attention_dropout
  • self.p_decoder_dropout = hparams.p_decoder_dropout
  • self.prenet = Prenet(
  • hparams.n_mel_channels * hparams.n_frames_per_step,
  • [hparams.prenet_dim, hparams.prenet_dim])
  • self.attention_rnn = nn.LSTMCell(
  • hparams.prenet_dim + hparams.encoder_embedding_dim,
  • hparams.attention_rnn_dim)
  • self.attention_layer = Attention(
  • hparams.attention_rnn_dim, hparams.encoder_embedding_dim,
  • hparams.attention_dim, hparams.attention_location_n_filters,
  • hparams.attention_location_kernel_size)
  • self.decoder_rnn = nn.LSTMCell(
  • hparams.attention_rnn_dim + hparams.encoder_embedding_dim,
  • hparams.decoder_rnn_dim, 1)
  • self.linear_projection = LinearNorm(
  • hparams.decoder_rnn_dim + hparams.encoder_embedding_dim,
  • hparams.n_mel_channels * hparams.n_frames_per_step)
  • self.gate_layer = LinearNorm(
  • hparams.decoder_rnn_dim + hparams.encoder_embedding_dim, 1,
  • bias=True, w_init_gain='sigmoid')

5 总结

Tacotron2 模型的完整网络结构:

  • Tacotron2(
  • (embedding): Embedding(148, 512)
  • (encoder): Encoder(
  • (convolutions): ModuleList(
  • (0): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • (1): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • (2): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • )
  • (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  • )
  • (decoder): Decoder(
  • (prenet): Prenet(
  • (layers): ModuleList(
  • (0): LinearNorm(
  • (linear_layer): Linear(in_features=80, out_features=256, bias=False)
  • )
  • (1): LinearNorm(
  • (linear_layer): Linear(in_features=256, out_features=256, bias=False)
  • )
  • )
  • )
  • (attention_rnn): LSTMCell(768, 1024)
  • (attention_layer): Attention(
  • (query_layer): LinearNorm(
  • (linear_layer): Linear(in_features=1024, out_features=128, bias=False)
  • )
  • (memory_layer): LinearNorm(
  • (linear_layer): Linear(in_features=512, out_features=128, bias=False)
  • )
  • (v): LinearNorm(
  • (linear_layer): Linear(in_features=128, out_features=1, bias=False)
  • )
  • (location_layer): LocationLayer(
  • (location_conv): ConvNorm(
  • (conv): Conv1d(2, 32, kernel_size=(31,), stride=(1,), padding=(15,), bias=False)
  • )
  • (location_dense): LinearNorm(
  • (linear_layer): Linear(in_features=32, out_features=128, bias=False)
  • )
  • )
  • )
  • (decoder_rnn): LSTMCell(1536, 1024, bias=1)
  • (linear_projection): LinearNorm(
  • (linear_layer): Linear(in_features=1536, out_features=80, bias=True)
  • )
  • (gate_layer): LinearNorm(
  • (linear_layer): Linear(in_features=1536, out_features=1, bias=True)
  • )
  • )
  • (postnet): Postnet(
  • (convolutions): ModuleList(
  • (0): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(80, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • (1): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • (2): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • (3): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • (4): Sequential(
  • (0): ConvNorm(
  • (conv): Conv1d(512, 80, kernel_size=(5,), stride=(1,), padding=(2,))
  • )
  • (1): BatchNorm1d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  • )
  • )
  • )
  • )

Reference

Last Modified: April 28, 2021
Archives Tip
QR Code for this page
Tipping QR Code
Leave a Comment

5 Comments
  1. QueSera QueSera

    请问博主,注意力机制里面写到 y0 是 Prenet 的输入,然后图片里 y0 却是 Decoder_input,是笔误吗?

    1. mathor mathor

      @QueSera 这俩是不同的 y0

    2. QueSera QueSera

      @mathor 在原论文的整体结构图中可以看到 2 个 LSTM layers 经过 Linear projection 后会将最终状态输入到 Prenet 之中,在您的整体细节图中 Prenet 部分只有输出没有输入,求解答

    3. mathor mathor

      @QueSera 那就以原论文为主吧

  2. AAA AAA

    151 哪来的