1 概述
Tacotron2 是由 Google Brain 在 2017 年提出来的一个 End-to-End 语音合成框架。模型从下到上可以看作由两部分组成:
- 声谱预测网络:一个 Encoder-Attention-Decoder 网络,用于将输入的字符序列预测为梅尔频谱的帧序列
- 声码器(vocoder):一个 WaveNet 的修订版,用于将预测的梅尔频谱帧序列产生时域波形
2 编码器
Encoder 的输入是多个句子,每个句子的基本单位是 character,例如
- 英文 "hello world" 就会被拆成 "h e l l o w o r l d" 作为输入
- 中文 "你好世界" 则会先把拼音标识出来得到 "ni hao shi jie",然后进一步按照声韵母的方式来分割成 "n i h ao sh i j ie",或者直接按照类似英文的方式分割成 "n i h a o s h i j i e"
Encoder 的具体流程为:
- 输入的数据维度为
[batch_size, char_seq_length]
- 使用 512 维的 Character Embedding,把每个 character 映射为 512 维的向量,输出维度为
[batch_size, char_seq_length, 512]
- 3 个一维卷积,每个卷积包括 512 个 kernel,每个 kernel 的大小是 5*1(即每次看 5 个 characters)。每做完一次卷积,进行一次 BatchNorm、ReLU 以及 Dropout。输出维度为
[batch_size, char_seq_length, 512]
(为了保证每次卷积的维度不变,因此使用了 pad) - 上面得到的输出,扔给一个单层 BiLSTM,隐藏层维度是 256,由于这是双向的 LSTM,因此最终输出维度是
[batch_size, char_seq_length, 512]
- class Encoder(nn.Module):
- def __init__(self, hparams):
- super(Encoder, self).__init__()
-
- convolutions = []
- for _ in range(hparams.encoder_n_convolutions):
- conv_layer = nn.Sequential(
- ConvNorm(hparams.encoder_embedding_dim,
- hparams.encoder_embedding_dim,x
- kernel_size=hparams.encoder_kernel_size, stride=1,
- padding=int((hparams.encoder_kernel_size - 1) / 2),
- dilation=1, w_init_gain='relu'),
- nn.BatchNorm1d(hparams.encoder_embedding_dim))
- convolutions.append(conv_layer)
- self.convolutions = nn.ModuleList(convolutions)
-
- self.lstm = nn.LSTM(hparams.encoder_embedding_dim,
- int(hparams.encoder_embedding_dim / 2), 1,
- batch_first=True, bidirectional=True)
- class ConvNorm(torch.nn.Module):
- def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
- padding=None, dilation=1, bias=True, w_init_gain='linear'):
- super(ConvNorm, self).__init__()
- if padding is None:
- assert(kernel_size % 2 == 1)
- padding = int(dilation * (kernel_size - 1) / 2)
-
- self.conv = torch.nn.Conv1d(in_channels, out_channels,
- kernel_size=kernel_size, stride=stride,
- padding=padding, dilation=dilation,
- bias=bias)
-
- torch.nn.init.xavier_uniform_(
- self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
3 注意力机制
上图描述了第一次做 attention 时的输入和输出。其中,$y_0$ 是 PreNet 初始输入 <S>
的编码表示,$c_0$ 是当前的 "注意力上下文"。初始第一步时,$y_0$ 和 $c_0$ 都被初始化为全 0 向量,然后将 $y_0$ 和 $c_0$ 拼接起来,得到一个 768 维的向量 $y_{0,c}$,将该向量与 attention_hidden 和 attention_cell 一起作为 LSTMcell 的输入(attention_hidden 其实就是 LSTMcell 的 hidden_state,attention_cell 其实就是 LSTMcell 的 cell_state)。得到的结果是 $h_1$ 和 attention_cell,这里没有给 attention_cell 单独起名字,主要考虑其是 "打酱油" 的,因为除了 attention_rnn 之外,其它地方没有用到 attention_cell
Attention_Layer 一共接受五个输入:
- $h_1$ 是和 mel 谱相关的变量
- $m$ 来自 source character sequence 经过 Encoder 层提取得到的特征
- $m'$ 是 $m$ 通过一个 Linear 后得到的
- attention_weights_cat 是将历史(上一时刻)的 attention_weights 和 attention_weights_cum 拼接得到的
- mask 全 false,基本没用
计算细节如下:
其中最核心的部分即 get_alignment_energies,这个函数内部引入了位置特征,因此是混合注意力机制
混合注意力机制实际上是内容注意力机制(常规的 Attention)与位置注意力机制的结合:
$$ e_{ij}=score(s_{i-1},\alpha_{i-1},h_j) $$
其中,$s_{i-1}$ 为之前解码器的隐状态,$\alpha_{i-1}$ 是之前的注意力权重,$h_j$ 是第 $j$ 个编码器隐状态。为其添加偏置 $b$,最终的 score 函数计算如下:
$$ e_{ij}=v_a^T\mathop{tanh}(Ws_{i-1}+Vh_j+Uf_{i,j}+b) $$
其中,$v_a$、$W$、$V$、$U$ 和 $b$ 均为待训练参数,$f_{i,j}$ 是之前的注意力权重 $\alpha_{i,j}$ 经卷积而得的位置特征(location feature),$f_i=F*\alpha_{i-1}$
Tancotron2 的注意力机制基本上和混合注意力机制差不多,但稍有不同
$$ e_{i,j}=score(s_i,c\alpha_{i-1},h_j)=v_a^T\mathop{tanh}(Ws_i+Vh_j+Uf_{i,j}+b) $$
其中,$s_i$ 为当前解码器隐状态而非上一步,偏置 $b$ 初始化为 0,位置特征 $f_i$ 是用累加注意力权重 $c\alpha_i$ 卷积而来:
$$ \begin{align*} f_i&=F*c\alpha_{i-1}\\ c\alpha_i&=\sum_{j=1}^{i-1}\alpha_j \end{align*} $$
get_alignment_energies 函数图示如下:
图中 Location_Layer 的代码如下:
- class LocationLayer(nn.Module):
- def __init__(self, attention_n_filters, attention_kernel_size, # 32, 31
- attention_dim): # 128
- super(LocationLayer, self).__init__()
- padding = int((attention_kernel_size - 1) / 2) # padding=15
- self.location_conv = ConvNorm(2, attention_n_filters,
- kernel_size=attention_kernel_size,
- padding=padding, bias=False, stride=1,
- dilation=1)
- self.location_dense = LinearNorm(attention_n_filters, attention_dim,
- bias=False, w_init_gain='tanh')
-
- def forward(self, attention_weights_cat): # [1, 2, 151]
- processed_attention = self.location_conv(attention_weights_cat) # [1, 32, 151]
- processed_attention = processed_attention.transpose(1, 2) # [1, 151, 32]
- processed_attention = self.location_dense(processed_attention) # [1, 151, 128]
- return processed_attention
Attention 的代码如下:
- class Attention(nn.Module):
- def __init__(self, attention_rnn_dim, embedding_dim, attention_dim,
- attention_location_n_filters, attention_location_kernel_size):
- super(Attention, self).__init__()
- self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
- bias=False, w_init_gain='tanh')
- self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
- w_init_gain='tanh')
- self.v = LinearNorm(attention_dim, 1, bias=False)
- self.location_layer = LocationLayer(attention_location_n_filters,
- attention_location_kernel_size,
- attention_dim)
- self.score_mask_value = -float("inf")
-
- def get_alignment_energies(self, query, processed_memory,
- attention_weights_cat):
- """
- PARAMS
- ------
- query: decoder output (batch, n_mel_channels * n_frames_per_step)
- processed_memory: processed encoder outputs (B, T_in, attention_dim)
- attention_weights_cat: cumulative and prev. att weights (B, 2, max_time)
-
- RETURNS
- -------
- alignment (batch, max_time)
- """
-
- processed_query = self.query_layer(query.unsqueeze(1))
- processed_attention_weights = self.location_layer(attention_weights_cat)
- energies = self.v(torch.tanh(
- processed_query + processed_attention_weights + processed_memory))
-
- energies = energies.squeeze(-1)
- return energies
-
- def forward(self, attention_hidden_state, memory, processed_memory,
- attention_weights_cat, mask):
- """
- PARAMS
- ------
- attention_hidden_state: attention rnn last output
- memory: encoder outputs
- processed_memory: processed encoder outputs
- attention_weights_cat: previous and cummulative attention weights
- mask: binary mask for padded data
- """
- alignment = self.get_alignment_energies(
- attention_hidden_state, processed_memory, attention_weights_cat)
-
- if mask is not None:
- alignment.data.masked_fill_(mask, self.score_mask_value)
-
- attention_weights = F.softmax(alignment, dim=1)
- attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
- attention_context = attention_context.squeeze(1)
-
- return attention_context, attention_weights
4 解码器
解码器是一个自回归结构,它从编码的输入序列预测出声谱图,一次预测一帧
- 上一步预测出的频谱首先传入一个 PreNet,它包含两层神经网络,PreNet 作为一个信息瓶颈层(bottleneck),对于学习注意力是必要的
- PreNet 的输出和 Attention Context 向量拼接在一起,传给一个含有 1024 个单元的两层 LSTM。LSTM 的输出再次和 Attention Context 向量拼接在一起,然后经过一个线性投影来预测目标频谱
- 最后,目标频谱帧经过一个 5 层卷积的 PostNet(后处理网络),再将该输出和 Linear Projection 的输出相加(残差连接)作为最终的输出
- 另一边,LSTM 的输出和 Attention Context 向量拼接在一起,投影成标量后传给 sigmoid 激活函数,来预测输出序列是否已完成预测
PreNet 层的图示及代码如下所示:
- class LinearNorm(torch.nn.Module):
- def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
- super(LinearNorm, self).__init__()
- self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
-
- torch.nn.init.xavier_uniform_(
- self.linear_layer.weight,
- gain=torch.nn.init.calculate_gain(w_init_gain))
-
- def forward(self, x):
- return self.linear_layer(x)
-
- class Prenet(nn.Module):
- def __init__(self, in_dim, sizes):
- super(Prenet, self).__init__()
- in_sizes = [in_dim] + sizes[:-1]
- self.layers = nn.ModuleList(
- [LinearNorm(in_size, out_size, bias=False)
- for (in_size, out_size) in zip(in_sizes, sizes)])
-
- def forward(self, x):
- for linear in self.layers:
- x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
- return x
PostNet 层的图示及代码如下所示:
- class Postnet(nn.Module):
- """Postnet
- - Five 1-d convolution with 512 channels and kernel size 5
- """
-
- def __init__(self, hparams):
- super(Postnet, self).__init__()
- self.convolutions = nn.ModuleList()
-
- self.convolutions.append(
- nn.Sequential(
- ConvNorm(hparams.n_mel_channels, hparams.postnet_embedding_dim,
- kernel_size=hparams.postnet_kernel_size, stride=1,
- padding=int((hparams.postnet_kernel_size - 1) / 2),
- dilation=1, w_init_gain='tanh'),
- nn.BatchNorm1d(hparams.postnet_embedding_dim))
- )
-
- for i in range(1, hparams.postnet_n_convolutions - 1):
- self.convolutions.append(
- nn.Sequential(
- ConvNorm(hparams.postnet_embedding_dim,
- hparams.postnet_embedding_dim,
- kernel_size=hparams.postnet_kernel_size, stride=1,
- padding=int((hparams.postnet_kernel_size - 1) / 2),
- dilation=1, w_init_gain='tanh'),
- nn.BatchNorm1d(hparams.postnet_embedding_dim))
- )
-
- self.convolutions.append(
- nn.Sequential(
- ConvNorm(hparams.postnet_embedding_dim, hparams.n_mel_channels,
- kernel_size=hparams.postnet_kernel_size, stride=1,
- padding=int((hparams.postnet_kernel_size - 1) / 2),
- dilation=1, w_init_gain='linear'),
- nn.BatchNorm1d(hparams.n_mel_channels))
- )
-
- def forward(self, x):
- for i in range(len(self.convolutions) - 1):
- x = F.dropout(torch.tanh(self.convolutions[i](x)), 0.5, self.training)
- x = F.dropout(self.convolutions[-1](x), 0.5, self.training)
-
- return x
从下面 Decoder 初始化部分可以看出 Decoder 由 prenet,attention_rnn,attention_layer,decoder_rnn,linear_projection,gate_layer 组成
- class Decoder(nn.Module):
- def __init__(self, hparams):
- super(Decoder, self).__init__()
- self.n_mel_channels = hparams.n_mel_channels
- self.n_frames_per_step = hparams.n_frames_per_step
- self.encoder_embedding_dim = hparams.encoder_embedding_dim
- self.attention_rnn_dim = hparams.attention_rnn_dim
- self.decoder_rnn_dim = hparams.decoder_rnn_dim
- self.prenet_dim = hparams.prenet_dim
- self.max_decoder_steps = hparams.max_decoder_steps
- self.gate_threshold = hparams.gate_threshold
- self.p_attention_dropout = hparams.p_attention_dropout
- self.p_decoder_dropout = hparams.p_decoder_dropout
-
- self.prenet = Prenet(
- hparams.n_mel_channels * hparams.n_frames_per_step,
- [hparams.prenet_dim, hparams.prenet_dim])
-
- self.attention_rnn = nn.LSTMCell(
- hparams.prenet_dim + hparams.encoder_embedding_dim,
- hparams.attention_rnn_dim)
-
- self.attention_layer = Attention(
- hparams.attention_rnn_dim, hparams.encoder_embedding_dim,
- hparams.attention_dim, hparams.attention_location_n_filters,
- hparams.attention_location_kernel_size)
-
- self.decoder_rnn = nn.LSTMCell(
- hparams.attention_rnn_dim + hparams.encoder_embedding_dim,
- hparams.decoder_rnn_dim, 1)
-
- self.linear_projection = LinearNorm(
- hparams.decoder_rnn_dim + hparams.encoder_embedding_dim,
- hparams.n_mel_channels * hparams.n_frames_per_step)
-
- self.gate_layer = LinearNorm(
- hparams.decoder_rnn_dim + hparams.encoder_embedding_dim, 1,
- bias=True, w_init_gain='sigmoid')
5 总结
Tacotron2 模型的完整网络结构:
- Tacotron2(
- (embedding): Embedding(148, 512)
- (encoder): Encoder(
- (convolutions): ModuleList(
- (0): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- (1): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- (2): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- )
- (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
- )
- (decoder): Decoder(
- (prenet): Prenet(
- (layers): ModuleList(
- (0): LinearNorm(
- (linear_layer): Linear(in_features=80, out_features=256, bias=False)
- )
- (1): LinearNorm(
- (linear_layer): Linear(in_features=256, out_features=256, bias=False)
- )
- )
- )
- (attention_rnn): LSTMCell(768, 1024)
- (attention_layer): Attention(
- (query_layer): LinearNorm(
- (linear_layer): Linear(in_features=1024, out_features=128, bias=False)
- )
- (memory_layer): LinearNorm(
- (linear_layer): Linear(in_features=512, out_features=128, bias=False)
- )
- (v): LinearNorm(
- (linear_layer): Linear(in_features=128, out_features=1, bias=False)
- )
- (location_layer): LocationLayer(
- (location_conv): ConvNorm(
- (conv): Conv1d(2, 32, kernel_size=(31,), stride=(1,), padding=(15,), bias=False)
- )
- (location_dense): LinearNorm(
- (linear_layer): Linear(in_features=32, out_features=128, bias=False)
- )
- )
- )
- (decoder_rnn): LSTMCell(1536, 1024, bias=1)
- (linear_projection): LinearNorm(
- (linear_layer): Linear(in_features=1536, out_features=80, bias=True)
- )
- (gate_layer): LinearNorm(
- (linear_layer): Linear(in_features=1536, out_features=1, bias=True)
- )
- )
- (postnet): Postnet(
- (convolutions): ModuleList(
- (0): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(80, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- (1): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- (2): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- (3): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- (4): Sequential(
- (0): ConvNorm(
- (conv): Conv1d(512, 80, kernel_size=(5,), stride=(1,), padding=(2,))
- )
- (1): BatchNorm1d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
- )
- )
- )
- )
请问博主,注意力机制里面写到 y0 是 Prenet 的输入,然后图片里 y0 却是 Decoder_input,是笔误吗?
这俩是不同的 y0
在原论文的整体结构图中可以看到 2 个 LSTM layers 经过 Linear projection 后会将最终状态输入到 Prenet 之中,在您的整体细节图中 Prenet 部分只有输出没有输入,求解答
那就以原论文为主吧
151 哪来的