MENU

TextCNN 的 PyTorch 实现

June 25, 2020 • Read: 24734 • Deep Learning阅读设置

B 站视频讲解

本文主要介绍一篇将 CNN 应用到 NLP 领域的一篇论文 Convolutional Neural Networks for Sentence Classification,然后给出 PyTorch 实现

论文比较短,总体流程也不复杂,最主要的是下面这张图,只要理解了这张图,就知道如何写代码了。如果你不了解 CNN,请先看我的这篇文章 CS231n 笔记:通俗理解 CNN

下图的 feature map 是将一句话中的各个词通过 WordEmbedding 得到的,feature map 的宽为 embedding 的维度,长为一句话的单词数量。例如下图中,很明显就是用一个 6 维的向量去编码每个词,并且一句话中有 9 个词

之所以有两张 feature map,你可以理解为 batchsize 为 2

其中,红色的框代表的就是卷积核。而且很明显可以看出,这是一个长宽不等的卷积核。有意思的是,卷积核的宽可以认为是 n-gram,比方说下图卷积核宽为 2,所以同时考虑了 "wait" 和 "for" 两个单词的词向量,因此可以认为该卷积是一个类似于 bigram 的模型

后面的部分就是传统 CNN 的步骤,激活、池化、Flatten,没什么好说的

代码实现(PyTorch 版)

源码来自于 nlp-tutorial,我在其基础上进行了修改(原本的代码感觉有很多问题)

  • '''
  • code by Tae Hwan Jung(Jeff Jung) @graykode, modify by wmathor
  • '''
  • import torch
  • import numpy as np
  • import torch.nn as nn
  • import torch.optim as optim
  • import torch.utils.data as Data
  • import torch.nn.functional as F
  • dtype = torch.FloatTensor
  • device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

下面代码就是定义一些数据,以及设置一些常规参数

  • # 3 words sentences (=sequence_length is 3)
  • sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
  • labels = [1, 1, 1, 0, 0, 0] # 1 is good, 0 is not good.
  • # TextCNN Parameter
  • embedding_size = 2
  • sequence_length = len(sentences[0]) # every sentences contains sequence_length(=3) words
  • num_classes = len(set(labels)) # num_classes=2
  • batch_size = 3
  • word_list = " ".join(sentences).split()
  • vocab = list(set(word_list))
  • word2idx = {w: i for i, w in enumerate(vocab)}
  • vocab_size = len(vocab)

数据预处理

  • def make_data(sentences, labels):
  • inputs = []
  • for sen in sentences:
  • inputs.append([word2idx[n] for n in sen.split()])
  • targets = []
  • for out in labels:
  • targets.append(out) # To using Torch Softmax Loss function
  • return inputs, targets
  • input_batch, target_batch = make_data(sentences, labels)
  • input_batch, target_batch = torch.LongTensor(input_batch), torch.LongTensor(target_batch)
  • dataset = Data.TensorDataset(input_batch, target_batch)
  • loader = Data.DataLoader(dataset, batch_size, True)

构建模型

  • class TextCNN(nn.Module):
  • def __init__(self):
  • super(TextCNN, self).__init__()
  • self.W = nn.Embedding(vocab_size, embedding_size)
  • output_channel = 3
  • self.conv = nn.Sequential(
  • # conv : [input_channel(=1), output_channel, (filter_height, filter_width), stride=1]
  • nn.Conv2d(1, output_channel, (2, embedding_size)),
  • nn.ReLU(),
  • # pool : ((filter_height, filter_width))
  • nn.MaxPool2d((2, 1)),
  • )
  • # fc
  • self.fc = nn.Linear(output_channel, num_classes)
  • def forward(self, X):
  • '''
  • X: [batch_size, sequence_length]
  • '''
  • batch_size = X.shape[0]
  • embedding_X = self.W(X) # [batch_size, sequence_length, embedding_size]
  • embedding_X = embedding_X.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
  • conved = self.conv(embedding_X) # [batch_size, output_channel*1*1]
  • flatten = conved.view(batch_size, -1)
  • output = self.fc(flatten)
  • return output

下面详细介绍一下数据在网络中流动的过程中维度的变化。输入数据是个矩阵,矩阵维度为 [batch_size, seqence_length],输入矩阵的数字代表的是某个词在整个词库中的索引(下标)

首先通过 Embedding 层,也就是查表,将每个索引转为一个向量,比方说 12 可能会变成 [0.3,0.6,0.12,...],因此整个数据无形中就增加了一个维度,变成了 [batch_size, sequence_length, embedding_size]

之后使用 unsqueeze(1) 函数使数据增加一个维度,变成 [batch_size, 1, sequence_length, embedding_size]。现在的数据才能做卷积,因为在传统 CNN 中,输入数据就应该是 [batch_size, in_channel, height, width] 这种维度

[batch_size, 1, 3, 2] 的输入数据通过 nn.Conv2d(1, 3, (2, 2)) 的卷积之后,得到的就是 [batch_size, 3, 2, 1] 的数据,由于经过 ReLU 激活函数是不改变维度的,所以就没画出来。最后经过一个 nn.MaxPool2d((2, 1)) 池化,得到的数据维度就是 [batch_size, 3, 1, 1]

训练

  • model = TextCNN().to(device)
  • criterion = nn.CrossEntropyLoss().to(device)
  • optimizer = optim.Adam(model.parameters(), lr=1e-3)
  • # Training
  • for epoch in range(5000):
  • for batch_x, batch_y in loader:
  • batch_x, batch_y = batch_x.to(device), batch_y.to(device)
  • pred = model(batch_x)
  • loss = criterion(pred, batch_y)
  • if (epoch + 1) % 1000 == 0:
  • print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))
  • optimizer.zero_grad()
  • loss.backward()
  • optimizer.step()

测试

  • # Test
  • test_text = 'i hate me'
  • tests = [[word2idx[n] for n in test_text.split()]]
  • test_batch = torch.LongTensor(tests).to(device)
  • # Predict
  • model = model.eval()
  • predict = model(test_batch).data.max(1, keepdim=True)[1]
  • if predict[0][0] == 0:
  • print(test_text,"is Bad Mean...")
  • else:
  • print(test_text,"is Good Mean!!")

完整代码如下:

  • '''
  • code by Tae Hwan Jung(Jeff Jung) @graykode, modify by wmathor
  • '''
  • import torch
  • import numpy as np
  • import torch.nn as nn
  • import torch.optim as optim
  • import torch.utils.data as Data
  • import torch.nn.functional as F
  • dtype = torch.FloatTensor
  • device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  • # 3 words sentences (=sequence_length is 3)
  • sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
  • labels = [1, 1, 1, 0, 0, 0] # 1 is good, 0 is not good.
  • # TextCNN Parameter
  • embedding_size = 2
  • sequence_length = len(sentences[0]) # every sentences contains sequence_length(=3) words
  • num_classes = 2 # 0 or 1
  • batch_size = 3
  • word_list = " ".join(sentences).split()
  • vocab = list(set(word_list))
  • word2idx = {w: i for i, w in enumerate(vocab)}
  • vocab_size = len(vocab)
  • def make_data(sentences, labels):
  • inputs = []
  • for sen in sentences:
  • inputs.append([word2idx[n] for n in sen.split()])
  • targets = []
  • for out in labels:
  • targets.append(out) # To using Torch Softmax Loss function
  • return inputs, targets
  • input_batch, target_batch = make_data(sentences, labels)
  • input_batch, target_batch = torch.LongTensor(input_batch), torch.LongTensor(target_batch)
  • dataset = Data.TensorDataset(input_batch, target_batch)
  • loader = Data.DataLoader(dataset, batch_size, True)
  • class TextCNN(nn.Module):
  • def __init__(self):
  • super(TextCNN, self).__init__()
  • self.W = nn.Embedding(vocab_size, embedding_size)
  • output_channel = 3
  • self.conv = nn.Sequential(
  • # conv : [input_channel(=1), output_channel, (filter_height, filter_width), stride=1]
  • nn.Conv2d(1, output_channel, (2, embedding_size)),
  • nn.ReLU(),
  • # pool : ((filter_height, filter_width))
  • nn.MaxPool2d((2, 1)),
  • )
  • # fc
  • self.fc = nn.Linear(output_channel, num_classes)
  • def forward(self, X):
  • '''
  • X: [batch_size, sequence_length]
  • '''
  • batch_size = X.shape[0]
  • embedding_X = self.W(X) # [batch_size, sequence_length, embedding_size]
  • embedding_X = embedding_X.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
  • conved = self.conv(embedding_X) # [batch_size, output_channel, 1, 1]
  • flatten = conved.view(batch_size, -1) # [batch_size, output_channel*1*1]
  • output = self.fc(flatten)
  • return output
  • model = TextCNN().to(device)
  • criterion = nn.CrossEntropyLoss().to(device)
  • optimizer = optim.Adam(model.parameters(), lr=1e-3)
  • # Training
  • for epoch in range(5000):
  • for batch_x, batch_y in loader:
  • batch_x, batch_y = batch_x.to(device), batch_y.to(device)
  • pred = model(batch_x)
  • loss = criterion(pred, batch_y)
  • if (epoch + 1) % 1000 == 0:
  • print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))
  • optimizer.zero_grad()
  • loss.backward()
  • optimizer.step()
  • # Test
  • test_text = 'i hate me'
  • tests = [[word2idx[n] for n in test_text.split()]]
  • test_batch = torch.LongTensor(tests).to(device)
  • # Predict
  • model = model.eval()
  • predict = model(test_batch).data.max(1, keepdim=True)[1]
  • if predict[0][0] == 0:
  • print(test_text,"is Bad Mean...")
  • else:
  • print(test_text,"is Good Mean!!")
Last Modified: April 29, 2021
Archives Tip
QR Code for this page
Tipping QR Code
Leave a Comment

23 Comments
  1. ngc ngc

    我觉得他这个是用不同的卷积核对数据进行卷积

    1. mathor mathor

      @ngc 是的,相当于实现不同的 gram

  2. kd kd

    batch size 可以 是 说成 多少个 单词吗 ?

    1. mathor mathor

      @kd 不能,在这个问题中,batchsize 应该理解为多少个句子

    2. crx crx

      @mathor 应该是句子的单词数?

    3. mathor mathor

      @crx 不是的,sequence_length 才是句子的单词数
      batchsize 是句子数

  3. kou kou

    这里是不是只实现了论文里面的输入 channels 为 singe_channel 的情况,我看论文里面也介绍了 multichannel architecture,比如这篇博客的第一张图片描述的那样,和 singel channel 不同的是,其中一个 channel 在训练过程中保持不变,另一个 channel 通过反向传播进行微调(channel 里保存的是 word vector),这要怎么实现呢

    1. mathor mathor

      @kou 这个我不太清楚,假如 batchsz=1,并且只有一句话,那么这个 input 就应该是一个单通道的矩阵,我不太理解多通道有什么意义

  4. 五楼 五楼

    博主,要是每个句子的长度不一样会怎么样?

    1. mathor mathor

      @五楼那么在定义 Dataset 的部分就会报错

  5. crx crx

    你好博主,能加个联系方式吗,有些问题想要请教

  6. frankye frankye

    为什么卷积操作以后还有一个 ReLU 函数,我在论文中好像没看到作者用了 ReLU 函数。

    1. mathor mathor

      @frankye 论文没写,不代表作者没用,不是所有的代码细节都要写在论文中的

  7. kaka kaka

    假如句子的长度 >=4,这个代码是不是跑不通了?

    句子长度 <=3 的情况下,卷积以及 maxpool2d 后输出大小是 [batch_size, output_channel, 1, 1],resize 后全连接层输入维度刚好是 output_channel;

    句子长度 = 4 的情况下,卷积以及 maxpool2d 后输出大小是 [batch_size, output_channel, 2, 1], 跟后面的全连接层维度(output_channel)不匹配了

    1. mathor mathor

      @kaka 根据具体情况,需要自行修改

  8. 殇小气 殇小气

    第 74 行:lr=le-3. 是什么意思

    1. 殇小气 殇小气

      @殇小气我用 pycharme 写报错:
      Unresolved reference 'le'

    2. mathor mathor

      @殇小气同学,这不是 le-3,是 1e-3,这是数字 1

  9. NLPer NLPer

    conved = self.conv(embedding_X) # [batch_size, output_channel, 1, 1]
    conved = self.conv(embedding_X) # [batch_size, output_channel11]
    博主,我认为这个维度第一个是 input_channel, 好像不是 batch_size

  10. dsq dsq

    line20:sequence_length = len (sentences [0]) 这里好像缺个.split (), 不过后面没用到 sequence_length,整体结果没啥问题

  11. 杨森淇 杨森淇

    你好 ,句子长度 = 4 的情况下,卷积以及 maxpool2d 后输出大小是 [batch_size, output_channel, 2, 1], 跟后面的全连接层维度(output_channel)不匹配了,这个问题我也遇到了,不知道如何修改,请问,,这个有办法吗

  12. 王

    “之所以有两张 feature map,你可以理解为 batchsize 为 2”
    论文原文的意思难道不是他们在提出了一个对基本模型的延伸,用了两种卷积方法,一种是固定参数的卷积,另一种是通过反向传播更新参数的卷积吗?@(汗) 关 batchsize 什么事啊...

  13. repairditch_dog repairditch_dog

    好困惑,我看李沐的 txtcnn 里面词向量是高词元数是宽,到底应该怎么样呢?