本文主要是基于英文文本关系抽取比赛,讲解如何 fine-tune Huggingface 的预训练模型,同时可以看作是关系抽取的一个简单案例
数据预览
训练数据包含两列。第一列是文本,其中 <e1></e1>
包起来的是第一个实体,<e2></e2>
包起来的是第二个实体。第二列是关系,除了 "Other" 这个关系以外,其它关系都有先后顺序,比方说第一个样本:"伤害是由邀请制度造成的",其中 "harm" 是因果关系中的果,而 "system" 是因,所以最终 label 是 "Cause-Effect (e2,e1)"
思路
总体来说都是用 Huggingface 的预训练模型做 fine-tune,但是具体怎么做有很多种方法
第一种方法
由于 label 非常特殊,不光要预测实体之间的关系,还要预测其顺序,因此可以考虑预测三个值,具体如下图所示(以第一个句子为例)
[CLS]
位置的输出会做一个 10 分类,而两个实体对应的输出分别做一个 3 分类。这个 3 分类是有讲究的,分类为 0 只有一种情况,就是当 [CLS]
的输出预测为 "Other" 的时候,因为 "Other" 是不需要考虑两个实体的顺序的;另外,假如 "harm" 预测为 1,"system" 预测为 2,表示在当前这个关系中 "harm" 的顺序在 "system" 的前面,反之一样的道理。通过预测三个值,就可以唯一确定一个关系了
第二种方法
先看下图,具体来说就是将两个实体的输出拼接在一起,然后做一个 18 分类
第三种方法
这是最简单的一种方法,本文使用的也是这种方法。直接将 [CLS]
的输出进行一个 18 分类
其实还有其他更多方法,大家自己下去尝试即可
Data Preprocessing
- def process_data(filename):
- with open(filename) as f:
- rows = [row for row in csv.reader(f)]
- rows = np.array(rows[1:]) # all data, 2D
- label_list = [label for _, label in rows] # label list
- global classes_list
- classes_list = list(set(label_list)) # non-repeated label list
- num_classes = len(classes_list) # num of classes
- for i in range(len(label_list)):
- label_list[i] = classes_list.index(label_list[i]) # index of label
-
- name_list, sentence_list = [], []
- for sentence, _ in rows:
- begin = sentence.find('<e1>')
- end = sentence.find('</e1>')
- e1 = sentence[begin:end + 5]
-
- begin = sentence.find('<e2>')
- end = sentence.find('</e2>')
- e2 = sentence[begin:end + 5]
-
- name_list.append(e1 + " " + e2)
- sentence_list.append(sentence)
- print(num_classes)
- return name_list, sentence_list, label_list, classes_list, num_classes
name_list
是个一维的 list,里面存了每一行文本中两个实体的名称,两个实体之间用空格隔开。sentence_list
是个一维的 list,里面存了每一行文本。label_list
是个一维的 list,里面的值是 int 类型的,就是将原本 str 类型的 label 标签转为对应的 index。classes_list
就是去重后的 label。num_classes
就是 len(classes_list)
这里我用的是第三种思路做的,因此后面是用不到 name_list
的,但我还是将其提取出来,方便后面读者调用
XLNetTokenizer
接下来要做的是将提取出来的 sentence_list
经过 XLNetTokenizer
,以每句话为单位,获取一句话中所有词的索引,attention mask 等相关内容
- tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
-
- def convert(names, sentences, target): # name_list, sentence_list, label_list
- input_ids, token_type_ids, attention_mask = [], [], []
- for i in range(len(sentences)):
- encoded_dict = tokenizer.encode_plus(
- sentences[i], # 输入文本
- add_special_tokens = True, # 添加 '[CLS]' 和 '[SEP]'
- max_length = 96, # 不够填充
- padding = 'max_length',
- truncation = True, # 太长截断
- return_tensors = 'pt', # 返回 pytorch tensors 格式的数据
- )
- input_ids.append(encoded_dict['input_ids'])
- token_type_ids.append(encoded_dict['token_type_ids'])
- attention_mask.append(encoded_dict['attention_mask'])
-
- input_ids = torch.cat(input_ids, dim=0)
- token_type_ids = torch.cat(token_type_ids, dim=0)
- attention_mask = torch.cat(attention_mask, dim=0)
-
- input_ids = torch.LongTensor(input_ids)
- token_type_ids = torch.LongTensor(token_type_ids)
- attention_mask = torch.LongTensor(attention_mask)
- target = torch.LongTensor(target)
-
- return input_ids, token_type_ids, attention_mask, target
Train Test Split
为了验证模型效果,所以我们将从训练数据中划分一部分作为验证集,只需调用 sklearn 中的 train_test_split
方法即可
- train_inputs, val_inputs, train_labels, val_labels = train_test_split(input_ids, labels, random_state=1, test_size=0.1)
- train_token, val_token, _, _ = train_test_split(token_type_ids, labels, random_state=1, test_size=0.1)
- train_mask, val_mask, _, _ = train_test_split(attention_mask, labels, random_state=1, test_size=0.1)
-
- train_data = Data.TensorDataset(train_inputs, train_token, train_mask, train_labels)
- train_dataloader = Data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
-
- validation_data = Data.TensorDataset(val_inputs, val_token, val_mask, val_labels)
- validation_dataloader = Data.DataLoader(validation_data, batch_size=batch_size, shuffle=True)
这里需要注意的是,由于每个样本的 ids,token,mask 是一一对应的,所以我们必须设置三个 train_test_split
方法中的 random_state
值相等,否则他们拆分得到的数据就乱了
XLNetForSequenceClassification
由于这里我是用的是简单的句子分类思路,直接调用 Huggingface 中有现成的 API 即可(注意设定分类个数)。下面的代码参考自 Huggingface Docs 中的 Training and fine-tuning
- model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=num_classes).to(device)
-
- param_optimizer = list(model.named_parameters())
- no_decay = ['bias', 'gamma', 'beta']
- optimizer_grouped_parameters = [
- {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
- 'weight_decay_rate': 0.01},
- {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
- 'weight_decay_rate': 0.0}]
-
- optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Train & Val
训练的时候,直接传给 model 指定的参数即可
- for _ in range(2):
- for i, batch in enumerate(train_dataloader):
- batch = tuple(t.to(device) for t in batch)
- loss = model(batch[0], token_type_ids=batch[1], attention_mask=batch[2], labels=batch[3])[0]
- print(loss.item())
-
- optimizer.zero_grad()
- loss.backward()
- optimizer.step()
-
- if i % 10 == 0:
- eval(model, validation_dataloader)
这个 model 的返回值可以查看 XLNetForSequenceClassificatiton 的 API 文档。如果需要 loss 进行反向传播,取出第一个值即可;如果需要预测值,则取出第二个值;如果需要每层、每个词的隐藏状态,则取出第四个值
在训练的过程中,每经过 10 个 epoch,就在验证集上测试一次。在 validation 以及 test 时我们是不会传给模型真实标签的,因此模型也不会返回 loss,所以此时模型返回的第一个值不再是 loss,而是 logits
- def flat_accuracy(preds, labels):
- pred_flat = np.argmax(preds, axis=1).flatten() # [3, 5, 8, 1, 2, ....]
- labels_flat = labels.flatten()
- return np.sum(pred_flat == labels_flat) / len(labels_flat)
-
- def eval(model, validation_dataloader):
- model.eval()
- eval_loss, eval_accuracy, nb_eval_steps = 0, 0, 0
- for batch in validation_dataloader:
- batch = tuple(t.to(device) for t in batch)
- with torch.no_grad():
- logits = model(batch[0], token_type_ids=batch[1], attention_mask=batch[2])[0]
- logits = logits.detach().cpu().numpy()
- label_ids = batch[3].cpu().numpy()
- tmp_eval_accuracy = flat_accuracy(logits, label_ids)
- eval_accuracy += tmp_eval_accuracy
- nb_eval_steps += 1
- print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))
- global best_score
- if best_score < eval_accuracy / nb_eval_steps:
- best_score = eval_accuracy / nb_eval_steps
- save(model)
在验证集上计算准确率的同时,保存到目前为止在验证集上准确率最高的模型参数,后面真正做测试的时候就用这个参数。下面的代码涉及到保存模型的操作
- output_dir = './models/'
- output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
- output_config_file = os.path.join(output_dir, CONFIG_NAME)
-
- def save(model):
- # save
- torch.save(model.state_dict(), output_model_file)
- model.config.to_json_file(output_config_file)
如果要加载模型,只需要一行代码即可
- # load model
- model = XLNetForSequenceClassification.from_pretrained(output_dir).to(device)
Test
- def pred():
- # load model
- model = XLNetForSequenceClassification.from_pretrained(output_dir).to(device)
-
- sentence_list = []
- with open('test.csv') as f:
- rows = [row for row in csv.reader(f)]
- rows = np.array(rows[1:])
- sentence_list = [text for idx, text in rows]
-
- input_ids, token_type_ids, attention_mask, _ = convert(['test'], sentence_list, [1]) # whatever name_list and label_list
- dataset = Data.TensorDataset(input_ids, token_type_ids, attention_mask)
- loader = Data.DataLoader(dataset, 32, False)
-
- pred_label = []
- model.eval()
- for i, batch in enumerate(loader):
- batch = tuple(t.to(device) for t in batch)
- with torch.no_grad():
- logits = model(batch[0], token_type_ids=batch[1], attention_mask=batch[2])[0]
- logits = logits.detach().cpu().numpy()
- preds = np.argmax(logits, axis=1).flatten()
- pred_label.extend(preds)
-
- for i in range(len(pred_label)):
- pred_label[i] = classes_list[pred_label[i]]
-
- pd.DataFrame(data=pred_label, index=range(len(pred_label))).to_csv('pred.csv')
其实这里没有用到 name_list
和真实的标签,但是为了能复用前面的 convert()
函数,保证函数参数一致,所以我随便传了两个参数 ['test']
和 [1]
Code
- import os
- import csv
- import torch
- import transformers
- import numpy as np
- import pandas as pd
- from transformers import *
- import torch.utils.data as Data
- from sklearn.model_selection import train_test_split
-
- transformers.logging.set_verbosity_error()
- tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-
- best_score = 0
- batch_size = 32
- classes_list = []
-
- output_dir = './models/'
- output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
- output_config_file = os.path.join(output_dir, CONFIG_NAME)
-
- def process_data(filename):
- with open(filename) as f:
- rows = [row for row in csv.reader(f)]
- rows = np.array(rows[1:]) # all data, 2D
- label_list = [label for _, label in rows] # label list
- global classes_list
- classes_list = list(set(label_list)) # non-repeated label list
- num_classes = len(classes_list) # num of classes
- for i in range(len(label_list)):
- label_list[i] = classes_list.index(label_list[i]) # index of label
-
- name_list, sentence_list = [], []
- for sentence, _ in rows:
- begin = sentence.find('<e1>')
- end = sentence.find('</e1>')
- e1 = sentence[begin:end + 5]
-
- begin = sentence.find('<e2>')
- end = sentence.find('</e2>')
- e2 = sentence[begin:end + 5]
-
- name_list.append(e1 + " " + e2)
- sentence_list.append(sentence)
- print(num_classes)
- return name_list, sentence_list, label_list, classes_list, num_classes
-
- def convert(names, sentences, target): # name_list, sentence_list, label_list
- input_ids, token_type_ids, attention_mask = [], [], []
- for i in range(len(sentences)):
- encoded_dict = tokenizer.encode_plus(
- sentences[i], # 输入文本
- add_special_tokens = True, # 添加 '[CLS]' 和 '[SEP]'
- max_length = 96, # 填充 & 截断长度
- pad_to_max_length = True,
- return_tensors = 'pt', # 返回 pytorch tensors 格式的数据
- )
- input_ids.append(encoded_dict['input_ids'])
- token_type_ids.append(encoded_dict['token_type_ids'])
- attention_mask.append(encoded_dict['attention_mask'])
-
- input_ids = torch.cat(input_ids, dim=0)
- token_type_ids = torch.cat(token_type_ids, dim=0)
- attention_mask = torch.cat(attention_mask, dim=0)
-
- input_ids = torch.LongTensor(input_ids)
- token_type_ids = torch.LongTensor(token_type_ids)
- attention_mask = torch.LongTensor(attention_mask)
- target = torch.LongTensor(target)
-
- return input_ids, token_type_ids, attention_mask, target
-
- def flat_accuracy(preds, labels):
- pred_flat = np.argmax(preds, axis=1).flatten() # [3, 5, 8, 1, 2, ....]
- labels_flat = labels.flatten()
- return np.sum(pred_flat == labels_flat) / len(labels_flat)
-
- def save(model):
- # save
- torch.save(model.state_dict(), output_model_file)
- model.config.to_json_file(output_config_file)
-
-
- def eval(model, validation_dataloader):
- model.eval()
- eval_loss, eval_accuracy, nb_eval_steps = 0, 0, 0
- for batch in validation_dataloader:
- batch = tuple(t.to(device) for t in batch)
- with torch.no_grad():
- logits = model(batch[0], token_type_ids=batch[1], attention_mask=batch[2])[0]
- logits = logits.detach().cpu().numpy()
- label_ids = batch[3].cpu().numpy()
- tmp_eval_accuracy = flat_accuracy(logits, label_ids)
- eval_accuracy += tmp_eval_accuracy
- nb_eval_steps += 1
- print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))
- global best_score
- if best_score < eval_accuracy / nb_eval_steps:
- best_score = eval_accuracy / nb_eval_steps
- save(model)
-
- def train_eval():
- name_list, sentence_list, label_list, _, num_classes = process_data('train.csv')
- input_ids, token_type_ids, attention_mask, labels = convert(name_list, sentence_list, label_list)
-
- train_inputs, val_inputs, train_labels, val_labels = train_test_split(input_ids, labels, random_state=1, test_size=0.1)
- train_token, val_token, _, _ = train_test_split(token_type_ids, labels, random_state=1, test_size=0.1)
- train_mask, val_mask, _, _ = train_test_split(attention_mask, labels, random_state=1, test_size=0.1)
-
- train_data = Data.TensorDataset(train_inputs, train_token, train_mask, train_labels)
- train_dataloader = Data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
-
- validation_data = Data.TensorDataset(val_inputs, val_token, val_mask, val_labels)
- validation_dataloader = Data.DataLoader(validation_data, batch_size=batch_size, shuffle=True)
-
- model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=num_classes).to(device)
-
- param_optimizer = list(model.named_parameters())
- no_decay = ['bias', 'gamma', 'beta']
- optimizer_grouped_parameters = [
- {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
- 'weight_decay_rate': 0.01},
- {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
- 'weight_decay_rate': 0.0}]
-
- optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
-
- for _ in range(2):
- for i, batch in enumerate(train_dataloader):
- batch = tuple(t.to(device) for t in batch)
- loss = model(batch[0], token_type_ids=batch[1], attention_mask=batch[2], labels=batch[3])[0]
- print(loss.item())
-
- optimizer.zero_grad()
- loss.backward()
- optimizer.step()
-
- if i % 10 == 0:
- eval(model, validation_dataloader)
-
- def pred():
- # load model
- model = XLNetForSequenceClassification.from_pretrained(output_dir).to(device)
-
- sentence_list = []
- with open('test.csv') as f:
- rows = [row for row in csv.reader(f)]
- rows = np.array(rows[1:])
- sentence_list = [text for idx, text in rows]
-
- input_ids, token_type_ids, attention_mask, _ = convert(['test'], sentence_list, [1]) # whatever name_list and label_list
- dataset = Data.TensorDataset(input_ids, token_type_ids, attention_mask)
- loader = Data.DataLoader(dataset, 32, False)
-
- pred_label = []
- model.eval()
- for i, batch in enumerate(loader):
- batch = tuple(t.to(device) for t in batch)
- with torch.no_grad():
- logits = model(batch[0], token_type_ids=batch[1], attention_mask=batch[2])[0]
- logits = logits.detach().cpu().numpy()
- preds = np.argmax(logits, axis=1).flatten()
- pred_label.extend(preds)
-
- for i in range(len(pred_label)):
- pred_label[i] = classes_list[pred_label[i]]
-
- pd.DataFrame(data=pred_label, index=range(len(pred_label))).to_csv('pred.csv')
-
- if __name__ == '__main__':
- train_eval()
- pred()
请问使用 Huggingface 的预训练模型 可以复现顶会 report 的效果吗?
这个,我想应该有的可以。其实 huggingface 的目的不在于此,而在于能够让大家很方便的用上各种开源预训练模型
谢谢分享!仔细一想论文公开的代码和 huggingface 上的实现,两者的差别对最终效果影响也不大吧:)
是的。而且其实很多预训练好的模型就是直接推到 huggingface 上的,比方说哈工大的预训练中文 bert
如果博主能给出测试的结果,能帮小白我提供一个参考答案,赞
最终测试的效果不太记得了,大概是七八十左右,因为我做的比较简单,只是单纯的分类
你好,请问,分类模型如果不用 API 的话,是不是得调用 XLNET 的 run_classifier.py 文件,然后更改自己的参数,然后自己跑一个模型呢?
是的
我看了看 XLNET 的代码,也是调用的预训练模型,请问有不是预训练模型的吗?也就是直接用 transformer-xl 直接做分类的代码?
这个,我不太清楚,您去 github 搜一搜有没有 transformer-xl 的源码吧
作者大大,报这个错是为什么呀,是因为模型保存的路径不对吗,我是全部按照你的代码来的,求救!FileNotFoundError: [Errno 2] No such file or directory: './models/pytorch_model.bin'
... 解决了,太丢人了...@(狂汗)
哈喽你好,我也报了这个错,想问下你是怎么解决的呢?
请问数据集在哪下?研习社这个比赛的链接我一直打不开。
不好意思我没用这个数据,我只用了代码
第二第三种方法不应该是 19 分类吗
有些类别不是成对的,如果你有数据集可以仔细查看下