# BART详解

November 1, 2020

Encoder负责将source进行self-attention并获得句子中每个词的representation，最经典的Encoder架构就是BERT，通过Masked Language Model来学习词之间的关系，另外还有XLNet, RoBERTa, ALBERT, DistilBERT等等。但是单独Encoder结构不适用于生成任务

Decoder如下图所示，输入与输出之间差一个位置，主要是模拟在Inference时，不能让模型看到未来的词，这种方式称为AutoRegressive，常见的基于Decoder的模型通常是用来做序列生成的，例如GPT, CTRL等等。但是单独Decoder结构仅基于左侧上下文预测单词，无法学习双向交互

#### BART vs Transformer

BART使用标准的Transformer模型，不过做了一些改变：

1. 同GPT一样，将ReLU激活函数改为GeLU，并且参数初始化服从正态分布$N(0, 0.02)$
2. BART base模型的Encoder和Decoder各有6层，large模型增加到了12层
3. BART解码器的各层对编码器最终隐藏层额外执行cross-attention
4. BERT在词预测之前使用了额外的Feed Forward Layer，而BART没有

#### Pre-training BART

BART作者尝试了不同的方式来破坏输入：

• Token Masking：Following BERT (Devlin et al., 2019), random tokens are sampled and replaced with [MASK] elements.
• Sentence Permutation：A document is divided into sentences based on full stops, and these sentences are shuffled in a random order.
• Document Rotation：A token is chosen uniformly at random, and the document is rotated so that it begins with that token. This task trains the model to identify the start of the document.
• Token Deletion：Random tokens are deleted from the input. In contrast to token masking, the model must decide which positions are missing inputs.
• Text Infilling：A number of text spans are sampled, with span lengths drawn from a Poisson distribution ($\lambda=3$). Each span is replaced with a single [MASK] token. 0-length spans correspond to the insertion of [MASK] tokens. Text infilling teaches the model to predict how many tokens are missing from a span.

#### Reference

Last Modified: December 20, 2020
