Language Models are Unsupervised Multitask Learners
Paper: PDF
Noticed that the author list is pretty interesting! Interesting people to google
There’s no 2 ways, this is the GPT-2 paper.
I think these line on the abstract capture’s the core idea well:
We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText.
These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Background
Before, the core ML process was for training narrow experts with the process being:
- collect data demoing desired behaviour
- train system to imitate this behaviour
- test its performance on IID examples
But this produces fragile experts and cannot give rise to “competent generalist’s”.
Enter multi-task learning, even this traditionally is done by curating (dataset,objective) pairs. Best results yet showed scaling of this approach to (D=10 , O=17) but for any real use we’ll need 1000s of such pairs, meaning current techniques will need endless dataset curation and objective setting! Hence the motive to explore alt approaches to multitask learning without needing these bottle-necks.
Things that were already working:
- pretraining + supervised fine-tuning , task-agnostic architecture + attention is sufficient. (See GPT-1)
- Lack of data → common-sense-reasoning + sentiment analysis still possible
This paper marries both the approaches and continue’s to find a more general method.
The prominent expression of auto-regressiveness arrives!
Core Idea:
McCann et al. (2018), language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols
” Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. ”
Oh how right were they! (lol xD)
Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning.
Dataset
Dataset used = CommonCrawl with a filter. The filter, “Outbound reddit links having more than 3 karma”. The resulting dataset, “WebText”. Remove all wikipedia links to avoid dataset overlap with ablation studies and also evals of this experiment.
Input representation, doing just UTF-8 as vocab is convenient but word-level-LM’s prove to be better. Introduction of BPE (Byte-Pair-Encoding) for tokenization.
BPE: word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.
Model
Similar architecture as GPT-1 with the following modifications:
- Move all layer-norms to start of the sub-block, add extra layer-norm after the last/final self-attention block.
- Scale weights by where is number of layers.
- Vocab size is now 50,257
- larger context size 512 → 1024
- larger batch size = 512
Experiments
We trained and benchmarked four LMs with approximately log-uniformly spaced sizes.
Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT.
Cinematic foreshadowing!
All models still underfit WebText and held-out perplexity has as of yet improved given more training time.
Note: use n-gram overlap based de-duplication for verification/sanity-check during train-test splits.
😂 Can’t believe they kept this, love it!
GPT-2 is also able to write news articles about the discovery of talking unicorns. An example is provided in Table 13.
Related Work’s
Interesting :
(Ramachandran et al., 2016) demonstrated that seq2seq models benefit from being initialized with pre-trained language models as encoders and decoders
Conclusion
Scaling-Laws-FTW!
When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets.
Good read, give’s a general idea and trend for what came just before and what kicked off the scaling frenzy. One of the few papers where I read through each experiment, I had not really been appreciative of the “unsupervised” aspect of GPT though it was know to be so. It really was a AHA moment for the community and the field.