Scaling Laws for Neural Language Model

Arxiv Pdf

For scaling large language models, measured by cross-entropy loss, the model-size, dataset-size and compute-for-training matter the most, other things like architecture and network width/depth have very small effects.

Key findings

Performance depends strongly on scale and weakly on model shape.

Scale consists of :

Number of model params ( $N$ ) { excluding embeddings }

Size of the dataset ( $D$ )

Amount of compute used for training ( $C$ )

Smooth Power Laws

$N$ , $D$ , $C$ have a power-law relationship with performance when one of them isn’t bottlenecked by the other.

Universality of Overfitting

Performance increases only when scaling both $N$ and $D$ in tandem, if one is held fixed we start seeing diminishing returns. Every time we increase model size by 8x , we need to scale the data by at least 5x to avoid this penalty.

Universality of training

Training curves follow predictable power-laws irrespective of model size, making future loss predictable if early loss is known.

Transfer improves with test performance

When a trained model is evaluated on a distribution different from the one trained on, it performs similar to how it would on the training-validation-set, i.e: with a constant offset in loss on the un-seen evals. And the model’s performance on both train-val and un-seen eval continues to improve as it is trained.

This is a strong signal of generalising powers learnt by the LLM during training.

Sample Efficiency

Larger models are more efficient than smaller models, they reach same level of performance with smaller number of optimisation steps and fewer datapoints.

This is likely an indicator that the bigger models have more “knobs” available to tune vs a smaller one, hence having more expressive and representational power.

Convergence is Inefficient

With fixed $C$ and no restrictions on $N$ and $D$ , optimal config is to train very large models and stopping the training significantly before convergence.

Optimal Batch Size

Batch size follows loss power law, measured by “gradient-noise-scale” , they use 1-2miillion tokens per batch at convergence for their largest model.

Together all the above help llm training to be smooth and predictable.

Empirical Summary

The paper goes on to provide empirical evidence for how the scaling laws behave and try to fit those empirical findings into eq’s with tuning params of these eq’s tied to the ( dataset , tokenisation, vocab ) hence not making it very transferable. But the main take away being :

as models grow larger, they become increasingly sample efficient. In practice, researchers typically train smaller models for longer than would be maximally compute-efficient because of hardware constraints. Optimal performance depends on total compute as a power law

Adding to reading list : “Universal Transformers” , seems like a interesting idea with “transition functions” .

Dataset

Dataset used: WebText2 + Newspaper3k
- 20.3M documents containing 96 GB of text and 1.62 × 1010 words (as defined by wc)
BPE tokenizer
- Yields 2.29 × 10^10 tokens
- reserve 6.6 × 108 of these tokens for use as a test set
vocabulary size $n_{v oc ab}$ = 50257
context = 1024 tokens
loss fn = cross-entropy

Training Process

Adam Optimizer
Fixed step count of $2.5 x 1 0^{5}$ steps
batch size = 512 sequences of 1024 tokens each

all training runs included in our data used a learning rate schedule with a 3000 step linear warmup followed by a cosine decay to zero.

Very interesting in 6.3 Contradictions and a Conjecture