Experimental Setup

We will be training the multi layer decoder network auto-regressively on Wikitext-2 and benchmark the loss and perplexity scores across all experiments, these metrics along with training time and GPU memory will be our main metrics.

Note that we will keep the tokens-per-steps constant across all runs. And for some experiments like Memory & GPU Utilisation we would be measuring wall clock time.

Check out Token_Budgeting to understand the dataset and process better.

Some quick additions to the metrics before we start these experiments:

Learning rate
Log gradient norm (global) + activation norm (one or two layers)
Step time
GPU Peak
Tokens per step ( not as a graph but in the config )

Hyper Param Sweep Experiments

LR vs Batch Size Empirical Sweep ✅

Learning rate (LR) controls how far you move per optimizer step.

Batch size (B) controls how noisy your gradient estimate is.

LR is related to batch_size in the way that it determines how important each optim.step() is , so lower batch size and higher lr can cause too much stochasticity. And we are looking for a sweep of LR vs Batch_size to determine optimal values for both within our constraints.

Based on the findings from above if we really need a bigger batch size we will need to resort to gradient batching as mentioned in the below “Memory & GPU Improvements” section.

Experiment Details

Optimiser Sweep SGD+momentum vs Adam vs AdamW ✅

AdamW is widely been the common choice but running this for a few 1000 steps should make it evident why.

Took a detour to dive a bit On Optimizers but also continuing to Experiment here

*Note: Since LR and Optimizer are dependent on each other we will first assume the optim to be Adam and carry out the LR vs Batch_Size sweep to determine a batch_size that fits and a stable range of lr for AdamW. We will then sweep different optimizers with different values of LR (each appropriate to the optimizers regime) and then conclude the best LR for each optimizer and pick the optim<>LR pair with the lowest loss/perplexity. **

Side Note: Since the above to already well establish, treating them as a learning ground for setting up , documenting and running experiments, getting used to cloud GPU envs and getting the metrics tracking and logging right.

Memory & GPU Utilisation Improvement Experiments

Activation Checkpointing ✅

On a A100 , a post-LN , baseline hand coded MHA can only fit a batch size of 20 on the final model, this was tested both on estimates and empirically on a short training run. Memory utilisation can be majorly contributed to activations storing data for grads. The optimisation here is activation checkpointing, we want to see how different the memory utilization behaves before and after checkpointing.

Experiment Log: Activation Checkpointing

Micro-batching + Gradient Accumulation ✅

Bigger batches might still be hard to support hence gradient accumulation strategy can be employed with micro-batches to optimally deploy compute for training.

Experiment Log: Micro-batching + Gradient Accumulation

Datatype Appropriation ✅

We already moved the datatype to BF16 instead of FP32.

SDPA vs Hand-Coded Attention Ablation ✅

Finally perform a training run with SPDA , check: 1. if memory utilisation reduced for attention since intermediate score materialisation should be avoided by this kernel 2. training speed up , per epoch time and per step time

After this we could potentially increase the batch-size for future runs!

Will be implementing a config drive way to swap between SDPA and custom attention implementations but the end goal of freeing up more memory to increase batch size is kinda redundant at least as long as we use the wikitext2 dataset since the gradients generated in a single micro-batch are already coherent enough to provide really good quality updates.

Experiment Log: Experiment Assessing Gradient Quality With Accumulation

Architectural And Training Recipe Ablation’s

Pre vs Post LN ( Layer Norm in Transformers experiment) ✅

Post vs ( Pre LN + final decoder LN ) : Small Dataset , Deep Model , Short training run.

Look for grad scale and training stability.
See what actually happens in the final LN grads and weights.

With and without scheduler warm-up ✅

With and without scheduler warm-up to be compared with pre and post LN and measure final loss and perplexity of each. The results should show that a pre-LN should learn better, converging to a lower loss without warmup while the post-LN reaches is lower loss on the warmup trail and a comparatively higher one without warm-up.

This above 2 experiments are paired and completed here: Pre vs Post Norms, With & Without Warmups

~~Setting up an Experiment Template and proceeding the runs!~~ Done :)

Baseline Experiments