Implementing a Baseline Model

Preface

Deriving inspiration from some of the predecessors:

GPT-1

Layers = 12
Training Token Count = ~4.5 GB of raw text from ~7,000 unpublished books (~985 M words)
Param Count = 117 million (0.117 B)
Vocab = 40k iterations of BPE
Dataset = Book Corpus

GPT-2

GPT-2 is basically a 10x scale up of GPT-1 with almost no other changes in architecture.

Training Token Count = 40 GB of text, ~15B+ tokens
Param Count = 1.5 B
Vocab = ~50k token vocab
Dataset = opencrawl

The Plan

Smoke Test With Small Models

Using the basic decoder LM architecture from Vaswani Et Al with only self-attention and FFN blocks.

Dataset = Shakesphere ( 1.1 MB of text)
Vocab = 10k tokens with BPE
Layers = 2
d_model=128
n_heads=6
layers=2

Will be doing the overfit test and validate plumbing and learning optimisation pipeline.

The scaling laws paper (Kaplan Et Al.) and Chinchilla Paper provide guidelines for optimally estimating a training run based on compute and data budgets, but that only provides a macroscopic view leaving many questions unanswered, which are especially important when training smaller models. How do you pick optimal batch size, sequence-length, learning-rate and schedulers , do these hyper-params effect each other, what about hardware dependency for picking these hyper-params ????!!!

Some useful reads on this:

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Maybe there is no fixed rules here, maybe its all just about a recipe to narrow down these things down experimentally?

Thinking from first principles, let me focus only on 2 variables, sequence-length and batch size.

1. Sequence Length

Lets denote sequence length by $T$ for ease.

So as $T$ increases,

PRO : it will also allow the model to pay attention to a longer context and learn relationships over longer stretches of the sequences.
CON : the computation required to perform attention also increases, the complexity of attention is $O (T^{2})$ , so this will be costly to keep scaling.

Remember attn_score is of shape [Batch, T_q, T_k, d_model] and for self-attn T_q = T_k , hence complexity becomes $O (T^{2})$
- In longer $T$ earlier tokens might be attributed to larger loss values simply because they occur first and due to the nature of BPTT.
Note : Also it might not be true that all tokens can be equally values, some tokens might have sparse info while some might be critical, so a longer sequence might not always mean more information or vise-versa.

2. Batch Size

Lets denote it by $B$ for ease.

So as $B$ increases:

Fewer epochs but also fewer gradient updates, with smaller models (lesser params) representation/weights might need more tuning to get right vs a bigger model where it can be learnt over a larger number of params with lesser tuning.
More GPU memory requirements, if batch_size > GPU_size we might need to resort to micro-batching and grad-accumulation.

Ok so one connecting dot here between $T$ and $B$ is that both are a knob to control how much computation we perform, $T$ dictates the attention op performance(which GPU should be good at) and $B$ dictates GPU memory utilisation. But they are not directly linked to each other in a significant way as per my current view.

Also practically, $T$ needs to be meaningfully long for the selected corpus and $B$ needs to be at a trade-off between - how often we want grad-updates(how small $B$ should be) vs how efficiently can i utilize my hardware (how big $B$ should be). So will be eye-balling the $T$ value based on a sweep of the corpus and tokenizer , and B can be calibrated on seeing hardware usage so starting off with a medium/some-what-high-ish value too should be ok.

Work Log

Baseline Overfit test

Run 1

Wandb run driven-sun-7 (Link) | Git Commit

Observations: The model seems to overfit for sure with the training loss dropping and the val loss and val perplexity going up. Also its possible that some of the val-perplexity is due to the might be due the nature of data, we are using a sequential corpus and its possible that certain tokens never occur until the tail-end which we use as validation slice.

Also i think its best to include <EOS> and <PAD> tokens to the VOCAB and also create training sequences based on <EOS> boundaries.

Also on prompting this model it seems just regurgitate random exerts as below:

you> greetings
bot> each my common mouth: I do despise them;
For they do prank them in authority,
Against all noble sufferance.

SICINIUS:
Pass no further.

CORIOLANUS:
Ha! what is that?

BRUTUS:
It will be dangerous to go on: no further.

CORIOLANUS:
What makes this change?

MENENIUS:
The matter?

COMINIUS:
Hath he not pass'd the noble and the common?

BRUTUS:
Cominius, no.

CORIOLANUS:
Have I had children's voices?

First Senator:
Tribunes, give way; he shall to the market-place.

BRUTUS:
The people are incensed against him.

SICINIUS:
Stop,
Or all will fall in broil.

CORIOLANUS:
Are these your herd?
Must these have voices, that can yield them now
And straight disclaim their tongues? What are
your offices?
You being

Run 2

wandb run: vibrant-salad-11

After a couple fixes (here to here ) with the following key changes:

including EOS and PAD tokens (also ment including key-padding mask to MHA). New sequences are created as follows:

  segments = corpus.split("\n\n")
  ...
  for segment in segments:
        encoded_segment = [
	        token_to_id[token] for token in tokenizer.encode(segment)
		]
        token_stream.extend(encoded_segment)
        token_stream.append(eos_id)
        eos_inserted += 1

revamped data pipelines overview as follows, checkout data.py for the code.

Step	Description
1	Load raw text from file
2	Tokenize with BPE
3	Convert to token stream with EOS markers
4	Truncate if needed
5	Create sliding windows of size `seq_len + 1`
6	For each window, split into input and target
7	Pad short windows and create masks
8	Wrap in Dataset → DataLoader for training

I also rented out a RTX A5000 on vast.ai to get some CUDA powers and speed things up! (Totally worth it!)

Retrained the model !

Now we see the classic overfitting case, falling train_loss , raising val_loss and perplexity , but we were expecting exactly this, simply because we only had a small amount of data for almost a 2M param model. Now we can enter into the “data regime” of scaling-laws and Chinchilla which describe that for optimal training we need to scale data alot more in-order to efficiently train a model of this size, tiny Shakespeare is nowhere close to enough.

But what we do want to look for is also how the gradients behaviour and health of signals flowing throughout our network which we can see here:

Sample outputs:

% python repl-lm-chat.py --model-path baseline/models/run_20260220_193939/baseline_checkpoint.pt
Tokenizer vocab loaded from /Users/ashwinm4p/ash/code/transformer-room/baseline/tiny_shakespeare_bpe_vocab.txt
Loaded model on mps
Context window: 128
Prompt format: plain
Type a prompt and press enter. Type /quit to exit.
 
you> greeting sire!
bot> Five times, being the city;
and maids like
Which way there, and heavy. begin and one skull, they do the good it were
wholesome, where he hath done before.
 
you> I understand not what you utter but what thy intend ?
bot> Must I with ba--when you now see
He still have been sooth the heart of every flattered the matter between an pike
Shall remain.
 
you> why my lord? Why such grief?
bot> Before he should thus far havity was the trivial motion,
Yet I can honour
To banish'd,
And will follow.

Concluding Baseline Overfit

This is already a lot better! Now we see better phrase formation, turn based response already just from including EOS and PAD. We can also see that the model is no longer just regurgitating the play-write and actually shuffling some stuff around. The training metrics show the model is learning and infact overfitting with all indicators pointing to a bigger training run with more data which is exactly what we’ll try next.

Prep’ing for bigger data & next-phase

To actually start measuring and improving model behaviour we need to move beyond toy examples and setup a more realistic data corpus which will help us see healthy training runs as well us build a reasonably useful model which demonstrates some level of capabilities.