Tasks
- Finalise a base model for this stage of experiments.
- Benchmark / Test setup for true testing model generalisation beyond a single training set.
- Setting up HuggingFace upload pipeline post training.
Finding Problems
I have already started to see some bottlenecks and compounding issues! I intend to switch to open-web-text dataset and at least have 1B tokens for training but the way i currently load up, tokenise and use the data is very memory bound and inefficient, the currently code cannot handle things like:
"\n\n".join(segments)BPETokenizer._train()→corpus = list(text.encode("utf-8")), this would probably overflow the RAM.- The current dataset and pipeline just does not fundamentally support streaming, which I need it to in order to scale up the experiments.
But the good news is that the training pipeline shouldn’t really be affected by this, most of the issue seems to be within the tokeniser code. As i see it we have 2 issues,
- BPE training on a 1B corpus is just going to take too long, even running for 50-100k would be time consuming and might still not be in the best interest as i scale to bigger experiments.
- Currently the corpus is being tokenised and materialised completely at training time, there isnt a true-streaming capability as of now