Abstract

In this tutorial, I will provide a comprehensive walk-through of the pipeline for training large language models, covering both pre-training and post-training phases. My goal is to discuss the best practices at each stage of training as known today—drawing from open models and public research papers—including data curation, training algorithms, and safety mitigations. The tutorial aims to serve as a starting point to facilitate discussions on the open research questions in training the next generation of large language models.