Build Your Own LLM: The Workshop
23 videos · 250 slides · 50 hands-on exercises · Free · No Math or ML Prerequisites.
Last month I hosted an in-person workshop about building your own large language model without any math or ML prerequisites. It covers everything from machine learning fundamentals, deep neural networks, transformer architecture, and pre/post-training. I’m excited to release recordings and training materials for you to watch!
[Start Watching: Build Your Own LLM ]
[Read the slides] [Do the Exercises]
The workshop’s goal is to grok all parts of modern LLM development. So each workshop section builds intuition sequentially: starting from slides teaching the concepts, followed by excel-by-hand exercises developing intuition for the math, and then coding tutorials. One participant, Emily HK, noted: “The best part of this workshop is that all the content is still available online for me to refresh my memory at any point.”. Well, now you have access to these materials as well!
23 Workshop Videos @ https://go.JustinAngel.ai/playlist
250-page Slide Deck @ https://go.JustinAngel.ai/deck
50 Excel and Code Exercises @ https://go.JustinAngel.ai/drive
Why I Recorded the In-Person Workshop
We had over 200 registrations for the in-person workshop in San Francisco! Unfortunately we only had physical space for ~30 attendees. Driven by guilt for having to turn away ~200 people, I ended up spending a week recording, editing and uploading all the materials to YouTube. If you happen to know of a space in SF that can host more than 30 people for the next workshop, let me know?
Video Links

Our first section focuses on sampling LLM token predictions. We have to sample tokens to avoid waiting until the heat-death of the universe to compose a short ideal paragraph. Sampling techniques like temperature, TopP and TopK are then used as a shortcut.
At this point of the workshop, we don’t really know how to build an LLM. So as the enterprising engineers we are, we’ll reverse engineer a classic LLM (GPT-2 small), list out its parts, and create a roadmap to learn about each of those.

ChatGPT is just 100B+ perceptrons. Or said in reverse: a perceptron is the basic unit of work of a neural net. The math formula is straightforward: weight * input + bias. In this section we luxuriate in that simple formula, seeing it in code, in excel demos, and in physical analog demos.
Perceptrons alone can’t easily detect edges (like the ending of sentences), understand non-linear values (like -5 and 5 being the same magnitude), or build some logical structures. Activation Functions solve that gap by introducing non-linearity into neural nets. Our key takeaway from this section is the ReLU^2 activation function which will be the first component of our LLM.
CPUs have 8 physical cores that can run in parallel. So if all we’re computing is 8 neurons, those are great. But when we have 32M perceptrons to compute in parallel, they just can’t match up with the 15,000 parallel compute core GPUs have. In this section we learn about GPU coding starting from CPU, PyTorch modules, torch.compile() fused kernels, CUDA, and Triton.
A single perceptron can only learn one really single rule. To learn language, we need complexity. Which means we’ll need multiple perceptrons grouped together in Layers. And then finally we’ll sequence those layers into a multi-layered perceptron (MLP). This section gives us another component for our final LLM: a growing-shrinking MLP.
If our goal is to teach LLMs the english language, we need to know how far away from that goal we are with our randomly-initialized LLMs spewing gibberish. Loss functions allow us to quantify the distance from that goal. During this section we learn about the theory of traversing loss landscape to minimize loss (i.e. improve english skills in LLMs). We also learn about how Cross-Entropy loss function works in problems without a definitive answer, like next word prediction.
Before we can train a model on the english language, we’ll learn how to train it on predicting A+B%5. At first, this toy example seems trivial. The same manifolds created inside deep neural networks to divide by five, are how LLMs learn any sequential concept like days, months, colours, and word counting.
Now that we’ve successfully trained our first DNN, it’s time to save it to disk, upload it to Hugging Face and load it remotely.
GPT2-small has 72 consecutive matrix multiplications, but activation functions tend to add zeros and small values. What happens when we multiply a bunch of zeros? we get more zeros. Training collapses. GPU time is wasted. People get fired. Recessions start. To avoid that, we learn that choosing a specific statistical distribution for our randomly-initialized values can prevent gradient collapse. Yay!
But in order to avoid training collapse over 72 consecutive matrix multiplications, we really need another trick up our sleeve: Residuals. Combining outputs from the previous layers with the outputs of this layer, allowing each layer to work independently for a bit. It’s like giving the DNN multiple scratchpads. The formula is straightforward: f(x)+x and we’ll see it play out in excel and in code.
Residuals introduced a new problem since f(x)+x can lead to gradient explosion. Intuitively, we can see that f(x)+x>x increases X at every turn, which is how we get a gradient explosion. So we bring in normalization to make sure the values stay clustered around a central point.
So far our DNN only worked on deterministic data with clear answers (e.g., A*B%C has clear answers), but language doesn’t have definitive answers. We can complete the sentence “The capital of France is” with “Paris”, “full of cheese”, or “has a metro strike” all factually correct. Once we introduce data without a definitive solution, the model will overfit on the training data. i.e., it’ll perform better on train data than it can generalize on unseen test data. That’s a problem. Regularization techniques like dropout, gradient clipping, and weight decay can be used to make sure our DNN doesn’t overfit on training data.

Ultimately, LLMs have to return percentage probability for the next words like saying “Paris” is 100% likely. But DNNs / LLMs operate by doing fun matrix multiplication with real numbers, like 1.234567. SoftMax converts those activation values into percentages.
LLMs operate on fun matrix multiplication with numbers, but language operates on strings. Tokenizers are our way of putting a square peg into a round hole by shaving off the edges. We’ll convert words, subwords components, punctuation marks and emojis into integer tokens. But then we have to decide how to best tokenize? Should each character be a token? should each word be a token? Or is there a compromise there? Ultimately, we settle on the BPE compromise tokenizer.
Once we’ve converted “Hello” to 995, what does the LLM know about 995? Token Embeddings represent learnable meanings in high-dimensional space for each token. We see the example of how in high-dimensional embedding space, woman and queen exist in the same distance & direction as man and king. These embeddings learn what each word/subword/token means. But those embeddings still lack positional information which makes “dog bites man” and “man bites dog” equivalent. So we’ll learn about positional encoding and how to add positional information to each token embedding.
Attention is how LLMs “read” sentences. First they read sentence fragments, understand word positions, then they identify subjects & verbs, read question words and understand transformations. It’s prohibitive to teach LLMs each one of those concepts. But Attention just learns that through the power of math and backpropagation.

Bringing it all together we build a Transformer architecture with MLPs, Attention Heads, Embeddings, Activation Functions, Regularization, Normalization, Residuals and everything else we learned about so far. This section is my favourite because it really shows how transformers work end-to-end.
Moving on from architecture to actual training, we need to first teach our LLMs the english language. We do that by taking textual information from the web, from books, from social media, code and other data sources. But extracting text from the internet isn’t straightforward. Do we prefer more lower quality text? Or less higher quality text? In this section we build our pretraining script and let it run for hours to train our model in the english language.
How do we know if our models are any good? Should we use leaderboards? Should we have another LLM act as a judge? Good at what? In this section we review the full options of how LLMs are evaluated, and on what LLMs can be evaluated on.
After our LLM learns to speak english, it’s still pretty bad at following instructions. We can tell it to answer “What is the capital of France?” and it’ll just answer “The capital of France is France.” because it doesn’t yet understand it’s supposed to be a helpful AI assistant. Using Instruction fine-tuning we teach our LLM to write poems, generate lists, explain concepts and more.
Finally, we want our LLMs to be good at things. We want them to know all the knowledge, to correctly solve all the math, and to write the perfect code. RL is our tool to evaluate entire responses, and backpropagate both negative and positive signals. In this section we implement the simples RL offline preference optimization algorithm from scratch: SimPO. First in excel, and then in code.
We couldn’t cover the entirety of building frontier LLMs, but we can give you trailheads to what we didn’t cover: scaling training time to multiple GPUs, optimizing GPU usage at inference time, and making LLMs safe to use.
[Start Watching: Build Your Own LLM ]
[Read the slides] [Do the Exercises]





















