Let's build GPT: from scratch, in code, spelled out.

0:00

Introduction to ChatGPT and Transformers

“What I'd like to do now is I'd like to build out something like ChatGPT, but we're not going to be able to reproduce ChatGPT. This is a very serious production-grade system trained on a good chunk of the internet with a lot of pre-training and fine-tuning stages.”

Andrej Karpathy begins by introducing ChatGPT and explaining its capabilities as a language model that can generate coherent text responses to prompts. He showcases examples of ChatGPT interactions and explains that at its core, ChatGPT is powered by the Transformer architecture from the 2017 paper "Attention is All You Need".

He then outlines the goal of the video: to build a smaller, character-level language model based on the Transformer architecture and train it on Shakespeare's text. Rather than trying to reproduce the full complexity of ChatGPT, the focus will be on implementing the core mechanisms that make these models work and demonstrating how they can generate Shakespeare-like text.

Takeaways

ChatGPT is a language model built on the Transformer architecture that generates text by predicting the next token in a sequence.
The Transformer architecture, introduced in 2017, has become fundamental to modern AI systems beyond its original machine translation purpose.
The video will focus on building a character-level language model trained on Shakespeare's works as a simpler demonstration of the technology.
Understanding the basics of transformer-based language models provides insight into how systems like ChatGPT function.

7:52

Data Preparation and Tokenization

“When people say tokenize they mean convert the raw text as a string to some sequence of integers according to some vocabulary of possible elements.”

This section covers the initial steps of preparing text data for training a language model. Karpathy downloads the Tiny Shakespeare dataset (about 1MB of text) and examines its contents, which consist of Shakespeare's complete works. He then implements a simple character-level tokenizer that converts each character in the text to a unique integer and vice versa, creating a vocabulary of 65 possible characters.

Karpathy explains that while his implementation uses character-level tokenization for simplicity, production systems like GPT typically use subword tokenization methods (like BPE or SentencePiece) that balance vocabulary size and sequence length efficiency. He also separates the dataset into training (90%) and validation (10%) splits to help measure model generalization during training.

Takeaways

Tokenization converts text into sequences of integers that neural networks can process.
Character-level tokenization is conceptually simple but creates longer sequences compared to subword methods used in production systems.
The vocabulary size for this character-level model is 65 (including letters, spaces, and punctuation).
Splitting data into training and validation sets is essential for evaluating model performance and preventing overfitting.

14:27

Data Loader and Batch Processing

“When we train the Transformer on a lot of these datasets, we only work with chunks of the dataset and when we train the Transformer we basically sample random little chunks out of the training set and train on just chunks at a time.”

This section explores how data is prepared for efficient training. Karpathy explains that models aren't trained on entire texts at once but on small chunks with a fixed maximum length (called "block size" or "context length"). He demonstrates how a single chunk of text actually contains multiple training examples - each position in the sequence provides a different context length for predicting the next character.

Karpathy then implements a data loader that creates batches of multiple independent sequences sampled randomly from the training data. These batches have dimensions of [batch_size, block_size], where each row is a separate sequence and each sequence contains multiple prediction examples. This approach maximizes computational efficiency by processing multiple examples in parallel.

Takeaways

Training data is processed in chunks of fixed maximum length (block_size) rather than feeding the entire text at once.
Each position in a sequence represents a distinct training example with different context lengths.
Batching multiple independent sequences together improves computational efficiency.
The model must learn to predict the next token given contexts of varying lengths, from 1 to block_size.

22:11

Bigram Language Model: A Simple Baseline

“What's happening here is we are predicting what comes next based on just the individual identity of a single token, not seeing any context except for themselves.”

Before diving into the complexity of Transformers, Karpathy implements a bigram language model as a simple baseline. This model predicts the next character based solely on the identity of the current character, without considering any additional context. He implements this using PyTorch's nn.Embedding table, which maps each character to a vector of scores (logits) representing the likelihood of each possible next character.

Karpathy explains how to calculate the loss using cross-entropy and how to reshape tensors to conform to PyTorch's expected dimensions. He also implements a generation function that allows the model to produce new text by sampling from the predicted probability distributions. Though the generated text is initially random, after training it begins to capture simple character-level patterns in Shakespeare's writing.

Takeaways

A bigram model is the simplest language model, predicting the next token based only on the current token.
The model uses an embedding table to map input tokens to output logits, representing scores for each possible next token.
Cross-entropy loss is used to measure prediction quality and guide optimization.
Even this simple model can capture basic patterns in text after training, though without understanding longer contexts.

42:13

Building Self-Attention: From Averaging to Matrix Multiplication

“The trick is that we can be very very efficient about doing this using matrix multiplication. That's the mathematical trick.”

This critical section introduces the core concept of self-attention, starting with a simple approach where tokens communicate by averaging their past context. Karpathy demonstrates how matrix multiplication can be used as an efficient weighted aggregation mechanism, using a triangular mask to ensure tokens only attend to their past context (not future tokens).

He progressively builds up the self-attention mechanism, showing how to use softmax to normalize weights and create proper probability distributions for aggregation. This mathematical foundation is crucial for understanding how tokens in a sequence can communicate with each other in a data-dependent manner, which is the key innovation of the Transformer architecture.

Takeaways

Self-attention allows tokens to communicate by gathering information from other tokens in a weighted manner.
Matrix multiplication provides an efficient implementation for weighted aggregation of information.
Triangular masking ensures that tokens only attend to previous tokens in the sequence, preserving the autoregressive property.
Softmax normalization creates a proper probability distribution over attention weights, controlling how much information is gathered from each token.

1:02:00

Self-Attention: The Heart of the Transformer

“Every single node or every single token at each position will emit two vectors: it will emit a query and it will emit a key. The query vector roughly speaking is 'what am I looking for' and the key vector roughly speaking is 'what do I contain'.”

This section represents the crux of the video as Karpathy implements the complete self-attention mechanism. He explains how each token produces three vectors: queries (what information the token is looking for), keys (what information the token contains), and values (what information the token shares when attended to). The attention weights are calculated by taking the dot product between queries and keys, determining how much each token should attend to others.

The self-attention implementation includes scaling by √(head_size) to control variance at initialization, masking to ensure the autoregressive property, and softmax normalization to create probability distributions. Karpathy then provides several key insights about attention as a communication mechanism that operates over sets with no inherent notion of space, explaining why positional encodings are necessary to give tokens information about their position in the sequence.

Takeaways

Self-attention computes interactions between tokens using query, key, and value vectors produced by each token.
The dot product between queries and keys determines attention weights, creating data-dependent communication patterns.
Scaling attention scores by √(head_size) helps control variance and prevents softmax from becoming too peaky at initialization.
Attention has no inherent notion of space or position, requiring explicit positional encodings.
Self-attention allows flexible communication patterns between tokens, making it applicable to various problems beyond sequence modeling.

1:19:11

Building the Complete Transformer

“The self-attention is the communication and then once they've gathered all the data, now they need to think on that data individually, and so that's what feed-forward is doing.”

After implementing self-attention, Karpathy expands the model to include all the components of a full Transformer. He introduces multi-headed attention, which uses multiple independent attention mechanisms in parallel to capture different types of relationships between tokens. This is followed by feed-forward networks that process each token independently after the communication phase.

Karpathy then adds crucial architectural elements like residual connections (which create gradient superhighways during backpropagation) and layer normalization (which normalizes features within each token to stabilize training). He explains how these components work together in Transformer blocks that can be stacked to create deeper networks, and shows how scaling up the model size significantly improves performance on the Shakespeare text generation task.

Takeaways

Multi-headed attention uses multiple attention mechanisms in parallel to capture different types of relationships between tokens.
Feed-forward networks process tokens independently after they've gathered information through attention, adding computational power.
Residual connections create paths for gradients to flow directly from the output to the input, making deep networks trainable.
Layer normalization stabilizes training by normalizing features within each token.
Transformers typically stack multiple blocks of attention and feed-forward layers to create deeper, more powerful models.

1:42:39

Transformer Variants and Scaling to ChatGPT

“To train ChatGPT there are roughly two stages: first is the pre-training stage and then the fine-tuning stage.”

In the final section, Karpathy explains different variants of Transformers and how they relate to models like ChatGPT. He distinguishes between encoder-only, decoder-only, and encoder-decoder Transformers, clarifying that what he implemented is a decoder-only Transformer similar to GPT. He also gives a brief walkthrough of nanoGPT, his GitHub repository containing a more efficient implementation of the same architecture.

The section concludes with an explanation of how ChatGPT is created, starting with pre-training a large model (175 billion parameters for GPT-3) on massive datasets (300+ billion tokens), followed by fine-tuning using reinforcement learning from human feedback (RLHF). This multi-stage process transforms a general text completion model into an aligned assistant that responds helpfully to questions.

Takeaways

Decoder-only Transformers (like GPT) use masked self-attention and are suitable for text generation tasks.
Encoder-only Transformers allow all tokens to attend to each other and are used for tasks like sentiment analysis.
Encoder-decoder Transformers use cross-attention to condition generation on separate inputs and are used for translation.
ChatGPT involves pre-training on large text corpora followed by fine-tuning with RLHF to align the model with human preferences.
Modern language models are dramatically larger than the demonstration model (175B+ parameters vs. 10M) and trained on vastly more data.

Conclusion

This comprehensive tutorial bridges the gap between theoretical understanding and practical implementation of transformer-based language models that power technologies like ChatGPT. By building a GPT model from scratch, Karpathy demystifies what might otherwise seem like an intimidating black box, revealing that the core architecture is surprisingly compact and understandable despite its powerful capabilities.

The video demonstrates how relatively simple components—self-attention, feed-forward networks, residual connections, and normalization—combine to create a system capable of learning complex patterns in text. While the model built in the video is orders of magnitude smaller than production systems like GPT-3, it illustrates the same fundamental principles and provides a foundation for understanding how these models scale.

So what? This knowledge empowers viewers to experiment with their own implementations, critically evaluate claims about AI capabilities, and better understand both the potential and limitations of large language models. As these technologies become increasingly integrated into our digital landscape, this deeper technical understanding becomes valuable not just for AI practitioners but for anyone seeking to navigate a world where generative AI is becoming ubiquitous.