Building GPT from Scratch - Part 3: Building the Complete Transformer

Aug 6, 2025

This is Part 3 of a 3-part series on building a Generative Pretrained Transformer from scratch. In Parts 1 and 2, we built the data pipeline and attention mechanism. Now we'll complete our transformer with the remaining key components.

Feed-Forward Networks

After our attention mechanism learns to communicate between tokens, we need to give the model computation time to process what it has learned. This is where feed-forward networks come in - they're simply MLPs with ReLU activation that allow the model to "think" about the information it just gathered.

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

We add this to our model after the attention:

# In our model:
self.sa_head = MultiHeadAttention(4, n_embd//4)
self.ffwd = FeedFoward(n_embd) # MLP with RELU

# In forward pass:
x = self.sa_head(x)
x = self.ffwd(x)
logits = self.lm_head(x)

Transformer Blocks: Communication + Computation

The pattern of attention (communication) followed by feed-forward (computation) is so fundamental that we package it into reusable Transformer Blocks. These blocks can be stacked to create deeper networks that alternate between communication and computation phases.

n_head = 4
n_layer = 2

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

We can then stack multiple blocks:

# In our model:
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])

# In forward pass:
x = tok_emb + pos_emb # (B,T,C)
x = self.blocks(x)
logits = self.lm_head(x)

At this point, our model starts generating recognizable English words!

RILELE
MAssell use you.
Nt I RYan lake off.

Ficink'd
hote carewtledeche quie to whanl Gatt Mejesery ely:
The if, bet leveses it theave be ry skit you file.

Kay wred tome dake stance, suks,
Adech JORo!

ALOUSBET:
Wis brake grourst and ald creapsss,
Andite
noat,
Amothery are doreast is

Training Optimizations

As our network gets deeper with multiple transformer blocks, we need optimizations to improve training stability and performance.

Residual Connections

The first optimization is residual/skip connections, introduced in the Deep Residual Learning for Image Recognition paper. These connections allow gradients to flow directly through the network, avoiding training bottlenecks in deep networks.

Instead of 'x = self.sa(x)', we do 'x = x + self.sa(x)'. The computation now returns a residual that gets added to the original input:

class Block(nn.Module):
    def forward(self, x):
        x = x + self.sa(x)  # residual connection
        x = x + self.ffwd(x)  # residual connection
        return x

Projection Layers

We also add projection layers to map back to the embedding space after our multi-head attention and expand our feed-forward networks as in the original paper:

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # expand by 4x as in the paper
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),  # project back down
        )

    def forward(self, x):
        return self.net(x)

Layer Normalization

Layer normalization helps stabilize training by normalizing inputs to have zero mean and unit variance. We apply it before both the self-attention and feed-forward computations in each block, and once more after all blocks:

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Dropout

Finally, we add dropout for regularization, which randomly sets some activations to zero during training to prevent overfitting:

class FeedFoward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),  # add dropout
        )

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

Final Model and Hyperparameters

We now have a complete decoder-only transformer ready to scale and train! Here are the hyperparameters for our final model:

# hyperparameters
batch_size = 32
block_size = 32
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
n_embd = 64
n_head = 6
n_layer = 6
dropout = 0.2

Final Results

After training our complete transformer, we get much more structured output:

DOF GlLN:
They good, then then, bladgy bone not thindnes
I way Jeain, fainly!

ISABELLA:
I way, thou fourd to havary flown dews'm-sine.

NowZALLET:
Which here old thy's warring hiod
On dearys hory be wive to more; greseli!

But nighd Wart, prance:
Barch,
And prayem not welld she you, coldinger:
O I the oldst God somed:
Sirry is let never To be whith, I new'n be thy limpiny
Word: where deblitiss for give upon the conqueennifult,
And so pobeterl. by they she thy truge,
If you his let a brotess.

While still not perfect, the improvement is dramatic! The model now:

Uses proper character names (ISABELLA, etc.)
Maintains somewhat consistent dramatic structure
Shows more coherent word formation
Demonstrates longer-range dependencies

What We've Accomplished

In this three-part series, we've built a complete transformer from scratch:

Part 1: Data loading, tokenization, and a simple bigram baseline
Part 2: The attention mechanism - the core innovation of transformers
Part 3: Complete transformer architecture with all the essential components

Key Components We've Implemented:

Self-attention mechanisms for learning relationships between tokens
Multi-head attention for capturing different types of relationships
Feed-forward networks for computation after communication
Transformer blocks that stack attention and computation
Residual connections for stable training of deep networks
Layer normalization for training stability
Positional embeddings so the model understands sequence order
Dropout for regularization and preventing overfitting

The Path Forward

Our small transformer demonstrates the core principles, but to achieve GPT-level performance, you'd need to scale up significantly:

More parameters: Modern language models have billions of parameters
More data: Training on much larger text corpora
More compute: Training for weeks or months on powerful hardware
Better tokenization: Using subword tokenizers like BPE or SentencePiece
Longer context windows: Supporting much longer sequences

Key Takeaways

Attention is the core innovation: The self-attention mechanism allows tokens to communicate and share information based on content rather than just position.
Transformers are surprisingly simple: The architecture is just attention + feed-forward blocks stacked together with some normalization and residual connections.
Scale matters: The same architecture that gives us semi-coherent Shakespeare can generate human-level text when scaled up with more parameters, data, and compute.
Training stability is crucial: Residual connections, layer normalization, and proper weight initialization are essential for training deep networks.

Conclusion

This tutorial has shown how to build a transformer model from scratch using PyTorch. We've implemented all the key components of the decoder transformer architecture and seen how they work together to create a language model capable of generating structured text.

The progression from random gibberish (bigram model) to semi-coherent Shakespeare-like text (full transformer) demonstrates the power of the attention mechanism and proper architectural choices. While our small model doesn't generate fully coherent text yet, scaling up the parameters, training data, and computation would lead to increasingly capable language models.

The complete nanoGPT implementation, with additional optimizations and the ability to train larger models, can be found in Andrej Karpathy's repository. This serves as an excellent foundation for understanding and experimenting with transformer architectures.

Build based on Takuya Matsuyama.