AI Architecture Guide

Understanding Transformer Architecture

A comprehensive guide to how Transformers work: tokenization, embeddings, attention mechanisms, and the architecture that powers ChatGPT, Claude, and modern AI. Written for developers and curious learners who want to understand the fundamentals.

What is a Transformer?

A Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by researchers at Google. It has become the foundation for virtually all modern large language models (LLMs) including GPT-4, Claude, LLaMA, and BERT.

Before Transformers, language models processed text sequentially—one word at a time, left to right. This was slow and made it hard to capture relationships between words that were far apart. Transformers solved this by processing all words simultaneously and using a mechanism called attention to understand how words relate to each other.

Why Should You Care?

Understanding Transformers helps you:

Write better prompts by understanding how models "see" your input
Understand why token limits exist and how to work within them
Debug unexpected model behavior
Make informed decisions about which models to use
Appreciate why certain tasks are easy or hard for LLMs

Part 1: Tokenization — How Text Becomes Numbers

Neural networks can only process numbers, not text. The first step in any language model is converting text into numerical representations called tokens.

What is a Token?

A token is a chunk of text that the model treats as a single unit. Depending on the tokenization strategy, a token might be:

A single character: H, e, l, l, o
A whole word: Hello
A subword (part of a word): Hello → Hel + lo

Tokenization Strategies

1. Character-Level Tokenization

Split text into individual characters.

Input:  "Hello world"
Tokens: ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
Count:  11 tokens

Pros: Small vocabulary (just ~100 characters), handles any word.
Cons: Very long sequences, hard to learn word meanings.

2. Word-Level Tokenization

Split text on whitespace and punctuation.

Input:  "Hello world"
Tokens: ["Hello", "world"]
Count:  2 tokens

Pros: Intuitive, captures word meanings.
Cons: Huge vocabulary needed, can't handle new/misspelled words.

3. Subword Tokenization (What Modern LLMs Use)

The sweet spot: common words stay whole, rare words get split into pieces. This is what GPT, Claude, and most modern models use.

Input:  "Hello world"
Tokens: ["Hello", " world"]
Count:  2 tokens

Input:  "tokenization"
Tokens: ["token", "ization"]
Count:  2 tokens

Input:  "unbelievable"
Tokens: ["un", "believable"]
Count:  2 tokens

Input:  "Pneumonoultramicroscopicsilicovolcanoconiosis"
Tokens: ["P", "ne", "um", "ono", "ult", "ram", "icro", "scop", "ics", "il", "ico", "vol", "cano", "con", "iosis"]
Count:  15 tokens

How Subword Tokenization Works (BPE)

The most common algorithm is Byte Pair Encoding (BPE). Here's the intuition:

Start with all individual characters as your vocabulary
Count which pairs of tokens appear most frequently in your training data
Merge the most frequent pair into a new token
Repeat until you reach your desired vocabulary size (e.g., 50,000 tokens)

BPE Example

Training corpus: "low lower lowest low"

Step 1: Start with characters
  Vocabulary: [l, o, w, e, r, s, t, ' ']

Step 2: Count pairs
  "lo" appears 4 times (most frequent)

Step 3: Merge "lo" into new token
  Vocabulary: [l, o, w, e, r, s, t, ' ', lo]
  Corpus becomes: "lo w   lo w e r   lo w e s t   lo w"

Step 4: Count pairs again
  "low" appears 4 times (most frequent)

Step 5: Merge "low" into new token
  Vocabulary: [l, o, w, e, r, s, t, ' ', lo, low]

...continue until vocabulary reaches target size

After training, common words like "the", "and", "is" become single tokens, while rare words get split into recognizable pieces.

Token IDs

Each token in the vocabulary gets a unique number (ID). The model only sees these numbers.

Vocabulary (simplified):
  "Hello"  → 15339
  " world" → 1917
  "the"    → 1
  "."      → 13

Input:  "Hello world."
Tokens: ["Hello", " world", "."]
IDs:    [15339, 1917, 13]

Why Token Limits Matter

When a model says it has a "128K context window," it means it can process 128,000 tokens at once. Since tokens aren't exactly words:

1 token ≈ 4 characters in English (on average)
1 token ≈ 0.75 words in English (on average)
100 tokens ≈ 75 words
Code uses more tokens than prose (symbols, indentation)
Non-English languages often use more tokens per word

Part 2: Embeddings — Giving Tokens Meaning

Token IDs are just arbitrary numbers. The ID 15339 for "Hello" doesn't tell the model anything about what "Hello" means. We need to convert these IDs into embeddings—vectors that capture semantic meaning.

What is an Embedding?

An embedding is a list of numbers (a vector) that represents a token in a way that captures its meaning. Similar words have similar embeddings.

Embedding dimension: 4 (real models use 768-12288)

"king"   → [0.2, 0.8, 0.1, 0.9]
"queen"  → [0.3, 0.7, 0.1, 0.8]  ← similar to "king"
"apple"  → [0.9, 0.1, 0.6, 0.2]  ← very different

The Famous Example: Word Arithmetic

Good embeddings capture relationships. The classic example:

king - man + woman ≈ queen

In vector space:
[0.2, 0.8, 0.1, 0.9]   (king)
- [0.1, 0.9, 0.0, 0.5] (man)
+ [0.2, 0.8, 0.1, 0.4] (woman)
= [0.3, 0.7, 0.2, 0.8] ≈ queen

This works because embeddings encode concepts like "royalty" and "gender" in different dimensions.

The Embedding Matrix

The model learns an embedding matrix during training. This is essentially a giant lookup table:

Embedding Matrix (vocabulary_size × embedding_dimension)

Token ID    Embedding Vector (simplified to 4 dimensions)
────────────────────────────────────────────────────────
0 ("the")   → [0.12, 0.45, 0.78, 0.23]
1 ("a")     → [0.11, 0.43, 0.76, 0.25]
2 ("is")    → [0.34, 0.56, 0.12, 0.89]
...
15339       → [0.67, 0.23, 0.91, 0.45]  ("Hello")
...
50256       → [0.89, 0.12, 0.34, 0.67]  (last token)

For GPT-3, this matrix has 50,257 tokens × 12,288 dimensions = 617 million parameters just for embeddings!

Part 3: Positional Encoding — Where Are You in the Sentence?

Unlike older models that read text left-to-right, Transformers process all tokens simultaneously. But word order matters! "Dog bites man" means something very different from "Man bites dog."

Positional encodings add information about each token's position in the sequence.

How It Works

Each position gets its own vector, which is added to the token's embedding:

Sentence: "The cat sat"

Token embeddings:
  "The"  → [0.1, 0.2, 0.3, 0.4]
  "cat"  → [0.5, 0.6, 0.7, 0.8]
  "sat"  → [0.2, 0.3, 0.4, 0.5]

Position encodings:
  pos 0  → [0.01, 0.02, 0.01, 0.02]
  pos 1  → [0.02, 0.01, 0.02, 0.01]
  pos 2  → [0.01, 0.01, 0.02, 0.02]

Final input = embedding + position:
  "The" at pos 0  → [0.11, 0.22, 0.31, 0.42]
  "cat" at pos 1  → [0.52, 0.61, 0.72, 0.81]
  "sat" at pos 2  → [0.21, 0.31, 0.42, 0.52]

The Original Positional Encoding Formula

The original Transformer paper used sine and cosine functions to generate position vectors. This clever approach means:

Each position has a unique pattern
The model can learn relative positions (how far apart tokens are)
It can theoretically generalize to longer sequences than seen in training

Modern models often use learned positional embeddings (like token embeddings, but for positions) or relative positional encodings (RoPE in LLaMA, ALiBi in others).

Part 4: Attention — The Core Innovation

Attention is what makes Transformers special. It allows every token to "look at" every other token and decide how much to pay attention to it.

The Intuition

Consider this sentence:

"The cat sat on the mat because it was tired."

What does "it" refer to? To understand, you need to look back at other words. A human reader connects "it" to "cat" (not "mat") based on context. Attention lets the model do the same thing.

Self-Attention: A Library Analogy

Imagine you're in a library researching a topic. For each question (Query) you have:

Query (Q): What you're looking for — "I need information about cats"
Key (K): The label on each book — "This book is about animals"
Value (V): The actual content of the book

You compare your Query to each book's Key. Books with relevant Keys get more attention, and you extract more from their Values.

Self-Attention Step by Step

Let's trace through attention for a simple sentence:

Input: "The cat sat"

Step 1: Create Q, K, V for each token
────────────────────────────────────
Each token's embedding is multiplied by three learned weight matrices
to produce Query, Key, and Value vectors.

Token    Query (Q)         Key (K)          Value (V)
───────────────────────────────────────────────────────
"The"    [0.1, 0.2]       [0.3, 0.1]       [0.5, 0.2]
"cat"    [0.4, 0.3]       [0.2, 0.5]       [0.1, 0.8]
"sat"    [0.2, 0.5]       [0.4, 0.3]       [0.3, 0.4]


Step 2: Calculate attention scores
─────────────────────────────────
For each token, compute how much it should attend to every other token.
This is done by taking the dot product of its Query with all Keys.

For "cat" (Q = [0.4, 0.3]):
  Score with "The": [0.4, 0.3] · [0.3, 0.1] = 0.12 + 0.03 = 0.15
  Score with "cat": [0.4, 0.3] · [0.2, 0.5] = 0.08 + 0.15 = 0.23
  Score with "sat": [0.4, 0.3] · [0.4, 0.3] = 0.16 + 0.09 = 0.25

Raw scores for "cat": [0.15, 0.23, 0.25]


Step 3: Apply softmax to get attention weights
─────────────────────────────────────────────
Softmax converts scores to probabilities that sum to 1.

Attention weights for "cat": [0.28, 0.35, 0.37]
                              ↑     ↑     ↑
                            "The" "cat" "sat"

"cat" pays 28% attention to "The", 35% to itself, 37% to "sat"


Step 4: Compute weighted sum of Values
─────────────────────────────────────
Multiply each Value by its attention weight and sum.

Output for "cat" = 0.28 × V("The") + 0.35 × V("cat") + 0.37 × V("sat")
                 = 0.28 × [0.5, 0.2] + 0.35 × [0.1, 0.8] + 0.37 × [0.3, 0.4]
                 = [0.14, 0.06] + [0.04, 0.28] + [0.11, 0.15]
                 = [0.29, 0.49]

This output vector now contains information gathered from all tokens,
weighted by relevance.

Visualizing Attention

Attention patterns can be visualized as a heatmap showing which tokens attend to which:

Sentence: "The cat sat on the mat because it was tired"

Attention from "it":
                The   cat   sat   on   the   mat   because   it   was   tired
            ┌─────────────────────────────────────────────────────────────────┐
     "it"   │ 0.02  0.71  0.03  0.01  0.02  0.08   0.02    0.05  0.03  0.03  │
            └─────────────────────────────────────────────────────────────────┘
                    ↑ Strong attention

The model learns that "it" refers to "cat" (0.71) more than "mat" (0.08)

Why "Scaled" Dot-Product?

The actual formula divides by √(dimension) before softmax:

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Without scaling, dot products of high-dimensional vectors can become very large, making softmax produce extreme values (nearly 0 or 1). Scaling keeps gradients healthy during training.

Part 5: Multi-Head Attention

One attention pattern isn't enough. Different aspects of language require different types of attention:

One head might track grammatical structure (subject-verb agreement)
Another might track semantic relationships (what "it" refers to)
Another might track positional patterns (nearby words)

How Multi-Head Attention Works

Instead of one attention calculation:
  Input → [Single Attention] → Output

We run multiple in parallel:
  Input → [Attention Head 1] ─┐
        → [Attention Head 2] ─┼→ Concatenate → Linear → Output
        → [Attention Head 3] ─┤
        → [Attention Head 4] ─┘

GPT-3 uses 96 attention heads
Each head has dimension 12288/96 = 128

Multi-Head Example

Sentence: "The bank by the river"

Head 1 (syntactic): "bank" attends to "The" (determiner relationship)
Head 2 (semantic):  "bank" attends to "river" (semantic disambiguation → riverbank, not financial)
Head 3 (local):     "bank" attends to "by" (nearby context)

Each head captures different relationships, and the model combines them.

Part 6: The Full Transformer Block

A Transformer is made of stacked identical blocks. Each block has:

Architecture Diagram

┌─────────────────────────────────────────────────┐
│                 Transformer Block                │
│                                                  │
│  Input                                           │
│    ↓                                             │
│  ┌──────────────────────────────────────────┐   │
│  │         Multi-Head Attention             │   │
│  └──────────────────────────────────────────┘   │
│    ↓                                             │
│  [Add & Normalize] ←── Residual Connection      │
│    ↓                                             │
│  ┌──────────────────────────────────────────┐   │
│  │        Feed-Forward Network              │   │
│  │   (Two linear layers with activation)    │   │
│  └──────────────────────────────────────────┘   │
│    ↓                                             │
│  [Add & Normalize] ←── Residual Connection      │
│    ↓                                             │
│  Output                                          │
│                                                  │
└─────────────────────────────────────────────────┘

GPT-3: 96 of these blocks stacked
Claude/GPT-4: Likely 100+ blocks

Component Breakdown

1. Multi-Head Attention

As described above — allows tokens to gather information from other tokens.

2. Add & Normalize (Residual Connection + Layer Normalization)

The residual connection adds the input back to the output:

output = LayerNorm(input + Attention(input))

This helps with training deep networks — gradients can flow directly through the addition, preventing the vanishing gradient problem.

3. Feed-Forward Network (FFN)

A simple two-layer neural network applied to each position independently:

FFN(x) = ReLU(x × W1 + b1) × W2 + b2

Typical dimensions:
  Input:  768 (or 12288 for large models)
  Hidden: 3072 (4× input, i.e., 49152 for large models)
  Output: 768 (same as input)

The FFN is where much of the model's "knowledge" is stored. It processes each token's representation independently, adding learned transformations.

Part 7: Encoder vs. Decoder

The original Transformer had two parts: an Encoder and a Decoder. Modern models often use just one or the other.

Architecture Types

┌────────────────────────────────────────────────────────────────┐
│ ENCODER-DECODER (Original Transformer, T5, BART)               │
│                                                                 │
│ ┌─────────────┐         ┌─────────────┐                        │
│ │   ENCODER   │ ──────→ │   DECODER   │                        │
│ │             │ context │             │                        │
│ │ Sees full   │         │ Generates   │                        │
│ │ input at    │         │ output      │                        │
│ │ once        │         │ left-to-    │                        │
│ │             │         │ right       │                        │
│ └─────────────┘         └─────────────┘                        │
│                                                                 │
│ Use case: Translation, summarization                            │
│ "Translate English to French"                                   │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│ DECODER-ONLY (GPT, Claude, LLaMA)                              │
│                                                                 │
│              ┌─────────────┐                                    │
│              │   DECODER   │                                    │
│              │             │                                    │
│              │ Generates   │                                    │
│              │ one token   │                                    │
│              │ at a time   │                                    │
│              │ (causal)    │                                    │
│              └─────────────┘                                    │
│                                                                 │
│ Use case: Text generation, chat, code completion                │
│ "The quick brown fox" → predicts "jumps"                        │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│ ENCODER-ONLY (BERT, RoBERTa)                                   │
│                                                                 │
│              ┌─────────────┐                                    │
│              │   ENCODER   │                                    │
│              │             │                                    │
│              │ Sees full   │                                    │
│              │ input       │                                    │
│              │ (bidirectional)                                  │
│              └─────────────┘                                    │
│                                                                 │
│ Use case: Classification, NER, embeddings                       │
│ "This movie was [MASK]" → predicts "great"                      │
└────────────────────────────────────────────────────────────────┘

Causal Masking in Decoder-Only Models

In decoder-only models (like GPT and Claude), each token can only attend to previous tokens, not future ones. This is called causal masking.

Sentence: "The cat sat on"

When processing "sat", it can only see:
  ✓ "The"
  ✓ "cat"
  ✓ "sat" (itself)
  ✗ "on"  (future - masked out)

Attention mask:
        The   cat   sat   on
      ┌─────────────────────┐
The   │  1     0     0    0 │
cat   │  1     1     0    0 │
sat   │  1     1     1    0 │
on    │  1     1     1    1 │
      └─────────────────────┘

1 = can attend, 0 = masked (set to -infinity before softmax)

This masking is essential for training: the model learns to predict the next token using only previous context, which is exactly what it needs to do during generation.

Part 8: Putting It All Together — How Generation Works

Let's trace through how a decoder-only model generates text:

Step-by-Step Generation

Prompt: "The capital of France is"

┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Tokenize                                                │
├─────────────────────────────────────────────────────────────────┤
│ "The capital of France is"                                      │
│   ↓                                                             │
│ [464, 3139, 286, 4881, 318]  (token IDs)                        │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Embed + Add Positions                                   │
├─────────────────────────────────────────────────────────────────┤
│ Each token ID → embedding vector                                │
│ Add positional encoding for positions 0, 1, 2, 3, 4             │
│   ↓                                                             │
│ Matrix of shape [5 tokens × 768 dimensions]                     │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Process through Transformer Blocks                      │
├─────────────────────────────────────────────────────────────────┤
│ Block 1:  Attention → Add&Norm → FFN → Add&Norm                 │
│ Block 2:  Attention → Add&Norm → FFN → Add&Norm                 │
│ ...                                                             │
│ Block 96: Attention → Add&Norm → FFN → Add&Norm                 │
│   ↓                                                             │
│ Output: [5 × 768] matrix with contextualized representations    │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 4: Predict Next Token                                      │
├─────────────────────────────────────────────────────────────────┤
│ Take the LAST token's representation ([1 × 768])                │
│   ↓                                                             │
│ Multiply by output embedding matrix (768 × 50257)               │
│   ↓                                                             │
│ Get logits for every token in vocabulary ([1 × 50257])          │
│   ↓                                                             │
│ Apply softmax → probability distribution                        │
│                                                                 │
│ Top probabilities:                                              │
│   "Paris"  → 0.92                                               │
│   "Lyon"   → 0.03                                               │
│   "the"    → 0.01                                               │
│   ...                                                           │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 5: Sample and Repeat                                       │
├─────────────────────────────────────────────────────────────────┤
│ Select "Paris" (highest probability, or sample with temperature)│
│   ↓                                                             │
│ New sequence: "The capital of France is Paris"                  │
│   ↓                                                             │
│ Repeat steps 1-5 to generate next token                         │
│   ↓                                                             │
│ "The capital of France is Paris."                               │
│   ↓                                                             │
│ Continue until: max length, or stop token generated             │
└─────────────────────────────────────────────────────────────────┘

Temperature and Sampling

The model doesn't always pick the highest-probability token. Temperature controls randomness:

Temperature = 0 (deterministic):
  Always pick highest probability
  "Paris" (0.92) → always selected

Temperature = 1 (balanced):
  Sample proportional to probabilities
  "Paris" selected 92% of the time

Temperature > 1 (creative):
  Flatten probabilities, more randomness
  Lower-probability tokens get picked more often

Temperature < 1 (focused):
  Sharpen probabilities, less randomness
  High-probability tokens even more likely

Part 9: Why Transformers Work So Well

1. Parallelization

RNNs process tokens one at a time. Transformers process all tokens simultaneously, making training on GPUs/TPUs extremely efficient.

RNN Training:
  Token 1 → Token 2 → Token 3 → Token 4  (sequential, slow)

Transformer Training:
  Token 1 ┐
  Token 2 ├→ All processed at once (parallel, fast)
  Token 3 │
  Token 4 ┘

2. Long-Range Dependencies

In RNNs, information from early tokens gets "diluted" as it passes through many steps. Transformers let any token directly attend to any other token, regardless of distance.

Sentence: "The cat, which had been sleeping peacefully in the sun all
           afternoon, suddenly jumped up."

RNN: Information about "cat" must pass through 15 steps to reach "jumped"
     Each step loses some information.

Transformer: "jumped" directly attends to "cat" in one step.
             No information loss from distance.

3. Scalability

Transformers scale remarkably well. Making them bigger consistently improves performance:

Model          Parameters    Layers    Attention Heads
─────────────────────────────────────────────────────────
GPT-2 Small    117M          12        12
GPT-2 Large    774M          36        20
GPT-3          175B          96        96
GPT-4          ~1.7T*        ~120*     ~?*

* Estimated, not officially confirmed

4. Transfer Learning

A Transformer trained on a large corpus learns general language understanding. This can be fine-tuned for specific tasks with much less data than training from scratch.

Part 10: Modern Improvements

Since 2017, researchers have made many improvements to the original architecture:

Key Innovations

RoPE (Rotary Position Embedding) — Better handling of relative positions; used in LLaMA, Mistral
GQA (Grouped Query Attention) — More efficient attention by sharing keys/values across heads
Flash Attention — Memory-efficient attention computation, enables longer contexts
SwiGLU — Improved activation function for the feed-forward network
RMSNorm — Faster normalization than LayerNorm
KV Cache — Cache key/value computations during generation for speed

Context Length Evolution

Model             Max Context Length
───────────────────────────────────
GPT-2 (2019)      1,024 tokens
GPT-3 (2020)      4,096 tokens
GPT-4 (2023)      8,192 / 32,768 tokens
Claude 2 (2023)   100,000 tokens
Claude 3 (2024)   200,000 tokens
GPT-4 Turbo       128,000 tokens
Gemini 1.5        1,000,000+ tokens

Summary: The Transformer at a Glance

┌─────────────────────────────────────────────────────────────────┐
│                        TRANSFORMER                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   "Hello world" (text)                                           │
│         │                                                        │
│         ▼                                                        │
│   ┌─────────────┐                                                │
│   │ Tokenizer   │  Split into subwords                           │
│   └─────────────┘                                                │
│         │                                                        │
│         ▼                                                        │
│   [15339, 995]   (token IDs)                                     │
│         │                                                        │
│         ▼                                                        │
│   ┌─────────────┐                                                │
│   │ Embedding   │  Look up vectors                               │
│   │ + Position  │  Add position info                             │
│   └─────────────┘                                                │
│         │                                                        │
│         ▼                                                        │
│   ┌─────────────┐                                                │
│   │ Transformer │  ┌─────────────────────┐                       │
│   │ Block ×N    │  │ Multi-Head Attention│──→ Which tokens       │
│   │             │  │ Feed-Forward Network│──→ matter to which?   │
│   │             │  │ Add & Normalize     │──→ Transform meaning  │
│   └─────────────┘  └─────────────────────┘                       │
│         │                                                        │
│         ▼                                                        │
│   ┌─────────────┐                                                │
│   │ Output Head │  Project to vocabulary                         │
│   │ + Softmax   │  Get probabilities                             │
│   └─────────────┘                                                │
│         │                                                        │
│         ▼                                                        │
│   "next_token" probability distribution                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways

Tokens are chunks of text; subword tokenization balances vocabulary size and coverage
Embeddings turn tokens into meaningful vectors that capture semantic relationships
Positional encodings tell the model where each token is in the sequence
Attention lets tokens gather relevant information from all other tokens
Multi-head attention captures different types of relationships simultaneously
Transformer blocks stack attention and feed-forward layers with residual connections
Decoder-only models (GPT, Claude) use causal masking to generate text left-to-right

Related Samples

This is a sample article to demonstrate how I write.