AI Architecture Guide
Understanding Transformer Architecture
A comprehensive guide to how Transformers work: tokenization, embeddings, attention mechanisms, and the architecture that powers ChatGPT, Claude, and modern AI. Written for developers and curious learners who want to understand the fundamentals.
What is a Transformer?
A Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by researchers at Google. It has become the foundation for virtually all modern large language models (LLMs) including GPT-4, Claude, LLaMA, and BERT.
Before Transformers, language models processed text sequentially—one word at a time, left to right. This was slow and made it hard to capture relationships between words that were far apart. Transformers solved this by processing all words simultaneously and using a mechanism called attention to understand how words relate to each other.
Why Should You Care?
Understanding Transformers helps you:
- Write better prompts by understanding how models "see" your input
- Understand why token limits exist and how to work within them
- Debug unexpected model behavior
- Make informed decisions about which models to use
- Appreciate why certain tasks are easy or hard for LLMs
Part 1: Tokenization — How Text Becomes Numbers
Neural networks can only process numbers, not text. The first step in any language model is converting text into numerical representations called tokens.
What is a Token?
A token is a chunk of text that the model treats as a single unit. Depending on the tokenization strategy, a token might be:
- A single character:
H,e,l,l,o - A whole word:
Hello - A subword (part of a word):
Hello→Hel+lo
Tokenization Strategies
1. Character-Level Tokenization
Split text into individual characters.
Input: "Hello world"
Tokens: ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
Count: 11 tokens
Pros: Small vocabulary (just ~100 characters), handles any word.
Cons: Very long sequences, hard to learn word meanings.
2. Word-Level Tokenization
Split text on whitespace and punctuation.
Input: "Hello world"
Tokens: ["Hello", "world"]
Count: 2 tokens
Pros: Intuitive, captures word meanings.
Cons: Huge vocabulary needed, can't handle new/misspelled words.
3. Subword Tokenization (What Modern LLMs Use)
The sweet spot: common words stay whole, rare words get split into pieces. This is what GPT, Claude, and most modern models use.
Input: "Hello world"
Tokens: ["Hello", " world"]
Count: 2 tokens
Input: "tokenization"
Tokens: ["token", "ization"]
Count: 2 tokens
Input: "unbelievable"
Tokens: ["un", "believable"]
Count: 2 tokens
Input: "Pneumonoultramicroscopicsilicovolcanoconiosis"
Tokens: ["P", "ne", "um", "ono", "ult", "ram", "icro", "scop", "ics", "il", "ico", "vol", "cano", "con", "iosis"]
Count: 15 tokens
How Subword Tokenization Works (BPE)
The most common algorithm is Byte Pair Encoding (BPE). Here's the intuition:
- Start with all individual characters as your vocabulary
- Count which pairs of tokens appear most frequently in your training data
- Merge the most frequent pair into a new token
- Repeat until you reach your desired vocabulary size (e.g., 50,000 tokens)
BPE Example
Training corpus: "low lower lowest low"
Step 1: Start with characters
Vocabulary: [l, o, w, e, r, s, t, ' ']
Step 2: Count pairs
"lo" appears 4 times (most frequent)
Step 3: Merge "lo" into new token
Vocabulary: [l, o, w, e, r, s, t, ' ', lo]
Corpus becomes: "lo w lo w e r lo w e s t lo w"
Step 4: Count pairs again
"low" appears 4 times (most frequent)
Step 5: Merge "low" into new token
Vocabulary: [l, o, w, e, r, s, t, ' ', lo, low]
...continue until vocabulary reaches target size
After training, common words like "the", "and", "is" become single tokens, while rare words get split into recognizable pieces.
Token IDs
Each token in the vocabulary gets a unique number (ID). The model only sees these numbers.
Vocabulary (simplified):
"Hello" → 15339
" world" → 1917
"the" → 1
"." → 13
Input: "Hello world."
Tokens: ["Hello", " world", "."]
IDs: [15339, 1917, 13]
Why Token Limits Matter
When a model says it has a "128K context window," it means it can process 128,000 tokens at once. Since tokens aren't exactly words:
- 1 token ≈ 4 characters in English (on average)
- 1 token ≈ 0.75 words in English (on average)
- 100 tokens ≈ 75 words
- Code uses more tokens than prose (symbols, indentation)
- Non-English languages often use more tokens per word
Part 2: Embeddings — Giving Tokens Meaning
Token IDs are just arbitrary numbers. The ID 15339 for "Hello" doesn't tell the model anything about what "Hello" means. We need to convert these IDs into embeddings—vectors that capture semantic meaning.
What is an Embedding?
An embedding is a list of numbers (a vector) that represents a token in a way that captures its meaning. Similar words have similar embeddings.
Embedding dimension: 4 (real models use 768-12288)
"king" → [0.2, 0.8, 0.1, 0.9]
"queen" → [0.3, 0.7, 0.1, 0.8] ← similar to "king"
"apple" → [0.9, 0.1, 0.6, 0.2] ← very different
The Famous Example: Word Arithmetic
Good embeddings capture relationships. The classic example:
king - man + woman ≈ queen
In vector space:
[0.2, 0.8, 0.1, 0.9] (king)
- [0.1, 0.9, 0.0, 0.5] (man)
+ [0.2, 0.8, 0.1, 0.4] (woman)
= [0.3, 0.7, 0.2, 0.8] ≈ queen
This works because embeddings encode concepts like "royalty" and "gender" in different dimensions.
The Embedding Matrix
The model learns an embedding matrix during training. This is essentially a giant lookup table:
Embedding Matrix (vocabulary_size × embedding_dimension)
Token ID Embedding Vector (simplified to 4 dimensions)
────────────────────────────────────────────────────────
0 ("the") → [0.12, 0.45, 0.78, 0.23]
1 ("a") → [0.11, 0.43, 0.76, 0.25]
2 ("is") → [0.34, 0.56, 0.12, 0.89]
...
15339 → [0.67, 0.23, 0.91, 0.45] ("Hello")
...
50256 → [0.89, 0.12, 0.34, 0.67] (last token)
For GPT-3, this matrix has 50,257 tokens × 12,288 dimensions = 617 million parameters just for embeddings!
Part 3: Positional Encoding — Where Are You in the Sentence?
Unlike older models that read text left-to-right, Transformers process all tokens simultaneously. But word order matters! "Dog bites man" means something very different from "Man bites dog."
Positional encodings add information about each token's position in the sequence.
How It Works
Each position gets its own vector, which is added to the token's embedding:
Sentence: "The cat sat"
Token embeddings:
"The" → [0.1, 0.2, 0.3, 0.4]
"cat" → [0.5, 0.6, 0.7, 0.8]
"sat" → [0.2, 0.3, 0.4, 0.5]
Position encodings:
pos 0 → [0.01, 0.02, 0.01, 0.02]
pos 1 → [0.02, 0.01, 0.02, 0.01]
pos 2 → [0.01, 0.01, 0.02, 0.02]
Final input = embedding + position:
"The" at pos 0 → [0.11, 0.22, 0.31, 0.42]
"cat" at pos 1 → [0.52, 0.61, 0.72, 0.81]
"sat" at pos 2 → [0.21, 0.31, 0.42, 0.52]
The Original Positional Encoding Formula
The original Transformer paper used sine and cosine functions to generate position vectors. This clever approach means:
- Each position has a unique pattern
- The model can learn relative positions (how far apart tokens are)
- It can theoretically generalize to longer sequences than seen in training
Modern models often use learned positional embeddings (like token embeddings, but for positions) or relative positional encodings (RoPE in LLaMA, ALiBi in others).
Part 4: Attention — The Core Innovation
Attention is what makes Transformers special. It allows every token to "look at" every other token and decide how much to pay attention to it.
The Intuition
Consider this sentence:
"The cat sat on the mat because it was tired."
What does "it" refer to? To understand, you need to look back at other words. A human reader connects "it" to "cat" (not "mat") based on context. Attention lets the model do the same thing.
Self-Attention: A Library Analogy
Imagine you're in a library researching a topic. For each question (Query) you have:
- Query (Q): What you're looking for — "I need information about cats"
- Key (K): The label on each book — "This book is about animals"
- Value (V): The actual content of the book
You compare your Query to each book's Key. Books with relevant Keys get more attention, and you extract more from their Values.
Self-Attention Step by Step
Let's trace through attention for a simple sentence:
Input: "The cat sat"
Step 1: Create Q, K, V for each token
────────────────────────────────────
Each token's embedding is multiplied by three learned weight matrices
to produce Query, Key, and Value vectors.
Token Query (Q) Key (K) Value (V)
───────────────────────────────────────────────────────
"The" [0.1, 0.2] [0.3, 0.1] [0.5, 0.2]
"cat" [0.4, 0.3] [0.2, 0.5] [0.1, 0.8]
"sat" [0.2, 0.5] [0.4, 0.3] [0.3, 0.4]
Step 2: Calculate attention scores
─────────────────────────────────
For each token, compute how much it should attend to every other token.
This is done by taking the dot product of its Query with all Keys.
For "cat" (Q = [0.4, 0.3]):
Score with "The": [0.4, 0.3] · [0.3, 0.1] = 0.12 + 0.03 = 0.15
Score with "cat": [0.4, 0.3] · [0.2, 0.5] = 0.08 + 0.15 = 0.23
Score with "sat": [0.4, 0.3] · [0.4, 0.3] = 0.16 + 0.09 = 0.25
Raw scores for "cat": [0.15, 0.23, 0.25]
Step 3: Apply softmax to get attention weights
─────────────────────────────────────────────
Softmax converts scores to probabilities that sum to 1.
Attention weights for "cat": [0.28, 0.35, 0.37]
↑ ↑ ↑
"The" "cat" "sat"
"cat" pays 28% attention to "The", 35% to itself, 37% to "sat"
Step 4: Compute weighted sum of Values
─────────────────────────────────────
Multiply each Value by its attention weight and sum.
Output for "cat" = 0.28 × V("The") + 0.35 × V("cat") + 0.37 × V("sat")
= 0.28 × [0.5, 0.2] + 0.35 × [0.1, 0.8] + 0.37 × [0.3, 0.4]
= [0.14, 0.06] + [0.04, 0.28] + [0.11, 0.15]
= [0.29, 0.49]
This output vector now contains information gathered from all tokens,
weighted by relevance.
Visualizing Attention
Attention patterns can be visualized as a heatmap showing which tokens attend to which:
Sentence: "The cat sat on the mat because it was tired"
Attention from "it":
The cat sat on the mat because it was tired
┌─────────────────────────────────────────────────────────────────┐
"it" │ 0.02 0.71 0.03 0.01 0.02 0.08 0.02 0.05 0.03 0.03 │
└─────────────────────────────────────────────────────────────────┘
↑ Strong attention
The model learns that "it" refers to "cat" (0.71) more than "mat" (0.08)
Why "Scaled" Dot-Product?
The actual formula divides by √(dimension) before softmax:
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
Without scaling, dot products of high-dimensional vectors can become very large, making softmax produce extreme values (nearly 0 or 1). Scaling keeps gradients healthy during training.
Part 5: Multi-Head Attention
One attention pattern isn't enough. Different aspects of language require different types of attention:
- One head might track grammatical structure (subject-verb agreement)
- Another might track semantic relationships (what "it" refers to)
- Another might track positional patterns (nearby words)
How Multi-Head Attention Works
Instead of one attention calculation:
Input → [Single Attention] → Output
We run multiple in parallel:
Input → [Attention Head 1] ─┐
→ [Attention Head 2] ─┼→ Concatenate → Linear → Output
→ [Attention Head 3] ─┤
→ [Attention Head 4] ─┘
GPT-3 uses 96 attention heads
Each head has dimension 12288/96 = 128
Multi-Head Example
Sentence: "The bank by the river"
Head 1 (syntactic): "bank" attends to "The" (determiner relationship)
Head 2 (semantic): "bank" attends to "river" (semantic disambiguation → riverbank, not financial)
Head 3 (local): "bank" attends to "by" (nearby context)
Each head captures different relationships, and the model combines them.
Part 6: The Full Transformer Block
A Transformer is made of stacked identical blocks. Each block has:
Architecture Diagram
┌─────────────────────────────────────────────────┐
│ Transformer Block │
│ │
│ Input │
│ ↓ │
│ ┌──────────────────────────────────────────┐ │
│ │ Multi-Head Attention │ │
│ └──────────────────────────────────────────┘ │
│ ↓ │
│ [Add & Normalize] ←── Residual Connection │
│ ↓ │
│ ┌──────────────────────────────────────────┐ │
│ │ Feed-Forward Network │ │
│ │ (Two linear layers with activation) │ │
│ └──────────────────────────────────────────┘ │
│ ↓ │
│ [Add & Normalize] ←── Residual Connection │
│ ↓ │
│ Output │
│ │
└─────────────────────────────────────────────────┘
GPT-3: 96 of these blocks stacked
Claude/GPT-4: Likely 100+ blocks
Component Breakdown
1. Multi-Head Attention
As described above — allows tokens to gather information from other tokens.
2. Add & Normalize (Residual Connection + Layer Normalization)
The residual connection adds the input back to the output:
output = LayerNorm(input + Attention(input))
This helps with training deep networks — gradients can flow directly through the addition, preventing the vanishing gradient problem.
3. Feed-Forward Network (FFN)
A simple two-layer neural network applied to each position independently:
FFN(x) = ReLU(x × W1 + b1) × W2 + b2
Typical dimensions:
Input: 768 (or 12288 for large models)
Hidden: 3072 (4× input, i.e., 49152 for large models)
Output: 768 (same as input)
The FFN is where much of the model's "knowledge" is stored. It processes each token's representation independently, adding learned transformations.
Part 7: Encoder vs. Decoder
The original Transformer had two parts: an Encoder and a Decoder. Modern models often use just one or the other.
Architecture Types
┌────────────────────────────────────────────────────────────────┐
│ ENCODER-DECODER (Original Transformer, T5, BART) │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ ENCODER │ ──────→ │ DECODER │ │
│ │ │ context │ │ │
│ │ Sees full │ │ Generates │ │
│ │ input at │ │ output │ │
│ │ once │ │ left-to- │ │
│ │ │ │ right │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Use case: Translation, summarization │
│ "Translate English to French" │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ DECODER-ONLY (GPT, Claude, LLaMA) │
│ │
│ ┌─────────────┐ │
│ │ DECODER │ │
│ │ │ │
│ │ Generates │ │
│ │ one token │ │
│ │ at a time │ │
│ │ (causal) │ │
│ └─────────────┘ │
│ │
│ Use case: Text generation, chat, code completion │
│ "The quick brown fox" → predicts "jumps" │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ ENCODER-ONLY (BERT, RoBERTa) │
│ │
│ ┌─────────────┐ │
│ │ ENCODER │ │
│ │ │ │
│ │ Sees full │ │
│ │ input │ │
│ │ (bidirectional) │
│ └─────────────┘ │
│ │
│ Use case: Classification, NER, embeddings │
│ "This movie was [MASK]" → predicts "great" │
└────────────────────────────────────────────────────────────────┘
Causal Masking in Decoder-Only Models
In decoder-only models (like GPT and Claude), each token can only attend to previous tokens, not future ones. This is called causal masking.
Sentence: "The cat sat on"
When processing "sat", it can only see:
✓ "The"
✓ "cat"
✓ "sat" (itself)
✗ "on" (future - masked out)
Attention mask:
The cat sat on
┌─────────────────────┐
The │ 1 0 0 0 │
cat │ 1 1 0 0 │
sat │ 1 1 1 0 │
on │ 1 1 1 1 │
└─────────────────────┘
1 = can attend, 0 = masked (set to -infinity before softmax)
This masking is essential for training: the model learns to predict the next token using only previous context, which is exactly what it needs to do during generation.
Part 8: Putting It All Together — How Generation Works
Let's trace through how a decoder-only model generates text:
Step-by-Step Generation
Prompt: "The capital of France is"
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Tokenize │
├─────────────────────────────────────────────────────────────────┤
│ "The capital of France is" │
│ ↓ │
│ [464, 3139, 286, 4881, 318] (token IDs) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Embed + Add Positions │
├─────────────────────────────────────────────────────────────────┤
│ Each token ID → embedding vector │
│ Add positional encoding for positions 0, 1, 2, 3, 4 │
│ ↓ │
│ Matrix of shape [5 tokens × 768 dimensions] │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Process through Transformer Blocks │
├─────────────────────────────────────────────────────────────────┤
│ Block 1: Attention → Add&Norm → FFN → Add&Norm │
│ Block 2: Attention → Add&Norm → FFN → Add&Norm │
│ ... │
│ Block 96: Attention → Add&Norm → FFN → Add&Norm │
│ ↓ │
│ Output: [5 × 768] matrix with contextualized representations │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 4: Predict Next Token │
├─────────────────────────────────────────────────────────────────┤
│ Take the LAST token's representation ([1 × 768]) │
│ ↓ │
│ Multiply by output embedding matrix (768 × 50257) │
│ ↓ │
│ Get logits for every token in vocabulary ([1 × 50257]) │
│ ↓ │
│ Apply softmax → probability distribution │
│ │
│ Top probabilities: │
│ "Paris" → 0.92 │
│ "Lyon" → 0.03 │
│ "the" → 0.01 │
│ ... │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 5: Sample and Repeat │
├─────────────────────────────────────────────────────────────────┤
│ Select "Paris" (highest probability, or sample with temperature)│
│ ↓ │
│ New sequence: "The capital of France is Paris" │
│ ↓ │
│ Repeat steps 1-5 to generate next token │
│ ↓ │
│ "The capital of France is Paris." │
│ ↓ │
│ Continue until: max length, or stop token generated │
└─────────────────────────────────────────────────────────────────┘
Temperature and Sampling
The model doesn't always pick the highest-probability token. Temperature controls randomness:
Temperature = 0 (deterministic):
Always pick highest probability
"Paris" (0.92) → always selected
Temperature = 1 (balanced):
Sample proportional to probabilities
"Paris" selected 92% of the time
Temperature > 1 (creative):
Flatten probabilities, more randomness
Lower-probability tokens get picked more often
Temperature < 1 (focused):
Sharpen probabilities, less randomness
High-probability tokens even more likely
Part 9: Why Transformers Work So Well
1. Parallelization
RNNs process tokens one at a time. Transformers process all tokens simultaneously, making training on GPUs/TPUs extremely efficient.
RNN Training:
Token 1 → Token 2 → Token 3 → Token 4 (sequential, slow)
Transformer Training:
Token 1 ┐
Token 2 ├→ All processed at once (parallel, fast)
Token 3 │
Token 4 ┘
2. Long-Range Dependencies
In RNNs, information from early tokens gets "diluted" as it passes through many steps. Transformers let any token directly attend to any other token, regardless of distance.
Sentence: "The cat, which had been sleeping peacefully in the sun all
afternoon, suddenly jumped up."
RNN: Information about "cat" must pass through 15 steps to reach "jumped"
Each step loses some information.
Transformer: "jumped" directly attends to "cat" in one step.
No information loss from distance.
3. Scalability
Transformers scale remarkably well. Making them bigger consistently improves performance:
Model Parameters Layers Attention Heads
─────────────────────────────────────────────────────────
GPT-2 Small 117M 12 12
GPT-2 Large 774M 36 20
GPT-3 175B 96 96
GPT-4 ~1.7T* ~120* ~?*
* Estimated, not officially confirmed
4. Transfer Learning
A Transformer trained on a large corpus learns general language understanding. This can be fine-tuned for specific tasks with much less data than training from scratch.
Part 10: Modern Improvements
Since 2017, researchers have made many improvements to the original architecture:
Key Innovations
- RoPE (Rotary Position Embedding) — Better handling of relative positions; used in LLaMA, Mistral
- GQA (Grouped Query Attention) — More efficient attention by sharing keys/values across heads
- Flash Attention — Memory-efficient attention computation, enables longer contexts
- SwiGLU — Improved activation function for the feed-forward network
- RMSNorm — Faster normalization than LayerNorm
- KV Cache — Cache key/value computations during generation for speed
Context Length Evolution
Model Max Context Length
───────────────────────────────────
GPT-2 (2019) 1,024 tokens
GPT-3 (2020) 4,096 tokens
GPT-4 (2023) 8,192 / 32,768 tokens
Claude 2 (2023) 100,000 tokens
Claude 3 (2024) 200,000 tokens
GPT-4 Turbo 128,000 tokens
Gemini 1.5 1,000,000+ tokens
Summary: The Transformer at a Glance
┌─────────────────────────────────────────────────────────────────┐
│ TRANSFORMER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ "Hello world" (text) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Tokenizer │ Split into subwords │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ [15339, 995] (token IDs) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Embedding │ Look up vectors │
│ │ + Position │ Add position info │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Transformer │ ┌─────────────────────┐ │
│ │ Block ×N │ │ Multi-Head Attention│──→ Which tokens │
│ │ │ │ Feed-Forward Network│──→ matter to which? │
│ │ │ │ Add & Normalize │──→ Transform meaning │
│ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Output Head │ Project to vocabulary │
│ │ + Softmax │ Get probabilities │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ "next_token" probability distribution │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Takeaways
- Tokens are chunks of text; subword tokenization balances vocabulary size and coverage
- Embeddings turn tokens into meaningful vectors that capture semantic relationships
- Positional encodings tell the model where each token is in the sequence
- Attention lets tokens gather relevant information from all other tokens
- Multi-head attention captures different types of relationships simultaneously
- Transformer blocks stack attention and feed-forward layers with residual connections
- Decoder-only models (GPT, Claude) use causal masking to generate text left-to-right
Further Reading
- Attention Is All You Need — The original 2017 paper by Vaswani et al.
- The Illustrated Transformer — Jay Alammar's visual explanation
- Language Models are Few-Shot Learners — GPT-3 paper
- BERT: Pre-training of Deep Bidirectional Transformers — The encoder-only approach
- The Annotated Transformer — Harvard NLP's code walkthrough
Related Samples
This is a sample article to demonstrate how I write.