Notes on Learning about Large Language Models
A running list of concepts and takeaways while learning how LLMs work, from tokenization to transformers
The world runs on Large Language Models now. ChatGPT revolutionized how we work, Claude helps us think through complex problems, and GPT-4 can write code that actually compiles. But for most of us, these systems remain mysterious black boxes. We know how to use them, but we don’t really understand how they work.
Over the past few months, I’ve been diving deep into the mechanics of LLMs, working through research papers, implementing my toy models, and slowly building intuition for what’s happening under the hood. This post is my running notebook of key concepts, surprising insights, and “aha” moments from that journey.
Think of this as a practical field guide to LLM internals; not an academic survey, but the kind of explanations I wish I’d had when I started.
Tokenization: The Foundation Most People Skip
Before any “intelligence” happens, LLMs need to convert text into numbers. This process, tokenization, is far more nuanced than it appears.
The naive approach: Split text by spaces and assign each word a number. This breaks down immediately with punctuation, capitalization, and the infinite variety of human language.
The actual approach: Modern LLMs use subword tokenization, typically Byte Pair Encoding (BPE). The algorithm starts with individual characters and iteratively merges the most frequent pairs. The word “hello” might become tokens for “he,” “ll,” and “o.”
Why this matters: Tokenization directly impacts what an LLM can understand. If “ChatGPT” gets split into weird tokens because it wasn’t in the training vocabulary, the model will struggle with it. This is why newer models often handle code better; they’re trained on tokenizers that treat programming syntax more intelligently.
Key insight: The tokenizer is part of the model. You can’t just swap in a different one without retraining everything.
Embeddings: Numbers That Capture Meaning
Once text becomes tokens, each token gets converted into a high-dimensional vector, typically 768, 1024, or even larger dimensions. These embeddings are learned during training and somehow capture semantic meaning.
The magic: Similar concepts end up close together in this high-dimensional space. “King” minus “man” plus “woman” approximately equals “queen.” This isn’t just a parlor trick; it demonstrates that the geometry of the embedding space reflects conceptual relationships.
The reality: Modern LLM embeddings are contextual. The embedding for “bank” is different when talking about rivers versus finance. This happens because embeddings get updated as they flow through the transformer layers.
Practical implication: Vector databases for retrieval-augmented generation (RAG) work because you can find semantically similar text by finding nearby embeddings, even if the exact words differ.
Attention: The Core Innovation
The transformer architecture revolutionized AI through the attention mechanism. Instead of processing text sequentially like RNNs, attention lets the model look at all positions simultaneously and decide what to focus on.
Self-attention in practice: For each word, the model creates three vectors: query, key, and value. It then computes how much each word should pay attention to every other word, including itself. This creates a matrix of attention weights.
Multi-head attention: The model runs multiple attention mechanisms in parallel (typically 8, 12, or 16 heads), each potentially learning different types of relationships: syntax, semantics, coreference, and more.
Why it works: Attention gives the model a flexible way to route information. Want to resolve a pronoun? Pay attention to previous nouns. Need to understand a complex sentence? Attend to the subject and very across long distances.
The computational cost: Attention is quadratic in sequence length. This is why context windows were historically limited; a 1M token context requires massive computational resources.
Transformer Layers: Repeated Pattern Recognition
A transformer consists of many identical layers stacked on top of each other. Each layer has two main components:
Multi-head attention: Routes information between positions
Feed-forward network: Processes each position independently
The pattern: Attention mixes information across positions, then the feed-forward network transforms that information. Add some normalization and residual connections, and repeat 12, 24 or even 96 times.
What each layer learns: Early layers tend to focus on syntax and local patterns. Middle layers capture more abstract semantic relationships. Later layers often specialize in the specific task at hand.
Residual connections matter: Each layer adds to the previous representation rather than replacing it. This means information can flow directly from input to output, making very deep networks trainable.
Training: The Three-Stage Journey
Modern LLMs go through multiple training phases, each serving a different purpose:
Stage 1: Pretraining
The model learns to predict the next token on massive amounts of Internet text. This is where most of the “knowledge” gets absorbed: facts, patterns, relationships, and even some reasoning capabilities emerge from this simple objective.
Scale matters: GPT-3 was trained on roughly 300B tokens. GPT-4 likely saw trillions. The relationship between compute, data, and capability follows surprisingly predictable scaling laws.
Stage 2: Supervised Fine-tuning
The model learn to follow instructions by training on human-written examples of questions and high-quality answers. This transforms a “text completion engine” into a “helpful assistant.”
Stage 3: Reinforcement Learning from Human Feedback (RLHF)
Humans rank different model outputs, and the model learns to optimize for human preferences. This is where models learn to be helpful, harmless, and honest, or at least appear to be.
The alignment problem: Each stage potentially conflict with the others. Making a model more “aligned” might make it less creative or knowledgeable. These trade-offs are still being figured out.
Emergence: When More Becomes Different
One of the most fascinating aspects of LLMs is emergence: capabilities that appear suddenly at certain scales rather than developing gradually.
Examples of emergent abilities:
Few-shot learning (learning from examples in the prompt)
Chain-of-thought reasoning
Code generation
Translation between languages never explicitly paired in training
Why emergence happens: Current theories suggest that many capabilities require the model to learn complex compositions of simpler patterns. These compositions only become possible once the model has sufficient capacity and has seen enough data.
The implication: We can’t easily predict what capabilities will emerge as models get larger. This makes AI development both exciting and unpredictable.
In-Context Learning: The Unexpected Superpower
Perhaps the most surprising capability of LLMs is their ability to learn new tasks from just a few examples provided in the prompt; no additional training required.
How it works: The model recognizes patterns in the examples and continues the pattern with new inputs. It’s performing a kind of meta-learning, using its training to figure out what task is being demonstrated.
The mystery: We still don’t fully understand the mechanisms behind in-context learning. Some research suggests the model is implementing simple learning algorithms (like gradient descent) within its forward pass.
Practical power: This is why prompt engineering works. By providing the right examples and context, you can steer the model toward almost any task without expensive retraining.
Scaling Laws: The Math of More
LLM development follows surprisingly predictable patterns described by scaling laws:
Compute scaling: Model performance improves predictably with more training compute, following a power law relationship.
Data scaling: More training data generally means better performance, but with diminishing returns.
Parameter scaling: Large models (more parameters) perform better, but the relationship isn’t linear.
The efficiency frontier: For any given compute budget, there’s an optimal trade-off between model size and training time. This led ti models like Chinchilla, which prioritized more training data over larger parameter counts.
The Limits We’re Hitting
Despite their impressive capabilities, current LLMs face fundamental limitations:
Hallucination: Models confidently generate false information, especially when asked about topics outside their training data or requiring real-time information.
Context limitations: Even with extended context windows, models struggle to effectively use very long contexts. Information in the middle often gets “lost.”
Reasoning gaps: While LLMs can perform impressive reasoning, they often fail on problems that require systematic thinking or precise logical steps.
Training data cutoffs: Models are frozen in time, unable to learn new information without expensive retraining.
What’s Next: The Frontier
The field is evolving rapidly across multiple dimensions:
Architectural innovations: Mixture of Experts (MoE) models activate only relevant parameters for each input. State Space Models offer alternatives to attention that scale better with sequence length.
Training improvements: Techniques like Constitutional AI and other alignment methods aim to make models more reliable and truthful.
Multimodal integration: Models are learning to work with images, audio, and video, not just text.
Agent capabilities: Connecting LLMs to tools, APIs, and the ability to take actions in the world.
Practical Takeaways
After diving deep into LLM internals, here’s what actually matters for practitioners:
Prompt engineering is real engineering: Understanding how attention and context work helps you write better prompts.
Model choice matters: Bigger isn’t always better. Choose models based on your specific use case, latency requirements, and cost constraints.
Context is king: The way you structure information in the prompt directly impacts model performance. Put important information early and late in the context.
Embrace iteration: LLMs are probabilistic systems. What works well today might need adjustment tomorrow as models evolve.
Understand the limitations: Know when you’re pushing against fundamental model constraints versus just needing better prompting.
The Bigger Picture
Learning about LLMs has been humbling. These systems are simultaneously more impressive and more limited than they appear from the outside. They’re not magic, but they’re also not simple pattern matching.
The more I understand about how they work, the more I appreciate both their current capabilities and their future potential. We’re still in the early days of this technology, but the foundations are becoming clearer.
If you’re building with AI or just trying to understand our rapidly changing world, I’d encourage you to look under the hood. The internals are more accessible than they seem, and understanding them changes how you think about what’s possible.
What aspect of LLM internals do you find most interesting or confusing? I’m always looking for new angles to explore in this space.
Great read my mentor 💯 Still trying to realize LLMs aren’t “magic” lol. Their capabilities are very impressive as well