3Blue1Brown
June 7, 2026
TL;DR
Compression and prediction are mathematically equivalent; Shannon entropy defines the fundamental limits of how efficiently data can be compressed by measuring average information content per symbol.
“Prediction and compression are mathematically equivalent. They turn out to be two sides of the same coin.”
— Presenter
“Great definitions are often the residue of some kind of insight.”
— Presenter
“Nobody knows what entropy really is, so in an argument, you'll always have the advantage.”
— John von Neumann (attributed to Shannon about naming entropy)
“When his interviewees had at least 100 preceding letters of context, he estimated the entropy of English to be about one bit per character.”
— Presenter (describing Shannon's findings)
1. Introduction: The Compression Question
Explores why compression matters and establishes the core question: what is the fundamental limit on how efficiently text can be compressed? Introduces Claude Shannon's information theory and its surprising relevance to modern machine learning, particularly the connection between compression and intelligence.
2. The Robot Instructions Warm-Up Example
Uses a simple four-direction robot instruction set with non-uniform probabilities (50% up, 25% down, 12.5% left/right) to demonstrate three encoding approaches: naive two-bit fixed encoding, clever variable-length encoding, and theoretical optimal encoding.
3. Prefix-Free Codes and Binary Trees
Explains how prefix-free codes work by ensuring no code word is a prefix of another, preventing ambiguity during decoding. Visualizes this through a binary tree diagram where allocating a code word consumes a proportional space of all possible bitstrings.
4. Random Noise and Perfect Compression
Establishes that perfect compression must produce output indistinguishable from random noise. Uses the insight that equally likely messages compressed to n bits must each have probability 1/2^n to derive the fundamental information formula: negative log base 2 of probability.
5. Shannon Information: From Formula to Intuition
Derives the expression -log₂(p) as the information content of an event, explaining why unlikely events contain more information. Demonstrates that in perfect compression, bits allocated equals information content, and shows how information values add when multiplying probabilities across symbols.
6. Natural Language and Fractional Bits
Extends information theory to realistic scenarios like English language where probabilities are context-dependent and rarely perfect powers of 2. Shows how GPT models produce fractional information values per letter and how these sum across messages for overall compression limits.
7. Shannon's Experimental Methodology
Describes Shannon's empirical approaches: analyzing n-grams from text, conducting character-guessing experiments with his wife Betty, and later surveying multiple people to estimate entropy of English. Emphasizes probing intelligent language models rather than pure statistical data analysis.
8. Entropy: Measuring Average Information
Formalizes entropy as the weighted average of information across a probability distribution: Σ p·log₂(1/p). Visualizes entropy as total area of rectangles (probability × information height). Shows entropy is minimized by skewed distributions and maximized by uniform ones.
9. The Noiseless Coding Theorem
Presents Shannon's 1948 theorem stating no encoding exceeds entropy as a limit, and that arbitrarily close approximations are always possible. Establishes entropy as the theoretical minimum bits per symbol for any message following a given probability distribution.
10. Entropy Rate and Language Compression
Extends entropy to stochastic processes where symbol probabilities vary by context (entropy rate). Explains why exact calculation is impossible for natural language and notes Shannon's estimate of ~1 bit per character for English with 100+ character context, hinting at radical compression potential.