Information Theory Fundamentals: Compression and Entropy

EducationReinventing Entropy | Compression is Intelligence Part 1

TL;DR

Compression and prediction are mathematically equivalent; Shannon entropy defines the fundamental limits of how efficiently data can be compressed by measuring average information content per symbol.

Key Takeaways

1Perfect compression produces output indistinguishable from random noise, with each bit carrying meaningful information rather than predictable patterns
2Shannon entropy (negative log of probability) quantifies information content and establishes the theoretical minimum bits needed to encode any symbol
3Compression efficiency is directly tied to understanding probability distributions; the more evenly distributed probabilities are, the higher the entropy
4Information theory bridges compression and prediction: more predictable data compresses better, forming the mathematical foundation for modern language model training
5English at 100+ character context has an estimated entropy of ~1 bit per character, suggesting radical compression potential despite seeming impossible
6Prefix-free codes (where no code word is a prefix of another) are necessary for unambiguous decoding without delimiters between symbols

Notable Quotes

“Prediction and compression are mathematically equivalent. They turn out to be two sides of the same coin.”
— Presenter

“Great definitions are often the residue of some kind of insight.”
— Presenter

“Nobody knows what entropy really is, so in an argument, you'll always have the advantage.”
— John von Neumann (attributed to Shannon about naming entropy)

“When his interviewees had at least 100 preceding letters of context, he estimated the entropy of English to be about one bit per character.”
— Presenter (describing Shannon's findings)

Chapters

1. Introduction: The Compression Question

Explores why compression matters and establishes the core question: what is the fundamental limit on how efficiently text can be compressed? Introduces Claude Shannon's information theory and its surprising relevance to modern machine learning, particularly the connection between compression and intelligence.

2. The Robot Instructions Warm-Up Example

Uses a simple four-direction robot instruction set with non-uniform probabilities (50% up, 25% down, 12.5% left/right) to demonstrate three encoding approaches: naive two-bit fixed encoding, clever variable-length encoding, and theoretical optimal encoding.

3. Prefix-Free Codes and Binary Trees

Explains how prefix-free codes work by ensuring no code word is a prefix of another, preventing ambiguity during decoding. Visualizes this through a binary tree diagram where allocating a code word consumes a proportional space of all possible bitstrings.

4. Random Noise and Perfect Compression

Establishes that perfect compression must produce output indistinguishable from random noise. Uses the insight that equally likely messages compressed to n bits must each have probability 1/2^n to derive the fundamental information formula: negative log base 2 of probability.

5. Shannon Information: From Formula to Intuition

Derives the expression -log₂(p) as the information content of an event, explaining why unlikely events contain more information. Demonstrates that in perfect compression, bits allocated equals information content, and shows how information values add when multiplying probabilities across symbols.

6. Natural Language and Fractional Bits

Extends information theory to realistic scenarios like English language where probabilities are context-dependent and rarely perfect powers of 2. Shows how GPT models produce fractional information values per letter and how these sum across messages for overall compression limits.

7. Shannon's Experimental Methodology

Describes Shannon's empirical approaches: analyzing n-grams from text, conducting character-guessing experiments with his wife Betty, and later surveying multiple people to estimate entropy of English. Emphasizes probing intelligent language models rather than pure statistical data analysis.

8. Entropy: Measuring Average Information

Formalizes entropy as the weighted average of information across a probability distribution: Σ p·log₂(1/p). Visualizes entropy as total area of rectangles (probability × information height). Shows entropy is minimized by skewed distributions and maximized by uniform ones.

9. The Noiseless Coding Theorem

Presents Shannon's 1948 theorem stating no encoding exceeds entropy as a limit, and that arbitrarily close approximations are always possible. Establishes entropy as the theoretical minimum bits per symbol for any message following a given probability distribution.

10. Entropy Rate and Language Compression

Extends entropy to stochastic processes where symbol probabilities vary by context (entropy rate). Explains why exact calculation is impossible for natural language and notes Shannon's estimate of ~1 bit per character for English with 100+ character context, hinting at radical compression potential.

Key People & Entities

Claude Shannon: Mathematician and engineer who founded information theory in the 1940s with work on compression limits and entropy
John von Neumann: Mathematician credited (possibly apocryphally) with suggesting Shannon use the term 'entropy' for the average information quantity
Betty Shannon: Shannon's wife; participated in character-guessing experiments to estimate entropy of English language

Glossary

Prefix-free code: An encoding scheme where no code word is a prefix of any other code word, ensuring unambiguous decoding without requiring delimiters between symbols.
Information content: The measure of surprise or uncertainty in an event, calculated as -log₂(p) where p is the probability; measured in bits.
Shannon entropy: The average information per symbol in a probability distribution, calculated as Σ p·log₂(1/p); represents the theoretical minimum bits needed to encode messages from that distribution.
Entropy rate: The extension of Shannon entropy to stochastic processes where symbol probabilities are context-dependent, measuring average information per symbol across all possible messages.
Cross-entropy loss: A measure used in machine learning that quantifies the difference between predicted and actual probability distributions; rooted in information theory and used in training language models.
Noiseless coding theorem: Shannon's 1948 theorem stating that no encoding can be more efficient than entropy, and that encoding can get arbitrarily close to this theoretical limit.
N-gram: A sequence of n consecutive symbols from a text; used in Shannon's analysis to estimate probability distributions for character prediction.
Stochastic process: A sequence of random variables with probabilities that may depend on previous values; used to model systems like natural language where context affects future outcomes.

Explore