100 Years of Computer Science: The 10 Papers That Built AI

ScienceI read every major CS paper of the last 100 years...

TL;DR

A retrospective of 10 landmark computer science papers from Turing's 1936 work through OpenAI's GPT-3, tracing how foundational ideas in computation, information theory, neural networks, and scaling converged to create modern AI.

Key Takeaways

1Turing's 1936 paper on computable numbers defined the abstract machine blueprint for all modern computers and proved mathematical limits to computation.
2Claude Shannon's 1948 information theory reduced all communication to bits and entropy, providing the mathematical foundation for AI loss functions.
3The perceptron (1958) introduced neural networks, but was nearly killed by the limitations proof (1969) until backpropagation was discovered.
4Lamport's distributed systems paper enabled the massive GPU coordination necessary for training large-scale neural networks.
5ImageNet (2012) and AlexNet demonstrated that deep learning works when you combine sufficient data, compute, and the right architecture.
6The Transformer architecture (2017) replaced sequential processing with attention, solving long-range dependency problems in language models.
7GPT-3 (2020) proved that intelligence emerges at scale—simply making models enormous with internet-scale data unlocked zero-shot generalization.

Notable Quotes

“Shannon wasn't trying to build artificial intelligence, but he gave us the math for uncertainty, prediction, and compression and accidentally wrote the spiritual ancestor to the loss function.”
— Narrator

“OpenAI takes the transformer and then asks the dumbest question possible. What if we just make it enormous?”
— Narrator

“Intelligence isn't some secret algorithm we're missing, but rather it simply emerges once you cross a threshold of scale.”
— Narrator

Chapters

1. The Birth of Computing: Turing and the Halting Problem

Alan Turing's 1936 paper on computable numbers answered Hilbert's decision problem by proving not all mathematical problems can be solved algorithmically. In doing so, he invented the Turing machine—the abstract blueprint for all modern computers.

2. Information Theory: Shannon's Bits and Entropy

Claude Shannon's 1948 paper 'A Mathematical Theory of Communication' reduced all human communication to ones and zeros, introducing the bit as a unit of information and entropy as a measure of uncertainty. This framework later became the spiritual ancestor to modern AI loss functions.

3. Neural Networks Emerge: The Perceptron and First AI Winter

The perceptron (1958) inspired by biological neurons introduced the first machine learning algorithm. However, a 1969 proof by MIT researchers showed single-layer perceptrons couldn't learn exclusive-or, killing AI funding for years despite discovering that stacking layers solves the problem.

4. Distributed Systems: Lamport's Logical Clocks

Leslie Lamport's paper on distributed systems solved the synchronization problem for multiple computers without shared clocks using causality-based ordering. This became essential infrastructure for coordinating thousands of GPUs in modern AI training.

5. Backpropagation: Training Deep Networks

After 17 years of AI winter, researchers including Geoffrey Hinton discovered backpropagation—running data forward, measuring error, and pushing it backward through layers using calculus to adjust weights. This revealed that hidden layers automatically learn features like edges and shapes.

6. Web Scale Data: PageRank and Google

Larry Page and Sergey Brin's 1998 PageRank algorithm ranked web pages by link votes weighted by voter trustworthiness. Google's resulting web index created the largest structured corpus of human text ever assembled, which became training data for future AI models.

7. Deep Learning Breakthrough: AlexNet and ImageNet

In 2012, Alex Krizhevsky trained a deep convolutional neural network on ImageNet (millions of labeled photos) using consumer GPUs. AlexNet dropped image classification error by 10 points in a single year, proving deep learning works at scale with the right data and compute.

8. The Transformer Revolution: Attention Is All You Need

The 2017 'Attention Is All You Need' paper introduced the Transformer architecture, replacing sequential token processing with self-attention that lets every word attend to every other word simultaneously. This solved long-range dependency problems and became the foundation for all modern LLMs including GPT.

9. Scaling Laws: GPT-3 and Emergent Intelligence

OpenAI's 2020 'Language Models are Few-Shot Learners' paper scaled the Transformer to 175 billion parameters on internet-scale data. GPT-3 demonstrated that intelligence emerges at sufficient scale, enabling zero-shot translation, summarization, and code generation without task-specific training.

10. The AI Era: From Theory to Trillion-Dollar Products

The evolution from GPT-3 to ChatGPT showed how scaling insights evolved into trillion-dollar products. Modern AI fundamentally performs the same next-token prediction Shannon described in 1948, but on an incomprehensibly larger scale.

Key People & Entities

Alan Turing: Mathematician who defined the Turing machine and proved the halting problem in 1936, establishing the theoretical foundations of computing
Claude Shannon: Founder of information theory who reduced communication to bits and introduced entropy in his 1948 paper
David Hilbert: Mathematician who posed the Entscheidungsproblem (decision problem) that Turing answered
Geoffrey Hinton: Godfather of neural networks who co-discovered backpropagation and pioneered deep learning
Frank Rosenblatt: Psychologist who created the perceptron, the first machine learning algorithm inspired by neurons
Marvin Minsky and Seymour Papert: MIT researchers whose 1969 limitations proof of single-layer perceptrons triggered the first AI winter
Leslie Lamport: Computer scientist who solved distributed systems synchronization with logical clocks, essential for large-scale AI training
Larry Page and Sergey Brin: Founders of Google who developed the PageRank algorithm and assembled massive web-scale text data

Glossary

Turing Machine: An abstract computational model consisting of an infinite tape, read-write head, and table of rules; the theoretical blueprint for all modern computers.
Bit: A unit of information introduced by Claude Shannon, representing the smallest unit of data (one or zero).
Entropy: In information theory, a measure of the average uncertainty or surprise in a message; borrowed from thermodynamics by Shannon.
Perceptron: An early machine learning algorithm that takes weighted inputs and adjusts weights to classify patterns, serving as the building block for modern neural networks.
Backpropagation: A training algorithm for neural networks that computes gradients by propagating error backward through layers using the chain rule, enabling deep learning.
Transformer: A neural network architecture introduced in 2017 that uses self-attention mechanisms to process all tokens in parallel, replacing sequential processing.
Self-Attention: A mechanism in Transformers that allows each token to attend to and weigh the relevance of every other token simultaneously.
Parameter: A learnable weight in a neural network; GPT-3 has 175 billion parameters.
Logical Clocks: A mechanism by Lamport for ordering events in a distributed system using causality rather than wall-clock time.
PageRank Algorithm: Google's algorithm that ranks web pages based on the number and quality of links pointing to them, treating links as votes.

Explore