Why Agentic AI Is Extremely Expensive: The Token Problem

TechExtreme Token Use of Agentic AI - Computerphile

TL;DR

AI tokens—words or word pieces—are charged per input and output, making agentic coding agents extraordinarily expensive because they must re-read entire conversation contexts repeatedly, with costs scaling from thousands to millions of tokens for simple tasks.

Key Takeaways

1Tokens are the fundamental unit of cost for AI models; a simple sentence like 'the cat sat on the mat' is split into multiple tokens including spaces and punctuation
2AI models process tokens inefficiently by re-reading the entire context window for every single new token generated, unlike humans who can hold information in memory
3Coding agents dramatically increase token usage by reading multiple files, generating internal thoughts, making tool calls, and adding all previous context to each new query
4A simple bug-fix request can easily consume 50,000–60,000 tokens across input and output, with costs scaling exponentially if multiple files are read or follow-up questions are asked
5GitHub Copilot's shift from monthly subscriptions to token-based billing revealed unsustainable costs; the previous flat-fee model masked the true expense of agentic AI
6KV caching reduces redundant computations by storing intermediate network values, but cache lifespans are limited by GPU memory and user response times
7Token-based billing incentivizes longer prompts and more expensive models, creating perverse incentives that accelerate cost growth rather than encouraging efficient AI usage

Notable Quotes

“language models don't work that way. Everything goes in every single time.”
— Host

“It's like measuring my quality as a driver by how quickly I wear through my tires. That is an unsustainable practice.”
— Host

“When you have an incentive like that, of course people are going to ask really long-winded questions... and in a surprise to absolutely no one, that is completely unsustainable in terms of cost.”
— Host

Chapters

1. What Is a Token?

Tokens are the basic units of AI processing—words, word fragments, punctuation, spaces, and special characters. They vary by language and tokenizer; Chinese characters may be 1–2 tokens, and modern models support ~100,000 tokens including code symbols and unicode characters.

2. How Models Generate Output

AI models work auto-regressively: they take an input context, make many decisions, and output one token at a time. Each new token generation requires the model to process the entire previous context again, making the process computationally expensive.

3. Context Window Growth and Cost Escalation

As conversations continue, the input context grows: initial query (100 tokens) + system prompt (1,000) + first thought (10,000) + follow-up query (200) leads to exponentially larger inputs. A simple follow-up triples the input cost.

4. KV Caching and System Optimization

KV caching stores intermediate network representations to avoid recalculating relationships between tokens. However, caches have short lifespans due to GPU memory constraints and user delays, requiring 're-filling' when users return after delays.

5. Chatbots vs. Coding Agents

Simple chatbots ask brief questions and receive brief responses (manageable token costs). Coding agents have autonomy, read files via tool calls, generate internal thoughts, and require all previous context in every query, causing token usage to explode.

6. Real-World Example: Bug-Fix Request

A simple bug-fix prompt (4,200 input tokens) triggers multiple file reads (~5,000 tokens each), internal thoughts (~2,000 tokens each), tool calls (~100 tokens), and code patches (~1,500 tokens), totaling ~55,000–60,000 tokens for one task.

7. GitHub Copilot Pricing Model Change

GitHub Copilot switched from flat-rate monthly subscriptions to per-token billing, revealing that simple agentic tasks consume millions of tokens. A six-prompt starfield code example used 2 million input tokens and 47,000 output tokens.

8. The Perverse Incentive Problem

Token-based billing encourages users to ask longer, more complex questions and use more expensive models with longer thinking times, creating a death spiral of cost escalation that is unsustainable for non-tech companies.

9. Sustainable Use Cases and Future Outlook

Efficient uses include small, succinct bug fixes, code completion (finishing half-written loops), and quick-fix scenarios requiring minimal context. Full agentic AI remains worryingly costly; companies must prove immediate ROI to justify expenses.

Key People & Entities

GitHub Copilot: AI coding assistant that switched from flat-rate billing to per-token billing, causing public backlash over hidden token costs
Anthropic: AI company implementing token usage caps for premium users due to cost concerns
Claude (Sonnet): Anthropic's language model used to generate the starfield screensaver example in the video

Glossary

Token: A unit of text processing in AI models; can be a word, word fragment, punctuation mark, space, or special character. Modern models support ~100,000 tokens including code, unicode, and multiple languages.
Tokenizer: A string parser that converts raw text into tokens before input to a language model and after output; based on frequency analysis and specific to each model.
Embedding: A high-dimensional vector representation of a token learned during model training; represents semantic meaning and relationships between tokens in vector space.
Auto-regressive: A model architecture that generates output one token at a time by processing the entire input context with each new token, then adding that token to the context for the next iteration.
Context Window: The total number of tokens a model can process in a single forward pass, including system prompts, user queries, previous responses, and file contents.
Tool Call: An output generated by an agentic AI that instructs the system to perform an action (e.g., read a file, delete a file) rather than responding directly to the user.
KV Caching: An optimization technique that stores key-value pairs (intermediate network representations) to avoid recalculating relationships between previously processed tokens when generating new tokens.
Pre-filling: The process of re-loading cached key-value pairs into the GPU when a cache has expired due to GPU memory constraints or user delay.
Agentic AI: An AI system with autonomy to make decisions, read files, and execute actions via tool calls rather than just responding to prompts; used in code assistants like GitHub Copilot.
System Prompt: Hidden instructions given to an AI model that define its role, behavior constraints, and context (e.g., 'You are a coding agent'); typically 1,000–4,000+ tokens.

Explore