sk any large language model to tell you a joke.
Go ahead. Open ChatGPT, Claude, Gemini — pick your favorite. Type "tell me a joke."
You'll get something like this: "Why don't scientists trust atoms? Because they make up everything."
Heard that one before? Of course you have. You heard it in 2022 when ChatGPT first launched. You heard it in 2023. In 2024. And you're still hearing it today, in 2026.
Andrej Karpathy — one of the founding minds behind OpenAI, former head of AI at Tesla — pointed this out on the No Priors podcast. Despite years of extraordinary improvement in code generation, mathematical reasoning, and scientific analysis, LLMs still tell the same three jokes they told when they first appeared.
“"Even though the models have improved tremendously, you ask for a joke and it has a stupid joke, a crappy joke from five years ago."”
— Andrej Karpathy, No Priors podcast
That observation stuck with me. Not because it's funny — it's the opposite of funny — but because it explains something I've been watching unfold from a front-row seat.
Since 2023, I've been giving a talk called "Past, Present, and Future of Generative AI." I gave it at conferences. I gave it in corporate boardrooms. I gave it for free, as often as I could, because most people genuinely didn't realize the magnitude of what was coming.
Every time, I updated the slides. New models. New predictions. New benchmarks crushed.
Now it's 2026. Almost two years since the last recorded version of that talk. The landscape has shifted in ways that would have seemed impossible even twelve months ago.
But some things haven't changed at all. Like that joke.
The fact that AI can now write production-quality code for hours without human supervision but still can't improvise a decent punchline isn't a quirky anecdote. It's a window into the most important question in artificial intelligence right now: Where is the boundary between what AI can genuinely do and what it merely appears to do?
That boundary has a name. Researchers call it the jagged frontier.
Understanding it is the difference between organizations that succeed with AI and the 95 percent that don't.
The Cast of Characters
In my talk, I traced the story of generative AI through five key figures. Each one represents a different piece of the puzzle. Looking back now, their predictions — and where those predictions landed — tell us more than any benchmark score.
Ray Kurzweil published Singularity is Near in 2005, predicting that AI would surpass human intelligence by 2029 and that humans would merge with AI by 2045. When he self-evaluated his predictions in 2010, he claimed 86 percent accuracy. His sequel, Singularity is Nearer, arrived in 2024 with the dates barely adjusted — 2029 became 2030.
The most human detail about Kurzweil has nothing to do with timelines. His father died in 1970. Since then, Ray has collected every letter, every document, every photograph his father left behind. He believes that one day, an AI will know everything about his father — and in a way, be more like him than his father himself was.
That's not a prediction about technology. It's a prayer dressed up as one.
Ben Goertzel popularized the term AGI and has spent his career arguing for artificial intelligence that is not just smart but compassionate. He co-created Sophia the robot, built the SingularityNET decentralized AI marketplace, and predicts human-level AGI by the early 2030s.
“"If we want machines to be our partners, we must raise them to be our friends."”
— Ben Goertzel
The idea sounds almost quaint in 2026, when benchmark scores and enterprise contracts drive most AI development. But Goertzel's question — what kind of intelligence are we building? — is more relevant now than ever.
Demis Hassabis was a chess master at 13, built the hit game Theme Park at 17, and co-founded DeepMind in 2010. Google acquired it for $500 million in 2014. Under Hassabis's leadership, DeepMind produced AlphaGo, AlphaFold, and something less flashy but more important: a framework for measuring AGI itself.
I'll come back to that framework. It turns out to be one of the most useful tools for cutting through the noise.
Then there's the friendship that broke apart.
Elon Musk and Larry Page were close friends in 2012. During their late-night conversations, Musk became alarmed that Page — the co-founder of Google, the company that had just acquired DeepMind — wasn't taking AI safety seriously enough. Page called Musk a "speciesist." Caring more about humans than about AI as a new form of intelligence.
That was the breaking point. In 2015, Musk poached AI researcher Ilya Sutskever from Google and co-founded OpenAI as a direct counterbalance to Google's growing AI power.
Which brings us to Sam Altman, who took over as CEO of OpenAI in 2019 and oversaw its transformation from a nonprofit research lab into the most consequential AI company in the world. GPT-3, DALL-E, GPT-4, the reasoning models that changed everything — all under Altman's leadership.
Also under his leadership: the shift to a "capped-profit" model, growing concerns about transparency, and the bizarre November 2023 boardroom coup that briefly ousted him before he returned days later.
These five arcs — the visionary, the idealist, the builder, the protector, and the strategist — set the stage for what came next.
What came next moved faster than any of them predicted.
The Scorecard
When I gave the talk in June 2024, the AI arms race was in full swing. Google had launched Gemini. Meta had pivoted from the Metaverse to open-source AGI. Microsoft had Copilot everywhere. Anthropic had Claude 3. And Elon Musk's xAI had just raised $6 billion to build the Gigafactory of Compute.
In my talk, I shared Musk's prediction — made just weeks earlier — that AGI would arrive "next year." That would be 2025.
It didn't.
Musk built Colossus, a supercomputer housing 200,000 GPUs in Memphis, Tennessee. It went from concept to operational in 122 days — then doubled in size 92 days later. An engineering feat by any measure.
But having the world's largest AI supercomputer didn't produce AGI. In late 2025, Musk told xAI staff there was a "10 percent likelihood" of achieving it with Grok 5. By early 2026, he'd quietly shifted his prediction to next year. Again.
This isn't about Musk being wrong. It's about the gap between infrastructure and intelligence. You can build the biggest computer in the world and pack it with more GPUs than anyone imagined possible. Intelligence — real, general intelligence — doesn't emerge just because you throw more hardware at the problem.
For years, the industry operated on a simple faith: scaling laws. In 2020, OpenAI researchers showed that model performance improves predictably with more compute, data, and parameters. DeepMind refined the math in 2022. The implication was intoxicating — just make it bigger and it gets smarter.
This faith drove the billions pouring into GPU clusters. It's why Colossus exists.
Then the returns started diminishing.
OpenAI's Orion — internally planned as GPT-5 — was quietly downgraded to GPT-4.5 after it failed to deliver the expected leap. It hit GPT-4-level performance after just 20 percent of training, but the remaining 80 percent showed diminishing returns. Fortune reported in February 2025 that Altman "effectively acknowledged the scaling technique was no longer producing a big enough performance boost."
At NeurIPS 2024, Ilya Sutskever — the same researcher Musk poached from Google to co-found OpenAI — delivered a stark verdict:
“"Pre-training as we know it will unquestionably end. The data is the fossil fuel of AI. We have but one internet."”
— Ilya Sutskever, NeurIPS 2024
The scaling laws didn't break. But the direction of scaling shifted.
Instead of bigger models trained on more data, the breakthrough came from giving models more time to think. OpenAI researcher Noam Brown discovered that letting a model reason for 20 seconds produced the same performance gain as scaling the model by 100,000 times. "I literally thought it was a bug," he said.
This is why o1 and o3 were breakthroughs — not because of larger pre-training runs, but because of a new dimension: inference-time compute. Letting models reason at the point of use rather than stuffing more knowledge into them during training.
The lab leaders still can't agree on what this means. Altman says scaling laws "absolutely" still hold. Amodei says Anthropic does "not see a wall." Hassabis thinks scaling gets us about 50 percent of the way to AGI — one or two additional breakthroughs are needed. Sutskever says the "age of scaling" is over and we're "back to the age of research, just with big computers."
The Epoch AI data suggests a resolution: capability improvement is accelerating — 1.85 times faster since April 2024 — but the acceleration comes from reasoning and reinforcement learning, not bigger pre-training runs. Something is still scaling. It's just not the same thing that was scaling before.
Meanwhile, here's what else happened between then and now.
Reasoning models delivered. That shift in scaling direction produced real results. By December 2024, o3 scored 75.7 percent on ARC-AGI-1 — a test GPT-4o could barely scratch five percent on.
AI agents became real. Not a prediction anymore — an actual shift in how software gets built. Karpathy describes going from writing 80 percent of his own code to writing essentially none of it since December 2025. "I don't think I've typed a line of code probably since December," he said. I can verify this from my own experience. At Orange Hill, our entire engineering team has made the same shift — we delegate to agents, review their output, and orchestrate rather than type. The change was sudden and it was total.
Costs collapsed. The performance equivalent of GPT-3.5 dropped from $20 per million tokens in November 2022 to $0.07 per million tokens by October 2024. A 280-times reduction in 18 months. This matters more than most benchmark scores because it's what makes AI deployable at scale.
The AI Paradox
Sources: MIT State of AI in Business 2025, Epoch AI, OpenAI Enterprise Data
Enterprise spending tripled. Companies poured $37 billion into generative AI in 2025, up from $11.5 billion the year before.
But here's the number that matters most: 95 percent of enterprise AI pilots failed to deliver measurable business impact, according to MIT's 2025 State of AI in Business report.
Read that again. $37 billion spent. 95 percent failure rate.
These companies aren't failing because the technology doesn't work. They're failing because they don't understand where it works and where it doesn't.
The Jagged Frontier
In 2023, researchers from Harvard and Boston Consulting Group ran a field experiment. They gave 758 BCG consultants access to GPT-4 and measured what happened.
The results were strange.
On some tasks — creative ideation, data analysis, persuasive writing — consultants using AI were dramatically more productive. Higher-quality work, less time.
On other tasks — tasks that seemed equally complex to a human observer — AI made them worse. Consultants who relied on the AI's output performed below those who didn't use it at all.
Ethan Mollick, the Wharton professor who co-led the study, coined the term "jagged technological frontier" to describe what they found. AI doesn't have a smooth capability curve that rises predictably. It has a jagged one — superhuman at some things, embarrassingly bad at others, with no reliable way to predict which is which.
“"AI can do some work incredibly well and other work incredibly badly, in ways that didn't map very well to our human intuition of the difficulty of the task."”
— Ethan Mollick, Wharton School
By 2026, the jagged frontier hasn't disappeared. But we understand its shape much better.
Karpathy put it viscerally: "I simultaneously feel like I'm talking to an extremely brilliant PhD student who's been a systems programmer for their entire life and a 10-year-old."
That metaphor lands because everyone who uses these tools has lived it. You're amazed by what the AI just accomplished — genuinely, jaw-on-the-floor amazed — and then five minutes later it does something so obviously wrong that you question whether it understood anything at all.
Why? Why would a system that solves advanced mathematics, writes elegant code, and passes medical licensing exams also fail at tasks a child could handle?
For a long time, nobody had a satisfying answer. Researchers described the jagged frontier but never explained it.
Then Karpathy explained it. And the explanation is simple.
Why the Joke Never Improves
Here's the mechanism.
Modern LLMs are trained using reinforcement learning. After the initial pre-training phase — where the model learns language from vast amounts of text — labs use RL to make the model better at specific tasks.
"It's because it's outside of the RL," Karpathy said. "It's outside of the reinforcement learning. It's outside of what's being improved."
This is why the atoms joke persists. No lab is optimizing for joke quality. It's not that they can't — it's that the optimization machinery doesn't know how to measure "funny." What it can't measure, it can't improve.
In a separate interview on the Dwarkesh Podcast, Karpathy described RL as "sucking supervision through a straw." The entire trajectory of a solution — every decision the model made along the way — gets compressed into a single binary signal: right or wrong. Every step toward the correct answer is treated as correct, even when the reasoning was flawed and the model just got lucky.
The consequence: "All of the samples you get from models are silently collapsed. They occupy a very tiny manifold of the possible space."
The model's outputs look diverse on the surface but are squeezed into a narrow band. That's why every LLM tells the same jokes, uses the same turns of phrase, and converges on the same patterns. The RL optimization carves deep grooves, and the model runs in those grooves even when you want it to explore.
This is the engine behind the jagged frontier.
Inside the RL boundary — math, code, benchmarks, anything with verifiable answers — models improve at a breathtaking pace. On the tasks that RL can optimize, these systems are climbing faster than ever.
Outside the RL boundary — humor, common sense, creative diversity, social judgment — progress is essentially zero.
“"You're either on rails and you're part of the superintelligence circuits, or you're not on rails and you're outside of the verifiable domains and suddenly everything kind of just meanders."”
— Andrej Karpathy
The jagged frontier isn't random. It's a map of what reinforcement learning can and cannot reach.
The Benchmark Graveyard
The RL boundary leads to a problem the industry doesn't talk about enough.
When GPT-3 first took the MMLU test — a broad multiple-choice exam covering 57 academic subjects — it scored 35 percent. The test was designed to measure general knowledge and reasoning, and 35 percent was barely above random guessing.
By 2026, the latest models score 99 percent.
The test is functionally useless.
The Benchmark Graveyard
Sources: Epoch AI, ARC Prize Foundation
The same story repeats across the board. HumanEval, which tests code generation, is saturated at 91 to 95 percent. GPQA, a graduate-level science exam, jumped 48.9 percentage points in a single year. SWE-bench, which measures real-world software engineering, gained 67.3 points.
Models conquer benchmarks designed to last years in a matter of months.
This looks like extraordinary progress. And some of it is. But Karpathy identified a critical problem: "Benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR."
RLVR — reinforcement learning with verifiable rewards — is the exact mechanism behind the jagged frontier. Benchmarks have clear right answers, which makes them perfect targets for RL optimization. Labs construct training environments adjacent to benchmark problems through synthetic data generation. They're not literally training on the test. But they're training on tasks so similar that benchmark performance skyrockets regardless.
This is Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.
There's a revealing case study. Researchers found that GPT-4 could solve Codeforces programming problems released before its training cutoff — but failed on problems released after. It wasn't reasoning through the problems. It had memorized the patterns.
So the industry creates harder benchmarks. Models crush those too. And the cycle repeats.
The latest entry in this arms race is ARC-AGI-2, created by François Chollet — the same researcher who built Keras, one of the most widely used deep learning libraries.
ARC-AGI-2 tests something specific: the ability to solve novel visual puzzles that require genuine reasoning, not pattern matching. Each puzzle involves interpreting unfamiliar rules from examples and applying them — something humans do naturally.
The results are sobering.
Pure LLMs score 0 percent on ARC-AGI-2. Zero.
The best system — a heavily engineered solution using Gemini 3 with extensive scaffolding — scored 54 percent at $31 per task. Average humans score close to 100 percent, with all tasks solvable by at least two human participants in fewer than two attempts.
For context: On the previous version, OpenAI's o3 had scored 87.5 percent with $4,560 per task in compute. When a harder version of the test dropped, performance collapsed.
Chollet's assessment: "Current AI reasoning performance is tied to model knowledge." When the test falls outside what the model has seen variations of, it fails.
As I used to say in my talk: If an LLM can solve a test, it doesn't mean it has solved what that test measures.
Level One of Five
So where are we, really?
In 2023, DeepMind published a paper that did something useful. Instead of treating AGI as a binary — you either have it or you don't — they created a framework with five levels.
| Performance × Generality | Narrow clearly scoped task or set of tasks | General wide range of non-physical tasks, including metacognitive abilities like learning new skills |
|---|---|---|
| Level 0: No AI | Calculator software, compiler | Human-in-the-loop computing (e.g., Amazon Mechanical Turk) |
| Level 1: Emerging equal to or somewhat better than an unskilled human | GOFAI, simple rule-based systems (e.g., SHRDLU) | ChatGPT, Bard, Llama 2 |
| Level 2: Competent at least 50th percentile of skilled adults | Siri, Alexa, Google Assistant; Watson; SOTA LLMs for subset tasks (short essays, simple coding) | Not yet achieved |
| Level 3: Expert at least 90th percentile of skilled adults | Grammarly; generative image models (Imagen, DALL-E 2) | Not yet achieved |
| Level 4: Virtuoso at least 99th percentile of skilled adults | Deep Blue, AlphaGo | Not yet achieved |
| Level 5: Superhuman outperforms 100% of humans | AlphaFold, AlphaZero, StockFish | Artificial Superintelligence (ASI) Not yet achieved |
These levels are measured across two dimensions: performance and generality.
In narrow tasks, we're already at Level 5. AlphaFold predicts protein structures better than any human. AI systems diagnose certain cancers more accurately than expert radiologists. Code generation models outperform most programmers on well-defined tasks.
But for general intelligence — the ability to handle whatever cognitive task you throw at it — current frontier models sit at Level 1.
Emerging.
Level 1 means: about as good as someone who doesn't really know what they're doing, but across a broad range of tasks.
That's the honest assessment. The most sophisticated AI systems in the world have reached the cognitive level of a well-meaning beginner who sometimes gets lucky.
This framing cuts through the hype in a way that raw benchmark scores never could.
When Musk says "AGI next year," he's implying we'll jump from Level 1 to...what? Level 3? Level 5? In twelve months?
When Kurzweil predicts AGI by 2030, he's talking about an AI that is at least expert-level across the full range of human cognitive tasks. We're at Level 1 general and Level 5 on a handful of narrow ones.
Karpathy's assessment is more measured. Throughout 2025, he consistently described AGI as roughly a decade away. "The problems are tractable," he said. "They're surmountable. But they're still difficult."
He distinguishes between the "year of agents" — which the industry declared in 2024 — and the "decade of agents," which is what he thinks we're living through. The tools work. They're transformative. But the gap between "impressive demo" and "reliable autonomous system" is measured in years of engineering, not months of hype.
The ARC-AGI-2 results reinforce this. Strip away the scaffolding, the prompt engineering, the compute-heavy reasoning loops. Test the raw model on novel problems.
Zero percent.
The jagged frontier isn't just jagged horizontally — good at some tasks, bad at others. It's jagged vertically too. Level 5 narrow, Level 1 general.
The AI is simultaneously superhuman and sub-competent. The challenge is knowing which one you're getting at any given moment.
What Actually Gets Better
I've spent most of this article on what AI can't do. That's deliberate. Understanding the limits is the most valuable thing I can offer, because the capabilities are obvious — you experience them every day.
But those capabilities are real, and they're accelerating.
Epoch AI's Capabilities Index shows a clear breakpoint in April 2024. Before that date, capabilities improved at 8.3 ECI points per year. After it, the rate jumped to 15.5 points per year — a 1.85-times acceleration.
What drove it? Reasoning models.
When OpenAI released o1, it introduced a fundamentally different approach: Instead of generating an answer immediately, the model first generates a chain of reasoning, thinking through the problem step by step. This sounds simple. It wasn't. It required new training methodologies, new infrastructure, and new ways of scoring model outputs.
The result was dramatic improvement on tasks requiring multi-step logical thinking. By 2026, every major lab has reasoning models. Claude thinks before it responds. Gemini has its reasoning mode. DeepSeek R1 proved you could achieve frontier-level reasoning at a fraction of the compute cost.
Then there are agents.
In my 2024 talk, I predicted that AI agents would be the next major development — systems that complete multi-step tasks autonomously. That prediction landed.
Karpathy describes the shift: He now runs multiple AI agents in parallel — one researching, one writing code, one planning — and orchestrates them like a team. "It's not about a single session with your agent," he said. "Multiple agents, how do they collaborate, and how do you move in much larger macro actions."
This isn't science fiction. It's happening right now, across thousands of engineering teams. The move from "AI as autocomplete" to "AI as autonomous collaborator" is the most significant practical change since ChatGPT launched.
And the cost curve makes all of it more accessible every month. When I started giving my talk, using GPT-4 for enterprise work was prohibitively expensive. Today, equivalent performance costs 280 times less than it did 18 months ago.
Here's what I think comes next.
The jagged frontier won't disappear. But it will flatten — gradually, unevenly, in ways that matter. Every time researchers figure out how to make a new domain verifiable, how to create reward signals for tasks that previously had none, the RL machinery kicks in and that domain improves rapidly.
Mollick's insight is useful here: "Don't watch the benchmarks. Watch the bottlenecks."
When a bottleneck breaks — when a capability that was stuck suddenly leaps forward — it can unlock entire categories of applications overnight. Google's breakthrough in image generation quality didn't just produce better images. It upgraded every tool that uses images: presentations, documents, design workflows.
The race isn't about whether AI will keep getting better. It will. The question is: Which bottlenecks break next? And are you positioned to move when they do?
The Question at the End
I've always ended my talk the same way.
After walking through the characters, the arms race, the predictions, the impacts on companies, society, and individuals — after covering job displacement, universal basic income, AI embodied in robots, and the theoretical futures of superintelligence — I close with two questions.
Is intelligence an emergent property of all matter?
Is love an emergent property of all intelligence?
I've never changed those questions. Not once, through every update of every version of the talk. Because they're the questions that still don't have answers, even as everything else shifts beneath our feet.
They feel more pointed now.
We've built systems that score 99 percent on tests designed to measure intelligence. Those same systems score 0 percent when the test changes in ways any human could handle. They reason through complex mathematics but can't tell a joke that hasn't been told a thousand times before.
The intelligence we've built is real. It's also radically incomplete.
It crushes benchmarks but can't operate outside its training rails. It accelerates human capability by orders of magnitude but fails 95 percent of the time when organizations try to deploy it at scale. It sits at Level 1 of a five-level framework that its own creators designed — powerful enough to transform industries, nowhere close to what any of us mean when we say "intelligence."
If intelligence is an emergent property of all matter — if it arises naturally from sufficient complexity — then we should expect AI to eventually develop the qualities we associate with general intelligence. Humor. Common sense. The ability to reason about things never seen before. Creative surprise. Judgment.
But the evidence from the jagged frontier suggests that intelligence doesn't emerge uniformly. It emerges in the directions you optimize for. Everywhere else, it stays frozen.
And love? Compassion? The qualities Ben Goertzel has spent his career arguing must be built into AI from the ground up?
Those are even further outside the RL boundary than jokes.
Which raises a question I wish the frontier labs would take seriously. We've seen what happens when you optimize for math — models become superhuman at math. We've seen what happens when you optimize for code — models write better code than most humans. Every time RL is pointed at a domain, that domain explodes with capability.
So what would happen if we pointed it at empathy? At nuance? At the kind of moral reasoning that requires holding two conflicting truths at once? We've never tried — not with the same intensity, the same compute budgets, the same engineering focus. We have no idea what emergent properties might appear if we optimized for compassion with the same rigor we optimize for code completion.
Maybe nothing. Or maybe something we can't predict — the way nobody predicted that training on code would improve reasoning about logic, or that chain-of-thought would unlock mathematical ability that brute-force scaling couldn't reach.
The labs won't do it on their own. There's no leaderboard for kindness. No benchmark for wisdom. But if this article has shown anything, it's that the shape of AI is not inevitable — it's a direct reflection of what we choose to measure and reward. Right now, we're building intelligence in our own image: brilliant at the things we can score, indifferent to everything else.
We're at an extraordinary moment. The tools are more powerful than they've ever been and more limited than most people realize. The organizations that thrive will be the ones that understand both sides of that equation — that see the jagged frontier clearly enough to know where to step and where not to.
I've spent the last three years helping organizations navigate this terrain. The gap between what AI can do in a demo and what it can do in your business is where the real work happens.
That gap is where I live. It's why I keep updating the talk.
And somewhere out there, an LLM is still telling the same joke about atoms.




Comments
No comments yet.