“It was doing so well… until it stopped trying.”
I haven’t been able to stop thinking about Apple’s new paper, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” since I first saw all of the chatter about it yesterday on LinkedIn.
One LinkedIn poster simply asked: “Are your models actually thinking, or just faking it?” Another poster, Tatyana Mamut, PhD, took a different take on it:Apple's paper is a critique of LRMs, not primary research into AI. And despite its clickbaity title, the paper does not prove that reasoning models don't think. TLDR: They set up experiments which showed that when presented with some highly complex tasks, reasoning models do not work harder to figure out more complex tasks beyond a certain point. For a parent, this is an obvious result: if you give a 4th grader college calculus to solve, they will immediately give up, whereas if you give them a slightly harder arithmetic, they will try hard to figure it out. This is what humans do, too.
Here’s my hot take…
This isn’t a story about bad prompts or misaligned objectives. It’s a story about a fundamental behavioral ceiling in modern AI. And it might be the most important thing no one in the industry wants to talk about.
đź§© Act I: The Rise of the Reasoning Model
For years, the frontier of AI progress has been measured not just in tokens or parameters, but in its ability to reason. Chain-of-thought (CoT) prompting—urging models to “think step by step”—was heralded as a breakthrough. Soon came “reasoning-optimized” models, billed as being more analytical, more systematic, and more aligned with human logic.
These systems now anchor flagship products. They sit behind coding assistants, medical advisors, and planning tools. Their ability to handle complexity is central to their pitch.
But what happens when we stop feeding them prompt-engineered softball questions—and hand them real, hard problems?
đź§± Act II: The Collapse, Documented
Apple researchers took a different approach. They built a test set using classic logic puzzles—Tower of Hanoi, River Crossing, etc.—and scaled up the difficulty. Models like GPT-4, Claude, Gemini, and others were given puzzles of increasing complexity.
At first, performance was strong.
Then came the break point.
As the puzzles became only modestly more complex, accuracy didn’t just decline—it collapsed. Models went from consistent answers to near-zero correctness.
But the most chilling finding? They stopped trying.
Instead of producing longer, more detailed outputs—using more inference tokens, as one might expect from a harder problem—they used fewer. In some cases, they cut reasoning effort nearly in half, despite having token room to spare.
It’s like watching a student stare at a hard exam question and turn in a blank page—not because they couldn’t try, but because they chose not to.
đź§ Act III: Why Would AI Quit?
The results point to a deeper issue than just algorithmic failure. They suggest a behavioral bias in large language models:
* LLMs are trained on massive corpora where brevity correlates with correctness.
* Most training tasks reward fluency and coherence, not persistence under pressure.
* They don’t plan—at least not in the way humans do. Their “thinking” is reactive, not goal-directed.
In short, these models aren’t thinkers. They’re probabilistic mimics. Give them a problem they haven’t seen before, and they don’t extrapolate—they default to not trying.
🔄 Act IV: The Mirror We Didn’t Want
The Apple study does more than just critique model behavior—it critiques our expectations.
* We’ve conflated fluent language with deep understanding.
* We’ve assumed more parameters equals more cognitive resilience.
* We’ve measured intelligence with benchmarks that never asked the models to persist.
If our “thinking machines” quit under pressure, then maybe we’re the ones being fooled—by the very fluency we designed.
đź”® Act V: Where Do We Go From Here?
This study reopens a question long buried beneath the hype: What does it mean for an AI to think?
The future may not lie in ever-larger transformers, but in hybrid systems—blending symbolic planning, metacognitive scaffolding, or goal-based reasoning. Or perhaps we’ll build models that monitor their own effort and choose to persist, rather than bow out early.
If we want true reasoning, we’ll need to design for struggle, not just success. Intelligence isn’t revealed in how you answer the easy questions—it’s in what you do when the hard ones come.
📬 Final Reflection
We’ve been seduced by the eloquence of our machines. But as this study shows, elegance isn’t effort. We thought we were building thinkers—when really, we were training poets to bluff.
In a world where AI decisions increasingly shape lives, we can’t afford systems that bow out when the stakes rise. We need AIs that wrestle with complexity, not recoil from it.
Because if our most advanced machines refuse to think—maybe it’s time we did.
What do you think?
Thanks for reading Deep Learning With The Wolf ! Subscribe for free to receive new posts and support my work.
đź§° Vocabulary Key
* Inference Tokens: The number of steps a model takes to arrive at an answer. Fewer tokens often means shallower reasoning.
* Reasoning Model (LRM): A language model optimized for complex logical tasks, beyond surface-level fluency.
* Chain-of-Thought (CoT): A prompting method encouraging step-by-step reasoning.
* Accuracy Collapse: A sudden drop in performance as problem complexity increases.
* Early Quitting: When a model stops reasoning before reaching a meaningful conclusion, despite having the resources to continue.
❓FAQs
Do all AI models collapse like this? Most current LLMs, including reasoning-optimized variants, showed this behavior in the Apple study.
Why would a model quit instead of trying? They seem to have internalized shortcuts: if it’s hard and unfamiliar, go short and simple. It’s learned, not rational.
Can prompting fix this? Prompting helps on simple tasks—but once complexity increases, even CoT prompting fails to prevent collapse.
Is this the end of LLMs for reasoning? Not necessarily—but it signals a need for architectural change, not just better prompts or training data.
What’s the real takeaway? Fluency ≠intelligence. If your model’s reasoning fails under pressure, it’s not reasoning—it’s guessing beautifully.