This post reviews the fascinating and sobering research paper The Illusion of Thinking by researchers at Apple. It digs into the inner workings of modern Large Reasoning Models (LRMs) like Claude, DeepSeek-R1, and OpenAI's o-series, and asks a tough question: do these models truly reason, or do they just look like they do?
The study introduces controlled puzzle environments—like Tower of Hanoi and River Crossing—to test reasoning models beyond benchmark accuracy. The result? A clear-eyed view of when these models help, when they fail, and where we may have overestimated their capabilities.
Why Reasoning Needs Better Evaluation
Most LLM evaluations focus on math and coding benchmarks, where models output a final answer. But these tests are often contaminated by training data and don't show how models reach their conclusions.
This paper breaks new ground by introducing puzzle-based environments that vary in difficulty and structure. By analyzing both the answers and the intermediate "thoughts" (reasoning traces), the researchers reveal much deeper insights into model behavior.
Three Regimes of Performance
The researchers categorize performance into three distinct complexity regimes:
- Low Complexity: Standard LLMs (no "thinking") are faster, more accurate, and more efficient than LRMs.
- Medium Complexity: LRMs shine, thanks to their ability to reflect and revise through longer reasoning chains.
- High Complexity: Both types of models collapse completely—no correct answers, no meaningful reasoning.
This suggests LRMs only provide benefit within a narrow window of complexity, raising doubts about their general reasoning abilities.
The Counterintuitive Scaling Limit
One of the most surprising findings is that as puzzle difficulty increases, LRMs actually begin to think less. Their use of reasoning tokens declines, even when they have plenty of budget left.
This suggests a limit to how well current models can scale reasoning effort in line with problem complexity—despite having the tools to do so.
What the Models Actually Think
By inspecting reasoning traces, the researchers observe three patterns:
- Overthinking (Low Complexity): Correct answers appear early but are buried under unnecessary steps.
- Late Discovery (Medium Complexity): Correct answers come after many incorrect ones—reasoning is inefficient but can work.
- Collapse (High Complexity): No correct reasoning appears at all—models give up, even when prompted.
This offers a rare, internal view of how models behave cognitively—and exposes deep limitations.
Even Algorithms Don’t Help
Perhaps the most damning result: even when the exact algorithm for a puzzle (like Tower of Hanoi) is provided, the models still fail at higher complexity.
This indicates that LRMs don't just struggle with finding solutions—they can’t reliably execute step-by-step instructions either.
Why This Matters
LRMs promise to bring us closer to true artificial reasoning. But this paper shows that today's models:
- Struggle to generalize beyond training data
- Fail to scale reasoning with problem complexity
- Break down even when following known algorithms
If we want real progress, we may need new architectures or training strategies—not just more data or longer chains of thought.
For those building AI systems today, this paper is a must-read caution. Thinking-looking models are not necessarily thinking-doing models. Reasoning may be the final frontier for AI—but we’re not there yet.
Limitations of the Study
While this research offers valuable insights, the authors acknowledge several limitations. The controlled puzzle environments—though useful for isolating reasoning complexity—represent only a narrow subset of possible reasoning tasks. They may not reflect the full diversity of real-world or knowledge-intensive problems where reasoning often interacts with memory, perception, and domain-specific knowledge.
Additionally, most experiments rely on black-box API access to proprietary models, limiting the researchers’ ability to probe internal model states or understand architectural causes of failure. The use of deterministic simulators also assumes that each step of reasoning can be perfectly validated. This clean evaluation setup may not translate well to messier, real-world settings where ambiguity, noise, or subjectivity play a role.
As such, while the findings highlight critical shortcomings in today’s LLM-based reasoning models, the generalizability of those insights beyond structured puzzles remains an open question.