Apple’s AI efforts have been methodical rather than headline-grabbing. On one side, Apple Machine Learning Research publishes its work—from computer vision to privacy—on machinelearning.apple.com, seeding the community with models, data, and insights. On the other, Apple Intelligence—introduced in October 2024—has quietly added generative features like writing assistance, notification summaries, image “Clean Up,” and a smarter Siri to iPhone, iPad, and Mac, all powered by a mix of on-device inference and Private Cloud Compute to keep data private. Neither front has delivered a blockbuster breakthrough yet, but together they establish a solid research foundation that could be the springboard for real AI impact in the future.
Their recent paper “The Illusion of Thinking” is worth unpacking.
Large Language Models (LLMs) have evolved significantly, with recent variants specifically designed for complex tasks, often referred to as Large Reasoning Models (LRMs). These models, such as Claude 3.7 Sonnet Thinking and DeepSeek-R1, incorporate mechanisms like long Chain-of-Thought (CoT) and self-reflection, leading to promising results on various reasoning benchmarks. Some researchers have even proposed them as considerable advancements towards more general artificial intelligence capabilities.
However, understanding the fundamental capabilities and limitations of these LRMs is still a critical topic, and their performance gains might not be solely due to enhanced reasoning. Current evaluations predominantly rely on established mathematical and coding benchmarks. While valuable, this paradigm has significant drawbacks: it often suffers from data contamination issues and provides limited insight into the actual process or quality of the reasoning traces produced by the models.
To address these limitations and gain a more rigorous understanding of how LRMs “think” and where their reasoning breaks down, the researchers adopted controllable puzzle environments.
Perspective: This shift in methodology is crucial. It reframes the core question we should be asking about AI. Instead of just asking “How smart is it?”, this approach allows us to ask, “How fragile is it?” By testing the models at the absolute limits of a task’s complexity, we can see not if they succeed, but how and when they fail.
These environments allow for the precise manipulation of problem complexity while keeping the core logical structures consistent. This systematic approach enables not only the evaluation of final answers but also a detailed analysis of the intermediate reasoning steps, or “thoughts,” generated by the models. The puzzles used include:
- Tower of Hanoi: A classic recursive puzzle involving moving disks between pegs.
- Checker Jumping: A one-dimensional puzzle to swap positions of red and blue checkers.
- River Crossing: A constraint satisfaction planning puzzle involving actors and agents crossing a river.
- Blocks World: A puzzle requiring rearrangement of blocks from an initial to a goal configuration.
By comparing LRMs (like Claude 3.7 Sonnet with thinking and DeepSeek-R1) with their standard LLM counterparts (Claude 3.7 Sonnet and DeepSeek-V3) under equivalent inference compute, the study revealed a nuanced and sometimes surprising performance landscape:
The Accuracy Collapse is a Universal Limit: The study found that all state-of-the-art reasoning models tested exhibit a systematic decline in accuracy as problem complexity increases, culminating in a complete collapse to zero accuracy beyond a certain problem-specific threshold. This suggests a failure to develop generalizable problem-solving capabilities for these tasks, despite “sophisticated self-reflection mechanisms”.
Perspective: The concept of an “Accuracy Collapse” is powerful. It shows that current models don’t degrade gracefully; they hit a wall and fall off a cliff. This points to a core limitation: they are masters of interpolation within their known data space, but they cannot reliably extrapolate to solve novel, more complex versions of the same problem structure.
A Counter-intuitive Scaling Limit in Effort: Perhaps most strikingly, as problems approach the complexity level where accuracy collapses, LRMs counter-intuitively begin to reduce their reasoning effort. This effort is measured by the usage of thinking tokens during the inference process. This reduction happens despite the models operating well below their maximum generation length limits, meaning they have ample token budget available but fail to use it for more complex problems.
Perspective: To me, this “Paradox of Effort” is the most damning finding in the paper and the heart of the “Illusion of Thinking” argument. A genuine thinker, when faced with a harder problem, digs deeper and applies more mental effort. These models do the opposite. They give up before their resource budget is even taxed, strongly implying their “reasoning” is not a deliberate process but an artifact of pattern-matching that simply breaks down when patterns become too complex.
Limitations in Exact Computation and Algorithmic Execution: The study uncovered surprising limitations in the LRMs’ ability to perform exact computations and follow explicit instructions. Specifically, for the Tower of Hanoi puzzle, providing the explicit, recursive solution algorithm in the prompt did not significantly improve performance or prevent the accuracy collapse at higher complexities.
Perspective: This finding exposes the vast gap between pattern correlation and true logical execution. The models’ inability to reliably follow a provided algorithm is a huge red flag. It demonstrates they can’t robustly handle symbolic manipulation or maintain state, which are foundational to programming and reasoning. They can generate output that looks right, but they can’t necessarily “run” the logic in their own “minds.”
Inconsistent Reasoning Across Puzzle Types: The models demonstrated very different behaviors and failure patterns across different puzzle types. For instance, the Claude 3.7 Sonnet thinking model could produce sequences of over 100 correct moves for the Tower of Hanoi (with N=10), while failing to provide more than a few correct moves in the River Crossing puzzle (even for N=3, which requires only 11 moves). This inconsistent performance suggests that model performance might be influenced by differences in training data exposure to specific problem types or structures.
These findings highlight both the advancements and the significant barriers facing current LRMs. The systematic collapse at higher complexity, the paradoxical reduction in effort, and the struggles with exact execution suggest that current approaches may be encountering fundamental limitations.
Final Perspective: This research provides powerful validation for Apple’s methodical AI strategy. If a company understands these fundamental limitations, it possibly explains why they wouldn’t rush to release a “do-everything” LLM soon. Instead, focusing on a solid foundation with on-device processing for practical, well-defined tasks seems like a far more robust long-term strategy. It suggests they are playing the long game, prioritizing reliability over headline-grabbing, but ultimately fragile, capabilities. The path to more robust AI may require advancements beyond simply scaling up current paradigms, potentially needing new architectures for symbolic manipulation and algorithmic consistency.