As models got better at maths and logic, a fair question grew louder: are they reasoning, or just recognising problems they've effectively seen before? A 2024 paper from Apple researchers, *GSM-Symbolic*, ran a clever test to find out.
(Our plain-language summary of the study; the paper is linked.)
The experiment
The team took a standard set of grade-school maths problems and made small, meaning-preserving changes — swapping the names and numbers for different ones, while keeping the underlying problem identical. A model that truly understands the maths should be unaffected.
They also tried adding a single irrelevant but related sentence to each problem — the kind of detail a person would simply ignore.
What they found
- Just changing the numbers caused measurable drops in accuracy across many leading models. A genuine reasoner shouldn't care what the numbers are.
- Adding one irrelevant sentence caused large accuracy drops — models were pulled off course by information a child would dismiss.
- Performance got less reliable as problems grew more complex.
The interpretation: a lot of what looks like reasoning is closer to sophisticated pattern-matching against training data — impressive, but more fragile than it appears.
The models aren't "thinking" the way the demos suggest. They're extraordinary pattern machines — and patterns break in predictable ways.
Why this matters for real products
This isn't a reason to avoid AI — it's a reason to design around its limits:
- Don't assume a confident, well-formatted answer is a correct one.
- Test on your edge cases and reworded inputs, not just the happy path.
- Keep a human in the loop wherever an error is expensive.
It's the same theme behind why so many agentic AI projects underdeliver: capability is real, but reliability has to be engineered, not assumed.
Sources
- Mirzadeh et al. / Apple (2024) — *GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs*