Skip to content
← Back to blog
Research Papers·June 23, 2026·5 min read

Do “reasoning” models actually reason? What Apple’s GSM-Symbolic paper found

A 2024 Apple study probed whether top models truly reason or just pattern-match. Changing the numbers in a maths problem — or adding an irrelevant sentence — made accuracy drop.

As models got better at maths and logic, a fair question grew louder: are they reasoning, or just recognising problems they've effectively seen before? A 2024 paper from Apple researchers, *GSM-Symbolic*, ran a clever test to find out.

(Our plain-language summary of the study; the paper is linked.)

The experiment

The team took a standard set of grade-school maths problems and made small, meaning-preserving changes — swapping the names and numbers for different ones, while keeping the underlying problem identical. A model that truly understands the maths should be unaffected.

They also tried adding a single irrelevant but related sentence to each problem — the kind of detail a person would simply ignore.

What they found

  • Just changing the numbers caused measurable drops in accuracy across many leading models. A genuine reasoner shouldn't care what the numbers are.
  • Adding one irrelevant sentence caused large accuracy drops — models were pulled off course by information a child would dismiss.
  • Performance got less reliable as problems grew more complex.

The interpretation: a lot of what looks like reasoning is closer to sophisticated pattern-matching against training data — impressive, but more fragile than it appears.

The models aren't "thinking" the way the demos suggest. They're extraordinary pattern machines — and patterns break in predictable ways.

Why this matters for real products

This isn't a reason to avoid AI — it's a reason to design around its limits:

  • Don't assume a confident, well-formatted answer is a correct one.
  • Test on your edge cases and reworded inputs, not just the happy path.
  • Keep a human in the loop wherever an error is expensive.

It's the same theme behind why so many agentic AI projects underdeliver: capability is real, but reliability has to be engineered, not assumed.

Sources

Written by ivector
Start a project →