Do “reasoning” models actually reason? What Apple’s GSM-Symbolic paper found

As models got better at maths and logic, a fair question grew louder: are they reasoning, or just recognising problems they've effectively seen before? A 2024 paper from Apple researchers, *GSM-Symbolic*, ran a clever test to find out.

(Our plain-language summary of the study; the paper is linked.)

The experiment

The team took a standard set of grade-school maths problems and made small, meaning-preserving changes — swapping the names and numbers for different ones, while keeping the underlying problem identical. A model that truly understands the maths should be unaffected.

They also tried adding a single irrelevant but related sentence to each problem — the kind of detail a person would simply ignore.

What they found

Just changing the numbers caused measurable drops in accuracy across many leading models. A genuine reasoner shouldn't care what the numbers are.
Adding one irrelevant sentence caused large accuracy drops — models were pulled off course by information a child would dismiss.
Performance got less reliable as problems grew more complex.

The interpretation: a lot of what looks like reasoning is closer to sophisticated pattern-matching against training data — impressive, but more fragile than it appears.

The models aren't "thinking" the way the demos suggest. They're extraordinary pattern machines — and patterns break in predictable ways.

Why this matters for real products

This isn't a reason to avoid AI — it's a reason to design around its limits:

Don't assume a confident, well-formatted answer is a correct one.
Test on your edge cases and reworded inputs, not just the happy path.
Keep a human in the loop wherever an error is expensive.

It's the same theme behind why so many agentic AI projects underdeliver: capability is real, but reliability has to be engineered, not assumed.

Sources

Mirzadeh et al. / Apple (2024) — *GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs*

Written by ivector

Start a project →

Do “reasoning” models actually reason? What Apple’s GSM-Symbolic paper found

The experiment

What they found

Why this matters for real products

Sources

Keep reading

“Attention Is All You Need”, explained for non-engineers

Chinchilla and the scaling laws: why bigger models aren’t always better

The state of enterprise AI in 2025: what the reports actually say