LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Adam Mahdi; Andrew M. Bean; Harry Mayne; Jude Khouja; Karolina Korgul; Lingyi Yang; Ryan Othniel Kearns; Simeon Hellsten; Vlad A. Neacsu

arxiv: 2503.02972 · v7 · submitted 2025-03-04 · 💻 cs.CL · cs.AI

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Jude Khouja , Lingyi Yang , Karolina Korgul , Simeon Hellsten , Vlad A. Neacsu , Harry Mayne , Ryan Othniel Kearns , Andrew M. Bean

show 1 more author

Adam Mahdi

This is my paper

Pith reviewed 2026-05-23 01:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reasoning benchmarklanguage modelsobfuscationlinguistics olympiadmodel shortcutsmemorizationdisentangling reasoningLINGOLY-TOO

0 comments

The pith

Obfuscating linguistics problems drops model scores from 0.59 to 0.48 by blocking knowledge shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frontier language models often solve reasoning problems by drawing on stored knowledge or memorised patterns rather than step-by-step logic. LINGOLY-TOO applies expert-designed templatised orthographic obfuscations to 1,203 Linguistics Olympiad questions, altering surface forms while keeping the required solution steps intact. Experiments show consistent performance drops on the obfuscated set, with the strongest models falling from roughly 0.59 to 0.48. The benchmark therefore supplies a cleaner signal of genuine reasoning ability by reducing the chance that high scores reflect prior exposure or factual recall.

Core claim

The paper introduces LINGOLY-TOO, a benchmark of 1,203 questions and 6,995 sub-questions created by applying templatised orthographic obfuscations to original Linguistics Olympiad problems. These changes preserve the underlying solution logic but lower the chance that models can solve items through knowledge or memorisation. Experiments demonstrate that models exploit shortcuts on the original questions, producing markedly lower scores once the obfuscations are applied, even for the best reasoning models.

What carries the argument

Templatised orthographic obfuscation: expert-designed, repeatable modifications to the spelling and writing conventions of each problem that keep the required reasoning steps unchanged.

If this is right

Existing reasoning benchmarks overestimate model capabilities because they allow knowledge-based shortcuts.
Performance on LINGOLY-TOO supplies a stricter test of whether a model performs genuine reasoning.
Even the strongest current models remain sensitive to the obfuscations, indicating they still rely on non-reasoning cues.
The benchmark can be used to track whether future training methods improve robustness to surface changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar obfuscation techniques could be applied to other reasoning domains to test whether reported gains reflect logic or pattern matching.
Training procedures might need explicit pressure against surface-form memorisation to close the observed gap.
The size of the performance drop offers a quantitative signal that could guide model selection for tasks requiring novel problem solving.

Load-bearing premise

The obfuscations preserve the underlying solution logic while reducing the chance that problems can be solved through knowledge or memorisation.

What would settle it

A model that scores equally on the original and obfuscated versions of the same problems would indicate it is solving them through preserved logic rather than surface shortcuts.

read the original abstract

Frontier language models demonstrate increasing ability at solving reasoning problems, but their performance is often inflated by circumventing reasoning and instead relying on their expanding knowledge and memorisation capacity. We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems. These obfuscations preserve the underlying solution logic while reducing the likelihood problems are solvable with via knowledge or memorisation. Our experiments show that models exploit shortcuts on the original question as performance markedly drop upon obfuscation. Even the best reasoning models remain highly sensitive, with scores dropping from around 0.59 on original problems to 0.48 after obfuscation. LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure of true reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LINGOLY-TOO shows a clear performance drop after obfuscation but lacks the human baseline needed to confirm the drop isolates shortcut removal rather than added difficulty.

read the letter

The main takeaway is that this benchmark applies expert-designed orthographic obfuscations to 1,203 linguistics olympiad problems and records a drop in model scores from roughly 0.59 to 0.48. The authors treat that gap as evidence that models were leaning on memorization or surface shortcuts in the original set. The work is new in its specific combination of templatised obfuscation with these particular problems, and the scale (nearly 7,000 sub-questions) gives it some weight as a test set. The paper does a straightforward job laying out the construction process and reporting the aggregate results across several models. That part is useful for anyone tracking how current systems handle altered surface forms. The central assumption, however, is that the obfuscations leave the solution logic unchanged for any competent solver. The abstract states the obfuscations are expert-designed to preserve that logic, but the provided details do not include a human accuracy comparison on the obfuscated items. Without that check, the performance drop could simply reflect harder pattern detection rather than blocked knowledge shortcuts. The stress-test note identifies this exact gap, and it holds up on the information given. Minor issues include the lack of error bars or statistical tests in the summary, though those may be present in the full tables. This paper is aimed at people building or critiquing reasoning benchmarks for language models. Readers who care about evaluation design will find the concrete example worth examining, even if they end up wanting more validation data. It is solid enough on its own terms to merit a serious referee rather than a desk reject, mainly because the question it raises is live and the benchmark itself is reproducible in principle. I would send it out for review with the expectation that the human baseline point would need addressing.

Referee Report

2 major / 1 minor

Summary. The paper introduces LINGOLY-TOO, a benchmark of 1,203 Linguistics Olympiad questions (6,995 sub-questions) that applies expert-designed templatised orthographic obfuscations to reduce reliance on knowledge or memorisation while preserving solution logic. Experiments on frontier models report a performance drop from ~0.59 on original problems to ~0.48 on obfuscated versions, interpreted as evidence that models exploit shortcuts on the originals and that the benchmark better isolates true reasoning.

Significance. If the obfuscations are shown to preserve solution procedures without increasing intrinsic difficulty, LINGOLY-TOO would provide a useful addition to reasoning benchmarks by offering a concrete way to measure sensitivity to surface-form changes. The scale (over 1,200 problems) and focus on orthographic templatisation are strengths for reproducibility and targeted testing.

major comments (2)

[Abstract] Abstract: The central interpretation—that the 0.59→0.48 drop demonstrates models were using knowledge shortcuts on the original LINGO problems—rests on the untested claim that obfuscations 'preserve the underlying solution logic.' No human accuracy baseline on the obfuscated subset is reported, leaving open the possibility that the drop reflects harder parsing or pattern detection for any solver rather than removal of memorisation.
[Abstract] Abstract: Reported performance drops lack error bars, details on the number of models or runs, or any statistical tests, making it impossible to assess whether the observed difference is reliable or load-bearing for the disentanglement claim.

minor comments (1)

[Abstract] Abstract contains a clear typo: 'solvable with via knowledge or memorisation.'

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and note the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central interpretation—that the 0.59→0.48 drop demonstrates models were using knowledge shortcuts on the original LINGO problems—rests on the untested claim that obfuscations 'preserve the underlying solution logic.' No human accuracy baseline on the obfuscated subset is reported, leaving open the possibility that the drop reflects harder parsing or pattern detection for any solver rather than removal of memorisation.

Authors: The obfuscations were constructed via expert-designed templates that replace surface forms while leaving the required reasoning steps unchanged; this construction process is described in Section 3. We agree that the absence of a human baseline on the obfuscated problems leaves the preservation claim untested in the current manuscript. We will revise the abstract to qualify the interpretation accordingly and add an explicit limitation statement. Human evaluation on the obfuscated set is planned for a follow-up release of the benchmark. revision: partial
Referee: [Abstract] Abstract: Reported performance drops lack error bars, details on the number of models or runs, or any statistical tests, making it impossible to assess whether the observed difference is reliable or load-bearing for the disentanglement claim.

Authors: The abstract is a high-level summary; the full paper reports results across multiple frontier models with repeated runs and presents error bars in the figures and tables of Section 4, along with statistical comparisons. We will update the abstract to include approximate error ranges, state the number of models and runs, and note that the observed difference reaches statistical significance per the analysis in the main text. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark is externally constructed and evaluated

full rationale

The paper presents LINGOLY-TOO as an externally designed benchmark using expert obfuscations on existing Linguistics Olympiad problems. The central result is an empirical performance comparison (original vs. obfuscated scores) with no equations, fitted parameters, or derivation steps that reduce to self-defined inputs. No load-bearing self-citations or uniqueness theorems are invoked in the provided text to justify the obfuscation preservation claim; the benchmark is treated as an independent test set. This is the standard case of a self-contained empirical evaluation with no internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that obfuscations preserve logic without introducing new fitted parameters or entities.

axioms (1)

domain assumption Expert-designed obfuscations preserve the underlying solution logic while reducing knowledge-based solvability
Directly stated in the abstract as the mechanism enabling disentanglement.

pith-pipeline@v0.9.0 · 5719 in / 1128 out tokens · 36942 ms · 2026-05-23T01:02:44.460097+00:00 · methodology

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)