pith. sign in

arxiv: 2503.02972 · v7 · submitted 2025-03-04 · 💻 cs.CL · cs.AI

LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Pith reviewed 2026-05-23 01:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords reasoning benchmarklanguage modelsobfuscationlinguistics olympiadmodel shortcutsmemorizationdisentangling reasoningLINGOLY-TOO
0
0 comments X

The pith

Obfuscating linguistics problems drops model scores from 0.59 to 0.48 by blocking knowledge shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frontier language models often solve reasoning problems by drawing on stored knowledge or memorised patterns rather than step-by-step logic. LINGOLY-TOO applies expert-designed templatised orthographic obfuscations to 1,203 Linguistics Olympiad questions, altering surface forms while keeping the required solution steps intact. Experiments show consistent performance drops on the obfuscated set, with the strongest models falling from roughly 0.59 to 0.48. The benchmark therefore supplies a cleaner signal of genuine reasoning ability by reducing the chance that high scores reflect prior exposure or factual recall.

Core claim

The paper introduces LINGOLY-TOO, a benchmark of 1,203 questions and 6,995 sub-questions created by applying templatised orthographic obfuscations to original Linguistics Olympiad problems. These changes preserve the underlying solution logic but lower the chance that models can solve items through knowledge or memorisation. Experiments demonstrate that models exploit shortcuts on the original questions, producing markedly lower scores once the obfuscations are applied, even for the best reasoning models.

What carries the argument

Templatised orthographic obfuscation: expert-designed, repeatable modifications to the spelling and writing conventions of each problem that keep the required reasoning steps unchanged.

If this is right

  • Existing reasoning benchmarks overestimate model capabilities because they allow knowledge-based shortcuts.
  • Performance on LINGOLY-TOO supplies a stricter test of whether a model performs genuine reasoning.
  • Even the strongest current models remain sensitive to the obfuscations, indicating they still rely on non-reasoning cues.
  • The benchmark can be used to track whether future training methods improve robustness to surface changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar obfuscation techniques could be applied to other reasoning domains to test whether reported gains reflect logic or pattern matching.
  • Training procedures might need explicit pressure against surface-form memorisation to close the observed gap.
  • The size of the performance drop offers a quantitative signal that could guide model selection for tasks requiring novel problem solving.

Load-bearing premise

The obfuscations preserve the underlying solution logic while reducing the chance that problems can be solved through knowledge or memorisation.

What would settle it

A model that scores equally on the original and obfuscated versions of the same problems would indicate it is solving them through preserved logic rather than surface shortcuts.

read the original abstract

Frontier language models demonstrate increasing ability at solving reasoning problems, but their performance is often inflated by circumventing reasoning and instead relying on their expanding knowledge and memorisation capacity. We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems. These obfuscations preserve the underlying solution logic while reducing the likelihood problems are solvable with via knowledge or memorisation. Our experiments show that models exploit shortcuts on the original question as performance markedly drop upon obfuscation. Even the best reasoning models remain highly sensitive, with scores dropping from around 0.59 on original problems to 0.48 after obfuscation. LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure of true reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LINGOLY-TOO, a benchmark of 1,203 Linguistics Olympiad questions (6,995 sub-questions) that applies expert-designed templatised orthographic obfuscations to reduce reliance on knowledge or memorisation while preserving solution logic. Experiments on frontier models report a performance drop from ~0.59 on original problems to ~0.48 on obfuscated versions, interpreted as evidence that models exploit shortcuts on the originals and that the benchmark better isolates true reasoning.

Significance. If the obfuscations are shown to preserve solution procedures without increasing intrinsic difficulty, LINGOLY-TOO would provide a useful addition to reasoning benchmarks by offering a concrete way to measure sensitivity to surface-form changes. The scale (over 1,200 problems) and focus on orthographic templatisation are strengths for reproducibility and targeted testing.

major comments (2)
  1. [Abstract] Abstract: The central interpretation—that the 0.59→0.48 drop demonstrates models were using knowledge shortcuts on the original LINGO problems—rests on the untested claim that obfuscations 'preserve the underlying solution logic.' No human accuracy baseline on the obfuscated subset is reported, leaving open the possibility that the drop reflects harder parsing or pattern detection for any solver rather than removal of memorisation.
  2. [Abstract] Abstract: Reported performance drops lack error bars, details on the number of models or runs, or any statistical tests, making it impossible to assess whether the observed difference is reliable or load-bearing for the disentanglement claim.
minor comments (1)
  1. [Abstract] Abstract contains a clear typo: 'solvable with via knowledge or memorisation.'

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and note the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central interpretation—that the 0.59→0.48 drop demonstrates models were using knowledge shortcuts on the original LINGO problems—rests on the untested claim that obfuscations 'preserve the underlying solution logic.' No human accuracy baseline on the obfuscated subset is reported, leaving open the possibility that the drop reflects harder parsing or pattern detection for any solver rather than removal of memorisation.

    Authors: The obfuscations were constructed via expert-designed templates that replace surface forms while leaving the required reasoning steps unchanged; this construction process is described in Section 3. We agree that the absence of a human baseline on the obfuscated problems leaves the preservation claim untested in the current manuscript. We will revise the abstract to qualify the interpretation accordingly and add an explicit limitation statement. Human evaluation on the obfuscated set is planned for a follow-up release of the benchmark. revision: partial

  2. Referee: [Abstract] Abstract: Reported performance drops lack error bars, details on the number of models or runs, or any statistical tests, making it impossible to assess whether the observed difference is reliable or load-bearing for the disentanglement claim.

    Authors: The abstract is a high-level summary; the full paper reports results across multiple frontier models with repeated runs and presents error bars in the figures and tables of Section 4, along with statistical comparisons. We will update the abstract to include approximate error ranges, state the number of models and runs, and note that the observed difference reaches statistical significance per the analysis in the main text. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark is externally constructed and evaluated

full rationale

The paper presents LINGOLY-TOO as an externally designed benchmark using expert obfuscations on existing Linguistics Olympiad problems. The central result is an empirical performance comparison (original vs. obfuscated scores) with no equations, fitted parameters, or derivation steps that reduce to self-defined inputs. No load-bearing self-citations or uniqueness theorems are invoked in the provided text to justify the obfuscation preservation claim; the benchmark is treated as an independent test set. This is the standard case of a self-contained empirical evaluation with no internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that obfuscations preserve logic without introducing new fitted parameters or entities.

axioms (1)
  • domain assumption Expert-designed obfuscations preserve the underlying solution logic while reducing knowledge-based solvability
    Directly stated in the abstract as the mechanism enabling disentanglement.

pith-pipeline@v0.9.0 · 5719 in / 1128 out tokens · 36942 ms · 2026-05-23T01:02:44.460097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.