pith. sign in

arxiv: 2605.07053 · v2 · pith:TNLFRSXEnew · submitted 2026-05-08 · 💻 cs.CL · cs.AI

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords GSM8Kmathematical reasoningLLM robustnesssemantic perturbationbenchmark augmentationdata generationreasoning evaluation
0
0 comments X

The pith

GSM-SEM creates fresh math problem variants by changing facts and entities while preserving the original answers and calculations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GSM-SEM, a stochastic framework that generates new benchmark versions by perturbing entities, attributes, and relationships in problem statements. These changes often alter the underlying facts and force recomputation under different conditions, yet the framework constrains outputs to keep the same calculations, final answer, and roughly the same difficulty. Applied to GSM8K, GSM-Symbolic, and GSM-Plus, the resulting datasets produce consistent performance drops across 14 state-of-the-art language models, with larger declines when semantic changes combine with symbolic or plus-style variations. The approach runs fresh each time without new human annotation, lowering the risk that models simply memorize fixed public test sets. The same generation method is shown to work on non-math benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

Core claim

GSM-SEM is a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. It perturbs problem statements by modifying entities, attributes, and relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations, answer, and approximate problem difficulty. When applied to GSM8K and existing variation suites, the resulting GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets reveal consistent performance drops in 14 SOTA LLMs, with an average drop rate of 28 percent in the maximum-

What carries the argument

The GSM-SEM framework, a stochastic generator that applies constrained modifications to entities, attributes, and relationships in problem statements to produce new variants on each run.

If this is right

  • Models achieving high scores on static GSM8K versions may exhibit lower accuracy when facts change but the underlying math remains identical.
  • Fresh variants generated on each run reduce the long-term value of memorizing any single public test set.
  • Combining semantic perturbations with symbolic or plus-style changes produces larger performance declines than either type alone.
  • The same generation process can be reused on other reasoning benchmarks without requiring new human annotation for each release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating stochastic variant generation into model training loops could encourage learning of reasoning patterns that generalize across changed contexts rather than surface forms.
  • If the preserved difficulty claim holds, the observed drops point to limits in how current models handle recomputation under altered problem conditions.
  • Extending the approach to domains with different reasoning structures, such as code or planning tasks, would test whether similar memorization vulnerabilities exist outside math word problems.

Load-bearing premise

Modifications to entities, attributes, and relationships can alter underlying facts and require recomputation while still preserving the original calculations, answer, and approximate problem difficulty.

What would settle it

A set of generated variants where human validators confirm that the required calculations or final answer have changed, or where the 14 evaluated LLMs show no measurable accuracy drop relative to the original problems.

Figures

Figures reproduced from arXiv: 2605.07053 by Amit Agarwal, Aziza Mirsaidova, Dan Roth, Fang Tu, Graham Horwood, Hitesh Laxmichand Patel, Jyotika Singh, Karan Dua, Miguel Ballesteros, Sandip Ghoshal, Sujith Ravi, Tao Sheng, Weiyi Sun, Yassine Benajiba.

Figure 1
Figure 1. Figure 1: Example perturbations and per-run accu￾racy. Top: original GSM8K problem. Middle: GSM-Symbolic and GSM-Plus rewrites. Bottom: GSM￾SEM variants (orange highlights indicate edited spans). For each panel, accuracy across five independent runs is shown (✓ correct, x incorrect), illustrating higher failures on SEM variants. L3.1 = Llama-3.1-405B-Ins; GPT-5 uses medium/default reasoning effort. often interpreted… view at source ↗
Figure 2
Figure 2. Figure 2: GSM-SEM: Semantic variant generation pipeline. serve the original answer and approximate diffi￾culty. This addresses two practical limitations of many existing benchmarks: (i) released variants are static and can become memorization targets over time; and (ii) extending them with fresh, compa￾rable perturbations is often infeasible or requires re-annotating ground-truth answers, which is hard to do reliabl… view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity distribution of GSM8K variants with respect to GSM8K, where GSM8K-SEM shows higher semantic divergence than other variants. show the count-based cosine similarity distribu￾tion between GSM-Symbolic and GSM8K, and be￾tween GSM-Plus and GSM8K. As GSM-Symbolic essentially only swaps entities from the original dataset, its similarity tends to be higher than para￾phrased queries. GSM-Plus show… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of Strictness Filter (Section 3) on PDR% (relative to GSM8K) and statistical significance. Filter settings: none (all samples kept; [α, β] 0-1), min ([α, β] 0.30–0.70), min-med (0.35–0.65), med (0.40–0.60), med￾max (0.45–0.55), and max (all such samples filtered out). Variant dataset sizes across filters are shared in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Delta in performance for GSM-variants com [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average accuracy across models for GSM8K [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Strictness filter configurations. Each setting [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cosine similarity distribution of GSM8K vari [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cosine similarity distribution using all [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GSM-SEM, a reusable stochastic framework for generating semantically variant augmentations of math reasoning benchmarks (GSM8K, GSM-Symbolic, GSM-Plus) by perturbing entities/attributes/relationships to increase semantic variance while constraining generation to preserve original calculations, answers, and approximate difficulty. It produces three new human-validated datasets, evaluates 14 SOTA LLMs showing consistent performance drops (larger when semantic perturbations are combined with symbolic/plus variations, averaging 28% in the maximum-strictness configuration), and demonstrates extension to other benchmarks such as BigBenchHard, LogicBench, and NLR-BIRD.

Significance. If the preservation constraints hold, GSM-SEM provides a practical, on-demand method for creating dynamic benchmarks that reduce memorization bias and better isolate true reasoning generalization; the reported drops when semantic changes are layered on symbolic variations would constitute useful evidence of current model limitations. The public release of validated datasets and the framework's applicability beyond GSM-style problems are concrete strengths.

major comments (2)
  1. [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).
  2. [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.
minor comments (2)
  1. [Abstract] Abstract: the human-validation claim would be strengthened by a brief statement of validation criteria or inter-annotator statistics.
  2. [Throughout] Throughout: notation for the three released variants (GSM8K-SEM, etc.) should be introduced once and used consistently; a small number of figure captions could be expanded to clarify what 'maximum strictness configuration' entails.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of the GSM-SEM framework and results. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).

    Authors: We agree that Section 3 would benefit from a more explicit account of the enforcement mechanisms. In the revision we will expand the framework description to detail the template rules governing entity/attribute/relationship perturbations, the symbolic equivalence checks that verify answer preservation, the post-generation filtering criteria, and the verification steps used to maintain approximate problem difficulty. These additions will directly support the interpretation of performance drops as arising from increased semantic variance. revision: yes

  2. Referee: [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.

    Authors: We acknowledge that the Results section should report the validation procedures more explicitly. We will add a dedicated subsection describing the automated answer-equivalence checks (exact numerical match after recomputation), the human validation protocol (including inter-annotator agreement on answer correctness and difficulty), and how these steps were applied across the GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets. This will allow readers to attribute the observed drops unambiguously to semantic variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GSM-SEM framework or empirical evaluation

full rationale

The paper introduces a stochastic generation framework that perturbs entities/attributes/relationships in existing benchmarks while enforcing preservation of answers and difficulty, applies it to produce new variant sets (GSM8K-SEM etc.), and reports empirical LLM performance drops on those sets. This chain relies on external model evaluations and human validation rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The central claims are observational results from applying the defined process to independent test items, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic perturbations can be generated to change facts while preserving calculations and difficulty; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Semantic perturbations can be applied to alter underlying facts while preserving original calculations, answers, and approximate difficulty
    This constraint is invoked as the core mechanism of GSM-SEM in the abstract.

pith-pipeline@v0.9.0 · 5630 in / 1347 out tokens · 63201 ms · 2026-05-11T00:50:17.185012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.