hub Mixed citations

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar · 2024 · cs.LG · arXiv 2410.05229

Mixed citation behavior. Most common role is background (57%).

44 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 1

citation-polarity summary

background 4 support 1 unclear 1 use dataset 1

representative citing papers

SEVerA: Verified Synthesis of Self-Evolving Agents

cs.LG · 2026-03-26 · unverdicted · novelty 8.0

SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.

Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

cs.LG · 2026-03-11 · unverdicted · novelty 8.0

Infinite-width transformers exhibit an inductive bias against high-complexity polynomial-time algorithms, with derived upper bounds on capturable tasks like sorting and string matching.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

cs.AI · 2026-02-14 · conditional · novelty 7.0

Fine-tuning LLMs on Navya-Nyaya's six-phase reasoning structure yields 100% semantic correctness on held-out logical problems despite only 40% strict format adherence.

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

cs.CL · 2026-02-11 · unverdicted · novelty 7.0

LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

cs.CL · 2026-01-09 · accept · novelty 7.0

OPT-Engine shows pure-text chain-of-thought reasoning in LLMs loses robustness as optimization complexity grows, external tools fix only local arithmetic, and solver-integrated methods are bottlenecked by automated constraint formulation.

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

cs.AI · 2025-12-21 · unverdicted · novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

BEAVER: An Efficient Deterministic LLM Verifier

cs.AI · 2025-12-05 · unverdicted · novelty 7.0

BEAVER is the first practical deterministic verifier that maintains sound probability bounds on LLM safety properties using token tries and frontier data structures, finding 2-3x more violations than sampling at 1/10 the compute.

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

Tracing Uncertainty in Language Model "Reasoning"

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.

Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)

cs.CY · 2026-03-28 · unverdicted · novelty 7.0

RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.

From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

cs.CL · 2026-01-07 · unverdicted · novelty 6.0

FSLR explicitly supervises the initial logical planning step in math problems, boosting LLM accuracy by 3-5% while using 80% fewer training tokens than standard CoT fine-tuning.

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

cs.AI · 2025-12-09 · unverdicted · novelty 6.0

State-of-the-art MLLMs show substantial inconsistency when reasoning over the same information presented in image, text, or mixed modalities, even after accounting for OCR errors, with inconsistency linked to visual factors and modality gap.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

cs.AI · 2025-08-02 · unverdicted · novelty 6.0

CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

cs.AI · 2025-07-30 · unverdicted · novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

cs.CL · 2026-05-04 · conditional · novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

cs.MA · 2026-04-03 · unverdicted · novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

citing papers explorer

Showing 44 of 44 citing papers.

SEVerA: Verified Synthesis of Self-Evolving Agents cs.LG · 2026-03-26 · unverdicted · none · ref 31 · internal anchor
SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers cs.LG · 2026-03-11 · unverdicted · none · ref 9 · internal anchor
Infinite-width transformers exhibit an inductive bias against high-complexity polynomial-time algorithms, with derived upper bounds on capturable tasks like sorting and string matching.
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 44
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory cs.AI · 2026-06-15 · unverdicted · none · ref 23 · internal anchor
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
Robust Reasoning Benchmark cs.LG · 2026-03-26 · unverdicted · none · ref 32 · 2 links · internal anchor
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya cs.AI · 2026-02-14 · conditional · none · ref 1 · internal anchor
Fine-tuning LLMs on Navya-Nyaya's six-phase reasoning structure yields 100% semantic correctness on held-out logical problems despite only 40% strict format adherence.
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations cs.CL · 2026-02-11 · unverdicted · none · ref 21 · internal anchor
LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.
OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling cs.CL · 2026-01-09 · accept · none · ref 35 · internal anchor
OPT-Engine shows pure-text chain-of-thought reasoning in LLMs loses robustness as optimization complexity grows, external tools fix only local arithmetic, and solver-integrated methods are bottlenecked by automated constraint formulation.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 15 · internal anchor
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
BEAVER: An Efficient Deterministic LLM Verifier cs.AI · 2025-12-05 · unverdicted · none · ref 34 · internal anchor
BEAVER is the first practical deterministic verifier that maintains sound probability bounds on LLM safety properties using token tries and frontier data structures, finding 2-3x more violations than sampling at 1/10 the compute.
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers cs.AI · 2026-05-12 · unverdicted · none · ref 12
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
Tracing Uncertainty in Language Model "Reasoning" cs.LG · 2026-05-08 · unverdicted · none · ref 24
Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems cs.AI · 2026-04-13 · unverdicted · none · ref 30
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025) cs.CY · 2026-03-28 · unverdicted · none · ref 20
RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs cs.CL · 2026-01-07 · unverdicted · none · ref 4 · internal anchor
FSLR explicitly supervises the initial logical planning step in math problems, boosting LLM accuracy by 3-5% while using 80% fewer training tokens than standard CoT fine-tuning.
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs cs.AI · 2025-12-09 · unverdicted · none · ref 27 · internal anchor
State-of-the-art MLLMs show substantial inconsistency when reasoning over the same information presented in image, text, or mixed modalities, even after accounting for OCR errors, with inconsistency linked to visual factors and modality gap.
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens cs.AI · 2025-08-02 · unverdicted · none · ref 7 · internal anchor
CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models cs.AI · 2025-07-30 · unverdicted · none · ref 32 · internal anchor
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis cs.CR · 2026-05-05 · unverdicted · none · ref 16
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models cs.CL · 2026-05-04 · conditional · none · ref 13
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 90
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 24
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Agentic Frameworks for Reasoning Tasks: An Empirical Study cs.AI · 2026-04-17 · unverdicted · none · ref 60
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate cs.MA · 2026-04-03 · unverdicted · none · ref 4
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework cs.CR · 2026-01-12 · unverdicted · none · ref 41 · internal anchor
CyberOps-Bots is a hierarchical LLM-empowered multi-agent RL framework that reports 68.5% higher network availability and 34.7% better jumpstart performance in new scenarios without retraining on real cloud datasets.
Semantic-Aware Logical Reasoning via a Semiotic Framework cs.AI · 2025-09-29 · conditional · none · ref 26 · internal anchor
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
Self-Aligned Reward: Towards Effective and Efficient Reasoners cs.LG · 2025-09-05 · unverdicted · none · ref 31 · internal anchor
Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts? cs.AI · 2025-03-23 · conditional · none · ref 5 · internal anchor
LLMs show accuracy drops of 0.3% to 5.9% on GSM8K math problems when culturally adapted to six countries while keeping math operations identical, with statistical significance confirmed by McNemar tests.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities cs.AI · 2026-05-10 · unverdicted · none · ref 12
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 19
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement cs.CL · 2026-04-28 · unverdicted · none · ref 4
ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to unseen models.
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity cs.AI · 2026-04-24 · unverdicted · none · ref 19
An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
A pragmatic approach to regulating AI agents cs.CY · 2026-04-16 · unverdicted · none · ref 16
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes cs.CL · 2026-06-09 · unverdicted · none · ref 172 · internal anchor
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
Gemma 3 Technical Report cs.CL · 2025-03-25 · accept · none · ref 33 · internal anchor
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
Large Language Models for Multi-Robot Systems: A Survey cs.RO · 2025-02-06 · unverdicted · none · ref 87 · internal anchor
A survey that categorizes LLM uses in multi-robot systems across task allocation, motion planning, action generation, and human interaction, while noting challenges and future research opportunities.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 140
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping cs.AI · 2026-04-03 · unverdicted · none · ref 4
EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 129 · internal anchor
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability cs.LG · 2026-05-14 · unreviewed · ref 8 · internal anchor
Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions cs.CY · 2026-02-27 · unreviewed · ref 149 · internal anchor
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards cs.LG · 2025-09-26 · unreviewed · ref 22 · internal anchor
Too long; didn't solve cs.AI · 2026-04-08 · unreviewed · ref 6
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning cs.CL · 2026-03-30 · unreviewed · ref 11

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer