hub

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar · 2024 · arXiv 2410.05229

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

Tracing Uncertainty in Language Model "Reasoning"

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.

Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

cs.CL · 2026-05-04 · conditional · novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

cs.MA · 2026-04-03 · unverdicted · novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

cs.AI · 2026-05-10 · unverdicted · novelty 5.0

Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

ReQueR trains a single RL-based query refiner with an adaptive curriculum to decompose raw queries into structured logic, delivering 1.7-7.2% absolute gains on reasoning tasks across diverse LLMs and generalizing to unseen models.

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

cs.AI · 2026-04-24 · unverdicted · novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.

A pragmatic approach to regulating AI agents

cs.CY · 2026-04-16 · unverdicted · novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

Too long; didn't solve

cs.AI · 2026-04-08 · unverdicted · novelty 5.0

Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

cs.AI · 2026-04-03 · unverdicted · novelty 4.0

EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis cs.CR · 2026-05-05 · unverdicted · none · ref 16
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer