hub

Are NLP Models really able to Solve Simple Math Word Problems?

Arkil Patel, Satwik Bhattamishra, Navin Goyal · 2021 · cs.CL · DOI 10.18653/v1/2021.naacl-main.168 · arXiv 2103.07191

42 Pith papers cite this work, alongside 148 external citations. Polarity classification is still indexing.

42 Pith papers citing it

148 external citations · Crossref

open full Pith review browse 42 citing papers arXiv PDF

abstract

The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

cs.MA · 2026-06-18 · unverdicted · novelty 7.0

SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.

The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

cs.CL · 2026-03-11 · unverdicted · novelty 7.0

The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

cs.CL · 2026-01-07 · unverdicted · novelty 7.0

DiffCoT applies diffusion-style iterative denoising to chain-of-thought steps with a causal noise schedule, outperforming standard CoT optimization methods on multi-step reasoning benchmarks.

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

cs.AI · 2025-12-21 · unverdicted · novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

EDUMATH: Generating Standards-aligned Educational Math Word Problems

cs.CL · 2025-10-08 · conditional · novelty 7.0

EDUMATH introduces the first teacher-annotated dataset for standards-aligned math word problem generation and demonstrates that it enables smaller open LLMs to match larger models while producing problems students prefer over human-written ones.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

Automated Design of Agentic Systems

cs.AI · 2024-08-15 · conditional · novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Enhancing Multilingual Reasoning via Steerable Model Merging

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

ST-Merge uses gated cross-attention to adaptively weight source models during merging, outperforming baselines on multilingual reasoning tasks across 21 languages.

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Post-hoc model-based compression of reasoning traces cuts training tokens to 12-30% and speeds training 2-7.6x while retaining up to 96% of raw-trace accuracy, though raw traces remain superior at every scale.

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

cs.CR · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MATH when transferring CoT from 14B to 7B models.

Factored Causal Representation Learning for Robust Reward Modeling in RLHF

cs.LG · 2026-01-29 · unverdicted · novelty 6.0

A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.

From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

cs.CL · 2026-01-07 · unverdicted · novelty 6.0

FSLR explicitly supervises the initial logical planning step in math problems, boosting LLM accuracy by 3-5% while using 80% fewer training tokens than standard CoT fine-tuning.

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

cs.CL · 2025-11-26 · unverdicted · novelty 6.0

PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.

Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

GTPO improves multi-turn tool-integrated reasoning in LLMs over GRPO by using turn-level rewards, return-based advantage estimation, and self-supervised reward shaping from generated code, yielding 3.0% gains on math benchmarks and 3.9% on commonsense and synthesis tasks.

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

cs.CL · 2025-10-09 · unverdicted · novelty 6.0

LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.

HyperAdapt: Simple High-Rank Adaptation

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.

Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

cs.CL · 2024-11-02 · unverdicted · novelty 6.0

DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.

citing papers explorer

Showing 42 of 42 citing papers.

PAL: Program-aided Language Models cs.CL · 2022-11-18 · conditional · none · ref 30 · internal anchor
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design cs.MA · 2026-06-18 · unverdicted · none · ref 53 · internal anchor
SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing cs.CL · 2026-05-13 · unverdicted · none · ref 44 · internal anchor
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 23 · internal anchor
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning cs.CL · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs? cs.CL · 2026-03-11 · unverdicted · none · ref 19 · internal anchor
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs cs.CL · 2026-01-07 · unverdicted · none · ref 5 · internal anchor
DiffCoT applies diffusion-style iterative denoising to chain-of-thought steps with a causal noise schedule, outperforming standard CoT optimization methods on multi-step reasoning benchmarks.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 17 · internal anchor
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
EDUMATH: Generating Standards-aligned Educational Math Word Problems cs.CL · 2025-10-08 · conditional · none · ref 5 · internal anchor
EDUMATH introduces the first teacher-annotated dataset for standards-aligned math word problem generation and demonstrates that it enables smaller open LLMs to match larger models while producing problems students prefer over human-written ones.
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation cs.CL · 2025-02-28 · unverdicted · none · ref 118 · internal anchor
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 193 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 131 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Enhancing Multilingual Reasoning via Steerable Model Merging cs.CL · 2026-06-17 · unverdicted · none · ref 11 · internal anchor
ST-Merge uses gated cross-attention to adaptively weight source models during merging, outperforming baselines on multilingual reasoning tasks across 21 languages.
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning cs.LG · 2026-06-08 · unverdicted · none · ref 21 · internal anchor
Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation cs.LG · 2026-06-04 · unverdicted · none · ref 28 · internal anchor
Post-hoc model-based compression of reasoning traces cuts training tokens to 12-30% and speeds training 2-7.6x while retaining up to 96% of raw-trace accuracy, though raw traces remain superior at every scale.
Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models cs.CR · 2026-05-13 · unverdicted · none · ref 25 · 2 links · internal anchor
A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment cs.LG · 2026-04-07 · unverdicted · none · ref 50 · internal anchor
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MATH when transferring CoT from 14B to 7B models.
Factored Causal Representation Learning for Robust Reward Modeling in RLHF cs.LG · 2026-01-29 · unverdicted · none · ref 21 · internal anchor
A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.
From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs cs.CL · 2026-01-07 · unverdicted · none · ref 5 · internal anchor
FSLR explicitly supervises the initial logical planning step in math problems, boosting LLM accuracy by 3-5% while using 80% fewer training tokens than standard CoT fine-tuning.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark cs.CL · 2025-11-26 · unverdicted · none · ref 41 · internal anchor
PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.
Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization cs.LG · 2025-11-18 · unverdicted · none · ref 3 · internal anchor
GTPO improves multi-turn tool-integrated reasoning in LLMs over GRPO by using turn-level rewards, return-based advantage estimation, and self-supervised reward shaping from generated code, yielding 3.0% gains on math benchmarks and 3.9% on commonsense and synthesis tasks.
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? cs.CL · 2025-10-09 · unverdicted · none · ref 14 · internal anchor
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
HyperAdapt: Simple High-Rank Adaptation cs.LG · 2025-09-23 · unverdicted · none · ref 30 · internal anchor
HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models cs.CL · 2024-11-02 · unverdicted · none · ref 24 · internal anchor
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 32 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning cs.CL · 2023-09-11 · conditional · none · ref 34 · internal anchor
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models cs.CL · 2023-08-03 · unverdicted · none · ref 87 · internal anchor
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes cs.CL · 2023-05-03 · conditional · none · ref 92 · internal anchor
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 22 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Automatic Chain of Thought Prompting in Large Language Models cs.CL · 2022-10-07 · conditional · none · ref 10 · internal anchor
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 109 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 5 · 2 links · internal anchor
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 54 · internal anchor
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective cs.AI · 2025-11-01 · conditional · none · ref 27 · internal anchor
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning cs.CL · 2024-10-17 · unverdicted · none · ref 18 · internal anchor
AdaSwitch improves small local LLM performance on reasoning tasks by adaptively switching to a large cloud LLM upon detected errors, sometimes matching cloud results with far less overhead.
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution cs.LG · 2026-05-29 · unverdicted · none · ref 17 · internal anchor
MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 260 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models cs.LG · 2026-05-15 · unreviewed · ref 103 · internal anchor
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards cs.AI · 2026-05-05 · unreviewed · ref 21 · 2 links · internal anchor
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation cs.LG · 2026-04-25 · unreviewed · ref 54 · internal anchor
Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization cs.CL · 2026-03-17 · unreviewed · ref 8 · internal anchor
Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression cs.LG · 2026-02-09 · unreviewed · ref 24 · internal anchor

Are NLP Models really able to Solve Simple Math Word Problems?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer