hub

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

URL https://aclanthology · 2024 · DOI 10.18653/v1/p17-1015

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open at publisher browse 19 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 dataset 1

citation-polarity summary

background 1 unclear 1 use dataset 1

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Measuring Faithfulness in Chain-of-Thought Reasoning

cs.AI · 2023-07-17 · conditional · novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.

Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

cs.CL · 2023-09-11 · conditional · novelty 6.0

MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

cs.CL · 2023-05-23 · conditional · novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

Large Language Models Can Self-Improve

cs.CL · 2022-10-20 · unverdicted · novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

Automatic Chain of Thought Prompting in Large Language Models

cs.CL · 2022-10-07 · conditional · novelty 6.0

Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

cs.CL · 2026-06-07 · unverdicted · novelty 5.0

SLMs achieve only a 4.4% accuracy gain from self-generated hints on reasoning benchmarks, fail to semantically distinguish useful feedback, and perform worse with longer hints.

PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

PRISM is a gauge-invariant DP mechanism for LoRA that avoids bilinear noise amplification via tangent-space sampling, supplies a closed-form noise characterization on Z, and includes a DP-aware adaptive update rule.

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

cs.MA · 2026-05-28 · unverdicted · novelty 5.0

Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

cs.CL · 2026-04-22 · unverdicted · novelty 4.0

Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding cs.CL · 2026-06-21 · unverdicted · none · ref 8
VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 109
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer