mega hub Mixed citations

Training Verifiers to Solve Math Word Problems

Heewoo Jun, Karl Cobbe, Lukasz Kaiser, Mark Chen, Mohammad Bavarian, Vineet Kosaraju · 2021 · cs.LG · arXiv 2110.14168

Mixed citation behavior. Most common role is background (48%).

1466 Pith papers citing it

Background 48% of classified citations

open full Pith review browse 1466 citing papers more from Heewoo Jun arXiv PDF

abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 126 dataset 101 method 7 baseline 4 other 2

citation-polarity summary

background 114 use dataset 99 unclear 16 use method 7 baseline 4

claims ledger

abstract State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we ge

authors

Heewoo Jun Karl Cobbe Lukasz Kaiser Mark Chen Mohammad Bavarian Vineet Kosaraju

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

cs.AI · 2026-06-17 · conditional · novelty 8.0

DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.

Entropy-Gated Latent Recursion

cs.LG · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

cs.LG · 2026-05-11 · accept · novelty 8.0 · 2 refs

Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

cs.AI · 2026-04-20 · accept · novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

cs.CL · 2026-04-08 · unverdicted · novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Self-Calibrating Language Models via Test-Time Discriminative Distillation

cs.CL · 2026-03-18 · unverdicted · novelty 8.0

SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

citing papers explorer

Showing 50 of 101 citing papers after filters.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models cs.LG · 2026-05-12 · accept · none · ref 8 · internal anchor
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 24 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 23 · 2 links · internal anchor
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effective capabilities.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales cs.AI · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs cs.AI · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL · 2026-05-10 · unverdicted · none · ref 21 · internal anchor
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets cs.CR · 2026-05-10 · unverdicted · none · ref 11 · internal anchor
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 8 · internal anchor
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play cs.AI · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms cs.LG · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification cs.CL · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.
Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection cs.LG · 2026-05-05 · unverdicted · none · ref 5 · internal anchor
Echo-LoRA raises average performance on eight commonsense reasoning benchmarks by 3.0 to 5.7 points over standard LoRA by using a training-only cross-layer echo representation that is discarded after training.
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models cs.CR · 2026-04-30 · unverdicted · none · ref 9 · internal anchor
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
BoostLoRA: Growing Effective Rank by Boosting Adapters cs.LG · 2026-04-30 · unverdicted · none · ref 8 · internal anchor
BoostLoRA grows effective adapter rank linearly via iterative boosting on hard examples with orthogonal low-rank updates, outperforming both single-shot ultra-low-rank adapters and full fine-tuning on math and code tasks with zero added inference overhead.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration cs.AI · 2026-04-17 · unverdicted · none · ref 53 · internal anchor
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 20 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 3 · internal anchor
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning cs.AI · 2025-10-08 · unverdicted · none · ref 31 · internal anchor
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
Training Large Language Models to Reason in a Continuous Latent Space cs.CL · 2024-12-09 · unverdicted · none · ref 4 · internal anchor
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency trade-offs.
TextGrad: Automatic "Differentiation" via Text cs.CL · 2024-06-11 · unverdicted · none · ref 52 · internal anchor
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
Jamba: A Hybrid Transformer-Mamba Language Model cs.CL · 2024-03-28 · conditional · none · ref 11 · internal anchor
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent cs.LG · 2026-05-04 · unverdicted · none · ref 6
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 40 · 2 links · internal anchor
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head cs.CL · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning cs.LG · 2026-05-12 · conditional · none · ref 5 · internal anchor
ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accuracy loss.
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration cs.LG · 2026-05-11 · unverdicted · none · ref 51 · internal anchor
CATS achieves up to 5.08x wall-clock speedup for LLM generation on edge devices via memory-matched cascaded tree speculation, outperforming prior methods by 1.45x with no quality loss.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration cs.LG · 2026-05-11 · unverdicted · none · ref 14 · 2 links · internal anchor
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
EditRisk-Bench demonstrates that malicious knowledge editing reliably induces incorrect or unsafe reasoning in LLMs while largely preserving general capabilities.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training cs.LG · 2026-05-10 · unverdicted · none · ref 63 · internal anchor
Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness cs.LG · 2026-05-10 · unverdicted · none · ref 9 · internal anchor
VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 9 · internal anchor
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning cs.LG · 2026-05-09 · unverdicted · none · ref 41 · 2 links · internal anchor
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation cs.LG · 2026-05-09 · unverdicted · none · ref 7 · internal anchor
AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive results on GPT-2, Mistral-7B, Qwen2-7B and diffusion personalization tasks.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 48 · 2 links · internal anchor
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training cs.AI · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL · 2026-05-08 · unverdicted · none · ref 5 · 2 links · internal anchor
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems cs.MA · 2026-05-07 · unverdicted · none · ref 45 · 2 links · internal anchor
An ensemble-based information-theoretic active learning method using ensemble Kalman inversion selects valuable tasks to optimize communication structures in LLM multi-agent systems more reliably than random sampling under limited training budgets.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering cs.AI · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference cs.DC · 2026-05-04 · unverdicted · none · ref 14 · 2 links · internal anchor
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.