mega hub Mixed citations

Training Verifiers to Solve Math Word Problems

Heewoo Jun, Karl Cobbe, Lukasz Kaiser, Mark Chen, Mohammad Bavarian, Vineet Kosaraju · 2021 · cs.LG · arXiv 2110.14168

Mixed citation behavior. Most common role is background (47%).

1030 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 1030 citing papers more from Heewoo Jun arXiv PDF

abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 125 dataset 100 method 7 baseline 4 other 2

citation-polarity summary

background 113 use dataset 98 unclear 16 use method 7 baseline 4

claims ledger

abstract State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we ge

authors

Heewoo Jun Karl Cobbe Lukasz Kaiser Mark Chen Mohammad Bavarian Vineet Kosaraju

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

cs.LG · 2026-05-11 · accept · novelty 8.0 · 2 refs

Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

cs.AI · 2026-04-20 · accept · novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

cs.CL · 2026-04-08 · unverdicted · novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Self-Calibrating Language Models via Test-Time Discriminative Distillation

cs.CL · 2026-03-18 · unverdicted · novelty 8.0

SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

citing papers explorer

Showing 50 of 1030 citing papers.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 16 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding cs.CL · 2026-05-13 · unverdicted · none · ref 6 · 2 links · internal anchor
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
FlowCompile: An Optimizing Compiler for Structured LLM Workflows cs.CL · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models cs.LG · 2026-05-12 · accept · none · ref 8 · internal anchor
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies cs.LG · 2026-05-11 · accept · none · ref 7 · 2 links · internal anchor
Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 11 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits cs.LG · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval cs.AI · 2026-04-20 · accept · none · ref 4 · internal anchor
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions cs.CL · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models cs.AI · 2026-04-02 · unverdicted · none · ref 4 · internal anchor
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Self-Calibrating Language Models via Test-Time Discriminative Distillation cs.CL · 2026-03-18 · unverdicted · none · ref 1 · internal anchor
SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages cs.LG · 2026-03-13 · unverdicted · none · ref 4 · internal anchor
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 24 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning cs.CL · 2025-04-15 · conditional · none · ref 5 · internal anchor
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 84 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 118 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Training Software Engineering Agents and Verifiers with SWE-Gym cs.SE · 2024-12-30 · conditional · none · ref 2 · internal anchor
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 11 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 24 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 8 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 13 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
PAL: Program-aided Language Models cs.CL · 2022-11-18 · conditional · none · ref 10 · internal anchor
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
Code as Policies: Language Model Programs for Embodied Control cs.RO · 2022-09-16 · accept · none · ref 45 · internal anchor
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 25 · internal anchor
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding cs.AI · 2026-05-30 · unverdicted · none · ref 36 · internal anchor
TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training cs.CL · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models cs.LG · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
LLM residual streams during addition form an Iso-Raw-Sum Trajectory anchored by digit semantics and modulated by continuous carry signals, with errors arising as geometric slippages across quantization thresholds in a noisy model.
Probing the Prompt KV Cache: Where It Becomes Dispensable cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Prompt KV cache redundancy in LLMs is primarily in chat template form rather than user content, as neutral scaffold replacement in upper layers preserves accuracy across Qwen3, Gemma 3, and Llama 3.
Unlocking the Working Memory of Large Language Models for Latent Reasoning cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.
How's it going? Reinforcement learning in language models recruits a functional welfare axis cs.LG · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
Latent Performance Profiling of Large Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.
Formalizing Mathematics at Scale cs.AI · 2026-05-28 · accept · none · ref 17 · internal anchor
A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base cs.CL · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
BrahmicTokenizer-131K is a 131K-vocab tokenizer constructed via script-prune crop and linear-programming retrofit to o200k_base, achieving 26.7% fewer tokens on Indic text while matching o200k_base on English fertility and outperforming alternatives on code/math benchmarks.
EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization cs.NE · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
EvoGM uses a dual-generator architecture with cycle-consistent learning on winner-loser pairs from search history to optimize LLM merging coefficients inside a multi-round evolutionary pipeline and reports outperformance over baselines on seen and unseen tasks.
Towards Cost-effective LLMs Routing with Batch Prompting cs.DB · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
Training-Free Looped Transformers cs.LG · 2026-05-22 · unverdicted · none · ref 22 · internal anchor
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning cs.CL · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints over shared reasoning.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks cs.CL · 2026-05-22 · conditional · none · ref 5 · internal anchor
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
Learnability-Informed Fine-Tuning of Diffusion Language Models cs.CL · 2026-05-21 · unverdicted · none · ref 6 · internal anchor
LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving cs.LG · 2026-05-21 · unverdicted · none · ref 3 · internal anchor
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems cs.AI · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
Proposes EpG and OOI metrics showing agentic workflows use 4.33x more energy per successful goal than linear baselines due to orchestration structure.
Residual Skill Optimization for Text-to-SQL Ensembles cs.CL · 2026-05-20 · unverdicted · none · ref 7 · internal anchor
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator cs.CL · 2026-05-20 · unverdicted · none · ref 9 · internal anchor
RankJudge creates paired multi-turn conversations with isolated single-turn flaws to generate unambiguous benchmarks for LLM-as-a-judge systems across ML, biomedicine, and finance domains.
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation cs.LG · 2026-05-20 · conditional · none · ref 14 · internal anchor
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema cs.LG · 2026-05-20 · accept · none · ref 13 · internal anchor
Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization cs.CL · 2026-05-20 · unverdicted · none · ref 44 · internal anchor
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.