citation dossier

Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al · 2025 · arXiv 2504.11456

16Pith papers citing it

17reference links

cs.LGtop field · 8 papers

UNVERDICTEDtop verdict bucket · 15 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 16 reviewed papers. Its strongest current cluster is cs.LG (8 papers). The largest review-status bucket among citing papers is UNVERDICTED (15 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issue rather than loss scaling.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

AIPO: : Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

cs.AI · 2026-04-20 · unverdicted · novelty 5.0

Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

Your Model Diversity, Not Method, Determines Reasoning Strategy

cs.AI · 2026-04-12 · unverdicted · novelty 5.0

The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

cs.LG · 2026-05-07 · 2 refs

citing papers explorer

Showing 16 of 16 citing papers.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning cs.CL · 2026-05-09 · unverdicted · none · ref 14
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction cs.LG · 2026-04-19 · unverdicted · none · ref 25
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issue rather than loss scaling.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 18
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 13
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 17
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
AIPO: : Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 21
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling cs.CL · 2026-04-29 · unverdicted · none · ref 8
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 73
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning cs.LG · 2026-04-19 · unverdicted · none · ref 29
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 30
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 16
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 12
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes cs.AI · 2026-04-20 · unverdicted · none · ref 5
Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR cs.LG · 2026-04-14 · unverdicted · none · ref 4
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
Your Model Diversity, Not Method, Determines Reasoning Strategy cs.AI · 2026-04-12 · unverdicted · none · ref 4
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level cs.LG · 2026-05-07 · unreviewed · ref 19 · 2 links

Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer