DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
hub
Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
IRDS selects RLVR data via verifier-coupled SAE cluster coverage using greedy log-determinant maximization, reporting accuracy gains over baselines on math benchmarks.
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.
CoTEvol evolves CoT trajectories via reflective crossover and uncertainty-guided mutation to synthesize more accurate and diverse math reasoning data, outperforming distillation and search-based methods.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.
Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.
Merging breaks MoE routing via softmax sensitivity; HARC uses Hessian curvature for closed-form router calibration that improves merged model performance without retraining.
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
A cooperative system with one SLM distilling stepwise hints from a large model to guide another SLM's math reasoning yields consistent accuracy gains on benchmarks.
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Static depth-staggered Fibonacci sparse attention improves perplexity over fixed/learned variants and extrapolates to 4x context while dense attention fails.
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
Mimir is a 1.6B multilingual concept model pretrained on 38.9 billion sentences across 46 languages and instruction-tuned on 66.8 million sentences across 35 languages, then compared to a token-based LM of similar size.
citing papers explorer
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.