pith. sign in

arxiv: 2411.15124 · v5 · submitted 2024-11-22 · 💻 cs.CL

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Pith reviewed 2026-05-11 05:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords language model post-trainingsupervised fine-tuningdirect preference optimizationreinforcement learningopen source modelsbenchmark evaluationdata decontamination
0
0 comments X

The pith

Fully open post-training on Llama 3.1 bases yields models that surpass several closed systems on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tulu 3, a family of models refined from Llama 3.1 bases through supervised finetuning, direct preference optimization, and a new method called reinforcement learning with verifiable rewards. These models achieve higher scores than the official instruct versions of Llama 3.1, Qwen 2.5, and Mistral, as well as closed models including GPT-4o-mini and Claude 3.5-Haiku. The work supplies complete datasets, code, infrastructure, and a multi-task evaluation scheme that includes development and unseen splits along with decontamination of training data. A sympathetic reader would care because post-training steps have long remained opaque, and an open recipe that reaches competitive performance removes a major barrier to further progress. The authors also report which training approaches failed to deliver reliable gains.

Core claim

Tulu 3 demonstrates that applying supervised finetuning, direct preference optimization, and reinforcement learning with verifiable rewards to Llama 3.1 base models, using carefully curated and decontaminated data, produces results that exceed those of Llama 3.1 instruct models, Qwen 2.5 instruct, Mistral instruct, GPT-4o-mini, and Claude 3.5-Haiku on the multi-task benchmarks.

What carries the argument

Reinforcement Learning with Verifiable Rewards (RLVR), which uses automatically verifiable signals to guide reinforcement learning instead of relying only on preference data or model judges.

If this is right

  • Post-training can be fully reproduced and adapted to new domains using the released data, code, and procedures.
  • A combination of SFT, DPO, and RLVR reliably improves over base models on the tested benchmarks.
  • Decontamination and separate unseen splits provide a stricter test than standard benchmark reporting.
  • Some common training techniques do not produce consistent improvements and can be deprioritized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The open release of the full pipeline could allow independent groups to match or exceed current closed-model performance on similar tasks.
  • RLVR may extend naturally to any domain where correctness can be checked automatically, such as code generation or mathematical reasoning.
  • Widespread adoption of the decontamination and multi-split evaluation approach could raise the bar for future post-training papers.

Load-bearing premise

The multi-task evaluation scheme with decontamination and unseen splits accurately measures real generalization instead of overfitting to known benchmark distributions.

What would settle it

Running the released Tulu 3 models on a new collection of tasks assembled after the training data cutoff or on live user queries that shows no performance edge over the original instruct baselines.

read the original abstract

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the Tulu 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tulu 3 approach to more domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tulu 3, a family of fully open post-trained models built on Llama 3.1 base models. It applies supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and a novel Reinforcement Learning with Verifiable Rewards (RLVR) method, claiming superior performance over Llama 3.1 instruct, Qwen 2.5, Mistral, GPT-4o-mini, and Claude 3.5-Haiku. The work releases all training data, code, recipes, and models, while introducing a multi-task evaluation protocol with development/unseen splits and substantial decontamination of open benchmark datasets.

Significance. If the benchmark results hold under rigorous decontamination, the primary contribution is the complete, reproducible open recipe for modern post-training that includes both established methods and RLVR, plus analysis of approaches that failed to improve performance. Releasing the full data, code, infrastructure, and detailed report enables independent verification and adaptation, which is a substantial advance for the open-source community.

major comments (2)
  1. [Abstract / Evaluation section] Abstract and evaluation description: the claim of surpassing closed models rests on benchmark results after 'substantial decontamination,' yet no concrete method is specified (e.g., n-gram overlap thresholds, embedding similarity cutoffs, model-based detection, or paraphrase handling). Without these details, residual leakage on MMLU, GSM8K, or HumanEval cannot be ruled out, directly affecting the validity of the generalization claims.
  2. [Results] Results presentation: the abstract reports benchmark wins, but the manuscript must include full tables with per-task scores, error bars or multiple seeds, and explicit ablations isolating the contribution of RLVR versus SFT+DPO to substantiate the performance frontier claim.
minor comments (1)
  1. [Evaluation] The multi-task evaluation scheme with dev/unseen splits is a positive design choice; clarify how the unseen split is constructed and whether it overlaps with any training data beyond the stated decontamination.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additions.

read point-by-point responses
  1. Referee: [Abstract / Evaluation section] Abstract and evaluation description: the claim of surpassing closed models rests on benchmark results after 'substantial decontamination,' yet no concrete method is specified (e.g., n-gram overlap thresholds, embedding similarity cutoffs, model-based detection, or paraphrase handling). Without these details, residual leakage on MMLU, GSM8K, or HumanEval cannot be ruled out, directly affecting the validity of the generalization claims.

    Authors: We agree that explicit details on the decontamination procedure are necessary to support the generalization claims. In the revised manuscript, we will add a dedicated subsection in the evaluation protocol describing the exact decontamination methods, including the n-gram overlap thresholds applied, embedding similarity cutoffs, any model-based detection steps, and handling of paraphrases. This will allow readers to assess residual leakage risks on MMLU, GSM8K, HumanEval, and other benchmarks. revision: yes

  2. Referee: [Results] Results presentation: the abstract reports benchmark wins, but the manuscript must include full tables with per-task scores, error bars or multiple seeds, and explicit ablations isolating the contribution of RLVR versus SFT+DPO to substantiate the performance frontier claim.

    Authors: We will strengthen the results section to meet this requirement. The revised paper will include comprehensive tables with per-task scores across all evaluated benchmarks, report error bars or multi-seed averages where computationally feasible, and add explicit ablation experiments that isolate the contribution of RLVR relative to the SFT+DPO baseline. These changes will more clearly substantiate the performance claims and the value of the novel RLVR method. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical post-training results with released artifacts

full rationale

The paper reports experimental outcomes from SFT, DPO, and the introduced RLVR on Llama 3.1 bases, evaluated via multi-task benchmarks with decontamination and unseen splits. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction; performance claims rest on direct training runs and external verification via released data, code, and models rather than self-referential metrics or fitted parameters renamed as predictions. Self-citations to prior Tulu work are present but non-load-bearing for the central empirical claims, which remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions (i.i.d. data, gradient descent convergence) and benchmark validity; no new invented entities or ad-hoc axioms are introduced in the abstract. Free parameters are the usual training hyperparameters (learning rates, batch sizes, reward scales) whose specific values are not detailed here.

pith-pipeline@v0.9.0 · 5693 in / 1154 out tokens · 23646 ms · 2026-05-11T05:03:07.115505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RECIPE: Procedural Planning via Grounding in Instructional Video

    cs.CV 2026-05 unverdicted novelty 7.0

    RECIPE improves visual procedural planners by rewarding plans according to their grounding quality in ASR transcripts via GRPO, yielding +7–8 in-domain and up to +16 zero-shot macro-accuracy gains over base models and...

  2. A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

    cs.CL 2026-05 unverdicted novelty 7.0

    PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.

  3. Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

    cs.LG 2026-05 unverdicted novelty 7.0

    Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...

  4. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  5. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  6. CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

    cs.CV 2026-05 unverdicted novelty 7.0

    CurveBench is a new benchmark for recovering rooted containment trees from images of nested Jordan curves, where the strongest model reaches only 19.1% accuracy on hard cases and fine-tuning lifts an open model to 33....

  7. CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

    cs.CV 2026-05 unverdicted novelty 7.0

    CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...

  8. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  9. No More, No Less: Task Alignment in Terminal Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

  10. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  11. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  12. K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

    cs.CL 2026-05 conditional novelty 7.0

    K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

  13. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  14. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  15. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  16. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  17. Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

    cs.CL 2026-04 unverdicted novelty 7.0

    Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.

  18. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  19. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  20. You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

    cs.CV 2026-04 unverdicted novelty 7.0

    A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.

  21. SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

    cs.AI 2026-04 unverdicted novelty 7.0

    SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

  22. ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

    cs.LG 2026-04 unverdicted novelty 7.0

    ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...

  23. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

    cs.LG 2026-03 unverdicted novelty 7.0

    SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.

  24. Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

    cs.LG 2026-02 unverdicted novelty 7.0

    Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

  25. HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

    cs.DC 2025-12 unverdicted novelty 7.0

    HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.

  26. ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

    cs.CL 2025-10 unverdicted novelty 7.0

    ProfBench is a new multi-domain benchmark with human-expert rubrics for judging LLM responses on professional tasks, showing top models reach only 65.9% performance while providing cheap LLM judges that reduce evaluat...

  27. Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary

    cs.LG 2025-09 unverdicted novelty 7.0

    Defines Decision Potential Surface (DPS) whose zero isohypse equals an LLM decision boundary and supplies a K-sample approximation algorithm with derived upper bounds on absolute, expected, and concentration errors.

  28. Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework

    cs.CL 2025-09 conditional novelty 7.0

    Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeo...

  29. Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

    cs.LG 2025-08 unverdicted novelty 7.0

    TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

  30. Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    cs.LG 2025-04 accept novelty 7.0

    One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

  31. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.

  32. Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    MOOD benchmark shows guard models fail to generalize to OOD alignment failures in LLMs, but combining them with Mahalanobis and perplexity OOD detectors improves recall from 39% to 45% with better scaling than larger ...

  33. TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

    cs.LG 2026-05 unverdicted novelty 6.0

    TimeSRL uses semantic abstractions from time-series data optimized via reinforcement learning to achieve better cross-dataset generalization than standard ML or LLM baselines in mental health prediction.

  34. Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...

  35. SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

    cs.AI 2026-05 unverdicted novelty 6.0

    SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.

  36. Self-Supervised On-Policy Distillation for Reasoning Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...

  37. PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

    cs.AI 2026-05 unverdicted novelty 6.0

    PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.

  38. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

    cs.AI 2026-05 unverdicted novelty 6.0

    NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on...

  39. SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.

  40. BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

    cs.AI 2026-05 conditional novelty 6.0

    BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.

  41. PreFT: Prefill-only finetuning for efficient inference

    cs.LG 2026-05 accept novelty 6.0

    Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

  42. N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.

  43. Bayesian Model Merging

    cs.LG 2026-05 unverdicted novelty 6.0

    Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines a...

  44. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  45. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  46. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  47. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  48. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  49. Annotations Mitigate Post-Training Mode Collapse

    cs.CL 2026-05 unverdicted novelty 6.0

    Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

  50. dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

  51. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  52. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  53. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

  54. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  55. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  56. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  57. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  58. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  59. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.

  60. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 120 Pith papers · 2 internal anchors

  1. [1]

    URL https://openreview.net/forum?id=Ep0TtjVoap. D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. D. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. ...

  2. [2]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    URL https://openreview.net/forum?id=1qvx610Cu7. Y. Liu. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 , 364, 2019. S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning.arXiv preprint ar...

  3. [3]

    {% "{% "{{␣’<|system|>\n’␣+␣message [ ’ content ’]␣+␣ ’\n’␣}}

    Association for Computational Linguistics. URLhttps://aclanthology.org/2024.emnlp-main.79. C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244 , 2023. 58 H. Xu, B. Liu, L. Shu, and P. Yu. BERT post-training for review reading com...

  4. [4]

    The above example is not tied to any particular persona, but you should create one that is unique and specific to the given persona

  5. [5]

    The instruction should contain all the following verifiable constraint(s):{constraints}

  6. [6]

    User instruction:

    Your output should start with "User instruction:". Your output should not include an answer to the instruction. Figure 30 Prompt used to generate precise instruction following instances.{persona} are borrowed from Chan et al. (2024). We use the set of{constraints} defined in Zhou et al. (2023). Example seeds are manually written by authors for each constr...

  7. [7]

    You should rewrite the instruction coherently while relaxing one of the following con- straint categories:{constraints}

  8. [8]

    Remember to entirely relax one of the constraint category

  9. [9]

    User instruction:

    Your output should start with "User instruction:". Your output should not include an answer to the instruction. Figure 32 Prompt used to generate modify an instruction following query minimally such that the answer to the rewritten prompt does not satisfy the original query and thus can be used as arejected response for preference data construction. Hard ...

  10. [10]

    Only top talents can solve it correctly

    The math problem should be challenging and involve advanced mathematical skills and knowledge. Only top talents can solve it correctly

  11. [11]

    You should make full use of the persona description to create the math problem to ensure that the math problem is unique and specific to the persona

  12. [12]

    Math problem:

    Your response should always start with "Math problem:". Your response should not include a solution to the created math problem

  13. [13]

    Figure 33 Prompt used to generate hard math word problems.{persona} are borrowed from Chan et al

    Your created math problem should include no more than 2 sub-problems. Figure 33 Prompt used to generate hard math word problems.{persona} are borrowed from Chan et al. (2024). 70 Hard Math Problems (response) Provide solution to the given math problem. Problem: {generated_math_problem} Note: Provide your solution step-by-step, and end your solution in a n...

  14. [14]

    Your question should be solvable by entry- to medium-level python programmers

  15. [15]

    Your question should clearly specify the type of input, expected output and an optional example

  16. [16]

    Question: Write a python function to

    Your response should always start with "Question: Write a python function to"

  17. [17]

    Figure 35 Prompt used to generate code completion instances.{persona} are borrowed from Chan et al

    Your response should not include a solution to the created coding problem. Figure 35 Prompt used to generate code completion instances.{persona} are borrowed from Chan et al. (2024). Code Completion (response) Provide solution to the given python programming question. Question: {generated_code_problem} Note:

  18. [18]

    Your response should always start with the function definition and end with the final re- turn statement

  19. [19]

    Instruction

    Your response should only and only include python function. Figure 36 Prompt used to generate code completion. 71 System prompt for LLM-as-a-judge Your role is to evaluate text quality based on given criteria. You’ll receive an instructional description (“Instruction”) and text outputs (“Text”). Understand and interpret instructions to evaluate effectivel...

  20. [20]

    Irrelevant: No alignment

  21. [21]

    Partial Focus: Addresses one aspect poorly

  22. [22]

    - (2) Acknowledges both but slight deviations

    Partial Compliance: - (1) Meets goal or restrictions, neglecting other. - (2) Acknowledges both but slight deviations

  23. [23]

    Almost There: Near alignment, minor deviations

  24. [24]

    Figure 39 Guideline for rating a model response using the Instruction Following aspect given aninstruction and a list of completions, adapted from Cui et al

    Comprehensive Compliance: Fully aligns, meets all requirements. Figure 39 Guideline for rating a model response using the Instruction Following aspect given aninstruction and a list of completions, adapted from Cui et al. (2023). 73 Informativeness or Helpfulness Aspect (prompt) # Informativeness / Helpfulness Assessment Evaluate if model’s outputs fulfil...

  25. [25]

    Clarity and Relevance: Ensure response relates to the task and seek clarifications if needed

  26. [26]

    Useful and Comprehensive Information: Provide relevant background, reasoning steps, or detailed description

  27. [27]

    Score 1 to 5 based on extent of helpfulness, regarding both informativeness and correctness:

    Not Lengthy, No Repetition: Avoid verbosity or recycling content. Score 1 to 5 based on extent of helpfulness, regarding both informativeness and correctness:

  28. [28]

    Severely Incorrect: Contains significant inaccuracies or fabricated content, even if comprehensive information is provided

  29. [29]

    Partially Incorrect : Contains errors that may cause confusion, even though comprehensive information is present

  30. [30]

    Correct: Accurate and provides useful information that meets the task’s requirements

  31. [31]

    Highly Informative: Accurate and extensive, providing valuable insights and detailed information

  32. [32]

    Figure 40 Guideline for rating a model response using the Helpfulness aspect given aninstruction and a list of completions, adapted from Cui et al

    Outstandingly Helpful: Both accurate and in-depth, offering profound insights and comprehensive information. Figure 40 Guideline for rating a model response using the Helpfulness aspect given aninstruction and a list of completions, adapted from Cui et al. (2023). 74 Honesty Aspect (prompt) # Honesty and Uncertainty Expression Assessment Assess how well t...

  33. [33]

    Weakeners: e.g., ‘I guess,’ ‘probably.’

  34. [34]

    - No uncertainty expression indicate confidence

    Verbalized confidence scores: [0, 20] low; (20, 40] uncertain; (40, 60] moderate; (60, 80] leaning confident; (80, 100] high. - No uncertainty expression indicate confidence. - Response Correctness: Align with ground truth, or provide accurate content without fabrication. Scoring: Rate outputs 1 to 5 (or “N/A”):

  35. [35]

    Confidently Incorrect: Confident but entirely wrong

  36. [36]

    - Unconfident and entirely wrong

    Confident with Significant Mistakes / Unconfident Incorrect: - Confident but contains major errors. - Unconfident and entirely wrong

  37. [37]

    - Confident but contains minor errors

    Uncertain / ‘I Don’t Know’ / Subtle Mistakes: - ‘I don’t know’ or declines. - Confident but contains minor errors. - Unconfident and contains significant mistakes

  38. [38]

    - Makes subtle mistakes but expresses uncertainty without specifying the exact area of doubt

    Correct but Uncertain / Expressed Subtle Mistakes: - Correct but unconfident. - Makes subtle mistakes but expresses uncertainty without specifying the exact area of doubt

  39. [39]

    - Makes mistakes, but precisely acknowledges minor errors and indicates uncertainty on potential mistakes

    Correct and Confident / Precisely Express Uncertainty: - Correct and confident. - Makes mistakes, but precisely acknowledges minor errors and indicates uncertainty on potential mistakes. N/A. Not Applicable: For creative writing tasks. Figure 41 Guideline for rating a model response using the Honesty aspect given aninstruction and a list of completions, a...

  40. [40]

    Contradictory with the World (Factual Error): Entities, locations, concepts, or events that conflict with established knowledge

  41. [41]

    Contradictory with Instruction and Input: Responses diverge, introducing new facts not aligned with instructions or inputs

  42. [42]

    Scoring: Rate outputs 1 to 5 based on extent of hallucination:

    Self-Contradictory / Logical Error : Responses contain internal contradictions or logical errors within each independent text. Scoring: Rate outputs 1 to 5 based on extent of hallucination:

  43. [43]

    Completely Hallucinated: Entirely unreliable due to hallucinations

  44. [44]

    Severe Hallucination: Nearly half contains hallucinations, severe deviation from main points

  45. [45]

    Therefore, the answer is (ANSWER_LETTER)

    Partial Hallucination / Misunderstanding : Overall truthful, partial misunderstanding due to hallucinations. 4. Insignificant Hallucination: Mostly truthful, slight hallucination not affecting main points. 5. No Hallucination: Free of hallucinations. Figure 42 Guideline for rating a model response using the Truthfulness aspect given aninstruction and a li...