The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
hub
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
62 Pith papers cite this work. Polarity classification is still indexing.
abstract
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the Tulu 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tulu 3 approach to more domains.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, whic
co-cited works
roles
background 1polarities
background 1representative citing papers
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.
SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines and nearly matching expert averages on up to 20-task vision and 5-task language Merg
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
citing papers explorer
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.