pith. sign in

hub Canonical reference

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Canonical reference. 89% of citing Pith papers cite this work as background.

23 Pith papers citing it
Background 89% of classified citations
abstract

Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

hub tools

citation-role summary

background 8 method 1

citation-polarity summary

years

2026 22 2025 1

representative citing papers

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Deep Research as Rubric for Reinforcement Learning

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

Differentiable Evolutionary Reinforcement Learning

cs.AI · 2025-12-15 · unverdicted · novelty 6.0

DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.

citing papers explorer

Showing 23 of 23 citing papers.