A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
hub Mixed citations
ResearchRubrics : A benchmark of prompts and rubrics for evaluating deep research agents
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
years
2026 19verdicts
UNVERDICTED 19representative citing papers
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on
VibeSearchBench provides 200 tasks across 20 domains with progressive-disclosure simulation and graph-matching evaluation, showing frontier LLM agents achieve at most 30.30 F1 on long-horizon proactive search.
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
ICBCBench is a new consortium-built benchmark that jointly measures retrieval-reasoning accuracy and end-to-end report quality for deep research agents in finance.
BlueFin is a new benchmark for LLM agents on financial spreadsheets showing frontier models score below 50% with weaknesses in dynamic correctness.
Meta-Team is a collaborative self-evolution framework that turns multi-agent execution experience into reusable improvements at agent, coordination, and team levels, outperforming baselines on six benchmarks.
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.
citing papers explorer
-
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
-
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
-
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on
-
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
VibeSearchBench provides 200 tasks across 20 domains with progressive-disclosure simulation and graph-matching evaluation, showing frontier LLM agents achieve at most 30.30 F1 on long-horizon proactive search.
-
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
-
ICBCBench: An Industry Consortium Benchmark for Financial Deep Research
ICBCBench is a new consortium-built benchmark that jointly measures retrieval-reasoning accuracy and end-to-end report quality for deep research agents in finance.
-
BlueFin: Benchmarking LLM Agents on Financial Spreadsheets
BlueFin is a new benchmark for LLM agents on financial spreadsheets showing frontier models score below 50% with weaknesses in dynamic correctness.
-
Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems
Meta-Team is a collaborative self-evolution framework that turns multi-agent execution experience into reusable improvements at agent, coordination, and team levels, outperforming baselines on six benchmarks.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
-
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
-
Self-Optimizing Multi-Agent Systems for Deep Research
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
-
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
-
Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence
A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.
-
On the Generalization Gap in Self-Evolving Language Model Reasoning
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
-
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.