hub Mixed citations

ResearchRubrics : A benchmark of prompts and rubrics for evaluating deep research agents

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya H · 2025 · arXiv 2511.07685

Mixed citation behavior. Most common role is background (67%).

19 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1

citation-polarity summary

background 4 baseline 1 unclear 1

representative citing papers

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

VibeSearchBench provides 200 tasks across 20 domains with progressive-disclosure simulation and graph-matching evaluation, showing frontier LLM agents achieve at most 30.30 F1 on long-horizon proactive search.

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.

RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

cs.LG · 2026-05-10 · unverdicted · novelty 7.0 · 3 refs

RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.

ICBCBench: An Industry Consortium Benchmark for Financial Deep Research

cs.CE · 2026-06-16 · unverdicted · novelty 6.0

ICBCBench is a new consortium-built benchmark that jointly measures retrieval-reasoning accuracy and end-to-end report quality for deep research agents in finance.

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

cs.SE · 2026-05-29 · unverdicted · novelty 6.0

BlueFin is a new benchmark for LLM agents on financial spreadsheets showing frontier models score below 50% with weaknesses in dynamic correctness.

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

cs.MA · 2026-05-28 · unverdicted · novelty 6.0

Meta-Team is a collaborative self-evolution framework that turns multi-agent execution experience into reusable improvements at agent, coordination, and team levels, outperforming baselines on six benchmarks.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.

Self-Optimizing Multi-Agent Systems for Deep Research

cs.IR · 2026-04-03 · unverdicted · novelty 6.0

Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

cs.AI · 2026-02-16 · unverdicted · novelty 6.0

A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.

On the Generalization Gap in Self-Evolving Language Model Reasoning

cs.CL · 2026-05-31 · unverdicted · novelty 5.0

Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

cs.AI · 2026-06-08 · unverdicted · novelty 4.0

Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

cs.AI · 2026-06-30 · unverdicted · novelty 2.0

Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

citing papers explorer

Showing 19 of 19 citing papers.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unverdicted · none · ref 21
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers cs.LG · 2026-06-10 · unverdicted · none · ref 23
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents cs.CL · 2026-06-10 · unverdicted · none · ref 44
FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild cs.CL · 2026-05-27 · unverdicted · none · ref 7
VibeSearchBench provides 200 tasks across 20 domains with progressive-disclosure simulation and graph-matching evaluation, showing frontier LLM agents achieve at most 30.30 F1 on long-horizon proactive search.
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? cs.CL · 2026-05-18 · unverdicted · none · ref 48
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement cs.LG · 2026-05-10 · unverdicted · none · ref 30 · 3 links
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
ICBCBench: An Industry Consortium Benchmark for Financial Deep Research cs.CE · 2026-06-16 · unverdicted · none · ref 33
ICBCBench is a new consortium-built benchmark that jointly measures retrieval-reasoning accuracy and end-to-end report quality for deep research agents in finance.
BlueFin: Benchmarking LLM Agents on Financial Spreadsheets cs.SE · 2026-05-29 · unverdicted · none · ref 30
BlueFin is a new benchmark for LLM agents on financial spreadsheets showing frontier models score below 50% with weaknesses in dynamic correctness.
Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems cs.MA · 2026-05-28 · unverdicted · none · ref 59
Meta-Team is a collaborative self-evolution framework that turns multi-agent execution experience into reusable improvements at agent, coordination, and team levels, outperforming baselines on six benchmarks.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 28
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution cs.LG · 2026-05-08 · unverdicted · none · ref 21
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
Self-Optimizing Multi-Agent Systems for Deep Research cs.IR · 2026-04-03 · unverdicted · none · ref 14
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 13
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence cs.AI · 2026-02-16 · unverdicted · none · ref 6
A tuned Bioptic Agent achieves 79.7% F1 on a new multilingual benchmark for global drug asset scouting, outperforming Gemini, Claude, GPT, and other models.
On the Generalization Gap in Self-Evolving Language Model Reasoning cs.CL · 2026-05-31 · unverdicted · none · ref 32
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 29
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 151
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 189
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity cs.AI · 2026-06-30 · unverdicted · none · ref 94
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

ResearchRubrics : A benchmark of prompts and rubrics for evaluating deep research agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer