Pith Number

pith:VJ25UGWN

pith:2025:VJ25UGWNXNXDEFM2XM7YODMSRU

not attested not anchored not stored refs resolved

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Benfeng Xu, Chiwei Zhu, Mingxuan Du, Xiaorui Wang, Zhendong Mao

DeepResearch Bench supplies 100 PhD-level tasks across 22 fields plus two evaluation methods that align with human judgment for deep research agents.

arxiv:2506.11763 v1 · 2025-06-13 · cs.CL · cs.IR

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{VJ25UGWNXNXDEFM2XM7YODMSRU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks... We therefore propose two novel methodologies that achieve strong alignment with human judgment.

C2weakest assumption

The 100 tasks crafted by domain experts across 22 fields are representative of real deep-research challenges and the two proposed evaluation methodologies genuinely align with human judgment without introducing systematic bias or requiring undisclosed tuning.

C3one line summary

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

References

66 extracted · 66 resolved · 19 Pith anchors

[1] 2408.07055 , archiveprefix = 2024

[2] Mle-bench: Evaluating machine learning agents on machine learning engineering 2025 · arXiv:2410.07095

[3] ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery 2025

[4] deepseek-ai/DeepSeek-V3-0324 · Hugging Face, March 2025 2025

[5] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948

Formal links

2 machine-checked theorem links

Cited by

33 papers in Pith

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

An AI system to help scientists write expert-level empirical software

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

An AI system to help scientists write expert-level empirical software

Receipt and verification

First computed	2026-05-17T23:38:48.555402Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

aa75da1acdbb6e32159abb3f870d928d33da2195ace29d4b094e865f5e65104b

Aliases

arxiv: 2506.11763 · arxiv_version: 2506.11763v1 · doi: 10.48550/arxiv.2506.11763 · pith_short_12: VJ25UGWNXNXD · pith_short_16: VJ25UGWNXNXDEFM2 · pith_short_8: VJ25UGWN

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/VJ25UGWNXNXDEFM2XM7YODMSRU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: aa75da1acdbb6e32159abb3f870d928d33da2195ace29d4b094e865f5e65104b

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "ac435ac616e289a2223f5bfea0c46dd657fc5aa9999a47cf319bbb3cdc7134f9",
    "cross_cats_sorted": [
      "cs.IR"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-06-13T13:17:32Z",
    "title_canon_sha256": "3a96bff25666a3568df2e4fd406d47bb61a953c95d6d9a9afbd2665556103b76"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2506.11763",
    "kind": "arxiv",
    "version": 1
  }
}