pith. sign in
Pith Number

pith:VJ25UGWN

pith:2025:VJ25UGWNXNXDEFM2XM7YODMSRU
not attested not anchored not stored refs resolved

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Benfeng Xu, Chiwei Zhu, Mingxuan Du, Xiaorui Wang, Zhendong Mao

DeepResearch Bench supplies 100 PhD-level tasks across 22 fields plus two evaluation methods that align with human judgment for deep research agents.

arxiv:2506.11763 v1 · 2025-06-13 · cs.CL · cs.IR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VJ25UGWNXNXDEFM2XM7YODMSRU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks... We therefore propose two novel methodologies that achieve strong alignment with human judgment.

C2weakest assumption

The 100 tasks crafted by domain experts across 22 fields are representative of real deep-research challenges and the two proposed evaluation methodologies genuinely align with human judgment without introducing systematic bias or requiring undisclosed tuning.

C3one line summary

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

References

66 extracted · 66 resolved · 19 Pith anchors

[1] 2408.07055 , archiveprefix = 2024
[2] Mle-bench: Evaluating machine learning agents on machine learning engineering 2025 · arXiv:2410.07095
[3] ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery 2025
[4] deepseek-ai/DeepSeek-V3-0324 · Hugging Face, March 2025 2025
[5] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948

Formal links

2 machine-checked theorem links

Cited by

33 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.555402Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

aa75da1acdbb6e32159abb3f870d928d33da2195ace29d4b094e865f5e65104b

Aliases

arxiv: 2506.11763 · arxiv_version: 2506.11763v1 · doi: 10.48550/arxiv.2506.11763 · pith_short_12: VJ25UGWNXNXD · pith_short_16: VJ25UGWNXNXDEFM2 · pith_short_8: VJ25UGWN
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VJ25UGWNXNXDEFM2XM7YODMSRU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: aa75da1acdbb6e32159abb3f870d928d33da2195ace29d4b094e865f5e65104b
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ac435ac616e289a2223f5bfea0c46dd657fc5aa9999a47cf319bbb3cdc7134f9",
    "cross_cats_sorted": [
      "cs.IR"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-06-13T13:17:32Z",
    "title_canon_sha256": "3a96bff25666a3568df2e4fd406d47bb61a953c95d6d9a9afbd2665556103b76"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2506.11763",
    "kind": "arxiv",
    "version": 1
  }
}