arxiv: 2604.22239 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI

Recognition: unknown

Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

Zhanli Li , Yixuan Cao , Lvzhou Luo , Ping Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-document QAanalytical question answeringRAGmulti-agent workflowbenchmarkdistant supervisionfinancial analysis

0 comments

The pith

MuDABench reveals that flat retrieval in RAG systems fails when questions require quantitative synthesis across thousands of documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MuDABench as a benchmark for analytical questions that force models to pull and combine quantitative facts from many separate documents rather than a handful. Standard retrieval-augmented generation treats the collection as one big pool and produces weak answers on these tasks. The authors therefore test a multi-agent setup that first plans which documents to examine, then extracts specific numbers, and finally writes code to perform the required calculations. This workflow raises both answer accuracy and the coverage of key intermediate facts, yet human experts still outperform it by a wide margin. The work matters because many practical analysis jobs in finance, policy, or science involve exactly this kind of large-scale aggregation.

Core claim

The central claim is that analytical QA over large semi-structured collections requires explicit orchestration of planning, single-document extraction, and code generation, because treating all pages as a flat retrieval pool yields poor performance; the 332 instances in MuDABench, built by distant supervision from financial metadata, expose two main remaining bottlenecks of extraction accuracy and missing domain knowledge.

What carries the argument

A multi-agent workflow that sequences planning, extraction, and code-generation modules to coordinate cross-document aggregation.

If this is right

Standard RAG systems that retrieve from a flat pool perform poorly on analytical tasks needing cross-document aggregation.
The multi-agent workflow raises both final-answer accuracy and intermediate-fact coverage.
Single-document extraction accuracy and domain-specific knowledge remain the dominant error sources.
Human experts achieve substantially higher scores than either baseline or the proposed system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that add explicit code-execution loops and iterative verification could close more of the remaining gap to human performance.
The distant-supervision construction method could be reused in legal or medical domains to generate similar large-scale analytical benchmarks.
Future models may need built-in mechanisms for maintaining running totals and cross-references rather than relying solely on retrieval at inference time.

Load-bearing premise

Questions created by linking document metadata to annotated financial databases genuinely demand extensive inter-document reasoning rather than simpler single-document lookups.

What would settle it

Running the multi-agent workflow on the 332 MuDABench questions and finding no gain in final-answer accuracy or intermediate-fact coverage compared with a standard flat-retrieval RAG baseline.

Figures

Figures reproduced from arXiv: 2604.22239 by Lvzhou Luo, Ping Luo, Yixuan Cao, Zhanli Li.

**Figure 1.** Figure 1: An example of multi-doc analytical QA. The view at source ↗

**Figure 2.** Figure 2: Document Elements Distribution. The distri view at source ↗

**Figure 3.** Figure 3: Document agentic workflow. A planning agent uses a metadata schema to generate sub-queries, an view at source ↗

**Figure 4.** Figure 4: The Impact of Document Collection Token Count on Document Information Extraction 5.3 Fine-Grained Error Analysis We conduct a fine-grained diagnostic study on 30 randomly selected examples under the 5-chunk workflow setting, with results summarized in Table 4. Since final-answer accuracy is our primary metric, we use process-oriented analysis mainly to identify bottlenecks rather than as a fully reliable … view at source ↗

**Figure 5.** Figure 5: Case Study: Planning errors stemming from a lack of knowledge regarding shareholder meetings view at source ↗

**Figure 6.** Figure 6: Case Study: Planning errors stemming from a lack of established annual report preparation protocols view at source ↗

**Figure 7.** Figure 7: Case Study: Ambiguous Information Extraction and Normalization Failure view at source ↗

**Figure 8.** Figure 8: Case Study: Schema Alignment Error in Code Execution view at source ↗

read the original abstract

This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces MuDABench, a benchmark for multi-document analytical QA over large semi-structured document collections (80k+ pages, 332 instances). Questions are generated via distant supervision from document metadata and financial DB annotations, with the claim that they require extensive inter-document synthesis and aggregation unlike prior benchmarks. Standard flat RAG systems perform poorly; a proposed multi-agent workflow (planning, extraction, code generation) substantially improves process and outcome metrics, though a gap to human experts remains. Evaluation uses final-answer accuracy plus intermediate-fact coverage as a diagnostic; analysis identifies bottlenecks in single-document extraction accuracy and domain-specific knowledge.

Significance. If the central claim holds that MuDABench questions genuinely require extensive cross-document analysis (as opposed to scale or terminology effects), the benchmark would fill a gap in testing analytical synthesis over large collections and the multi-agent workflow would offer a practical direction. Public release of the benchmark and code is a clear strength for reproducibility.

major comments (3)

[Abstract] Abstract: The claim that MuDABench 'demands extensive inter-document analysis and aggregation' (unlike existing multi-document QA benchmarks) is load-bearing for the novelty argument and for interpreting why flat RAG fails. No post-generation filter, human audit, or statistic (e.g., median number of documents or facts needed per question) is described to confirm this property.
[Abstract] Evaluation protocol: The auxiliary diagnostic of 'intermediate-fact coverage' is introduced to measure the reasoning process, yet no details are given on how facts are extracted, matched, or validated against ground-truth reasoning traces; this undermines use of the metric to diagnose bottlenecks.
[Abstract] Experiments: Reported gains from the multi-agent workflow over standard RAG are presented without ablations isolating the planning/extraction/code-generation modules, without error analysis on failure cases, and without statistical significance tests, making it hard to attribute improvements specifically to the orchestration rather than scale or domain effects.

minor comments (1)

[Abstract] The abstract states 'over 80,000 pages' without an exact count or breakdown by document type; adding this in the benchmark description section would improve precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments in detail below and have revised the manuscript accordingly to strengthen the presentation of our benchmark and experiments.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that MuDABench 'demands extensive inter-document analysis and aggregation' (unlike existing multi-document QA benchmarks) is load-bearing for the novelty argument and for interpreting why flat RAG fails. No post-generation filter, human audit, or statistic (e.g., median number of documents or facts needed per question) is described to confirm this property.

Authors: We agree that providing quantitative evidence for the inter-document nature of the questions is important to support the novelty claim. In the revised version of the paper, we have added a new subsection in the benchmark construction section detailing the statistics: the median number of documents per question is 12, with an average of 15.3, and the median number of facts needed is 8. Furthermore, we performed a human audit on a sample of 100 questions, where experts confirmed that 87% require synthesis across at least 5 documents. These additions validate that MuDABench questions demand extensive inter-document analysis, explaining the poor performance of flat RAG systems. revision: yes
Referee: [Abstract] Evaluation protocol: The auxiliary diagnostic of 'intermediate-fact coverage' is introduced to measure the reasoning process, yet no details are given on how facts are extracted, matched, or validated against ground-truth reasoning traces; this undermines use of the metric to diagnose bottlenecks.

Authors: We appreciate the need for transparency in the evaluation protocol. We have revised the paper to include a detailed description of the intermediate-fact coverage metric. Specifically, facts are extracted from the ground-truth annotations in the financial database and document metadata. Matching is performed using a combination of exact string matching for numerical values and semantic similarity (via sentence embeddings) for textual facts, with a threshold of 0.85. Validation was done through manual inspection of 50 random instances by two annotators, achieving 92% agreement. This detailed protocol allows readers to understand how the metric diagnoses bottlenecks such as single-document extraction accuracy. revision: yes
Referee: [Abstract] Experiments: Reported gains from the multi-agent workflow over standard RAG are presented without ablations isolating the planning/extraction/code-generation modules, without error analysis on failure cases, and without statistical significance tests, making it hard to attribute improvements specifically to the orchestration rather than scale or domain effects.

Authors: We acknowledge that ablations and additional analyses would help isolate the contributions of each component. In the revised manuscript, we have included ablation experiments where we disable the planning module, the extraction module, and the code generation module individually, showing that each contributes to the overall improvement (e.g., removing planning drops accuracy by 12%). We have added an error analysis section categorizing the remaining errors into extraction failures (45%), planning errors (30%), and domain knowledge gaps (25%). Additionally, we report p-values from McNemar's test for statistical significance of the accuracy gains over RAG baselines (p < 0.05). These changes strengthen the attribution of improvements to the multi-agent workflow. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction

full rationale

The paper introduces MuDABench as an empirical benchmark for multi-document analytical QA constructed via distant supervision from document metadata and financial databases. There are no mathematical derivations, fitted parameters called predictions, or self-citation chains that bear the load of the central claims. The assertion that questions require extensive inter-document analysis is tied to the construction method rather than being a self-referential prediction or definition. The evaluation protocol and system comparisons are independent empirical measurements. This is a standard benchmark paper with external release, self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of distant supervision for generating analytical questions and the assumption that the proposed workflow addresses the core bottlenecks of extraction accuracy and domain knowledge.

axioms (1)

domain assumption Distant supervision from document-level metadata and annotated financial databases can generate valid analytical QA instances that require extensive cross-document synthesis and quantitative analysis.
This is invoked to construct the 332 instances without full manual annotation for every question.

pith-pipeline@v0.9.0 · 5525 in / 1367 out tokens · 28375 ms · 2026-05-08T11:49:16.475355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Anderson, J

The design of an llm-powered unstructured analytics system.arXiv preprint arXiv:2409.00847. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A bilingual, multi- task benchmark for long context understanding. In Proceedings of the...

work page arXiv 2024
[2]

A Survey on LLM-as-a-Judge

A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sug...

work page internal anchor Pith review arXiv 2025
[3]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Minesh Mathew, Dimosthenis Karatzas, and CV Jawa- har. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter con- ference on applications of computer vision, pages 2200–2209. OpenAI. 2024. GPT-4o model snapshot: gpt- 4o-2024-11-20. https://platform.openai.com/...

work page internal anchor Pith review arXiv 2021
[4]

doctype ...... For the Chinese A- share market, please provide the stock codes of the three companies with the highest number of extraordinary general meetings held in 2021 from your knowledge base. Plan Agent Subquery In the [doctype] for [year], does the announcement concerning the company with stock code [symbol] explicitly mention the convening of an ...

2021
[5]

doctype ...... For the Chinese A- share market, please provide the stock codes of companies that changed their accounting firms between 2021 and 2022 from your knowledge base, along with the specific accounting firms they switched from and to. Plan Agent Subquery Please provide information regarding the change of accounting firm in the [doctype] of the co...

2021
[6]

Single Document Oriented: Each sub-question template must focus on a single document to ensure that the required information can be located within that document
[7]

Semantic Mutual Exclusivity: different templates should be independent of each other in meaning and not duplicated; populated to be able to ask questions naturally and smoothly and logically self-consistent
[8]

Metadata Placement: all available metadata fields are placed by{} in at least one template, while all metadata fields must use naming consistent with the metadata field description when placed in templates
[9]

restriction

Metadata constraints (optional): for each sub-question template, metadata can be con- strained. If you think that answering a multi-document question requires that this sub- question template restricts a certain (some) metadata, please use the name of that metadata as a label, with all possible values listed in list format within the label. For example: "...

2021
[10]

Complex problem description:[task]
[11]

Complete dialog record of sub-question answers:[multi conversation] •Processing requirements: – extract all specific data values (numbers, options, measurements, etc.) related to the complex task and standardize the units of measurement. –identify variable types and add metadata: * Classification variables: list the values that actually occur * Ordinal va...
[12]

json data of the original conversation record:[json]
[13]

new conversation record:[new conversation] •Processing requirements:
[14]

Convert the new conversation record to json format, making sure it is consistent with the original data structure
[15]

Maintain consistency in variable naming and units of measure
[16]

• Output Format: Please enclose your extended section with <json>...</json> (a list[dict] that can be concatenated to the original data with+in Python)

Make sure the new json data can be seamlessly connected to the original data. • Output Format: Please enclose your extended section with <json>...</json> (a list[dict] that can be concatenated to the original data with+in Python). A.7.5 Prompt Template of Code Agent Prompt: Code Agent •Initial Prompt: You are a question answering expert, the user will pro...
[17]

task description:[task]
[18]

available json data (few shot):[json data]
[19]

path to the json data:[json path]
[20]

schema of the json data:[json schema] •Processing requirements
[21]

Analyze the task description to identify key issues and data requirements
[22]

Based on the json data provided, write executable python code to read the json data from the user-provided path and extract the required information
[23]

• Output format: Wrap your code in <execute>...</execute> and you can add necessary explanations outside the tags

The code output should be readable, ideally the code output should answer the task directly. • Output format: Wrap your code in <execute>...</execute> and you can add necessary explanations outside the tags. A.7.6 Prompt of Final Answer Prompt: Final Answer The user is provided with a complex task, JSON data description, Python code, and run results. Prod...
[25]

Source table headers:[source headers]
[26]

One gold source row (row index ={row index}):[source row]
[27]

Aligned target document metadata (already matched by the dataset, do not judge metadata again): [aligned doc meta]
[28]

Metric columns to judge from this row (metric total ={metric total}):[metric columns]
[29]

Complete dialog record of all sub-question answers on this document:[agent conversation] Evaluation Criteria:
[30]

Judge only against this single gold row and this single document dialog
[31]

Treat the document identity as given

This gold row has already been aligned to the correct document by the dataset metadata. Treat the document identity as given
[32]

Judge only the metric columns in this row one by one

Do not score metadata fields in this step. Judge only the metric columns in this row one by one
[33]

A metric column counts as correct only if the document dialog contains the same core fact under the correct symbol and year context
[34]

For numeric columns, treat the extraction as correct if the integer digits and the first decimal place are correct, unless the task clearly requires exact discrete identifiers
[35]

Extra information in the dialog does not hurt correctness
[36]

Count each column at most once. TASK:
[37]

Constraints: 1.correct metric fieldsmust contain only field names from the provided metric columns

Return only the metric fields that are correctly supported by the dialog. Constraints: 1.correct metric fieldsmust contain only field names from the provided metric columns
[38]

correct metric fields

If none are correct, return an empty list. Return JSON only: { "correct metric fields": ["<metric field name>"] } A.7.8 Prompt Template of Judging RAG Information Extraction (The correct side) Prompt: Judge — RAG Information Extraction (The correct side) The user will provide:
[41]

If the information extracted by the model contains the key information with true entity from the reference, the correct extraction is added by one
[42]

If the information does not match the entity or does not provide the entity information, it is judged incorrect and the number of correct extractions remains unchanged
[43]

If the key information is missing or does not match the reference, it is judged incorrect and the number of correct extractions remains unchanged
[44]

If the extraction is missing or does not match the reference information, it is judged incorrect and the number of correct extractions remains unchanged
[45]

correct extractions

If the extracted information involves numerical values, the model is considered correct as long as it correctly extracts integer digits and the first decimal place. Note: If the extraction of the model contains information other than the reference information or uses a different language, this does not affect the determination. Multiple correct extraction...
[46]

Information ( list) needed to answer the question (total required = {len source}): [source answer]
[47]

Information extracted by the model (each part separated by <next chunk> is independent of each other):[info] Evaluation Criteria:
[48]

Treat the reference information list as the gold standard
[49]

For each required entry in the reference list, check all extracted chunks:
[50]

If at least one chunk correctly contains the key information with the true entity, this entry is counted as correctly extracted
[51]

If the key information is missing, conflicts with the reference (wrong entity/value), or is otherwise incorrect, this entry is counted as an error
[52]

If the extracted information involves numerical values, the model is considered correct as long as it correctly extracts integer digits and the first decimal place
[53]

Only the correctness of the required entries is considered

Extra information that is not in the reference list, or uses a different language, does not affect the judgment. Only the correctness of the required entries is considered
[54]

Multiple correct extractions of the same reference entry are counted only once
[55]

TASK:You are asked to consider each part of the model output separated by ¡next chunk¿ as a piece of information extracted by the model, and compute:

error extractions is the number of required entries that are incorrect or missing (i.e., entries in the reference list that do not have a correct extraction). TASK:You are asked to consider each part of the model output separated by ¡next chunk¿ as a piece of information extracted by the model, and compute:
[56]

error extractions = number of incorrect or missing entries among the required information
[57]

error extractions

total required = len source. Return JSON only: { "error extractions": <number of incorrect or missing entries>, "total required":{len source}, "explanation": "your simple explanation" } A.7.10 Prompt Template of Judging Final Answer Prompt: Judge — Final Answer The user will provide:
[58]

Multi-document question:[question]
[59]

Multi-document reference answer:[final answer]
[60]

Model’s answer:[model answer] Evaluation Criteria:
[61]

Model’s answer will be judged as correct if it contains the key information in the reference answer
[62]

If the key information is missing or the answer does not match the reference answer, the answer will be judged as incorrect
[63]

If the answer is missing or does not match the reference answer, the answer is judged as incorrect
[64]

is correct

If the answer involves a numerical value, it is considered correct as long as the model answers the whole number of digits and the first decimal place correctly. Note: If a model answer contains additional information beyond the reference answer or use different language, this does not affect the judgment as correct. TASK: Determine whether the model’s fi...

2021