ACL-Verbatim: hallucination-free question answering for research

\'Ad\'am Kov\'acs; G\'abor Recski; Istv\'an Boros; Nadia Verdha; Szilveszter T\'oth

arxiv: 2605.21102 · v1 · pith:Y2R3JKBYnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.SE

ACL-Verbatim: hallucination-free question answering for research

G\'abor Recski , Szilveszter T\'oth , Nadia Verdha , Istv\'an Boros , \'Ad\'am Kov\'acs This is my paper

Pith reviewed 2026-05-21 04:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords extractive question answeringhallucination-free retrievalresearch papersACL AnthologyModernBERTVerbatimRAGsynthetic query generation

0 comments

The pith

A small token classifier maps researcher queries to exact text spans in papers more accurately than large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VerbatimRAG as an extractive system that answers questions about research papers by returning verbatim text spans rather than generating new sentences. Authors build a ground-truth dataset by generating synthetic queries with a ScIRGen-based pipeline, retrieving paper chunks, and having NLP researchers annotate the relevant spans. They train and compare extractive models on this data, showing that a 150-million-parameter ModernBERT token classifier reaches the highest word-level F1 of 53.6 and beats the strongest LLM extractor evaluated at 48.7. This matters to researchers because it supplies a concrete route to reliable, hallucination-free information retrieval from trusted academic sources.

Core claim

We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline

What carries the argument

VerbatimRAG, the extractive question-answering pipeline that directly returns verbatim text spans from retrieved research-paper chunks instead of generating answers.

If this is right

Researchers obtain source text directly rather than paraphrased or invented content.
A compact model can be trained to perform reliable span extraction on academic documents.
The new annotated dataset supports further development of extractive tools for scholarly search.
The approach extends to other publication corpora once similar query-generation and annotation pipelines are built.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interfaces could highlight the exact returned span in the original PDF for immediate verification.
The same extraction layer could be added to existing retrieval systems to reduce factual errors in generated summaries.
Performance gaps between small classifiers and LLMs may widen or shrink on domains with different writing styles and terminology.

Load-bearing premise

Synthetic queries generated by the custom pipeline and paired with retrieved chunks can be annotated by human NLP researchers to produce ground truth that reflects real researcher information needs.

What would settle it

A follow-up evaluation in which actual NLP researchers submit their own open-ended queries about ACL papers and rate the relevance and completeness of the extracted spans versus those produced by an LLM baseline.

Figures

Figures reproduced from arXiv: 2605.21102 by \'Ad\'am Kov\'acs, G\'abor Recski, Istv\'an Boros, Nadia Verdha, Szilveszter T\'oth.

**Figure 2.** Figure 2: Generation of synthetic user queries, based on the ScIRGen methodology. The example shows a chunk [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a new benchmark for verbatim QA on ACL papers using synthetic queries, but validation against real needs is missing and results are modest.

read the letter

The key takeaway is that this work introduces a new benchmark dataset for extractive question answering over research papers in the ACL Anthology, where models map queries directly to verbatim text spans instead of generating answers. A ModernBERT token classifier trained on their silver data gets 53.6 word-level F1, beating the best LLM extractor at 48.7. They start with VerbatimRAG to retrieve relevant chunks from papers, then use a custom pipeline inspired by ScIRGen to create synthetic user queries. NLP researchers annotate which spans answer those queries, creating the ground truth. This is applied to train and compare various extractive models. The paper does a solid job of framing the problem around hallucination-free access to trusted sources and providing concrete evaluation numbers on a domain-specific collection. Creating a human-annotated dataset in this area is a positive step, especially since it's done by people familiar with the field. Where it gets shaky is the reliance on synthetic queries. The stress test points out that without evidence these queries reflect real researcher information needs in terms of phrasing, complexity, or coverage, the results might not translate to practical use. The abstract does not report dataset size, annotation guidelines, inter-annotator agreement, or any comparison to authentic queries, which makes it hard to judge how robust the benchmark is. The performance gap is small, and overall F1 scores suggest the task remains challenging. This paper would interest people working on retrieval-augmented systems for scientific literature or those developing benchmarks for faithful QA. A reader focused on improving reliability in AI research tools could pick up useful ideas from the dataset construction. It has enough of a novel element with the new data and application to warrant a serious referee, who could push for more validation details and error analysis. I recommend putting it through peer review rather than desk rejecting it, as the core idea of verbatim mapping has merit even if the current evidence is preliminary.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ACL-Verbatim, which applies the VerbatimRAG extractive QA system to ACL Anthology papers to enable hallucination-free research question answering by directly mapping queries to verbatim text spans. The authors contribute a new ground truth dataset created from synthetic user queries (generated via a custom ScIRGen-based pipeline on VerbatimRAG-retrieved chunks) that are annotated by NLP researchers. They train and evaluate extractive models on this benchmark, reporting that a 150M-parameter ModernBERT token classifier trained with silver supervision achieves the highest word-level F1 of 53.6, outperforming the strongest evaluated LLM extractor (48.7).

Significance. If the synthetic-query benchmark is shown to be a valid proxy for real researcher information needs, the work would supply a useful annotated resource and evidence that a modest-sized fine-tuned classifier can outperform LLM-based extractors on verbatim span prediction. The reliance on an existing VerbatimRAG system and silver supervision from the pipeline is a positive for reproducibility.

major comments (2)

[Abstract] Abstract: The central performance claim (ModernBERT word-level F1 53.6 vs. LLM extractor 48.7) is presented at a high level with no dataset size, number of annotated examples, annotation guidelines, inter-annotator agreement statistics, or significance tests. These omissions leave the empirical result without the supporting evidence required to interpret the reported F1 scores.
[Abstract] Dataset construction (as described in the abstract): Ground truth is produced by NLP researchers annotating spans for synthetic queries generated by a custom ScIRGen pipeline on VerbatimRAG chunks. No validation study is reported that compares the distribution, specificity, complexity, or coverage of these synthetic queries against authentic researcher queries; without such evidence the benchmark may measure performance on an artificial task rather than the claimed hallucination-free research QA setting.

minor comments (1)

[Abstract] The abstract would be clearer if it stated the total number of papers, chunks, and queries in the contributed dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address the major comments point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (ModernBERT word-level F1 53.6 vs. LLM extractor 48.7) is presented at a high level with no dataset size, number of annotated examples, annotation guidelines, inter-annotator agreement statistics, or significance tests. These omissions leave the empirical result without the supporting evidence required to interpret the reported F1 scores.

Authors: We agree with the referee that the abstract would benefit from more specific details to allow readers to better interpret the results. In the revised manuscript, we will update the abstract to include the dataset size (number of synthetic queries and annotated spans), a summary of the annotation process by NLP researchers, and report inter-annotator agreement statistics. Regarding significance tests, we will include a note that the performance difference was assessed for statistical significance using appropriate methods such as paired t-tests or bootstrap resampling on the test set. Full annotation guidelines will be provided in the appendix or supplementary material. This revision will make the empirical claims more transparent. revision: yes
Referee: [Abstract] Dataset construction (as described in the abstract): Ground truth is produced by NLP researchers annotating spans for synthetic queries generated by a custom ScIRGen pipeline on VerbatimRAG chunks. No validation study is reported that compares the distribution, specificity, complexity, or coverage of these synthetic queries against authentic researcher queries; without such evidence the benchmark may measure performance on an artificial task rather than the claimed hallucination-free research QA setting.

Authors: We recognize the importance of validating that the synthetic queries reflect real researcher information needs. Our pipeline uses ScIRGen to generate queries from VerbatimRAG-retrieved chunks of ACL papers, with the goal of creating queries that are relevant to research content. Annotations by domain-expert NLP researchers further ensure the quality of the ground truth spans. However, we did not include a comparative study with authentic researcher queries in the current work. We will add a dedicated limitations section discussing this aspect and outlining plans for future validation studies involving real user queries collected from researchers. This will clarify the scope of the current benchmark as a controlled proxy for the task. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results are direct measurements on a new benchmark

full rationale

The paper introduces a custom benchmark consisting of synthetic queries generated via a ScIRGen-based pipeline, VerbatimRAG-retrieved paper chunks, and human annotations by NLP researchers. It reports measured word-level F1 scores (e.g., ModernBERT at 53.6 outperforming LLM extractors at 48.7) for models trained on silver supervision from the same pipeline and evaluated against the human ground truth. No equations, derivations, or fitted parameters are presented that reduce the reported performance numbers to quantities defined by the inputs themselves. The use of VerbatimRAG is an application of an existing retrieval system to construct the dataset rather than a load-bearing self-citation that justifies the central claim. The results are self-contained performance metrics on the introduced benchmark and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that synthetic queries and VerbatimRAG-retrieved chunks yield annotations that generalize to real researcher needs, plus standard supervised learning assumptions for token classification.

axioms (1)

domain assumption Human annotations performed by NLP researchers on synthetic queries provide a reliable proxy for real-world query-to-span relevance.
This premise underpins both the training data and the benchmark used to claim model superiority.

pith-pipeline@v0.9.0 · 5729 in / 1437 out tokens · 55444 ms · 2026-05-21T04:50:08.789775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

Ragbench: Explainable benchmark for retrieval-augmented generation systems.Preprint, arXiv:2407.11005. Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. Preprint, arXiv:2310.18344. Daniel Gildea, Min-Yen Kan, Nitin Madnani, Christoph Teichmann, and Martín Villalba. 2018. The ACL Anthology: Current ...

work page arXiv 2023
[2]

Challenges and applications of large language models.arXiv preprint arXiv:2307.10169, 2023

Challenges and applications of large language models.Preprint, arXiv:2307.10169. Adam Kovacs, Paul Schmitt, and Gabor Recski. 2025. KR labs at ArchEHR-QA 2025: A verbatim approach for evidence-based question answering. InProceed- ings of the 24th Workshop on Biomedical Language Processing (Shared Tasks), pages 69–74, Vienna, Austria. Association for Compu...

work page arXiv 2025
[3]

Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah

Algorithm appreciation: People prefer algo- rithmic to human judgment.Organizational Behav- ior and Human Decision Processes, 151:90–103. Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. Clinfo.ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific litera- ture.Preprint...

work page arXiv 2023
[4]

Hallucination-free? assessing the reliability of leading ai legal research tools

Hallucination-free? assessing the reliabil- ity of leading ai legal research tools.Preprint, arXiv:2405.20362. Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Ex- pertQA: Expert-curated questions and attributed an- swers. InProceedings of the 2024 Conference of the North American Chapter of the Association fo...

work page arXiv 2024
[5]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US

RAG-HAT: A hallucination-aware tuning pipeline for LLM in retrieval-augmented generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US. Asso- ciation for Computational Linguistics. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: Factoi...

work page 2024
[6]

Galactica: A Large Language Model for Science

Galactica: A large language model for science. Preprint, arXiv:2211.09085. Felipe Viegas, Washington Cunha, Christian Gomes, Antônio Pereira, Leonardo Rocha, and Marcos Goncalves. 2020. CluHTM - semantic hierarchical topic modeling based on CluWords. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8138–8150...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Verification: questions seeking a simple yes/no confirmation

work page
[8]

Disjunctive: questions presenting multiple alternatives

work page
[9]

Concept Completion: questions starting with Who/What/When/Where

work page
[10]

Example: questions asking for instances of a concept

work page
[11]

Feature Specification: questions about properties or characteristics

work page
[12]

Quantification: questions seeking numerical or measurable information

work page
[13]

Definition: questions asking for the meaning of a term or concept

work page
[14]

Comparison: questions asking for similarities or differences

work page
[15]

Interpretation: questions asking for inference over observed patterns

work page
[16]

Causal Antecedent: questions about causes or reasons

work page
[17]

Causal Consequence: questions about outcomes or results

work page
[18]

Goal Orientation: questions about objectives or intentions

work page
[19]

Instrumental/Procedural: questions asking how to achieve a goal

work page
[20]

Enablement: questions about conditions enabling an action

work page
[21]

Expectation: questions about anticipated or missing outcomes

work page
[22]

Judgmental: questions asking for evaluation or opinion

work page
[23]

Assertion: statements indicating lack of knowledge

work page
[24]

Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text

Request/Directive: requests to summarize, analyze, or search. Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text. Give me the name of each type and not other information. Return ONLY valid JSON -- an array of objects, no markdown or explanations. Text: {chunk} C.2 Quest...

work page
[25]

Only return a question without any other information

work page
[26]

a dataset

Use neutral terms like "a dataset", "data collection method", or "research approach", instead of references like "the study" or "this dataset"

work page
[27]

The question should be short and simple, resembling what a user might type into a search engine

work page
[28]

C.3 Query rewriting prompt You are a researcher using a search engine to find information

The question should be answerable based on the text above. C.3 Query rewriting prompt You are a researcher using a search engine to find information. Your question: {question} Please generate a search query that you would use to find the answer to this question. Instructions:

work page
[29]

Only return a search query without any other information

work page
[30]

The query should be short and simple, resembling what a user might type into a search engine

work page
[31]

D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question

The query does not need to be grammatical. D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question. Rules

work page
[32]

Extract only text that explicitly addresses the question

work page
[33]

Never paraphrase, modify, or add to the original text

work page
[35]

Order spans within each document by relevance, most relevant first

work page
[36]

doc_0": [

Include complete sentences or paragraphs for context. Output format Return a JSON object mapping document IDs to span arrays ordered by relevance: { "doc_0": ["most relevant span", "next most relevant span"], "doc_1": ["most relevant from doc 1"], "doc_2": [] } If no relevant information exists in a document, use an empty array. Your task Question: {{ que...

work page
[37]

Use EXACT text from the document; no paraphrasing or edits

work page
[38]

Preserve original wording, capitalization, and punctuation

work page
[39]

If no passage in the document supports the answer, return an empty array

work page
[40]

doc_0": [

Order spans within each document by relevance, most relevant first. Output format Return JSON mapping document IDs to arrays: { "doc_0": ["first supporting passage", " second supporting passage"], "doc_1": ["passage from doc 1"], "doc_2": [] } Your task Question: {{ question }} Documents: {{ documents }} Extract supporting passages from each document: 13

work page

[1] [1]

Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

Ragbench: Explainable benchmark for retrieval-augmented generation systems.Preprint, arXiv:2407.11005. Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. Preprint, arXiv:2310.18344. Daniel Gildea, Min-Yen Kan, Nitin Madnani, Christoph Teichmann, and Martín Villalba. 2018. The ACL Anthology: Current ...

work page arXiv 2023

[2] [2]

Challenges and applications of large language models.arXiv preprint arXiv:2307.10169, 2023

Challenges and applications of large language models.Preprint, arXiv:2307.10169. Adam Kovacs, Paul Schmitt, and Gabor Recski. 2025. KR labs at ArchEHR-QA 2025: A verbatim approach for evidence-based question answering. InProceed- ings of the 24th Workshop on Biomedical Language Processing (Shared Tasks), pages 69–74, Vienna, Austria. Association for Compu...

work page arXiv 2025

[3] [3]

Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah

Algorithm appreciation: People prefer algo- rithmic to human judgment.Organizational Behav- ior and Human Decision Processes, 151:90–103. Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. Clinfo.ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific litera- ture.Preprint...

work page arXiv 2023

[4] [4]

Hallucination-free? assessing the reliability of leading ai legal research tools

Hallucination-free? assessing the reliabil- ity of leading ai legal research tools.Preprint, arXiv:2405.20362. Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Ex- pertQA: Expert-curated questions and attributed an- swers. InProceedings of the 2024 Conference of the North American Chapter of the Association fo...

work page arXiv 2024

[5] [5]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US

RAG-HAT: A hallucination-aware tuning pipeline for LLM in retrieval-augmented generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US. Asso- ciation for Computational Linguistics. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: Factoi...

work page 2024

[6] [6]

Galactica: A Large Language Model for Science

Galactica: A large language model for science. Preprint, arXiv:2211.09085. Felipe Viegas, Washington Cunha, Christian Gomes, Antônio Pereira, Leonardo Rocha, and Marcos Goncalves. 2020. CluHTM - semantic hierarchical topic modeling based on CluWords. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8138–8150...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[7] [7]

Verification: questions seeking a simple yes/no confirmation

work page

[8] [8]

Disjunctive: questions presenting multiple alternatives

work page

[9] [9]

Concept Completion: questions starting with Who/What/When/Where

work page

[10] [10]

Example: questions asking for instances of a concept

work page

[11] [11]

Feature Specification: questions about properties or characteristics

work page

[12] [12]

Quantification: questions seeking numerical or measurable information

work page

[13] [13]

Definition: questions asking for the meaning of a term or concept

work page

[14] [14]

Comparison: questions asking for similarities or differences

work page

[15] [15]

Interpretation: questions asking for inference over observed patterns

work page

[16] [16]

Causal Antecedent: questions about causes or reasons

work page

[17] [17]

Causal Consequence: questions about outcomes or results

work page

[18] [18]

Goal Orientation: questions about objectives or intentions

work page

[19] [19]

Instrumental/Procedural: questions asking how to achieve a goal

work page

[20] [20]

Enablement: questions about conditions enabling an action

work page

[21] [21]

Expectation: questions about anticipated or missing outcomes

work page

[22] [22]

Judgmental: questions asking for evaluation or opinion

work page

[23] [23]

Assertion: statements indicating lack of knowledge

work page

[24] [24]

Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text

Request/Directive: requests to summarize, analyze, or search. Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text. Give me the name of each type and not other information. Return ONLY valid JSON -- an array of objects, no markdown or explanations. Text: {chunk} C.2 Quest...

work page

[25] [25]

Only return a question without any other information

work page

[26] [26]

a dataset

Use neutral terms like "a dataset", "data collection method", or "research approach", instead of references like "the study" or "this dataset"

work page

[27] [27]

The question should be short and simple, resembling what a user might type into a search engine

work page

[28] [28]

C.3 Query rewriting prompt You are a researcher using a search engine to find information

The question should be answerable based on the text above. C.3 Query rewriting prompt You are a researcher using a search engine to find information. Your question: {question} Please generate a search query that you would use to find the answer to this question. Instructions:

work page

[29] [29]

Only return a search query without any other information

work page

[30] [30]

The query should be short and simple, resembling what a user might type into a search engine

work page

[31] [31]

D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question

The query does not need to be grammatical. D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question. Rules

work page

[32] [32]

Extract only text that explicitly addresses the question

work page

[33] [33]

Never paraphrase, modify, or add to the original text

work page

[34] [35]

Order spans within each document by relevance, most relevant first

work page

[35] [36]

doc_0": [

Include complete sentences or paragraphs for context. Output format Return a JSON object mapping document IDs to span arrays ordered by relevance: { "doc_0": ["most relevant span", "next most relevant span"], "doc_1": ["most relevant from doc 1"], "doc_2": [] } If no relevant information exists in a document, use an empty array. Your task Question: {{ que...

work page

[36] [37]

Use EXACT text from the document; no paraphrasing or edits

work page

[37] [38]

Preserve original wording, capitalization, and punctuation

work page

[38] [39]

If no passage in the document supports the answer, return an empty array

work page

[39] [40]

doc_0": [

Order spans within each document by relevance, most relevant first. Output format Return JSON mapping document IDs to arrays: { "doc_0": ["first supporting passage", " second supporting passage"], "doc_1": ["passage from doc 1"], "doc_2": [] } Your task Question: {{ question }} Documents: {{ documents }} Extract supporting passages from each document: 13

work page