ACL-Verbatim: hallucination-free question answering for research
Pith reviewed 2026-05-21 04:50 UTC · model grok-4.3
The pith
A small token classifier maps researcher queries to exact text spans in papers more accurately than large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline
What carries the argument
VerbatimRAG, the extractive question-answering pipeline that directly returns verbatim text spans from retrieved research-paper chunks instead of generating answers.
If this is right
- Researchers obtain source text directly rather than paraphrased or invented content.
- A compact model can be trained to perform reliable span extraction on academic documents.
- The new annotated dataset supports further development of extractive tools for scholarly search.
- The approach extends to other publication corpora once similar query-generation and annotation pipelines are built.
Where Pith is reading between the lines
- Interfaces could highlight the exact returned span in the original PDF for immediate verification.
- The same extraction layer could be added to existing retrieval systems to reduce factual errors in generated summaries.
- Performance gaps between small classifiers and LLMs may widen or shrink on domains with different writing styles and terminology.
Load-bearing premise
Synthetic queries generated by the custom pipeline and paired with retrieved chunks can be annotated by human NLP researchers to produce ground truth that reflects real researcher information needs.
What would settle it
A follow-up evaluation in which actual NLP researchers submit their own open-ended queries about ACL papers and rate the relevance and completeness of the extracted spans versus those produced by an LLM baseline.
Figures
read the original abstract
Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ACL-Verbatim, which applies the VerbatimRAG extractive QA system to ACL Anthology papers to enable hallucination-free research question answering by directly mapping queries to verbatim text spans. The authors contribute a new ground truth dataset created from synthetic user queries (generated via a custom ScIRGen-based pipeline on VerbatimRAG-retrieved chunks) that are annotated by NLP researchers. They train and evaluate extractive models on this benchmark, reporting that a 150M-parameter ModernBERT token classifier trained with silver supervision achieves the highest word-level F1 of 53.6, outperforming the strongest evaluated LLM extractor (48.7).
Significance. If the synthetic-query benchmark is shown to be a valid proxy for real researcher information needs, the work would supply a useful annotated resource and evidence that a modest-sized fine-tuned classifier can outperform LLM-based extractors on verbatim span prediction. The reliance on an existing VerbatimRAG system and silver supervision from the pipeline is a positive for reproducibility.
major comments (2)
- [Abstract] Abstract: The central performance claim (ModernBERT word-level F1 53.6 vs. LLM extractor 48.7) is presented at a high level with no dataset size, number of annotated examples, annotation guidelines, inter-annotator agreement statistics, or significance tests. These omissions leave the empirical result without the supporting evidence required to interpret the reported F1 scores.
- [Abstract] Dataset construction (as described in the abstract): Ground truth is produced by NLP researchers annotating spans for synthetic queries generated by a custom ScIRGen pipeline on VerbatimRAG chunks. No validation study is reported that compares the distribution, specificity, complexity, or coverage of these synthetic queries against authentic researcher queries; without such evidence the benchmark may measure performance on an artificial task rather than the claimed hallucination-free research QA setting.
minor comments (1)
- [Abstract] The abstract would be clearer if it stated the total number of papers, chunks, and queries in the contributed dataset.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our manuscript. We address the major comments point by point below, indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (ModernBERT word-level F1 53.6 vs. LLM extractor 48.7) is presented at a high level with no dataset size, number of annotated examples, annotation guidelines, inter-annotator agreement statistics, or significance tests. These omissions leave the empirical result without the supporting evidence required to interpret the reported F1 scores.
Authors: We agree with the referee that the abstract would benefit from more specific details to allow readers to better interpret the results. In the revised manuscript, we will update the abstract to include the dataset size (number of synthetic queries and annotated spans), a summary of the annotation process by NLP researchers, and report inter-annotator agreement statistics. Regarding significance tests, we will include a note that the performance difference was assessed for statistical significance using appropriate methods such as paired t-tests or bootstrap resampling on the test set. Full annotation guidelines will be provided in the appendix or supplementary material. This revision will make the empirical claims more transparent. revision: yes
-
Referee: [Abstract] Dataset construction (as described in the abstract): Ground truth is produced by NLP researchers annotating spans for synthetic queries generated by a custom ScIRGen pipeline on VerbatimRAG chunks. No validation study is reported that compares the distribution, specificity, complexity, or coverage of these synthetic queries against authentic researcher queries; without such evidence the benchmark may measure performance on an artificial task rather than the claimed hallucination-free research QA setting.
Authors: We recognize the importance of validating that the synthetic queries reflect real researcher information needs. Our pipeline uses ScIRGen to generate queries from VerbatimRAG-retrieved chunks of ACL papers, with the goal of creating queries that are relevant to research content. Annotations by domain-expert NLP researchers further ensure the quality of the ground truth spans. However, we did not include a comparative study with authentic researcher queries in the current work. We will add a dedicated limitations section discussing this aspect and outlining plans for future validation studies involving real user queries collected from researchers. This will clarify the scope of the current benchmark as a controlled proxy for the task. revision: partial
Circularity Check
No significant circularity; empirical results are direct measurements on a new benchmark
full rationale
The paper introduces a custom benchmark consisting of synthetic queries generated via a ScIRGen-based pipeline, VerbatimRAG-retrieved paper chunks, and human annotations by NLP researchers. It reports measured word-level F1 scores (e.g., ModernBERT at 53.6 outperforming LLM extractors at 48.7) for models trained on silver supervision from the same pipeline and evaluated against the human ground truth. No equations, derivations, or fitted parameters are presented that reduce the reported performance numbers to quantities defined by the inputs themselves. The use of VerbatimRAG is an application of an existing retrieval system to construct the dataset rather than a load-bearing self-citation that justifies the central claim. The results are self-contained performance metrics on the introduced benchmark and do not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations performed by NLP researchers on synthetic queries provide a reliable proxy for real-world query-to-span relevance.
Reference graph
Works this paper leans on
-
[1]
Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan
Ragbench: Explainable benchmark for retrieval-augmented generation systems.Preprint, arXiv:2407.11005. Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. Preprint, arXiv:2310.18344. Daniel Gildea, Min-Yen Kan, Nitin Madnani, Christoph Teichmann, and Martín Villalba. 2018. The ACL Anthology: Current ...
-
[2]
Challenges and applications of large language models.arXiv preprint arXiv:2307.10169, 2023
Challenges and applications of large language models.Preprint, arXiv:2307.10169. Adam Kovacs, Paul Schmitt, and Gabor Recski. 2025. KR labs at ArchEHR-QA 2025: A verbatim approach for evidence-based question answering. InProceed- ings of the 24th Workshop on Biomedical Language Processing (Shared Tasks), pages 69–74, Vienna, Austria. Association for Compu...
-
[3]
Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah
Algorithm appreciation: People prefer algo- rithmic to human judgment.Organizational Behav- ior and Human Decision Processes, 151:90–103. Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. Clinfo.ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific litera- ture.Preprint...
-
[4]
Hallucination-free? assessing the reliability of leading ai legal research tools
Hallucination-free? assessing the reliabil- ity of leading ai legal research tools.Preprint, arXiv:2405.20362. Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Ex- pertQA: Expert-curated questions and attributed an- swers. InProceedings of the 2024 Conference of the North American Chapter of the Association fo...
-
[5]
RAG-HAT: A hallucination-aware tuning pipeline for LLM in retrieval-augmented generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US. Asso- ciation for Computational Linguistics. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: Factoi...
work page 2024
-
[6]
Galactica: A Large Language Model for Science
Galactica: A large language model for science. Preprint, arXiv:2211.09085. Felipe Viegas, Washington Cunha, Christian Gomes, Antônio Pereira, Leonardo Rocha, and Marcos Goncalves. 2020. CluHTM - semantic hierarchical topic modeling based on CluWords. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8138–8150...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[7]
Verification: questions seeking a simple yes/no confirmation
-
[8]
Disjunctive: questions presenting multiple alternatives
-
[9]
Concept Completion: questions starting with Who/What/When/Where
-
[10]
Example: questions asking for instances of a concept
-
[11]
Feature Specification: questions about properties or characteristics
-
[12]
Quantification: questions seeking numerical or measurable information
-
[13]
Definition: questions asking for the meaning of a term or concept
-
[14]
Comparison: questions asking for similarities or differences
-
[15]
Interpretation: questions asking for inference over observed patterns
-
[16]
Causal Antecedent: questions about causes or reasons
-
[17]
Causal Consequence: questions about outcomes or results
-
[18]
Goal Orientation: questions about objectives or intentions
-
[19]
Instrumental/Procedural: questions asking how to achieve a goal
-
[20]
Enablement: questions about conditions enabling an action
-
[21]
Expectation: questions about anticipated or missing outcomes
-
[22]
Judgmental: questions asking for evaluation or opinion
-
[23]
Assertion: statements indicating lack of knowledge
-
[24]
Request/Directive: requests to summarize, analyze, or search. Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text. Give me the name of each type and not other information. Return ONLY valid JSON -- an array of objects, no markdown or explanations. Text: {chunk} C.2 Quest...
-
[25]
Only return a question without any other information
- [26]
-
[27]
The question should be short and simple, resembling what a user might type into a search engine
-
[28]
C.3 Query rewriting prompt You are a researcher using a search engine to find information
The question should be answerable based on the text above. C.3 Query rewriting prompt You are a researcher using a search engine to find information. Your question: {question} Please generate a search query that you would use to find the answer to this question. Instructions:
-
[29]
Only return a search query without any other information
-
[30]
The query should be short and simple, resembling what a user might type into a search engine
-
[31]
The query does not need to be grammatical. D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question. Rules
-
[32]
Extract only text that explicitly addresses the question
-
[33]
Never paraphrase, modify, or add to the original text
-
[35]
Order spans within each document by relevance, most relevant first
-
[36]
Include complete sentences or paragraphs for context. Output format Return a JSON object mapping document IDs to span arrays ordered by relevance: { "doc_0": ["most relevant span", "next most relevant span"], "doc_1": ["most relevant from doc 1"], "doc_2": [] } If no relevant information exists in a document, use an empty array. Your task Question: {{ que...
-
[37]
Use EXACT text from the document; no paraphrasing or edits
-
[38]
Preserve original wording, capitalization, and punctuation
-
[39]
If no passage in the document supports the answer, return an empty array
-
[40]
Order spans within each document by relevance, most relevant first. Output format Return JSON mapping document IDs to arrays: { "doc_0": ["first supporting passage", " second supporting passage"], "doc_1": ["passage from doc 1"], "doc_2": [] } Your task Question: {{ question }} Documents: {{ documents }} Extract supporting passages from each document: 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.