pith. sign in

arxiv: 2605.21102 · v1 · pith:Y2R3JKBYnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.SE

ACL-Verbatim: hallucination-free question answering for research

Pith reviewed 2026-05-21 04:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE
keywords extractive question answeringhallucination-free retrievalresearch papersACL AnthologyModernBERTVerbatimRAGsynthetic query generation
0
0 comments X

The pith

A small token classifier maps researcher queries to exact text spans in papers more accurately than large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VerbatimRAG as an extractive system that answers questions about research papers by returning verbatim text spans rather than generating new sentences. Authors build a ground-truth dataset by generating synthetic queries with a ScIRGen-based pipeline, retrieving paper chunks, and having NLP researchers annotate the relevant spans. They train and compare extractive models on this data, showing that a 150-million-parameter ModernBERT token classifier reaches the highest word-level F1 of 53.6 and beats the strongest LLM extractor evaluated at 48.7. This matters to researchers because it supplies a concrete route to reliable, hallucination-free information retrieval from trusted academic sources.

Core claim

We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline

What carries the argument

VerbatimRAG, the extractive question-answering pipeline that directly returns verbatim text spans from retrieved research-paper chunks instead of generating answers.

If this is right

  • Researchers obtain source text directly rather than paraphrased or invented content.
  • A compact model can be trained to perform reliable span extraction on academic documents.
  • The new annotated dataset supports further development of extractive tools for scholarly search.
  • The approach extends to other publication corpora once similar query-generation and annotation pipelines are built.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interfaces could highlight the exact returned span in the original PDF for immediate verification.
  • The same extraction layer could be added to existing retrieval systems to reduce factual errors in generated summaries.
  • Performance gaps between small classifiers and LLMs may widen or shrink on domains with different writing styles and terminology.

Load-bearing premise

Synthetic queries generated by the custom pipeline and paired with retrieved chunks can be annotated by human NLP researchers to produce ground truth that reflects real researcher information needs.

What would settle it

A follow-up evaluation in which actual NLP researchers submit their own open-ended queries about ACL papers and rate the relevance and completeness of the extracted spans versus those produced by an LLM baseline.

Figures

Figures reproduced from arXiv: 2605.21102 by \'Ad\'am Kov\'acs, G\'abor Recski, Istv\'an Boros, Nadia Verdha, Szilveszter T\'oth.

Figure 1
Figure 1. Figure 1: Example of conversion from PDF to markdown using Docling [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generation of synthetic user queries, based on the ScIRGen methodology. The example shows a chunk [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ACL-Verbatim, which applies the VerbatimRAG extractive QA system to ACL Anthology papers to enable hallucination-free research question answering by directly mapping queries to verbatim text spans. The authors contribute a new ground truth dataset created from synthetic user queries (generated via a custom ScIRGen-based pipeline on VerbatimRAG-retrieved chunks) that are annotated by NLP researchers. They train and evaluate extractive models on this benchmark, reporting that a 150M-parameter ModernBERT token classifier trained with silver supervision achieves the highest word-level F1 of 53.6, outperforming the strongest evaluated LLM extractor (48.7).

Significance. If the synthetic-query benchmark is shown to be a valid proxy for real researcher information needs, the work would supply a useful annotated resource and evidence that a modest-sized fine-tuned classifier can outperform LLM-based extractors on verbatim span prediction. The reliance on an existing VerbatimRAG system and silver supervision from the pipeline is a positive for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The central performance claim (ModernBERT word-level F1 53.6 vs. LLM extractor 48.7) is presented at a high level with no dataset size, number of annotated examples, annotation guidelines, inter-annotator agreement statistics, or significance tests. These omissions leave the empirical result without the supporting evidence required to interpret the reported F1 scores.
  2. [Abstract] Dataset construction (as described in the abstract): Ground truth is produced by NLP researchers annotating spans for synthetic queries generated by a custom ScIRGen pipeline on VerbatimRAG chunks. No validation study is reported that compares the distribution, specificity, complexity, or coverage of these synthetic queries against authentic researcher queries; without such evidence the benchmark may measure performance on an artificial task rather than the claimed hallucination-free research QA setting.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it stated the total number of papers, chunks, and queries in the contributed dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address the major comments point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (ModernBERT word-level F1 53.6 vs. LLM extractor 48.7) is presented at a high level with no dataset size, number of annotated examples, annotation guidelines, inter-annotator agreement statistics, or significance tests. These omissions leave the empirical result without the supporting evidence required to interpret the reported F1 scores.

    Authors: We agree with the referee that the abstract would benefit from more specific details to allow readers to better interpret the results. In the revised manuscript, we will update the abstract to include the dataset size (number of synthetic queries and annotated spans), a summary of the annotation process by NLP researchers, and report inter-annotator agreement statistics. Regarding significance tests, we will include a note that the performance difference was assessed for statistical significance using appropriate methods such as paired t-tests or bootstrap resampling on the test set. Full annotation guidelines will be provided in the appendix or supplementary material. This revision will make the empirical claims more transparent. revision: yes

  2. Referee: [Abstract] Dataset construction (as described in the abstract): Ground truth is produced by NLP researchers annotating spans for synthetic queries generated by a custom ScIRGen pipeline on VerbatimRAG chunks. No validation study is reported that compares the distribution, specificity, complexity, or coverage of these synthetic queries against authentic researcher queries; without such evidence the benchmark may measure performance on an artificial task rather than the claimed hallucination-free research QA setting.

    Authors: We recognize the importance of validating that the synthetic queries reflect real researcher information needs. Our pipeline uses ScIRGen to generate queries from VerbatimRAG-retrieved chunks of ACL papers, with the goal of creating queries that are relevant to research content. Annotations by domain-expert NLP researchers further ensure the quality of the ground truth spans. However, we did not include a comparative study with authentic researcher queries in the current work. We will add a dedicated limitations section discussing this aspect and outlining plans for future validation studies involving real user queries collected from researchers. This will clarify the scope of the current benchmark as a controlled proxy for the task. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results are direct measurements on a new benchmark

full rationale

The paper introduces a custom benchmark consisting of synthetic queries generated via a ScIRGen-based pipeline, VerbatimRAG-retrieved paper chunks, and human annotations by NLP researchers. It reports measured word-level F1 scores (e.g., ModernBERT at 53.6 outperforming LLM extractors at 48.7) for models trained on silver supervision from the same pipeline and evaluated against the human ground truth. No equations, derivations, or fitted parameters are presented that reduce the reported performance numbers to quantities defined by the inputs themselves. The use of VerbatimRAG is an application of an existing retrieval system to construct the dataset rather than a load-bearing self-citation that justifies the central claim. The results are self-contained performance metrics on the introduced benchmark and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that synthetic queries and VerbatimRAG-retrieved chunks yield annotations that generalize to real researcher needs, plus standard supervised learning assumptions for token classification.

axioms (1)
  • domain assumption Human annotations performed by NLP researchers on synthetic queries provide a reliable proxy for real-world query-to-span relevance.
    This premise underpins both the training data and the benchmark used to claim model superiority.

pith-pipeline@v0.9.0 · 5729 in / 1437 out tokens · 55444 ms · 2026-05-21T04:50:08.789775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

    Ragbench: Explainable benchmark for retrieval-augmented generation systems.Preprint, arXiv:2407.11005. Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. Preprint, arXiv:2310.18344. Daniel Gildea, Min-Yen Kan, Nitin Madnani, Christoph Teichmann, and Martín Villalba. 2018. The ACL Anthology: Current ...

  2. [2]

    Challenges and applications of large language models.arXiv preprint arXiv:2307.10169, 2023

    Challenges and applications of large language models.Preprint, arXiv:2307.10169. Adam Kovacs, Paul Schmitt, and Gabor Recski. 2025. KR labs at ArchEHR-QA 2025: A verbatim approach for evidence-based question answering. InProceed- ings of the 24th Workshop on Biomedical Language Processing (Shared Tasks), pages 69–74, Vienna, Austria. Association for Compu...

  3. [3]

    Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah

    Algorithm appreciation: People prefer algo- rithmic to human judgment.Organizational Behav- ior and Human Decision Processes, 151:90–103. Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. Clinfo.ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific litera- ture.Preprint...

  4. [4]

    Hallucination-free? assessing the reliability of leading ai legal research tools

    Hallucination-free? assessing the reliabil- ity of leading ai legal research tools.Preprint, arXiv:2405.20362. Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Ex- pertQA: Expert-curated questions and attributed an- swers. InProceedings of the 2024 Conference of the North American Chapter of the Association fo...

  5. [5]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US

    RAG-HAT: A hallucination-aware tuning pipeline for LLM in retrieval-augmented generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558, Miami, Florida, US. Asso- ciation for Computational Linguistics. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: Factoi...

  6. [6]

    Galactica: A Large Language Model for Science

    Galactica: A large language model for science. Preprint, arXiv:2211.09085. Felipe Viegas, Washington Cunha, Christian Gomes, Antônio Pereira, Leonardo Rocha, and Marcos Goncalves. 2020. CluHTM - semantic hierarchical topic modeling based on CluWords. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8138–8150...

  7. [7]

    Verification: questions seeking a simple yes/no confirmation

  8. [8]

    Disjunctive: questions presenting multiple alternatives

  9. [9]

    Concept Completion: questions starting with Who/What/When/Where

  10. [10]

    Example: questions asking for instances of a concept

  11. [11]

    Feature Specification: questions about properties or characteristics

  12. [12]

    Quantification: questions seeking numerical or measurable information

  13. [13]

    Definition: questions asking for the meaning of a term or concept

  14. [14]

    Comparison: questions asking for similarities or differences

  15. [15]

    Interpretation: questions asking for inference over observed patterns

  16. [16]

    Causal Antecedent: questions about causes or reasons

  17. [17]

    Causal Consequence: questions about outcomes or results

  18. [18]

    Goal Orientation: questions about objectives or intentions

  19. [19]

    Instrumental/Procedural: questions asking how to achieve a goal

  20. [20]

    Enablement: questions about conditions enabling an action

  21. [21]

    Expectation: questions about anticipated or missing outcomes

  22. [22]

    Judgmental: questions asking for evaluation or opinion

  23. [23]

    Assertion: statements indicating lack of knowledge

  24. [24]

    Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text

    Request/Directive: requests to summarize, analyze, or search. Task: Based on the following text from a research paper, return the most appropriate 3 question types that could be answered by this text. Give me the name of each type and not other information. Return ONLY valid JSON -- an array of objects, no markdown or explanations. Text: {chunk} C.2 Quest...

  25. [25]

    Only return a question without any other information

  26. [26]

    a dataset

    Use neutral terms like "a dataset", "data collection method", or "research approach", instead of references like "the study" or "this dataset"

  27. [27]

    The question should be short and simple, resembling what a user might type into a search engine

  28. [28]

    C.3 Query rewriting prompt You are a researcher using a search engine to find information

    The question should be answerable based on the text above. C.3 Query rewriting prompt You are a researcher using a search engine to find information. Your question: {question} Please generate a search query that you would use to find the answer to this question. Instructions:

  29. [29]

    Only return a search query without any other information

  30. [30]

    The query should be short and simple, resembling what a user might type into a search engine

  31. [31]

    D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question

    The query does not need to be grammatical. D Prompts for extraction D.1 Default VerbatimRAG extraction prompt Extract EXACT verbatim text spans from multiple documents that answer the question. Rules

  32. [32]

    Extract only text that explicitly addresses the question

  33. [33]

    Never paraphrase, modify, or add to the original text

  34. [35]

    Order spans within each document by relevance, most relevant first

  35. [36]

    doc_0": [

    Include complete sentences or paragraphs for context. Output format Return a JSON object mapping document IDs to span arrays ordered by relevance: { "doc_0": ["most relevant span", "next most relevant span"], "doc_1": ["most relevant from doc 1"], "doc_2": [] } If no relevant information exists in a document, use an empty array. Your task Question: {{ que...

  36. [37]

    Use EXACT text from the document; no paraphrasing or edits

  37. [38]

    Preserve original wording, capitalization, and punctuation

  38. [39]

    If no passage in the document supports the answer, return an empty array

  39. [40]

    doc_0": [

    Order spans within each document by relevance, most relevant first. Output format Return JSON mapping document IDs to arrays: { "doc_0": ["first supporting passage", " second supporting passage"], "doc_1": ["passage from doc 1"], "doc_2": [] } Your task Question: {{ question }} Documents: {{ documents }} Extract supporting passages from each document: 13