arxiv: 2605.10168 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

Hulong Wu, Shansong Zhou, Shiwei Wang, Shu Wang, Xinyang Wang, Yixiang Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:51 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords abstract question answeringdocument QA benchmarktopic coverage scoringunsupported content detectionRAG evaluationretrieval scope robustnesshallucination diagnosisacademic and news documents

0 comments

The pith

ASTRA-QA supplies explicit topic annotations so abstract document answers can be scored directly for required coverage and unsupported content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Abstract questions demand that answers synthesize scattered facts from long documents, yet existing benchmarks often lack stable references and fall back on coarse similarity measures or unstable head-to-head comparisons. The paper introduces ASTRA-QA with 869 QA instances drawn from academic papers and news articles, covering five abstract question types and three controlled retrieval scopes. Each instance carries answer topic sets, curated unsupported topics, and aligned evidence annotations. These annotations support direct scoring of topic coverage and unsupported content, removing the need for exhaustive pairwise comparisons. The resulting evaluations diagnose coverage, hallucination, and retrieval-scope robustness in representative RAG methods.

Core claim

ASTRA-QA is a benchmark of 869 QA instances over academic papers and news documents equipped with explicit evaluation annotations that include answer topic sets, curated unsupported topics, and aligned evidence. It assesses generated answers by directly scoring how well they cover the required key points and how much they include unsupported content, thereby enabling scalable, reference-grounded evaluation without exhaustive head-to-head comparisons.

What carries the argument

Explicit evaluation annotations consisting of answer topic sets, curated unsupported topics, and aligned evidence, which permit direct scoring of topic coverage and detection of unsupported content.

Load-bearing premise

The manually curated answer topic sets, unsupported topics, and aligned evidence accurately and unbiasedly represent what constitutes a high-quality abstract answer.

What would settle it

A controlled study in which human raters assign quality rankings to a sample of answers that diverge substantially from the benchmark's topic-coverage and unsupported-content scores would falsify the reliability of the evaluation method.

Figures

Figures reproduced from arXiv: 2605.10168 by Hulong Wu, Shansong Zhou, Shiwei Wang, Shu Wang, Xinyang Wang, Yixiang Fang.

**Figure 2.** Figure 2: An example QA instance in the ASTRA-QA dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow for constructing the ASTRA-QA dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Instead of relying on a single free-form reference, each question is equipped with a set of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 4.** Figure 4: Distribution of topics in the answer and hallucination sets in our ASTRA-QA. topics. This design supports joint evaluation of completeness and faithfulness by measuring whether a system covers the major answer topics while avoiding plausible but unsupported ones. 4 Topic-based Evaluation Method 4.1 Evaluation Method Our core idea is to evaluate ASTRA-QA answers in the same spirit as grading a composition o… view at source ↗

**Figure 5.** Figure 5: Head-to-head (bidirectional, overall) win-rate comparison, using the same method abbreviations as in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: b (second row, third column). VR LL LH HY GG GL HR RA AR KR HI VR 50 8 5 36 41 17 3 1 1 91 14 LL 92 50 33 60 86 44 29 14 23 95 78 LH 95 67 50 60 90 52 37 26 31 98 82 HY 64 40 40 50 60 23 28 11 10 88 44 GG 59 14 10 40 50 21 19 2 1 73 30 GL 83 56 48 77 79 50 35 21 28 95 71 HR 97 71 63 72 81 65 50 40 49 98 96 RA 99 86 74 89 98 79 60 50 54 99 96 AR 99 77 69 90 99 72 51 46 50 99 99 KR 9 5 2 12 27 5 2 1 1 50 1 H… view at source ↗

**Figure 8.** Figure 8: Head-to-head win rates for the Comprehensiveness criterion. Each entry reports the row method’s win rate against the column method; higher is better. VR LL LH HY GG GL HR RA AR KR HI VR 50 8 6 35 38 15 2 2 1 88 10 LL 92 50 36 57 89 42 23 18 10 96 77 LH 94 64 50 59 89 48 27 27 13 96 81 HY 65 43 41 50 62 28 16 13 4 92 44 GG 62 11 11 38 50 19 5 4 1 71 29 GL 85 58 52 72 81 50 27 23 12 96 72 HR 98 77 73 84 95 7… view at source ↗

**Figure 9.** Figure 9: Head-to-head win rates for the Diversity criterion. Each entry reports the row method’s win rate against the column method; higher is better. VR LL LH HY GG GL HR RA AR KR HI VR 50 7 5 36 40 17 4 1 1 91 11 LL 93 50 34 58 88 43 21 17 21 96 77 LH 95 66 50 60 88 47 27 27 32 97 78 HY 64 42 40 50 61 26 16 13 10 90 42 GG 60 12 12 39 50 19 5 4 1 71 29 GL 83 57 53 74 81 50 27 23 28 96 72 HR 96 79 73 84 95 73 50 42… view at source ↗

**Figure 10.** Figure 10: Head-to-head win rates for the Empowerment criterion. Each entry reports the row method’s win rate against the column method; higher is better. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for generating Single-Sum QA instance. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for generating QA instance of Pair-Comp. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for generating Multi-Comp QA instance. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for generating QA instance of Enum. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for generating Temporal type. Prompt for Evaluation in ASTRA-QA You are an expert tasked with extracting topic lists from response of the question. Task: Read the response and return the complete predicted topics list. Topic normalization: When you extract a topic, check whether it is semantically equivalent to an existing topic in the Common Errors List or Ground_truth list. − If it matches a topi… view at source ↗

**Figure 16.** Figure 16: Prompt for ASTRA-QA evaluation method. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for head-to-head evaluation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at https://xinyangsally.github.io/astra-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASTRA-QA adds a benchmark with direct topic-coverage scoring for abstract questions, but the annotations that make it work lack any reported reliability checks.

read the letter

The main contribution is a new dataset of 869 QA instances drawn from academic papers and news, organized around five abstract question types and three retrieval scopes. Each instance includes manual annotations for required answer topics, unsupported topics, and evidence alignment. This setup lets them score coverage and hallucination by direct match rather than embedding similarity or head-to-head model comparisons, which is the part that actually differs from prior benchmarks. They run it on a few RAG variants including graph and hierarchical retrieval and report diagnostics on how retrieval scope affects the outcomes. Releasing the data and code is straightforward and useful. The construction is described clearly enough that someone could replicate the format on new domains. The soft spot is exactly where the stress test flagged: the topic sets and unsupported lists are the load-bearing part, yet the paper gives no inter-annotator agreement figures, no expert re-check on a sample, and no analysis of how topic granularity was kept consistent. Without that evidence the scores are hard to treat as stable reference points. This is for people working on RAG evaluation or abstract synthesis tasks who need something beyond standard metrics. A reader already building or testing retrieval systems could download it and run their own checks on the annotations. It deserves peer review because the core idea targets a real evaluation gap and the experiments are modest but on-point; referees can ask for the missing validation numbers and decide how much weight to give the results.

Referee Report

2 major / 3 minor

Summary. The paper introduces ASTRA-QA, a benchmark of 869 QA instances over academic papers and news documents spanning five abstract question types and three controlled retrieval scopes. Each instance includes explicit annotations consisting of answer topic sets, curated unsupported topics, and aligned evidence. The benchmark enables direct scoring of topic coverage and avoidance of unsupported content in generated answers, supporting scalable evaluation of RAG methods without exhaustive head-to-head comparisons. Experiments with vanilla, graph-based, and hierarchical RAG approaches illustrate its use for diagnosing coverage, hallucination, and retrieval-scope robustness. The dataset and code are publicly released.

Significance. If the annotations are shown to be reliable, ASTRA-QA would address a clear gap in evaluating abstract QA over long or multi-document settings, where existing benchmarks often depend on coarse similarity metrics or unstable comparisons. The public release of the full dataset with annotations and code is a clear strength that supports reproducibility and further research. The approach could enable more stable, reference-grounded diagnostics for coverage and hallucination in RAG systems.

major comments (2)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The description of how answer topic sets and curated unsupported topics were created for the 869 instances across five question types provides no inter-annotator agreement statistics, no expert re-validation on a held-out sample, and no analysis of topic granularity control. These annotations are load-bearing for the central claim that direct scoring of coverage and unsupported content yields stable, reference-grounded evaluation without head-to-head comparisons.
[§4 (Experiments)] §4 (Experiments): The reported results with representative RAG methods demonstrate diagnostic utility but contain no validation of the automatic topic-coverage and unsupported-content scores against independent human judgments on even a small sample of outputs. This leaves open whether the metrics align with expert notions of answer quality.

minor comments (3)

[Abstract and §1] Abstract and §1: The three controlled retrieval scopes are mentioned but not defined until later; a brief upfront characterization would improve readability.
[§2 (Related Work)] §2 (Related Work): Ensure all cited QA benchmarks are compared on the specific dimensions of abstract synthesis and annotation stability rather than only on dataset size.
[Data Statistics] Table 1 or data statistics section: Report the distribution of instances per question type and retrieval scope to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of ASTRA-QA's potential to address gaps in abstract QA evaluation. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The description of how answer topic sets and curated unsupported topics were created for the 869 instances across five question types provides no inter-annotator agreement statistics, no expert re-validation on a held-out sample, and no analysis of topic granularity control. These annotations are load-bearing for the central claim that direct scoring of coverage and unsupported content yields stable, reference-grounded evaluation without head-to-head comparisons.

Authors: We agree that inter-annotator agreement (IAA) statistics, re-validation, and granularity analysis would strengthen the presentation of the annotations. The topic sets and unsupported topics were constructed using explicit guidelines and domain-expert curation across the 869 instances, but these supporting statistics were not included in the initial submission. In the revised manuscript, we will add IAA results computed on a held-out sample of 100 instances using a second independent annotator (reporting Cohen's kappa for topic overlap and unsupported topic identification). We will also include expert re-validation on a separate 50-instance sample and an analysis of topic granularity control, reporting average topic set sizes, variance, and distributions stratified by question type and document domain. These elements will be incorporated into Section 3. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments): The reported results with representative RAG methods demonstrate diagnostic utility but contain no validation of the automatic topic-coverage and unsupported-content scores against independent human judgments on even a small sample of outputs. This leaves open whether the metrics align with expert notions of answer quality.

Authors: We acknowledge that direct validation of the automatic scores against human judgments would provide stronger evidence of metric reliability. The current experiments in Section 4 focus on using the benchmark to diagnose RAG behaviors across retrieval scopes, but no human correlation study was reported. In the revision, we will add a targeted human validation: two experts will independently rate a random sample of 100 generated answers (drawn from the reported RAG runs) for topic coverage and unsupported content using a 5-point scale. We will then compute and report Pearson and Spearman correlations between these human ratings and the automatic scores. This analysis will be added to Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined via independent new annotations

full rationale

The paper introduces ASTRA-QA as a new benchmark consisting of 869 instances with explicitly created answer topic sets, curated unsupported topics, and aligned evidence annotations. The evaluation method directly scores topic coverage and unsupported content using these annotations by construction, which is the standard non-circular definition of a reference-based benchmark rather than a derivation that reduces to prior fitted quantities or self-citations. No equations, parameter fits, uniqueness theorems, or load-bearing self-citations appear in the provided text; the central claim of scalable evaluation without head-to-head comparisons follows directly from supplying the reference annotations as new inputs. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on standard practices for creating QA datasets but introduces no free parameters, new entities, or ad-hoc axioms beyond typical domain assumptions in NLP evaluation.

axioms (1)

domain assumption Standard assumptions in NLP benchmark creation such as representative sampling of documents and questions.
The benchmark construction implicitly relies on typical practices for creating QA datasets over documents.

pith-pipeline@v0.9.0 · 5524 in / 1191 out tokens · 64203 ms · 2026-05-12T02:51:41.665677+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
topic-based evaluation method that directly scores topic coverage and hallucinated content

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 6 internal anchors

[1]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[2]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Lightrag: Simple and fast retrieval-augmented generation.arXiv e-prints, pages arXiv–2410, 2024

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation.arXiv e-prints, pages arXiv–2410, 2024

work page 2024
[4]

Archrag: Attributed community-based hierarchical retrieval-augmented generation

Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, and Yuchi Ma. Archrag: Attributed community-based hierarchical retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15868–15876, 2026

work page 2026
[5]

In-depth Analysis of Graph-based RAG in a Unified Framework

Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, et al. In-depth analysis of graph-based rag in a unified framework.arXiv preprint arXiv:2503.04338, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Pathrag: Pruning graph-based retrieval augmented generation with relational paths

Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30183–30191, 2026

work page 2026
[7]

Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150,

Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, and James Cheng. Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150, 2025

work page arXiv 2025
[8]

Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering

Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, and Jinjie Gu. Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 199–210, 2024

work page 2024
[9]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011, 2021

work page arXiv 2021
[10]

Peerqa: A scientific question answering dataset from peer reviews

Tim Baumgärtner, Ted Briscoe, and Iryna Gurevych. Peerqa: A scientific question answering dataset from peer reviews. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 508–544, 2025

work page 2025
[11]

Asqa: Factoid questions meet long-form answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, 2022

work page 2022
[12]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review arXiv 2018
[13]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[14]

Crag-comprehensive rag benchmark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024

Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag benchmark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024. 10

work page 2024
[15]

Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli. Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 4354–4374, 2024

work page 2024
[16]

Liverag: A diverse q&a dataset with varying difficulty level for rag evaluation

David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, and Ran Tavory. Liverag: A diverse q&a dataset with varying difficulty level for rag evaluation. arXiv preprint arXiv:2511.14531, 2025

work page arXiv 2025
[17]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[18]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[19]

Questeval: Summarization asks for fact-based evaluation

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. Questeval: Summarization asks for fact-based evaluation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 6594–6604, 2021

work page 2021
[20]

Qafacteval: Improved qa-based factual consistency evaluation for summarization

Alexander Richard Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. Qafacteval: Improved qa-based factual consistency evaluation for summarization. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, 2022

work page 2022
[21]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

work page 2023
[22]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

work page 2023
[23]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pages 874–880, 2021

work page 2021
[24]

arXiv:2405.14831 [cs.CL] https://arxiv.org/abs/2405.14831

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831, 2024

work page arXiv 2024
[25]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. Raptor: Recursive abstractive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059, 2024

work page arXiv 2024
[26]

Eli5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3558–3567, 2019

work page 2019
[27]

Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization

Esin Durmus, He He, and Mona Diab. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 5055–5070, 2020

work page 2020
[28]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[29]

Open scholarship and peer review: a time for experimentation

David Soergel, Adam Saunders, and Andrew McCallum. Open scholarship and peer review: a time for experimentation. InICML 2013 Workshop on Peer Reviewing and Publishing Models,

work page 2013
[30]

URLhttps://openreview.net/forum?id=xf0zSBd2iufMg. 11

work page
[31]

Mapping and taking stock of the personal informatics literature.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4):1–38, 2020

Daniel A Epstein, Clara Caldeira, Mayara Costa Figueiredo, Xi Lu, Lucas M Silva, Lucretia Williams, Jong Ho Lee, Qingyang Li, Simran Ahuja, Qiuer Chen, et al. Mapping and taking stock of the personal informatics literature.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4):1–38, 2020

work page 2020
[32]

Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page arXiv 2024
[33]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag

Yiqian Huang, Shiqi Zhang, and Xiaokui Xiao. Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 1003–1012, 2025

work page 2025
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Morris, Brandon Duderstadt, and Andriy Mulyar

Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2024

work page 2024
[37]

Gpt-5.1 instant and gpt-5.1 thinking system card addendum

OpenAI. Gpt-5.1 instant and gpt-5.1 thinking system card addendum. https://openai.com/ index/gpt-5-system-card-addendum-gpt-5-1/, 2025. Accessed: 2026-04-14

work page 2025
[38]

On big data benchmarking

Rui Han, Xiaoyi Lu, and Jiangtao Xu. On big data benchmarking. InWorkshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pages 3–18. Springer, 2014

work page 2014
[39]

Predicting question-answering performance of large language models through semantic consistency

Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi, and Ateret Anaby Tavor. Predicting question-answering performance of large language models through semantic consistency. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 138–154, 2023

work page 2023
[40]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries,

Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391, 2024

work page arXiv 2024
[41]

Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation

Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pages 2366–2377, 2025

work page 2025
[42]

Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.arXiv preprint arXiv:2403.00553, 2024

Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa F Siu, Byron C Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.arXiv preprint arXiv:2403.00553, 2024

work page arXiv 2024
[43]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[44]

Ollama.https://github.com/ollama/ollama, 2024

Ollama. Ollama.https://github.com/ollama/ollama, 2024. Accessed: 2026-05-03. 12 A Additional Details of ASTRA-QA A.1 Examples of ASTRA-QA benchmark Table 7: Example QA instance of ASTRA-QA benchmark across the five question types. Question Type Question Answer Topic Set Single-Sum Please summarize the paper Leveraging Large Language Models for Multiple Ch...

work page 2024
[45]

Read all review content, including summaries, strengths, and weaknesses

work page
[46]

Extract technical terms or key phrases that are explicitly mentioned or strongly implied by the reviews

work page
[47]

First identify the main topics mentioned in each individual review, then aggregate them across all reviews

work page
[48]

Prioritize compound technical phrases over single words

work page
[49]

Government Agencies vs. Industry and AGI Labs

Exclude: − general academic verbs or adjectives − generic praise or criticism − explanations, commentary, or full sentences Output Requirements: − Output only technical terms or key phrases − Do not include any introduction or explanation − Use a comma−separated list − Do not use markdown − Do not use numbering Output Format: [term 1, term 2, term 3, term...

work page
[50]

Scan the list of news articles above

work page
[51]

Identify a single, specific, and concrete topic or subject that is mentioned or discussed across multiple articles. This should be a clear focal point, like ’Retrieval−Augmented Generation (RAG) techniques’, ’ Apple’s upcoming product features’, ’Impacts of a specific new policy’, or ’Performance of a particular company’s recent quarter’. Avoid overly bro...

work page
[52]

The question should prompt an answer that lists and briefly describes different aspects or examples

Formulate a question that asks for a list or enumeration of key points, methods, features, impacts, or other relevant details specifically related to the chosen topic. The question should prompt an answer that lists and briefly describes different aspects or examples. An example question format is: ’What are the key features of the new iPhone as reported ...

work page
[53]

Format the answer strictly as a list like this: [ Point 1, brief description; Point 2, brief description; Another relevant detail]

Provide the answer to your question as a list of concise keywords or short phrases, summarizing the core information from the selected articles related to the topic. Format the answer strictly as a list like this: [ Point 1, brief description; Point 2, brief description; Another relevant detail]. Do not include any explanatory text or markdown in the answer

work page
[54]

This is the ’reason’

Briefly explain in 1−2 sentences why you chose this specific topic and how the selected articles contribute information to answer your question. This is the ’reason’

work page
[55]

question

List the titles of the news articles that are relevant to the chosen topic and used to formulate your question and answer, separated by semicolons. Please respond in JSON format with the following structure: { "question": "<Your generated question asking for a list related to the single topic>", "answer": "[<Point 1, brief description; Point 2, brief desc...

work page