Recognition: 2 theorem links
· Lean TheoremASTRA-QA: A Benchmark for Abstract Question Answering over Documents
Pith reviewed 2026-05-12 02:51 UTC · model grok-4.3
The pith
ASTRA-QA supplies explicit topic annotations so abstract document answers can be scored directly for required coverage and unsupported content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ASTRA-QA is a benchmark of 869 QA instances over academic papers and news documents equipped with explicit evaluation annotations that include answer topic sets, curated unsupported topics, and aligned evidence. It assesses generated answers by directly scoring how well they cover the required key points and how much they include unsupported content, thereby enabling scalable, reference-grounded evaluation without exhaustive head-to-head comparisons.
What carries the argument
Explicit evaluation annotations consisting of answer topic sets, curated unsupported topics, and aligned evidence, which permit direct scoring of topic coverage and detection of unsupported content.
Load-bearing premise
The manually curated answer topic sets, unsupported topics, and aligned evidence accurately and unbiasedly represent what constitutes a high-quality abstract answer.
What would settle it
A controlled study in which human raters assign quality rankings to a sample of answers that diverge substantially from the benchmark's topic-coverage and unsupported-content scores would falsify the reliability of the evaluation method.
Figures
read the original abstract
Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at https://xinyangsally.github.io/astra-benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ASTRA-QA, a benchmark of 869 QA instances over academic papers and news documents spanning five abstract question types and three controlled retrieval scopes. Each instance includes explicit annotations consisting of answer topic sets, curated unsupported topics, and aligned evidence. The benchmark enables direct scoring of topic coverage and avoidance of unsupported content in generated answers, supporting scalable evaluation of RAG methods without exhaustive head-to-head comparisons. Experiments with vanilla, graph-based, and hierarchical RAG approaches illustrate its use for diagnosing coverage, hallucination, and retrieval-scope robustness. The dataset and code are publicly released.
Significance. If the annotations are shown to be reliable, ASTRA-QA would address a clear gap in evaluating abstract QA over long or multi-document settings, where existing benchmarks often depend on coarse similarity metrics or unstable comparisons. The public release of the full dataset with annotations and code is a clear strength that supports reproducibility and further research. The approach could enable more stable, reference-grounded diagnostics for coverage and hallucination in RAG systems.
major comments (2)
- [§3 (Benchmark Construction)] §3 (Benchmark Construction): The description of how answer topic sets and curated unsupported topics were created for the 869 instances across five question types provides no inter-annotator agreement statistics, no expert re-validation on a held-out sample, and no analysis of topic granularity control. These annotations are load-bearing for the central claim that direct scoring of coverage and unsupported content yields stable, reference-grounded evaluation without head-to-head comparisons.
- [§4 (Experiments)] §4 (Experiments): The reported results with representative RAG methods demonstrate diagnostic utility but contain no validation of the automatic topic-coverage and unsupported-content scores against independent human judgments on even a small sample of outputs. This leaves open whether the metrics align with expert notions of answer quality.
minor comments (3)
- [Abstract and §1] Abstract and §1: The three controlled retrieval scopes are mentioned but not defined until later; a brief upfront characterization would improve readability.
- [§2 (Related Work)] §2 (Related Work): Ensure all cited QA benchmarks are compared on the specific dimensions of abstract synthesis and annotation stability rather than only on dataset size.
- [Data Statistics] Table 1 or data statistics section: Report the distribution of instances per question type and retrieval scope to allow readers to assess balance.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of ASTRA-QA's potential to address gaps in abstract QA evaluation. We address each major comment below with specific plans for revision where appropriate.
read point-by-point responses
-
Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The description of how answer topic sets and curated unsupported topics were created for the 869 instances across five question types provides no inter-annotator agreement statistics, no expert re-validation on a held-out sample, and no analysis of topic granularity control. These annotations are load-bearing for the central claim that direct scoring of coverage and unsupported content yields stable, reference-grounded evaluation without head-to-head comparisons.
Authors: We agree that inter-annotator agreement (IAA) statistics, re-validation, and granularity analysis would strengthen the presentation of the annotations. The topic sets and unsupported topics were constructed using explicit guidelines and domain-expert curation across the 869 instances, but these supporting statistics were not included in the initial submission. In the revised manuscript, we will add IAA results computed on a held-out sample of 100 instances using a second independent annotator (reporting Cohen's kappa for topic overlap and unsupported topic identification). We will also include expert re-validation on a separate 50-instance sample and an analysis of topic granularity control, reporting average topic set sizes, variance, and distributions stratified by question type and document domain. These elements will be incorporated into Section 3. revision: yes
-
Referee: [§4 (Experiments)] §4 (Experiments): The reported results with representative RAG methods demonstrate diagnostic utility but contain no validation of the automatic topic-coverage and unsupported-content scores against independent human judgments on even a small sample of outputs. This leaves open whether the metrics align with expert notions of answer quality.
Authors: We acknowledge that direct validation of the automatic scores against human judgments would provide stronger evidence of metric reliability. The current experiments in Section 4 focus on using the benchmark to diagnose RAG behaviors across retrieval scopes, but no human correlation study was reported. In the revision, we will add a targeted human validation: two experts will independently rate a random sample of 100 generated answers (drawn from the reported RAG runs) for topic coverage and unsupported content using a 5-point scale. We will then compute and report Pearson and Spearman correlations between these human ratings and the automatic scores. This analysis will be added to Section 4. revision: yes
Circularity Check
No circularity: benchmark defined via independent new annotations
full rationale
The paper introduces ASTRA-QA as a new benchmark consisting of 869 instances with explicitly created answer topic sets, curated unsupported topics, and aligned evidence annotations. The evaluation method directly scores topic coverage and unsupported content using these annotations by construction, which is the standard non-circular definition of a reference-based benchmark rather than a derivation that reduces to prior fitted quantities or self-citations. No equations, parameter fits, uniqueness theorems, or load-bearing self-citations appear in the provided text; the central claim of scalable evaluation without head-to-head comparisons follows directly from supplying the reference annotations as new inputs. This is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in NLP benchmark creation such as representative sampling of documents and questions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability uncleartopic-based evaluation method that directly scores topic coverage and hallucinated content
Reference graph
Works this paper leans on
-
[1]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[2]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Lightrag: Simple and fast retrieval-augmented generation.arXiv e-prints, pages arXiv–2410, 2024
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation.arXiv e-prints, pages arXiv–2410, 2024
work page 2024
-
[4]
Archrag: Attributed community-based hierarchical retrieval-augmented generation
Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, and Yuchi Ma. Archrag: Attributed community-based hierarchical retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15868–15876, 2026
work page 2026
-
[5]
In-depth Analysis of Graph-based RAG in a Unified Framework
Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, et al. In-depth analysis of graph-based rag in a unified framework.arXiv preprint arXiv:2503.04338, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Pathrag: Pruning graph-based retrieval augmented generation with relational paths
Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30183–30191, 2026
work page 2026
-
[7]
Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150,
Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, and James Cheng. Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150, 2025
-
[8]
Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, and Jinjie Gu. Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 199–210, 2024
work page 2024
-
[9]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011, 2021
-
[10]
Peerqa: A scientific question answering dataset from peer reviews
Tim Baumgärtner, Ted Briscoe, and Iryna Gurevych. Peerqa: A scientific question answering dataset from peer reviews. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 508–544, 2025
work page 2025
-
[11]
Asqa: Factoid questions meet long-form answers
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, 2022
work page 2022
-
[12]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review arXiv 2018
-
[13]
Longbench: A bilingual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024
work page 2024
-
[14]
Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag benchmark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024. 10
work page 2024
-
[15]
Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering
Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli. Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 4354–4374, 2024
work page 2024
-
[16]
Liverag: A diverse q&a dataset with varying difficulty level for rag evaluation
David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, and Ran Tavory. Liverag: A diverse q&a dataset with varying difficulty level for rag evaluation. arXiv preprint arXiv:2511.14531, 2025
-
[17]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[18]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[19]
Questeval: Summarization asks for fact-based evaluation
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. Questeval: Summarization asks for fact-based evaluation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 6594–6604, 2021
work page 2021
-
[20]
Qafacteval: Improved qa-based factual consistency evaluation for summarization
Alexander Richard Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. Qafacteval: Improved qa-based factual consistency evaluation for summarization. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, 2022
work page 2022
-
[21]
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023
work page 2023
-
[22]
G-eval: Nlg evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023
work page 2023
-
[23]
Leveraging passage retrieval with generative models for open domain question answering
Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pages 874–880, 2021
work page 2021
-
[24]
arXiv:2405.14831 [cs.CL] https://arxiv.org/abs/2405.14831
Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831, 2024
-
[25]
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. Raptor: Recursive abstractive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059, 2024
-
[26]
Eli5: Long form question answering
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3558–3567, 2019
work page 2019
-
[27]
Esin Durmus, He He, and Mona Diab. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 5055–5070, 2020
work page 2020
-
[28]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[29]
Open scholarship and peer review: a time for experimentation
David Soergel, Adam Saunders, and Andrew McCallum. Open scholarship and peer review: a time for experimentation. InICML 2013 Workshop on Peer Reviewing and Publishing Models,
work page 2013
-
[30]
URLhttps://openreview.net/forum?id=xf0zSBd2iufMg. 11
-
[31]
Daniel A Epstein, Clara Caldeira, Mayara Costa Figueiredo, Xi Lu, Lucas M Silva, Lucretia Williams, Jong Ho Lee, Qingyang Li, Simran Ahuja, Qiuer Chen, et al. Mapping and taking stock of the personal informatics literature.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4):1–38, 2020
work page 2020
-
[32]
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024
-
[33]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag
Yiqian Huang, Shiqi Zhang, and Xiaokui Xiao. Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 1003–1012, 2025
work page 2025
-
[35]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Morris, Brandon Duderstadt, and Andriy Mulyar
Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2024
work page 2024
-
[37]
Gpt-5.1 instant and gpt-5.1 thinking system card addendum
OpenAI. Gpt-5.1 instant and gpt-5.1 thinking system card addendum. https://openai.com/ index/gpt-5-system-card-addendum-gpt-5-1/, 2025. Accessed: 2026-04-14
work page 2025
-
[38]
Rui Han, Xiaoyi Lu, and Jiangtao Xu. On big data benchmarking. InWorkshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pages 3–18. Springer, 2014
work page 2014
-
[39]
Predicting question-answering performance of large language models through semantic consistency
Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi, and Ateret Anaby Tavor. Predicting question-answering performance of large language models through semantic consistency. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 138–154, 2023
work page 2023
-
[40]
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries,
Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391, 2024
-
[41]
Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation
Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pages 2366–2377, 2025
work page 2025
-
[42]
Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa F Siu, Byron C Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.arXiv preprint arXiv:2403.00553, 2024
-
[43]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[44]
Ollama.https://github.com/ollama/ollama, 2024
Ollama. Ollama.https://github.com/ollama/ollama, 2024. Accessed: 2026-05-03. 12 A Additional Details of ASTRA-QA A.1 Examples of ASTRA-QA benchmark Table 7: Example QA instance of ASTRA-QA benchmark across the five question types. Question Type Question Answer Topic Set Single-Sum Please summarize the paper Leveraging Large Language Models for Multiple Ch...
work page 2024
-
[45]
Read all review content, including summaries, strengths, and weaknesses
-
[46]
Extract technical terms or key phrases that are explicitly mentioned or strongly implied by the reviews
-
[47]
First identify the main topics mentioned in each individual review, then aggregate them across all reviews
-
[48]
Prioritize compound technical phrases over single words
-
[49]
Government Agencies vs. Industry and AGI Labs
Exclude: − general academic verbs or adjectives − generic praise or criticism − explanations, commentary, or full sentences Output Requirements: − Output only technical terms or key phrases − Do not include any introduction or explanation − Use a comma−separated list − Do not use markdown − Do not use numbering Output Format: [term 1, term 2, term 3, term...
-
[50]
Scan the list of news articles above
-
[51]
Identify a single, specific, and concrete topic or subject that is mentioned or discussed across multiple articles. This should be a clear focal point, like ’Retrieval−Augmented Generation (RAG) techniques’, ’ Apple’s upcoming product features’, ’Impacts of a specific new policy’, or ’Performance of a particular company’s recent quarter’. Avoid overly bro...
-
[52]
The question should prompt an answer that lists and briefly describes different aspects or examples
Formulate a question that asks for a list or enumeration of key points, methods, features, impacts, or other relevant details specifically related to the chosen topic. The question should prompt an answer that lists and briefly describes different aspects or examples. An example question format is: ’What are the key features of the new iPhone as reported ...
-
[53]
Provide the answer to your question as a list of concise keywords or short phrases, summarizing the core information from the selected articles related to the topic. Format the answer strictly as a list like this: [ Point 1, brief description; Point 2, brief description; Another relevant detail]. Do not include any explanatory text or markdown in the answer
-
[54]
Briefly explain in 1−2 sentences why you chose this specific topic and how the selected articles contribute information to answer your question. This is the ’reason’
-
[55]
List the titles of the news articles that are relevant to the chosen topic and used to formulate your question and answer, separated by semicolons. Please respond in JSON format with the following structure: { "question": "<Your generated question asking for a list related to the single topic>", "answer": "[<Point 1, brief description; Point 2, brief desc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.