Recognition: 2 theorem links
· Lean TheoremMedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
Pith reviewed 2026-05-13 04:16 UTC · model grok-4.3
The pith
MedHopQA provides a benchmark of 1,000 expert questions that each require synthesizing information from two distinct Wikipedia articles to answer open-ended biomedical queries about diseases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs. Each question requires synthesis of information across two distinct Wikipedia articles and is supplied in open-ended free-text format, augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy for lexical and concept-level evaluation. The dataset was constructed through a structured process of human annotation, triage, iterative verification, and LLM-as-a-judge validation, then embedded within a publicly available collection of 10,000 questions with answers withheld on a CodaBench leaderboard to limit contamination.
What carries the argument
The requirement that each question integrates facts from exactly two distinct Wikipedia articles, enforced by expert curation and a hidden-answer leaderboard structure that separates the scored items from the larger public set.
If this is right
- Models must demonstrate cross-article inference instead of single-document lookup or elimination strategies to perform well.
- The construction process can be reused to generate additional biomedical or domain-specific multi-hop datasets that maintain discriminative power.
- Evaluation at both surface and concept levels is supported by the provided ontology synonym sets.
- The benchmark can serve as a more durable test for clinically relevant capabilities such as literature-based discovery and hypothesis generation.
- Embedding scored items in a larger withheld-answer set reduces the risk that high performance stems from training-data contamination.
Where Pith is reading between the lines
- If the two-article requirement holds, training objectives that reward explicit cross-document chaining could close the gap between current model performance and the benchmark's demands.
- The same curation-plus-hidden-set method could be applied to create multi-hop tests in non-biomedical domains where contamination is also a concern.
- Persistent low performance on MedHopQA would point to specific limits in how current LLMs integrate distributed factual knowledge rather than isolated recall.
- Extending the framework to questions that draw on more than two sources would create a natural next test of deeper compositional reasoning.
Load-bearing premise
The expert-curated questions genuinely require synthesis across two distinct sources rather than being solvable by single-article lookup or surface pattern matching.
What would settle it
A model that reaches high accuracy on the scored questions after being shown only one of the two source articles for each question, or after the top leaderboard scores saturate within a short period.
read the original abstract
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedHopQA, a benchmark of 1,000 expert-curated open-ended QA pairs for disease-centered multi-hop reasoning in biomedicine. Each question is constructed to require synthesis of information from two distinct Wikipedia articles, with gold answers augmented by ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy. The dataset is built via a process of human annotation, triage, iterative verification, and LLM-as-a-judge validation, and is embedded within a public set of 10,000 questions (answers withheld) on a CodaBench leaderboard to reduce contamination risk. The work also presents a reusable framework for future biomedical QA datasets that emphasize compositional reasoning, saturation resistance, and contamination resistance.
Significance. If the multi-hop property holds and the questions cannot be solved via single-article lookup or surface patterns, MedHopQA would address a clear gap in existing biomedical QA benchmarks, which often permit success through elimination or memorization rather than inference. The ontology-augmented evaluation and the 10k-withheld-answers design are concrete strengths that support lexical/concept-level scoring and long-term leaderboard utility. The reusable framework could help future dataset creators enforce similar constraints. These elements would be valuable for advancing LLM evaluation in clinically relevant tasks such as diagnostic support and literature-based discovery.
major comments (2)
- [Abstract] Abstract: The assertion that 'each question requires synthesis of information across two distinct Wikipedia articles' is presented without any quantitative validation, such as single-article retrieval accuracy, inter-annotator agreement on source necessity, or ablation results showing performance degradation when one article is withheld. This directly undermines the central claim that the benchmark tests multi-hop reasoning rather than single-source lookup or pattern matching.
- [Construction process] Construction process description: No inter-annotator agreement statistics, example question breakdowns, or empirical checks (e.g., human or model performance on single vs. dual articles) are reported to confirm that the human annotation plus LLM-as-a-judge pipeline produces genuinely compositional items. This is load-bearing for both the benchmark validity and the reusable framework's claimed properties.
minor comments (2)
- [Abstract] The abstract would benefit from one or two concrete example questions to illustrate the multi-hop requirement and the open-ended answer format.
- [Dataset release] Clarify how the 1,000 scored questions are selected from the 10,000 public set and whether any leakage-prevention measures (e.g., temporal or source filtering) are applied beyond answer withholding.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which correctly identify the need for stronger empirical support of the multi-hop claims. We address each major comment below. Where the manuscript lacks quantitative validation, we agree that revisions are required and will incorporate the suggested analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'each question requires synthesis of information across two distinct Wikipedia articles' is presented without any quantitative validation, such as single-article retrieval accuracy, inter-annotator agreement on source necessity, or ablation results showing performance degradation when one article is withheld. This directly undermines the central claim that the benchmark tests multi-hop reasoning rather than single-source lookup or pattern matching.
Authors: We agree that the current manuscript presents the multi-hop requirement as a design property without accompanying quantitative evidence. The claim originates from the annotation guidelines, which explicitly required questions to draw non-redundant information from two distinct articles, followed by human triage and LLM-as-a-judge consistency checks. However, we did not report single-article ablations, retrieval accuracy, or source-necessity agreement. In the revision we will add a dedicated validation subsection that includes: (i) LLM performance on each question when provided with only the first article, only the second article, or both; (ii) retrieval accuracy of the two source articles given the question; and (iii) inter-annotator agreement on whether both articles are necessary, measured on a 100-question subset. These results will be reported in the main text and will directly test whether single-article lookup suffices. revision: yes
-
Referee: [Construction process] Construction process description: No inter-annotator agreement statistics, example question breakdowns, or empirical checks (e.g., human or model performance on single vs. dual articles) are reported to confirm that the human annotation plus LLM-as-a-judge pipeline produces genuinely compositional items. This is load-bearing for both the benchmark validity and the reusable framework's claimed properties.
Authors: We acknowledge that the construction section currently lacks these supporting statistics and examples. The process consisted of expert biomedical annotators, iterative triage, and LLM-as-a-judge validation, but specific inter-annotator agreement figures and per-question breakdowns were omitted. In the revised manuscript we will: (1) provide 2–3 fully worked example questions with explicit mapping of required facts to each Wikipedia article; (2) report inter-annotator agreement (Cohen’s kappa) on question validity and source necessity for a randomly sampled subset of 200 items; and (3) include the single-vs-dual article model performance results described in the response to the first comment. These additions will also illustrate how the reusable framework can enforce compositional checks in future datasets. revision: yes
Circularity Check
No circularity: dataset construction with no derivations or self-referential reductions
full rationale
The paper describes the manual curation, triage, verification, and LLM-assisted validation of a 1,000-question multi-hop QA benchmark drawn from Wikipedia articles. No equations, fitted parameters, or predictive derivations appear in the provided text or abstract. The central claim—that questions require synthesis across two distinct sources—is asserted via the construction protocol itself rather than being defined in terms of any output or prior self-result. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is therefore self-contained as a descriptive benchmark-creation effort.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert human annotation combined with LLM-as-a-judge validation produces questions that require genuine cross-article synthesis.
- domain assumption Ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy enable reliable concept-level evaluation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each question requires synthesis of information across two distinct Wikipedia articles... constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y , Cissel D, Coss C, Fisher C et al: NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021, 8(1):91
work page 2021
-
[2]
Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P , Miliaras N, Rodionov O, Sekiya K, Trinh D et al: The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024, 2024
work page 2024
-
[3]
Dogan RI, Leaman R, Lu Z: NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014, 47:1-10
work page 2014
-
[4]
BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1
Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1
work page 2005
-
[5]
Genome Biol 2008, 9 Suppl 2(Suppl 2):S1
Krallinger M, Morgan A, Smith L, Leitner F , Tanabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 2008, 9 Suppl 2(Suppl 2):S1
work page 2008
-
[6]
BMC Bioinformatics 2011, 12 Suppl 8(Suppl 8):S3
Krallinger M, Vazquez M, Leitner F , Salgado D, Chatr-Aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M et al: The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011, 12 Suppl 8(Suppl 8):S3
work page 2011
-
[7]
In: June 2009; Boulder, Colorado
Kim J-D, Ohta T, Pyysalo S, Kano Y , Tsujii Ji: Overview of BioNLP’09 Shared Task on Event Extraction. In: June 2009; Boulder, Colorado. Association for Computational Linguistics: 1-9
work page 2009
-
[8]
Proceedings of the fifth BioCreative challenge evaluation workshop 2015:173-182
Li J, Sun Y , Johnson R, Sciaky D, Wei CH, Leaman R, Davis AP , Mattingly CJ, Wiegers TC, Lu Z: Annotating chemicals, diseases, and their interactions in biomedical literature. Proceedings of the fifth BioCreative challenge evaluation workshop 2015:173-182
work page 2015
-
[9]
arXiv preprint arXiv:210807258 2021
Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E: On the opportunities and risks of foundation models. arXiv preprint arXiv:210807258 2021. 26
work page 2021
-
[10]
Nature 2023, 620(7972):172-180
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole- Lewis H, Pfohl S et al: Large language models encode clinical knowledge. Nature 2023, 620(7972):172-180
work page 2023
-
[11]
Nature Medicine 2023, 29(8):1930-1940
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF , Ting DSW: Large language models in medicine. Nature Medicine 2023, 29(8):1930-1940
work page 2023
-
[12]
Applied Sciences 2021, 11(14):6421
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences 2021, 11(14):6421
work page 2021
-
[13]
Pal A, Umapathi LK, Sankarasubbu M: MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In: Proceedings of the Conference on Health, Inference, and Learning; Proceedings of Machine Learning Research: Edited by Gerardo F , George HC, Tom P , Joyce CH, Tristan N. PMLR 2022: 248--260
work page 2022
-
[14]
In: November 2019; Hong Kong, China
Jin Q, Dhingra B, Liu Z, Cohen W, Lu X: PubMedQA: A Dataset for Biomedical Research Question Answering. In: November 2019; Hong Kong, China. Association for Computational Linguistics: 2567-2577
work page 2019
-
[15]
In: International Conference on Learning Representations: 2021
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J: Measuring massive multitask language understanding. In: International Conference on Learning Representations: 2021
work page 2021
-
[16]
[https://www.vals.ai/benchmarks/medqa]
MedQA:Evaluating language model bias in medical questions. [https://www.vals.ai/benchmarks/medqa]
-
[17]
Massive Multitask Language Understanding (MMLU) on HELM [https://crfm.stanford.edu/helm/mmlu/latest/]
-
[18]
Transactions on Machine Learning Research 2023
Liang P , Bommasani R, Lee T, Tsipras D, Soylu D, Yasunaga M, Zhang Y , Narayanan D, Wu Y , Kumar A: Holistic evaluation of language models. Transactions on Machine Learning Research 2023
work page 2023
-
[19]
arXiv preprint arXiv:250506108 2025
Justen L: Llms outperform experts on challenging biology benchmarks. arXiv preprint arXiv:250506108 2025
work page 2025
-
[20]
In: The Twelfth International Conference on Learning Representations (ICLR): 2024
Golchin S, Surdeanu M: Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In: The Twelfth International Conference on Learning Representations (ICLR): 2024
work page 2024
-
[21]
Sainz O, Campos J, García-Ferrero I, Etxaniz J, de Lacalle OL, Agirre E: NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In: December 2023; Singapore. Association for Computational Linguistics: 10776-10787
work page 2023
-
[22]
Islamaj R, Chan J, Leaman R, Lu Z: Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering. In: Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artifici...
work page 2025
-
[23]
BMC Bioinformatics 2015, 16:138
Tsatsaronis G, Balikas G, Malakasiotis P , Partalas I, Zschunke M, Alvers MR, Weissenborn D, Krithara A, Petridis S, Polychronopoulos D et al: An overview of the 27 BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 2015, 16:138
work page 2015
-
[24]
Pappas D, Androutsopoulos I, Papageorgiou H: BioRead: A New Dataset for Biomedical Reading Comprehension. In: May 2018; Miyazaki, Japan. European Language Resources Association (ELRA)
work page 2018
-
[25]
In: Proceedings of the 2018 conference on empirical methods in natural language processing: 2018
Pampari A, Raghavan P , Liang J, Peng J: emrqa: A large corpus for question answering on electronic medical records. In: Proceedings of the 2018 conference on empirical methods in natural language processing: 2018. 2357-2368
work page 2018
-
[26]
In: oct nov 2018; Brussels, Belgium
Romanov A, Shivade C: Lessons from Natural Language Inference in the Clinical Domain. In: oct nov 2018; Brussels, Belgium. Association for Computational Linguistics: 1586-1596
work page 2018
-
[27]
In: June 2022; Marseille, France
Soni S, Gudala M, Pajouhi A, Roberts K: RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports. In: June 2022; Marseille, France. European Language Resources Association: 6250-6259
work page 2022
-
[28]
In: July 2025; Vienna, Austria
Nimo C, Olatunji T, Owodunni AT, Abdullahi T, Ayodele E, Sanni M, Aka EC, Omofoye F , Yuehgoh F , Faniran T et al: AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. In: July 2025; Vienna, Austria. Association for Computational Linguistics: 1948-1973
work page 2025
-
[29]
Digit Health 2025, 11:20552076251390447
Liu J, Liu S: HealthBench: Advancing AI evaluation in healthcare, but not yet clinically ready. Digit Health 2025, 11:20552076251390447
work page 2025
-
[30]
In: August 2024; Bangkok, Thailand
Manes I, Ronn N, Cohen D, Ilan Ber R, Horowitz-Kugler Z, Stanovsky G: K-QA: A Real-World Medical Q&A Benchmark. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 277-294
work page 2024
-
[31]
In: August 2024; Bangkok, Thailand
Vladika J, Schneider P , Matthes F: MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 14459-14469
work page 2024
-
[32]
Journal of Healthcare Informatics Research 2025, 9(3):280- 296
Adams L, Busch F , Han T, Excoffier J-B, Ortala M, Löser A, Aerts HJWL, Kather JN, Truhn D, Bressem K: LongHealth: A Question Answering Benchmark with Long Clinical Documents. Journal of Healthcare Informatics Research 2025, 9(3):280- 296
work page 2025
-
[33]
In: August 2025; Viena, Austria
Colelough B, Bartels D, Demner-Fushman D: Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering. In: August 2025; Viena, Austria. Association for Computational Linguistics: 378-387
work page 2025
-
[34]
In: oct nov 2018; Brussels, Belgium
Yang Z, Qi P , Zhang S, Bengio Y , Cohen W, Salakhutdinov R, Manning CD: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In: oct nov 2018; Brussels, Belgium. Association for Computational Linguistics: 2369-2380
work page 2018
-
[35]
Transactions of the Association for Computational Linguistics 2022, 10:539-554
Trivedi H, Balasubramanian N, Khot T, Sabharwal A: ♫ MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics 2022, 10:539-554
work page 2022
-
[36]
In: December 2020; Barcelona, Spain (Online)
Ho X, Duong Nguyen A-K, Sugawara S, Aizawa A: Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In: December 2020; Barcelona, Spain (Online). International Committee on Computational Linguistics: 6609-6625. 28
work page 2020
-
[37]
Transactions of the Association for Computational Linguistics 2018, 6:287-302
Welbl J, Stenetorp P , Riedel S: Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics 2018, 6:287-302
work page 2018
-
[38]
In: July 2025; Vienna, Austria
Kim Y , Abdulle Y , Wu H: BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain. In: July 2025; Vienna, Austria. Association for Computational Linguistics: 12894-12908
work page 2025
-
[39]
Ben Abacha A, Mrabet Y , Zhang Y , Shivade C, Langlotz C, Demner-Fushman D: Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain. In: June 2021; Online. Association for Computational Linguistics: 74-85
work page 2021
-
[40]
In: August 2019; Florence, Italy
Ben Abacha A, Shivade C, Demner-Fushman D: Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering. In: August 2019; Florence, Italy. Association for Computational Linguistics: 370-379
work page 2019
-
[41]
Möller T, Reina A, Jayakumar R, Pietsch M: COVID-QA: A Question Answering Dataset for COVID-19. In: July 2020; Online. Association for Computational Linguistics
work page 2020
-
[42]
Zhu M, Ahuja A, Juan D-C, Wei W, Reddy CK: Question Answering with Long Multiple-Span Answers. In: November 2020; Online. Association for Computational Linguistics: 3840-3849
work page 2020
-
[43]
In: Text Retrieval Conference: 2017
Abacha AB, Agichtein E, Pinter Y , Demner-Fushman D: Overview of the Medical Question Answering Task at TREC 2017 LiveQA. In: Text Retrieval Conference: 2017
work page 2017
-
[44]
In: AMIA Annual Symposium Proceedings: 2025
Kell G, Roberts A, Umansky S, Khare Y , Ahmed N, Patel N, Simela C, Coumbe J, Rozario J, Griffiths R-R: RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions. In: AMIA Annual Symposium Proceedings: 2025. 590
work page 2025
-
[45]
In: August 2024; Bangkok, Thailand
Kim Y , Wu J, Abdulle Y , Wu H: MedExQA: Medical Question Answering Benchmark with Multiple Explanations. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 167-181
work page 2024
-
[46]
Rogoz AC, Ionescu RT, Anghel AV , Antone-Iordache IL, Coniac S, Ionescu AI: A large- scale benchmark for evaluating large language models on medical question answering in Romanian. NPJ Digit Med 2026, 9(1)
work page 2026
-
[47]
Jin Q, Kim W, Chen Q, Comeau DC, Yeganova L, Wilbur WJ, Lu Z: MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 2023, 39(11)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.