pith. machine review for the scientific record. sign in

arxiv: 2605.12918 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords commonsense reasoningcausal reasoninglarge language modelsknowledge graph question answeringentity-based reasoningwhy questionshallucinationsabductive reasoning
0
0 comments X

The pith

CommonWhy introduces 15,000 why questions that test whether LLMs can combine specific entity facts with causal commonsense inference

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CommonWhy to measure how well large language models perform entity-based causal commonsense reasoning. Existing benchmarks use only true/false or multiple-choice formats and do not check abductive reasoning or explanation quality. The new dataset supplies all required facts from Wikidata so that correct answers demand both knowledge retrieval and causal inference. Experiments on current models show frequent factual errors and breakdowns in identifying causes and effects. A reader cares because real-world interaction with language models depends on this exact form of integrated reasoning.

Core claim

The paper presents CommonWhy, a dataset of 15,000 why questions that evaluate LLMs on entity-based commonsense reasoning about causal relationships. Every query is answerable from information already present in the Wikidata knowledge graph, turning the task into a KGQA benchmark that targets causal inference rather than simple fact lookup. Tests on state-of-the-art LLMs and LLM-based KGQA methods show frequent factual hallucinations together with failures to perform the required causal reasoning.

What carries the argument

The CommonWhy dataset of why questions, which forces models to retrieve entity facts from Wikidata and then apply causal commonsense to generate answers and explanations

If this is right

  • LLMs will continue to generate factually incorrect answers on causal why questions until the underlying retrieval and inference failures are addressed.
  • KGQA systems built on LLMs will underperform on tasks that require causal chaining rather than direct fact lookup.
  • Explanation quality will remain low because models cannot reliably trace causal links between entities.
  • New evaluation protocols for LLMs must include open-ended why questions to expose gaps hidden by true/false formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid systems that explicitly connect LLMs to knowledge graphs may need separate modules for causal chaining to reach reliable performance.
  • The dataset could be extended to track whether improvements on CommonWhy also improve performance on other forms of abductive reasoning outside Wikidata.
  • Future work could measure how much training data volume is required before models stop hallucinating the causal relations tested here.

Load-bearing premise

The questions require genuine integration of entity facts with causal commonsense reasoning instead of being solvable through superficial patterns learned in training.

What would settle it

A model that produces correct answers and explanations on most CommonWhy questions while avoiding factual hallucinations would show that current shortcomings are not as widespread as claimed.

Figures

Figures reproduced from arXiv: 2605.12918 by Armin Toroghi, Faeze Moradi Kalarde, Scott Sanner.

Figure 1
Figure 1. Figure 1: Overview of the CommonWhy dataset construction pipeline. (1) Commonsense axioms extracted from existing datasets are fed to GPT-5.1 to produce additional similar commonsense axioms, after which human annotators filter out invalid ones. (2) Commonsense axioms are rewritten as lifted question–answer pairs, and entities extracted from Wikidata are substituted into their variables to generate grounded pairs. (… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of reasoning skills in CommonWhy. factual information required to answer the question; (iv) a general commonsense axiom, expressed in natural language, that captures the implicit knowledge underlying the causal relationship between the question and its answers; and (v) one or more ground-truth answers that specify possible causes for the effect described in the query. We describe each constitu… view at source ↗
Figure 3
Figure 3. Figure 3: Entity popularities in the head and long-tail splits. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FActScores obtained by different LLMs across head [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces CommonWhy, a dataset of 15,000 why-questions for assessing entity-based causal commonsense reasoning in LLMs. The questions are designed to require integration of specific entity facts from Wikidata with abductive causal inference, serving also as a KGQA benchmark beyond simple fact retrieval. Experiments with SOTA LLMs and KGQA methods highlight shortcomings like hallucinations and causal reasoning failures.

Significance. Should the dataset construction and evaluation hold up under scrutiny, this work could provide a useful new benchmark for probing LLMs on causal reasoning tasks that combine factual knowledge with commonsense, addressing a gap in existing True/False or multiple-choice datasets. The emphasis on generating explanations is a positive aspect.

major comments (3)
  1. [Dataset Construction] Dataset Construction section: insufficient detail is provided on the question generation process from Wikidata entities and the validation steps used to confirm that questions require genuine integration of entity facts with causal commonsense rather than surface-level patterns.
  2. [Experiments] Experiments section: no ablation studies (e.g., with vs. without KG access, or performance on paraphrased vs. original questions) are reported to demonstrate that failures cannot be explained by memorized patterns or superficial cues, which is load-bearing for interpreting results as evidence of reasoning deficits.
  3. [Results] Results and Analysis: quantitative breakdowns of error types (hallucinations vs. causal failures) and inter-annotator agreement for any human validation are missing, weakening support for the claim of significant shortcomings.
minor comments (1)
  1. [Abstract] Abstract: the exact train/test split and any filtering criteria for the 15,000 questions should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional detail and analysis will strengthen the paper. We will revise the manuscript to expand the dataset construction description, incorporate ablation studies, and provide quantitative error breakdowns along with inter-annotator agreement metrics. Below we address each major comment.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: insufficient detail is provided on the question generation process from Wikidata entities and the validation steps used to confirm that questions require genuine integration of entity facts with causal commonsense rather than surface-level patterns.

    Authors: We agree that the Dataset Construction section would benefit from greater detail. In the revised manuscript we will expand this section with a step-by-step account of entity selection from Wikidata, the hybrid template- and LLM-assisted question generation procedure, and the multi-stage validation protocol (including explicit criteria and examples) used to verify that each question necessitates integration of specific entity facts with abductive causal reasoning rather than surface-level lexical patterns. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation studies (e.g., with vs. without KG access, or performance on paraphrased vs. original questions) are reported to demonstrate that failures cannot be explained by memorized patterns or superficial cues, which is load-bearing for interpreting results as evidence of reasoning deficits.

    Authors: We concur that ablation studies are important for ruling out alternative explanations. We will add these experiments to the revised manuscript, specifically reporting performance with and without KG retrieval access as well as results on paraphrased question variants. These additions will help isolate whether observed shortcomings stem from reasoning deficits rather than memorization or superficial cues. revision: yes

  3. Referee: [Results] Results and Analysis: quantitative breakdowns of error types (hallucinations vs. causal failures) and inter-annotator agreement for any human validation are missing, weakening support for the claim of significant shortcomings.

    Authors: We will revise the Results and Analysis section to include quantitative error-type breakdowns derived from manual inspection of a representative sample of model outputs, explicitly separating factual hallucinations from causal reasoning failures. We will also report inter-annotator agreement statistics for the human validation steps performed during dataset construction and error categorization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset creation and evaluation

full rationale

The paper introduces the CommonWhy dataset of 15,000 why-questions grounded in Wikidata entity facts and causal commonsense, then reports empirical results on LLMs and KGQA methods. No equations, derivations, parameter fitting, or load-bearing self-citations appear in the provided text. All claims rest on direct experimental outcomes rather than any reduction of outputs to inputs by construction, so the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the assumption that Wikidata supplies all necessary facts and that the why-questions isolate causal commonsense reasoning.

axioms (1)
  • domain assumption All supporting knowledge required to answer the queries is available in the Wikidata knowledge graph
    Explicitly stated in the abstract as the basis for the KGQA benchmark design.

pith-pipeline@v0.9.0 · 5490 in / 1121 out tokens · 33285 ms · 2026-05-14T20:29:27.984212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

  1. [1]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  2. [2]

    Marco Baroni, Armand Joulin, Allan Jabri, German Kruszewski, Angeliki Lazari- dou, Klemen Simonic, and Tomas Mikolov. 2017. CommAI: Evaluating the first steps towards a useful general AI.arXiv preprint arXiv:1701.08954(2017)

  3. [3]

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Seman- tic parsing on freebase from question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language processing. 1533–1544

  4. [4]

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 15...

  5. [5]

    A is B” fail to learn “B is A

    Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. InThe Twelfth International Conference on Learning Representations

  6. [6]

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439

  7. [7]

    Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor

  8. [8]

    InProceedings of the 2008 ACM SIGMOD international conference on Management of data

    Freebase: a collaboratively created graph database for structuring human knowledge. InProceedings of the 2008 ACM SIGMOD international conference on Management of data. 1247–1250

  9. [9]

    Ana Brassard, Benjamin Heinzerling, Pride Kavumba, and Kentaro Inui. 2022. COPA-SSE: Semi-structured Explanations for Commonsense Reasoning. InPro- ceedings of the Thirteenth Language Resources and Evaluation Conference. 3994– 4000

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  11. [11]

    Joseph Cotnareanu, Didier Chetelat, Yingxue Zhang, and Mark Coates. 2026. A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic.arXiv preprint arXiv:2601.18595(2026)

  12. [12]

    Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence.Commun. ACM58, 9 (2015), 92–103

  13. [14]

    İbrahim Ethem Deveci and Duygu Ataman. 2025. The Ouroboros of Benchmark- ing: Reasoning Evaluation in an Era of Saturation. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

  14. [15]

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.Trans. Assoc. Comput. Linguistics9 (2021), 346–361. doi:10.1162/TACL_A_00370

  15. [16]

    In: Proceedings of the Web Conference 2021

    Yu Gu, Sue Kase, Michelle Vanni, Brian M. Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. InWWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). ACM / I...

  16. [17]

    Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, and Le Sun. 2024. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18126–18134

  17. [18]

    Willis Guo, Armin Toroghi, and Scott Sanner. 2024. Cr-lt-kgqa: A knowledge graph question answering dataset requiring commonsense reasoning and long- tail knowledge.arXiv preprint arXiv:2403.01395(2024)

  18. [19]

    Yanzhu Guo, Guokan Shang, and Chloé Clavel. 2025. Benchmarking linguistic diversity of large language models.Transactions of the Association for Computa- tional Linguistics13 (2025), 1507–1526

  19. [20]

    Benjamin Heinzerling and Kentaro Inui. 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main Volume. 1772–1791

  20. [21]

    Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, and William Yang Wang. 2022. Wikiwhy: Answering and explaining cause- and-effect questions.arXiv preprint arXiv:2210.12152(2022)

  21. [22]

    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. 2021. Knowledge graphs.ACM Computing Surveys (Csur)54, 4 (2021), 1–37

  22. [23]

    Mete Ismayilzada, Debjit Paul, Syrielle Montariol, Mor Geva, and Antoine Bosse- lut. 2023. CRoW: Benchmarking commonsense reasoning in real-world tasks. arXiv preprint arXiv:2310.15239(2023)

  23. [24]

    Tianjie Ju, Weiwei Sun, Wei Du, Xinwei Yuan, Zhaochun Ren, and Gongshen Liu. 2024. How Large Language Models Encode Context Knowledge? A Layer- Wise Probing Study. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 8235–8246

  24. [25]

    Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plex- ousakis, and Michel Scholl. 2002. RQL: a declarative query language for RDF. In Proceedings of the 11th international conference on World Wide Web. 592–603

  25. [26]

    Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran

  26. [27]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Lambada: Backward chaining for automated reasoning in natural language. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6547–6568

  27. [28]

    Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia.Semantic web6, 2 (2015), 167–195

  28. [29]

    Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023. Few- shot In-context Learning on Knowledge Base Question Answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6966–6980

  29. [30]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  30. [31]

    Trond Linjordet and Krisztian Balog. 2022. Would you ask it that way? measuring and improving question naturalness for knowledge graph question answering. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3090–3098. SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Armin Toroghi, Faeze ...

  31. [32]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  32. [33]

    Hugo Liu and Push Singh. 2004. ConceptNet—a practical commonsense reasoning tool-kit.BT technology journal22, 4 (2004), 211–226

  33. [34]

    Adyasha Maharana and Mohit Bansal. 2022. GraDA: Graph generative data augmentation for commonsense reasoning. InProceedings of the 29th International Conference on Computational Linguistics. 4499–4516

  34. [35]

    Meta AI. 2024. LLaMA 3.3 70B Instruct. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct

  35. [36]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12076–12100

  36. [37]

    1982.The role of logic in knowledge representation and common- sense reasoning

    Robert C Moore. 1982.The role of logic in knowledge representation and common- sense reasoning. SRI International. Artificial Intelligence Center

  37. [38]

    Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Dur- rett. 2021. CREAK: A Dataset for Commonsense Reasoning over En- tity Knowledge. InProceedings of the Neural Information Processing Sys- tems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://...

  38. [39]

    OpenAI. 2024. GPT-4o Technical Report. (2024). https://cdn.openai.com/gpt-4o- system-card.pdf

  39. [40]

    OpenAI. 2025. GPT-5.1. https://openai.com/index/gpt-5-1/

  40. [41]

    OpenAI. 2025. OpenAI o3. https://platform.openai.com/docs/models/o3

  41. [42]

    Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2362–2376

  42. [43]

    Nidhi Rastogi and Mohammed J Zaki. 2020. Personal health knowledge graphs for patients.arXiv preprint arXiv:2004.00071(2020)

  43. [44]

    Shaina Raza, Mizanur Rahman, Safiullah Kamawal, Armin Toroghi, Ananya Raval, Farshad Navah, and Amirmohammad Kazemeini. 2024. A comprehensive review of recommender systems: Transitioning from theory to practice.arXiv preprint arXiv:2407.13699(2024)

  44. [45]

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning.. In AAAI spring symposium: logical formalizations of commonsense reasoning. 90–95

  45. [46]

    Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 3027–3035

  46. [47]

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social IQa: Commonsense Reasoning about Social Interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP). 4463–4473

  47. [48]

    Andy Seaborne and Eric Prud’hommeaux. 2008. SPARQL query language for RDF.W3C Recommendation, W3C(2008)

  48. [49]

    Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. InProceedings of the Thirty- First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 4444–

  49. [50]

    doi:10.1609/AAAI.V31I1.11164

  50. [51]

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL- HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,...

  51. [52]

    Zhenwei Tang, Griffin Floto, Armin Toroghi, Shichao Pei, Xiangliang Zhang, and Scott Sanner. 2023. LogicRec: Recommendation with Users’ Logical Requirements. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2129–2133

  52. [53]

    Armin Toroghi, Griffin Floto, Zhenwei Tang, and Scott Sanner. 2023. Bayesian Knowledge-driven Critiquing with Indirect Evidence. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1838–1842

  53. [54]

    Armin Toroghi, Willis Guo, Ali Pesaranghader, and Scott Sanner. 2024. Verifiable, Debuggable, and Repairable Commonsense Logical Reasoning via LLM-based Theory Resolution. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6634–6652

  54. [55]

    Armin Toroghi, Willis Guo, Mohammad Mahdi Abdollah Pour, and Scott Sanner

  55. [56]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Right for Right Reasons: Large Language Models for Verifiable Com- monsense Knowledge Graph Question Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6601–6633

  56. [57]

    Armin Toroghi, Willis Guo, and Scott Sanner. 2025. CoLoTa: A Dataset for Entity- based Commonsense Reasoning over Long-Tail Knowledge. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3444–3454

  57. [58]

    Armin Toroghi, Ali Pesaranghader, Tanmana Sadhu, and Scott Sanner. 2025. Llm- based typed hyperresolution for commonsense reasoning with knowledge bases. InThe Thirteenth International Conference on Learning Representations

  58. [59]

    Armin Toroghi and Scott Sanner. 2024. Bayesian inference with complex knowl- edge graph evidence. InProceedings of the AAAI Conference on Artificial Intelli- gence, Vol. 38. 20550–20558

  59. [60]

    Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs. InThe Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 10588), Claudia d’Amato, Miriam ...

  60. [61]

    Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase.Commun. ACM57, 10 (2014), 78–85

  61. [62]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems37 (2024), 95266– 95290

  62. [63]

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. 2025. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computatio...

  63. [64]

    Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Ling...

  64. [65]

    Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base ques- tion answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206

  65. [66]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi

  66. [67]

    InInternational Con- ference on Learning Representations

    BERTScore: Evaluating Text Generation with BERT. InInternational Con- ference on Learning Representations

  67. [68]

    Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song

  68. [69]

    In Proceedings of the AAAI conference on artificial intelligence, Vol

    Variational reasoning for question answering with knowledge graph. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

  69. [70]

    Zirui Zhao, Wee Sun Lee, and David Hsu. 2023. Large language models as commonsense knowledge for large-scale task planning.Advances in neural information processing systems36 (2023), 31967–31987

  70. [71]

    Weiguo Zheng, Hong Cheng, Lei Zou, Jeffrey Xu Yu, and Kangfei Zhao. 2017. Natural language question/answering: Let users talk with the knowledge graph. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 217–226