arxiv: 2605.12918 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

Armin Toroghi , Faeze Moradi Kalarde , Scott Sanner

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords commonsense reasoningcausal reasoninglarge language modelsknowledge graph question answeringentity-based reasoningwhy questionshallucinationsabductive reasoning

0 comments

The pith

CommonWhy introduces 15,000 why questions that test whether LLMs can combine specific entity facts with causal commonsense inference

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CommonWhy to measure how well large language models perform entity-based causal commonsense reasoning. Existing benchmarks use only true/false or multiple-choice formats and do not check abductive reasoning or explanation quality. The new dataset supplies all required facts from Wikidata so that correct answers demand both knowledge retrieval and causal inference. Experiments on current models show frequent factual errors and breakdowns in identifying causes and effects. A reader cares because real-world interaction with language models depends on this exact form of integrated reasoning.

Core claim

The paper presents CommonWhy, a dataset of 15,000 why questions that evaluate LLMs on entity-based commonsense reasoning about causal relationships. Every query is answerable from information already present in the Wikidata knowledge graph, turning the task into a KGQA benchmark that targets causal inference rather than simple fact lookup. Tests on state-of-the-art LLMs and LLM-based KGQA methods show frequent factual hallucinations together with failures to perform the required causal reasoning.

What carries the argument

The CommonWhy dataset of why questions, which forces models to retrieve entity facts from Wikidata and then apply causal commonsense to generate answers and explanations

If this is right

LLMs will continue to generate factually incorrect answers on causal why questions until the underlying retrieval and inference failures are addressed.
KGQA systems built on LLMs will underperform on tasks that require causal chaining rather than direct fact lookup.
Explanation quality will remain low because models cannot reliably trace causal links between entities.
New evaluation protocols for LLMs must include open-ended why questions to expose gaps hidden by true/false formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems that explicitly connect LLMs to knowledge graphs may need separate modules for causal chaining to reach reliable performance.
The dataset could be extended to track whether improvements on CommonWhy also improve performance on other forms of abductive reasoning outside Wikidata.
Future work could measure how much training data volume is required before models stop hallucinating the causal relations tested here.

Load-bearing premise

The questions require genuine integration of entity facts with causal commonsense reasoning instead of being solvable through superficial patterns learned in training.

What would settle it

A model that produces correct answers and explanations on most CommonWhy questions while avoiding factual hallucinations would show that current shortcomings are not as widespread as claimed.

Figures

Figures reproduced from arXiv: 2605.12918 by Armin Toroghi, Faeze Moradi Kalarde, Scott Sanner.

**Figure 1.** Figure 1: Overview of the CommonWhy dataset construction pipeline. (1) Commonsense axioms extracted from existing datasets are fed to GPT-5.1 to produce additional similar commonsense axioms, after which human annotators filter out invalid ones. (2) Commonsense axioms are rewritten as lifted question–answer pairs, and entities extracted from Wikidata are substituted into their variables to generate grounded pairs. (… view at source ↗

**Figure 2.** Figure 2: Distribution of reasoning skills in CommonWhy. factual information required to answer the question; (iv) a general commonsense axiom, expressed in natural language, that captures the implicit knowledge underlying the causal relationship between the question and its answers; and (v) one or more ground-truth answers that specify possible causes for the effect described in the query. We describe each constitu… view at source ↗

**Figure 3.** Figure 3: Entity popularities in the head and long-tail splits. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: FActScores obtained by different LLMs across head [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CommonWhy adds a why-question dataset for entity causal commonsense but the evaluation leaves open whether it forces real integration or allows shortcuts.

read the letter

The main thing here is a new dataset called CommonWhy with 15,000 why-questions that tie specific entities to causal commonsense reasoning, all grounded in Wikidata facts. It frames this as a KGQA benchmark that targets abductive inference about causes and effects rather than plain fact retrieval or true/false checks. That shift from existing formats is the clearest addition. The experiments flag hallucinations and causal failures in current LLMs and KGQA methods, which lines up with what shows up in practice. The paper does a decent job naming the gap in how we currently test entity-level causal reasoning. On the soft side, the available description gives little on how the questions were generated or validated, so it is hard to judge whether they actually require combining facts with causal steps or whether surface patterns from training could suffice. The stress-test note about missing ablations for shortcuts looks like it applies; without those controls the claim that the results isolate reasoning deficits stays loose. This is the sort of benchmark paper that would interest people working on LLM reasoning systems or KGQA setups that mix knowledge with inference. A reader who needs new test cases for causal understanding could pull useful items from it. The approach shows straightforward engagement with the literature on commonsense and KGQA, so it counts as serious thinking even if the evidence needs tightening. I would send it for peer review so the authors can add the construction details and any needed controls.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces CommonWhy, a dataset of 15,000 why-questions for assessing entity-based causal commonsense reasoning in LLMs. The questions are designed to require integration of specific entity facts from Wikidata with abductive causal inference, serving also as a KGQA benchmark beyond simple fact retrieval. Experiments with SOTA LLMs and KGQA methods highlight shortcomings like hallucinations and causal reasoning failures.

Significance. Should the dataset construction and evaluation hold up under scrutiny, this work could provide a useful new benchmark for probing LLMs on causal reasoning tasks that combine factual knowledge with commonsense, addressing a gap in existing True/False or multiple-choice datasets. The emphasis on generating explanations is a positive aspect.

major comments (3)

[Dataset Construction] Dataset Construction section: insufficient detail is provided on the question generation process from Wikidata entities and the validation steps used to confirm that questions require genuine integration of entity facts with causal commonsense rather than surface-level patterns.
[Experiments] Experiments section: no ablation studies (e.g., with vs. without KG access, or performance on paraphrased vs. original questions) are reported to demonstrate that failures cannot be explained by memorized patterns or superficial cues, which is load-bearing for interpreting results as evidence of reasoning deficits.
[Results] Results and Analysis: quantitative breakdowns of error types (hallucinations vs. causal failures) and inter-annotator agreement for any human validation are missing, weakening support for the claim of significant shortcomings.

minor comments (1)

[Abstract] Abstract: the exact train/test split and any filtering criteria for the 15,000 questions should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional detail and analysis will strengthen the paper. We will revise the manuscript to expand the dataset construction description, incorporate ablation studies, and provide quantitative error breakdowns along with inter-annotator agreement metrics. Below we address each major comment.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: insufficient detail is provided on the question generation process from Wikidata entities and the validation steps used to confirm that questions require genuine integration of entity facts with causal commonsense rather than surface-level patterns.

Authors: We agree that the Dataset Construction section would benefit from greater detail. In the revised manuscript we will expand this section with a step-by-step account of entity selection from Wikidata, the hybrid template- and LLM-assisted question generation procedure, and the multi-stage validation protocol (including explicit criteria and examples) used to verify that each question necessitates integration of specific entity facts with abductive causal reasoning rather than surface-level lexical patterns. revision: yes
Referee: [Experiments] Experiments section: no ablation studies (e.g., with vs. without KG access, or performance on paraphrased vs. original questions) are reported to demonstrate that failures cannot be explained by memorized patterns or superficial cues, which is load-bearing for interpreting results as evidence of reasoning deficits.

Authors: We concur that ablation studies are important for ruling out alternative explanations. We will add these experiments to the revised manuscript, specifically reporting performance with and without KG retrieval access as well as results on paraphrased question variants. These additions will help isolate whether observed shortcomings stem from reasoning deficits rather than memorization or superficial cues. revision: yes
Referee: [Results] Results and Analysis: quantitative breakdowns of error types (hallucinations vs. causal failures) and inter-annotator agreement for any human validation are missing, weakening support for the claim of significant shortcomings.

Authors: We will revise the Results and Analysis section to include quantitative error-type breakdowns derived from manual inspection of a representative sample of model outputs, explicitly separating factual hallucinations from causal reasoning failures. We will also report inter-annotator agreement statistics for the human validation steps performed during dataset construction and error categorization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset creation and evaluation

full rationale

The paper introduces the CommonWhy dataset of 15,000 why-questions grounded in Wikidata entity facts and causal commonsense, then reports empirical results on LLMs and KGQA methods. No equations, derivations, parameter fitting, or load-bearing self-citations appear in the provided text. All claims rest on direct experimental outcomes rather than any reduction of outputs to inputs by construction, so the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the assumption that Wikidata supplies all necessary facts and that the why-questions isolate causal commonsense reasoning.

axioms (1)

domain assumption All supporting knowledge required to answer the queries is available in the Wikidata knowledge graph
Explicitly stated in the abstract as the basis for the KGQA benchmark design.

pith-pipeline@v0.9.0 · 5490 in / 1121 out tokens · 33285 ms · 2026-05-14T20:29:27.984212+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

[1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

work page 2005
[2]

Marco Baroni, Armand Joulin, Allan Jabri, German Kruszewski, Angeliki Lazari- dou, Klemen Simonic, and Tomas Mikolov. 2017. CommAI: Evaluating the first steps towards a useful general AI.arXiv preprint arXiv:1701.08954(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Seman- tic parsing on freebase from question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language processing. 1533–1544

work page 2013
[4]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 15...

work page 2013
[5]

A is B” fail to learn “B is A

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2024. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. InThe Twelfth International Conference on Learning Representations

work page 2024
[6]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439

work page 2020
[7]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor

work page
[8]

InProceedings of the 2008 ACM SIGMOD international conference on Management of data

Freebase: a collaboratively created graph database for structuring human knowledge. InProceedings of the 2008 ACM SIGMOD international conference on Management of data. 1247–1250

work page 2008
[9]

Ana Brassard, Benjamin Heinzerling, Pride Kavumba, and Kentaro Inui. 2022. COPA-SSE: Semi-structured Explanations for Commonsense Reasoning. InPro- ceedings of the Thirteenth Language Resources and Evaluation Conference. 3994– 4000

work page 2022
[10]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Joseph Cotnareanu, Didier Chetelat, Yingxue Zhang, and Mark Coates. 2026. A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic.arXiv preprint arXiv:2601.18595(2026)

work page arXiv 2026
[12]

Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence.Commun. ACM58, 9 (2015), 92–103

work page 2015
[14]

İbrahim Ethem Deveci and Duygu Ataman. 2025. The Ouroboros of Benchmark- ing: Reasoning Evaluation in an Era of Saturation. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

work page 2025
[15]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.Trans. Assoc. Comput. Linguistics9 (2021), 346–361. doi:10.1162/TACL_A_00370

work page doi:10.1162/tacl_a_00370 2021
[16]

In: Proceedings of the Web Conference 2021

Yu Gu, Sue Kase, Michelle Vanni, Brian M. Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. InWWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). ACM / I...

work page doi:10.1145/3442381.3449992 2021
[17]

Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, and Le Sun. 2024. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18126–18134

work page 2024
[18]

Willis Guo, Armin Toroghi, and Scott Sanner. 2024. Cr-lt-kgqa: A knowledge graph question answering dataset requiring commonsense reasoning and long- tail knowledge.arXiv preprint arXiv:2403.01395(2024)

work page arXiv 2024
[19]

Yanzhu Guo, Guokan Shang, and Chloé Clavel. 2025. Benchmarking linguistic diversity of large language models.Transactions of the Association for Computa- tional Linguistics13 (2025), 1507–1526

work page 2025
[20]

Benjamin Heinzerling and Kentaro Inui. 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main Volume. 1772–1791

work page 2021
[21]

Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, and William Yang Wang. 2022. Wikiwhy: Answering and explaining cause- and-effect questions.arXiv preprint arXiv:2210.12152(2022)

work page arXiv 2022
[22]

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. 2021. Knowledge graphs.ACM Computing Surveys (Csur)54, 4 (2021), 1–37

work page 2021
[23]

Mete Ismayilzada, Debjit Paul, Syrielle Montariol, Mor Geva, and Antoine Bosse- lut. 2023. CRoW: Benchmarking commonsense reasoning in real-world tasks. arXiv preprint arXiv:2310.15239(2023)

work page arXiv 2023
[24]

Tianjie Ju, Weiwei Sun, Wei Du, Xinwei Yuan, Zhaochun Ren, and Gongshen Liu. 2024. How Large Language Models Encode Context Knowledge? A Layer- Wise Probing Study. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 8235–8246

work page 2024
[25]

Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plex- ousakis, and Michel Scholl. 2002. RQL: a declarative query language for RDF. In Proceedings of the 11th international conference on World Wide Web. 592–603

work page 2002
[26]

Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran

work page
[27]

InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lambada: Backward chaining for automated reasoning in natural language. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6547–6568

work page
[28]

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia.Semantic web6, 2 (2015), 167–195

work page 2015
[29]

Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023. Few- shot In-context Learning on Knowledge Base Question Answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6966–6980

work page 2023
[30]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004
[31]

Trond Linjordet and Krisztian Balog. 2022. Would you ask it that way? measuring and improving question naturalness for knowledge graph question answering. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3090–3098. SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Armin Toroghi, Faeze ...

work page 2022
[32]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Hugo Liu and Push Singh. 2004. ConceptNet—a practical commonsense reasoning tool-kit.BT technology journal22, 4 (2004), 211–226

work page 2004
[34]

Adyasha Maharana and Mohit Bansal. 2022. GraDA: Graph generative data augmentation for commonsense reasoning. InProceedings of the 29th International Conference on Computational Linguistics. 4499–4516

work page 2022
[35]

Meta AI. 2024. LLaMA 3.3 70B Instruct. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct

work page 2024
[36]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12076–12100

work page 2023
[37]

1982.The role of logic in knowledge representation and common- sense reasoning

Robert C Moore. 1982.The role of logic in knowledge representation and common- sense reasoning. SRI International. Artificial Intelligence Center

work page 1982
[38]

Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Dur- rett. 2021. CREAK: A Dataset for Commonsense Reasoning over En- tity Knowledge. InProceedings of the Neural Information Processing Sys- tems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://...

work page 2021
[39]

OpenAI. 2024. GPT-4o Technical Report. (2024). https://cdn.openai.com/gpt-4o- system-card.pdf

work page 2024
[40]

OpenAI. 2025. GPT-5.1. https://openai.com/index/gpt-5-1/

work page 2025
[41]

OpenAI. 2025. OpenAI o3. https://platform.openai.com/docs/models/o3

work page 2025
[42]

Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2362–2376

work page 2020
[43]

Nidhi Rastogi and Mohammed J Zaki. 2020. Personal health knowledge graphs for patients.arXiv preprint arXiv:2004.00071(2020)

work page arXiv 2020
[44]

Shaina Raza, Mizanur Rahman, Safiullah Kamawal, Armin Toroghi, Ananya Raval, Farshad Navah, and Amirmohammad Kazemeini. 2024. A comprehensive review of recommender systems: Transitioning from theory to practice.arXiv preprint arXiv:2407.13699(2024)

work page arXiv 2024
[45]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning.. In AAAI spring symposium: logical formalizations of commonsense reasoning. 90–95

work page 2011
[46]

Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 3027–3035

work page 2019
[47]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social IQa: Commonsense Reasoning about Social Interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP). 4463–4473

work page 2019
[48]

Andy Seaborne and Eric Prud’hommeaux. 2008. SPARQL query language for RDF.W3C Recommendation, W3C(2008)

work page 2008
[49]

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. InProceedings of the Thirty- First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 4444–

work page 2017
[50]

doi:10.1609/AAAI.V31I1.11164

work page doi:10.1609/aaai.v31i1.11164
[51]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL- HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,...

work page doi:10.18653/v1/n19-1421 2019
[52]

Zhenwei Tang, Griffin Floto, Armin Toroghi, Shichao Pei, Xiangliang Zhang, and Scott Sanner. 2023. LogicRec: Recommendation with Users’ Logical Requirements. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2129–2133

work page 2023
[53]

Armin Toroghi, Griffin Floto, Zhenwei Tang, and Scott Sanner. 2023. Bayesian Knowledge-driven Critiquing with Indirect Evidence. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1838–1842

work page 2023
[54]

Armin Toroghi, Willis Guo, Ali Pesaranghader, and Scott Sanner. 2024. Verifiable, Debuggable, and Repairable Commonsense Logical Reasoning via LLM-based Theory Resolution. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6634–6652

work page 2024
[55]

Armin Toroghi, Willis Guo, Mohammad Mahdi Abdollah Pour, and Scott Sanner

work page
[56]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Right for Right Reasons: Large Language Models for Verifiable Com- monsense Knowledge Graph Question Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 6601–6633

work page 2024
[57]

Armin Toroghi, Willis Guo, and Scott Sanner. 2025. CoLoTa: A Dataset for Entity- based Commonsense Reasoning over Long-Tail Knowledge. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3444–3454

work page 2025
[58]

Armin Toroghi, Ali Pesaranghader, Tanmana Sadhu, and Scott Sanner. 2025. Llm- based typed hyperresolution for commonsense reasoning with knowledge bases. InThe Thirteenth International Conference on Learning Representations

work page 2025
[59]

Armin Toroghi and Scott Sanner. 2024. Bayesian inference with complex knowl- edge graph evidence. InProceedings of the AAAI Conference on Artificial Intelli- gence, Vol. 38. 20550–20558

work page 2024
[60]

Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs. InThe Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 10588), Claudia d’Amato, Miriam ...

work page doi:10.1007/978-3-319-68204-4_22 2017
[61]

Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase.Commun. ACM57, 10 (2014), 78–85

work page 2014
[62]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems37 (2024), 95266– 95290

work page 2024
[63]

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. 2025. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computatio...

work page 2025
[64]

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Ling...

work page doi:10.18653/v1/p16- 2016
[65]

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base ques- tion answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206

work page 2016
[66]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi

work page
[67]

InInternational Con- ference on Learning Representations

BERTScore: Evaluating Text Generation with BERT. InInternational Con- ference on Learning Representations

work page
[68]

Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song

work page
[69]

In Proceedings of the AAAI conference on artificial intelligence, Vol

Variational reasoning for question answering with knowledge graph. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page
[70]

Zirui Zhao, Wee Sun Lee, and David Hsu. 2023. Large language models as commonsense knowledge for large-scale task planning.Advances in neural information processing systems36 (2023), 31967–31987

work page 2023
[71]

Weiguo Zheng, Hong Cheng, Lei Zou, Jeffrey Xu Yu, and Kangfei Zhao. 2017. Natural language question/answering: Let users talk with the knowledge graph. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 217–226

work page 2017