arxiv: 1904.09728 · v3 · submitted 2019-04-22 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap , Hannah Rashkin , Derek Chen , Ronan LeBras , Yejin Choi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 12:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords social commonsensebenchmarkquestion answeringemotional intelligencetransfer learningWinograd schemasCOPA

0 comments

The pith

Social IQa is a 38,000-question benchmark that exposes a greater than 20 percent performance gap between humans and pretrained language models on social commonsense reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Social IQa as the first large-scale multiple-choice dataset for testing how well systems understand everyday social interactions and the emotions behind them. Questions cover situations such as why someone might lean in to share a secret, with correct and incorrect answers collected through crowdsourcing. A new collection method asks workers to supply the right answer to a related question in order to generate plausible but wrong options, reducing superficial cues that models could exploit. Existing question-answering systems built on pretrained language models fall more than 20 percent behind human accuracy, yet fine-tuning on Social IQa raises state-of-the-art results on established commonsense tasks including Winograd Schemas and COPA.

Core claim

Social IQa contains 38,000 multiple-choice questions that probe emotional and social intelligence across ordinary situations. The dataset is constructed by crowdsourcing both questions and answers while using a framework that mitigates stylistic artifacts in the incorrect options. Pretrained language-model-based question-answering systems show a performance gap exceeding 20 percent relative to humans. When used for transfer learning, the same resource produces state-of-the-art results on multiple other commonsense reasoning benchmarks such as Winograd Schemas and COPA.

What carries the argument

The Social IQa benchmark, constructed via a crowdsourcing framework that generates incorrect answers by soliciting correct answers to related questions.

If this is right

Pretrained language models lack robust representations of social and emotional reasoning.
Fine-tuning on social interaction data can improve performance on other commonsense benchmarks.
Future systems will need explicit mechanisms for social intelligence to close the observed gap.
The benchmark supplies a concrete testbed for measuring progress in social reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Social commonsense may not emerge reliably from standard language modeling objectives alone.
The collection method could be adapted to create similar benchmarks for physical or temporal commonsense.
Models might benefit from pairing the dataset with explicit social knowledge representations.

Load-bearing premise

The crowdsourced questions and answers capture genuine social commonsense rather than new biases that models can exploit without true understanding.

What would settle it

A model that reaches human-level accuracy on Social IQa questions without any training on the dataset itself would show that the claimed gap and transfer benefit do not hold.

read the original abstract

We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SocialIQA, a crowdsourced benchmark of 38,000 multiple-choice questions targeting commonsense reasoning about social and emotional situations. It reports that pretrained LM-based QA models lag human performance by more than 20% and demonstrates that fine-tuning on SocialIQA yields state-of-the-art transfer results on the Winograd Schema Challenge and COPA.

Significance. If the questions genuinely probe social commonsense rather than collection artifacts, the benchmark would be a valuable addition for evaluating and improving AI social reasoning, with the transfer gains providing concrete evidence of utility. The scale and the explicit transfer experiments are strengths.

major comments (3)

[Data Collection] Data Collection section: The mitigation framework (workers supply correct answers to related questions to generate distractors) is described as reducing stylistic artifacts, yet no quantitative analysis is provided on whether residual patterns (e.g., answer distributions correlated with prompt surface features or generation-specific meta-patterns) remain exploitable by models. This directly affects the validity of both the >20% human-model gap and the transfer claims.
[Experiments] Experiments section (results tables): The reported model accuracies, human baseline, and transfer SOTA numbers lack details on statistical significance testing, variance across runs, or error analysis broken down by question type. Without these, the robustness of the central difficulty and transfer claims cannot be fully assessed.
[Transfer Learning] Transfer experiments: The SOTA results on Winograd Schemas and COPA are presented without ablations isolating the contribution of SocialIQA data versus other factors, and without comparison to more recent strong baselines available at the time of submission.

minor comments (2)

[Abstract] Abstract: Specify example model families (e.g., BERT, GPT) when referring to 'pretrained language models' for immediate clarity.
[Related Work] Related Work: Add explicit comparison to contemporaneous social reasoning datasets to sharpen the novelty claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our SocialIQA benchmark paper. We address each major comment below with honest responses and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Data Collection] Data Collection section: The mitigation framework (workers supply correct answers to related questions to generate distractors) is described as reducing stylistic artifacts, yet no quantitative analysis is provided on whether residual patterns (e.g., answer distributions correlated with prompt surface features or generation-specific meta-patterns) remain exploitable by models. This directly affects the validity of both the >20% human-model gap and the transfer claims.

Authors: We agree that a quantitative analysis of residual artifacts would further validate the benchmark. The framework was specifically designed to reduce stylistic biases by requiring workers to answer a related question correctly before generating distractors, which we believe minimizes common patterns. However, we did not include such an analysis in the original submission. In revision, we will add a section quantifying answer distributions, correlations with surface features, and simple model exploitability tests (e.g., using bag-of-words baselines) to demonstrate that residual patterns do not explain the performance gap. revision: yes
Referee: [Experiments] Experiments section (results tables): The reported model accuracies, human baseline, and transfer SOTA numbers lack details on statistical significance testing, variance across runs, or error analysis broken down by question type. Without these, the robustness of the central difficulty and transfer claims cannot be fully assessed.

Authors: We acknowledge this limitation in the original presentation. The reported numbers reflect single-run results from standard fine-tuning procedures, but we agree that variance and significance testing are important for robustness. In the revision, we will rerun key experiments with multiple random seeds to report means and standard deviations, include statistical significance tests (e.g., McNemar's test for comparisons), and add an error analysis section breaking down performance by question categories such as emotional vs. social inference. revision: yes
Referee: [Transfer Learning] Transfer experiments: The SOTA results on Winograd Schemas and COPA are presented without ablations isolating the contribution of SocialIQA data versus other factors, and without comparison to more recent strong baselines available at the time of submission.

Authors: The transfer results compare models fine-tuned on SocialIQA against their non-fine-tuned counterparts and prior SOTA at submission time (e.g., BERT-based models). We did not include exhaustive ablations isolating every factor, which is a fair critique. For recent baselines, the paper was submitted in 2019 and used the strongest available methods then; we will update the transfer section with additional comparisons to contemporaneous strong models and add a simple ablation table showing performance with and without SocialIQA fine-tuning to better isolate its contribution. revision: partial

Circularity Check

0 steps flagged

No circularity; benchmark and results rest on new crowdsourced data collection and direct empirical evaluation

full rationale

The paper introduces Social IQa via a described crowdsourcing framework that generates questions and answers about social situations, then reports direct model evaluations (pretrained QA models vs. humans) and transfer experiments on Winograd/COPA. No equations, fitted parameters, or derivations are present. Claims do not reduce to self-citations, prior fits, or self-definitions; they are independent empirical measurements on the newly collected 38k-question resource. Minor prior-work citations exist for context but are not load-bearing for the performance gap or transfer results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that crowdsourced human judgments reliably encode social commonsense and that the benchmark questions isolate social intelligence from other factors.

axioms (1)

domain assumption Crowdsourced annotations from the described framework accurately reflect genuine social commonsense without residual artifacts
Invoked in the data collection process described in the abstract.

pith-pipeline@v0.9.0 · 5472 in / 1085 out tokens · 36341 ms · 2026-05-13T12:18:16.743823+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Mixture of Heterogeneous Grouped Experts for Language Modeling
cs.CL 2026-04 unverdicted novelty 6.0

MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
cs.CL 2026-04 unverdicted novelty 6.0

SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
cs.AI 2024-08 unverdicted novelty 6.0

A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
cs.CL 2019-05 accept novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
Post-Optimization Adaptive Rank Allocation for LoRA
cs.AI 2026-04 unverdicted novelty 5.0

PARA uses post-optimization SVD with a global singular-value threshold to allocate non-uniform ranks to LoRA layers, cutting parameters 75-90% with no loss in benchmark performance.
Generating Place-Based Compromises Between Two Points of View
cs.CL 2026-04 unverdicted novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

140 extracted references · 140 canonical work pages · cited by 23 Pith papers · 1 internal anchor

[1]

theory of mind

Ian Apperly. 2010. Mindreaders: the cognitive basis of" theory of mind". Psychology Press

work page 2010
[2]

Simon Baron-Cohen, Alan M Leslie, and Uta Frith. 1985. Does the Autistic Child have a ``Theory of Mind''? Cognition, 21(1):37--46

work page 1985
[3]

Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM, 58:92--103

work page 2015
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In NAACL

work page 2019
[5]

Espinosa and Henry Lieberman

Jos \'e H. Espinosa and Henry Lieberman. 2005. Eventnet: Inferring temporal relations between commonsense events. In MICAI

work page 2005
[6]

MY Ganaie and Hafiz Mudasir. 2015. A Study of Social Intelligence & Academic Achievement of College Students of District Srinagar, J&K, India . Journal of American Science, 11(3):23--27

work page 2015
[7]

Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M Harabagiu. 2012. UTDHLT : Copacetic system for choosing plausible alternatives. In NAACL workshop on SemEval, pages 461--466. Association for Computational Linguistics

work page 2012
[8]

Andrew S Gordon and Jerry R Hobbs. 2017. A Formal Theory of Commonsense Psychology: How People Think People Think. Cambridge University Press

work page 2017
[11]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT

work page 2018
[12]

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for the winograd schema challenge. In ACL

work page 2019
[13]

Baris Korkmaz. 2011. Theory of mind and neurodevelopmental disorders of childhood. Pediatr Res, 69(5 Pt 2):101R--8R

work page 2011
[14]

Douglas B Lenat. 1995. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33--38

work page 1995
[15]

Levesque

Hector J. Levesque. 2011. The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning

work page 2011
[16]

Li Lucy and Jon Gauthier. 2017. Are distributional representations ready for the real world? evaluating word vectors for grounded perceptual meaning. In RoboNLP@ACL

work page 2017
[17]

Zhiyi Luo, Yuchen Sha, Kenny Q Zhu, Seung-won Hwang, and Zhongyuan Wang. 2016. Commonsense causal reasoning between short texts. In Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning

work page 2016
[18]

Gary Marcus. 2018. Deep learning: A critical appraisal. CoRR, abs/1801.00631

work page arXiv 2018
[19]

Saif Mohammad. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 174--184

work page 2018
[20]

Chris Moore. 2013. The development of commonsense psychology. Psychology Press

work page 2013
[21]

Griffiths

Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison Gopnik, and Thomas L. Griffiths. 2018. Evaluating theory of mind in question answering. In EMNLP

work page 2018
[22]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W

work page 2017
[23]

Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. Solving hard coreference problems. In HLT-NAACL

work page 2015
[24]

Jason Phang, Thibault F \'e vry, and Samuel R. Bowman. 2019. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088

work page Pith review arXiv 2019
[25]

Martha E. Pollack. 2005. Intelligent technology for an aging population: The use of ai to assist elders with cognitive impairment. AI Magazine, 26:9--24

work page 2005
[26]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative Pre-Training

work page 2018
[27]

Altaf Rahman and Vincent Ng. 2012. Resolving complex cases of definite pronouns: The winograd schema challenge. In EMNLP , EMNLP-CoNLL '12, pages 777--789, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2012
[29]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy S. Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP

work page 2016
[30]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning

work page 2011
[31]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641

work page internal anchor Pith review arXiv 2019
[32]

Maarten Sap, Ronan Le Bras , Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI

work page 2019
[33]

Shota Sasaki, Sho Takase, Naoya Inoue, Naoaki Okazaki, and Kentaro Inui. 2017. Handling multiword expressions in causality estimation. In IWCS

work page 2017
[34]

Sawilowsky

Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8(2):597--599

work page 2009
[35]

Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In CoNLL

work page 2017
[36]

Rishi Kant Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. 2018. Tackling the story ending biases in the story cloze test. In ACL

work page 2018
[37]

Robyn Speer and Catherine Havasi. 2012. Representing general relational knowledge in conceptnet 5. In LREC

work page 2012
[38]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA : A question answering challenge targeting commonsense knowledge. In NAACL

work page 2019
[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008

work page 2017
[40]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019 a . From recognition to cognition: Visual commonsense reasoning. In CVPR

work page 2019
[41]

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG : A large-scale adversarial dataset for grounded commonsense inference. In EMNLP

work page 2018
[42]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019 b . Hellaswag: Can a machine really finish your sentence? In ACL

work page 2019
[43]

Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017. Ordinal common-sense inference. Transactions of the Association of Computational Linguistics, 5(1):379--395

work page 2017
[44]

Zemel, Ruslan R

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19--27

work page 2015
[45]

2012 , organization=

Goodwin, Travis and Rink, Bryan and Roberts, Kirk and Harabagiu, Sanda M , booktitle=. 2012 , organization=

work page 2012
[46]

ArXiv , year=

Winogrande: An Adversarial Winograd Schema Challenge at Scale , author=. ArXiv , year=

work page
[47]

Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=

Commonsense causal reasoning between short texts , author=. Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=

work page
[48]

EMNLP , year=

QuAC: Question Answering in Context , author=. EMNLP , year=

work page
[49]

, author=

Role of theory of mind and executive function in explaining social intelligence: a structural equation modeling approach. , author=. Aging & mental health , year=

work page
[50]

Ganaie, MY and Mudasir, Hafiz , journal=

work page
[51]

2019 , booktitle=

ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , author=. 2019 , booktitle=

work page 2019
[52]

Pediatr Res , volume=

Theory of Mind and Neurodevelopmental Disorders of Childhood , author=. Pediatr Res , volume=

work page
[53]

Psychometric Properties of the ToM storybooks , author=

Measuring Theory of Mind in Children. Psychometric Properties of the ToM storybooks , author=. Journal of autism and Developmental Disorders , volume=. 2008 , publisher=

work page 2008
[54]

ACL , year=

Tackling the Story Ending Biases in The Story Cloze Test , author=. ACL , year=

work page
[55]

AI Magazine , year=

Intelligent Technology for an Aging Population: The Use of AI to Assist Elders with Cognitive Impairment , author=. AI Magazine , year=

work page
[56]

EMNLP , year=

Evaluating Theory of Mind in Question Answering , author=. EMNLP , year=

work page
[57]

1985 , publisher=

Baron-Cohen, Simon and Leslie, Alan M and Frith, Uta , journal=. 1985 , publisher=

work page 1985
[58]

theory of mind

Mindreaders: the cognitive basis of" theory of mind" , author=. 2010 , publisher=

work page 2010
[59]

Know What You Don't Know: Unanswerable Questions for SQuAD

Know What You Don't Know: Unanswerable Questions for SQuAD , author=. arXiv preprint arXiv:1806.03822 , year=

work page Pith review arXiv
[60]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[61]

Rowan Zellers and Yonatan Bisk and Roy Schwartz and Yejin Choi , booktitle=

work page
[62]

Proceedings of the Second International Conference on Human Language Technology Research , series =

Schubert, Lenhart , title =. Proceedings of the Second International Conference on Human Language Technology Research , series =. 2002 , location =

work page 2002
[63]

ACL , year=

WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation , author=. ACL , year=

work page
[64]

CVPR , year=

From Recognition to Cognition: Visual Commonsense Reasoning , author=. CVPR , year=

work page
[65]

WWW , year=

Distilling Task Knowledge from How-To Communities , author=. WWW , year=

work page
[66]

WWW , year=

AMIE: association rule mining under incomplete evidence in ontological knowledge bases , author=. WWW , year=

work page
[67]

ICLR , year =

Yang, Bishan and Yih, Scott Wen-tau and He, Xiaodong and Gao, Jianfeng and Deng, Li , title=. ICLR , year =

work page
[68]

Proceedings of the 2013 Workshop on Automated Knowledge Base Construction , series =

Gordon, Jonathan and Van Durme, Benjamin , title =. Proceedings of the 2013 Workshop on Automated Knowledge Base Construction , series =. 2013 , isbn =. doi:10.1145/2509558.2509563 , acmid =

work page doi:10.1145/2509558.2509563 2013
[69]

ACL , year=

Unsupervised Learning of Narrative Event Chains , author=. ACL , year=

work page
[70]

EMNLP , year=

How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation , author=. EMNLP , year=

work page
[71]

1977 , publisher=

Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures , author=. 1977 , publisher=

work page 1977
[72]

AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , year=

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , author=. AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , year=

work page
[73]

2017 , publisher=

A Formal Theory of Commonsense Psychology: How People Think People Think , author=. 2017 , publisher=

work page 2017
[74]

2018 , booktitle=

Event2Mind: Commonsense Inference on Events, Intents, and Reactions , author=. 2018 , booktitle=

work page 2018
[75]

2018 , booktitle=

Modeling Naive Psychology of Characters in Simple Commonsense Stories , author=. 2018 , booktitle=

work page 2018
[76]

Did It Happen? The Pragmatic Complexity of Veridicality Assessment

de Marneffe, Marie-Catherine and Manning, Christopher D and Potts, Christopher. Did It Happen? The Pragmatic Complexity of Veridicality Assessment. Comput. Linguist

work page
[77]

and Turney, Peter D

Mohammad, Saif M. and Turney, Peter D. , Booktitle =. Crowdsourcing a Word-Emotion Association Lexicon , Volume =

work page
[78]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[79]

CoRR , year=

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , author=. CoRR , year=

work page
[80]

Journal of Modern Applied Statistical Methods , volume=

New Effect Size Rules of Thumb , author=. Journal of Modern Applied Statistical Methods , volume=

work page
[81]

and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=

Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=. Proc. of NAACL , year=

work page
[82]

SSST@EMNLP , year=

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches , author=. SSST@EMNLP , year=

work page
[83]

EMNLP , year=

Glove: Global Vectors for Word Representation , author=. EMNLP , year=

work page

Showing first 80 references.