Recognition: 2 theorem links
· Lean TheoremSocialIQA: Commonsense Reasoning about Social Interactions
Pith reviewed 2026-05-13 12:18 UTC · model grok-4.3
The pith
Social IQa is a 38,000-question benchmark that exposes a greater than 20 percent performance gap between humans and pretrained language models on social commonsense reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Social IQa contains 38,000 multiple-choice questions that probe emotional and social intelligence across ordinary situations. The dataset is constructed by crowdsourcing both questions and answers while using a framework that mitigates stylistic artifacts in the incorrect options. Pretrained language-model-based question-answering systems show a performance gap exceeding 20 percent relative to humans. When used for transfer learning, the same resource produces state-of-the-art results on multiple other commonsense reasoning benchmarks such as Winograd Schemas and COPA.
What carries the argument
The Social IQa benchmark, constructed via a crowdsourcing framework that generates incorrect answers by soliciting correct answers to related questions.
If this is right
- Pretrained language models lack robust representations of social and emotional reasoning.
- Fine-tuning on social interaction data can improve performance on other commonsense benchmarks.
- Future systems will need explicit mechanisms for social intelligence to close the observed gap.
- The benchmark supplies a concrete testbed for measuring progress in social reasoning.
Where Pith is reading between the lines
- Social commonsense may not emerge reliably from standard language modeling objectives alone.
- The collection method could be adapted to create similar benchmarks for physical or temporal commonsense.
- Models might benefit from pairing the dataset with explicit social knowledge representations.
Load-bearing premise
The crowdsourced questions and answers capture genuine social commonsense rather than new biases that models can exploit without true understanding.
What would settle it
A model that reaches human-level accuracy on Social IQa questions without any training on the dataset itself would show that the claimed gap and transfer benefit do not hold.
read the original abstract
We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SocialIQA, a crowdsourced benchmark of 38,000 multiple-choice questions targeting commonsense reasoning about social and emotional situations. It reports that pretrained LM-based QA models lag human performance by more than 20% and demonstrates that fine-tuning on SocialIQA yields state-of-the-art transfer results on the Winograd Schema Challenge and COPA.
Significance. If the questions genuinely probe social commonsense rather than collection artifacts, the benchmark would be a valuable addition for evaluating and improving AI social reasoning, with the transfer gains providing concrete evidence of utility. The scale and the explicit transfer experiments are strengths.
major comments (3)
- [Data Collection] Data Collection section: The mitigation framework (workers supply correct answers to related questions to generate distractors) is described as reducing stylistic artifacts, yet no quantitative analysis is provided on whether residual patterns (e.g., answer distributions correlated with prompt surface features or generation-specific meta-patterns) remain exploitable by models. This directly affects the validity of both the >20% human-model gap and the transfer claims.
- [Experiments] Experiments section (results tables): The reported model accuracies, human baseline, and transfer SOTA numbers lack details on statistical significance testing, variance across runs, or error analysis broken down by question type. Without these, the robustness of the central difficulty and transfer claims cannot be fully assessed.
- [Transfer Learning] Transfer experiments: The SOTA results on Winograd Schemas and COPA are presented without ablations isolating the contribution of SocialIQA data versus other factors, and without comparison to more recent strong baselines available at the time of submission.
minor comments (2)
- [Abstract] Abstract: Specify example model families (e.g., BERT, GPT) when referring to 'pretrained language models' for immediate clarity.
- [Related Work] Related Work: Add explicit comparison to contemporaneous social reasoning datasets to sharpen the novelty claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our SocialIQA benchmark paper. We address each major comment below with honest responses and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Data Collection] Data Collection section: The mitigation framework (workers supply correct answers to related questions to generate distractors) is described as reducing stylistic artifacts, yet no quantitative analysis is provided on whether residual patterns (e.g., answer distributions correlated with prompt surface features or generation-specific meta-patterns) remain exploitable by models. This directly affects the validity of both the >20% human-model gap and the transfer claims.
Authors: We agree that a quantitative analysis of residual artifacts would further validate the benchmark. The framework was specifically designed to reduce stylistic biases by requiring workers to answer a related question correctly before generating distractors, which we believe minimizes common patterns. However, we did not include such an analysis in the original submission. In revision, we will add a section quantifying answer distributions, correlations with surface features, and simple model exploitability tests (e.g., using bag-of-words baselines) to demonstrate that residual patterns do not explain the performance gap. revision: yes
-
Referee: [Experiments] Experiments section (results tables): The reported model accuracies, human baseline, and transfer SOTA numbers lack details on statistical significance testing, variance across runs, or error analysis broken down by question type. Without these, the robustness of the central difficulty and transfer claims cannot be fully assessed.
Authors: We acknowledge this limitation in the original presentation. The reported numbers reflect single-run results from standard fine-tuning procedures, but we agree that variance and significance testing are important for robustness. In the revision, we will rerun key experiments with multiple random seeds to report means and standard deviations, include statistical significance tests (e.g., McNemar's test for comparisons), and add an error analysis section breaking down performance by question categories such as emotional vs. social inference. revision: yes
-
Referee: [Transfer Learning] Transfer experiments: The SOTA results on Winograd Schemas and COPA are presented without ablations isolating the contribution of SocialIQA data versus other factors, and without comparison to more recent strong baselines available at the time of submission.
Authors: The transfer results compare models fine-tuned on SocialIQA against their non-fine-tuned counterparts and prior SOTA at submission time (e.g., BERT-based models). We did not include exhaustive ablations isolating every factor, which is a fair critique. For recent baselines, the paper was submitted in 2019 and used the strongest available methods then; we will update the transfer section with additional comparisons to contemporaneous strong models and add a simple ablation table showing performance with and without SocialIQA fine-tuning to better isolate its contribution. revision: partial
Circularity Check
No circularity; benchmark and results rest on new crowdsourced data collection and direct empirical evaluation
full rationale
The paper introduces Social IQa via a described crowdsourcing framework that generates questions and answers about social situations, then reports direct model evaluations (pretrained QA models vs. humans) and transfer experiments on Winograd/COPA. No equations, fitted parameters, or derivations are present. Claims do not reduce to self-citations, prior fits, or self-definitions; they are independent empirical measurements on the newly collected 38k-question resource. Minor prior-work citations exist for context but are not load-bearing for the performance gap or transfer results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Crowdsourced annotations from the described framework accurately reflect genuine social commonsense without residual artifacts
Forward citations
Cited by 23 Pith papers
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Mixture of Heterogeneous Grouped Experts for Language Modeling
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
-
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
-
Post-Optimization Adaptive Rank Allocation for LoRA
PARA uses post-optimization SVD with a global singular-value threshold to allocate non-uniform ranks to LoRA layers, cutting parameters 75-90% with no loss in benchmark performance.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Ian Apperly. 2010. Mindreaders: the cognitive basis of" theory of mind". Psychology Press
work page 2010
-
[2]
Simon Baron-Cohen, Alan M Leslie, and Uta Frith. 1985. Does the Autistic Child have a ``Theory of Mind''? Cognition, 21(1):37--46
work page 1985
-
[3]
Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM, 58:92--103
work page 2015
-
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In NAACL
work page 2019
-
[5]
Jos \'e H. Espinosa and Henry Lieberman. 2005. Eventnet: Inferring temporal relations between commonsense events. In MICAI
work page 2005
-
[6]
MY Ganaie and Hafiz Mudasir. 2015. A Study of Social Intelligence & Academic Achievement of College Students of District Srinagar, J&K, India . Journal of American Science, 11(3):23--27
work page 2015
-
[7]
Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M Harabagiu. 2012. UTDHLT : Copacetic system for choosing plausible alternatives. In NAACL workshop on SemEval, pages 461--466. Association for Computational Linguistics
work page 2012
-
[8]
Andrew S Gordon and Jerry R Hobbs. 2017. A Formal Theory of Commonsense Psychology: How People Think People Think. Cambridge University Press
work page 2017
-
[11]
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT
work page 2018
-
[12]
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for the winograd schema challenge. In ACL
work page 2019
-
[13]
Baris Korkmaz. 2011. Theory of mind and neurodevelopmental disorders of childhood. Pediatr Res, 69(5 Pt 2):101R--8R
work page 2011
-
[14]
Douglas B Lenat. 1995. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33--38
work page 1995
- [15]
-
[16]
Li Lucy and Jon Gauthier. 2017. Are distributional representations ready for the real world? evaluating word vectors for grounded perceptual meaning. In RoboNLP@ACL
work page 2017
-
[17]
Zhiyi Luo, Yuchen Sha, Kenny Q Zhu, Seung-won Hwang, and Zhongyuan Wang. 2016. Commonsense causal reasoning between short texts. In Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning
work page 2016
- [18]
-
[19]
Saif Mohammad. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 174--184
work page 2018
-
[20]
Chris Moore. 2013. The development of commonsense psychology. Psychology Press
work page 2013
- [21]
-
[22]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W
work page 2017
-
[23]
Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. Solving hard coreference problems. In HLT-NAACL
work page 2015
-
[24]
Jason Phang, Thibault F \'e vry, and Samuel R. Bowman. 2019. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088
work page Pith review arXiv 2019
-
[25]
Martha E. Pollack. 2005. Intelligent technology for an aging population: The use of ai to assist elders with cognitive impairment. AI Magazine, 26:9--24
work page 2005
-
[26]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative Pre-Training
work page 2018
-
[27]
Altaf Rahman and Vincent Ng. 2012. Resolving complex cases of definite pronouns: The winograd schema challenge. In EMNLP , EMNLP-CoNLL '12, pages 777--789, Stroudsburg, PA, USA. Association for Computational Linguistics
work page 2012
-
[29]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy S. Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP
work page 2016
-
[30]
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning
work page 2011
-
[31]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641
work page internal anchor Pith review arXiv 2019
-
[32]
Maarten Sap, Ronan Le Bras , Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI
work page 2019
-
[33]
Shota Sasaki, Sho Takase, Naoya Inoue, Naoaki Okazaki, and Kentaro Inui. 2017. Handling multiword expressions in causality estimation. In IWCS
work page 2017
-
[34]
Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8(2):597--599
work page 2009
-
[35]
Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In CoNLL
work page 2017
-
[36]
Rishi Kant Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. 2018. Tackling the story ending biases in the story cloze test. In ACL
work page 2018
-
[37]
Robyn Speer and Catherine Havasi. 2012. Representing general relational knowledge in conceptnet 5. In LREC
work page 2012
-
[38]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA : A question answering challenge targeting commonsense knowledge. In NAACL
work page 2019
-
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008
work page 2017
-
[40]
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019 a . From recognition to cognition: Visual commonsense reasoning. In CVPR
work page 2019
-
[41]
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG : A large-scale adversarial dataset for grounded commonsense inference. In EMNLP
work page 2018
-
[42]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019 b . Hellaswag: Can a machine really finish your sentence? In ACL
work page 2019
-
[43]
Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017. Ordinal common-sense inference. Transactions of the Association of Computational Linguistics, 5(1):379--395
work page 2017
-
[44]
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19--27
work page 2015
-
[45]
Goodwin, Travis and Rink, Bryan and Roberts, Kirk and Harabagiu, Sanda M , booktitle=. 2012 , organization=
work page 2012
-
[46]
Winogrande: An Adversarial Winograd Schema Challenge at Scale , author=. ArXiv , year=
-
[47]
Commonsense causal reasoning between short texts , author=. Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=
- [48]
- [49]
-
[50]
Ganaie, MY and Mudasir, Hafiz , journal=
-
[51]
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , author=. 2019 , booktitle=
work page 2019
-
[52]
Theory of Mind and Neurodevelopmental Disorders of Childhood , author=. Pediatr Res , volume=
-
[53]
Psychometric Properties of the ToM storybooks , author=
Measuring Theory of Mind in Children. Psychometric Properties of the ToM storybooks , author=. Journal of autism and Developmental Disorders , volume=. 2008 , publisher=
work page 2008
-
[54]
Tackling the Story Ending Biases in The Story Cloze Test , author=. ACL , year=
-
[55]
Intelligent Technology for an Aging Population: The Use of AI to Assist Elders with Cognitive Impairment , author=. AI Magazine , year=
- [56]
-
[57]
Baron-Cohen, Simon and Leslie, Alan M and Frith, Uta , journal=. 1985 , publisher=
work page 1985
-
[58]
Mindreaders: the cognitive basis of" theory of mind" , author=. 2010 , publisher=
work page 2010
-
[59]
Know What You Don't Know: Unanswerable Questions for SQuAD
Know What You Don't Know: Unanswerable Questions for SQuAD , author=. arXiv preprint arXiv:1806.03822 , year=
-
[60]
Advances in neural information processing systems , pages=
Attention is all you need , author=. Advances in neural information processing systems , pages=
-
[61]
Rowan Zellers and Yonatan Bisk and Roy Schwartz and Yejin Choi , booktitle=
-
[62]
Proceedings of the Second International Conference on Human Language Technology Research , series =
Schubert, Lenhart , title =. Proceedings of the Second International Conference on Human Language Technology Research , series =. 2002 , location =
work page 2002
-
[63]
WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation , author=. ACL , year=
-
[64]
From Recognition to Cognition: Visual Commonsense Reasoning , author=. CVPR , year=
- [65]
-
[66]
AMIE: association rule mining under incomplete evidence in ontological knowledge bases , author=. WWW , year=
-
[67]
Yang, Bishan and Yih, Scott Wen-tau and He, Xiaodong and Gao, Jianfeng and Deng, Li , title=. ICLR , year =
-
[68]
Proceedings of the 2013 Workshop on Automated Knowledge Base Construction , series =
Gordon, Jonathan and Van Durme, Benjamin , title =. Proceedings of the 2013 Workshop on Automated Knowledge Base Construction , series =. 2013 , isbn =. doi:10.1145/2509558.2509563 , acmid =
- [69]
-
[70]
How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation , author=. EMNLP , year=
-
[71]
Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures , author=. 1977 , publisher=
work page 1977
-
[72]
AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , year=
Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , author=. AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , year=
-
[73]
A Formal Theory of Commonsense Psychology: How People Think People Think , author=. 2017 , publisher=
work page 2017
-
[74]
Event2Mind: Commonsense Inference on Events, Intents, and Reactions , author=. 2018 , booktitle=
work page 2018
-
[75]
Modeling Naive Psychology of Characters in Simple Commonsense Stories , author=. 2018 , booktitle=
work page 2018
-
[76]
Did It Happen? The Pragmatic Complexity of Veridicality Assessment
de Marneffe, Marie-Catherine and Manning, Christopher D and Potts, Christopher. Did It Happen? The Pragmatic Complexity of Veridicality Assessment. Comput. Linguist
-
[77]
Mohammad, Saif M. and Turney, Peter D. , Booktitle =. Crowdsourcing a Word-Emotion Association Lexicon , Volume =
-
[78]
Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[79]
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , author=. CoRR , year=
-
[80]
Journal of Modern Applied Statistical Methods , volume=
New Effect Size Rules of Thumb , author=. Journal of Modern Applied Statistical Methods , volume=
-
[81]
Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=. Proc. of NAACL , year=
-
[82]
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches , author=. SSST@EMNLP , year=
- [83]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.