arxiv: 1911.11641 · v1 · pith:TO7S2ELHnew · submitted 2019-11-26 · 💻 cs.CL · cs.AI· cs.LG

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk , Rowan Zellers , Ronan Le Bras , Jianfeng Gao , Yejin Choi This is my paper

Pith reviewed 2026-05-17 14:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords physical commonsensequestion answeringPIQApretrained modelscommonsense reasoningnatural language understandingreporting bias

0 comments

The pith

Large pretrained models reach only 77 percent accuracy on physical commonsense questions that humans answer at 95 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PIQA, a benchmark of everyday questions about using common objects for physical tasks such as applying eyeshadow with a cotton swab or toothpick. It demonstrates that while people solve these items reliably, current large language models trained only on text fall well short. The gap arises because physical domains suffer from reporting bias, so text alone does not supply the needed knowledge about object properties and interactions. By releasing the dataset and analyzing where models fail, the work frames a concrete research target for building systems that reason about the physical world.

Core claim

AI systems cannot yet reliably answer physical commonsense questions without experiencing the physical world, as shown by the 77 percent accuracy of large pretrained models on the new PIQA benchmark compared with 95 percent for humans.

What carries the argument

PIQA, a dataset of multiple-choice questions that test reasoning about how everyday objects can be used for simple physical tasks.

If this is right

Text-based pretraining is insufficient for physical domains because of inherent reporting bias.
Models lack specific dimensions of knowledge about object affordances and interactions.
Targeted new methods will be needed to close the gap between model and human performance.
The benchmark supplies a measurable target for measuring progress on physical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same limitation may appear in any AI system that must act in the real world without direct experience.
Combining language models with simulation or vision data could serve as one route to better physical reasoning.
Future benchmarks might separate linguistic shortcuts from genuine commonsense to isolate the remaining gap.

Load-bearing premise

The questions in PIQA genuinely require physical commonsense and cannot be solved mainly by detecting linguistic patterns or reporting bias already present in training text.

What would settle it

A text-only pretrained model that reaches 95 percent accuracy on the PIQA test set without any additional physical simulation or sensory data.

read the original abstract

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the PIQA benchmark for physical commonsense reasoning, consisting of crowdsourced multiple-choice questions about everyday physical tasks and interactions. It reports that humans achieve 95% accuracy on the dataset while large pretrained language models reach only 77%, and provides an analysis of the specific dimensions of physical knowledge (such as affordances and dynamics) where current models are deficient.

Significance. If the dataset construction successfully isolates physical reasoning requirements from textual artifacts, the work is significant for natural language understanding research. It directly demonstrates the impact of reporting bias in text corpora on learning physical commonsense and supplies both a reusable benchmark and targeted error analysis that can guide future efforts to integrate world knowledge into pretrained models. The release of the dataset and the human-model gap constitute clear contributions.

major comments (2)

[§3] §3 (Dataset Construction): The central claim that the 18-point human-model gap reflects missing physical interaction knowledge rather than statistical cues requires explicit validation that incorrect options lack exploitable lexical, syntactic, or co-occurrence signals. The description of crowdsourcing physical tasks and generating alternatives does not include quantitative checks such as n-gram overlap statistics, bag-of-words baseline performance, or adversarial filtering results that would rule out reporting bias exploitation by pretrained models.
[§5] §5 (Analysis of Model Deficiencies): While the paper discusses dimensions of missing knowledge, the error analysis does not quantify the proportion of model errors attributable to genuine physical reasoning failures versus potential dataset artifacts (e.g., option plausibility detectable from text alone). This weakens the claim that the benchmark offers clear opportunities for future research on specific knowledge gaps.

minor comments (3)

[Table 1] Table 1: Model accuracy numbers should include standard deviations across multiple random seeds or runs to establish robustness of the reported 77% ceiling.
[Figure 2] Figure 2: The visualization of knowledge dimensions would benefit from explicit mapping to example PIQA questions to make the analysis more concrete for readers.
Related Work section: Consider adding a brief comparison to contemporaneous physical reasoning benchmarks to clarify PIQA's distinct contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of validating the PIQA benchmark's focus on physical commonsense. We address each major comment below and describe the corresponding revisions to the manuscript.

read point-by-point responses

Referee: §3 (Dataset Construction): The central claim that the 18-point human-model gap reflects missing physical interaction knowledge rather than statistical cues requires explicit validation that incorrect options lack exploitable lexical, syntactic, or co-occurrence signals. The description of crowdsourcing physical tasks and generating alternatives does not include quantitative checks such as n-gram overlap statistics, bag-of-words baseline performance, or adversarial filtering results that would rule out reporting bias exploitation by pretrained models.

Authors: We agree that quantitative checks are needed to strengthen the claim that the gap arises from missing physical knowledge rather than exploitable textual signals. The original manuscript emphasized the crowdsourcing protocol but omitted these metrics. In the revision, we have added n-gram overlap statistics between correct and incorrect options (showing minimal differences), a bag-of-words baseline achieving only ~55% accuracy, and results from a simple adversarial filtering pass. These are now reported in the updated §3 to better rule out statistical artifacts. revision: yes
Referee: §5 (Analysis of Model Deficiencies): While the paper discusses dimensions of missing knowledge, the error analysis does not quantify the proportion of model errors attributable to genuine physical reasoning failures versus potential dataset artifacts (e.g., option plausibility detectable from text alone). This weakens the claim that the benchmark offers clear opportunities for future research on specific knowledge gaps.

Authors: We acknowledge that a quantitative breakdown of error sources would make the analysis more robust. Fully automated separation of physical failures from textual artifacts is difficult without additional targeted annotations. We have therefore expanded §5 with a manual review of 200 model errors, categorizing them into physical knowledge gaps (affordances, dynamics, etc.) versus potential artifacts, with approximate proportions reported. This provides a clearer, if partial, quantification while preserving the discussion of targeted future research directions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with direct accuracy measurements and no derivations or self-referential predictions

full rationale

The paper introduces the PIQA dataset for physical commonsense reasoning and reports direct evaluation results (models at 77%, humans at 95%) along with analysis of model shortcomings. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction appear in the provided text or abstract. Results stem from straightforward accuracy measurements on a newly collected crowdsourced dataset rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The work is self-contained as an empirical benchmark without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that the PIQA questions isolate physical commonsense rather than surface text statistics; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5451 in / 1000 out tokens · 32943 ms · 2026-05-17T14:49:40.257530+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world?

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
cs.CL 2024-02 unverdicted novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 18 Pith papers · 4 internal anchors

[1]

CVPR , year =

Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. CVPR , year =

work page
[2]

SocialIQA: Commonsense Reasoning about Social Interactions , booktitle =

Maarten Sap and Hannah Rashkin and Derek Chen and Ronan. SocialIQA: Commonsense Reasoning about Social Interactions , booktitle =. 2019 , month =

work page 2019
[3]

AAAI , year=

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , author=. AAAI , year=

work page
[4]

ACL , year =

Antoine Bosselut and Hannah Rashkin and Maarten Sap and Chaitanya Malaviya and Asli Celikyilmaz and Yejin Choi , title =. ACL , year =

work page
[5]

IROS , year =

Rosario Scalise and Jesse Thomason and Yonatan Bisk and Siddhartha Srinivasa , title =. IROS , year =

work page
[6]

ICRA , year =

Angel Daruna and Weiyu Liu and Zsolt Kira and Sonia Chernova , title =. ICRA , year =

work page
[7]

EMNLP , year =

Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin , title =. EMNLP , year =

work page
[8]

ACL , year =

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. ACL , year =

work page
[9]

EMNLP-IJCNLP , year =

Mor Geva and Yoav Goldberg and Jonathan Berant , title =. EMNLP-IJCNLP , year =

work page
[10]

Annotation Artifacts in Natural Language Inference Data

Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. NAACL-HLT. 2018

work page 2018
[11]

Joint Conference on Lexical and Computational Semantics (StarSem) , year =

Poliak, Adam and Naradowsky, Jason and Haldar, Aparajita and Rudinger, Rachel and. Joint Conference on Lexical and Computational Semantics (StarSem) , year =

work page
[12]

NAACL-HLT , year =

Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , title =. NAACL-HLT , year =

work page
[13]

2019 , url =

Alec Radford and Jeffrey Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever , title =. 2019 , url =

work page 2019
[14]

2018 , url =

Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskever , title =. 2018 , url =

work page 2018
[16]

Tenth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2011) , year =

Roemmele, Melissa and Bejan, Cosmin and Gordon, Andrew , title =. Tenth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2011) , year =

work page 2011
[17]

NAACL-HLT , year =

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonatha and Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , title =. NAACL-HLT , year =

work page
[18]

NIPS-W , year=

Automatic differentiation in PyTorch , author=. NIPS-W , year=

work page
[19]

AAAI , year =

Robyn Speer and Joshua Chin and Catherine Havasi , title =. AAAI , year =

work page
[20]

IJCAI , year =

Thomason, Jesse and Sinapov, Jivko and Svetlik, Maxwell and Stone, Peter and Mooney, Raymond J , title =. IJCAI , year =

work page
[21]

2016 , file =

Carissa Schoenick and Peter Clark and Oyvind Tafjord and Peter Turney and Oren Etzioni , title =. 2016 , file =

work page 2016
[22]

Miller and Sebastian Riedel , title =

Fabio Petroni and Tim Rocktäschel and Patrick Lewis and Anton Bakhtin and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel , title =. EMNLP , year =

work page
[23]

NAACL-HLT , year =

Yonatan Bisk and Jan Buys and Karl Pichotta and Yejin Choi , title =. NAACL-HLT , year =

work page
[24]

NeurIPS , editor =

Learning to See Physics via Visual De-animation , author =. NeurIPS , editor =. 2017 , url =

work page 2017
[25]

ACL , year =

Maxwell Forbes and Yejin Choi , title =. ACL , year =

work page
[26]

ACL , year =

Yanai Elazar and Abhijit Mahabal and Deepak Ramachandran and Tania Bedrax-Weiss and Dan Roth , title =. ACL , year =

work page
[27]

2016 , booktitle =

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , booktitle =

work page 2016
[28]

CVPR , year=

Situation Recognition: Visual Semantic Role Labeling for Image Understanding , author=. CVPR , year=

work page
[30]

``What Happens If...'' Learning to Predict the Effect of Forces in Images

Mottaghi, Roozbeh and Rastegari, Mohammad and Gupta, Abhinav and Farhadi, Ali. ``What Happens If...'' Learning to Predict the Effect of Forces in Images. ECCV. 2016

work page 2016
[31]

2019 , journal =

Anton Bakhtin and Laurens van der Maaten and Justin Johnson and Laura Gustafson and Ross Girshick , title =. 2019 , journal =

work page 2019
[32]

AAAI , year =

Stewart, Russell and Ermon, Stefano , title =. AAAI , year =

work page
[33]

Sigurdsson and G

Gunnar A. Sigurdsson and G. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , booktitle=

work page
[34]

A Short Note about Kinetics-600

Joao Carreira and Eric Noland and Chloe Hillier and Andrew Zisserman , title =. arXiv:1808.01340 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Learning to Poke by Poking: Experiential Learning of Intuitive Physics , year =

Agrawal, Pulkit and Nair, Ashvin and Abbeel, Pieter and Malik, Jitendra and Levine, Sergey , booktitle =. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , year =

work page
[36]

RSS , year =

Toussaint, Marc and Allen, Kelsey R and Smith, Kevin A and Tenenbaum, Joshua B , title =. RSS , year =

work page
[37]

ICRA , year =

Byravan, Arunkumar and Leeb, Felix and Meier, Franziska and Fox,Dieter , title =. ICRA , year =

work page
[38]

ICRA , year =

Lakshmi Nair and Jonathan Balloch and Sonia Chernova , title =. ICRA , year =

work page
[39]

ACL , year =

Gao, Qiaozi and Doering, Malcolm and Yang, Shaohua and Chai, Joyce , title =. ACL , year =

work page
[40]

Proceedings of the National Conference on Artificial Intelligence , year =

Tellex, Stefanie and Kollar, Thomas and Dickerson, Steven and Walter, Matthew R and Banerjee, Ashis Gopal and Teller, Seth and Roy, Nicholas , title =. Proceedings of the National Conference on Artificial Intelligence , year =

work page
[41]

IJCAI , year =

Matuszek, Cynthia , title =. IJCAI , year =

work page
[42]

Goldberg, Yoav , journal=

work page
[43]

EMNLP , pages=

SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. EMNLP , pages=

work page
[44]

and De Meulder, Fien

Tjong Kim Sang, Erik F. and De Meulder, Fien. Introduction to the C o NLL -2003 Shared Task: Language-Independent Named Entity Recognition. NAACL. 2003

work page 2003
[45]

Concreteness ratings for 40 thousand generally known English word lemmas , journal =

Marc Brysbaert and Amy Beth Warriner and Victor Kuperman , year =. Concreteness ratings for 40 thousand generally known English word lemmas , journal =

work page
[46]

Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets

Hessel, Jack and Mimno, David and Lee, Lillian. Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets. NAACL-HLT. 2018

work page 2018
[47]

and Spelke, Elizabeth S

Hespos, Susan J. and Spelke, Elizabeth S. , title =. Nature , volume = 430, pages =

work page
[48]

Agrawal, P.; Nair, A.; Abbeel, P.; Malik, J.; and Levine, S. 2016. Learning to poke by poking: Experiential learning of intuitive physics. In NeurIPS

work page 2016
[49]

Bisk, Y.; Buys, J.; Pichotta, K.; and Choi, Y. 2019. Benchmarking hierarchical script knowledge. In NAACL-HLT

work page 2019
[50]

Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction . In ACL

work page 2019
[51]

B.; and Kuperman, V

Brysbaert, M.; Warriner, A. B.; and Kuperman, V. 2014. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods (46):904--911

work page 2014
[52]

Byravan, A.; Leeb, F.; Meier, F.; and Fox, D. 2018. Se3-pose-nets: Structured deep dynamics models for visuomotor planning and control. In ICRA

work page 2018
[53]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In NAACL-HLT

work page 2019
[54]

Elazar, Y.; Mahabal, A.; Ramachandran, D.; Bedrax-Weiss, T.; and Roth, D. 2019. How large are lions? inducing distributions over quantitative attributes. In ACL

work page 2019
[55]

Forbes, M., and Choi, Y. 2017. Verb physics: Relative physical knowledge of actions and objects. In ACL

work page 2017
[56]

Gao, Q.; Doering, M.; Yang, S.; and Chai, J. 2016. Physical causality of action verbs in grounded language understanding. In ACL , 1814--1824

work page 2016
[57]

Goldberg, Y. 2019. Assessing BERT's Syntactic Abilities . arXiv:1901.05287

work page internal anchor Pith review Pith/arXiv arXiv 2019
[58]

Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT , 107--112

work page 2018
[59]

J., and Spelke, E

Hespos, S. J., and Spelke, E. S. 2004. Conceptual precursors to language. Nature 430:453--456

work page 2004
[60]

Hessel, J.; Mimno, D.; and Lee, L. 2018. Quantifying the visual concreteness of words and topics in multimodal datasets. In NAACL-HLT , 2194--2205

work page 2018
[61]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M.; and Fei-Fei, L. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In arXiv:1602.07332

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Li, Y.-L.; Xu, L.; Huang, X.; Liu, X.; Ma, Z.; Chen, M.; Wang, S.; Fang, H.-S.; and Lu, C. 2019. Hake: Human activity knowledge engine. arXiv preprint arXiv:1904.06539

work page arXiv 2019
[63]

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[64]

Matuszek, C. 2018. Grounded Language Learning: Where Robotics and NLP Meet . In IJCAI , 5687 -- 5691

work page 2018
[65]

Mottaghi, R.; Rastegari, M.; Gupta, A.; and Farhadi, A. 2016. ``what happens if...'' learning to predict the effect of forces in images. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., ECCV , 269--285

work page 2016
[66]

Nair, L.; Balloch, J.; and Chernova, S. 2019. Tool Macgyvering: Tool Construction Using Geometric Reasoning . In ICRA

work page 2019
[67]

H.; and Riedel, S

Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language models as knowledge bases? In EMNLP

work page 2019
[68]

Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme , B. 2018. Hypothesis Only Baselines in Natural Language Inference . In Joint Conference on Lexical and Computational Semantics (StarSem)

work page 2018
[69]

Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training

work page 2018
[70]

Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. In EMNLP , 2383--2392

work page 2016
[71]

Sakaguchi, K.; Le Bras , R.; Bhagavatula, C.; and Choi, Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI

work page 2020
[72]

Sap, M.; Rashkin, H.; Chen, D.; Le Bras , R.; and Choi, Y. 2019. Socialiqa: Commonsense reasoning about social interactions. In EMNLP

work page 2019
[73]

Schoenick, C.; Clark, P.; Tafjord, O.; Turney, P.; and Etzioni, O. 2016. Moving beyond the turing test with the allen ai science challenge. Communications of the ACM

work page 2016
[74]

R.; Banerjee, A

Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M. R.; Banerjee, A. G.; Teller, S.; and Roy, N. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the National Conference on Artificial Intelligence

work page 2011
[75]

Thomason, J.; Sinapov, J.; Svetlik, M.; Stone, P.; and Mooney, R. J. 2016. Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" . In IJCAI , 3477--3483

work page 2016
[76]

F., and De Meulder, F

Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the C o NLL -2003 shared task: Language-independent named entity recognition. In NAACL , 142--147

work page 2003
[77]

R.; Smith, K

Toussaint, M.; Allen, K. R.; Smith, K. A.; and Tenenbaum, J. B. 2018. Differentiable physics and stable modes for tool-use and manipulation planning. In RSS

work page 2018
[78]

Wu, J.; Lu, E.; Kohli, P.; Freeman, B.; and Tenenbaum, J. 2017. Learning to see physics via visual de-animation. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., NeurIPS

work page 2017
[79]

Yatskar, M.; Zettlemoyer, L.; and Farhadi, A. 2016. Situation recognition: Visual semantic role labeling for image understanding. In CVPR

work page 2016
[80]

Zellers, R.; Bisk, Y.; Schwartz, R.; and Choi, Y. 2018. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference . In EMNLP

work page 2018
[81]

Zellers, R.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019a. From recognition to cognition: Visual commonsense reasoning. In CVPR

work page
[82]

Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019b. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL

work page