pith. machine review for the scientific record. sign in

arxiv: 1911.11641 · v1 · pith:TO7S2ELHnew · submitted 2019-11-26 · 💻 cs.CL · cs.AI· cs.LG

PIQA: Reasoning about Physical Commonsense in Natural Language

Pith reviewed 2026-05-17 14:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords physical commonsensequestion answeringPIQApretrained modelscommonsense reasoningnatural language understandingreporting bias
0
0 comments X

The pith

Large pretrained models reach only 77 percent accuracy on physical commonsense questions that humans answer at 95 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PIQA, a benchmark of everyday questions about using common objects for physical tasks such as applying eyeshadow with a cotton swab or toothpick. It demonstrates that while people solve these items reliably, current large language models trained only on text fall well short. The gap arises because physical domains suffer from reporting bias, so text alone does not supply the needed knowledge about object properties and interactions. By releasing the dataset and analyzing where models fail, the work frames a concrete research target for building systems that reason about the physical world.

Core claim

AI systems cannot yet reliably answer physical commonsense questions without experiencing the physical world, as shown by the 77 percent accuracy of large pretrained models on the new PIQA benchmark compared with 95 percent for humans.

What carries the argument

PIQA, a dataset of multiple-choice questions that test reasoning about how everyday objects can be used for simple physical tasks.

If this is right

  • Text-based pretraining is insufficient for physical domains because of inherent reporting bias.
  • Models lack specific dimensions of knowledge about object affordances and interactions.
  • Targeted new methods will be needed to close the gap between model and human performance.
  • The benchmark supplies a measurable target for measuring progress on physical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same limitation may appear in any AI system that must act in the real world without direct experience.
  • Combining language models with simulation or vision data could serve as one route to better physical reasoning.
  • Future benchmarks might separate linguistic shortcuts from genuine commonsense to isolate the remaining gap.

Load-bearing premise

The questions in PIQA genuinely require physical commonsense and cannot be solved mainly by detecting linguistic patterns or reporting bias already present in training text.

What would settle it

A text-only pretrained model that reaches 95 percent accuracy on the PIQA test set without any additional physical simulation or sensory data.

read the original abstract

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the PIQA benchmark for physical commonsense reasoning, consisting of crowdsourced multiple-choice questions about everyday physical tasks and interactions. It reports that humans achieve 95% accuracy on the dataset while large pretrained language models reach only 77%, and provides an analysis of the specific dimensions of physical knowledge (such as affordances and dynamics) where current models are deficient.

Significance. If the dataset construction successfully isolates physical reasoning requirements from textual artifacts, the work is significant for natural language understanding research. It directly demonstrates the impact of reporting bias in text corpora on learning physical commonsense and supplies both a reusable benchmark and targeted error analysis that can guide future efforts to integrate world knowledge into pretrained models. The release of the dataset and the human-model gap constitute clear contributions.

major comments (2)
  1. [§3] §3 (Dataset Construction): The central claim that the 18-point human-model gap reflects missing physical interaction knowledge rather than statistical cues requires explicit validation that incorrect options lack exploitable lexical, syntactic, or co-occurrence signals. The description of crowdsourcing physical tasks and generating alternatives does not include quantitative checks such as n-gram overlap statistics, bag-of-words baseline performance, or adversarial filtering results that would rule out reporting bias exploitation by pretrained models.
  2. [§5] §5 (Analysis of Model Deficiencies): While the paper discusses dimensions of missing knowledge, the error analysis does not quantify the proportion of model errors attributable to genuine physical reasoning failures versus potential dataset artifacts (e.g., option plausibility detectable from text alone). This weakens the claim that the benchmark offers clear opportunities for future research on specific knowledge gaps.
minor comments (3)
  1. [Table 1] Table 1: Model accuracy numbers should include standard deviations across multiple random seeds or runs to establish robustness of the reported 77% ceiling.
  2. [Figure 2] Figure 2: The visualization of knowledge dimensions would benefit from explicit mapping to example PIQA questions to make the analysis more concrete for readers.
  3. Related Work section: Consider adding a brief comparison to contemporaneous physical reasoning benchmarks to clarify PIQA's distinct contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of validating the PIQA benchmark's focus on physical commonsense. We address each major comment below and describe the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: §3 (Dataset Construction): The central claim that the 18-point human-model gap reflects missing physical interaction knowledge rather than statistical cues requires explicit validation that incorrect options lack exploitable lexical, syntactic, or co-occurrence signals. The description of crowdsourcing physical tasks and generating alternatives does not include quantitative checks such as n-gram overlap statistics, bag-of-words baseline performance, or adversarial filtering results that would rule out reporting bias exploitation by pretrained models.

    Authors: We agree that quantitative checks are needed to strengthen the claim that the gap arises from missing physical knowledge rather than exploitable textual signals. The original manuscript emphasized the crowdsourcing protocol but omitted these metrics. In the revision, we have added n-gram overlap statistics between correct and incorrect options (showing minimal differences), a bag-of-words baseline achieving only ~55% accuracy, and results from a simple adversarial filtering pass. These are now reported in the updated §3 to better rule out statistical artifacts. revision: yes

  2. Referee: §5 (Analysis of Model Deficiencies): While the paper discusses dimensions of missing knowledge, the error analysis does not quantify the proportion of model errors attributable to genuine physical reasoning failures versus potential dataset artifacts (e.g., option plausibility detectable from text alone). This weakens the claim that the benchmark offers clear opportunities for future research on specific knowledge gaps.

    Authors: We acknowledge that a quantitative breakdown of error sources would make the analysis more robust. Fully automated separation of physical failures from textual artifacts is difficult without additional targeted annotations. We have therefore expanded §5 with a manual review of 200 model errors, categorizing them into physical knowledge gaps (affordances, dynamics, etc.) versus potential artifacts, with approximate proportions reported. This provides a clearer, if partial, quantification while preserving the discussion of targeted future research directions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with direct accuracy measurements and no derivations or self-referential predictions

full rationale

The paper introduces the PIQA dataset for physical commonsense reasoning and reports direct evaluation results (models at 77%, humans at 95%) along with analysis of model shortcomings. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction appear in the provided text or abstract. Results stem from straightforward accuracy measurements on a newly collected crowdsourced dataset rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The work is self-contained as an empirical benchmark without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that the PIQA questions isolate physical commonsense rather than surface text statistics; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5451 in / 1000 out tokens · 32943 ms · 2026-05-17T14:49:40.257530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  2. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  3. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  4. Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

  5. A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

    cs.AR 2026-03 unverdicted novelty 7.0

    SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

  6. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    cs.CL 2024-02 unverdicted novelty 7.0

    BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

  7. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  8. Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.

  9. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  10. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  11. Textbooks Are All You Need II: phi-1.5 technical report

    cs.CL 2023-09 unverdicted novelty 6.0

    phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.

  12. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  13. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  14. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  15. Gemma: Open Models Based on Gemini Research and Technology

    cs.CL 2024-03 accept novelty 4.0

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  16. Yi: Open Foundation Models by 01.AI

    cs.CL 2024-03 unverdicted novelty 4.0

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  17. Gemma 2: Improving Open Language Models at a Practical Size

    cs.CL 2024-07 conditional novelty 3.0

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

  18. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 18 Pith papers · 4 internal anchors

  1. [1]

    CVPR , year =

    Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. CVPR , year =

  2. [2]

    SocialIQA: Commonsense Reasoning about Social Interactions , booktitle =

    Maarten Sap and Hannah Rashkin and Derek Chen and Ronan. SocialIQA: Commonsense Reasoning about Social Interactions , booktitle =. 2019 , month =

  3. [3]

    AAAI , year=

    WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , author=. AAAI , year=

  4. [4]

    ACL , year =

    Antoine Bosselut and Hannah Rashkin and Maarten Sap and Chaitanya Malaviya and Asli Celikyilmaz and Yejin Choi , title =. ACL , year =

  5. [5]

    IROS , year =

    Rosario Scalise and Jesse Thomason and Yonatan Bisk and Siddhartha Srinivasa , title =. IROS , year =

  6. [6]

    ICRA , year =

    Angel Daruna and Weiyu Liu and Zsolt Kira and Sonia Chernova , title =. ICRA , year =

  7. [7]

    EMNLP , year =

    Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin , title =. EMNLP , year =

  8. [8]

    ACL , year =

    Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. ACL , year =

  9. [9]

    EMNLP-IJCNLP , year =

    Mor Geva and Yoav Goldberg and Jonathan Berant , title =. EMNLP-IJCNLP , year =

  10. [10]

    Annotation Artifacts in Natural Language Inference Data

    Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. NAACL-HLT. 2018

  11. [11]

    Joint Conference on Lexical and Computational Semantics (StarSem) , year =

    Poliak, Adam and Naradowsky, Jason and Haldar, Aparajita and Rudinger, Rachel and. Joint Conference on Lexical and Computational Semantics (StarSem) , year =

  12. [12]

    NAACL-HLT , year =

    Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , title =. NAACL-HLT , year =

  13. [13]

    2019 , url =

    Alec Radford and Jeffrey Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever , title =. 2019 , url =

  14. [14]

    2018 , url =

    Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskever , title =. 2018 , url =

  15. [16]

    Tenth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2011) , year =

    Roemmele, Melissa and Bejan, Cosmin and Gordon, Andrew , title =. Tenth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2011) , year =

  16. [17]

    NAACL-HLT , year =

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonatha and Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , title =. NAACL-HLT , year =

  17. [18]

    NIPS-W , year=

    Automatic differentiation in PyTorch , author=. NIPS-W , year=

  18. [19]

    AAAI , year =

    Robyn Speer and Joshua Chin and Catherine Havasi , title =. AAAI , year =

  19. [20]

    IJCAI , year =

    Thomason, Jesse and Sinapov, Jivko and Svetlik, Maxwell and Stone, Peter and Mooney, Raymond J , title =. IJCAI , year =

  20. [21]

    2016 , file =

    Carissa Schoenick and Peter Clark and Oyvind Tafjord and Peter Turney and Oren Etzioni , title =. 2016 , file =

  21. [22]

    Miller and Sebastian Riedel , title =

    Fabio Petroni and Tim Rocktäschel and Patrick Lewis and Anton Bakhtin and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel , title =. EMNLP , year =

  22. [23]

    NAACL-HLT , year =

    Yonatan Bisk and Jan Buys and Karl Pichotta and Yejin Choi , title =. NAACL-HLT , year =

  23. [24]

    NeurIPS , editor =

    Learning to See Physics via Visual De-animation , author =. NeurIPS , editor =. 2017 , url =

  24. [25]

    ACL , year =

    Maxwell Forbes and Yejin Choi , title =. ACL , year =

  25. [26]

    ACL , year =

    Yanai Elazar and Abhijit Mahabal and Deepak Ramachandran and Tania Bedrax-Weiss and Dan Roth , title =. ACL , year =

  26. [27]

    2016 , booktitle =

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , booktitle =

  27. [28]

    CVPR , year=

    Situation Recognition: Visual Semantic Role Labeling for Image Understanding , author=. CVPR , year=

  28. [30]

    ``What Happens If...'' Learning to Predict the Effect of Forces in Images

    Mottaghi, Roozbeh and Rastegari, Mohammad and Gupta, Abhinav and Farhadi, Ali. ``What Happens If...'' Learning to Predict the Effect of Forces in Images. ECCV. 2016

  29. [31]

    2019 , journal =

    Anton Bakhtin and Laurens van der Maaten and Justin Johnson and Laura Gustafson and Ross Girshick , title =. 2019 , journal =

  30. [32]

    AAAI , year =

    Stewart, Russell and Ermon, Stefano , title =. AAAI , year =

  31. [33]

    Sigurdsson and G

    Gunnar A. Sigurdsson and G. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , booktitle=

  32. [34]

    A Short Note about Kinetics-600

    Joao Carreira and Eric Noland and Chloe Hillier and Andrew Zisserman , title =. arXiv:1808.01340 , year =

  33. [35]

    Learning to Poke by Poking: Experiential Learning of Intuitive Physics , year =

    Agrawal, Pulkit and Nair, Ashvin and Abbeel, Pieter and Malik, Jitendra and Levine, Sergey , booktitle =. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , year =

  34. [36]

    RSS , year =

    Toussaint, Marc and Allen, Kelsey R and Smith, Kevin A and Tenenbaum, Joshua B , title =. RSS , year =

  35. [37]

    ICRA , year =

    Byravan, Arunkumar and Leeb, Felix and Meier, Franziska and Fox,Dieter , title =. ICRA , year =

  36. [38]

    ICRA , year =

    Lakshmi Nair and Jonathan Balloch and Sonia Chernova , title =. ICRA , year =

  37. [39]

    ACL , year =

    Gao, Qiaozi and Doering, Malcolm and Yang, Shaohua and Chai, Joyce , title =. ACL , year =

  38. [40]

    Proceedings of the National Conference on Artificial Intelligence , year =

    Tellex, Stefanie and Kollar, Thomas and Dickerson, Steven and Walter, Matthew R and Banerjee, Ashis Gopal and Teller, Seth and Roy, Nicholas , title =. Proceedings of the National Conference on Artificial Intelligence , year =

  39. [41]

    IJCAI , year =

    Matuszek, Cynthia , title =. IJCAI , year =

  40. [42]

    Goldberg, Yoav , journal=

  41. [43]

    EMNLP , pages=

    SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. EMNLP , pages=

  42. [44]

    and De Meulder, Fien

    Tjong Kim Sang, Erik F. and De Meulder, Fien. Introduction to the C o NLL -2003 Shared Task: Language-Independent Named Entity Recognition. NAACL. 2003

  43. [45]

    Concreteness ratings for 40 thousand generally known English word lemmas , journal =

    Marc Brysbaert and Amy Beth Warriner and Victor Kuperman , year =. Concreteness ratings for 40 thousand generally known English word lemmas , journal =

  44. [46]

    Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets

    Hessel, Jack and Mimno, David and Lee, Lillian. Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets. NAACL-HLT. 2018

  45. [47]

    and Spelke, Elizabeth S

    Hespos, Susan J. and Spelke, Elizabeth S. , title =. Nature , volume = 430, pages =

  46. [48]

    Agrawal, P.; Nair, A.; Abbeel, P.; Malik, J.; and Levine, S. 2016. Learning to poke by poking: Experiential learning of intuitive physics. In NeurIPS

  47. [49]

    Bisk, Y.; Buys, J.; Pichotta, K.; and Choi, Y. 2019. Benchmarking hierarchical script knowledge. In NAACL-HLT

  48. [50]

    Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction . In ACL

  49. [51]

    B.; and Kuperman, V

    Brysbaert, M.; Warriner, A. B.; and Kuperman, V. 2014. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods (46):904--911

  50. [52]

    Byravan, A.; Leeb, F.; Meier, F.; and Fox, D. 2018. Se3-pose-nets: Structured deep dynamics models for visuomotor planning and control. In ICRA

  51. [53]

    Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In NAACL-HLT

  52. [54]

    Elazar, Y.; Mahabal, A.; Ramachandran, D.; Bedrax-Weiss, T.; and Roth, D. 2019. How large are lions? inducing distributions over quantitative attributes. In ACL

  53. [55]

    Forbes, M., and Choi, Y. 2017. Verb physics: Relative physical knowledge of actions and objects. In ACL

  54. [56]

    Gao, Q.; Doering, M.; Yang, S.; and Chai, J. 2016. Physical causality of action verbs in grounded language understanding. In ACL , 1814--1824

  55. [57]

    Goldberg, Y. 2019. Assessing BERT's Syntactic Abilities . arXiv:1901.05287

  56. [58]

    Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT , 107--112

  57. [59]

    J., and Spelke, E

    Hespos, S. J., and Spelke, E. S. 2004. Conceptual precursors to language. Nature 430:453--456

  58. [60]

    Hessel, J.; Mimno, D.; and Lee, L. 2018. Quantifying the visual concreteness of words and topics in multimodal datasets. In NAACL-HLT , 2194--2205

  59. [61]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M.; and Fei-Fei, L. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In arXiv:1602.07332

  60. [62]

    Li, Y.-L.; Xu, L.; Huang, X.; Liu, X.; Ma, Z.; Chen, M.; Wang, S.; Fang, H.-S.; and Lu, C. 2019. Hake: Human activity knowledge engine. arXiv preprint arXiv:1904.06539

  61. [63]

    Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv:1907.11692

  62. [64]

    Matuszek, C. 2018. Grounded Language Learning: Where Robotics and NLP Meet . In IJCAI , 5687 -- 5691

  63. [65]

    Mottaghi, R.; Rastegari, M.; Gupta, A.; and Farhadi, A. 2016. ``what happens if...'' learning to predict the effect of forces in images. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., ECCV , 269--285

  64. [66]

    Nair, L.; Balloch, J.; and Chernova, S. 2019. Tool Macgyvering: Tool Construction Using Geometric Reasoning . In ICRA

  65. [67]

    H.; and Riedel, S

    Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language models as knowledge bases? In EMNLP

  66. [68]

    Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme , B. 2018. Hypothesis Only Baselines in Natural Language Inference . In Joint Conference on Lexical and Computational Semantics (StarSem)

  67. [69]

    Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training

  68. [70]

    Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. In EMNLP , 2383--2392

  69. [71]

    Sakaguchi, K.; Le Bras , R.; Bhagavatula, C.; and Choi, Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI

  70. [72]

    Sap, M.; Rashkin, H.; Chen, D.; Le Bras , R.; and Choi, Y. 2019. Socialiqa: Commonsense reasoning about social interactions. In EMNLP

  71. [73]

    Schoenick, C.; Clark, P.; Tafjord, O.; Turney, P.; and Etzioni, O. 2016. Moving beyond the turing test with the allen ai science challenge. Communications of the ACM

  72. [74]

    R.; Banerjee, A

    Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M. R.; Banerjee, A. G.; Teller, S.; and Roy, N. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the National Conference on Artificial Intelligence

  73. [75]

    Thomason, J.; Sinapov, J.; Svetlik, M.; Stone, P.; and Mooney, R. J. 2016. Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" . In IJCAI , 3477--3483

  74. [76]

    F., and De Meulder, F

    Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the C o NLL -2003 shared task: Language-independent named entity recognition. In NAACL , 142--147

  75. [77]

    R.; Smith, K

    Toussaint, M.; Allen, K. R.; Smith, K. A.; and Tenenbaum, J. B. 2018. Differentiable physics and stable modes for tool-use and manipulation planning. In RSS

  76. [78]

    Wu, J.; Lu, E.; Kohli, P.; Freeman, B.; and Tenenbaum, J. 2017. Learning to see physics via visual de-animation. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., NeurIPS

  77. [79]

    Yatskar, M.; Zettlemoyer, L.; and Farhadi, A. 2016. Situation recognition: Visual semantic role labeling for image understanding. In CVPR

  78. [80]

    Zellers, R.; Bisk, Y.; Schwartz, R.; and Choi, Y. 2018. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference . In EMNLP

  79. [81]

    Zellers, R.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019a. From recognition to cognition: Visual commonsense reasoning. In CVPR

  80. [82]

    Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019b. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL