pith. machine review for the scientific record. sign in

arxiv: 1712.05474 · v4 · submitted 2017-12-14 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

AI2-THOR: An Interactive 3D Environment for Visual AI

Abhinav Gupta, Ali Farhadi, Alvaro Herrasti, Aniruddha Kembhavi, Daniel Gordon, Eli VanderBilt, Eric Kolve, Kiana Ehsani, Luca Weihs, Matt Deitke, Roozbeh Mottaghi, Winson Han, Yuke Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords AI2-THOR3D indoor scenesvisual AIembodied agentsreinforcement learninginteractive simulationobject manipulation
0
0 comments X

The pith

AI2-THOR supplies near photo-realistic 3D indoor scenes where AI agents navigate and interact with objects to complete tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AI2-THOR as a new framework designed to advance visual AI research. It supplies a collection of near photo-realistic 3D indoor scenes in which agents can move through rooms and manipulate objects while performing tasks. This environment is intended to support work across reinforcement learning, imitation learning, planning, visual question answering, and several other domains. The authors position the framework as a tool that can help close the gap between simulated training and real-world visual intelligence.

Core claim

AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

What carries the argument

The AI2-THOR framework of interactive 3D indoor scenes that allow agent navigation and object manipulation.

Load-bearing premise

The simulated interactions and near photo-realistic visuals are representative enough of real-world conditions to support transferable learning and that the research community will widely adopt and extend the framework.

What would settle it

Train a policy in AI2-THOR on a navigation or object-interaction task and test the same policy on equivalent tasks in a physical room; substantially lower success rates would indicate limited transfer.

read the original abstract

We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces AI2-THOR (The House Of inteRactions), a publicly available framework for visual AI research consisting of near photo-realistic 3D indoor scenes. Agents can navigate these scenes and interact with objects to perform tasks. The framework is presented as a platform to support research across domains including deep reinforcement learning, imitation learning, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and models of cognition.

Significance. If the described environment functions as outlined, the work offers a useful, open-source simulation platform that advances embodied visual AI by moving beyond static datasets toward interactive 3D settings. The explicit public release supports reproducibility and community extensions, which are concrete strengths for a systems-style contribution in this area.

minor comments (3)
  1. The abstract and title use an unconventional capitalization in the acronym expansion ('inteRactions'); a consistent typographic treatment would improve readability.
  2. The manuscript would benefit from a dedicated section or appendix providing a minimal usage example (e.g., agent action API calls or scene loading code) to lower the barrier for new users.
  3. Related-work discussion could more explicitly contrast the interaction fidelity and scene variety against contemporaneous simulators such as AI2-THOR's predecessors or other 3D environments referenced in the field.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee's summary correctly identifies AI2-THOR as a publicly available framework providing near photo-realistic 3D indoor environments that support navigation and object interaction for a range of visual AI tasks.

Circularity Check

0 steps flagged

No significant circularity; paper introduces a new simulation framework without derivations or predictions

full rationale

The paper's central contribution is the direct presentation of the AI2-THOR framework consisting of near photo-realistic 3D indoor scenes supporting navigation and object interactions. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. The work is self-contained as a platform introduction rather than a solved transfer problem or model derivation, so no steps reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a framework introduction and does not rely on any free parameters, mathematical axioms, or invented scientific entities. The simulated scenes and objects are software constructs rather than postulated physical entities with independent evidence.

pith-pipeline@v0.9.0 · 5462 in / 1066 out tokens · 56256 ms · 2026-05-12T05:20:48.177440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  3. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    cs.CV 2021-09 accept novelty 8.0

    HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.

  4. Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    Ego2World turns real egocentric cooking videos into hidden symbolic world graphs for evaluating belief-state planning and memory in embodied agents.

  5. Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...

  6. 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

  7. EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.

  8. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.

  9. MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

    cs.RO 2026-05 unverdicted novelty 7.0

    MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...

  10. ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

    cs.RO 2026-05 unverdicted novelty 7.0

    ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.

  11. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 accept novelty 7.0

    3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.

  12. KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.

  13. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

  14. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  15. ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

    cs.AI 2026-04 unverdicted novelty 7.0

    ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.

  16. EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.

  17. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  18. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  19. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.

  20. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

    cs.CR 2026-05 unverdicted novelty 6.0

    VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.

  21. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

    cs.CR 2026-05 unverdicted novelty 6.0

    Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...

  22. Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance

    cs.RO 2026-05 unverdicted novelty 6.0

    The work creates NIABench and an LLM-plus-scoring-model framework that enables robots to deliver proactive assistance during human multi-step activities while avoiding interruptions and reducing human effort.

  23. SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

  24. ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation

    cs.CV 2026-04 unverdicted novelty 6.0

    ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.

  25. Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

    cs.RO 2026-04 unverdicted novelty 6.0

    Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.

  26. Scalable Trajectory Generation for Whole-Body Mobile Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    AutoMoMa unifies AKR kinematic modeling with parallel trajectory optimization to produce 500k+ valid coordinated trajectories across 330 scenes and multiple robot embodiments, 80x faster than prior CPU methods.

  27. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  28. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    cs.RO 2024-06 unverdicted novelty 6.0

    RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.

  29. On Evaluation of Embodied Navigation Agents

    cs.AI 2018-07 accept novelty 6.0

    Consensus recommendations for standardized evaluation measures, problem statements, and benchmarking scenarios in embodied navigation research.

  30. Cross-Modal Navigation with Multi-Agent Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    CRONA is a MARL framework that uses modality-specialized agents with auxiliary beliefs and a centralized multi-modal critic to achieve better performance and efficiency than single-agent baselines on visual-acoustic n...

  31. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.

  32. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.

  33. Environmental Understanding Vision-Language Model for Embodied Agent

    cs.CV 2026-04 unverdicted novelty 5.0

    EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.

  34. EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

    cs.RO 2026-04 unverdicted novelty 5.0

    EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.

  35. RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

    cs.RO 2026-04 unverdicted novelty 5.0

    RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

  36. Pre-Execution Safety Gate & Task Safety Contracts for LLM-Controlled Robot Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    SafeGate adds a deterministic pre-execution gate and runtime contracts with Z3 SMT solving to block unsafe LLM commands for robots while passing safe ones.

  37. From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

    cs.AI 2026-03 unverdicted novelty 5.0

    An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

  38. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    cs.RO 2020-09 unverdicted novelty 5.0

    The paper presents robosuite v1.5, a MuJoCo-based modular simulation framework with benchmark environments for reproducible robot learning research.

  39. Leveraging VR Robot Games to Facilitate Data Collection for Embodied Intelligence Tasks

    cs.RO 2026-04 unverdicted novelty 4.0

    A VR gamified data collection system in Unity for humanoid robots demonstrates broad state-action coverage in pick-and-place tasks, with higher difficulty increasing motion intensity and workspace exploration.

  40. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  41. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...

  42. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 2.0

    The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 36 Pith papers

  1. [1]

    Robothor: An open simulation-to-real embodied ai platform

    Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. Robothor: An open simulation-to-real embodied ai platform. In CVPR, 2020. 2, 3, 6, 7

  2. [2]

    Procthor: Large-scale embodied ai using procedural generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. arXiv, 2022. 2, 3, 6, 7, 8, 12

  3. [3]

    Learning object relation graph and tentative policy for visual navigation

    Heming Du, Xin Yu, and Liang Zheng. Learning object relation graph and tentative policy for visual navigation. In ECCV, 2020. 6

  4. [4]

    What do navigation agents learn about their environment? In CVPR, 2022

    Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, and Roozbeh Mottaghi. What do navigation agents learn about their environment? In CVPR, 2022. 7, 8

  5. [5]

    Manipulathor: A framework for visual object manipulation

    Kiana Ehsani, Winson Han, Alvaro Herrasti, Eli VanderBilt, Luca Weihs, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Manipulathor: A framework for visual object manipulation. In CVPR, 2021. 3, 8

  6. [6]

    Segan: Segmenting and generating the invisible

    Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi. Segan: Segmenting and generating the invisible. In CVPR, 2018. 8

  7. [7]

    Threedworld: A platform for interactive multi-modal physical simulation

    Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, et al. Threedworld: A platform for interactive multi-modal physical simulation. In Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , 2020. 2, 8

  8. [8]

    Look, listen, and act: Towards audio- visual embodied navigation

    Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B Tenenbaum. Look, listen, and act: Towards audio- visual embodied navigation. In ICRA, 2020. 6, 7

  9. [9]

    Dialfred: Dialogue- enabled agents for embodied instruction following

    Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S Sukhatme. Dialfred: Dialogue- enabled agents for embodied instruction following. IEEE Robotics and Automation Letters , 2022. 6

  10. [10]

    Iqa: Visual question answering in interactive environments

    Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. Iqa: Visual question answering in interactive environments. In CVPR, 2018. 6

  11. [11]

    Unnat Jain, Luca Weihs, Eric Kolve, Ali Farhadi, Svetlana Lazebnik, Aniruddha Kembhavi, and Alexander G. Schwing. A cordial sync: Going beyond marginal policies for multi-agent embodied tasks. In ECCV, 2020. 6

  12. [12]

    Schwing, and Aniruddha Kembhavi

    Unnat Jain, Luca Weihs, Eric Kolve, Mohammad Rastegari, Svetlana Lazebnik, Ali Farhadi, Alexander G. Schwing, and Aniruddha Kembhavi. Two body problem: Collaborative visual task completion. In CVPR, 2019. 6, 7

  13. [13]

    Learning adaptive language interfaces through decomposition

    Siddharth Karamcheti, Dorsa Sadigh, and Percy Liang. Learning adaptive language interfaces through decomposition. arXiv, 2020. 6

  14. [14]

    The design of stretch: A compact, lightweight mobile manipulator for indoor human environments

    Charles C Kemp, Aaron Edsinger, Henry M Clever, and Blaine Matulevich. The design of stretch: A compact, lightweight mobile manipulator for indoor human environments. In ICRA, 2022. 3

  15. [15]

    Simple but effective: Clip embeddings for embodied ai

    Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embeddings for embodied ai. In CVPR, 2022. 6

  16. [16]

    Contrasting contrastive self- supervised representation learning pipelines

    Klemen Kotar, Gabriel Ilharco, Ludwig Schmidt, Kiana Ehsani, and Roozbeh Mottaghi. Contrasting contrastive self- supervised representation learning pipelines. In ICCV, 2021. 8

  17. [17]

    Interactron: Embodied adaptive object detection

    Klemen Kotar and Roozbeh Mottaghi. Interactron: Embodied adaptive object detection. In CVPR, 2022. 7, 8

  18. [18]

    igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

    Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Elliott Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, Andrey Kurenkov, Karen Liu, Hyowon Gweon, Jiajun Wu, Li Fei-Fei, and Silvio Savarese. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In CoRL, 2021. 2, 8

  19. [19]

    Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes

    Qi Li, Kaichun Mo, Yanchao Yang, Hang Zhao, and Leonidas Guibas. Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes. In ICLR, 2022. 7

  20. [20]

    Multi-agent embodied visual semantic navigation with scene prior knowledge

    Xinzhu Liu, Di Guo, Huaping Liu, and Fuchun Sun. Multi-agent embodied visual semantic navigation with scene prior knowledge. IEEE Robotics and Automation Letters , 2022. 6

  21. [21]

    Learning about objects by learning to interact with them

    Martin Lohmann, Jordi Salvador, Aniruddha Kembhavi, and Roozbeh Mottaghi. Learning about objects by learning to interact with them. In NeurIPS, 2020. 8

  22. [22]

    Mgrl: Graph neural network based inference in a markov network with reinforcement learning for visual navigation

    Yi Lu, Yaran Chen, Dongbin Zhao, and Dong Li. Mgrl: Graph neural network based inference in a markov network with reinforcement learning for visual navigation. Neurocomputing, 2021. 6

  23. [23]

    Film: Following instructions in language with modular methods

    So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Following instructions in language with modular methods. In ICLR, 2022. 6

  24. [24]

    Pyrobot: An open-source robotics framework for research and benchmarking

    Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta, and Abhinav Gupta. Pyrobot: An open-source robotics framework for research and benchmarking. arXiv, 2019. 3

  25. [25]

    Learning affordance landscapes for interaction exploration in 3d environments

    Tushar Nagarajan and Kristen Grauman. Learning affordance landscapes for interaction exploration in 3d environments. In NeurIPS, 2020. 7

  26. [26]

    Shaping embodied agent behavior with activity-context priors from egocentric video

    Tushar Nagarajan and Kristen Grauman. Shaping embodied agent behavior with activity-context priors from egocentric video. In NeurIPS, 2021. 7

  27. [27]

    Teach: Task-driven embodied agents that chat

    Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. In AAAI,

  28. [28]

    Episodic transformer for vision-and-language navigation

    Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic transformer for vision-and-language navigation. In ICCV, 2021. 6 9

  29. [29]

    Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InNeural Information Processing Systems Dataset...

  30. [30]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. In ICCV,

  31. [31]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020. 6, 7

  32. [32]

    Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel X. Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Traini...

  33. [33]

    Multi-agent embodied question answering in interactive environments

    Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, and Fuchun Sun. Multi-agent embodied question answering in interactive environments. In ECCV, 2020. 6

  34. [34]

    Visual room rearrangement

    Luca Weihs, Matt Deitke, Aniruddha Kembhavi, and Roozbeh Mottaghi. Visual room rearrangement. In CVPR, 2021. 7, 8

  35. [35]

    Learning generalizable visual representations via interactive gameplay

    Luca Weihs, Aniruddha Kembhavi, Kiana Ehsani, Sarah M Pratt, Winson Han, Alvaro Herrasti, Eric Kolve, Dustin Schwenk, Roozbeh Mottaghi, and Ali Farhadi. Learning generalizable visual representations via interactive gameplay. In ICLR, 2021. 6, 8

  36. [36]

    Allenact: A framework for embodied AI research

    Luca Weihs, Jordi Salvador, Klemen Kotar, Unnat Jain, Kuo-Hao Zeng, Roozbeh Mottaghi, and Aniruddha Kembhavi. Allenact: A framework for embodied AI research. arXiv, 2020. 11

  37. [37]

    Learning to learn how to learn: Self-adaptive visual navigation using meta-learning

    Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In CVPR, 2019. 6

  38. [38]

    Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene

    Qi Wu, Cheng-Ju Wu, Yixin Zhu, and Jungseock Joo. Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene. In IROS, 2021. 6, 7

  39. [39]

    Chang, Leonidas J

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. Sapien: A simulated part-based interactive environment. In CVPR, 2020. 8

  40. [40]

    Visual semantic navigation using scene priors

    Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. Visual semantic navigation using scene priors. In ICLR, 2019. 6

  41. [41]

    Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, and Yejin Choi

    Rowan Zellers, Ari Holtzman, Matthew E. Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, and Yejin Choi. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. In ACL, 2021. 6

  42. [42]

    Visual reaction: Learning to play catch with your drone

    Kuo-Hao Zeng, Roozbeh Mottaghi, Luca Weihs, and Ali Farhadi. Visual reaction: Learning to play catch with your drone. In CVPR, 2020. 3

  43. [43]

    Lumi- nous: Indoor scene generation for embodied ai challenges

    Yizhou Zhao, Kaixiang Lin, Zhiwei Jia, Qiaozi Gao, Govind Thattai, Jesse Thomason, and Gaurav S Sukhatme. Lumi- nous: Indoor scene generation for embodied ai challenges. arXiv, 2021. 8

  44. [44]

    Towards optimal correlational object search

    Kaiyu Zheng, Rohan Chitnis, Yoonchang Sung, George Konidaris, and Stefanie Tellex. Towards optimal correlational object search. In ICRA, 2022. 6

  45. [45]

    Target-driven visual navigation in indoor scenes using deep reinforcement learning

    Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017. 6, 7, 11 A Contributions Eric Kolve was the lead engineer and built the API that connects Python and Unity, setup the infrastructure for maintenance and develop...

  46. [46]

    AI2-THOR supports many different types of agents, including the Ma- nipulaTHOR, Abstract, and LoCoBot agents

    Different simulators support different agents, each with their own action spaces and capabilities, with little standardization across simulators. AI2-THOR supports many different types of agents, including the Ma- nipulaTHOR, Abstract, and LoCoBot agents. The ManipulaTHOR agent is often slower to simulate than a navigation-only LoCoBot agent as it is more...

  47. [47]

    AI2-THOR

    Some simulators are relatively slow when run on a single process but can be easily parallelized with many processes running on a single GPU, e.g. AI2-THOR. Thus single-process simulation speeds may be highly deceptive as they do not capture the ease of scalability

  48. [48]

    These factors include: (a) Model forward pass when computing agent rollouts

    When training agents via reinforcement learning, there are a large number of factors that bottleneck training speed and so the value of raw simulator speed is substantially reduced. These factors include: (a) Model forward pass when computing agent rollouts. (b) Model backward pass when computing gradients for RL losses. (c) Environment resets - for many ...