arxiv: 1712.05474 · v4 · submitted 2017-12-14 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

AI2-THOR: An Interactive 3D Environment for Visual AI

Abhinav Gupta, Ali Farhadi, Alvaro Herrasti, Aniruddha Kembhavi, Daniel Gordon, Eli VanderBilt, Eric Kolve, Kiana Ehsani, Luca Weihs, Matt Deitke, Roozbeh Mottaghi, Winson Han, Yuke Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords AI2-THOR3D indoor scenesvisual AIembodied agentsreinforcement learninginteractive simulationobject manipulation

0 comments

The pith

AI2-THOR supplies near photo-realistic 3D indoor scenes where AI agents navigate and interact with objects to complete tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AI2-THOR as a new framework designed to advance visual AI research. It supplies a collection of near photo-realistic 3D indoor scenes in which agents can move through rooms and manipulate objects while performing tasks. This environment is intended to support work across reinforcement learning, imitation learning, planning, visual question answering, and several other domains. The authors position the framework as a tool that can help close the gap between simulated training and real-world visual intelligence.

Core claim

AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

What carries the argument

The AI2-THOR framework of interactive 3D indoor scenes that allow agent navigation and object manipulation.

Load-bearing premise

The simulated interactions and near photo-realistic visuals are representative enough of real-world conditions to support transferable learning and that the research community will widely adopt and extend the framework.

What would settle it

Train a policy in AI2-THOR on a navigation or object-interaction task and test the same policy on equivalent tasks in a physical room; substantially lower success rates would indicate limited transfer.

read the original abstract

We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper mainly announces a new open 3D simulation platform for training agents that navigate and manipulate objects in indoor scenes.

read the letter

The main thing here is the release of AI2-THOR, a set of near photo-realistic indoor environments where agents can walk around and interact with objects. It targets a bunch of tasks at once: reinforcement learning, visual question answering, planning, and a few others. That breadth is the practical contribution. Earlier simulators were often either too abstract or limited to navigation without object handling, so this fills a gap by giving researchers a shared starting point with actual interaction mechanics built in. The authors make the code and scenes available right away, which lowers the barrier for people who want to run experiments without building their own renderer from scratch. That part is straightforward and useful. The description stays at the level of what the environment supports rather than claiming solved transfer or new algorithms, so there are no overreaches in the abstract. The main soft spot is that the write-up is light on specifics. We get no numbers on how many scenes exist, how the physics or interaction models work in detail, or any side-by-side comparison with prior tools. Without those, it is hard to judge whether the interactions are rich enough to matter for downstream robotics work. The assumption that near-photo-realistic visuals will help with real-world transfer is left untested, which is typical for platform papers but still leaves the reader wanting evidence. This is aimed at labs doing embodied or simulation-based visual learning. Anyone already running agent training in 3D would get immediate value from having a common benchmark. It is worth sending to peer review because the environment itself is a concrete, usable artifact even if the paper is mostly documentation. A referee can push for more implementation details and early usage results, but the core offering stands on its own.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces AI2-THOR (The House Of inteRactions), a publicly available framework for visual AI research consisting of near photo-realistic 3D indoor scenes. Agents can navigate these scenes and interact with objects to perform tasks. The framework is presented as a platform to support research across domains including deep reinforcement learning, imitation learning, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and models of cognition.

Significance. If the described environment functions as outlined, the work offers a useful, open-source simulation platform that advances embodied visual AI by moving beyond static datasets toward interactive 3D settings. The explicit public release supports reproducibility and community extensions, which are concrete strengths for a systems-style contribution in this area.

minor comments (3)

The abstract and title use an unconventional capitalization in the acronym expansion ('inteRactions'); a consistent typographic treatment would improve readability.
The manuscript would benefit from a dedicated section or appendix providing a minimal usage example (e.g., agent action API calls or scene loading code) to lower the barrier for new users.
Related-work discussion could more explicitly contrast the interaction fidelity and scene variety against contemporaneous simulators such as AI2-THOR's predecessors or other 3D environments referenced in the field.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee's summary correctly identifies AI2-THOR as a publicly available framework providing near photo-realistic 3D indoor environments that support navigation and object interaction for a range of visual AI tasks.

Circularity Check

0 steps flagged

No significant circularity; paper introduces a new simulation framework without derivations or predictions

full rationale

The paper's central contribution is the direct presentation of the AI2-THOR framework consisting of near photo-realistic 3D indoor scenes supporting navigation and object interactions. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. The work is self-contained as a platform introduction rather than a solved transfer problem or model derivation, so no steps reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a framework introduction and does not rely on any free parameters, mathematical axioms, or invented scientific entities. The simulated scenes and objects are software constructs rather than postulated physical entities with independent evidence.

pith-pipeline@v0.9.0 · 5462 in / 1066 out tokens · 56256 ms · 2026-05-12T05:20:48.177440+00:00 · methodology

discussion (0)

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
cs.CV 2021-09 accept novelty 8.0

HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning
cs.AI 2026-05 unverdicted novelty 7.0

Ego2World turns real egocentric cooking videos into hidden symbolic world graphs for evaluating belief-state planning and memory in embodied agents.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue
cs.RO 2026-05 unverdicted novelty 7.0

ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
cs.CV 2026-04 unverdicted novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
cs.AI 2026-04 unverdicted novelty 7.0

ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
cs.CV 2026-04 unverdicted novelty 7.0

EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance
cs.RO 2026-05 unverdicted novelty 6.0

The work creates NIABench and an LLM-plus-scoring-model framework that enables robots to deliver proactive assistance during human multi-step activities while avoiding interruptions and reducing human effort.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
cs.CV 2026-04 unverdicted novelty 6.0

ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
Scalable Trajectory Generation for Whole-Body Mobile Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

AutoMoMa unifies AKR kinematic modeling with parallel trajectory optimization to produce 500k+ valid coordinated trajectories across 330 scenes and multiple robot embodiments, 80x faster than prior CPU methods.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
cs.RO 2024-06 unverdicted novelty 6.0

RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
On Evaluation of Embodied Navigation Agents
cs.AI 2018-07 accept novelty 6.0

Consensus recommendations for standardized evaluation measures, problem statements, and benchmarking scenarios in embodied navigation research.
Cross-Modal Navigation with Multi-Agent Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 5.0

CRONA is a MARL framework that uses modality-specialized agents with auxiliary beliefs and a centralized multi-modal critic to achieve better performance and efficiency than single-agent baselines on visual-acoustic n...
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
cs.AI 2026-04 unverdicted novelty 5.0

ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
cs.AI 2026-04 unverdicted novelty 5.0

ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.
Environmental Understanding Vision-Language Model for Embodied Agent
cs.CV 2026-04 unverdicted novelty 5.0

EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
cs.RO 2026-04 unverdicted novelty 5.0

EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
cs.RO 2026-04 unverdicted novelty 5.0

RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Pre-Execution Safety Gate & Task Safety Contracts for LLM-Controlled Robot Systems
cs.RO 2026-04 unverdicted novelty 5.0

SafeGate adds a deterministic pre-execution gate and runtime contracts with Z3 SMT solving to block unsafe LLM commands for robots while passing safe ones.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
cs.AI 2026-03 unverdicted novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
cs.RO 2020-09 unverdicted novelty 5.0

The paper presents robosuite v1.5, a MuJoCo-based modular simulation framework with benchmark environments for reproducible robot learning research.
Leveraging VR Robot Games to Facilitate Data Collection for Embodied Intelligence Tasks
cs.RO 2026-04 unverdicted novelty 4.0

A VR gamified data collection system in Unity for humanoid robots demonstrates broad state-action coverage in pick-and-place tasks, with higher difficulty increasing motion intensity and workspace exploration.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 3.0

The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 2.0

The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 36 Pith papers

[1]

Robothor: An open simulation-to-real embodied ai platform

Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. Robothor: An open simulation-to-real embodied ai platform. In CVPR, 2020. 2, 3, 6, 7

work page 2020
[2]

Procthor: Large-scale embodied ai using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. arXiv, 2022. 2, 3, 6, 7, 8, 12

work page 2022
[3]

Learning object relation graph and tentative policy for visual navigation

Heming Du, Xin Yu, and Liang Zheng. Learning object relation graph and tentative policy for visual navigation. In ECCV, 2020. 6

work page 2020
[4]

What do navigation agents learn about their environment? In CVPR, 2022

Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, and Roozbeh Mottaghi. What do navigation agents learn about their environment? In CVPR, 2022. 7, 8

work page 2022
[5]

Manipulathor: A framework for visual object manipulation

Kiana Ehsani, Winson Han, Alvaro Herrasti, Eli VanderBilt, Luca Weihs, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Manipulathor: A framework for visual object manipulation. In CVPR, 2021. 3, 8

work page 2021
[6]

Segan: Segmenting and generating the invisible

Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi. Segan: Segmenting and generating the invisible. In CVPR, 2018. 8

work page 2018
[7]

Threedworld: A platform for interactive multi-modal physical simulation

Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, et al. Threedworld: A platform for interactive multi-modal physical simulation. In Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , 2020. 2, 8

work page 2020
[8]

Look, listen, and act: Towards audio- visual embodied navigation

Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B Tenenbaum. Look, listen, and act: Towards audio- visual embodied navigation. In ICRA, 2020. 6, 7

work page 2020
[9]

Dialfred: Dialogue- enabled agents for embodied instruction following

Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S Sukhatme. Dialfred: Dialogue- enabled agents for embodied instruction following. IEEE Robotics and Automation Letters , 2022. 6

work page 2022
[10]

Iqa: Visual question answering in interactive environments

Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. Iqa: Visual question answering in interactive environments. In CVPR, 2018. 6

work page 2018
[11]

Unnat Jain, Luca Weihs, Eric Kolve, Ali Farhadi, Svetlana Lazebnik, Aniruddha Kembhavi, and Alexander G. Schwing. A cordial sync: Going beyond marginal policies for multi-agent embodied tasks. In ECCV, 2020. 6

work page 2020
[12]

Schwing, and Aniruddha Kembhavi

Unnat Jain, Luca Weihs, Eric Kolve, Mohammad Rastegari, Svetlana Lazebnik, Ali Farhadi, Alexander G. Schwing, and Aniruddha Kembhavi. Two body problem: Collaborative visual task completion. In CVPR, 2019. 6, 7

work page 2019
[13]

Learning adaptive language interfaces through decomposition

Siddharth Karamcheti, Dorsa Sadigh, and Percy Liang. Learning adaptive language interfaces through decomposition. arXiv, 2020. 6

work page 2020
[14]

The design of stretch: A compact, lightweight mobile manipulator for indoor human environments

Charles C Kemp, Aaron Edsinger, Henry M Clever, and Blaine Matulevich. The design of stretch: A compact, lightweight mobile manipulator for indoor human environments. In ICRA, 2022. 3

work page 2022
[15]

Simple but effective: Clip embeddings for embodied ai

Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embeddings for embodied ai. In CVPR, 2022. 6

work page 2022
[16]

Contrasting contrastive self- supervised representation learning pipelines

Klemen Kotar, Gabriel Ilharco, Ludwig Schmidt, Kiana Ehsani, and Roozbeh Mottaghi. Contrasting contrastive self- supervised representation learning pipelines. In ICCV, 2021. 8

work page 2021
[17]

Interactron: Embodied adaptive object detection

Klemen Kotar and Roozbeh Mottaghi. Interactron: Embodied adaptive object detection. In CVPR, 2022. 7, 8

work page 2022
[18]

igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Elliott Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, Andrey Kurenkov, Karen Liu, Hyowon Gweon, Jiajun Wu, Li Fei-Fei, and Silvio Savarese. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In CoRL, 2021. 2, 8

work page 2021
[19]

Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes

Qi Li, Kaichun Mo, Yanchao Yang, Hang Zhao, and Leonidas Guibas. Ifr-explore: Learning inter-object functional relationships in 3d indoor scenes. In ICLR, 2022. 7

work page 2022
[20]

Multi-agent embodied visual semantic navigation with scene prior knowledge

Xinzhu Liu, Di Guo, Huaping Liu, and Fuchun Sun. Multi-agent embodied visual semantic navigation with scene prior knowledge. IEEE Robotics and Automation Letters , 2022. 6

work page 2022
[21]

Learning about objects by learning to interact with them

Martin Lohmann, Jordi Salvador, Aniruddha Kembhavi, and Roozbeh Mottaghi. Learning about objects by learning to interact with them. In NeurIPS, 2020. 8

work page 2020
[22]

Mgrl: Graph neural network based inference in a markov network with reinforcement learning for visual navigation

Yi Lu, Yaran Chen, Dongbin Zhao, and Dong Li. Mgrl: Graph neural network based inference in a markov network with reinforcement learning for visual navigation. Neurocomputing, 2021. 6

work page 2021
[23]

Film: Following instructions in language with modular methods

So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Following instructions in language with modular methods. In ICLR, 2022. 6

work page 2022
[24]

Pyrobot: An open-source robotics framework for research and benchmarking

Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta, and Abhinav Gupta. Pyrobot: An open-source robotics framework for research and benchmarking. arXiv, 2019. 3

work page 2019
[25]

Learning affordance landscapes for interaction exploration in 3d environments

Tushar Nagarajan and Kristen Grauman. Learning affordance landscapes for interaction exploration in 3d environments. In NeurIPS, 2020. 7

work page 2020
[26]

Shaping embodied agent behavior with activity-context priors from egocentric video

Tushar Nagarajan and Kristen Grauman. Shaping embodied agent behavior with activity-context priors from egocentric video. In NeurIPS, 2021. 7

work page 2021
[27]

Teach: Task-driven embodied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. In AAAI,

work page
[28]

Episodic transformer for vision-and-language navigation

Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic transformer for vision-and-language navigation. In ICCV, 2021. 6 9

work page 2021
[29]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InNeural Information Processing Systems Dataset...

work page 2021
[30]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. In ICCV,

work page
[31]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020. 6, 7

work page 2020
[32]

Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel X. Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Traini...

work page 2021
[33]

Multi-agent embodied question answering in interactive environments

Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, and Fuchun Sun. Multi-agent embodied question answering in interactive environments. In ECCV, 2020. 6

work page 2020
[34]

Visual room rearrangement

Luca Weihs, Matt Deitke, Aniruddha Kembhavi, and Roozbeh Mottaghi. Visual room rearrangement. In CVPR, 2021. 7, 8

work page 2021
[35]

Learning generalizable visual representations via interactive gameplay

Luca Weihs, Aniruddha Kembhavi, Kiana Ehsani, Sarah M Pratt, Winson Han, Alvaro Herrasti, Eric Kolve, Dustin Schwenk, Roozbeh Mottaghi, and Ali Farhadi. Learning generalizable visual representations via interactive gameplay. In ICLR, 2021. 6, 8

work page 2021
[36]

Allenact: A framework for embodied AI research

Luca Weihs, Jordi Salvador, Klemen Kotar, Unnat Jain, Kuo-Hao Zeng, Roozbeh Mottaghi, and Aniruddha Kembhavi. Allenact: A framework for embodied AI research. arXiv, 2020. 11

work page 2020
[37]

Learning to learn how to learn: Self-adaptive visual navigation using meta-learning

Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In CVPR, 2019. 6

work page 2019
[38]

Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene

Qi Wu, Cheng-Ju Wu, Yixin Zhu, and Jungseock Joo. Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene. In IROS, 2021. 6, 7

work page 2021
[39]

Chang, Leonidas J

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. Sapien: A simulated part-based interactive environment. In CVPR, 2020. 8

work page 2020
[40]

Visual semantic navigation using scene priors

Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi. Visual semantic navigation using scene priors. In ICLR, 2019. 6

work page 2019
[41]

Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, and Yejin Choi

Rowan Zellers, Ari Holtzman, Matthew E. Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, and Yejin Choi. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. In ACL, 2021. 6

work page 2021
[42]

Visual reaction: Learning to play catch with your drone

Kuo-Hao Zeng, Roozbeh Mottaghi, Luca Weihs, and Ali Farhadi. Visual reaction: Learning to play catch with your drone. In CVPR, 2020. 3

work page 2020
[43]

Lumi- nous: Indoor scene generation for embodied ai challenges

Yizhou Zhao, Kaixiang Lin, Zhiwei Jia, Qiaozi Gao, Govind Thattai, Jesse Thomason, and Gaurav S Sukhatme. Lumi- nous: Indoor scene generation for embodied ai challenges. arXiv, 2021. 8

work page 2021
[44]

Towards optimal correlational object search

Kaiyu Zheng, Rohan Chitnis, Yoonchang Sung, George Konidaris, and Stefanie Tellex. Towards optimal correlational object search. In ICRA, 2022. 6

work page 2022
[45]

Target-driven visual navigation in indoor scenes using deep reinforcement learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017. 6, 7, 11 A Contributions Eric Kolve was the lead engineer and built the API that connects Python and Unity, setup the infrastructure for maintenance and develop...

work page 2017
[46]

AI2-THOR supports many different types of agents, including the Ma- nipulaTHOR, Abstract, and LoCoBot agents

Different simulators support different agents, each with their own action spaces and capabilities, with little standardization across simulators. AI2-THOR supports many different types of agents, including the Ma- nipulaTHOR, Abstract, and LoCoBot agents. The ManipulaTHOR agent is often slower to simulate than a navigation-only LoCoBot agent as it is more...

work page
[47]

AI2-THOR

Some simulators are relatively slow when run on a single process but can be easily parallelized with many processes running on a single GPU, e.g. AI2-THOR. Thus single-process simulation speeds may be highly deceptive as they do not capture the ease of scalability

work page
[48]

These factors include: (a) Model forward pass when computing agent rollouts

When training agents via reinforcement learning, there are a large number of factors that bottleneck training speed and so the value of raw simulator speed is substantially reduced. These factors include: (a) Model forward pass when computing agent rollouts. (b) Model backward pass when computing gradients for RL losses. (c) Environment resets - for many ...

work page 2080