pith. machine review for the scientific record. sign in

arxiv: 2010.03768 · v2 · submitted 2020-10-08 · 💻 cs.CL · cs.AI· cs.CV· cs.LG· cs.RO

Recognition: 2 theorem links

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Adam Trischler, Marc-Alexandre C\^ot\'e, Matthew Hausknecht, Mohit Shridhar, Xingdi Yuan, Yonatan Bisk

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LGcs.RO
keywords ALFWorldembodied agentsTextWorldALFREDgeneralizationtext policiesvisual executioninteractive learning
0
0 comments X

The pith

Text-based policy learning in simulators improves generalization for visual embodied agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Humans plan tasks in abstract terms without visuals, then adapt those plans once seeing the environment. Embodied agents need the same split but lack tools to learn abstract reasoning separately from concrete execution. ALFWorld creates that split by letting agents master text-based policies in TextWorld before transferring them to ALFRED's visual household scenes. The resulting BUTLER agent achieves better success on new tasks and scenes than agents trained only in the visual setting. Its modular pipeline isolates language understanding, planning, navigation, and visual recognition so each piece can be upgraded independently.

Core claim

ALFWorld aligns the abstract action space of TextWorld with the concrete visual actions of ALFRED, so that policies trained entirely in text transfer directly to visual execution; the BUTLER agent built on this alignment demonstrates higher generalization than agents trained solely inside the visually grounded ALFRED environment.

What carries the argument

The ALFWorld simulator that maps abstract TextWorld policies onto ALFRED's visual action and state representations, enabling the modular BUTLER agent's transfer.

If this is right

  • Agents reach higher success on multi-step household goals by first mastering abstract plans in text.
  • The modular split lets researchers improve language understanding or visual navigation in isolation.
  • Generalization across unseen room layouts and object placements increases without extra visual samples.
  • The same alignment pattern can be reused to combine other text planners with other visual simulators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-training abstract policies in cheap text simulators could cut the data cost of physical robot learning.
  • Large language models could slot directly into the abstract planning slot of the same pipeline.
  • Similar text-to-visual alignments for other benchmarks would let the field test transfer at scale.

Load-bearing premise

Abstract policies learned in TextWorld map directly enough to ALFRED's visual actions and state space that the transfer produces reliable gains without major domain adaptation.

What would settle it

Run identical BUTLER-style agents on the ALFRED test set with and without the TextWorld pre-training phase; if success rates show no consistent advantage for the text-pretrained version, the claimed generalization benefit is refuted.

read the original abstract

Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text based policies in TextWorld (C\^ot\'e et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ALFWorld, a simulator aligning text-based environments from TextWorld with the visual ALFRED benchmark to enable agents to learn abstract policies in text and transfer them to grounded visual execution. It presents the BUTLER agent, whose modular design separates language understanding, planning, navigation, and visual scene understanding, and reports an empirical demonstration that this text-to-visual transfer yields better generalization than training solely in the visual environment.

Significance. If the empirical transfer results hold under rigorous controls, the work supplies a practical testbed for studying abstract-to-grounded policy transfer in interactive agents. The modular factoring of the pipeline is a clear strength that could accelerate targeted progress on individual subproblems such as planning or visual grounding.

major comments (2)
  1. [BUTLER agent and transfer mechanism] The central claim that abstract policies learned in TextWorld produce reliable gains when transferred to ALFRED rests on the assumption that state and action spaces align sufficiently without major domain adaptation; the manuscript should provide explicit analysis or ablations quantifying any residual domain gap (e.g., differences in object affordances or navigation granularity).
  2. [Experimental results] The empirical demonstration of improved generalization lacks visible details on exact baselines, success metrics, number of evaluation episodes, and ablation controls in the provided description; without these, the strength of the cross-environment claim cannot be fully assessed.
minor comments (2)
  1. [Abstract] The abstract's human-reasoning vignette is illustrative but could be tied more directly to the concrete technical pipeline (e.g., how abstract scoring maps to the agent's planning module).
  2. [Simulator description] Notation for environment states and action spaces should be defined consistently when first introduced to aid readers comparing TextWorld and ALFRED representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential of ALFWorld as a testbed for abstract-to-grounded policy transfer. We address the two major comments point by point below and will revise the manuscript to incorporate the suggested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [BUTLER agent and transfer mechanism] The central claim that abstract policies learned in TextWorld produce reliable gains when transferred to ALFRED rests on the assumption that state and action spaces align sufficiently without major domain adaptation; the manuscript should provide explicit analysis or ablations quantifying any residual domain gap (e.g., differences in object affordances or navigation granularity).

    Authors: We agree that an explicit quantification of any residual domain gap would strengthen the presentation of the central claim. ALFWorld is constructed so that TextWorld states and actions map directly onto the corresponding ALFRED visual observations and executable actions (detailed in Section 3), thereby minimizing differences in object affordances and navigation granularity by design. Nevertheless, we will add a new ablation subsection in the revised manuscript that reports performance under controlled perturbations of the alignment (e.g., altered object affordance mappings and coarser navigation grids) to quantify the size of any remaining gap and its effect on transfer gains. revision: yes

  2. Referee: [Experimental results] The empirical demonstration of improved generalization lacks visible details on exact baselines, success metrics, number of evaluation episodes, and ablation controls in the provided description; without these, the strength of the cross-environment claim cannot be fully assessed.

    Authors: We apologize that these details were not sufficiently prominent. The manuscript already specifies success rate as the primary metric, compares against visual-only training and random baselines, and evaluates on the standard ALFRED test episodes. In the revision we will expand the experimental section with an explicit table listing the exact number of evaluation episodes, full baseline descriptions, and all ablation controls so that the transfer results can be assessed rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparison is independent of inputs

full rationale

The paper's central claim rests on an empirical demonstration that agents trained via ALFWorld (abstract policies in TextWorld transferred to ALFRED visuals) generalize better than those trained only in the visual environment. This is evaluated through direct experimental comparisons on held-out tasks, without any derivation, fitted parameter renamed as prediction, or self-citation that reduces the result to its own inputs by construction. Prior citations (e.g., to TextWorld and ALFRED) supply the benchmarks and infrastructure rather than justifying the transfer result itself. The work is self-contained against external task performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces new simulation infrastructure and an agent architecture rather than new physical entities or fitted constants. It relies on standard reinforcement learning assumptions for policy training in text environments.

axioms (1)
  • domain assumption Standard reinforcement learning assumptions hold for learning policies in TextWorld.
    The approach depends on RL training succeeding in the text environment to produce transferable abstract knowledge.
invented entities (2)
  • ALFWorld simulator no independent evidence
    purpose: To provide aligned tasks between text and visual embodied environments
    New infrastructure created by the authors to enable the transfer experiments.
  • BUTLER agent no independent evidence
    purpose: Modular agent that factors language, planning, navigation, and visual understanding
    New agent design presented in the paper.

pith-pipeline@v0.9.0 · 5557 in / 1323 out tokens · 103049 ms · 2026-05-12T06:33:39.346176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  3. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  4. SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    cs.SE 2026-05 unverdicted novelty 7.0

    SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...

  5. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  6. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  7. MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

    cs.RO 2026-05 unverdicted novelty 7.0

    MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...

  8. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.

  9. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...

  10. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  11. TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

    cs.LG 2026-04 unverdicted novelty 7.0

    TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

  12. EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 7.0

    EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

  13. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  14. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  15. Verifiable Process Rewards for Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.

  16. Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

    cs.LG 2026-05 unverdicted novelty 6.0

    Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.

  17. The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.

  18. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  19. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  20. From History to State: Constant-Context Skill Learning for LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...

  21. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  22. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.

  23. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  24. KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

    cs.DC 2026-04 unverdicted novelty 6.0

    KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

  25. From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.

  26. Visually-grounded Humanoid Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.

  27. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  28. MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models

    cs.SE 2026-04 unverdicted novelty 6.0

    MIMIC-Py provides a modular Python framework that turns personality-driven LLM agents into an extensible system for automated game testing via configurable traits, decoupled components, and multiple interaction methods.

  29. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  30. Cross-Modal Navigation with Multi-Agent Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    CRONA is a MARL framework that uses modality-specialized agents with auxiliary beliefs and a centralized multi-modal critic to achieve better performance and efficiency than single-agent baselines on visual-acoustic n...

  31. From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

  32. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.

  33. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.

  34. Environmental Understanding Vision-Language Model for Embodied Agent

    cs.CV 2026-04 unverdicted novelty 5.0

    EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.

  35. DORA Explorer: Improving the Exploration Ability of LLMs Without Training

    cs.CL 2026-04 unverdicted novelty 5.0

    DORA Explorer boosts LLM agent exploration without training by ranking diverse actions using log-probabilities and a tunable parameter, yielding UCB-competitive results on multi-armed bandits and gains on text adventu...

  36. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 4.0

    StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.

  37. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 33 Pith papers · 1 internal anchor

  1. [1]

    Adhikari, A., Yuan, X., Côté, M.-A., Zelinka, M., Rondeau, M.-A., Laroche, R., Poupart, P., Tang, J., Trischler, A., and Hamilton, W. L. (2020). Learning dynamic belief graphs to generalize on text-based games. In Neural Information Processing Systems (NeurIPS). Ammanabrolu, P. and Hausknecht, M. (2020). Graph constrained reinforcement learning for natura...

  2. [2]

    Deep recurrent q-learning for partially observable mdps.arXiv preprint arXiv:1507.06527,

    Hausknecht, M. and Stone, P. (2015). Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527. Hausknecht, M. J., Ammanabrolu, P., Côté, M.-A., and Yuan, X. (2020). Interactive fiction games: A colossal adventure. In AAAI. He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE inter...

  3. [3]

    Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheape...

  4. [4]

    Y ., and Zhang, L

    Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y .-F., Wang, W. Y ., and Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision- language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Wu, J., Lu, E., Kohli, P., Freeman, B., and Tenenbaum, J. (2017)...

  5. [5]

    We useL to refer to a linear transformation andLf means it is followed by a non-linear activation functionf

    13 Published as a conference paper at ICLR 2021 A D ETAILS OF BUTLER::B RAIN In this section, we useot to denote text observation at game stept,g to denote the goal description provided by a game. We useL to refer to a linear transformation andLf means it is followed by a non-linear activation functionf. Brackets [⋅;⋅] denote vector concatenation,⊙ denote...

  6. [6]

    Specifically, embeddings are initialized by pre-trained 768-dimensional BERT embeddings (Sanh et al., 2019)

    A.2 E NCODER We use a transformer-based encoder, which consists of an embedding layer and a transformer block (Vaswani et al., 2017). Specifically, embeddings are initialized by pre-trained 768-dimensional BERT embeddings (Sanh et al., 2019). The embeddings are fixed during training in all settings. The transformer block consists of a stack of 5 convolution...

  7. [7]

    Following standard transformer training, we add positional encodings into each block’s input

    is applied after each component inside the block. Following standard transformer training, we add positional encodings into each block’s input. At every game stept, we use the same encoder to process text observationot and goal description g. The resulting representations are hot ∈ RLot×H andhg ∈ RLg×H, whereLot is the number of tokens inot,Lg denotes the...

  8. [8]

    We first obtain the source representation by concatenating hog and ht, resultinghsrc∈ RLo×2H

    A.4 D ECODER Our decoder consists of an embedding layer, a transformer block and a pointer softmax mechanism (Gulcehre et al., 2016). We first obtain the source representation by concatenating hog and ht, resultinghsrc∈ RLo×2H. Similar to the encoder, the embedding layer is frozen after initializing it with pre-trained BERT embeddings. The transformer bloc...

  9. [9]

    Takinghtgt as input, a linear layer with tanh activation projects the target representation into the same space as the embeddings (with dimensionality of 768), then the pre-trained embedding matrix E generates output logits (Press and Wolf, 2016), where the output size is same as the vocabulary size. The resulting logits are then normalized by a softmax t...

  10. [10]

    We gather a sequence of transitions from each game episode, and push each sequence into a replay buffer, which has a capacity of 500K episodes

    15 Published as a conference paper at ICLR 2021 During training with DAgger, we use a batch size of 10 to collect transitions (tuples of{o0,o t,g, ˆat}) at each game stept, where ˆat is the ground-truth action provided by the rule-based expert (see Sec- tion E). We gather a sequence of transitions from each game episode, and push each sequence into a repl...

  11. [11]

    We linearly anneal the fraction of the expert’s assistance from 100% to 1% across a window of 50K episodes

    If the agent uses up this budget, the game episode is forced to terminate. We linearly anneal the fraction of the expert’s assistance from 100% to 1% across a window of 50K episodes. The agent is updated after every 5 steps of data collection. We sample a batch of 64 data points from the replay buffer. In the setting with the recurrent aggregator, every s...

  12. [12]

    By default unless mentioned otherwise (ablations), we use all available training games in each of the task types

    We take the best performing check- points and then apply this heuristic during eval- uation and report the resulting scores in tables (e.g., Table 2). By default unless mentioned otherwise (ablations), we use all available training games in each of the task types. We use an observation queue length of 5 and use a recurrent aggregator. The model is trained...

  13. [13]

    Then, the template gets realized using the predicates found in the current state

    When needed, the engine will sample a template given some context, 16 Published as a conference paper at ICLR 2021 i.e., the current state and the last action. Then, the template gets realized using the predicates found in the current state. D M ASK R-CNN D ETECTOR We use a Mask R-CNN detector (He et al.,

  14. [14]

    and fine-tune it with additional labels from ALFRED training scenes. To generate additional labels, we replay the expert demonstrations from ALFRED and record ground-truth image and instance segmentation pairs from the simulator (THOR) after completing each high-level action e.g., goto, pickup etc. We generate a dataset of 50K images, and fine-tune the dete...

  15. [15]

    put a plate on the coffee table

    or populate a list of command templates (Ammanabrolu and Hausknecht, 2020). We initially trained our agents with candidate commands from the TextWorld Engine, but they quickly ovefit without learning affordances, 19 Published as a conference paper at ICLR 2021 commonsense, or pre-conditions, and had zero performance on embodied transfer. In the embodied se...