hub Mixed citations

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

· 2020 · cs.CL · arXiv 2010.03768

Mixed citation behavior. Most common role is background (52%).

70 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text based policies in TextWorld (C\^ot\'e et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding).

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 dataset 8 method 2 baseline 1

citation-polarity summary

background 12 use dataset 8 baseline 1 unclear 1 use method 1

claims ledger

abstract Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstr

co-cited works

representative citing papers

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

cs.AI · 2026-05-14 · unverdicted · novelty 8.0 · 2 refs

LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

cs.AI · 2026-05-12 · conditional · novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.

Belief Memory: Agent Memory Under Partial Observability

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

cs.RO · 2026-02-09 · unverdicted · novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

cs.AI · 2026-05-31 · unverdicted · novelty 6.0

SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

CodeCytos is a code-augmented reasoning agent framework for dynamic, programmable exploration of custom spatial cellular features in molecular imaging data across four tissue types.

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.

Self-Evolving Multi-Agent Systems via Decentralized Memory

cs.MA · 2026-05-21 · unverdicted · novelty 6.0

DecentMem is a decentralized dual-pool memory framework for self-evolving multi-agent systems that provides O(log T) regret guarantees and yields up to 23.8% accuracy gains over centralized baselines.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

Training Language Agents to Learn from Experience

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Introduces the ICT framework and an RL pipeline to train language agent reflectors that distill experience into reusable prompts, outperforming baselines on held-out tasks in ALFWorld and MiniHack.

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.

citing papers explorer

Showing 38 of 38 citing papers after filters.

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution cs.AI · 2026-05-14 · unverdicted · none · ref 16 · 2 links · internal anchor
LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare cs.AI · 2026-05-12 · conditional · none · ref 28 · internal anchor
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · accept · none · ref 72 · 2 links · internal anchor
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 40 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium cs.AI · 2026-05-10 · unverdicted · none · ref 57 · internal anchor
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
Belief Memory: Agent Memory Under Partial Observability cs.AI · 2026-05-07 · unverdicted · none · ref 13 · 2 links · internal anchor
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision cs.AI · 2026-05-31 · unverdicted · none · ref 19 · internal anchor
SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 27 · internal anchor
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 140 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
From History to State: Constant-Context Skill Learning for LLM Agents cs.AI · 2026-05-06 · unverdicted · none · ref 29 · internal anchor
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning cs.AI · 2026-05-04 · unverdicted · none · ref 17 · internal anchor
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning cs.AI · 2026-04-12 · unverdicted · none · ref 15 · internal anchor
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation cs.AI · 2026-04-09 · unverdicted · none · ref 28 · internal anchor
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents cs.AI · 2026-03-05 · unverdicted · none · ref 23 · internal anchor
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents cs.AI · 2026-03-01 · unverdicted · none · ref 46 · internal anchor
HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction cs.AI · 2026-02-05 · unverdicted · none · ref 17 · internal anchor
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning cs.AI · 2026-01-11 · unverdicted · none · ref 12 · internal anchor
ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.
Differentiable Evolutionary Reinforcement Learning cs.AI · 2025-12-15 · unverdicted · none · ref 20 · internal anchor
DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.
Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning cs.AI · 2025-08-05 · unverdicted · none · ref 11 · internal anchor
BPO framework achieves state-of-the-art performance with improved token efficiency on ALFWorld, ScienceWorld, and WebShop by bootstrapping efficient reasoning, extrapolating via curriculum, and refining on reward-selected experiences.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 73 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents cs.AI · 2023-02-03 · conditional · none · ref 43 · internal anchor
DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.
Echo: Learning from Experience Data via User-Driven Refinement cs.AI · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning cs.AI · 2026-05-18 · unverdicted · none · ref 37 · internal anchor
TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.
Interactive Evaluation Requires a Design Science cs.AI · 2026-05-18 · unverdicted · none · ref 51 · internal anchor
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents cs.AI · 2026-04-25 · unverdicted · none · ref 18 · internal anchor
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures cs.AI · 2026-04-23 · unverdicted · none · ref 19 · 2 links · internal anchor
ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism cs.AI · 2026-03-04 · unverdicted · none · ref 8 · internal anchor
A staged LLM pipeline synthesizes verifiable discrete-event world models from natural language specifications using the DEVS formalism for long-horizon consistency in LLM agents.
End-to-end PDDL Planning with Hardcoded and Dynamic Agents cs.AI · 2025-12-10 · unverdicted · none · ref 31 · internal anchor
An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective cs.AI · 2025-11-01 · conditional · none · ref 33 · internal anchor
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 48 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
Understanding the planning of LLM agents: A survey cs.AI · 2024-02-05 · accept · none · ref 37 · internal anchor
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 216 · internal anchor
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
Survey on Evaluation of LLM-based Agents cs.AI · 2025-03-20 · unverdicted · none · ref 7 · internal anchor
A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 133 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
A Survey on the Memory Mechanism of Large Language Model based Agents cs.AI · 2024-04-21 · accept · none · ref 141 · internal anchor
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents cs.AI · 2026-05-21 · unreviewed · ref 48 · internal anchor
Verifiable Process Rewards for Agentic Reasoning cs.AI · 2026-05-11 · unreviewed · ref 24 · internal anchor

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer