Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
hub Canonical reference
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Canonical reference. 86% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
ProgPrompt generates situated robot task plans by prompting LLMs with program-like specifications of actions, objects, and executable examples, achieving state-of-the-art success in VirtualHome tasks and physical robot deployment.
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and reduce wasted steps on the HM3D dataset.
citing papers explorer
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.