Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Alois Knoll; Dhruv Shah; Edward Johns; Hongkuan Zhou; Jean Oh; Jesse Thomason; Joyce Chai; Kai Huang; Mohit Shridhar; Oier Mees

arxiv: 2312.10807 · v7 · pith:ASCYOB4Bnew · submitted 2023-12-17 · 💻 cs.RO

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Xiangtong Yao , Hongkuan Zhou , Oier Mees , Yuan Meng , Ted Xiao , Yonatan Bisk , Jean Oh , Edward Johns

show 7 more authors

Mohit Shridhar Dhruv Shah Jesse Thomason Kai Huang Joyce Chai Zhenshan Bing Alois Knoll

This is my paper

classification 💻 cs.RO

keywords languagerobotlanguage-conditionedmanipulationactionfieldinstructionspolicy

0 comments

read the original abstract

Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, language for cognitive planning and reasoning, and language in unified vision-language-action models. Specifically, we further analyze state-of-the-art techniques from five axes of action granularity, data and supervision regimes, system cost and latency, environments and evaluations, and task specification. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
When control meets large language models: From words to dynamics
eess.SY 2026-02 unverdicted novelty 3.0

The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.