{"total":16,"items":[{"citing_arxiv_id":"2605.17077","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16932","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-16T10:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and reduce wasted steps on the HM3D dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05846","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LoopTrap: Termination Poisoning Attacks on LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-05-07T08:21:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Red-Teaming LLM Multi-Agent Systems via Communication Attacks. arXiv:2502.14847 [cs.CR] https://arxiv.org/abs/2502.14847 [14] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. Lan- guage Models as Zero-Shot Planners: Extracting Actionable Knowledge for Em- bodied Agents. arXiv:2201.07207 [cs.LG] https://arxiv.org/abs/2201.07207 [15] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. 2022. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO] https://arxiv."},{"citing_arxiv_id":"2602.13193","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control","primary_cat":"cs.RO","submitted_at":"2026-02-13T18:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.20911","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing","primary_cat":"cs.CV","submitted_at":"2025-06-26T00:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.19645","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","primary_cat":"cs.RO","submitted_at":"2025-02-27T00:30:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 , 2021. [15] Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML) , 2024. [16] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot plan- ners: Extracting actionable knowledge for embodied agents, 2022. URL https://arxiv.org/abs/ 2201.07207. [17] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah"},{"citing_arxiv_id":"2410.02644","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents","primary_cat":"cs.CR","submitted_at":"2024-10-03T16:30:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.10639","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models","primary_cat":"cs.RO","submitted_at":"2023-10-16T17:57:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.17760","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CAMEL: Communicative Agents for \"Mind\" Exploration of Large Language Model Society","primary_cat":"cs.AI","submitted_at":"2023-03-31T01:09:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[48] Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022. [49] Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. A sim- ple language model for task-oriented dialogue. Advances in Neural Information Processing Systems , 33:20179-20191, 2020. [50] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022. [51] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through"},{"citing_arxiv_id":"2209.11302","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProgPrompt: Generating Situated Robot Task Plans using Large Language Models","primary_cat":"cs.RO","submitted_at":"2022-09-22T20:29:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProgPrompt generates situated robot task plans by prompting LLMs with program-like specifications of actions, objects, and executable examples, achieving state-of-the-art success in VirtualHome tasks and physical robot deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2209.07753","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Code as Policies: Language Model Programs for Embodied Control","primary_cat":"cs.RO","submitted_at":"2022-09-16T07:17:23+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"More recent methods learn the grounding end-to-end (language to action) [8]-[10], but they require copious amounts of training data, which can be expensive to obtain on real robots. Meanwhile, recent progress in natural language processing shows that large language models (LLMs) pretrained on Internet- scale data [11]-[13] exhibit out-of-the-box capabilities [14]-[16] that can be applied to language-using robots e.g., planning a sequence of steps from natural language instructions [16]-[18] without additional model finetuning. These steps can be grounded in real robot affordances from value functions among a fixed set of skills i.e., policies pretrained with behavior cloning or rein- forcement learning [19]-[21]."},{"citing_arxiv_id":"2206.07682","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emergent Abilities of Large Language Models","primary_cat":"cs.CL","submitted_at":"2022-06-15T17:32:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.06175","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Generalist Agent","primary_cat":"cs.AI","submitted_at":"2022-05-12T16:03:26+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The tokenized result is a sequence of integers within the range of[0, 1024). • Continuous values, e.g. proprioceptive inputs or joint torques, are ﬁrst ﬂattened into sequences of ﬂoating point values in row-major order. The values are mu-law encoded to the range[−1, 1] if not already there (see Figure 14 for details), then discretized to 1024 uniform bins. The discrete integers are then shifted to the range of[32000, 33024). After converting data into tokens, we use the following canonical sequence ordering. • Text tokens in the same order as the raw input text. • Image patch tokens in raster order. • Tensors in row-major order. • Nested structures in lexicographical order by key. • Agent timesteps as observation tokens followed by a separator, then action tokens. • Agent episodes as timesteps in time order. Further details on tokenizing agent data are presented in the supplementary material (Section B). 2.2 Embedding input tokens and setting output targets After tokenization and sequencing, we apply a parameterized embedding functionf (·;θe) to each token (i.e. it is applied to both observations and actions) to produce the ﬁnal model input. To enable eﬃcient learning from our multi-modal input sequences1:L the embedding function performs diﬀerent operations depending on the modality the token stems from: 3 Published in Transactions on Machine Learning Research (11/2022) • Tokens belonging to text, discrete- or continuous-valued observations or actions for any time-step are embedded via a lookup table into a learned vector embedding space. Learnable position encodings are added for all tokens based on their local token position within their corresponding time-step. • Tokens belonging to image patches for any time-step are embedded using a single ResNet (He et al., 2016a) block to obtain a vector per patch. For image patch token embeddings, we also add a learnable within-image position encoding vector. We refer to appendix Section C.3 for full details on th"},{"citing_arxiv_id":"2204.06745","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GPT-NeoX-20B: An Open-Source Autoregressive Language Model","primary_cat":"cs.CL","submitted_at":"2022-04-14T04:00:27+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.01691","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","primary_cat":"cs.RO","submitted_at":"2022-04-04T17:57:11+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 2 illustrates the necessity of the affordance grounding. We com- pare PaLM-SayCan to (1) No VF, which removes the value function grounding (i.e., choosing the maximum language score skill) and to (2) Generative, which uses the generative output of the LLM and then projects each planned skill to its maximal cosine similarity skill via USE embeddings. The latter in effect compares to [23], which loses the explicit option probabilities, and thus is less interpretable and cannot be combined with affordance probabilities. For Generative we also tried BERT embeddings [3], but found poor performance. The No VF and Generative approaches performed similarly, achieving 67% and 74% planning success rate respectively, and worse than PaLM-SayCan's 84%."},{"citing_arxiv_id":"2204.00598","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language","primary_cat":"cs.CV","submitted_at":"2022-04-01T17:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning. 1 Introduction Large pretrained models (e.g., BERT [1], GPT-3 [2], CLIP [3]) have enabled impressive capabilities [4]: from zero-shot image classiﬁcation [ 3, 5], to high-level planning [ 6, 7]. Their capabilities depend on their training data - while they may be broadly crawled from the web, their distributions remain distinct across domains. For example, in terms of linguistic data, visual-language models (VLMs) [8, 9] are trained on image and video captions, but large language models (LMs) [1, 10, 11] are additionally trained on a large corpora of other data such as spreadsheets, ﬁctional novels, and"}],"limit":50,"offset":0}