{"total":12,"items":[{"citing_arxiv_id":"2606.26006","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation","primary_cat":"cs.RO","submitted_at":"2026-06-24T16:23:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01036","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Good Embodied Reward Models Need Bad Behavior Data","primary_cat":"cs.RO","submitted_at":"2026-05-31T05:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Embodied reward models systematically over-reward unsafe, suboptimal, and shortcut robot behaviors due to training on successful data only, and modest inclusion of bad behavior data improves alignment with human preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30660","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies","primary_cat":"cs.LG","submitted_at":"2026-05-28T23:39:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12620","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents","primary_cat":"cs.AI","submitted_at":"2026-05-12T18:08:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habitat and ALFRED.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10094","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs","primary_cat":"cs.RO","submitted_at":"2026-05-11T07:11:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and accumulated closed-loop errors during deployment, limiting their stability and local adaptability. Test-Time Policy Steering.Recent work has explored test-time policy steering or scaling to improve VLA deployment stability [10, 5, 13, 27]. These methods enhance current action decisions through additional sampling, external evaluators, or internal confidence signals. For example, RoboMonkey [13] selects among perturbed action candidates with a VLM-based verifier, MG- Select [10] uses condition-masking confidence for verifier-free selection, and TACO [27] constrains generation toward stable successful modes via pseudo-count estimation. While effective, these methods follow a generate-then-select paradigm, which incurs extra inference overhead and discards"},{"citing_arxiv_id":"2605.01194","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model","primary_cat":"cs.RO","submitted_at":"2026-05-02T02:13:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"It is lightweight but powerful, as it can leverage the internal features of the VLM to reduce computation. 4.3.1 Input Representation The RAC takes four distinct inputs: two action chunks to be compared, ai and aj; their difference ai −a j; and the current proprioceptive state st. Each input is independently mapped by a dedicated MLP into the model's embedding dimensiond model: ei =MLP i(ai), e j =MLP j(aj)(10) edif f =MLP dif f(ai −a j), e s =MLP s(st)(11) These four embedding vectors are then concatenated to form the initial hidden state for the RAC's Transformer tower: X 0 =Concat[e i;e j;e dif f;e s]. 4.3.2 Hierarchical Context Condition To equip the RAC with task-relevant context, we introduce a set of Nq randomly initialized, learnable query tokens"},{"citing_arxiv_id":"2604.19730","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FASTER: Value-Guided Sampling for Fast RL","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:52:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"class for a wide variety of domains because they can represent rich, multimodal data distributions [5, 6, 9, 10, 11, 12, 13, 14]. Several recent works study how to improve these policies with value functions or reduce their inference cost. Value-Guided Policy Steering (V-GPS) re-ranks actions from frozen robot policies with a learned value function [15], RoboMonkey scales test-time sampling 2 and verification for vision-language-action policies [ 16], and EXPO couples an expressive base policy with a lightweight edit policy and selects value-maximizing actions online [5]. Other work amortizes or changes the policy itself: One-Step Diffusion Policy distills multi-step denoising into a faster actor [17], while DSRL post-trains a behavior-cloned diffusion policy by running RL in its latent-noise space [18]."},{"citing_arxiv_id":"2604.18107","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T11:25:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05672","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-07T10:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In8th Annual Conference on Robot Learning. [17] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. [18] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025.https://arxiv.org/abs/2502.19645. [19] Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025. https://arxiv.org/abs/2506.17811. [20] Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang,"},{"citing_arxiv_id":"2602.22474","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering","primary_cat":"cs.RO","submitted_at":"2026-02-25T23:23:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from interventions to continually improve the base policy with minimal feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13193","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control","primary_cat":"cs.RO","submitted_at":"2026-02-13T18:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14093","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"extracts \"semantic\" information from the RGB image, while the Transporter network extracts \"spatial\" information from the RGB-D image. The CLIP sentence encoder encodes the language instruction and guides the outputSE(2)action, consisting of paired pick and place end-effector poses. It represents an early demonstration of language-conditioned pick-and-place capabilities. BC-Z [82] processes two types of task instructions: a language instruction or a human demonstration video. The environment is presented to the model in the form of an RGB image. Then the instruction embedding and the image embedding are combined through the FiLM layer, culminating in the generation of actions. This conditional policy is asserted to exhibit zero-shot task generalization to unseen tasks."}],"limit":50,"offset":0}