pith. machine review for the scientific record. sign in

arxiv: 2605.11817 · v1 · submitted 2026-05-12 · 💻 cs.RO · cs.CV

Recognition: 1 theorem link

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:37 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision language actiontoken compressiondifferentiable samplingrobotic manipulationefficient vlagrid samplervisual tokens
0
0 comments X

The pith

Differentiable grid sampling lets VLA models compress visual tokens to under 10 percent while keeping full manipulation success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models promise to let robots follow natural language instructions but require too much computation for practical use. Existing ways to reduce visual information by pruning tokens lose key details such as exact contact points, which breaks performance. The paper proposes replacing pruning with a continuous resampling process that learns which coordinates matter and pulls out features smoothly from the original grid. This change supports far more aggressive compression. On both simulation tests and real hardware, the resulting models use 76 percent less computation yet match the success rates of the original full models.

Core claim

The Differentiable Grid Sampler module adaptively predicts a minimal set of salient coordinates and extracts features via differentiable interpolation to perform task-aware continuous resampling of visual tokens in the vision encoder, achieving compression to fewer than 10 percent of original tokens without performance loss.

What carries the argument

Differentiable Grid Sampler (GridS) that predicts salient coordinates adaptively and uses differentiable interpolation for continuous feature extraction instead of discrete pruning.

If this is right

  • Delivers a 76 percent reduction in FLOPs for VLA models.
  • Maintains identical success rates on manipulation tasks.
  • Operates as a plug-and-play addition to existing VLA architectures.
  • Validates the lowest visual token counts reported for these models.
  • Shows consistent results on the LIBERO benchmark and physical robot platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar continuous resampling could improve efficiency in other spatial reasoning models like those for navigation or scene understanding.
  • Combining GridS with model quantization or distillation might push real-time performance even further on limited hardware.
  • Testing the method on a wider range of manipulation tasks with varying lighting or clutter would clarify its robustness limits.

Load-bearing premise

Adaptively predicting minimal salient coordinates and using differentiable interpolation to extract features will always retain the geometric details essential for task success.

What would settle it

A controlled experiment on a contact-heavy manipulation task where GridS with reduced tokens shows lower success rate than the full-token baseline.

Figures

Figures reproduced from arXiv: 2605.11817 by Chang Xu, Chengbin Du, Chenghao Xia, Yanxiang Ma, Yixu Feng, Yunke Wang, Zinan Zhao.

Figure 1
Figure 1. Figure 1: Motivation and Performance of GridS. (a) Standard VLAs process images with dense, uniform token representations (2 × 256), leading to high computational redundancy in irrelevant background areas (100% Compute). (b) Our Grid Sample (GridS) prunes non-essential tokens, focusing only on salient regions. This reduces the token count to 2×16, requiring only 6.25% of the orig￾inal compute. (c) Real-world Experim… view at source ↗
Figure 2
Figure 2. Figure 2: Discrete Selection vs. Differentiable Sampling. (a) Tra￾ditional approaches operate on a fixed grid. When the target region (yellow cross) falls between patches, the model is forced to per￾form discrete selection, leading to spatial quantization errors and a loss of fidelity. (b) Our approach predicts continuous coordinates and utilizes differentiable bilinear sampling to interpolate features from the four… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the GridS Token Pruning framework. (a) Standard Dense Representation: An input image (HR and WR denote the original image resolution) is processed by a visual encoder with ViT embeddings (Dosovitskiy et al., 2021) to generate dense visual tokens (H × W × C), capturing full spatial details. (b) GridS Token Pruning Module: This module identifies salient regions to sample a sparse set of visual to… view at source ↗
Figure 4
Figure 4. Figure 4: Differentiable Bilinear Sampling. To extract features at a continuous coordinate P(x, y), the module computes a weighted interpolation of the four nearest integer neighbors. This operation enables sub-pixel feature extraction and ensures the sampling pro￾cess is differentiable. achieve sub-patch level accuracy, we define the value of the sampled token Fsampled(x, y) as a weighted interpolation of its four … view at source ↗
Figure 5
Figure 5. Figure 5: Real-world evaluation on the SO100 robot arm. (a) Execution rollouts of three language-conditioned tasks: Pick & Place, Stack Cubes, and Transfer Pen. (b) The corresponding Out-of-Distribution (OOD) test scenarios, featuring unseen distractor objects and variable spatial arrangements. We schemed 21 different OOD scenarios. (c) Quantitative comparison of Success Rate (%) and Execution Time (s). Our proposed… view at source ↗
Figure 6
Figure 6. Figure 6: Performance Analysis. We compare the inference latency (left) and computational cost (right) of the baseline method versus our proposed GridS pruning (16 tokens) across varying batch sizes. Solid and dashed lines denote the absolute values (left y-axis), while the dotted lines indicate the relative speedup and efficiency ratios (right y-axis). mation that may be lost when representing an object with only f… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of Information Retention and Sampling Efficiency. We evaluate GridS on LIBERO, ALOHA, and Real￾World data. Left: Information Retention Maps demonstrate that our sampling strategy maintains high information retention (green), effectively covering the original feature space. Right: Token Self￾Similarity matrices reveal that while original features suffer from high spatial redundancy. optimizati… view at source ↗
Figure 8
Figure 8. Figure 8: Real-World Hardware Setup. The image displays the LeRobot SO-100 follower arm used for policy execution. Visual inputs come from a fixed Intel RealSense D435 providing global scene context and a wrist-mounted Intel RealSense D405 capturing fine-grained local details. The policy operates using only RGB streams from these sensors. Real World Tasks. We provide comprehensive video demonstrations comparing our … view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of OOD Scenarios. We selected seven examples per task to demonstrate how we evaluated the strategy across over 20 scenarios categorized into seven types of perturbations: cluttered backgrounds, novel objects, removed training scenes, and unseen spatial layouts. Across these settings, GridS demonstrated stronger robustness compared to baseline models. B. Hyperparameter Settings We present the … view at source ↗
Figure 10
Figure 10. Figure 10: Additional Information Retention maps on the LIBERO dataset. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional Information Retention maps on the ALOHA dataset. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Continuous Information Retention maps for the Real-World Stacking task (Steps 0–47). The visualization demonstrates that the model consistently maintains a retention score of 0.8 ∼ 0.9, effectively filtering background distractors while focusing on the relative geometry between the gripper and cubes. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Differentiable Grid Sampler (GridS), a plug-and-play module inserted into the vision encoder of vision-language-action (VLA) models. GridS adaptively predicts a minimal set of salient coordinates and extracts features via differentiable interpolation to perform continuous, geometry-aware resampling of visual tokens. The central claim is that this approach reduces visual tokens to fewer than 10% of the original count, yielding a 76% reduction in FLOPs while preserving success rates with no degradation on the LIBERO benchmark and real-robot manipulation tasks.

Significance. If the empirical results hold under scrutiny, the work would be significant for enabling real-time deployment of VLA models in robotics by resolving the compression-performance trade-off that prior token-pruning methods face. The plug-and-play design, end-to-end differentiability, and public code release are strengths that support reproducibility and adoption.

major comments (2)
  1. [Abstract] Abstract: The central performance claims (76% FLOPs reduction, <10% visual tokens, and zero success-rate degradation) are stated without any reference to baselines, number of evaluation runs, error bars, or task breakdowns. This absence makes it impossible to assess whether the results substantiate the claim of breaking the compression trade-off, as the manuscript provides no experimental details, tables, or statistical analysis.
  2. [Abstract] The weakest assumption—that adaptive coordinate prediction plus differentiable interpolation will always retain critical geometric details (contact points, spatial relations) across tasks and environments—is load-bearing for the no-degradation claim but is not supported by any ablation or failure-case analysis in the provided text.
minor comments (1)
  1. [Abstract] The abstract sentence beginning 'Experiments on both LIBERO benchmark...' contains an awkward phrasing ('demonstrate that validating the lowest feasible...') that should be revised for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of GridS's potential impact. We address the two major comments point by point below, agreeing where the abstract is overly concise and committing to revisions that strengthen the manuscript without misrepresenting our existing results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (76% FLOPs reduction, <10% visual tokens, and zero success-rate degradation) are stated without any reference to baselines, number of evaluation runs, error bars, or task breakdowns. This absence makes it impossible to assess whether the results substantiate the claim of breaking the compression trade-off, as the manuscript provides no experimental details, tables, or statistical analysis.

    Authors: We agree that the abstract's brevity omits explicit pointers to the supporting evidence. The full manuscript (Section 4 and associated tables) reports comparisons against baseline token-pruning methods on the LIBERO benchmark, including per-task success rates, real-robot manipulation results, and the reported 76% FLOPs reduction with under 10% tokens retained. We will revise the abstract to add a concise reference such as 'with no degradation relative to full-token baselines on LIBERO tasks (see Table 1 and Section 4)' while retaining the word limit. Detailed task breakdowns, run counts, and any error statistics remain in the main text and supplement, as space constraints prevent their inclusion in the abstract itself. revision: yes

  2. Referee: [Abstract] The weakest assumption—that adaptive coordinate prediction plus differentiable interpolation will always retain critical geometric details (contact points, spatial relations) across tasks and environments—is load-bearing for the no-degradation claim but is not supported by any ablation or failure-case analysis in the provided text.

    Authors: We acknowledge that the current manuscript does not contain dedicated ablations isolating the contribution of differentiable interpolation to geometric fidelity or explicit failure-case studies. The no-degradation result is evidenced by unchanged success rates on contact-rich LIBERO tasks and real-robot trials, which implicitly require preservation of spatial relations. To directly address the concern, we will add (i) an ablation comparing GridS with and without the differentiable sampling step and (ii) qualitative visualizations of predicted coordinates on representative tasks, plus a short discussion of observed edge cases, in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an independent architectural contribution

full rationale

The paper introduces Differentiable Grid Sampler (GridS) as a new plug-and-play module that adaptively predicts salient coordinates and uses differentiable interpolation for token resampling in the vision encoder. The claimed 76% FLOPs reduction with no success-rate degradation is presented as an empirical outcome measured on external LIBERO benchmarks and real-robot tasks, not derived from any self-referential equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing step reduces the result to its own inputs by construction; the derivation chain consists of standard architectural design choices validated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a learned coordinate predictor whose outputs, when interpolated, retain all task-critical spatial information; no free parameters or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Differentiable interpolation can extract sufficient geometric features from a sparse set of predicted coordinates without loss of manipulation-relevant information.
    Invoked when claiming that drastic token reduction preserves success rate.
invented entities (1)
  • Differentiable Grid Sampler (GridS) module no independent evidence
    purpose: Performs task-aware continuous resampling of visual tokens by predicting salient coordinates and differentiable interpolation.
    New plug-and-play component introduced to replace standard pruning.

pith-pipeline@v0.9.0 · 5524 in / 1317 out tokens · 75105 ms · 2026-05-13T05:37:23.048495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 11 internal anchors

  1. [1]

    The Fourteenth International Conference on Learning Representations , year=

    Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation , author=. The Fourteenth International Conference on Learning Representations , year=

  2. [2]

    Towards a unified understanding of robot ma- nipulation: A comprehensive survey,

    Towards a unified understanding of robot manipulation: A comprehensive survey , author=. arXiv preprint arXiv:2510.10903 , year=

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    0: A vision-language-action flow model for general robot control. CoRR, abs/2410.24164, 2024. doi: 10.48550 , author=. arXiv preprint ARXIV.2410.24164 , year =

  4. [4]

    Proceedings of The 8th Conference on Robot Learning , pages =

    OpenVLA: An Open-Source Vision-Language-Action Model , author =. Proceedings of The 8th Conference on Robot Learning , pages =. 2025 , editor =

  5. [5]

    Conference on Robot Learning , pages=

    Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

  6. [6]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Dexgraspvla: A vision-language-action framework towards general dexterous grasping , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  7. [7]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Fast: Efficient action tokenization for vision-language-action models , author=. arXiv preprint arXiv:2501.09747 , year=

  8. [8]

    arXiv preprint arXiv:2508.10399 , year=

    Large model empowered embodied ai: A survey on decision-making and embodied learning , author=. arXiv preprint arXiv:2508.10399 , year=

  9. [9]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  10. [10]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Spatialvla: Exploring spatial representations for visual-language-action model , author=. arXiv preprint arXiv:2501.15830 , year=

  11. [11]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  12. [12]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  13. [13]

    EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models , url =

    Yang, Yantai and Wang, Yuhao and Wen, Zichen and Zhongwei, Luo and Zou, Chang and Zhang, Zhipeng and Wen, Chuan and Zhang, Linfeng , booktitle =. EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models , url =

  14. [14]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  15. [15]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Rethinking visual token reduction in lvlms under cross-modal misalignment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  16. [16]

    2025 , editor =

    Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and Huang, Tao and Cheng, Kuan and Gudovskiy, Denis A and Okuno, Tomoyuki and Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang , booktitle =. 2025 , editor =

  17. [17]

    Proceedings of The 8th Conference on Robot Learning , pages =

    RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics , author =. Proceedings of The 8th Conference on Robot Learning , pages =. 2025 , editor =

  18. [18]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Fine-tuning vision-language-action models: Optimizing speed and success , author=. arXiv preprint arXiv:2502.19645 , year=

  19. [19]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    0. 5: a vision-language-action model with open-world generalization, 2025 , author=. URL https://arxiv. org/abs/2504.16054 , year=

  20. [20]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Smolvla: A vision-language-action model for affordable and efficient robotics , author=. arXiv preprint arXiv:2506.01844 , year=

  21. [21]

    RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1: Robotics Transformer for Real-World Control at Scale , author=. arXiv preprint arXiv:2212.06817 , year=

  22. [22]

    Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff,...

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Real-time execution of action chunking flow policies , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    arXiv preprint arXiv:2512.01031 , year=

    VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference , author=. arXiv preprint arXiv:2512.01031 , year=

  25. [25]

    Advances in neural information processing systems , volume=

    Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

  26. [26]

    Proceedings of the IEEE international conference on computer vision , pages=

    Deformable convolutional networks , author=. Proceedings of the IEEE international conference on computer vision , pages=

  27. [27]

    Deformable

    Xizhou Zhu and Weijie Su and Lewei Lu and Bin Li and Xiaogang Wang and Jifeng Dai , booktitle=. Deformable. 2021 , url=

  28. [28]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Vision transformer with deformable attention , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  29. [29]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Learning continuous image representation with local implicit image function , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  30. [30]

    Communications of the ACM , volume=

    Nerf: Representing scenes as neural radiance fields for view synthesis , author=. Communications of the ACM , volume=. 2021 , publisher=

  31. [31]

    ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems , year=

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author=. ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems , year=

  32. [32]

    Advances in Neural Information Processing Systems , volume=

    Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=

  33. [33]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation , author=. arXiv preprint arXiv:2411.19650 , year=

  34. [34]

    European Conference on Computer Vision , pages=

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  35. [35]

    2012 IEEE/RSJ international conference on intelligent robots and systems , pages=

    Mujoco: A physics engine for model-based control , author=. 2012 IEEE/RSJ international conference on intelligent robots and systems , pages=. 2012 , organization=

  36. [36]

    GitHub repository , howpublished =

    Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas , title =. GitHub repository , howpublished ...

  37. [37]

    13, 2018 , author=

    JAX: Composable transformations of Python+ NumPy programs, version 0.3. 13, 2018 , author=. Online at http://github.com/jax-ml/jax (accessed on October 31, 2025) , year=

  38. [38]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Diffusion-based imaginative coordination for bimanual manipulation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  39. [39]

    Transactions on Machine Learning Research , issn=

    Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , url=

  40. [40]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Segment anything , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  41. [41]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  42. [42]

    International conference on machine learning , pages=

    Parameter-efficient transfer learning for NLP , author=. International conference on machine learning , pages=. 2019 , organization=

  43. [43]

    Cohen and Sarah-Jane Leslie and Thomas L

    Henry Conklin and Tom Hosking and Tan Yi-Chern and Jonathan D. Cohen and Sarah-Jane Leslie and Thomas L. Griffiths and Max Bartolo and Seraphina Goldfarb-Tarrant , booktitle=. Learning is Forgetting. 2026 , url=

  44. [44]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Libero-plus: In-depth robustness analysis of vision-language-action models , author=. arXiv preprint arXiv:2510.13626 , year=

  45. [45]

    Proceedings of the computer vision and pattern recognition conference , pages=

    Robotwin: Dual-arm robot benchmark with generative digital twins , author=. Proceedings of the computer vision and pattern recognition conference , pages=

  46. [46]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model , author=. arXiv preprint arXiv:2510.10274 , year=

  47. [47]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation , author=. arXiv preprint arXiv:2506.18088 , year=

  48. [48]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=