pith. sign in

arxiv: 2606.27872 · v1 · pith:Q7E5PPRSnew · submitted 2026-06-26 · 💻 cs.RO · cs.AI

S²-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation

Pith reviewed 2026-06-29 04:38 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords vision-language-action modelslong-horizon manipulationadaptive attentionstate-space modelsrobotic manipulationbelief statedynamic gatingfeature fusion
0
0 comments X

The pith

A state-space belief mechanism enables compact 2B-parameter vision-language-action models to outperform 7B models on long-horizon robotic manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S²-VLA, which uses a State-Space Guided Adaptive Attention mechanism to maintain a belief state that tracks task progression in robotic manipulation. This belief state generates dynamic gating weights to adaptively combine visual features, task intents from language, and action sequences, allowing the model to adjust its focus as the task evolves through different stages. Static fusion in prior models leads to error accumulation over long horizons, but this adaptive approach aligns information sources with current task needs. The result is that a smaller model achieves better performance than larger ones on benchmarks like LIBERO and SimplerEnv.

Core claim

S²-VLA maintains a belief state inside the SSGAA module that tracks task progression and produces dynamic gating weights. These weights adaptively fuse visual features for spatial perception, task intents for high-level planning, and temporal action sequences for execution consistency. This dynamic fusion replaces fixed-weight combinations and reduces cumulative errors in long-horizon tasks.

What carries the argument

The State-Space Guided Adaptive Attention (SSGAA) mechanism, which maintains a belief state to generate dynamic gating weights for fusing visual, language, and action representations.

Load-bearing premise

The belief state inside the state-space module accurately tracks task progression and generates gating weights that correctly adapt the fusion of visual, language, and action information at each stage.

What would settle it

Running the model on long-horizon tasks where the belief state fails to update correctly, such as when task stages are ambiguous, and observing whether performance drops below that of static-fusion baselines of similar size.

Figures

Figures reproduced from arXiv: 2606.27872 by Jing Zhao, Shiliang Sun, Xiangyi Wei, Yang Li, Zhipeng Xie, Zongyi Han.

Figure 1
Figure 1. Figure 1: Illustrative examples comparing our S2 -VLA, which in￾corporates State-Space Guided Adaptive Attention, with static fu￾sion approaches, highlighting its ability to mitigate early-stage bias. visual observations and language instructions with the action space to generate precise and executable policies. The mainstream method aligns multimodal representa￾tions vision and language with the action space to gen… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture and Workflow of the S2 -VLA Framework. The Workflow processes multimodal inputs through a vision-language backbone, with the core State-Space Guided Adaptive Attention(SSGAA) module adaptively fusing spatial, semantic, and temporal informa￾tion under the dynamic guidance of the belief state. mantic intent for planning a critical temporal dynamic ig￾nored by current methods. 3 Methodology 3.1 P… view at source ↗
Figure 3
Figure 3. Figure 3: Real World Performance of our S2 -VLA compared with ACT and π0-FAST [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gating weight trajectories and stage-aligned keyframes for a representative LIBERO-Long rollout. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, but their performance degrades significantly in long-horizon tasks due to cumulative error propagation. This limitation largely arises from static feature fusion mechanisms that rely on fixed weights to combine visual, language, and action representations, preventing the model from adapting to different phases of task execution. To address this limitation, we propose S$^2$-VLA, a framework that introduces a State-Space Guided Adaptive Attention (SSGAA) mechanism. SSGAA maintains a belief state that tracks task progression and generates dynamic gating weights to adaptively fuse information from three complementary sources visual features for spatial perception, task intents for high-level task planning, and temporal action sequences for execution consistency. This adaptive fusion allows the model to shift its focus throughout task execution, aligning with the evolving requirements of different task stages. Despite its compact 2B parameter size, S$^2$-VLA consistently outperforms larger 7B-scale models and achieves state-of-the-art performance on long-horizon manipulation benchmarks, including LIBERO and SimplerEnv. highlighting the importance of adaptive feature fusion for long-horizon robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes S²-VLA, a 2B-parameter Vision-Language-Action model equipped with a State-Space Guided Adaptive Attention (SSGAA) mechanism. SSGAA maintains an internal belief state that tracks task progression and produces dynamic gating weights to adaptively fuse visual features, language-based task intents, and temporal action sequences. The authors claim this enables the compact model to outperform larger 7B-scale VLAs and reach state-of-the-art results on long-horizon benchmarks including LIBERO and SimplerEnv.

Significance. If the performance claims hold under proper controls and the belief state is shown to produce meaningful, stage-aligned gates, the work would demonstrate that state-space-guided adaptive fusion can mitigate error accumulation in long-horizon manipulation more effectively than scale alone. This would be a useful contribution to efficient VLA design for robotics.

major comments (3)
  1. [Abstract] Abstract: the headline claim that the 2B model 'consistently outperforms larger 7B-scale models' and achieves SOTA on LIBERO/SimplerEnv is presented with no quantitative results, tables, error bars, or baseline comparisons. This prevents any evaluation of whether the gains exist or are attributable to SSGAA.
  2. [Method (SSGAA)] SSGAA description: no update rule, state dimension, initialization, or training objective for the belief state is supplied. Without these, it is impossible to determine whether the state actually tracks task phases or collapses to a near-constant gate, rendering the adaptive-fusion explanation unsupported.
  3. [Experiments] Experiments: no ablation that disables or freezes the belief-state gating (e.g., fixed-weight fusion baseline) and no diagnostic plots of gate trajectories versus ground-truth stage boundaries are referenced. These are required to establish causality between the adaptive mechanism and the reported performance.
minor comments (2)
  1. [Notation] The three input sources and the resulting gated representation should be given explicit mathematical notation and an equation for the fusion step.
  2. [Benchmarks] Clarify the precise task suites, success metrics, and number of trials used for the LIBERO and SimplerEnv results to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where additional detail and controls would strengthen the presentation of our claims regarding the SSGAA mechanism and its empirical benefits. We address each major comment below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the 2B model 'consistently outperforms larger 7B-scale models' and achieves SOTA on LIBERO/SimplerEnv is presented with no quantitative results, tables, error bars, or baseline comparisons. This prevents any evaluation of whether the gains exist or are attributable to SSGAA.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the headline claims. In the revised manuscript we will insert concise performance highlights (e.g., mean success rates on LIBERO and SimplerEnv together with direct comparisons to the 7B baselines) and will reference the main result tables and error-bar reporting that already appear in the Experiments section. revision: yes

  2. Referee: [Method (SSGAA)] SSGAA description: no update rule, state dimension, initialization, or training objective for the belief state is supplied. Without these, it is impossible to determine whether the state actually tracks task phases or collapses to a near-constant gate, rendering the adaptive-fusion explanation unsupported.

    Authors: The referee correctly notes that these implementation details are insufficiently specified. We will expand the Method section with a new subsection that supplies the exact state-update recurrence, the belief-state dimensionality, the initialization procedure, and the composite training objective (task loss plus auxiliary term encouraging stage-discriminative states). revision: yes

  3. Referee: [Experiments] Experiments: no ablation that disables or freezes the belief-state gating (e.g., fixed-weight fusion baseline) and no diagnostic plots of gate trajectories versus ground-truth stage boundaries are referenced. These are required to establish causality between the adaptive mechanism and the reported performance.

    Authors: We accept that these controls are necessary to substantiate the causal role of the adaptive gating. The revised Experiments section will add (i) an ablation replacing the learned belief-state gates with fixed uniform weights and (ii) diagnostic plots that overlay the learned gate trajectories against annotated task-stage boundaries on representative LIBERO and SimplerEnv rollouts. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture proposal with external benchmark evaluation

full rationale

The paper describes an architectural addition (SSGAA belief state and gating) and reports empirical results on LIBERO and SimplerEnv. No equations, parameter-fitting steps, or derivation chains appear in the supplied text. Claims rest on benchmark comparisons rather than any self-referential reduction or self-citation load-bearing step. The central performance assertion is therefore not forced by construction from its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or independent evidence for the belief state.

invented entities (1)
  • belief state no independent evidence
    purpose: tracks task progression to generate dynamic gating weights for feature fusion
    Central new component of the SSGAA mechanism described in the abstract

pith-pipeline@v0.9.1-grok · 5755 in / 1163 out tokens · 48535 ms · 2026-06-29T04:38:52.251272+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 35 canonical work pages · 28 internal anchors

  1. [1]

    Structure and Interpretation of Computer Programs

    Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

  2. [2]

    Visual Information Extraction with Lixto

    Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

  3. [3]

    Brachman and James G

    Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

  4. [4]

    Complexity results for nonmonotonic logics

    Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992

  5. [5]

    Hypertree Decompositions and Tractable Queries

    Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002

  6. [6]

    Levesque

    Hector J. Levesque. Foundations of a functional approach to knowledge representation. Artificial Intelligence. 1984

  7. [7]

    Levesque

    Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

  8. [8]

    On the compilability and expressive power of propositional planning formalisms

    Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

  9. [9]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, Moo Jin and Pertsch, Karl and Karamcheti, Siddharth and Xiao, Ted and Balakrishna, Ashwin and Nair, Suraj and Rafailov, Rafael and Foster, Ethan and Lam, Grace and Sanketi, Pannag and others , month = sep, year =. doi:10.48550/arXiv.2406.09246 , urldate =

  10. [10]

    Collaboration, Open X.-Embodiment and O'Neill, Abby and Rehman, Abdul and Gupta, Abhinav and Maddukuri, Abhiram and Gupta, Abhishek and Padalkar, Abhishek and Lee, Abraham and Pooley, Acorn and Gupta, Agrim and others , month = may, year =. Open. doi:10.48550/arXiv.2310.08864 , urldate =

  11. [11]

    Team, Octo Model and Ghosh, Dibya and Walke, Homer and Pertsch, Karl and Black, Kevin and Mees, Oier and Dasari, Sudeep and Hejna, Joey and Kreiman, Tobias and Xu, Charles and others , month = may, year =. Octo:. doi:10.48550/arXiv.2405.12213 , urldate =

  12. [12]

    Proceedings of

    Zitkovich, Brianna and Yu, Tianhe and Xu, Sichun and others , month = dec, year =. Proceedings of

  13. [13]

    Kim, Moo Jin and Finn, Chelsea and Liang, Percy , month = apr, year =. Fine-. doi:10.48550/arXiv.2502.19645 , urldate =

  14. [14]

    The International Journal of Robotics Research , year =

    Cheng Chi and Zhenjia Xu and Siyuan Feng and Eric Cousineau and Yilun Du and others , title =. The International Journal of Robotics Research , year =

  15. [15]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, Songming and Wu, Lingxuan and Li, Bangguo and Tan, Hengkai and Chen, Huayu and Wang, Zhengyi and Xu, Ke and Su, Hang and Zhu, Jun , month = mar, year =. doi:10.48550/arXiv.2410.07864 , urldate =

  16. [16]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, Kevin and Brown, Noah and Driess, Danny and Esmail, Adnan and Equi, Michael and Finn, Chelsea and Fusai, Niccolo and Groom, Lachy and Hausman, Karol and Ichter, Brian and others , month = nov, year =. \ pi\_0\ :. doi:10.48550/arXiv.2410.24164 , urldate =

  17. [17]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Shukor, Mustafa and Aubakirova, Dana and Capuano, Francesco and Kooijmans, Pepijn and Palma, Steven and Zouitine, Adil and Aractingi, Michel and Pascal, Caroline and Russi, Martino and Marafioti, Andres and others , month = jun, year =. doi:10.48550/arXiv.2506.01844 , urldate =

  18. [18]

    Advances in

    Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter , year =. Advances in

  19. [19]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, Tony Z. and Kumar, Vikash and Levine, Sergey and Finn, Chelsea , month = apr, year =. Learning. doi:10.48550/arXiv.2304.13705 , language =

  20. [20]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Qingwen and Yang, Yanting and Cai, Jisong and Gao, Shenyuan and Ren, Guanghui and Yao, Maoqing and Luo, Ping and Li, Hongyang , month = nov, year =. doi:10.48550/arXiv.2505.06111 , urldate =

  21. [21]

    WorldVLA: Towards Autoregressive Action World Model

    Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and others , month = jun, year =. doi:10.48550/arXiv.2506.21539 , urldate =

  22. [22]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Zhao, Qingqing and Lu, Yao and Kim, Moo Jin and Fu, Zipeng and Zhang, Zhuoyang and others , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

  23. [23]

    Available: https://arxiv.org/abs/2508.18269

    Zhong, Zhide and Yan, Haodong and Li, Junfeng and Liu, Xiangchen and Gong, Xin and Zhang, Tianran and Song, Wenxuan and Chen, Jiayi and Zheng, Xinhu and Wang, Hesheng and Li, Haoang , month = oct, year =. doi:10.48550/arXiv.2508.18269 , urldate =

  24. [24]

    Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and others , month = nov, year =. Qwen3-. doi:10.48550/arXiv.2511.21631 , urldate =

  25. [26]

    Li, Shuang and Gao, Yihuai and Sadigh, Dorsa and Song, Shuran , month = apr, year =. Unified. doi:10.48550/arXiv.2503.00200 , urldate =

  26. [27]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Zheng, Ruijie and Liang, Yongyuan and Huang, Shuaiyi and Gao, Jianfeng and Daumé, Hal and Kolobov, Andrey and Huang, Furong and Yang, Jianwei , month = jun, year =. doi:10.48550/arXiv.2412.10345 , urldate =

  27. [28]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Lee, Jason and Duan, Jiafei and Fang, Haoquan and Deng, Yuquan and Liu, Shuo and Li, Boyang and Fang, Bohan and Zhang, Jieyu and Wang, Yi Ru and Lee, Sangho and others , month = sep, year =. doi:10.48550/arXiv.2508.07917 , urldate =

  28. [29]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Huang, Chi-Pin and Wu, Yueh-Hua and Chen, Min-Hung and Wang, Yu-Chiang Frank and Yang, Fu-En , month = sep, year =. doi:10.48550/arXiv.2507.16815 , urldate =

  29. [30]

    doi:10.48550/arXiv.2506.22242 , urldate =

    Zhang, Jiahui and Chen, Yurui and Xu, Yueming and Huang, Ze and Zhou, Yanpeng and Yuan, Yu-Jie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li , month = nov, year =. doi:10.48550/arXiv.2506.22242 , urldate =

  30. [31]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, Delin and Song, Haoming and Chen, Qizhi and Yao, Yuanqi and Ye, Xinyi and Ding, Yan and Wang, Zhigang and Gu, JiaYuan and Zhao, Bin and Wang, Dong and others , month = may, year =. doi:10.48550/arXiv.2501.15830 , language =

  31. [32]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, Karl and Stachowicz, Kyle and Ichter, Brian and Driess, Danny and Nair, Suraj and Vuong, Quan and Mees, Oier and Finn, Chelsea and Levine, Sergey , month = jan, year =. doi:10.48550/arXiv.2501.09747 , language =

  32. [33]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Hung, Chia-Yu and Sun, Qi and Hong, Pengfei and Zadeh, Amir and Li, Chuan and others , month = apr, year =. doi:10.48550/arXiv.2504.19854 , urldate =

  33. [34]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA and Bjorck, Johan and Castañeda, Fernando and Cherniadev, Nikita and Da, Xingye and Ding, Runyu and Fan, Linxi "Jim" and Fang, Yu and Fox, Dieter and Hu, Fengyuan and others , month = mar, year =. doi:10.48550/arXiv.2503.14734 , urldate =

  34. [35]

    Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

    Tian, Yang and Yang, Sizhe and Zeng, Jia and Wang, Ping and Lin, Dahua and Dong, Hao and Pang, Jiangmiao , month = dec, year =. Predictive. doi:10.48550/arXiv.2412.15109 , urldate =

  35. [36]

    doi:10.48550/arXiv.2506.17561 , urldate =

    Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin , month = jun, year =. doi:10.48550/arXiv.2506.17561 , urldate =

  36. [37]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li, Xuanlin and Hsu, Kyle and Gu, Jiayuan and Pertsch, Karl and Mees, Oier and others , month = may, year =. Evaluating. doi:10.48550/arXiv.2405.05941 , language =

  37. [38]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Chen, Xi and Djolonga, Josip and Padlewski, Piotr and Mustafa, Basil and Changpinyo, Soravit and Wu, Jialin and Ruiz, Carlos Riquelme and Goodman, Sebastian and Wang, Xiao and Tay, Yi and others , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  38. [39]

    Prismatic

    Karamcheti, Siddharth and Nair, Suraj and Balakrishna, Ashwin and Liang, Percy and Kollar, Thomas and Sadigh, Dorsa , month = jul, year =. Prismatic. Proceedings of the 41st

  39. [40]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Khazatsky, Alexander and Pertsch, Karl and Nair, Suraj and Balakrishna, Ashwin and Dasari, Sudeep and Karamcheti, Siddharth and Nasiriany, Soroush and Srirama, Mohan Kumar and Chen, Lawrence Yunliang and Ellis, Kirsty and others , month = apr, year =. doi:10.48550/arXiv.2403.12945 , urldate =

  40. [41]

    doi:10.48550/arXiv.2509.09372 , language =

    Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and others , month = sep, year =. doi:10.48550/arXiv.2509.09372 , language =

  41. [42]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Shi, Hao and Xie, Bin and Liu, Yingfei and Sun, Lin and Liu, Fengrong and others , month = aug, year =. doi:10.48550/arXiv.2508.19236 , language =

  42. [43]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Zhao, Qingqing and Lu, Yao and Kim, Moo Jin and Fu, Zipeng and Zhang, Zhuoyang and Wu, Yecheng and Li, Zhaoshuo and Ma, Qianli and Han, Song and Finn, Chelsea and others , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

  43. [44]

    doi:10.48550/arXiv.2506.19816 , language =

    Li, Hao and Yang, Shuai and Chen, Yilun and Chen, Xinyi and Yang, Xiaoda and Tian, Yang and Wang, Hanqing and Wang, Tai and Lin, Dahua and Zhao, Feng and others , month = oct, year =. doi:10.48550/arXiv.2506.19816 , language =

  44. [45]

    Research Square , publisher=

    Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models , author=. Research Square , publisher=. 2025 , month=. doi:10.21203/rs.3.rs-5770637/v1 , url=

  45. [46]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Yang, Jianwei and Tan, Reuben and Wu, Qianhui and Zheng, Ruijie and Peng, Baolin and Liang, Yongyuan and Gu, Yu and Cai, Mu and Ye, Seonghyeon and Jang, Joel and Deng, Yuquan and Gao, Jianfeng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

  46. [47]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others , month = nov, year =. doi:10.48550/arXiv.2411.19650 , language =

  47. [48]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Cho, Kyunghyun and Merrienboer, Bart van and Gulcehre, Caglar and Bahdanau, Dzmitry and others , month = sep, year =. Learning. doi:10.48550/arXiv.1406.1078 , urldate =

  48. [49]

    Qwen2.5-VL Technical Report

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and others , month = feb, year =. Qwen2.5-. doi:10.48550/arXiv.2502.13923 , urldate =

  49. [50]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Dabis, Joseph and Finn, Chelsea and Gopalakrishnan, Keerthana and Hausman, Karol and Herzog, Alex and Hsu, Jasmine and others , month = aug, year =. doi:10.48550/arXiv.2212.06817 , urldate =

  50. [51]

    Zhang, Dapeng and Sun, Jing and Hu, Chenghui and Wu, Xiaoyan and Yuan, Zhenlong and Zhou, Rui and Shen, Fei and Zhou, Qingguo , month = nov, year =. Pure. doi:10.48550/arXiv.2509.19012 , language =

  51. [52]

    Attention is

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Ł ukasz and Polosukhin, Illia , year =. Attention is. Advances in

  52. [53]

    Decoupled Weight Decay Regularization

    Loshchilov, Ilya and Hutter, Frank , month = jan, year =. Decoupled. doi:10.48550/arXiv.1711.05101 , urldate =