pith. machine review for the scientific record. sign in

arxiv: 2605.10993 · v1 · submitted 2026-05-09 · 💻 cs.RO

Recognition: no theorem link

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

Yanbin Hu , Jin Cui , Jiayi Lu , Ruixuan Yang , Jun Ye , Boran Zhao , Xingyu Chen , Xuguang Lan , Pengju Ren

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:08 UTC · model grok-4.3

classification 💻 cs.RO
keywords hierarchical memoryvision-language-actionhyperbolic embeddinglong-horizon manipulationsemantic memory treeexperience consolidationrobot generalization
0
0 comments X

The pith

ECHO maps VLA hidden states into a continuous hierarchical space with a hyperbolic autoencoder to organize experiences into a retrievable semantic memory tree.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ECHO as a memory framework for vision-language-action models that addresses limited capacity in long-horizon tasks by moving beyond flat storage. It employs a hyperbolic autoencoder to embed model states into a continuous hierarchical space, then applies entailment constraints to build a semantic memory tree that enables top-down retrieval. A background consolidation process continuously refines this tree through geometric interpolation and structural splitting, allowing synthesis of virtual experiences. When integrated into the π0 model, the approach produces measurable gains on standard benchmarks for manipulation tasks. The central demonstration is improved execution success and better handling of novel task compositions.

Core claim

ECHO maps VLA hidden states into a continuous hierarchical space via a hyperbolic autoencoder. Entailment constraints organize the resulting vectors into a semantic memory tree that supports efficient top-down retrieval. A background consolidation mechanism refines the tree over time by geometric interpolation and structural splitting, enabling virtual memory synthesis. Added to the π0 foundation model, this produces a 12.8 percent absolute increase in execution success rate on LIBERO-Long tasks together with stronger compositional generalization on cross-suite unseen long-horizon problems.

What carries the argument

The hyperbolic autoencoder with entailment constraints that embeds VLA states into continuous hierarchical space and constructs an organizable semantic memory tree for retrieval and refinement.

If this is right

  • Enables efficient top-down retrieval from the semantic memory tree during action generation.
  • Supports continuous refinement and virtual synthesis of experiences through geometric operations.
  • Delivers improved compositional generalization on long-horizon tasks not seen during training.
  • Provides a structural prior for manipulation categories that flat memory lacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hyperbolic geometry may prove broadly useful for representing nested task hierarchies in sequential decision systems.
  • The consolidation mechanism could reduce the need for ever-larger discrete memory buffers as task horizons grow.
  • The same continuous-space organization might transfer to other embodied domains that require long-term experience reuse.

Load-bearing premise

The hyperbolic autoencoder with entailment constraints will reliably embed VLA states into a useful hierarchy that supports retrieval without introducing distortions or overfitting to the training distribution.

What would settle it

Replace the hyperbolic autoencoder and entailment constraints with a standard Euclidean autoencoder of similar capacity, retrain the full system on the same data, and check whether the 12.8 percent success-rate gain on LIBERO-Long disappears.

Figures

Figures reproduced from arXiv: 2605.10993 by Boran Zhao, Jiayi Lu, Jin Cui, Jun Ye, Pengju Ren, Ruixuan Yang, Xingyu Chen, Xuguang Lan, Yanbin Hu.

Figure 1
Figure 1. Figure 1: Conceptual overview of the ECHO framework. (a) Manipulation demonstrations exhibit hierarchical semantics, where high-level task instructions are grounded through sub-tasks, action primitives, and micro-level controls. (b) Instead of organizing experiences as a flat linear memory, ECHO embeds them into a continuous hyperbolic hierarchical space, enabling coarse-to-fine memory organization, entailment-aware… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the ECHO framework. The system integrates a hierarchical memory module with the π0 backbone. Visual observations are processed by a pre-adapted external VLM, frozen during ECHO training, and mapped into a unified hyperbolic latent space. A gating mechanism fuses the short-term context (zshort) with the retrieved long-term memory (zmem). The decoded Euclidean context vector (zctx) is… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-suite compositional generalization on the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Retrieval-path evidence for cross-suite compositional memory reuse. (A) Semantic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Global organization of the consolidated ECHO memory bank. (a) Icicle visualization of the depth-truncated cone memory tree, where bands denote discrete tree depths and segment widths indicate subtree memory counts. (b) Node counts and average subtree sizes across depths, showing how consolidation refines broad clusters into finer semantic branches. (c) Retrieval visits over tree depth, showing that ECHO pe… view at source ↗
Figure 6
Figure 6. Figure 6: System analysis. (a) LIBERO-Long success rate as a function of the base memory injection [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Real-world experimental setup. G Limitations While ECHO demonstrates consistent gains in long-horizon manipulation, its performance still depends on the quality of semantic keyframe extraction and the relevance of retrieved memories. In practice, this is mitigated by the VLM-guided consistency check and similarity-modulated residual injection, which reduce the influence of mismatched or noisy memory entrie… view at source ↗
Figure 8
Figure 8. Figure 8: Real-world execution sequences of ECHO on three tabletop manipulation tasks: pick￾and-place, block stacking, and ring insertion. Each row shows one complete rollout on the Franka platform. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Memory capacity is a critical factor determining the performance of Vision-Language-Action (VLA) models in long-horizon manipulation tasks. Existing memory-augmented architectures primarily rely on linear or flat storage, lacking structural priors for manipulation categories and hierarchical organization. This deficiency hinders efficient experience retrieval and limits generalization to unseen long-horizon task compositions. Inspired by the hierarchical organization of human experience, we propose ECHO (Experience Consolidation and Hierarchical Organization), a novel memory framework operating within a Continuous Hierarchical Space. By employing a hyperbolic autoencoder, ECHO maps VLA hidden states into this space. Leveraging hyperbolic metrics and entailment constraint mechanisms, experience vectors are organized into a semantic memory tree that supports efficient top-down retrieval. In parallel, a background consolidation mechanism continuously refines the memory tree through geometric interpolation and structural splitting, supporting virtual memory synthesis in the continuous space. We integrate ECHO into the $\pi_0$ foundation model. Evaluations on LIBERO and preliminary real-world experiments demonstrate the effectiveness of our approach, notably achieving a 12.8% absolute improvement in execution success rate over the $\pi_0$ baseline on LIBERO-Long, while improving compositional generalization on cross-suite unseen long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ECHO, a memory framework for Vision-Language-Action (VLA) models that maps hidden states into a continuous hierarchical space using a hyperbolic autoencoder. It organizes experience vectors into a semantic memory tree via hyperbolic metrics and entailment constraints, supports top-down retrieval, and performs background consolidation through geometric interpolation and structural splitting. Integrated into the π0 foundation model, the work claims a 12.8% absolute improvement in execution success rate over the π0 baseline on LIBERO-Long tasks, along with improved compositional generalization on cross-suite unseen long-horizon tasks.

Significance. If the reported gains are shown to arise specifically from the hierarchical geometry rather than added capacity or regularization, the approach could provide a useful structural prior for efficient retrieval and virtual memory synthesis in long-horizon manipulation, addressing a recognized limitation of flat memory in current VLA architectures.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The central claim of a 12.8% absolute success-rate gain on LIBERO-Long and improved cross-suite generalization is presented without ablations, error bars, or controls that isolate the contribution of the hyperbolic tree versus other factors such as increased model capacity or the entailment loss; this leaves open whether the improvement is load-bearing on the claimed continuous hierarchical space.
  2. [§3.2] §3.2 (Hyperbolic Autoencoder): The assumption that the Poincaré-ball metric plus entailment constraints will produce artifact-free embeddings aligned with manipulation semantics is not verified; no quantitative checks (e.g., hierarchy preservation metrics, embedding collapse analysis, or retrieval precision on held-out states) are reported to confirm that the geometry supports the claimed top-down retrieval and interpolation without introducing spurious states.
  3. [§3.3] §3.3 (Background Consolidation): The mechanism of continuous refinement via interpolation and splitting is described at a high level, but the manuscript supplies no analysis of how these operations affect policy execution stability or whether they risk degrading performance on the original training distribution.
minor comments (2)
  1. [§3] Notation for the entailment constraint and the Poincaré-ball radius is introduced without an explicit equation reference in the methods section, making it difficult to reproduce the exact loss formulation.
  2. [§4] The preliminary real-world experiments are mentioned only briefly; adding quantitative metrics and task descriptions would strengthen the generalization claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses below and outline revisions to address the concerns regarding empirical validation of the hierarchical memory contributions.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The central claim of a 12.8% absolute success-rate gain on LIBERO-Long and improved cross-suite generalization is presented without ablations, error bars, or controls that isolate the contribution of the hyperbolic tree versus other factors such as increased model capacity or the entailment loss; this leaves open whether the improvement is load-bearing on the claimed continuous hierarchical space.

    Authors: We agree that the current manuscript reports the performance gains without explicit ablations isolating the hyperbolic tree from capacity or regularization effects. In the revised version we will add: (1) a Euclidean-space control with matched capacity, (2) an ablation removing the entailment loss, and (3) error bars from multiple random seeds. These experiments will demonstrate that the reported 12.8% gain is attributable to the continuous hierarchical geometry. revision: yes

  2. Referee: [§3.2] §3.2 (Hyperbolic Autoencoder): The assumption that the Poincaré-ball metric plus entailment constraints will produce artifact-free embeddings aligned with manipulation semantics is not verified; no quantitative checks (e.g., hierarchy preservation metrics, embedding collapse analysis, or retrieval precision on held-out states) are reported to confirm that the geometry supports the claimed top-down retrieval and interpolation without introducing spurious states.

    Authors: We acknowledge the absence of quantitative embedding validation. The revision will include hierarchy-preservation metrics (entailment satisfaction rate across the tree), embedding-collapse diagnostics (variance and norm statistics in the Poincaré ball), and retrieval precision/recall on held-out manipulation states. These checks will confirm that the geometry supports artifact-free top-down retrieval and interpolation. revision: yes

  3. Referee: [§3.3] §3.3 (Background Consolidation): The mechanism of continuous refinement via interpolation and splitting is described at a high level, but the manuscript supplies no analysis of how these operations affect policy execution stability or whether they risk degrading performance on the original training distribution.

    Authors: The referee correctly identifies the missing stability analysis. We will add experiments in the revision that measure policy success rates on the original training tasks before and after multiple consolidation rounds, together with execution-stability metrics. These results will show that geometric interpolation and structural splitting do not degrade performance on the seen distribution. revision: yes

Circularity Check

0 steps flagged

No circularity detected; ECHO is presented as an additive architectural module with empirical validation

full rationale

The paper introduces ECHO as a new memory framework that maps VLA hidden states via a hyperbolic autoencoder into a continuous hierarchical space, with organization via entailment constraints and background consolidation. No equations, derivations, or self-referential definitions appear that reduce the reported 12.8% success-rate gain or generalization improvements to fitted parameters defined by the same data or to self-citations whose load-bearing content is unverified. The framework is explicitly described as integrated into the existing π0 model, and performance claims rest on evaluations on LIBERO and real-world experiments rather than any closed mathematical chain. This is the standard case of an additive proposal whose central claims remain externally falsifiable through the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified premise that hyperbolic geometry plus entailment constraints will produce a semantically meaningful memory tree for manipulation experiences; no independent evidence for this mapping is supplied in the abstract.

axioms (1)
  • domain assumption Hyperbolic space with entailment constraints organizes VLA hidden states into a useful semantic hierarchy
    Invoked to justify the memory tree construction and retrieval mechanism.
invented entities (1)
  • Continuous Hierarchical Space no independent evidence
    purpose: To embed and organize experience vectors hierarchically for retrieval and synthesis
    New space introduced via the hyperbolic autoencoder; no external falsifiable handle provided.

pith-pipeline@v0.9.0 · 5533 in / 1331 out tokens · 53748 ms · 2026-05-13T07:08:01.430739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 11 internal anchors

  1. [1]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  2. [2]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  5. [5]

    Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596, 2026

    Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596, 2026

  6. [6]

    Map-vla: Memory-augmented prompting for vision-language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025

    Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap- Peng Tan, and Ziwei Wang. Map-vla: Memory-augmented prompting for vision-language-action model in robotic manipulation.arXiv preprint arXiv:2511.09516, 2025

  7. [7]

    Dejavu: Towards Experience Feedback Learning for Embodied Intelligence

    Shaokai Wu, Yanbiao Ji, Qiuchang Li, Zhiyi Zhang, Qichen He, Wenyuan Xie, Guodong Zhang, Bayram Bayramli, Yue Ding, and Hongtao Lu. Dejavu: Towards experience feedback learning for embodied intelligence.arXiv preprint arXiv:2510.10181, 2025

  8. [8]

    Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

    Zhen Liu, Xinyu Ning, Zhe Hu, Xinxin Xie, Weize Li, Zhipeng Tang, Chongyu Wang, Zejun Yang, Hanlin Wang, Yitong Liu, et al. Goal2skill: Long-horizon manipulation with adaptive planning and reflection.arXiv preprint arXiv:2604.13942, 2026

  9. [9]

    Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.Advances in neural information processing systems, 37:49881–49913, 2024

  10. [10]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

  11. [11]

    A neural network for modeling human concept formation, understanding and communication.Nature Computational Science, pages 1–15, 2026

    Liangxuan Guo, Haoyang Chen, Yang Chen, Yanchao Bi, and Shan Yu. A neural network for modeling human concept formation, understanding and communication.Nature Computational Science, pages 1–15, 2026

  12. [12]

    Human-like cognitive generalization for large models via brain-in-the-loop supervision.arXiv preprint arXiv:2505.09085, 2025

    Jiaxuan Chen, Yu Qi, Yueming Wang, and Gang Pan. Human-like cognitive generalization for large models via brain-in-the-loop supervision.arXiv preprint arXiv:2505.09085, 2025

  13. [13]

    What learning systems do intelligent agents need? complementary learning systems theory updated.Trends in cognitive sciences, 20(7):512–534, 2016

    Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learning systems do intelligent agents need? complementary learning systems theory updated.Trends in cognitive sciences, 20(7):512–534, 2016. 10

  14. [14]

    James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review, 102(3):419, 1995

  15. [15]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  16. [16]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  17. [17]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  18. [18]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  19. [19]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  20. [20]

    Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

  21. [21]

    Retrieval-augmented embodied agents

    Yichen Zhu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Retrieval-augmented embodied agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17985–17995, 2024

  22. [22]

    Embodied-rag: General non- parametric embodied memory for retrieval and generation.arXiv preprint arXiv:2409.18313, 2024

    Quanting Xie, So Yeon Min, Pengliang Ji, Yue Yang, Tianyi Zhang, Kedi Xu, Aarav Bajaj, Rus- lan Salakhutdinov, Matthew Johnson-Roberson, and Yonatan Bisk. Embodied-rag: General non- parametric embodied memory for retrieval and generation.arXiv preprint arXiv:2409.18313, 2024

  23. [23]

    Statler: State-maintaining language models for embodied reasoning

    Takuma Yoneda, Jiading Fang, Peng Li, Huanyu Zhang, Tianchong Jiang, Shengjie Lin, Ben Picker, David Yunis, Hongyuan Mei, and Matthew R Walter. Statler: State-maintaining language models for embodied reasoning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15083–15091. IEEE, 2024

  24. [24]

    N., Hutchins, D., and Szegedy, C

    Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers.arXiv preprint arXiv:2203.08913, 2022

  25. [25]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

  26. [26]

    Accelerating reinforcement learning with learned skill priors

    Karl Pertsch, Youngwoon Lee, and Joseph Lim. Accelerating reinforcement learning with learned skill priors. InConference on robot learning, pages 188–204. PMLR, 2021

  27. [27]

    Representation tradeoffs for hyperbolic embeddings

    Frederic Sala, Chris De Sa, Albert Gu, and Christopher Ré. Representation tradeoffs for hyperbolic embeddings. InInternational conference on machine learning, pages 4460–4469. PMLR, 2018

  28. [28]

    Wordnet: a lexical database for english.Communications of the ACM, 38(11): 39–41, 1995

    George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11): 39–41, 1995

  29. [29]

    Poincaré embeddings for learning hierarchical represen- tations.Advances in neural information processing systems, 30, 2017

    Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical represen- tations.Advances in neural information processing systems, 30, 2017. 11

  30. [30]

    Hyperbolic graph convolutional neural networks.Advances in neural information processing systems, 32, 2019

    Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks.Advances in neural information processing systems, 32, 2019

  31. [31]

    Hyperbolic graph attention network.IEEE Transactions on Big Data, 8(6):1690–1701, 2021

    Yiding Zhang, Xiao Wang, Chuan Shi, Xunqiang Jiang, and Yanfang Ye. Hyperbolic graph attention network.IEEE Transactions on Big Data, 8(6):1690–1701, 2021

  32. [32]

    Continuous hierarchical representations with poincaré variational auto-encoders.Advances in neural information processing systems, 32, 2019

    Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee Whye Teh. Continuous hierarchical representations with poincaré variational auto-encoders.Advances in neural information processing systems, 32, 2019

  33. [33]

    R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J

    Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyper- spherical variational auto-encoders.arXiv preprint arXiv:1804.00891, 2018

  34. [34]

    Hyperbolic entailment cones for learning hierarchical embeddings

    Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. InInternational conference on machine learning, pages 1646–1655. PMLR, 2018

  35. [35]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  36. [36]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language- action models.arXiv preprint arXiv:2510.13626, 2025

  37. [37]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 12 A Core Network Architecture and Hyperparameters A.1 Hyperbolic Autoencoder Architecture The Hyperbolic A...