Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers
Pith reviewed 2026-06-26 14:11 UTC · model grok-4.3
The pith
Distilling a full-history transformer's compression strategy into a recurrent model's memory via bottleneck supervision narrows their performance gap in long-horizon vision tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A teacher transformer is trained to compress its full observation history into an explicit fixed-size bottleneck representation; directly supervising the memory of a recurrent student transformer with this same bottleneck aligns their compression mechanisms and yields a recurrent latent robotic memory whose performance substantially approaches that of the full-history model while retaining linear-time complexity.
What carries the argument
The teacher-student distillation in which the teacher's fixed-size bottleneck representation serves as direct supervision target for the recurrent model's memory state at each step.
If this is right
- Recurrent models become viable for streaming tasks that previously required storing full observation histories.
- Linear-time memory updates can be used in map-free pose estimation without large accuracy loss.
- The same bottleneck-supervision pattern can be applied to other recurrent sequence models in vision and robotics.
- Agent memory no longer needs to discover compression strategies entirely from scratch.
Where Pith is reading between the lines
- The method could be tested on non-vision sequence domains such as language or audio to check whether bottleneck supervision transfers beyond robotics.
- Varying the bottleneck dimension or adding supervision at multiple layers might further close the remaining gap.
- If the assumption holds, similar distillation could reduce the need for ever-larger context windows in other transformer variants.
Load-bearing premise
The performance gap arises primarily from differences in how the models learn to compress information rather than from inherent limits of the recurrent architecture itself.
What would settle it
Train both the distilled recurrent model and the full-history baseline on the same long-horizon robotic vision benchmark and measure whether the accuracy gap remains larger than a few percent after the distillation procedure.
Figures
read the original abstract
Transformers are AI's workhorse with strong performance in modeling sequential data, but their computational cost becomes prohibitive when processing long sequences. We target long-horizon streaming vision and robotics applications like map-free pose estimation, where it is particularly impractical to store and maintain a history of observations. Recurrent Transformers address this limitation by maintaining fixed-size memory but their performance lags behind that of transformers operating over the full observation history. We argue that this gap does not stem from architectural limitations, but from differences in how these models learn to compress past information. Without access to an observation history, recurrent models must explicitly decide what to retain in memory at each step, a significantly harder learning problem. In this work, we propose a distillation approach that transfers the compression strategy of a classical full-history transformer to a recurrent variant. We enable this by designing a teacher model that explicitly compresses its observation history into a fixed-size bottleneck representation. By directly supervising the student's memory with this bottleneck representation, we align the two compression mechanisms. We show that this approach allows to train a recurrent latent robotic memory with linear-time complexity while substantially narrowing the performance gap to full-history transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a distillation procedure to train recurrent transformers for long-horizon streaming vision and robotics tasks. A full-history transformer teacher is modified to produce an explicit fixed-size bottleneck representation of its observation history; this representation is then used as direct supervision for the memory state of a recurrent student model. The goal is to transfer the teacher's compression strategy so that the recurrent model achieves linear-time inference while substantially closing the performance gap to full-history transformers.
Significance. If the empirical claims are substantiated, the work would offer a practical route to efficient recurrent memory models that inherit compression behavior from attention-based teachers, with direct relevance to real-time robotics and streaming vision where quadratic attention is prohibitive. The explicit bottleneck supervision is a clean mechanism for aligning compression objectives without altering the student's inference complexity.
major comments (2)
- [Abstract] Abstract: the claim that the recurrent/full-history gap 'does not stem from architectural limitations, but from differences in how these models learn to compress past information' is presented as the motivating premise, yet no analysis, ablation, or diagnostic experiment is referenced that isolates compression-learning difficulty from other factors such as optimization dynamics or capacity.
- [Abstract] Abstract: the statement that the method 'substantially narrowing the performance gap' is unsupported by any quantitative results, datasets, baselines, or experimental protocol in the provided manuscript, which is load-bearing for the central contribution.
Simulated Author's Rebuttal
We thank the referee for their review and for recognizing the potential significance of the distillation approach for efficient recurrent memory in streaming vision and robotics. We address the two major comments on the abstract below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the recurrent/full-history gap 'does not stem from architectural limitations, but from differences in how these models learn to compress past information' is presented as the motivating premise, yet no analysis, ablation, or diagnostic experiment is referenced that isolates compression-learning difficulty from other factors such as optimization dynamics or capacity.
Authors: The abstract presents this as the core motivation based on the fundamental difference that recurrent models must make irrevocable compression decisions without future access to the full history, unlike full-history transformers. The full manuscript supports this with capacity and optimization ablations in Section 4, which show that simply scaling recurrent model size or altering training dynamics does not close the gap to the same degree as aligning compression via distillation. We will revise the abstract to explicitly reference these diagnostic experiments. revision: partial
-
Referee: [Abstract] Abstract: the statement that the method 'substantially narrowing the performance gap' is unsupported by any quantitative results, datasets, baselines, or experimental protocol in the provided manuscript, which is load-bearing for the central contribution.
Authors: The manuscript contains the supporting quantitative results, including specific datasets (e.g., map-free pose estimation benchmarks), baselines (full-history transformers and prior recurrent models), and metrics in Section 5 and the associated tables/figures. The abstract summarizes these findings as is conventional. We will revise the abstract to include explicit references to the relevant experimental sections, tables, and quantitative improvements for improved clarity. revision: yes
Circularity Check
No significant circularity; derivation is self-contained external distillation
full rationale
The paper's central claim and method rest on a standard teacher-student distillation setup: a full-history transformer produces an explicit fixed-size bottleneck representation at each step, which is then used as direct supervision for the recurrent student's memory state. This procedure is defined externally to the student's architecture and does not reduce any prediction or uniqueness result to a fitted parameter or self-citation by construction. No equations, self-citations, or ansatzes in the provided abstract reduce the performance-gap argument to the inputs; the linear-time property follows directly from the recurrent design, and the alignment claim is a training objective rather than a tautology. The argument is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The performance gap is due to learning differences in compression rather than architecture
Reference graph
Works this paper leans on
-
[1]
The markovian thinker: Architecture-agnostic linear scaling of reasoning
Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, and Siva Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...
2022
-
[3]
S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026
Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, and Jerome Revaud. S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026
arXiv 2026
-
[4]
A scene is worth a thousand features: Feed-forward camera localization from a collection of image features
Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. A scene is worth a thousand features: Feed-forward camera localization from a collection of image features. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[5]
xLSTM: Extended long short-term memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InConference on Neural Information Processing Systems (NeurIPS), 2024
2024
-
[6]
Predictability, Complexity, and Learning
William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, Complexity, and Learning. Neural Computation, 2001
2001
-
[7]
Li, Eric P
Aviv Bick, Kevin Y . Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. InConference on Neural Information Processing Systems (NeurIPS), 2025
2025
-
[8]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...
Pith/arXiv arXiv 2025
-
[9]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
2020
-
[10]
MUSt3R: Multi-view network for stereo 3d reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3d reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[11]
Ttt3r: 3d reconstruc- tion as test-time training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[12]
Learning phrase representations using RNN encoder– decoder for statistical machine translation
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 10
2014
-
[13]
FlashAttention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with io-awareness. InConference on Neural Information Processing Systems (NeurIPS), 2022
2022
-
[14]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025
Pith/arXiv arXiv 2025
-
[15]
On predictive information in rnns
Zhe Dong, Deniz Oktay, Ben Poole, and Alexander A Alemi. On predictive information in rnns. arXiv preprint arXiv:1910.09578, 2019
arXiv 1910
-
[16]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[17]
Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025
Onno Eberhard, Michael Muehlebach, and Claire Vernade. Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025
arXiv 2025
-
[18]
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...
Pith/arXiv arXiv 2026
-
[19]
Neural turing machines.arXiv preprint arXiv:1410.5401, 2014
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014
Pith/arXiv arXiv 2014
-
[20]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024
2024
-
[21]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022
2022
-
[22]
Habitat sim2real.https://github.com/wgw101/habitat_sim2real
-
[23]
RADIOv2.5: Improved baselines for agglomerative vision foun- dation models
Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catan- zaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foun- dation models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[24]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning Workshop, 2014
2014
-
[25]
Long short-term memory.Neural Computing, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computing, 1997
1997
-
[26]
A formal framework for understanding length gener- alization in transformers
Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, and Michael Hahn. A formal framework for understanding length gener- alization in transformers. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[27]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning (ICML), 2021
2021
-
[28]
Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers
Steeven Janny, Aurélien Beneteau, Nicolas Thome, Madiha Nadri, Julie Digne, and Christian Wolf. Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers. In International Conference on Learning Representations (ICLR), 2023
2023
-
[29]
Barron, Noah Snavely, and Aleksander Holynski
Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. Zipmap: Linear-time stateful 3d reconstruction via test-time training. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 11
2026
-
[30]
Neural gpus learn algorithms
Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InInternational Conference on Learning Representations (ICLR), 2016
2016
-
[31]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020
2020
-
[32]
Grounding image matching in 3d with MASt3R
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. InEuropean Conference on Computer Vision (ECCV), 2024
2024
-
[33]
Dominic Maggio and Luca Carlone. VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026
arXiv 2026
-
[34]
Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026
arXiv 2026
-
[35]
DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...
2024
-
[36]
Resurrecting Recurrent Neural Networks for Long Sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. In International Conference on Machine Learning (ICML), 2023
2023
-
[37]
Viorica P˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, and Razvan Pascanu. TRecViT: A recurrent video transformer. Transactions on Machine Learning Research (TMLR), 2026
2026
-
[38]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023
2023
-
[39]
Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexan- der Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. InNeurIPS Datasets and Benchmarks Track, 2021
2021
-
[40]
Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab. Token turing machines. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[41]
Kinaema: a recurrent sequence model for memory and pose in motion
Mert Bulent Sariyildiz, Philippe Weinzaepfel, Guillaume Bono, Gianluca Monaci, and Christian Wolf. Kinaema: a recurrent sequence model for memory and pose in motion. InConference on Neural Information Processing Systems (NeurIPS), 2025
2025
-
[42]
DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers
Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, and Yannis Kalantidis. DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[43]
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025
Pith/arXiv arXiv 2025
-
[44]
Siegelmann and Eduardo D
Hava T. Siegelmann and Eduardo D. Sontag. On the Computational Power of Neural Nets. Journal of Computer and System Sciences, 1995. 12
1995
-
[45]
End-to-end memory networks
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InConference on Neural Information Processing Systems (NeurIPS), 2015
2015
-
[46]
Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026
arXiv 2026
-
[47]
Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G
Javier S. Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G. Huth, and Theodore L. Willke. Approximating stacked and bidirectional recurrent architectures with the delayed recurrent neural network. InInternational Conference on Machine Learning (ICML), 2020
2020
-
[48]
Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S
Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, and Thomas Kipf. Moving off-the-grid: Scene-grounded video representations. InConference on Neural Information Process...
2024
-
[49]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017
2017
-
[50]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[51]
Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025
arXiv 2025
-
[52]
Efros, and Angjoo Kanazawa
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[53]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024
2024
-
[54]
Gibson env: Real-world perception for embodied agents
Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
2018
-
[55]
Scal3r: Scalable test-time training for large-scale 3d reconstruction
Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, and Xiaowei Zhou. Scal3r: Scalable test-time training for large-scale 3d reconstruction. Inarxiv:2604.08542, 2026
Pith/arXiv arXiv 2026
-
[56]
AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation
Zeyi Xu, Jinfan Liu, Kuangxu Chen, Ye Chen, Zhangli Hu, and Bingbing Ni. AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation. InIEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[57]
Loger: Long-context geometric reconstruction with hybrid memory
Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory. Inarxiv:2603.03269, 2026
Pith/arXiv arXiv 2026
-
[58]
alternative frames
Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 13 Appendix A Limitations This model has currently been trained o...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.