pith. machine review for the scientific record. sign in

arxiv: 2604.26488 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.LG

Recognition: unknown

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

David Joseph Tan, Federico Tombari, Hidenobu Matsuki, Martin Sundermeyer, Nikita Araslanov

Pith reviewed 2026-05-07 13:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords pixel feature descriptorslinear in-context learningvideo representation learningdynamic scenesvideo object segmentationsurface normal estimationsemantic segmentationspatio-temporal cues
0
0 comments X

The pith

Linear in-context learning on noisy depth and motion cues from uncurated videos produces pixel descriptors that embed semantic and geometric properties consistently over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LILA, a framework that trains pixel-accurate feature descriptors directly from video sequences of dynamic scenes. Its central training mechanism is linear in-context learning, which operates on cue maps for depth and motion supplied by existing off-the-shelf networks. The method succeeds on raw, uncurated video collections even though the input cues contain noise, and the resulting descriptors remain stable across frames while carrying both semantic labels and geometric structure. A reader would care because current vision models either ignore motion or target coarse action labels, leaving a gap for dense pixel-level reasoning on moving scenes. If the approach holds, it supplies a scalable route to representations that support tasks requiring precise spatial and temporal understanding from ordinary video.

Core claim

We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.

What carries the argument

Linear in-context learning applied to noisy spatio-temporal cue maps, which embeds semantic and geometric properties into temporally consistent pixel-level feature descriptors.

If this is right

  • The descriptors directly improve accuracy on video object segmentation by supplying frame-to-frame consistent pixel labels.
  • Surface normal estimation benefits from the geometric structure encoded in the same features.
  • Semantic segmentation on dynamic scenes gains from the temporal consistency without requiring additional video-specific training.
  • The same training procedure scales to dense prediction tasks that current action-level video models cannot address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on off-the-shelf cues suggests that similar linear learners could bootstrap representations for other modalities once basic estimators exist.
  • Temporally consistent pixel features could support downstream 3D reconstruction pipelines that currently require explicit multi-view geometry.
  • If the linear mechanism generalizes, it may reduce the need for large-scale supervised video annotation in favor of cue-driven self-supervision.

Load-bearing premise

Noisy spatio-temporal cues estimated by off-the-shelf networks are sufficient for the linear in-context learner to embed semantic and geometric properties in a temporally consistent manner on uncurated video data.

What would settle it

A controlled test on held-out uncurated video where the learned pixel features exhibit no measurable gain in temporal consistency or task accuracy on video object segmentation compared with standard image-pretrained descriptors.

Figures

Figures reproduced from arXiv: 2604.26488 by David Joseph Tan, Federico Tombari, Hidenobu Matsuki, Martin Sundermeyer, Nikita Araslanov.

Figure 1
Figure 1. Figure 1: Overview. We train an encoder-decoder model from unlabelled videos to produce high-resolution (HR), temporally consistent feature maps. The core novelty of our training approach is linear in-context learning, or LILA. Trained on noisy cue maps, such as those provided by off-the-shelf optical flow and monocular depth networks, LILA enhances the low-resolution (LR) encoder features with pixel-level geometry … view at source ↗
Figure 2
Figure 2. Figure 2: Training overview. (a) We train an encoder-decoder network (LILA) to produce pixel-level feature maps (here, x0 and x∆). As the pretext task, we propose linear in-context learning. Solving for a linear projection W∗ w.r.t. context frame It and the corresponding cue maps Gcontext (see (b)), we train the network to minimise the reconstruction loss w.r.t. the features produced for a query frame It+∆ and the c… view at source ↗
Figure 3
Figure 3. Figure 3: Approximating cue maps. LILA learns to approximate the query cues (flow, depth and encoder features) with high fidelity in a temporally consistent fashion. Providing a glimpse into the LILA’s training process, the predictions here are derived from an optimal linear mapping w.r.t. the context frame that is distinct from the query frame, thereby enforcing spatio-temporal coherence. Note that Eq. (2) has a cl… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on VOS. LILA provides a high degree of spatial detail and strong temporal consistency, allowing for a high-quality segmentation of fine and dynamic structures (e.g. a bicycle). fit a 1 × 1 convolution that maps a feature representation to the segmentation logits. Then, we apply the linear projection to the features extracted from the other frames. Local k-NN. Following previous work [2,… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results: surface normal estimation and semantic segmentation. (a) Despite pre-training on dynamic scenes, LILA reveals a finer surface structure than the baseline (DINOv2 [1] / ViT-B14) in predominantly static scenes. (b) LILA enables more accurate semantic segmentation, providing more accurate semantic boundaries and background detail. by 4.3 % J F. This is remarkable, since LILA did not use m… view at source ↗
Figure 6
Figure 6. Figure 6: Further qualitative analysis. (a) A qualitative com￾parison between LILA and ERM distillation — a distillation base￾line based on the external cues without linear in-context learning. LILA yields noticeably sharper feature maps, suggesting that linear in-context learning is effective at handling inherent noise in the external cues. (b) We test LILA on out-of-domain images — here, chest X-rays and aerial im… view at source ↗
Figure 7
Figure 7. Figure 7: The temporal gap trade-off. We report J F on VOS (DAVIS 2017) with k-NN inference for different LILA models trained with varying sizes of the sampling window ∆. ∆ increases, cue predictability becomes more challenging. Accuracy degrades gracefully for large ∆. Limitations. LILA relies on cue maps, which derive from state-of-the-art off-the-shelf models (SEA-RAFT [50] for op￾tical flow; DepthAnythingV2 [51]… view at source ↗
read the original abstract

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces LILA, a framework for learning pixel-accurate feature descriptors from dynamic 3D scenes in videos. The core mechanism is linear in-context learning that ingests noisy spatio-temporal cue maps (depth and motion) produced by off-the-shelf networks; training occurs on uncurated video data and is claimed to embed semantic and geometric properties in a temporally consistent manner. The learned representations are evaluated on video object segmentation, surface normal estimation, and semantic segmentation.

Significance. If the empirical claims hold, the approach would supply a scalable route to dense, temporally coherent pixel representations from large-scale uncurated video without requiring curated labels or action-level supervision. The explicit use of linear in-context learners to tolerate noisy external cues is a distinctive technical choice that could influence future video representation learning.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of LILA and the recommendation for minor revision. The review accurately captures the core contribution: using linear in-context learning on off-the-shelf depth and motion cues to obtain temporally consistent pixel features from uncurated video. We have no major comments to address, as none were raised.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central framework LILA learns pixel features via linear in-context learning applied to external depth and motion cue maps produced by off-the-shelf networks, then trains on uncurated video data. No derivation step reduces to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain; the method is explicitly positioned as using noisy external cues rather than its own outputs, and empirical benefits are demonstrated on separate downstream tasks (video object segmentation, surface normal estimation, semantic segmentation) without the core claim collapsing into its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; the central claim rests on the assumption that noisy off-the-shelf cues suffice for effective training and on the effectiveness of linear in-context learning for embedding consistent features.

axioms (1)
  • domain assumption Off-the-shelf networks produce usable (though noisy) depth and motion cues that can supervise pixel feature learning
    The framework explicitly leverages these cues despite their noise.
invented entities (1)
  • LILA training framework no independent evidence
    purpose: To learn pixel-accurate, temporally consistent feature descriptors from videos
    New method introduced by the paper; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5474 in / 1280 out tokens · 45083 ms · 2026-05-07T13:34:04.357888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 8 canonical work pages · 6 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth \' e e Darcet, Th \' e o Moutakanni, and Huy Vo et al. DINO v2: Learning robust visual features without supervision. arXiv:2304.07193 [cs.CV], 2023

  3. [3]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \' e J \' e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

  4. [4]

    Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

  5. [5]

    LoftUp : Learning a coordinate-based feature upsampler for vision foundation models

    Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, and Dan Zhang. LoftUp : Learning a coordinate-based feature upsampler for vision foundation models. In ICCV, 2025

  6. [6]

    Brandt, Axel Feldman, Zhoutong Zhang, and William T

    Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldman, Zhoutong Zhang, and William T. Freeman. FeatUp : A model-agnostic framework for features at any resolution. In ICLR, 2024

  7. [7]

    FlowFeat : Pixel-dense embedding of motion profiles

    Nikita Araslanov, Anna Sonnenweber, and Daniel Cremers. FlowFeat : Pixel-dense embedding of motion profiles. In NeurIPS, 2025

  8. [8]

    Berg, Wan - Yen Lo, Piotr Doll \' a r, and Ross B

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chlo \' e Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan - Yen Lo, Piotr Doll \' a r, and Ross B. Girshick. Segment anything. In ICCV, pages 3992--4003, 2023

  9. [9]

    Video MAE : Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Video MAE : Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022

  10. [10]

    Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael G. Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. TMLR, 2024

  11. [11]

    Girshick

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \' a r, and Ross B. Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 15979--15988, 2022

  12. [12]

    Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

    Michael Gutmann and Aapo Hyv \" a rinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, volume 9, pages 297--304, 2010

  13. [13]

    Representation Learning with Contrastive Predictive Coding

    A \" a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748 [cs.LG], 2018

  14. [14]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1597--1607, 2020

  15. [15]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, pages 9620--9629, 2021

  16. [16]

    Exploring simple siamese representation learning

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, pages 15750--15758, 2021

  17. [17]

    Richemond, Elena Buchatskaya, Carl Doersch, Bernardo \' A vila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R \' e mi Munos, and Michal Valko

    Jean - Bastien Grill, Florian Strub, Florent Altch \' e , Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo \' A vila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R \' e mi Munos, and Michal Valko. Bootstrap your own latent - A new approach to self-supervised learning. In NeurIPS, 2020

  18. [18]

    Rabbat, Yann LeCun, and Nicolas Ballas

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael G. Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, pages 15619--15629, 2023

  19. [19]

    BEiT : BERT pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT : BERT pre-training of image transformers. In ICLR, 2022

  20. [20]

    Yuille, and Tao Kong

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan L. Yuille, and Tao Kong. Image BERT pre-training with online tokenizer. In ICLR, 2022

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748--8763, 2021

  22. [22]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11941--11952, 2023

  23. [23]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey A. Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier J. H \' e naff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense f...

  24. [24]

    AM-RADIO : Agglomerative vision foundation model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO : Agglomerative vision foundation model reduce all domains into one. In CVPR, pages 12490--12500, 2024

  25. [25]

    Pinheiro, Amjad Almahairi, Ryan Y

    Pedro O. Pinheiro, Amjad Almahairi, Ryan Y. Benmalek, Florian Golemo, and Aaron C. Courville. Unsupervised learning of dense visual representations. In NeurIPS, 2020

  26. [26]

    H \' e naff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, Jo \ a o Carreira, and Relja Arandjelovic

    Olivier J. H \' e naff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, Jo \ a o Carreira, and Relja Arandjelovic. Object discovery and representation networks. In ECCV, volume 13687, pages 123--143, 2022

  27. [27]

    CrIBo : Self-supervised learning via cross-image object-level bootstrapping

    Tim Lebailly, Thomas Stegm \" u ller, Behzad Bozorgtabar, Jean - Philippe Thiran, and Tinne Tuytelaars. CrIBo : Self-supervised learning via cross-image object-level bootstrapping. In ICLR, 2024

  28. [28]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, volume 9910, pages 69--84, 2016

  29. [29]

    Location-aware self-supervised transformers for semantic segmentation

    Mathilde Caron, Neil Houlsby, and Cordelia Schmid. Location-aware self-supervised transformers for semantic segmentation. In WACV, pages 116--126, 2024

  30. [30]

    Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In ECCV, volume 9907, pages 649--666, 2016

  31. [31]

    Colorization as a proxy task for visual understanding

    Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, pages 840--849, 2017

  32. [32]

    CroCo : Self-supervised pre-training for 3D vision tasks by cross-view completion

    Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Romain Br \' e gier, Yohann Cabon, Vaibhav Arora, Leonid Antsfeld, Boris Chidlovskii, Gabriela Csurka, and J \' e r \^ o me Revaud. CroCo : Self-supervised pre-training for 3D vision tasks by cross-view completion. In NeurIPS, 2022

  33. [33]

    Allan Jabri, Andrew Owens, and Alexei A. Efros. Space-time correspondence as a contrastive random walk. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin, editors, NeurIPS, 2020

  34. [34]

    Dense unsupervised learning for video segmentation

    Nikita Araslanov, Simone Schaub - Meyer, and Stefan Roth. Dense unsupervised learning for video segmentation. In NeurIPS, pages 25308--25319, 2021

  35. [35]

    Nikhil Parthasarathy, S. M. Ali Eslami, Jo \ a o Carreira, and Olivier J. H \' e naff. Self-supervised video pretraining yields robust and more human-aligned visual representations. In NeurIPS, 2023

  36. [36]

    Self-supervised representation learning from flow equivariance

    Yuwen Xiong, Mengye Ren, Wenyuan Zeng, and Raquel Urtasun. Self-supervised representation learning from flow equivariance. In ICCV, pages 10171--10180, 2021

  37. [37]

    Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren

    Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren. PooDLe : Pooled and dense self-supervised learning from naturalistic videos. In ICLR, 2025

  38. [38]

    Asano, and Yannis Avrithis

    Shashanka Venkataramanan, Mamshad Nayeem Rizve, Jo \ a o Carreira, Yuki M. Asano, and Yannis Avrithis. Is ImageNet worth 1 video? L earning strong image encoders from 1 long unlabelled video. In ICLR, 2024

  39. [39]

    Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C

    Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, and Thomas Kipf. SAVi++ : Towards end-to-end object-centric learning from real-world videos. In NeurIPS, 2022

  40. [40]

    Zoedepth: Zero-shot transfer by combining relative and metric depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M \" u ller. ZoeDepth : Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288 [cs.CV], 2023

  41. [41]

    Vision transformers for dense prediction

    Ren \' e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In ICCV, pages 12159--12168, 2021

  42. [42]

    Single-stage semantic segmentation from image labels

    Nikita Araslanov and Stefan Roth. Single-stage semantic segmentation from image labels. In CVPR, pages 4252--4261, 2020

  43. [43]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

  44. [44]

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. YouTube-VOS : A large-scale video object segmentation benchmark. arXiv:1809.03327 [cs.CV], 2018

  45. [45]

    The Kinetics Human Action Video Dataset

    Will Kay, Jo \ a o Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The Kinetics human action video dataset. arXiv:1705.06950 [cs.CV], 2017

  46. [46]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel\'aez, Alexander Sorkine-Hornung, and Luc Van Gool . The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 [cs.CV], 2017

  47. [47]

    Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. COCO-Stuff : Thing and stuff classes in context. In CVPR, pages 1209--1218, 2018

  48. [48]

    Indoor segmentation and support inference from RGBD images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, pages 746--760, 2012

  49. [49]

    Guibas, Justin Johnson, and Varun Jampani

    Mohamed El Banani, Amit Raj, Kevis - Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas J. Guibas, Justin Johnson, and Varun Jampani. Probing the 3D awareness of visual foundation models. In CVPR, pages 21795--21806, 2024

  50. [50]

    Scene parsing through ADE20K dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In CVPR, pages 5122--5130, 2017

  51. [51]

    SEA-RAFT : Simple, efficient, accurate RAFT for optical flow

    Yihan Wang, Lahav Lipson, and Jia Deng. SEA-RAFT : Simple, efficient, accurate RAFT for optical flow. In ECCV, pages 36--54, 2024

  52. [52]

    Depth anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2 . In NeurIPS, 2024

  53. [53]

    Vision transformers need registers

    Timoth \' e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024

  54. [54]

    Semantic projection network for zero- and few-label semantic segmentation

    Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Semantic projection network for zero- and few-label semantic segmentation. In CVPR, 2019