Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LinStereo uses Position-Aware Linear Attention, Hierarchical Semantic Cost Volumes, and Depth Prior Initialization to enable global aggregation in iterative stereo matching at linear complexity, showing improved performance on standard and underwater benchmarks.
citing papers explorer
-
Modality Forcing for Scalable Spatial Generation
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
-
LinStereo: Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching
LinStereo uses Position-Aware Linear Attention, Hierarchical Semantic Cost Volumes, and Depth Prior Initialization to enable global aggregation in iterative stereo matching at linear complexity, showing improved performance on standard and underwater benchmarks.