Multi-view Pyramid Transformer: Look Coarser to See Broader

Eunbyung Park; Gyeongjin Kang; Jungwoo Kim; Seungkwon Yang; Seungtae Nam; Younggeun Lee

arxiv: 2512.07806 · v2 · pith:NMYXMJIFnew · submitted 2025-12-08 · 💻 cs.CV

Multi-view Pyramid Transformer: Look Coarser to See Broader

Gyeongjin Kang , Seungkwon Yang , Seungtae Nam , Younggeun Lee , Jungwoo Kim , Eunbyung Park This is my paper

classification 💻 cs.CV

keywords hierarchymulti-viewtransformerachievesbroaderefficiencylargelooking

0 comments

read the original abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 7.0

AdaptSplat adds a Frequency-Preserving Adapter to vision foundation models to boost high-frequency fidelity and cross-domain performance in feed-forward 3D Gaussian Splatting.
PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld autoregressively generates consistent multi-room 360-degree panoramas for whole-house VR using a floorplan-derived 3D shell as geometric proxy and a dynamic 3DGS cache for spatial memory.
AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 5.0

AdaptSplat adds a lightweight Frequency-Preserving Adapter to vision foundation models that extracts direction-aware high-frequency priors and integrates them via positional encodings and residual modulation to improv...