Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Anyup: Universal feature upsampling
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
CAFe-DINO achieves SOTA open-vocabulary semantic segmentation on remote sensing datasets by leveraging DINOv3 features with cost aggregation and upsampling, fine-tuned solely on an RS-targeted COCO-Stuff subset.
HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while modulating unreliable features.
C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.
MapSR achieves 59.64% mIoU on land cover super-resolution from low-resolution labels alone by prompting frozen vision foundation models and applying training-free inference plus graph refinement.
citing papers explorer
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
DINO Soars: DINOv3 for Open-Vocabulary Semantic Segmentation of Remote Sensing Imagery
CAFe-DINO achieves SOTA open-vocabulary semantic segmentation on remote sensing datasets by leveraging DINOv3 features with cost aggregation and upsampling, fine-tuned solely on an RS-targeted COCO-Stuff subset.
-
HD-VGGT: High-Resolution Visual Geometry Transformer
HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while modulating unreliable features.
-
C3G: Learning Compact 3D Representations with 2K Gaussians
C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.
-
MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models
MapSR achieves 59.64% mIoU on land cover super-resolution from low-resolution labels alone by prompting frozen vision foundation models and applying training-free inference plus graph refinement.