Trust3R introduces a gated residual refinement plus Normal-Inverse-Wishart evidential head that produces closed-form multivariate Student-t uncertainty for per-point geometry in feed-forward 3D reconstruction and improves uncertainty ranking metrics on indoor and outdoor benchmarks.
hub Mixed citations
Frozen in time: A joint video and image encoder for end-to-end retrieval
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 22representative citing papers
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.
Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.
LUMEN enhances low-light images via depth estimation, soft clustering for virtual flash simulation, and attention-based transformer fusion, reporting state-of-the-art results on LOL-v1 and LOL-v2 benchmarks.
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material metrics plus better human ratings.
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
JetViT uses post-training attention search to hybridize full-attention ViTs with linear and window attention blocks, achieving up to 1.79x throughput gains on high-res images while preserving accuracy on DINOv3 and DepthAnythingV2.
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
A drone-mounted stereo camera pipeline with YOLO segmentation, deep stereo depth, centroid triangulation, and MAD outlier rejection achieves robust 3D positioning of thin pine branches at 1-2 m distances.
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
Shape2Animal converts natural object silhouettes into plausible animal images via open-vocabulary segmentation, vision-language interpretation, text-to-image diffusion, and scene blending.
Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.
Drone stereo vision pipeline segments pine branches with YOLO variants and estimates depth with deep stereo networks, yielding more coherent maps than SGBM at 1-2 m distances.
citing papers explorer
-
Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R
Trust3R introduces a gated residual refinement plus Normal-Inverse-Wishart evidential head that produces closed-form multivariate Student-t uncertainty for per-point geometry in feed-forward 3D reconstruction and improves uncertainty ranking metrics on indoor and outdoor benchmarks.
-
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
-
Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures
LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.
-
Materialist: Physically Based Editing Using Single-Image Inverse Rendering
Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
-
Stabilizing Streaming Video Geometry via Dynamic Feature Normalization
DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.
-
LUMEN: Low-light Unified Multi-stage Enhancement Network using depth-guided flash, clustering, and attention-based Transformers
LUMEN enhances low-light images via depth estimation, soft clustering for virtual flash simulation, and attention-based transformer fusion, reporting state-of-the-art results on LOL-v1 and LOL-v2 benchmarks.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
-
Materialistic RIR: Material Conditioned Realistic RIR Generation
A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material metrics plus better human ratings.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Depth Anything V2
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
-
Towards Consistent Video Geometry Estimation
ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
-
JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
JetViT uses post-training attention search to hybridize full-attention ViTs with linear and window attention blocks, achieving up to 1.79x throughput gains on high-res images while preserving accuracy on DINOv3 and DepthAnythingV2.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning
A drone-mounted stereo camera pipeline with YOLO segmentation, deep stereo depth, centroid triangulation, and MAD outlier rejection achieves robust 3D positioning of thin pine branches at 1-2 m distances.
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
Shape2Animal: Creative Animal Generation from Natural Silhouettes
Shape2Animal converts natural object silhouettes into plausible animal images via open-vocabulary segmentation, vision-language interpretation, text-to-image diffusion, and scene blending.
-
Open-Sora Plan: Open-Source Large Video Generation Model
Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.
-
Positioning radiata pine branches requiring pruning by drone stereo vision
Drone stereo vision pipeline segments pine branches with YOLO variants and estimates depth with deep stereo networks, yielding more coherent maps than SGBM at 1-2 m distances.