hub Mixed citations

Frozen in time: A joint video and image encoder for end-to-end retrieval

Reiner Birkl, Diana Wofk, Matthias M ¨uller · 2023 · arXiv 2307.14460

Mixed citation behavior. Most common role is background (60%).

22 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

background 3 baseline 1 use method 1

representative citing papers

Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Trust3R introduces a gated residual refinement plus Normal-Inverse-Wishart evidential head that produces closed-form multivariate Student-t uncertainty for per-point geometry in feed-forward 3D reconstruction and improves uncertainty ranking metrics on indoor and outdoor benchmarks.

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.

LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

cs.CV · 2026-03-02 · unverdicted · novelty 7.0

Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.

Materialist: Physically Based Editing Using Single-Image Inverse Rendering

cs.CV · 2025-01-07 · unverdicted · novelty 7.0

Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

cs.CV · 2024-06-24 · unverdicted · novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.

LUMEN: Low-light Unified Multi-stage Enhancement Network using depth-guided flash, clustering, and attention-based Transformers

eess.IV · 2026-05-18 · unverdicted · novelty 6.0

LUMEN enhances low-light images via depth estimation, soft clustering for virtual flash simulation, and attention-based transformer fusion, reporting state-of-the-art results on LOL-v1 and LOL-v2 benchmarks.

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 4 refs

GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.

SS3D: End2End Self-Supervised 3D from Web Videos

cs.CV · 2026-04-24 · unverdicted · novelty 6.0 · 3 refs

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.

Materialistic RIR: Material Conditioned Realistic RIR Generation

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material metrics plus better human ratings.

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

Depth Anything V2

cs.CV · 2024-06-13 · unverdicted · novelty 6.0

Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.

Towards Consistent Video Geometry Estimation

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

cs.CV · 2026-05-26 · unverdicted · novelty 5.0

JetViT uses post-training attention search to hybridize full-attention ViTs with linear and window attention blocks, achieving up to 1.79x throughput gains on high-res images while preserving accuracy on DINOv3 and DepthAnythingV2.

The Midas Touch for Metric Depth

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning

cs.CV · 2026-05-06 · unverdicted · novelty 5.0

A drone-mounted stereo camera pipeline with YOLO segmentation, deep stereo depth, centroid triangulation, and MAD outlier rejection achieves robust 3D positioning of thin pine branches at 1-2 m distances.

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

cs.CV · 2025-07-03 · unverdicted · novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.

Shape2Animal: Creative Animal Generation from Natural Silhouettes

cs.CV · 2025-06-25 · unverdicted · novelty 5.0

Shape2Animal converts natural object silhouettes into plausible animal images via open-vocabulary segmentation, vision-language interpretation, text-to-image diffusion, and scene blending.

Open-Sora Plan: Open-Source Large Video Generation Model

cs.CV · 2024-11-28 · unverdicted · novelty 4.0

Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.

Positioning radiata pine branches requiring pruning by drone stereo vision

cs.CV · 2026-04-12 · unverdicted · novelty 3.0

Drone stereo vision pipeline segments pine branches with YOLO variants and estimates depth with deep stereo networks, yielding more coherent maps than SGBM at 1-2 m distances.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs cs.CV · 2024-06-24 · unverdicted · none · ref 15
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

Frozen in time: A joint video and image encoder for end-to-end retrieval

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer