citation dossier

Zoedepth: Zero-shot transfer by combining relative and metric depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller · 2023 · arXiv 2302.12288

18Pith papers citing it

20reference links

cs.CVtop field · 17 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 18 reviewed papers. Its strongest current cluster is cs.CV (17 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.

DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.

3D-VLA: A 3D Vision-Language-Action Generative World Model

cs.CV · 2024-03-14 · unverdicted · novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.

In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.

Depth Anything V2

cs.CV · 2024-06-13 · unverdicted · novelty 6.0

Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.

Pose-Aware Diffusion for 3D Generation

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation

cs.CV · 2026-04-29 · unverdicted · novelty 5.0

A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.

Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan

cs.CV · 2026-04-17 · conditional · novelty 5.0

A new wildlife-specific hazy image dataset and IncepDehazeGan model that reports state-of-the-art dehazing metrics and more than doubles downstream animal detection performance.

Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

cs.CV · 2026-04-03 · unverdicted · novelty 5.0

A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

cs.RO · 2025-01-27 · unverdicted · novelty 5.0

SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 million real-world episodes.

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

cs.CV · 2026-05-10 · unverdicted · novelty 4.0

AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.

ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

cs.CV · 2026-04-14 · unverdicted · novelty 4.0 · 2 refs

ELoG-GS integrates geometry-aware initialization and luminance-guided photometric adaptation into Gaussian Splatting, achieving PSNR 18.66 and SSIM 0.69 on the NTIRE 2026 Track 1 low-light 3D reconstruction benchmark.

Step1X-Edit: A Practical Framework for General Image Editing

cs.CV · 2025-04-24 · unverdicted · novelty 4.0

Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.

SS3D: End2End Self-Supervised 3D from Web Videos

cs.CV · 2026-04-24 · 2 refs

citing papers explorer

Showing 18 of 18 citing papers.

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World cs.CV · 2026-05-06 · unverdicted · none · ref 4
LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity cs.CV · 2026-05-03 · unverdicted · none · ref 12
Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors cs.CV · 2026-04-14 · unverdicted · none · ref 2
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation cs.CV · 2026-04-08 · unverdicted · none · ref 54
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
3D-VLA: A 3D Vision-Language-Action Generative World Model cs.CV · 2024-03-14 · unverdicted · none · ref 2
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV · 2026-04-29 · unverdicted · none · ref 40
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation cs.CV · 2026-04-29 · unverdicted · none · ref 4
Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting cs.CV · 2026-04-07 · unverdicted · none · ref 3
A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
Depth Anything V2 cs.CV · 2024-06-13 · unverdicted · none · ref 6
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
Pose-Aware Diffusion for 3D Generation cs.CV · 2026-05-01 · unverdicted · none · ref 2
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation cs.CV · 2026-04-29 · unverdicted · none · ref 50
A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.
Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan cs.CV · 2026-04-17 · conditional · none · ref 1
A new wildlife-specific hazy image dataset and IncepDehazeGan model that reports state-of-the-art dehazing metrics and more than doubles downstream animal detection performance.
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction cs.CV · 2026-04-03 · unverdicted · none · ref 6
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model cs.RO · 2025-01-27 · unverdicted · none · ref 4
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 million real-world episodes.
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation cs.CV · 2026-05-10 · unverdicted · none · ref 3
AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction cs.CV · 2026-04-14 · unverdicted · none · ref 1 · 2 links
ELoG-GS integrates geometry-aware initialization and luminance-guided photometric adaptation into Gaussian Splatting, achieving PSNR 18.66 and SSIM 0.69 on the NTIRE 2026 Track 1 low-light 3D reconstruction benchmark.
Step1X-Edit: A Practical Framework for General Image Editing cs.CV · 2025-04-24 · unverdicted · none · ref 4
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.
SS3D: End2End Self-Supervised 3D from Web Videos cs.CV · 2026-04-24 · unreviewed · ref 4 · 2 links

Zoedepth: Zero-shot transfer by combining relative and metric depth

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer