hub Mixed citations

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu · 2022 · cs.CV · arXiv 2203.03605

Mixed citation behavior. Most common role is background (45%).

79 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 79 citing papers arXiv PDF

abstract

We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 6 background 4 baseline 1

citation-polarity summary

background 5 use method 5 baseline 1

representative citing papers

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.

FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

FlowOVD applies rectified flow to generate continuous latent query dynamics for text-conditioned open-vocabulary detection, reporting 49.5 AP on COCO and 31.5 AP on LVIS.

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Presents the first multispectral dataset for fine-grained small-UAV detection and a dual-stream MFDNet baseline that gains 6.2% AP50 over RGB-only detectors by using spectral material cues.

Best Segmentation Buddies for Image-Shape Correspondence

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

HL-OutPaint enables high-resolution outpainting of long video sequences via a coarse-to-fine pipeline that first builds Global Coarse Guidance through global-local frame swapping then synthesizes details.

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

WD-FQDet decouples modality-shared and modality-specific features in infrared-visible images via wavelet-based frequency decomposition and frequency-aware query selection to achieve state-of-the-art detection performance.

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

cs.CV · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

OT-Bridge Editor uses geometrically constrained entropic optimal transport to synthesize CAG images with precise stenosis, improving downstream detection by 27.8% on ARCADE and 23.0% on a multi-center dataset.

SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

cs.CV · 2026-04-23 · unverdicted · novelty 7.0 · 2 refs

VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.

Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.

WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection

cs.CV · 2026-04-05 · unverdicted · novelty 7.0

SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.

TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

cs.CV · 2026-03-02 · unverdicted · novelty 7.0

TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

Simple Supervision Is Hard to Beat: A Bitter Lesson from Sparse Target Labels in Domain-Adaptive Object Detection

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

RTSM improves SFDA-OD by 1.7-18.3 AP50 across methods and detectors, and ten sparse-label feedback plugins give only limited method-dependent gains over it.

Flow Matching in Feature Space for Stochastic World Modeling

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.

From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

Proposes DERNet with Decompose-Enhance-Reconstruct operator and three plug-and-play modules to shift small object detection from spatial to spectral feature processing, claiming better performance than YOLOv11 with 1/6 the parameters.

Modular Diffusion Models for Structured Visual Recognition

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

Modular Diffusion Models decompose diffusion into task-specific modules to model distributions over structured visual outputs for detection, segmentation, and scene graph generation.

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

Introduces HOI-Edit benchmark with HOI-Eval metric and SCPE self-correcting framework leveraging I2V models for competitive HOI image editing performance.

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Tac-DINO constructs a large tactile dataset and Vis-Tac Holographic Matching Benchmark, then proposes Vision-Tactile Patch Alignment (VTPA) methods that outperform non-aligned baselines on local-to-global feature matching.

citing papers explorer

Showing 50 of 62 citing papers after filters.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation cs.CV · 2026-05-31 · unverdicted · none · ref 63 · internal anchor
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences cs.CV · 2026-05-30 · unverdicted · none · ref 79 · internal anchor
CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.
FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection cs.CV · 2026-05-30 · unverdicted · none · ref 32 · internal anchor
FlowOVD applies rectified flow to generate continuous latent query dynamics for text-conditioned open-vocabulary detection, reporting 49.5 AP on COCO and 31.5 AP on LVIS.
Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method cs.CV · 2026-05-20 · unverdicted · none · ref 33 · internal anchor
Presents the first multispectral dataset for fine-grained small-UAV detection and a dual-stream MFDNet baseline that gains 6.2% AP50 over RGB-only detectors by using spectral material cues.
Best Segmentation Buddies for Image-Shape Correspondence cs.CV · 2026-05-18 · unverdicted · none · ref 66 · internal anchor
The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.
HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos cs.CV · 2026-05-17 · unverdicted · none · ref 4 · internal anchor
HL-OutPaint enables high-resolution outpainting of long video sequences via a coarse-to-fine pipeline that first builds Global Coarse Guidance through global-local frame swapping then synthesizes details.
WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning cs.CV · 2026-05-13 · unverdicted · none · ref 51 · internal anchor
WD-FQDet decouples modality-shared and modality-specific features in infrared-visible images via wavelet-based frequency decomposition and frequency-aware query selection to achieve state-of-the-art detection performance.
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection cs.CV · 2026-05-11 · unverdicted · none · ref 45 · internal anchor
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport cs.CV · 2026-05-09 · unverdicted · none · ref 12 · 2 links · internal anchor
OT-Bridge Editor uses geometrically constrained entropic optimal transport to synthesize CAG images with precise stenosis, improving downstream detection by 27.8% on ARCADE and 23.0% on a multi-center dataset.
SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters cs.CV · 2026-05-04 · unverdicted · none · ref 73 · internal anchor
SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation cs.CV · 2026-04-27 · unverdicted · none · ref 5 · internal anchor
DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection cs.CV · 2026-04-23 · unverdicted · none · ref 49 · 2 links · internal anchor
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects cs.CV · 2026-04-09 · unverdicted · none · ref 37 · internal anchor
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CV · 2026-04-08 · unverdicted · none · ref 50 · internal anchor
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection cs.CV · 2026-04-05 · unverdicted · none · ref 11 · internal anchor
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding cs.CV · 2026-03-02 · unverdicted · none · ref 45 · internal anchor
TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.
Simple Supervision Is Hard to Beat: A Bitter Lesson from Sparse Target Labels in Domain-Adaptive Object Detection cs.CV · 2026-06-29 · unverdicted · none · ref 31 · internal anchor
RTSM improves SFDA-OD by 1.7-18.3 AP50 across methods and detectors, and ten sparse-label feedback plugins give only limited method-dependent gains over it.
Flow Matching in Feature Space for Stochastic World Modeling cs.CV · 2026-06-27 · unverdicted · none · ref 36 · internal anchor
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection cs.CV · 2026-06-22 · unverdicted · none · ref 44 · internal anchor
Proposes DERNet with Decompose-Enhance-Reconstruct operator and three plug-and-play modules to shift small object detection from spatial to spectral feature processing, claiming better performance than YOLOv11 with 1/6 the parameters.
Modular Diffusion Models for Structured Visual Recognition cs.CV · 2026-06-21 · unverdicted · none · ref 19 · internal anchor
Modular Diffusion Models decompose diffusion into task-specific modules to model distributions over structured visual outputs for detection, segmentation, and scene graph generation.
Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework cs.CV · 2026-06-17 · unverdicted · none · ref 31 · internal anchor
Introduces HOI-Edit benchmark with HOI-Eval metric and SCPE self-correcting framework leveraging I2V models for competitive HOI image editing performance.
Tac-DINO: Learning Vision-Tactile Features with Patch Alignment cs.CV · 2026-06-10 · unverdicted · none · ref 185 · internal anchor
Tac-DINO constructs a large tactile dataset and Vis-Tac Holographic Matching Benchmark, then proposes Vision-Tactile Patch Alignment (VTPA) methods that outperform non-aligned baselines on local-to-global feature matching.
Leveraging Morphology for Historical Script Metrological Analysis cs.CV · 2026-06-08 · unverdicted · none · ref 53 · internal anchor
A new deep architecture learns character prototypes from line transcriptions to produce scalable paleographic measurements on historical scripts.
Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders cs.CV · 2026-05-30 · unverdicted · none · ref 41 · internal anchor
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining cs.CV · 2026-05-30 · unverdicted · none · ref 78 · internal anchor
T-CLIP introduces a physics-aware thermal captioning dataset (IR-Cap) and a decoupled dual-LoRA adaptation of CLIP that improves cross-modal retrieval on thermal benchmarks by separating scene-level and object-level thermal understanding.
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 17 · internal anchor
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation cs.CV · 2026-05-25 · unverdicted · none · ref 54 · internal anchor
A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.
DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection cs.CV · 2026-05-23 · unverdicted · none · ref 38 · internal anchor
DisDop distills complementary priors from RemoteCLIP and DINOv3 via teacher fusion and semantic modeling to reach new state-of-the-art results on open-vocabulary aerial detection benchmarks.
SparseSAM: Structured Sparsification of Activations in Segment Anything Models cs.CV · 2026-05-17 · unverdicted · none · ref 34 · internal anchor
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 33 · internal anchor
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection cs.CV · 2026-05-13 · unverdicted · none · ref 63 · internal anchor
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding cs.CV · 2026-05-09 · unverdicted · none · ref 53 · internal anchor
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness cs.CV · 2026-05-06 · unverdicted · none · ref 11 · internal anchor
RefCD enables unsupervised category-aware object detection by using feature similarity between predicted objects and unlabeled reference images to guide category learning.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 214 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing cs.CV · 2026-04-15 · unverdicted · none · ref 28 · internal anchor
ZoomSpec achieves 78.1 mAP@0.5:0.95 on the SpaceNet dataset by combining log-space STFT, a coarse proposal net, adaptive heterodyne filtering, and dual-domain fine recognition to improve narrowband visibility in wideband spectrum sensing.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization cs.CV · 2026-04-13 · unverdicted · none · ref 16 · internal anchor
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 49 · internal anchor
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection cs.CV · 2026-04-07 · unverdicted · none · ref 56 · internal anchor
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training cs.CV · 2026-04-01 · unverdicted · none · ref 33 · internal anchor
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization cs.CV · 2026-03-13 · unverdicted · none · ref 43 · internal anchor
A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.
Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 43 · internal anchor
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework cs.CV · 2026-01-27 · unverdicted · none · ref 30 · internal anchor
Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.
Efficient RGB-T Object Detection via Sparse Cross-Modality Fusion cs.CV · 2026-06-29 · unverdicted · none · ref 35 · internal anchor
A two-stage RGB-T detector performs lightweight modality-specific proposal generation followed by sparse fusion-based refinement to match accuracy of heavier models at lower parameter and compute cost.
Improving Reasoning in Vision-Language Models via Perception Verified Self-Training cs.CV · 2026-06-20 · unverdicted · none · ref 32 · 2 links · internal anchor
Perception-verified self-training with PerceptEval and two-stage curriculum learning improves VLM reasoning by up to 16% over standard self-training baselines.
Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance cs.CV · 2026-06-18 · unverdicted · none · ref 53 · internal anchor
CCDM metrics achieve perfect Spearman correlation of 1.0 with YOLOv8 mAP on VisDrone-DET synthetic sets, outperforming prior synthetic-image metrics.
VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio cs.CV · 2026-06-10 · unverdicted · none · ref 40 · internal anchor
VL-DINO improves open-vocabulary object detection by adding QPSC, VSE, and ORSA modules that inject CLIP knowledge into DINO, reaching 36.3 and 38.1 AP zero-shot on LVIS.
IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation cs.CV · 2026-06-03 · unverdicted · none · ref 37 · internal anchor
IMPose introduces dual-level (keypoint and instance) correction propagation with a trajectory bank to turn sparse annotations into dense multi-person pose trajectories in videos.
EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers cs.CV · 2026-06-01 · unverdicted · none · ref 42 · internal anchor
EIVE reformulates decoder cross-attention in Detection Transformers to produce instance-specific saliency maps via cross-layer fusion and attention-aware training, matching post-hoc methods in quality while improving speed.
GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection cs.CV · 2026-05-28 · unverdicted · none · ref 28 · internal anchor
GiPL uses iterative pseudo-label self-training on support sets plus generative augmentation from VLMs to improve CD-FSOD performance on RUOD, CARPK, and CarDD under 1/5/10-shot regimes.
LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images cs.CV · 2026-05-22 · unverdicted · none · ref 53 · internal anchor
LangFlash introduces a feed-forward model for 3D language Gaussian splatting from sparse unposed images, claiming superior novel view synthesis and semantic consistency via enriched training data and sparse semantic encoding.

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer