MVDGC unifies BEV and image-view pedestrian localization into one task via 3D cylindrical queries that enforce dual geometric constraints between views.
Nms strikes back
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4verdicts
UNVERDICTED 4roles
baseline 1polarities
baseline 1representative citing papers
RTSM improves SFDA-OD by 1.7-18.3 AP50 across methods and detectors, and ten sparse-label feedback plugins give only limited method-dependent gains over it.
MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
citing papers explorer
-
MVDGC: Joint 3D and 2D Multi-view Pedestrian Detection via Dual Geometric Constraints
MVDGC unifies BEV and image-view pedestrian localization into one task via 3D cylindrical queries that enforce dual geometric constraints between views.
-
Simple Supervision Is Hard to Beat: A Bitter Lesson from Sparse Target Labels in Domain-Adaptive Object Detection
RTSM improves SFDA-OD by 1.7-18.3 AP50 across methods and detectors, and ten sparse-label feedback plugins give only limited method-dependent gains over it.
-
MDS-DETR: DETR with Masked Duplicate Suppressor
MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.