hub

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai · 2020 · cs.CV · arXiv 2010.04159

44 Pith papers cite this work. Polarity classification is still indexing.

44 Pith papers citing it

open full Pith review browse 44 citing papers arXiv PDF

abstract

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive e

co-cited works

representative citing papers

Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.

Towards Open World Sound Event Detection

cs.SD · 2026-05-05 · unverdicted · novelty 7.0

Introduces OW-SED paradigm and WOOT transformer framework to detect known sounds, identify unseen events, and incrementally learn in open audio environments.

ReLeaf: Benchmarking Leaf Segmentation across Domains and Species

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.

Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

ConFusion reaches 59.1 mAP and 65.6 NDS on nuScenes validation by combining heterogeneous queries with QMix cross-attention and QSwap feature exchange.

URoPE: Universal Relative Position Embedding across Geometric Spaces

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.

Chatting about Upper-Body Expressive Human Pose and Shape Estimation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

CoEvoer is a new cross-dependency transformer framework for upper-body expressive human pose and shape estimation that achieves state-of-the-art performance by enabling mutual enhancement between body parts.

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.

SynthPID: P&ID digitization from Topology-Preserving Synthetic Data

cs.CV · 2026-04-15 · conditional · novelty 7.0

Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.

Online Reasoning Video Object Segmentation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.

YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection

cs.CV · 2026-04-11 · unverdicted · novelty 7.0

YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outperforms prior methods.

DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

Deep Probabilistic Unfolding for Quantized Compressive Sensing

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A probabilistic unfolding network with stable likelihood projection and dual-domain Mamba achieves state-of-the-art reconstruction in quantized compressive sensing.

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.

A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.

Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital and physical evaluations.

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

InterMesh improves multi-person human mesh recovery accuracy by explicitly enriching DETR-style queries with structured interaction semantics from a human-object detector.

FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 363-image dataset.

ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.

GateMOT: Q-Gated Attention for Dense Object Tracking

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.

OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

Weakly-Supervised Referring Video Object Segmentation through Text Supervision

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.

HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

HiProto uses hierarchical prototypes with RPC-Loss, PR-Loss, and SPLGS to deliver competitive, interpretable object detection on low-quality datasets like ExDark and RTTS.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Towards Open World Sound Event Detection cs.SD · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
Introduces OW-SED paradigm and WOOT transformer framework to detect known sounds, identify unseen events, and incrementally learn in open audio environments.

Deformable DETR: Deformable Transformers for End-to-End Object Detection

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer