UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
hub
Deformable DETR: Deformable Transformers for End-to-End Object Detection
44 Pith papers cite this work. Polarity classification is still indexing.
abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.
hub tools
claims ledger
- abstract DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive e
co-cited works
representative citing papers
Introduces OW-SED paradigm and WOOT transformer framework to detect known sounds, identify unseen events, and incrementally learn in open audio environments.
A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
ConFusion reaches 59.1 mAP and 65.6 NDS on nuScenes validation by combining heterogeneous queries with QMix cross-attention and QSwap feature exchange.
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.
CoEvoer is a new cross-dependency transformer framework for upper-body expressive human pose and shape estimation that achieves state-of-the-art performance by enabling mutual enhancement between body parts.
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outperforms prior methods.
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
A probabilistic unfolding network with stable likelihood projection and dual-domain Mamba achieves state-of-the-art reconstruction in quantized compressive sensing.
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital and physical evaluations.
InterMesh improves multi-person human mesh recovery accuracy by explicitly enriching DETR-style queries with structured interaction semantics from a human-object detector.
FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 363-image dataset.
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
HiProto uses hierarchical prototypes with RPC-Loss, PR-Loss, and SPLGS to deliver competitive, interpretable object detection on low-quality datasets like ExDark and RTTS.
citing papers explorer
-
Towards Open World Sound Event Detection
Introduces OW-SED paradigm and WOOT transformer framework to detect known sounds, identify unseen events, and incrementally learn in open audio environments.