FlowOVD applies rectified flow to generate continuous latent query dynamics for text-conditioned open-vocabulary detection, reporting 49.5 AP on COCO and 31.5 AP on LVIS.
hub Mixed citations
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Mixed citation behavior. Most common role is background (62%).
abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive e
co-cited works
representative citing papers
Presents the first multispectral dataset for fine-grained small-UAV detection and a dual-stream MFDNet baseline that gains 6.2% AP50 over RGB-only detectors by using spectral material cues.
UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
InterMesh explicitly incorporates human-object interaction semantics into multi-person mesh recovery via a detector and two lightweight modules, delivering up to 9.9% MPJPE reduction on interaction-heavy datasets.
A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
ConFusion reaches 59.1 mAP and 65.6 NDS on nuScenes validation by combining heterogeneous queries with QMix cross-attention and QSwap feature exchange.
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.
CoEvoer is a new cross-dependency transformer framework for upper-body expressive human pose and shape estimation that achieves state-of-the-art performance by enabling mutual enhancement between body parts.
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outperforms prior methods.
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
A hierarchical spiking transformer using Q-K attention achieves 85.65% top-1 accuracy on ImageNet-1K, the first direct-trained SNN to exceed 85%.
Deformba introduces context-adaptive state fusion to vision SSMs for better spatial augmentation and cross-stream interactions, showing strong results on 2D classification/detection/segmentation and 3D BEV perception benchmarks.
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
A probabilistic unfolding network with stable likelihood projection and dual-domain Mamba achieves state-of-the-art reconstruction in quantized compressive sensing.
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
citing papers explorer
-
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning
UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
-
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.