hub Mixed citations

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai · 2020 · cs.CV · arXiv 2010.04159

Mixed citation behavior. Most common role is background (62%).

66 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 66 citing papers arXiv PDF

abstract

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 2 baseline 1

citation-polarity summary

background 5 use method 2 baseline 1

claims ledger

abstract DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive e

co-cited works

representative citing papers

FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

FlowOVD applies rectified flow to generate continuous latent query dynamics for text-conditioned open-vocabulary detection, reporting 49.5 AP on COCO and 31.5 AP on LVIS.

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Presents the first multispectral dataset for fine-grained small-UAV detection and a dual-stream MFDNet baseline that gains 6.2% AP50 over RGB-only detectors by using spectral material cues.

Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

cs.CV · 2026-05-06 · conditional · novelty 7.0 · 2 refs

InterMesh explicitly incorporates human-object interaction semantics into multi-person mesh recovery via a detector and two lightweight modules, delivering up to 9.9% MPJPE reduction on interaction-heavy datasets.

ReLeaf: Benchmarking Leaf Segmentation across Domains and Species

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.

Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

ConFusion reaches 59.1 mAP and 65.6 NDS on nuScenes validation by combining heterogeneous queries with QMix cross-attention and QSwap feature exchange.

URoPE: Universal Relative Position Embedding across Geometric Spaces

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.

Chatting about Upper-Body Expressive Human Pose and Shape Estimation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

CoEvoer is a new cross-dependency transformer framework for upper-body expressive human pose and shape estimation that achieves state-of-the-art performance by enabling mutual enhancement between body parts.

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.

SynthPID: P&ID digitization from Topology-Preserving Synthetic Data

cs.CV · 2026-04-15 · conditional · novelty 7.0

Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.

Online Reasoning Video Object Segmentation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.

YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection

cs.CV · 2026-04-11 · unverdicted · novelty 7.0

YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outperforms prior methods.

DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

cs.CV · 2026-03-20 · unverdicted · novelty 7.0

MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

cs.NE · 2024-03-25 · conditional · novelty 7.0

A hierarchical spiking transformer using Q-K attention achieves 85.65% top-1 accuracy on ImageNet-1K, the first direct-trained SNN to exceed 85%.

Deformba: Vision State Space Model with Adaptive State Fusion

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Deformba introduces context-adaptive state fusion to vision SSMs for better spatial augmentation and cross-stream interactions, showing strong results on 2D classification/detection/segmentation and 3D BEV perception benchmarks.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.

Deep Probabilistic Unfolding for Quantized Compressive Sensing

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A probabilistic unfolding network with stable likelihood projection and dual-domain Mamba achieves state-of-the-art reconstruction in quantized compressive sensing.

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.

citing papers explorer

Showing 50 of 66 citing papers.

FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection cs.CV · 2026-05-30 · unverdicted · none · ref 33 · internal anchor
FlowOVD applies rectified flow to generate continuous latent query dynamics for text-conditioned open-vocabulary detection, reporting 49.5 AP on COCO and 31.5 AP on LVIS.
Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method cs.CV · 2026-05-20 · unverdicted · none · ref 32 · internal anchor
Presents the first multispectral dataset for fine-grained small-UAV detection and a dual-stream MFDNet baseline that gains 6.2% AP50 over RGB-only detectors by using spectral material cues.
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery cs.CV · 2026-05-06 · conditional · none · ref 11 · 2 links · internal anchor
InterMesh explicitly incorporates human-object interaction semantics into multi-person mesh recovery via a detector and two lightweight modules, delivering up to 9.9% MPJPE reduction on interaction-heavy datasets.
ReLeaf: Benchmarking Leaf Segmentation across Domains and Species cs.CV · 2026-05-05 · unverdicted · none · ref 45 · internal anchor
A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion cs.CV · 2026-04-28 · unverdicted · none · ref 72 · internal anchor
ConFusion reaches 59.1 mAP and 65.6 NDS on nuScenes validation by combining heterogeneous queries with QMix cross-attention and QSwap feature exchange.
URoPE: Universal Relative Position Embedding across Geometric Spaces cs.CV · 2026-04-20 · unverdicted · none · ref 44 · internal anchor
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.
Chatting about Upper-Body Expressive Human Pose and Shape Estimation cs.CV · 2026-04-20 · unverdicted · none · ref 35 · internal anchor
CoEvoer is a new cross-dependency transformer framework for upper-body expressive human pose and shape estimation that achieves state-of-the-art performance by enabling mutual enhancement between body parts.
Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection cs.CV · 2026-04-16 · unverdicted · none · ref 42 · internal anchor
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
SynthPID: P&ID digitization from Topology-Preserving Synthetic Data cs.CV · 2026-04-15 · conditional · none · ref 22 · internal anchor
Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.
Online Reasoning Video Object Segmentation cs.CV · 2026-04-13 · unverdicted · none · ref 58 · internal anchor
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection cs.CV · 2026-04-11 · unverdicted · none · ref 10 · internal anchor
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outperforms prior methods.
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather cs.CV · 2026-04-09 · unverdicted · none · ref 55 · internal anchor
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 78 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects cs.CV · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane cs.CV · 2026-03-20 · unverdicted · none · ref 59 · internal anchor
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 168 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
QKFormer: Hierarchical Spiking Transformer using Q-K Attention cs.NE · 2024-03-25 · conditional · none · ref 7 · internal anchor
A hierarchical spiking transformer using Q-K attention achieves 85.65% top-1 accuracy on ImageNet-1K, the first direct-trained SNN to exceed 85%.
Deformba: Vision State Space Model with Adaptive State Fusion cs.CV · 2026-05-20 · unverdicted · none · ref 11 · internal anchor
Deformba introduces context-adaptive state fusion to vision SSMs for better spatial augmentation and cross-stream interactions, showing strong results on 2D classification/detection/segmentation and 3D BEV perception benchmarks.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 104 · internal anchor
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction cs.CV · 2026-05-15 · unverdicted · none · ref 17 · internal anchor
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection cs.CV · 2026-05-13 · unverdicted · none · ref 66 · internal anchor
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
Deep Probabilistic Unfolding for Quantized Compressive Sensing cs.CV · 2026-05-12 · unverdicted · none · ref 50 · internal anchor
A probabilistic unfolding network with stable likelihood projection and dual-domain Mamba achieves state-of-the-art reconstruction in quantized compressive sensing.
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding cs.CV · 2026-05-09 · unverdicted · none · ref 56 · internal anchor
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series cs.CV · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern cs.CV · 2026-05-06 · unverdicted · none · ref 47 · internal anchor
Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital and physical evaluations.
FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging cs.CV · 2026-04-30 · unverdicted · none · ref 49 · internal anchor
FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 363-image dataset.
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection cs.CV · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
GateMOT: Q-Gated Attention for Dense Object Tracking cs.CV · 2026-04-29 · unverdicted · none · ref 95 · internal anchor
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models cs.CV · 2026-04-20 · unverdicted · none · ref 62 · internal anchor
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
Weakly-Supervised Referring Video Object Segmentation through Text Supervision cs.CV · 2026-04-20 · unverdicted · none · ref 54 · internal anchor
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions cs.CV · 2026-04-15 · unverdicted · none · ref 50 · internal anchor
HiProto uses hierarchical prototypes with RPC-Loss, PR-Loss, and SPLGS to deliver competitive, interpretable object detection on low-quality datasets like ExDark and RTTS.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization cs.CV · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection cs.CV · 2026-04-07 · unverdicted · none · ref 60 · internal anchor
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation cs.CV · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
GCNV-Net achieves state-of-the-art accuracy on multiple 3D medical segmentation benchmarks while cutting FLOPs by 56% and inference latency by 68% through dynamic nonvoid voxelization and geometric attention.
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training cs.CV · 2026-04-01 · unverdicted · none · ref 38 · internal anchor
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework cs.CV · 2026-01-27 · unverdicted · none · ref 34 · internal anchor
Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.
ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding cs.CV · 2025-12-03 · unverdicted · none · ref 96 · internal anchor
ShelfGaussian achieves state-of-the-art zero-shot semantic occupancy prediction on Occ3D-nuScenes by jointly supervising Gaussian representations with vision foundation model features at 2D image and 3D scene levels.
YOLOv12: Attention-Centric Real-Time Object Detectors cs.CV · 2025-02-18 · unverdicted · none · ref 72 · internal anchor
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations cs.CV · 2025-01-13 · unverdicted · none · ref 51 · internal anchor
Introduces TimberVision dataset and multi-task framework for log-component segmentation, detection, and tracking in forestry operations using RGB images.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos cs.CV · 2025-01-07 · conditional · none · ref 125 · internal anchor
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
Uncertainty Quantification in Detection Transformers: Object-Level Calibration and Image-Level Reliability cs.CV · 2024-12-02 · unverdicted · none · ref 14 · internal anchor
DETRs learn an optimal specialist strategy via the Hungarian loss, motivating the new Object-level Calibration Error (OCE) metric and an image-level post-hoc uncertainty quantification framework.
GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization cs.CL · 2024-10-31 · unverdicted · none · ref 93 · internal anchor
GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.
STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection cs.CV · 2026-05-20 · unverdicted · none · ref 71 · internal anchor
STAR-IOD applies scale-decoupled topology alignment and K-Means-based pseudo-label refinement to reduce catastrophic forgetting in remote sensing incremental object detection, reporting 1.7% and 2.1% mAP gains on new DIOR-IOD and DOTA-IOD datasets.
MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes cs.CV · 2026-05-14 · conditional · none · ref 48 · internal anchor
MR2-ByteTrack maintains high accuracy in video object detection on MCUs by combining multi-resolution processing, ByteTrack for frame linking, and Rescore for confidence aggregation, achieving up to 55% energy savings and real-time performance for both CNN and Transformer models.
Towards Open World Sound Event Detection cs.SD · 2026-05-05 · unverdicted · none · ref 14 · 2 links · internal anchor
Introduces OW-SED paradigm and WOOT framework with deformable attention for detecting known and unseen sound events in open-world settings.
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation cs.CV · 2026-05-04 · unverdicted · none · ref 43 · 2 links · internal anchor
HeroCrystal achieves 33.4% mAP on cross-domain multi-camera object detection by combining one-shot diffusion-based synthetic data generation, probabilistic federated Faster R-CNN, and inconsistent-category distillation, outperforming prior privacy-preserving baselines by 2.1%.
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence cs.CV · 2026-04-21 · unverdicted · none · ref 59 · internal anchor
XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention eess.IV · 2026-04-15 · unverdicted · none · ref 23 · internal anchor
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking cs.CV · 2026-04-14 · unverdicted · none · ref 74 · internal anchor
HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.

Deformable DETR: Deformable Transformers for End-to-End Object Detection

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer