NERVE is a new 600GB multi-sensor dataset with DVS, RGB-D, and 24/77GHz radar plus baselines showing DVS+77GHz radar fusion improves human detection to 47.5% mAP with sub-1.8m distance error.
hub Canonical reference
YOLOX: Exceeding YOLO Series in 2021
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4-CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/Megvii-BaseDetection/YOLOX.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CUTAL scores multi-frame clips for uncertainty and enforces temporal diversity to train transformer MOT models to near full-supervision performance with 50% of the labels.
LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
A framework with new metrics and train-time/post-hoc calibrators aligns probabilistic object detectors to annotator disagreement distributions for classification and localization without ground truth.
TrajVAD shows that bounding-box trajectories modeled via normalizing flows can serve as a primary cue for video anomaly detection, with the trajectory-only variant achieving 87.7% AP on ShanghaiTech and best results on MSAD.
Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
A deterministic queue-based matching algorithm using geometric overlaps and virtual lane discretization enables 99.8% handover success rate for continuous identity persistence in multi-UAV vehicle tracking.
Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested datasets.
FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 363-image dataset.
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.
AHCQ-SAM introduces ACNR, HLUQ, CAG, and LNQ quantization techniques that deliver 15.2% mAP gain on 4-bit SAM-B and 14.01% J&F gain on 4-bit SAM2-Tiny versus prior PTQ methods.
Dual-head knowledge distillation partitions the linear classifier into separate heads for logit and probability losses to exploit logits without causing classification head collapse.
DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
PS-Track sets a new state-of-the-art for point-supervised multi-object tracking by converting point seeds into temporally consistent pseudo-labels via Temporal-Feedback Prompting, Point-Excited Wavelet Attention, and Uncertainty-Guided Gaussian Learning.
A two-stage RGB-T detector performs lightweight modality-specific proposal generation followed by sparse fusion-based refinement to match accuracy of heavier models at lower parameter and compute cost.
citing papers explorer
-
Portable Active Learning for Object Detection
PAL is a portable active learning method for object detection that uses class-specific logistic classifiers for uncertainty and image-level diversity to select annotation batches, showing better label efficiency than baselines on COCO, VOC, and BDD100K.
-
Hierarchical Prompting with Dual LLM Modules for Robotic Task and Motion Planning
A dual-LLM hierarchical framework for robotic task and motion planning, integrating object detection, achieves 86% success across 24 test scenarios ranging from simple spatial commands to infeasible requests.