hub Tool reference

MMDetection: Open MMLab Detection Toolbox and Benchmark

· 2019 · cs.CV · arXiv 1906.07155

Tool reference. 88% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

46 Pith papers citing it

Method reference 88% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

We present MMDetection, an object detection toolbox that contains a rich set of object detection and instance segmentation methods as well as related components and modules. The toolbox started from a codebase of MMDet team who won the detection track of COCO Challenge 2018. It gradually evolves into a unified platform that covers many popular detection methods and contemporary modules. It not only includes training and inference codes, but also provides weights for more than 200 network models. We believe this toolbox is by far the most complete detection toolbox. In this paper, we introduce the various features of this toolbox. In addition, we also conduct a benchmarking study on different methods, components, and their hyper-parameters. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new detectors. Code and models are available at https://github.com/open-mmlab/mmdetection. The project is under active development and we will keep this document updated.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 7 background 1

citation-polarity summary

use method 7 background 1

representative citing papers

VMamba: Visual State Space Model

cs.CV · 2024-01-18 · conditional · novelty 8.0

VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

cs.CV · 2021-03-25 · accept · novelty 8.0

Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

HadamardNet applies Hadamard-coded outputs to segmentation and detection, with a novel projection-based decoder that supplies inconsistency measures for SOTA perturbation detection while preserving clean-data performance.

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

cs.CV · 2025-11-01 · unverdicted · novelty 7.0

Introduces the triplet segmentation task, CholecTriplet-Seg dataset with over 30,000 frames, and TargetFusionNet architecture extending Mask2Former for instance-level grounding of surgical <instrument, verb, target> triplets.

OD3: Optimization-free Dataset Distillation for Object Detection

cs.CV · 2025-06-02 · unverdicted · novelty 7.0

OD3 presents an optimization-free dataset distillation framework for object detection that reports new state-of-the-art accuracy on COCO and VOC at compression ratios from 0.25% to 5%.

FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry

cs.CV · 2025-05-20 · unverdicted · novelty 7.0

FractalMamba++ scales Vision Mamba across resolutions by using Hilbert fractal serialization, hierarchy-based skip connections, and fractal-aware 2D rotary position encoding.

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.

KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.

UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

UHR-DETR delivers 2.8% higher mAP and 10x faster inference than sliding-window baselines for small object detection in UHR remote sensing imagery on a single 24GB GPU.

FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation

astro-ph.IM · 2026-04-14 · unverdicted · novelty 7.0

FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.

DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

DroneFINE is a domain-aware PEFT approach for VLM-based drone detectors using foreground-aware multi-path adaptation and text-conditioned background suppression, outperforming standard PEFT and matching full fine-tuning on VisDrone and UAVDT with fewer trainable parameters.

CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

CL-CLIP uses CLIP image-text cost volumes to create class-specific pathways processed by a multi-expert RoI head, improving continual object detection on VOC and COCO over the F-ViT baseline.

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

DisDop distills complementary priors from RemoteCLIP and DINOv3 via teacher fusion and semantic modeling to reach new state-of-the-art results on open-vocabulary aerial detection benchmarks.

Deformba: Vision State Space Model with Adaptive State Fusion

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Deformba introduces context-adaptive state fusion to vision SSMs for better spatial augmentation and cross-stream interactions, showing strong results on 2D classification/detection/segmentation and 3D BEV perception benchmarks.

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.

Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework

cs.CV · 2026-01-27 · unverdicted · novelty 6.0

Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

cs.CV · 2025-12-03 · unverdicted · novelty 6.0

NN-RAG extracts 1,289 candidate neural modules from 19 PyTorch repositories, validates 941 of them, and supplies roughly 72% of the novel structures in the LEMUR dataset while enabling cross-repository migration.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

Spectral-Adaptive Modulation Networks for Visual Perception

cs.CV · 2025-03-31 · unverdicted · novelty 6.0

SPANetV2 is a vision backbone built around a new spectral-adaptive modulation mixer that outperforms prior models on ImageNet-1K classification, COCO detection, and ADE20K segmentation.

TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations

cs.CV · 2025-01-13 · unverdicted · novelty 6.0

Introduces TimberVision dataset and multi-task framework for log-component segmentation, detection, and tracking in forestry operations using RGB images.

Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks

cs.CV · 2024-11-29 · unverdicted · novelty 6.0

Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.

UniISP: A Unified ISP Framework for Both Human and Machine Vision

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

UniISP unifies ISP processing with a Hybrid Attention Module and Feature Adapter to produce images that are both visually pleasing for humans and informative for computer vision models.

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

cs.CV · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.

citing papers explorer

Showing 31 of 31 citing papers after filters.

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation cs.CV · 2026-06-08 · unverdicted · none · ref 3 · internal anchor
HadamardNet applies Hadamard-coded outputs to segmentation and detection, with a novel projection-based decoder that supplies inconsistency measures for SOTA perturbation detection while preserving clean-data performance.
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression cs.LG · 2026-04-30 · unverdicted · none · ref 2
Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition cs.CV · 2026-04-25 · unverdicted · none · ref 51
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery cs.CV · 2026-04-23 · unverdicted · none · ref 44
UHR-DETR delivers 2.8% higher mAP and 10x faster inference than sliding-window baselines for small object detection in UHR remote sensing imagery on a single 24GB GPU.
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation astro-ph.IM · 2026-04-14 · unverdicted · none · ref 10
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images cs.CV · 2026-07-01 · unverdicted · none · ref 1 · internal anchor
DroneFINE is a domain-aware PEFT approach for VLM-based drone detectors using foreground-aware multi-path adaptation and text-conditioned background suppression, outperforming standard PEFT and matching full fine-tuning on VisDrone and UAVDT with fewer trainable parameters.
CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection cs.CV · 2026-06-05 · unverdicted · none · ref 5 · internal anchor
CL-CLIP uses CLIP image-text cost volumes to create class-specific pathways processed by a multi-expert RoI head, improving continual object detection on VOC and COCO over the F-ViT baseline.
Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders cs.CV · 2026-05-30 · unverdicted · none · ref 36 · internal anchor
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection cs.CV · 2026-05-23 · unverdicted · none · ref 3 · internal anchor
DisDop distills complementary priors from RemoteCLIP and DINOv3 via teacher fusion and semantic modeling to reach new state-of-the-art results on open-vocabulary aerial detection benchmarks.
Deformba: Vision State Space Model with Adaptive State Fusion cs.CV · 2026-05-20 · unverdicted · none · ref 1 · internal anchor
Deformba introduces context-adaptive state fusion to vision SSMs for better spatial augmentation and cross-stream interactions, showing strong results on 2D classification/detection/segmentation and 3D BEV perception benchmarks.
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers cs.CV · 2026-05-14 · unverdicted · none · ref 18 · internal anchor
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework cs.CV · 2026-01-27 · unverdicted · none · ref 8 · internal anchor
Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.
UniISP: A Unified ISP Framework for Both Human and Machine Vision cs.CV · 2026-05-08 · unverdicted · none · ref 3
UniISP unifies ISP processing with a Hybrid Attention Module and Feature Adapter to produce images that are both visually pleasing for humans and informative for computer vision models.
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy cs.CV · 2026-05-06 · unverdicted · none · ref 6 · 2 links
HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
SignDATA: Data Pipeline for Sign Language Translation cs.CV · 2026-04-22 · unverdicted · none · ref 8
SignDATA provides a reproducible, config-driven preprocessing toolkit that converts heterogeneous sign language corpora into standardized pose or video outputs using interchangeable backends and privacy-aware options.
Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests cs.CV · 2026-04-15 · unverdicted · none · ref 1
Granularity-aware distillation improves tree instance segmentation accuracy on real forest images by merging logits and unifying masks from fine-grained synthetic teachers despite coarse real labels.
Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating cs.CV · 2026-04-11 · unverdicted · none · ref 4
DualEngage fuses transformer-encoded student motion dynamics with 3D scene features via softmax-gated fusion to recognize group engagement in classroom videos, reporting 96.21% average accuracy on a university dataset.
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection cs.CV · 2026-04-07 · unverdicted · none · ref 8
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
Enhancing Layer Interaction Using Key-Correlated Layer Attention cs.CV · 2026-06-24 · unverdicted · none · ref 44 · internal anchor
KCLA is a linear-complexity layer attention mechanism that exploits high key cosine similarity to preserve dynamic updates and long-range cross-layer connections.
A Turbo-Inference Strategy for Object Detection and Instance Segmentation cs.CV · 2026-06-10 · unverdicted · none · ref 10 · internal anchor
A turbo-inference method with two new heads enables iterative communication between detection and segmentation tasks, improving accuracy on COCO, iFLYTEK, and Cityscapes datasets at higher computational cost.
Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light cs.CV · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
Synthetic RAW augmentations create continuous low-light samples matching sensor noise, enabling fine-grained evaluation of person detection performance where real data is sparse.
Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments cs.CV · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
An agentic pipeline combines multimodal LLMs for self-synchronization and verification with monocular pose estimation and geometric optimization to achieve 5.97° MAE joint angle monitoring from uncalibrated multi-view videos, validated against Vicon.
Portable Active Learning for Object Detection cs.CV · 2026-05-11 · unverdicted · none · ref 5
PAL is a portable active learning method for object detection that uses class-specific logistic classifiers for uncertainty and image-level diversity to select annotation batches, showing better label efficiency than baselines on COCO, VOC, and BDD100K.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 35
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation eess.IV · 2026-04-30 · unverdicted · none · ref 61
A scale-robust lightweight CNN for glottis segmentation achieves 92.9% mDice at over 170 FPS with a 19 MB model size on three datasets.
Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization cs.CV · 2026-04-29 · unverdicted · none · ref 7
Bridge learns low-rank bases for front-door causal adjustment to remove spurious correlations from domain shifts and integrates the approach with vision foundation models for improved object detection generalization.
A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction cs.CV · 2026-04-11 · unverdicted · none · ref 54
A3-FPN augments multi-scale representations with asymptotic global interaction and content-aware resampling, delivering gains such as 49.6 mask AP on MS COCO when paired with OneFormer and Swin-L.
Scaling Datasets for Multi-Sensor, Multi-Agent, and Multi-Domain Learning in Autonomous Systems eess.IV · 2026-06-03 · unverdicted · none · ref 14 · internal anchor
Introduces a modular dataset generation pipeline using CARLA and AVstack to produce terabyte-scale ground-truth data for ground, aerial, and infrastructure autonomy in single- and multi-agent setups.
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline cs.CV · 2026-05-12 · unverdicted · none · ref 59 · 2 links · internal anchor
Clear2Fog simulates fog on 270k Waymo images; mixed-density fog at 75% scale matches full fixed-density training performance, and adjusted learning rates improve sim-to-real transfer by up to 1.17 mAP.
Advancing Vision Transformer with Enhanced Spatial Priors cs.CV · 2026-04-20 · unverdicted · none · ref 91
EVT improves Vision Transformers by using Euclidean distance decay for spatial priors and simpler grouping, achieving 86.6% top-1 accuracy on ImageNet-1k.
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results cs.CV · 2026-04-13 · unverdicted · none · ref 11
The NTIRE 2026 CD-FSOD Challenge report details innovative methods and performance results from 19 teams on cross-domain few-shot object detection in open- and closed-source tracks.

MMDetection: Open MMLab Detection Toolbox and Benchmark

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer