Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Feng Li; Han Gao; Hao Zhang; Hongjie Huang; Kent Yu; Lei Zhang; Peijun Tang; Qing Jiang; Shilong Liu; Tianhe Ren

arxiv: 2405.10300 · v2 · pith:EVLFUC5Unew · submitted 2024-05-16 · 💻 cs.CV

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Tianhe Ren , Qing Jiang , Shilong Liu , Zhaoyang Zeng , Wenlong Liu , Han Gao , Hongjie Huang , Zhengyu Ma

show 8 more authors

Xiaoke Jiang Yihao Chen Yuda Xiong Hao Zhang Feng Li Peijun Tang Kent Yu Lei Zhang

This is my paper

classification 💻 cs.CV

keywords groundingdinomodeledgedetectionobjectopen-setbenchmark

0 comments

read the original abstract

This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Comprehensive Ecosystem for Open-Domain Customized Video Generation
cs.CV 2026-06 unverdicted novelty 7.0

Introduces PexelsCustom-1M dataset, CustoMDiT parameter-efficient model, and OpenCustom benchmark for open-domain customized video generation.
WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory
cs.CV 2026-06 unverdicted novelty 7.0

WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while e...
FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection
cs.CV 2026-05 unverdicted novelty 7.0

FlowOVD applies rectified flow to generate continuous latent query dynamics for text-conditioned open-vocabulary detection, reporting 49.5 AP on COCO and 31.5 AP on LVIS.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
cs.CV 2026-02 unverdicted novelty 7.0

DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in ...
SAM 3: Segment Anything with Concepts
cs.CV 2025-11 unverdicted novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
cs.CV 2026-05 unverdicted novelty 6.0

SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better str...
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
Unify Robot Actions in Camera Frame
cs.RO 2025-11 conditional novelty 6.0

CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
Inferring Dynamic Physical Properties from Video Foundation Models
cs.CV 2025-10 unverdicted novelty 6.0

Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
RelAfford6D: Relational 6D Affordance Graphs for Constraint-Driven Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 5.0

RelAfford6D constructs relational 6D affordance graphs from instructions, uses vision foundation models for metric poses, and executes via closed-loop kinematic constraint tracking to achieve claimed superior zero-sho...
VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio
cs.CV 2026-06 unverdicted novelty 5.0

VL-DINO improves open-vocabulary object detection by adding QPSC, VSE, and ORSA modules that inject CLIP knowledge into DINO, reaching 36.3 and 38.1 AP zero-shot on LVIS.
TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 5.0

TrackRef3D proposes a fully automatic multi-view consistent track-then-label method for open-world referring segmentation in 3D Gaussian Splatting using TSCM, visibility-aware descriptions, and hybrid contrastive training.
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
cs.CV 2026-05 unverdicted novelty 5.0

RHINO recovers 3D human, novel manipulated object, and static scene from monocular video by stabilizing SfM with foundation models, separating motions, and refining with compositional neural SDFs plus contact priors.
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
cs.RO 2025-07 unverdicted novelty 5.0

The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Qwen2.5-VL Technical Report
cs.CV 2025-02 unverdicted novelty 5.0

Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving
cs.CV 2025-01 unverdicted novelty 5.0

Vision foundation model embeddings with density modeling outperform state-of-the-art methods for unsupervised semantic and covariate shift detection in autonomous driving inputs.
Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images
cs.CV 2026-06 unverdicted novelty 4.0

A lightweight neural model fuses 3D point cloud reconstructions with view-aligned 2D features via a graph decoder to regress volume, surface area, and uncertainties from multi-view images without iterative optimization.
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
cs.CV 2026-06 unverdicted novelty 4.0

YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across...
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...