Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Chunyuan Li; Feng Li; Hang Su; Hao Zhang; Jianwei Yang; Jie Yang; Jun Zhu; Lei Zhang; Qing Jiang; Shilong Liu

arxiv: 2303.05499 · v5 · submitted 2023-03-09 · 💻 cs.CV

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu , Zhaoyang Zeng , Tianhe Ren , Feng Li , Hao Zhang , Jie Yang , Qing Jiang , Chunyuan Li

show 4 more authors

Jianwei Yang Hang Su Jun Zhu Lei Zhang

This is my paper

Pith reviewed 2026-05-11 11:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-set object detectiongrounded pre-trainingvision-language fusionzero-shot detectionreferring expression comprehensionDINO detectortransformer-based detection

0 comments

The pith

Grounding DINO marries DINO with grounded pre-training to detect arbitrary objects from language inputs without target training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an open-set object detector by integrating language into the DINO model through grounded pre-training. It divides the detector into phases and proposes tight fusion modules to handle both category names and referring expressions. A reader would care if this enables detectors to recognize novel objects dynamically from text alone, as demonstrated by high zero-shot performance. The model reaches 52.5 AP on COCO without using COCO data and leads on ODinW at 26.1 mean AP.

Core claim

By marrying the Transformer-based DINO detector with grounded pre-training, Grounding DINO introduces language to enable detection of arbitrary objects given inputs such as category names or referring expressions. The solution involves tight cross-modality fusion using a feature enhancer, language-guided query selection, and a cross-modality decoder. This leads to strong results across COCO, LVIS, ODinW, and RefCOCO benchmarks, including 52.5 AP zero-shot on COCO and a record 26.1 mean AP on ODinW.

What carries the argument

Tight fusion modules including a feature enhancer, language-guided query selection, and cross-modality decoder that fuse language and vision for open-set generalization in the DINO architecture.

Load-bearing premise

The proposed tight fusion of language and vision generalizes to open-set concepts from pre-training data without overfitting to specific training distributions or needing per-dataset adjustments.

What would settle it

Observing that the model's performance on unseen object classes falls to near zero when the language encoder is replaced with a different one or when inputs are from a domain far from pre-training would indicate the fusion does not truly generalize.

read the original abstract

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grounding DINO gives a practical open-set detector by adding tight language-vision fusion to DINO, with strong zero-shot numbers and released code.

read the letter

Grounding DINO takes the DINO detector and adds grounded pre-training through three specific fusion pieces: a feature enhancer, language-guided query selection, and a cross-modality decoder. The headline results are 52.5 AP zero-shot on COCO with no COCO training data at all, plus a new record of 26.1 mean AP on the ODinW benchmark. They also test referring expression comprehension on RefCOCO, which checks how well the model handles attributes and descriptions beyond plain category names. This is positioned as an incremental step from GLIP and similar grounded detectors, with straightforward citations to prior work. The experiments use standard public benchmarks and a clean zero-shot protocol, and the code is released, which makes the claims easier to verify. The performance tables line up with the stated improvements, and there is no sign of data leakage or self-referential metrics. The main soft spot is that the ablations are not as detailed as they could be, so it is harder to measure exactly how much each of the three fusion modules moves the needle. The central assumption that the tight fusion generalizes open-set concepts without heavy overfitting holds up in the reported numbers, but more targeted analysis would make the case tighter. This paper is useful for anyone working on open-vocabulary or grounded object detection who wants a concrete, high-performing recipe rather than a new paradigm. It shows clear engagement with the existing literature and delivers reproducible results. I would bring it to a reading group to discuss the fusion choices and the referring-expression results. It is worth citing if you need a strong baseline in this area. Send it to peer review; the empirical grounding is solid enough to justify referee time.

Referee Report

0 major / 0 minor

Summary. The paper introduces Grounding DINO, an open-set object detector formed by integrating the DINO transformer-based detector with grounded pre-training. It enables detection of arbitrary objects from language inputs (category names or referring expressions) via three proposed tight fusion modules: a feature enhancer, language-guided query selection, and cross-modality decoder. The work evaluates the model on zero-shot detection (COCO, LVIS, ODinW) and referring expression comprehension (RefCOCO/+/g), reporting 52.5 AP zero-shot on COCO (no COCO training data used) and a new record of 26.1 mean AP on ODinW.

Significance. If the results hold, the paper advances open-set object detection by demonstrating that language-vision fusion in a transformer detector can yield strong generalization from grounded pre-training. The zero-shot COCO and ODinW results, combined with the additional referring-expression evaluation protocol, provide concrete evidence of practical open-vocabulary capability. The promised code release supports reproducibility and further research.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending acceptance. We appreciate the recognition of the contributions of Grounding DINO to open-set object detection through tight language-vision fusion and the strong empirical results on zero-shot and referring-expression benchmarks.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical results from training and evaluating an open-set detector on independent public benchmarks (COCO zero-shot, ODinW, LVIS, RefCOCO). No mathematical derivation chain exists that reduces claimed performance or architectural choices to fitted inputs or self-referential quantities by construction. The tight fusion modules are presented as design decisions motivated by modality fusion needs, not as predictions derived from prior equations within the paper. Self-citations to DINO and grounded pre-training are external and do not bear the load of the reported AP numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of grounded pre-training for open-set generalization and on the assumption that the proposed fusion modules do not introduce harmful modality misalignment; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Grounded pre-training on large-scale image-text data transfers to zero-shot detection on held-out categories and referring expressions.
Invoked when the paper states that marrying DINO with grounded pre-training enables open-set concept generalization.

pith-pipeline@v0.9.0 · 5565 in / 1259 out tokens · 46032 ms · 2026-05-11T11:05:09.992353+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris
cs.CV 2026-05 unverdicted novelty 8.0

Presents the CANSURF dataset for surface-level aluminum can detection from ASV viewpoints and shows that training YOLOv11 on it yields a 12x performance boost over generic datasets along with stable tracking results.
VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
cs.CV 2026-05 unverdicted novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
cs.RO 2026-05 unverdicted novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses
cs.CV 2026-05 unverdicted novelty 7.0

RelWitness introduces relation witnesses from visual and geometric cues to learn open-vocabulary 3D scene graphs under incomplete supervision using a positive-unlabeled objective.
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
cs.CV 2026-05 conditional novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization
cs.RO 2026-05 conditional novelty 7.0

CosFlyTrack supplies 2.4 million timesteps of aligned RGB, depth, segmentation, pose, target state, and bilingual instructions from expert UAV trajectories, with experiments showing 53-69 point gains in SR@1m after fi...
CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization
cs.RO 2026-05 conditional novelty 7.0

CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate aft...
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
cs.CV 2026-05 unverdicted novelty 7.0

Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
cs.CV 2026-05 unverdicted novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
cs.CV 2026-05 unverdicted novelty 7.0

ScriptHOI decomposes HOI phrases into state slots and uses script coverage, conflict, interval partial-label learning, and counterfactual contrast to improve rare and unseen interaction detection while cutting afforda...
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
cs.CV 2026-05 unverdicted novelty 7.0

LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
Multimodal Data Curation Through Ranked Retrieval
cs.IR 2026-05 unverdicted novelty 7.0

Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
Grounding Video Reasoning in Physical Signals
cs.CV 2026-04 unverdicted novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

PanDA is the first UDA method for multimodal 3D panoptic segmentation that improves robustness to single-modality degradation and pseudo-label completeness via asymmetric augmentation and dual-expert refinement.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
cs.CV 2026-04 conditional novelty 7.0

Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
WildDet3D: Scaling Promptable 3D Detection in the Wild
cs.CV 2026-04 unverdicted novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
SPRITE: From Static Mockups to Engine-Ready Game UI
cs.HC 2026-03 unverdicted novelty 7.0

SPRITE converts static game UI screenshots into editable engine-ready assets by using VLMs to parse complex layouts into a YAML intermediate representation.
Towards Generalizable Robotic Manipulation in Dynamic Environments
cs.CV 2026-03 unverdicted novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
cs.CV 2025-12 conditional novelty 7.0

ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.
RoofNet: A Global Multimodal Dataset for Roof Material Identification from Earth Observation
cs.CE 2025-05 conditional novelty 7.0

RoofNet is a multimodal dataset pairing high-resolution Earth observation imagery with roof material annotations from diverse global locations to support vision-language models for hazard exposure mapping.
VACE: All-in-One Video Creation and Editing
cs.CV 2025-03 unverdicted novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
cs.CV 2024-11 unverdicted novelty 7.0

VideoRepair detects text-video misalignments via MLLM-generated questions and performs localized, region-preserving refinement to improve alignment in existing T2V diffusion models.
RoboDreamer: Learning Compositional World Models for Robot Imagination
cs.RO 2024-04 unverdicted novelty 7.0

RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Action with Visual Primitives
cs.RO 2026-05 unverdicted novelty 6.0

AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations
cs.RO 2026-05 unverdicted novelty 6.0

A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses
cs.CV 2026-05 unverdicted novelty 6.0

RelWitness uses concrete visual-geometric cues to verify and learn from missing relation labels in open-vocabulary 3D scene graph generation.
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
cs.CV 2026-05 unverdicted novelty 6.0

TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.
ReactiveGWM: Steering NPC in Reactive Game World Models
cs.CV 2026-05 unverdicted novelty 6.0

ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
cs.CV 2026-05 unverdicted novelty 6.0

ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain trai...
Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

Hallucinations in diffusion models are driven by local intrinsic dimension instabilities on the manifold, which Intrinsic Quenching corrects by deflating it.
Approaching human parity in the quality of automated organoid image segmentation
cs.CV 2026-05 conditional novelty 6.0

A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
cs.RO 2026-04 unverdicted novelty 6.0

GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...
WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring
cs.CV 2026-04 unverdicted novelty 6.0

WildLIFT lifts monocular drone video to 3D for species-agnostic wildlife detection, tracking, and viewpoint analysis by integrating scene geometry with open-vocabulary segmentation.
Pi-HOC: Pairwise 3D Human-Object Contact Estimation
cs.CV 2026-04 unverdicted novelty 6.0

Pi-HOC predicts dense 3D semantic contacts for all human-object pairs in an image via instance-aware tokens and an InteractionFormer, achieving higher accuracy and 20x throughput than prior methods.
Pi-HOC: Pairwise 3D Human-Object Contact Estimation
cs.CV 2026-04 unverdicted novelty 6.0

Pi-HOC is a new instance-aware framework that predicts dense 3D semantic contacts between all human-object pairs in an image via dedicated HO tokens, InteractionFormer refinement, and a SAM decoder, achieving higher a...
GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors
cs.CV 2026-04 unverdicted novelty 6.0

GS4City derives geometry-grounded semantic masks from LoD3 CityGML models via raycasting and fuses them with 2D foundation model outputs to supervise identity encodings on Gaussians, improving coarse and fine semantic...
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
cs.CV 2026-04 unverdicted novelty 6.0

VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Indoor Asset Detection in Large Scale 360{\deg} Drone-Captured Imagery via 3D Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

A 3D object codebook leveraging mask semantics and Gaussian spatial information enables multi-view mask association for indoor asset detection in 3DGS scenes, yielding 65% F1 and 11% mAP gains on two large indoor scenes.
From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes
cs.CV 2026-03 unverdicted novelty 6.0

L2G-Det detects and segments novel object instances in open scenes by using local template patch matches to generate points that prompt an augmented SAM for global masks.
ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning
cs.RO 2025-12 conditional novelty 6.0

ESPADA uses semantic segmentation from VLMs and LLMs plus DTW to downsample non-critical segments in demonstrations, delivering about 2x faster robot execution in behavior cloning while maintaining task success rates.
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
cs.CV 2025-12 unverdicted novelty 6.0

ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
cs.CV 2025-11 unverdicted novelty 6.0

RADSeg adapts the RADIO model with targeted enhancements to deliver 6-30% higher mIoU in zero-shot OVSS while using 2.5x fewer parameters and running 3.95x faster than prior large-model combinations.
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
cs.CV 2025-11 unverdicted novelty 6.0

A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
cs.RO 2025-11 unverdicted novelty 6.0

SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
cs.CV 2025-10 unverdicted novelty 6.0

Memory-SAM retrieves similar prior cases via DINOv3 features and FAISS to generate point prompts for SAM2, achieving mIoU 0.9863 on 600 tongue images without training or human prompts.
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
cs.RO 2025-07 unverdicted novelty 6.0

RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
cs.RO 2025-05 unverdicted novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
cs.CV 2025-02 conditional novelty 6.0

Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
cs.RO 2024-12 unverdicted novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
cs.CL 2024-01 conditional novelty 6.0

Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 85 Pith papers · 3 internal anchors

[1]

computer vision and pattern recognition (2017)

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. computer vision and pattern recognition (2017)

work page 2017
[2]

In: European Conference on Computer Vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European Conference on Computer Vision. pp. 213–229. Springer (2020)

work page 2020
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4974–4983 (2019) Grounding DINO 15

work page 2019
[4]

Chen, Q., Chen, X., Wang, J., Feng, H., Han, J., Ding, E., Zeng, G., Wang, J.: Group DETR: Fast detr training with group-wise one-to-many assignment (2022)

work page 2022
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Dai, X., Chen, Y., Xiao, B., Chen, D., Liu, M., Yuan, L., Zhang, L.: Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7373–7382 (2021)

work page 2021
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: End-to- end object detection with dynamic attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2988–2997 (October 2021)

work page 2021
[7]

arXiv: Computer Vision and Pattern Recognition (2021)

Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: Transvg: End-to-end visual grounding with transformers. arXiv: Computer Vision and Pattern Recognition (2021)

work page 2021
[8]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Dong, N., Zhang, Y., Ding, M., Lee, G.H.: Boosting long-tailed object detection via step-wise learning on smooth-tail data (2023),https://arxiv.org/abs/2305. 12833

work page 2023
[10]

Du, Y., Fu, Z., Liu, Q., Wang, Y.: Visual grounding with transformers. (2021)

work page 2021
[11]

neural information processing systems (2020)

Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversar- ial training for vision-and-language representation learning. neural information processing systems (2020)

work page 2020
[12]

Clip-adapter: Better vision-language models with feature adapters,

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)

work page arXiv 2021
[13]

Fast convergence of detr with spatially modulated co-attention

Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of DETR with spatially modulated co-attention. arXiv preprint arXiv:2101.07448 (2021)

work page arXiv 2021
[14]

Learning (2021)

Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. Learning (2021)

work page 2021
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5356–5364 (2019)

work page 2019
[16]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016)

work page 2016
[17]

Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C., Hu, H.: DETRs with hybrid matching (2022)

work page 2022
[18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR- modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790 (2021)

work page 2021
[19]

Dataset available from https://github

Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., Veit, A., et al.: Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2(3), 18 (2017)

work page 2017
[20]

International Journal of Computer Vision (2017)

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (2017)

work page 2017
[21]

Liu et al

Kuo, W., Bertsch, F., Li, W., Piergiovanni, A., Saffar, M., Angelova, A.: Findit: Generalized localization with natural language queries (2022) 16 S. Liu et al

work page 2022
[22]

arXiv: Computer Vision and Pattern Recognition (2018)

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv: Computer Vision and Pattern Recognition (2018)

work page 2018
[23]

Li, C., Liu, H., Li, L.H., Zhang, P., Aneja, J., Yang, J., Jin, P., Lee, Y.J., Hu, H., Liu, Z., Gao, J.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models (2022)

work page 2022
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: Accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13619–13627 (2022)

work page 2022
[25]

Grounded language-image pre-training

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 (2021)

work page arXiv 2021
[26]

arXiv: Computer Vision and Pattern Recognition (2021)

Li, M., Sigal, L.: Referring transformer: A one-step approach to multi-task visual grounding. arXiv: Computer Vision and Pattern Recognition (2021)

work page 2021
[27]

CVPR (2023)

Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: GLIGEN: Open-set grounded text-to-image generation. CVPR (2023)

work page 2023
[28]

In: Proceedings of the IEEE international conference on computer vision

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)

work page 2017
[29]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

work page 2014
[30]

international conference on computer vision (2017)

Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. international conference on computer vision (2017)

work page 2017
[31]

In: International Conference on Learning Representations (2022),https://openreview.net/forum? id=oMI9PjOb9Jl

Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB- DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations (2022),https://openreview.net/forum? id=oMI9PjOb9Jl

work page 2022
[32]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

work page internal anchor Pith review arXiv 2021
[33]

Conditional detr for fast training convergence

Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional DETR for fast training convergence. arXiv preprint arXiv:2108.06152 (2021)

work page arXiv 2021
[34]

ArXivabs/2204.09957 (2022)

Miao, P., Su, W., Wang, L., Fu, Y., Li, X.: Referring expression comprehension via cross-level multi-modal fusion. ArXivabs/2204.09957 (2022)

work page arXiv 2022
[35]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., Houlsby, N.: Simple open-vocabulary object detection with vision transformers (2022)

work page 2022
[36]

neural information processing systems (2011)

Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. neural information processing systems (2011)

work page 2011
[37]

In: Proceedings of the IEEE international conference on computer vision

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)

work page 2015
[38]

International Journal of Computer Vision (2015) Grounding DINO 17

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision (2015) Grounding DINO 17

work page 2015
[39]

IEEE Transactions on Pattern Analysis and Machine Intelligence39(6), 1137–1149 (2017)

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence39(6), 1137–1149 (2017)

work page 2017
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Gener- alized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 658–666 (2019)

work page 2019
[41]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

work page 2021
[42]

meeting of the association for computational linguistics (2015)

Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. meeting of the association for computational linguistics (2015)

work page 2015
[43]

In: Proceedings of the IEEE international conference on computer vision

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 8430–8439 (2019)

work page 2019
[44]

meeting of the association for computational linguistics (2018)

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. meeting of the association for computational linguistics (2018)

work page 2018
[45]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

Shilong, L., Yaoyuan, L., Shijia, H., Feng, L., Hao, Z., Hang, S., Jun, Z., Lei, Z.: DQ-DETR: Dual query detection transformer for phrase extraction and grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

work page 2023
[46]

international conference on machine learning (2019)

Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. international conference on machine learning (2019)

work page 2019
[47]

Communications of The ACM (2016)

Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D.N., Borth, D., Li, L.J.: Yfcc100m: the new data in multimedia research. Communications of The ACM (2016)

work page 2016
[48]

national conference on artificial intelligence (2021)

Wang,Y.,Zhang,X.,Yang,T.,Sun,J.:AnchorDETR:Querydesignfortransformer- based detector. national conference on artificial intelligence (2021)

work page 2021
[49]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the- art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

work page internal anchor Pith review arXiv 1910
[50]

arXiv preprint arXiv:2106.09018 (2021)

Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., Liu, Z.: End-to-end semi-supervised object detection with soft teacher. arXiv preprint arXiv:2106.09018 (2021)

work page arXiv 2021
[51]

Yao, L., Han, J., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, H.: DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment (2023)

work page 2023
[52]

Yao, L., Han, J., Wen, Y., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, C., Xu, H.: Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection (2022)

work page 2022
[53]

computer vision and pattern recognition (2018)

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Modular attention network for referring expression comprehension. computer vision and pattern recognition (2018)

work page 2018
[54]

Yuan, L., Chen, D., Chen, Y.L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., Zhang, P.: Florence: A new foundation model for computer vision (2022)

work page 2022
[55]

Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching (2022)

work page 2022
[56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14393–14402 (2021) 18 S. Liu et al

work page 2021
[57]

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection (2022)

work page 2022
[58]

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L.H., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: GLIPv2: Unifying Localization and Vision-Language Understanding (2022)

work page 2022
[59]

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L.H., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language understand- ing (2022)

work page 2022
[60]

Zhao, T., Liu, P., Lu, X., Lee, K.: Omdet: Language-aware object detection with large-scale vision-language multi-dataset pre-training (2022)

work page 2022
[61]

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., Gao, J.: Regionclip: Region-based language-image pretraining (2022)

work page 2022
[62]

In: ECCV (2022)

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: ECCV (2022)

work page 2022
[63]

Objects as Points

Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

work page Pith review arXiv 1904
[64]

"" 2 Input: 3 image_feat: (bs, num_img_tokens, ndim) 4 text_feat: (bs, num_text_tokens, ndim) 5 num_query: int 6 7 Output: 8 topk_idx: (bs, num_query) 9

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformersforend-to-endobjectdetection.In:ICLR2021:TheNinthInternational Conference on Learning Representations (2021) Grounding DINO 19 A More Implementation Details A.1 Hyperparameters Table 8 presents the hyperparameters used in our main experiments. Item Value optimizer ...

work page 2021
[65]

We use COCO [29], O365 [43], and OpenImage(OI) [19] for our model pretrain

Detection data.Following GLIP [25], we reformulate the object detection task to a phrase grounding task by concatenating the category names into text prompts. We use COCO [29], O365 [43], and OpenImage(OI) [19] for our model pretrain. To simulate different text inputs, we randomly sampled category names from all categories in a dataset on the fly during training

work page
[66]

We use the GoldG and RefC data as grounding data

Grounding data. We use the GoldG and RefC data as grounding data. Both GoldG and RefC are preprocessed by MDETR [18]. These data can be fed into Grounding DINO directly. GoldG contains images in Flickr30k entities [37,38] and Visual Genome [20]. RefC contains images in RefCOCO, RefCOCO+, and RefCOCOg

work page
[67]

Following GLIP, we use the pseudo-labeled caption data for model training

Caption data.To enhance the model performance on novel categories, we feed the semantic-rich caption data to our model. Following GLIP, we use the pseudo-labeled caption data for model training. In our experiments, we use the same data with GLIP under comparable settings. More specifically, we use GLIP-T annotated caption data for Grounding DINO T, while ...

work page
[68]

Model Overall Input Text Input Image Model Outputs Keys& Values Cross-Modality Queries Text Features Image Features Vanilla Text Features A Cross-Modality Decoder Layer Cross-Modality Query Self-Attention Image Cross-Attention Text Cross-Attention FFN Updated Cross-Modality Query Text Features Image Features

work page
[69]

Tomato leaf mosaic virus,

A Feature Enhancer Layer Self-Attention Image-to-text Cross-Attention Text-to-image Cross-Attention FFN Deformable Self-Attention Image Features Text Features FFN Q,K,V Q K,V K,V Q Q,K,VQ,K,V QK,V K,VQ Updated Image Features Updated Text Features Vanilla Image Features 1 1 1 Text Features cat dog desk person dog mouse table sets Contrastive loss Localizat...

work page

[1] [1]

computer vision and pattern recognition (2017)

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. computer vision and pattern recognition (2017)

work page 2017

[2] [2]

In: European Conference on Computer Vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European Conference on Computer Vision. pp. 213–229. Springer (2020)

work page 2020

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4974–4983 (2019) Grounding DINO 15

work page 2019

[4] [4]

Chen, Q., Chen, X., Wang, J., Feng, H., Han, J., Ding, E., Zeng, G., Wang, J.: Group DETR: Fast detr training with group-wise one-to-many assignment (2022)

work page 2022

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Dai, X., Chen, Y., Xiao, B., Chen, D., Liu, M., Yuan, L., Zhang, L.: Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7373–7382 (2021)

work page 2021

[6] [6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: End-to- end object detection with dynamic attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2988–2997 (October 2021)

work page 2021

[7] [7]

arXiv: Computer Vision and Pattern Recognition (2021)

Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: Transvg: End-to-end visual grounding with transformers. arXiv: Computer Vision and Pattern Recognition (2021)

work page 2021

[8] [8]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Dong, N., Zhang, Y., Ding, M., Lee, G.H.: Boosting long-tailed object detection via step-wise learning on smooth-tail data (2023),https://arxiv.org/abs/2305. 12833

work page 2023

[10] [10]

Du, Y., Fu, Z., Liu, Q., Wang, Y.: Visual grounding with transformers. (2021)

work page 2021

[11] [11]

neural information processing systems (2020)

Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversar- ial training for vision-and-language representation learning. neural information processing systems (2020)

work page 2020

[12] [12]

Clip-adapter: Better vision-language models with feature adapters,

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)

work page arXiv 2021

[13] [13]

Fast convergence of detr with spatially modulated co-attention

Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of DETR with spatially modulated co-attention. arXiv preprint arXiv:2101.07448 (2021)

work page arXiv 2021

[14] [14]

Learning (2021)

Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. Learning (2021)

work page 2021

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5356–5364 (2019)

work page 2019

[16] [16]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016)

work page 2016

[17] [17]

Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C., Hu, H.: DETRs with hybrid matching (2022)

work page 2022

[18] [18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR- modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790 (2021)

work page 2021

[19] [19]

Dataset available from https://github

Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., Veit, A., et al.: Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2(3), 18 (2017)

work page 2017

[20] [20]

International Journal of Computer Vision (2017)

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (2017)

work page 2017

[21] [21]

Liu et al

Kuo, W., Bertsch, F., Li, W., Piergiovanni, A., Saffar, M., Angelova, A.: Findit: Generalized localization with natural language queries (2022) 16 S. Liu et al

work page 2022

[22] [22]

arXiv: Computer Vision and Pattern Recognition (2018)

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv: Computer Vision and Pattern Recognition (2018)

work page 2018

[23] [23]

Li, C., Liu, H., Li, L.H., Zhang, P., Aneja, J., Yang, J., Jin, P., Lee, Y.J., Hu, H., Liu, Z., Gao, J.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models (2022)

work page 2022

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: Accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13619–13627 (2022)

work page 2022

[25] [25]

Grounded language-image pre-training

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 (2021)

work page arXiv 2021

[26] [26]

arXiv: Computer Vision and Pattern Recognition (2021)

Li, M., Sigal, L.: Referring transformer: A one-step approach to multi-task visual grounding. arXiv: Computer Vision and Pattern Recognition (2021)

work page 2021

[27] [27]

CVPR (2023)

Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: GLIGEN: Open-set grounded text-to-image generation. CVPR (2023)

work page 2023

[28] [28]

In: Proceedings of the IEEE international conference on computer vision

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)

work page 2017

[29] [29]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

work page 2014

[30] [30]

international conference on computer vision (2017)

Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. international conference on computer vision (2017)

work page 2017

[31] [31]

In: International Conference on Learning Representations (2022),https://openreview.net/forum? id=oMI9PjOb9Jl

Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: DAB- DETR: Dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations (2022),https://openreview.net/forum? id=oMI9PjOb9Jl

work page 2022

[32] [32]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

work page internal anchor Pith review arXiv 2021

[33] [33]

Conditional detr for fast training convergence

Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional DETR for fast training convergence. arXiv preprint arXiv:2108.06152 (2021)

work page arXiv 2021

[34] [34]

ArXivabs/2204.09957 (2022)

Miao, P., Su, W., Wang, L., Fu, Y., Li, X.: Referring expression comprehension via cross-level multi-modal fusion. ArXivabs/2204.09957 (2022)

work page arXiv 2022

[35] [35]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., Houlsby, N.: Simple open-vocabulary object detection with vision transformers (2022)

work page 2022

[36] [36]

neural information processing systems (2011)

Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. neural information processing systems (2011)

work page 2011

[37] [37]

In: Proceedings of the IEEE international conference on computer vision

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)

work page 2015

[38] [38]

International Journal of Computer Vision (2015) Grounding DINO 17

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision (2015) Grounding DINO 17

work page 2015

[39] [39]

IEEE Transactions on Pattern Analysis and Machine Intelligence39(6), 1137–1149 (2017)

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence39(6), 1137–1149 (2017)

work page 2017

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Gener- alized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 658–666 (2019)

work page 2019

[41] [41]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

work page 2021

[42] [42]

meeting of the association for computational linguistics (2015)

Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. meeting of the association for computational linguistics (2015)

work page 2015

[43] [43]

In: Proceedings of the IEEE international conference on computer vision

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 8430–8439 (2019)

work page 2019

[44] [44]

meeting of the association for computational linguistics (2018)

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. meeting of the association for computational linguistics (2018)

work page 2018

[45] [45]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

Shilong, L., Yaoyuan, L., Shijia, H., Feng, L., Hao, Z., Hang, S., Jun, Z., Lei, Z.: DQ-DETR: Dual query detection transformer for phrase extraction and grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

work page 2023

[46] [46]

international conference on machine learning (2019)

Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. international conference on machine learning (2019)

work page 2019

[47] [47]

Communications of The ACM (2016)

Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D.N., Borth, D., Li, L.J.: Yfcc100m: the new data in multimedia research. Communications of The ACM (2016)

work page 2016

[48] [48]

national conference on artificial intelligence (2021)

Wang,Y.,Zhang,X.,Yang,T.,Sun,J.:AnchorDETR:Querydesignfortransformer- based detector. national conference on artificial intelligence (2021)

work page 2021

[49] [49]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-the- art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

work page internal anchor Pith review arXiv 1910

[50] [50]

arXiv preprint arXiv:2106.09018 (2021)

Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., Liu, Z.: End-to-end semi-supervised object detection with soft teacher. arXiv preprint arXiv:2106.09018 (2021)

work page arXiv 2021

[51] [51]

Yao, L., Han, J., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, H.: DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment (2023)

work page 2023

[52] [52]

Yao, L., Han, J., Wen, Y., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, C., Xu, H.: Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection (2022)

work page 2022

[53] [53]

computer vision and pattern recognition (2018)

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Modular attention network for referring expression comprehension. computer vision and pattern recognition (2018)

work page 2018

[54] [54]

Yuan, L., Chen, D., Chen, Y.L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., Zhang, P.: Florence: A new foundation model for computer vision (2022)

work page 2022

[55] [55]

Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching (2022)

work page 2022

[56] [56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14393–14402 (2021) 18 S. Liu et al

work page 2021

[57] [57]

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection (2022)

work page 2022

[58] [58]

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L.H., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: GLIPv2: Unifying Localization and Vision-Language Understanding (2022)

work page 2022

[59] [59]

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L.H., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language understand- ing (2022)

work page 2022

[60] [60]

Zhao, T., Liu, P., Lu, X., Lee, K.: Omdet: Language-aware object detection with large-scale vision-language multi-dataset pre-training (2022)

work page 2022

[61] [61]

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., Gao, J.: Regionclip: Region-based language-image pretraining (2022)

work page 2022

[62] [62]

In: ECCV (2022)

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: ECCV (2022)

work page 2022

[63] [63]

Objects as Points

Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

work page Pith review arXiv 1904

[64] [64]

"" 2 Input: 3 image_feat: (bs, num_img_tokens, ndim) 4 text_feat: (bs, num_text_tokens, ndim) 5 num_query: int 6 7 Output: 8 topk_idx: (bs, num_query) 9

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformersforend-to-endobjectdetection.In:ICLR2021:TheNinthInternational Conference on Learning Representations (2021) Grounding DINO 19 A More Implementation Details A.1 Hyperparameters Table 8 presents the hyperparameters used in our main experiments. Item Value optimizer ...

work page 2021

[65] [65]

We use COCO [29], O365 [43], and OpenImage(OI) [19] for our model pretrain

Detection data.Following GLIP [25], we reformulate the object detection task to a phrase grounding task by concatenating the category names into text prompts. We use COCO [29], O365 [43], and OpenImage(OI) [19] for our model pretrain. To simulate different text inputs, we randomly sampled category names from all categories in a dataset on the fly during training

work page

[66] [66]

We use the GoldG and RefC data as grounding data

Grounding data. We use the GoldG and RefC data as grounding data. Both GoldG and RefC are preprocessed by MDETR [18]. These data can be fed into Grounding DINO directly. GoldG contains images in Flickr30k entities [37,38] and Visual Genome [20]. RefC contains images in RefCOCO, RefCOCO+, and RefCOCOg

work page

[67] [67]

Following GLIP, we use the pseudo-labeled caption data for model training

Caption data.To enhance the model performance on novel categories, we feed the semantic-rich caption data to our model. Following GLIP, we use the pseudo-labeled caption data for model training. In our experiments, we use the same data with GLIP under comparable settings. More specifically, we use GLIP-T annotated caption data for Grounding DINO T, while ...

work page

[68] [68]

Model Overall Input Text Input Image Model Outputs Keys& Values Cross-Modality Queries Text Features Image Features Vanilla Text Features A Cross-Modality Decoder Layer Cross-Modality Query Self-Attention Image Cross-Attention Text Cross-Attention FFN Updated Cross-Modality Query Text Features Image Features

work page

[69] [69]

Tomato leaf mosaic virus,

A Feature Enhancer Layer Self-Attention Image-to-text Cross-Attention Text-to-image Cross-Attention FFN Deformable Self-Attention Image Features Text Features FFN Q,K,V Q K,V K,V Q Q,K,VQ,K,V QK,V K,VQ Updated Image Features Updated Text Features Vanilla Image Features 1 1 1 Text Features cat dog desk person dog mouse table sets Contrastive loss Localizat...

work page