Recognition: no theorem link
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Pith reviewed 2026-05-11 06:15 UTC · model grok-4.3
The pith
Assembling an open-set detector with a segment-anything model enables text-prompted segmentation of arbitrary regions and connects to other vision tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using an open-set detector to detect objects from text and feeding its outputs as prompts to the segment-anything model, the system achieves detection and segmentation of any regions based on arbitrary text inputs. This opens a door to connecting various vision models for diverse tasks, including automatic annotation with captioning models, controllable editing with diffusion models, and promptable 3D human motion analysis. On the SegInW zero-shot benchmark, the combination attains 48.7 mean AP.
What carries the argument
The pipeline that uses bounding boxes from the open-set detector as spatial prompts to guide the promptable segmenter.
If this is right
- Automatic annotation pipelines become possible using only input images and added captioning models.
- Controllable image editing is enabled by linking with diffusion models.
- Promptable 3D human motion analysis is supported through integration with specialized motion models.
- High performance is achieved on open-vocabulary segmentation tasks in zero-shot settings without fine-tuning.
Where Pith is reading between the lines
- Model composition like this could reduce the reliance on training separate systems for each visual task.
- It implies that compatibility in prompt formats between models is key to seamless assembly.
- Extensions to other modalities or more complex tasks might be feasible by adding appropriate models.
- The zero-shot performance suggests potential for broader applications in real-world scenarios where labeled data is scarce.
Load-bearing premise
The bounding box proposals from the open-set detector are sufficiently accurate and compatible to directly guide the segmenter without requiring refinement or additional training of the combined system.
What would settle it
A demonstration that the combined system fails to produce accurate segmentations for text-described objects that the detector correctly identifies, or that performance does not exceed what the individual models achieve separately.
read the original abstract
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Grounded SAM, an assembly of Grounding DINO (open-set detector) with SAM (segmentation model) to enable text-prompted detection and segmentation of arbitrary regions in open-world settings. It illustrates versatility through integrations with models such as BLIP for automatic annotation, Stable Diffusion for controllable editing, and OSX for 3D motion analysis, and reports a zero-shot result of 48.7 mean AP on the SegInW benchmark using the Grounding DINO-Base + SAM-Huge combination.
Significance. If the direct interface between detector outputs and SAM prompts holds under open-vocabulary conditions, the work provides a practical, training-free template for composing existing foundation models into more capable systems. This could lower barriers for open-world vision applications and encourage further model-assembly research; the reported SegInW score, once properly documented, would serve as a useful reference point for zero-shot segmentation performance.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the headline 48.7 mAP on SegInW is stated without any description of the evaluation protocol, comparison baselines, ablations on prompt quality or model-size variants, or error-propagation analysis, leaving the central empirical claim unsupported by verifiable evidence.
- [Method] Method section: the assumption that Grounding DINO bounding boxes and labels can be used directly as drop-in prompts for SAM is presented without specifying the exact prompt construction (e.g., box-to-point conversion, label text formatting), any post-processing, or robustness measures against localization noise or label errors, which is load-bearing for the claimed compatibility and performance.
minor comments (1)
- [Figure 1] Figure 1 caption and surrounding text could more explicitly label the data flow arrows between Grounding DINO and SAM to clarify the interface for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major point below and will revise the manuscript to improve clarity and verifiability of the presented results and method.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the headline 48.7 mAP on SegInW is stated without any description of the evaluation protocol, comparison baselines, ablations on prompt quality or model-size variants, or error-propagation analysis, leaving the central empirical claim unsupported by verifiable evidence.
Authors: We agree that the abstract presents the 48.7 mAP result in a concise manner without accompanying details. The Experiments section of the manuscript describes the zero-shot evaluation on SegInW using the Grounding DINO-Base + SAM-Huge combination, but to fully address the concern we will revise both the abstract and Experiments section. Revisions will include a brief statement of the evaluation protocol in the abstract, explicit description of how text prompts are derived from the benchmark, comparison to relevant zero-shot baselines, ablations across model-size variants and prompt strategies, and a short analysis of error propagation from detection to segmentation outputs. These additions will make the central claim fully supported by documented evidence. revision: yes
-
Referee: [Method] Method section: the assumption that Grounding DINO bounding boxes and labels can be used directly as drop-in prompts for SAM is presented without specifying the exact prompt construction (e.g., box-to-point conversion, label text formatting), any post-processing, or robustness measures against localization noise or label errors, which is load-bearing for the claimed compatibility and performance.
Authors: We concur that the Method section would benefit from greater specificity on the detector-to-segmenter interface. The current description focuses on the overall pipeline; we will expand it to detail prompt construction, including conversion of bounding boxes to center-point prompts (or direct box prompts when supported by SAM), formatting of class labels into text prompts, and any filtering or post-processing steps such as confidence thresholding. We will also add discussion of robustness, noting that SAM's promptable design tolerates moderate localization noise and that Grounding DINO's open-set training reduces label errors, together with a brief error-propagation analysis. These clarifications will be added without changing the core training-free assembly approach. revision: yes
Circularity Check
No circularity: empirical model assembly with no derivations or self-referential predictions
full rationale
The paper presents Grounded SAM as a pipeline that assembles existing pre-trained models (Grounding DINO for detection, SAM for segmentation, plus optional models like BLIP or Stable Diffusion) to enable text-prompted open-world tasks. No equations, parameter fitting, or derivations are described; the 48.7 mAP on SegInW is reported as an empirical benchmark result for the Base+Huge combination. All load-bearing elements are external model capabilities rather than internally derived quantities that reduce to the paper's own inputs by construction. This is a standard non-circular engineering assembly.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Grounding DINO outputs can be used directly as effective prompts for SAM without compatibility issues or performance degradation.
Forward citations
Cited by 49 Pith papers
-
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
-
Local Conformal Calibration of Dynamics Uncertainty from Semantic Images
OCULAR calibrates dynamics uncertainty using perception from similar environments to give guaranteed prediction regions for unseen test conditions.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods esp...
-
Is Your Driving World Model an All-Around Player?
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
-
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene g...
-
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
-
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
-
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
-
Anny-Fit: All-Age Human Mesh Recovery
Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...
-
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
-
DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation
DockAnywhere lifts single demonstrations to diverse docking points via structure-preserving augmentation and point-cloud spatial editing to improve viewpoint generalization in visuomotor policies for mobile manipulation.
-
ROSE: Retrieval-Oriented Segmentation Enhancement
ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.
-
AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling
AmodalSVG produces semantically separate and geometrically complete SVG layers from natural images by using VLM-guided semantic layer peeling for amodal completion followed by adaptive vectorization.
-
VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions
VLN-NF benchmark adds false-premise instructions to VLN and ROAM hybrid agent improves REV-SPL by combining room navigation with evidence-gathering exploration.
-
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outper...
-
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
-
Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction
ADM-GS decomposes static background appearance into traversal-invariant material and traversal-dependent illumination via a frequency-separated neural light field, yielding +0.98 dB PSNR gains and better cross-travers...
-
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
-
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark
TinySet-9M dataset and DEAL point-prompted framework deliver 31.4% relative AP75 gain over supervised baselines for small object detection with one click at inference and generalization to unseen categories.
-
Relit-LiVE: Relight Video by Jointly Learning Environment Video
Relit-LiVE jointly predicts relit videos and viewpoint-aligned environment maps inside a single diffusion process to achieve physically consistent video relighting without camera pose input.
-
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
-
Approaching human parity in the quality of automated organoid image segmentation
A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.
-
Sparse-View 3D Gaussian Splatting in the Wild
A new sparse-view 3D Gaussian splatting method for unconstrained scenes with distractors combines diffusion-based reference-guided refinement and sparsity-aware Gaussian replication to achieve better rendering quality.
-
WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring
WildLIFT lifts monocular drone video to 3D for species-agnostic wildlife detection, tracking, and viewpoint analysis by integrating scene geometry with open-vocabulary segmentation.
-
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics
PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.
-
Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
Wiggle and Go! uses system identification from rope motion observations to predict parameters that enable zero-shot goal-conditioned dynamic manipulation, achieving 3.55 cm accuracy on 3D target striking versus 15.34 ...
-
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
-
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve seriali...
-
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
A two-stage method synthesizes multi-view 2D motion data from internet video keypoints and trains a camera-conditioned diffusion model to recover globally consistent 3D human motion and HOI in world space.
-
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
-
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
-
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
The model uses dense visuo-tactile feature interactions and material-diversity pairing on expanded datasets to generate tactile saliency maps for material segmentation, outperforming prior global-alignment methods.
-
Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
A scene-agnostic object codebook learned via unsupervised object-centric learning provides consistent identity-anchored representations for 3D Gaussians across multiple scenes.
-
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
-
Visually-grounded Humanoid Agents
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
-
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...
-
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.
-
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.
-
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...
-
LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment
LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-dist...
-
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
-
Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs
A zero-shot pipeline uses SAM2 segmentation plus numeric-label prompting of a VLM to identify drivable off-road areas and enable navigation without task-specific training or datasets.
-
Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images
NPCs gain spatial awareness via panoramic images turned into JSON scene data for LLMs, enabling dynamic references to nearby objects and improving player preference in user studies.
-
Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Reference graph
Works this paper leans on
-
[1]
Blended Latent Diffusion, Jun 2022
Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion, Jun 2022. 2
work page 2022
-
[2]
Blended Diffusion for Text-driven Editing of Natural Images
Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), Sep 2022. 2
work page 2022
-
[3]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A Versatile Vision-Language Model for Un- derstanding, Localization, Text Reading, and Beyond, 2023. 2
work page 2023
-
[4]
Smpler-x: Scaling up expressive human pose and shape estimation
Zhongang Cai, Wanqi Yin, Ailing Zeng, CHEN WEI, SUN Qingping, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. In Thirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 2
work page 2023
-
[5]
Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. arXiv preprint arXiv:2401.04747, 2024. 2
-
[6]
HumanMAC: Masked Motion Completion for Human Motion Prediction
Ling-Hao Chen, Jiawei Zhang, Yewen Li, Yiren Pang, Xi- aobo Xia, and Tongliang Liu. HumanMAC: Masked Motion Completion for Human Motion Prediction. 2023. 2
work page 2023
-
[7]
Cheng, B., Girshick, R., Dollar, P., Berg, A
Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A Language Modeling Framework for Object Detection. arXiv preprint arXiv:2109.10852, 2021. 2
-
[8]
Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexan- der Kirillov, Rohit Girdhar, and Alexander G. Schwing. Mask2Former for Video Instance Segmentation. 2022. 2
work page 2022
-
[9]
Schwing, and Alexander Kir- illov
Bowen Cheng, Alexander G. Schwing, and Alexander Kir- illov. Per-Pixel Classification is Not All You Need for Seman- tic Segmentation. 2021. 2
work page 2021
-
[10]
Tracking Anything with Decoupled Video Segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking Anything with Decoupled Video Segmentation. In ICCV, 2023. 7
work page 2023
-
[11]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023. 2
work page 2023
-
[13]
GPT-3: Its na- ture, scope, limits, and consequences
Luciano Floridi and Massimo Chiriatti. GPT-3: Its na- ture, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020. 2
work page 2020
-
[14]
Make-A-Scene: Scene- Based Text-to-Image Generation with Human Priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene- Based Text-to-Image Generation with Human Priors. 2
-
[15]
Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. ChatGPT is not all you need. A State of the Art Review of large Generative AI models.arXiv preprint arXiv:2301.04655,
-
[16]
You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023
Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Ron- grong Ji, and Liujuan Cao. You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023. 2
work page 2023
-
[17]
Open-Set Image Tagging with Multi-Grained Text Supervision, 2023
Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, and Lei Zhang. Open-Set Image Tagging with Multi-Grained Text Supervision, 2023. 2
work page 2023
-
[18]
Tag2Text: Guiding Vision-Language Model via Image Tag- ging, 2023
Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding Vision-Language Model via Image Tag- ging, 2023. 2, 4
work page 2023
-
[19]
Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080 ,
-
[20]
T-Rex: Counting by Visual Prompting, 2023
Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, and Lei Zhang. T-Rex: Counting by Visual Prompting, 2023. 2
work page 2023
-
[21]
arXiv preprint arXiv:2310.01506 (2023)
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023. 2
-
[22]
HumanSD: A native skeleton-guided diffusion model for human image generation
Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. HumanSD: A native skeleton-guided diffusion model for human image generation. 2023. 2
work page 2023
-
[23]
Scaling up GANs for Text-to-Image Synthesis
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park, and Postech Postech. Scaling up GANs for Text-to-Image Synthesis. 2
-
[24]
Segment Anything in High Quality
Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment Anything in High Quality. arXiv:2306.01567, 2023. 6, 7
-
[25]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment Any- thing. arXiv preprint arXiv:2304.02643, 2023. 1, 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Visual In-Context Prompting, 2023
Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jian- wei Yang, Lei Zhang, and Jianfeng Gao. Visual In-Context Prompting, 2023. 2
work page 2023
-
[27]
DN-DETR: Accelerate DETR Training by In- troducing Query DeNoising
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR Training by In- troducing Query DeNoising. In Computer Vision and Pattern Recognition (CVPR), 2022. 2
work page 2022
-
[28]
Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023. 2
work page 2023
-
[29]
Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask DINO: To- wards A Unified Transformer-based Framework for Object Detection and Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2023. 2
work page 2023
-
[30]
DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting
Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, and Lei Zhang. DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6684–6693, October 2023. 2
work page 2023
-
[31]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In In- ternational Conference on Machine Learning , pages 12888– 12900. PMLR, 2022. 2, 3, 4
work page 2022
-
[32]
Motion-x: A large- scale 3d expressive whole-body human motion dataset
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large- scale 3d expressive whole-body human motion dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 2
work page 2023
-
[33]
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023. 2, 3, 5, 6
work page 2023
-
[34]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review arXiv
-
[35]
LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, and Chunyuan Li. LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023. 2
work page 2023
-
[36]
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In International Conference on Learning Representations, 2022. 2
work page 2022
-
[37]
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022
Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su, Jun Zhu, and Lei Zhang. DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022. 2
work page 2022
-
[38]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023. 1, 2, 3
work page Pith review arXiv 2023
-
[39]
Humantomato: Text-aligned whole-body motion generation,
Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Human- TOMATO: Text-aligned Whole-body Motion Generation. arXiv preprint arXiv:2310.12978, 2023. 2
-
[40]
Cheap and Quick: Efficient Vision- Language Instruction Tuning for Large Language Models,
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and Quick: Efficient Vision- Language Instruction Tuning for Large Language Models,
-
[41]
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020. 2
work page 2020
-
[42]
SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021
Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021. 2
work page 2021
-
[43]
Conditional detr for fast training convergence
Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Con- ditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021. 2
-
[44]
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to- Image Diffusion Models, Feb 2023. 2
work page 2023
-
[45]
GLIDE: Towards Photorealistic Image Genera- tion and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Genera- tion and Editing with Text-Guided Diffusion Models. 2
- [46]
-
[47]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2
work page 2021
-
[48]
Hierarchical Text-Conditional Image Gener- ation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Gener- ation with CLIP Latents. 2
-
[49]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 2
work page 2017
-
[50]
detrex: Benchmarking Detection Transformers
Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, et al. detrex: Benchmarking Detection Transformers. arXiv preprint arXiv:2306.07265, 2023. 2
-
[51]
A Strong and Reproducible Object Detector with Only Public Datasets, 2023
Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, and Lei Zhang. A Strong and Reproducible Object Detector with Only Public Datasets, 2023. 2
work page 2023
-
[52]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 3, 7
work page 2022
-
[53]
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022. 2
work page 2022
-
[54]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGontijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2
-
[55]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weim- ing Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 2, 3
work page internal anchor Pith review arXiv 2023
-
[56]
Resolution-robust large mask inpainting with fourier convolutions
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lem- pitsky. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161, 2021. 7
-
[57]
LaMDA: Language Models for Dialog Applications
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. 2
work page Pith review arXiv 2022
-
[58]
V3Det: Vast V ocabulary Visual Detection Dataset
Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3Det: Vast V ocabulary Visual Detection Dataset. arXiv preprint arXiv:2304.03752, 2023. 4
-
[59]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learn- ing Framework. In ICML, 2022. 2
work page 2022
-
[60]
CogVLM: Visual Expert for Pretrained Language Models, 2023
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual Expert for Pretrained Language Models, 2023. 2
work page 2023
-
[61]
Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imita- tion of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023. 2
-
[62]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671, 2023. 2, 3
work page internal anchor Pith review arXiv 2023
-
[63]
EfficientSAM: leveraged masked image pretraining for efficient segment anything
Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. arXiv preprint arXiv:2312.00863, 2023. 6
-
[64]
Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.04803, 2023. 7
-
[65]
Side Adapter Network for Open-V ocabulary Semantic Segmentation, 2023
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side Adapter Network for Open-V ocabulary Semantic Segmentation, 2023. 7
work page 2023
-
[66]
Universal Instance Perception as Object Discovery and Retrieval
Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Universal Instance Perception as Object Discovery and Retrieval. In CVPR, 2023. 2, 7
work page 2023
-
[67]
Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023
Feng Yan, Weixin Luo, Yujie Zhong, Yiyang Gan, and Lin Ma. Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023. 2
work page 2023
-
[68]
Paint by Ex- ample: Exemplar-based Image Editing with Diffusion Models
Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Ex- ample: Exemplar-based Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023. 7
work page 2023
-
[69]
Boosting human-object interaction de- tection with text-to-image diffusion model
Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, and Ruimao Zhang. Boosting human-object interaction de- tection with text-to-image diffusion model. arXiv preprint arXiv:2305.12252, 2023. 2
-
[70]
Semantic human parsing via scalable semantic transfer over multiple label domains
Jie Yang, Chaoqun Wang, Zhen Li, Junle Wang, and Ruimao Zhang. Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19424–19433, 2023. 2
work page 2023
-
[71]
Neural Interactive Keypoint Detection
Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, and Lei Zhang. Neural Interactive Keypoint Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15122–15132, 2023. 2
work page 2023
-
[72]
Explicit box detection unifies end-to-end multi-person pose estimation
Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, and Lei Zhang. Explicit box detection unifies end-to-end multi-person pose estimation. In International Conference on Learning Representations, 2023. 2
work page 2023
-
[73]
Unipose: Detecting any keypoints
Jie Yang, Ailing Zeng, Ruimao Zhang, and Lei Zhang. Unipose: Detecting any keypoints. arXiv preprint arXiv:2310.08530, 2023. 2
-
[74]
Effec- tive whole-body pose estimation with two-stages distillation
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 2
work page 2023
-
[75]
Retrieval- Augmented Multimodal Language Modeling
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettle- moyer, Wen-Tau Yih, and Memory Memory. Retrieval- Augmented Multimodal Language Modeling. 2
-
[76]
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289, 2023. 6
-
[77]
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Ob- ject Detection, 2022. 2
work page 2022
-
[78]
MP-Former: Mask- Piloted Transformer for Image Segmentation
Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang. MP-Former: Mask- Piloted Transformer for Image Segmentation. arXiv preprint arXiv:2303.07336, 2023. 2
-
[79]
A simple framework for open-vocabulary segmentation and detection
Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A Simple Framework for Open-V ocabulary Segmentation and Detection. arXiv preprint arXiv:2303.08131, 2023. 2, 3, 7
-
[80]
LLaV A-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023
Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chun- yuan Li, and Jianwei Yang. LLaV A-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023. 2
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.