pith. machine review for the scientific record. sign in

arxiv: 2401.14159 · v1 · submitted 2024-01-25 · 💻 cs.CV

Recognition: no theorem link

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ailing Zeng, Feng Li, Feng Yan, Hao Zhang, He Cao, Hongyang Li, Jiayu Chen, Jie Yang, Jing Lin, Kunchang Li, Lei Zhang, Qing Jiang, Shilong Liu, Tianhe Ren, Xinyu Huang, Yukang Chen, Zhaoyang Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords Grounded SAMopen-set detectionsegment anythingzero-shot segmentationtext-prompted segmentationmodel compositionopen-world visionvisual task pipelines
0
0 comments X

The pith

Assembling an open-set detector with a segment-anything model enables text-prompted segmentation of arbitrary regions and connects to other vision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to combine an open-set object detector with a promptable segmentation model to handle a wide range of visual tasks using only text descriptions as input. The resulting system can detect and outline objects or regions based on natural language without additional training for each new task. It further allows plugging in other models to create pipelines for automatic image labeling, controlled editing, and even 3D motion analysis. Readers should care because such composition turns specialized tools into a flexible platform for open-world vision problems. The approach reaches 48.7 mean average precision on a challenging zero-shot segmentation benchmark.

Core claim

By using an open-set detector to detect objects from text and feeding its outputs as prompts to the segment-anything model, the system achieves detection and segmentation of any regions based on arbitrary text inputs. This opens a door to connecting various vision models for diverse tasks, including automatic annotation with captioning models, controllable editing with diffusion models, and promptable 3D human motion analysis. On the SegInW zero-shot benchmark, the combination attains 48.7 mean AP.

What carries the argument

The pipeline that uses bounding boxes from the open-set detector as spatial prompts to guide the promptable segmenter.

If this is right

  • Automatic annotation pipelines become possible using only input images and added captioning models.
  • Controllable image editing is enabled by linking with diffusion models.
  • Promptable 3D human motion analysis is supported through integration with specialized motion models.
  • High performance is achieved on open-vocabulary segmentation tasks in zero-shot settings without fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model composition like this could reduce the reliance on training separate systems for each visual task.
  • It implies that compatibility in prompt formats between models is key to seamless assembly.
  • Extensions to other modalities or more complex tasks might be feasible by adding appropriate models.
  • The zero-shot performance suggests potential for broader applications in real-world scenarios where labeled data is scarce.

Load-bearing premise

The bounding box proposals from the open-set detector are sufficiently accurate and compatible to directly guide the segmenter without requiring refinement or additional training of the combined system.

What would settle it

A demonstration that the combined system fails to produce accurate segmentations for text-described objects that the detector correctly identifies, or that performance does not exceed what the individual models achieve separately.

read the original abstract

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Grounded SAM, an assembly of Grounding DINO (open-set detector) with SAM (segmentation model) to enable text-prompted detection and segmentation of arbitrary regions in open-world settings. It illustrates versatility through integrations with models such as BLIP for automatic annotation, Stable Diffusion for controllable editing, and OSX for 3D motion analysis, and reports a zero-shot result of 48.7 mean AP on the SegInW benchmark using the Grounding DINO-Base + SAM-Huge combination.

Significance. If the direct interface between detector outputs and SAM prompts holds under open-vocabulary conditions, the work provides a practical, training-free template for composing existing foundation models into more capable systems. This could lower barriers for open-world vision applications and encourage further model-assembly research; the reported SegInW score, once properly documented, would serve as a useful reference point for zero-shot segmentation performance.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline 48.7 mAP on SegInW is stated without any description of the evaluation protocol, comparison baselines, ablations on prompt quality or model-size variants, or error-propagation analysis, leaving the central empirical claim unsupported by verifiable evidence.
  2. [Method] Method section: the assumption that Grounding DINO bounding boxes and labels can be used directly as drop-in prompts for SAM is presented without specifying the exact prompt construction (e.g., box-to-point conversion, label text formatting), any post-processing, or robustness measures against localization noise or label errors, which is load-bearing for the claimed compatibility and performance.
minor comments (1)
  1. [Figure 1] Figure 1 caption and surrounding text could more explicitly label the data flow arrows between Grounding DINO and SAM to clarify the interface for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below and will revise the manuscript to improve clarity and verifiability of the presented results and method.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline 48.7 mAP on SegInW is stated without any description of the evaluation protocol, comparison baselines, ablations on prompt quality or model-size variants, or error-propagation analysis, leaving the central empirical claim unsupported by verifiable evidence.

    Authors: We agree that the abstract presents the 48.7 mAP result in a concise manner without accompanying details. The Experiments section of the manuscript describes the zero-shot evaluation on SegInW using the Grounding DINO-Base + SAM-Huge combination, but to fully address the concern we will revise both the abstract and Experiments section. Revisions will include a brief statement of the evaluation protocol in the abstract, explicit description of how text prompts are derived from the benchmark, comparison to relevant zero-shot baselines, ablations across model-size variants and prompt strategies, and a short analysis of error propagation from detection to segmentation outputs. These additions will make the central claim fully supported by documented evidence. revision: yes

  2. Referee: [Method] Method section: the assumption that Grounding DINO bounding boxes and labels can be used directly as drop-in prompts for SAM is presented without specifying the exact prompt construction (e.g., box-to-point conversion, label text formatting), any post-processing, or robustness measures against localization noise or label errors, which is load-bearing for the claimed compatibility and performance.

    Authors: We concur that the Method section would benefit from greater specificity on the detector-to-segmenter interface. The current description focuses on the overall pipeline; we will expand it to detail prompt construction, including conversion of bounding boxes to center-point prompts (or direct box prompts when supported by SAM), formatting of class labels into text prompts, and any filtering or post-processing steps such as confidence thresholding. We will also add discussion of robustness, noting that SAM's promptable design tolerates moderate localization noise and that Grounding DINO's open-set training reduces label errors, together with a brief error-propagation analysis. These clarifications will be added without changing the core training-free assembly approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model assembly with no derivations or self-referential predictions

full rationale

The paper presents Grounded SAM as a pipeline that assembles existing pre-trained models (Grounding DINO for detection, SAM for segmentation, plus optional models like BLIP or Stable Diffusion) to enable text-prompted open-world tasks. No equations, parameter fitting, or derivations are described; the 48.7 mAP on SegInW is reported as an empirical benchmark result for the Base+Huge combination. All load-bearing elements are external model capabilities rather than internally derived quantities that reduce to the paper's own inputs by construction. This is a standard non-circular engineering assembly.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the interoperability of two pre-trained models and the validity of their individual capabilities rather than new theoretical elements.

axioms (1)
  • domain assumption Grounding DINO outputs can be used directly as effective prompts for SAM without compatibility issues or performance degradation.
    This is the core premise of the Grounded SAM integration described in the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1267 out tokens · 59225 ms · 2026-05-11T06:15:23.620937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

    cs.CY 2026-05 unverdicted novelty 7.0

    A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

  2. Local Conformal Calibration of Dynamics Uncertainty from Semantic Images

    cs.RO 2026-05 unverdicted novelty 7.0

    OCULAR calibrates dynamics uncertainty using perception from similar environments to give guaranteed prediction regions for unseen test conditions.

  3. Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

  4. EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

    cs.CV 2026-05 unverdicted novelty 7.0

    EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods esp...

  5. Is Your Driving World Model an All-Around Player?

    cs.CV 2026-05 unverdicted novelty 7.0

    WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

  6. OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

    cs.CV 2026-05 conditional novelty 7.0

    OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene g...

  7. From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

    cs.CV 2026-05 unverdicted novelty 7.0

    CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.

  8. ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

    cs.CV 2026-05 unverdicted novelty 7.0

    ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.

  9. Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

  10. Anny-Fit: All-Age Human Mesh Recovery

    cs.CV 2026-05 unverdicted novelty 7.0

    Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...

  11. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

  12. DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

    cs.RO 2026-04 unverdicted novelty 7.0

    DockAnywhere lifts single demonstrations to diverse docking points via structure-preserving augmentation and point-cloud spatial editing to improve viewpoint generalization in visuomotor policies for mobile manipulation.

  13. ROSE: Retrieval-Oriented Segmentation Enhancement

    cs.CV 2026-04 unverdicted novelty 7.0

    ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.

  14. AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling

    cs.CV 2026-04 unverdicted novelty 7.0

    AmodalSVG produces semantically separate and geometrically complete SVG layers from natural images by using VLM-guided semantic layer peeling for amodal completion followed by adaptive vectorization.

  15. VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions

    cs.RO 2026-04 unverdicted novelty 7.0

    VLN-NF benchmark adds false-premise instructions to VLN and ROAM hybrid agent improves REV-SPL by combining room navigation with evidence-gathering exploration.

  16. YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outper...

  17. Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.

  18. Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ADM-GS decomposes static background appearance into traversal-invariant material and traversal-dependent illumination via a frequency-separated neural light field, yielding +0.98 dB PSNR gains and better cross-travers...

  19. Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

    cs.CV 2026-04 unverdicted novelty 7.0

    Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.

  20. 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...

  21. Training a Student Expert via Semi-Supervised Foundation Model Distillation

    cs.CV 2026-04 conditional novelty 7.0

    A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

  22. Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark

    cs.CV 2026-04 unverdicted novelty 7.0

    TinySet-9M dataset and DEAL point-prompted framework deliver 31.4% relative AP75 gain over supervised baselines for small object detection with one click at inference and generalization to unseen categories.

  23. Relit-LiVE: Relight Video by Jointly Learning Environment Video

    cs.CV 2026-05 unverdicted novelty 6.0

    Relit-LiVE jointly predicts relit videos and viewpoint-aligned environment maps inside a single diffusion process to achieve physically consistent video relighting without camera pose input.

  24. Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.

  25. Approaching human parity in the quality of automated organoid image segmentation

    cs.CV 2026-05 conditional novelty 6.0

    A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.

  26. Sparse-View 3D Gaussian Splatting in the Wild

    cs.CV 2026-04 unverdicted novelty 6.0

    A new sparse-view 3D Gaussian splatting method for unconstrained scenes with distractors combines diffusion-based reference-guided refinement and sparsity-aware Gaussian replication to achieve better rendering quality.

  27. WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring

    cs.CV 2026-04 unverdicted novelty 6.0

    WildLIFT lifts monocular drone video to 3D for species-agnostic wildlife detection, tracking, and viewpoint analysis by integrating scene geometry with open-vocabulary segmentation.

  28. PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics

    cs.CV 2026-04 unverdicted novelty 6.0

    PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.

  29. Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    Wiggle and Go! uses system identification from rope motion observations to predict parameters that enable zero-shot goal-conditioned dynamic manipulation, achieving 3.55 cm accuracy on 3D target striking versus 15.34 ...

  30. Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.

  31. SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve seriali...

  32. AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion

    cs.CV 2026-04 unverdicted novelty 6.0

    A two-stage method synthesizes multi-view 2D motion data from internet video keypoints and trains a camera-conditioned diffusion model to recover globally consistent 3D human motion and HOI in world space.

  33. DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

    cs.CV 2026-04 unverdicted novelty 6.0

    DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

  34. OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.

  35. Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

    cs.CV 2026-04 unverdicted novelty 6.0

    The model uses dense visuo-tactile feature interactions and material-diversity pairing on expanded datasets to generate tactile saliency maps for material segmentation, outperforming prior global-alignment methods.

  36. Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 6.0

    A scene-agnostic object codebook learned via unsupervised object-centric learning provides consistent identity-anchored representations for 3D Gaussians across multiple scenes.

  37. ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

    cs.RO 2026-04 unverdicted novelty 6.0

    ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.

  38. Visually-grounded Humanoid Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.

  39. VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...

  40. VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.

  41. VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.

  42. CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

    cs.CV 2026-04 unverdicted novelty 5.0

    CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...

  43. LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

    cs.RO 2026-04 unverdicted novelty 5.0

    LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-dist...

  44. MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.

  45. CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

    cs.RO 2026-04 unverdicted novelty 5.0

    CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...

  46. Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs

    cs.RO 2026-04 unverdicted novelty 5.0

    A zero-shot pipeline uses SAM2 segmentation plus numeric-label prompting of a VLM to identify drivable off-road areas and enable navigation without task-specific training or datasets.

  47. Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images

    cs.GR 2026-04 unverdicted novelty 4.0

    NPCs gain spatial awareness via panoramic images turned into JSON scene data for LLMs, enabling dynamic references to nearby objects and improving player preference in user studies.

  48. Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

    cs.CV 2026-04 unverdicted novelty 4.0

    Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.

  49. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 47 Pith papers · 5 internal anchors

  1. [1]

    Blended Latent Diffusion, Jun 2022

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion, Jun 2022. 2

  2. [2]

    Blended Diffusion for Text-driven Editing of Natural Images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), Sep 2022. 2

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Un- derstanding, Localization, Text Reading, and Beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A Versatile Vision-Language Model for Un- derstanding, Localization, Text Reading, and Beyond, 2023. 2

  4. [4]

    Smpler-x: Scaling up expressive human pose and shape estimation

    Zhongang Cai, Wanqi Yin, Ailing Zeng, CHEN WEI, SUN Qingping, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. In Thirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 2

  5. [5]

    Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation

    Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. arXiv preprint arXiv:2401.04747, 2024. 2

  6. [6]

    HumanMAC: Masked Motion Completion for Human Motion Prediction

    Ling-Hao Chen, Jiawei Zhang, Yewen Li, Yiren Pang, Xi- aobo Xia, and Tongliang Liu. HumanMAC: Masked Motion Completion for Human Motion Prediction. 2023. 2

  7. [7]

    Cheng, B., Girshick, R., Dollar, P., Berg, A

    Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A Language Modeling Framework for Object Detection. arXiv preprint arXiv:2109.10852, 2021. 2

  8. [8]

    Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexan- der Kirillov, Rohit Girdhar, and Alexander G. Schwing. Mask2Former for Video Instance Segmentation. 2022. 2

  9. [9]

    Schwing, and Alexander Kir- illov

    Bowen Cheng, Alexander G. Schwing, and Alexander Kir- illov. Per-Pixel Classification is Not All You Need for Seman- tic Segmentation. 2021. 2

  10. [10]

    Tracking Anything with Decoupled Video Segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking Anything with Decoupled Video Segmentation. In ICCV, 2023. 7

  11. [11]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

  12. [12]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023. 2

  13. [13]

    GPT-3: Its na- ture, scope, limits, and consequences

    Luciano Floridi and Massimo Chiriatti. GPT-3: Its na- ture, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020. 2

  14. [14]

    Make-A-Scene: Scene- Based Text-to-Image Generation with Human Priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene- Based Text-to-Image Generation with Human Priors. 2

  15. [15]

    ChatGPT is not all you need

    Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. ChatGPT is not all you need. A State of the Art Review of large Generative AI models.arXiv preprint arXiv:2301.04655,

  16. [16]

    You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023

    Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Ron- grong Ji, and Liujuan Cao. You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023. 2

  17. [17]

    Open-Set Image Tagging with Multi-Grained Text Supervision, 2023

    Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, and Lei Zhang. Open-Set Image Tagging with Multi-Grained Text Supervision, 2023. 2

  18. [18]

    Tag2Text: Guiding Vision-Language Model via Image Tag- ging, 2023

    Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding Vision-Language Model via Image Tag- ging, 2023. 2, 4

  19. [19]

    DETRs with Hybrid Matching

    Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080 ,

  20. [20]

    T-Rex: Counting by Visual Prompting, 2023

    Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, and Lei Zhang. T-Rex: Counting by Visual Prompting, 2023. 2

  21. [21]

    arXiv preprint arXiv:2310.01506 (2023)

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023. 2

  22. [22]

    HumanSD: A native skeleton-guided diffusion model for human image generation

    Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. HumanSD: A native skeleton-guided diffusion model for human image generation. 2023. 2

  23. [23]

    Scaling up GANs for Text-to-Image Synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park, and Postech Postech. Scaling up GANs for Text-to-Image Synthesis. 2

  24. [24]

    Segment Anything in High Quality

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment Anything in High Quality. arXiv:2306.01567, 2023. 6, 7

  25. [25]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment Any- thing. arXiv preprint arXiv:2304.02643, 2023. 1, 2, 3, 5

  26. [26]

    Visual In-Context Prompting, 2023

    Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jian- wei Yang, Lei Zhang, and Jianfeng Gao. Visual In-Context Prompting, 2023. 2

  27. [27]

    DN-DETR: Accelerate DETR Training by In- troducing Query DeNoising

    Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR Training by In- troducing Query DeNoising. In Computer Vision and Pattern Recognition (CVPR), 2022. 2

  28. [28]

    Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023. 2

  29. [29]

    Ni, and Heung-Yeung Shum

    Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask DINO: To- wards A Unified Transformer-based Framework for Object Detection and Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2023. 2

  30. [30]

    DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

    Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, and Lei Zhang. DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6684–6693, October 2023. 2

  31. [31]

    BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In In- ternational Conference on Machine Learning , pages 12888– 12900. PMLR, 2022. 2, 3, 4

  32. [32]

    Motion-x: A large- scale 3d expressive whole-body human motion dataset

    Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large- scale 3d expressive whole-body human motion dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 2

  33. [33]

    One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer

    Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023. 2, 3, 5, 6

  34. [34]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485,

  35. [35]

    LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, and Chunyuan Li. LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023. 2

  36. [36]

    DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

    Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In International Conference on Learning Representations, 2022. 2

  37. [37]

    DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022

    Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su, Jun Zhu, and Lei Zhang. DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022. 2

  38. [38]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023. 1, 2, 3

  39. [39]

    Humantomato: Text-aligned whole-body motion generation,

    Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Human- TOMATO: Text-aligned Whole-body Motion Generation. arXiv preprint arXiv:2310.12978, 2023. 2

  40. [40]

    Cheap and Quick: Efficient Vision- Language Instruction Tuning for Large Language Models,

    Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and Quick: Efficient Vision- Language Instruction Tuning for Large Language Models,

  41. [41]

    Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020

    Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020. 2

  42. [42]

    SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021

    Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021. 2

  43. [43]

    Conditional detr for fast training convergence

    Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Con- ditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021. 2

  44. [44]

    T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to- Image Diffusion Models, Feb 2023

    Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to- Image Diffusion Models, Feb 2023. 2

  45. [45]

    GLIDE: Towards Photorealistic Image Genera- tion and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Genera- tion and Editing with Text-Guided Diffusion Models. 2

  46. [46]

    GPT-4 Technical Report, 2023

    OpenAI. GPT-4 Technical Report, 2023. 3

  47. [47]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2

  48. [48]

    Hierarchical Text-Conditional Image Gener- ation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Gener- ation with CLIP Latents. 2

  49. [49]

    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 2

  50. [50]

    detrex: Benchmarking Detection Transformers

    Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, et al. detrex: Benchmarking Detection Transformers. arXiv preprint arXiv:2306.07265, 2023. 2

  51. [51]

    A Strong and Reproducible Object Detector with Only Public Datasets, 2023

    Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, and Lei Zhang. A Strong and Reproducible Object Detector with Only Public Datasets, 2023. 2

  52. [52]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 3, 7

  53. [53]

    DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022. 2

  54. [54]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGontijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2

  55. [55]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weim- ing Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 2, 3

  56. [56]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lem- pitsky. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161, 2021. 7

  57. [57]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. 2

  58. [58]

    V3Det: Vast V ocabulary Visual Detection Dataset

    Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3Det: Vast V ocabulary Visual Detection Dataset. arXiv preprint arXiv:2304.03752, 2023. 4

  59. [59]

    OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learn- ing Framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learn- ing Framework. In ICML, 2022. 2

  60. [60]

    CogVLM: Visual Expert for Pretrained Language Models, 2023

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual Expert for Pretrained Language Models, 2023. 2

  61. [61]

    Physhoi: Physics-based imita- tion of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

    Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imita- tion of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023. 2

  62. [62]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671, 2023. 2, 3

  63. [63]

    EfficientSAM: leveraged masked image pretraining for efficient segment anything

    Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. arXiv preprint arXiv:2312.00863, 2023. 6

  64. [64]

    Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.04803, 2023. 7

  65. [65]

    Side Adapter Network for Open-V ocabulary Semantic Segmentation, 2023

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side Adapter Network for Open-V ocabulary Semantic Segmentation, 2023. 7

  66. [66]

    Universal Instance Perception as Object Discovery and Retrieval

    Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Universal Instance Perception as Object Discovery and Retrieval. In CVPR, 2023. 2, 7

  67. [67]

    Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023

    Feng Yan, Weixin Luo, Yujie Zhong, Yiyang Gan, and Lin Ma. Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023. 2

  68. [68]

    Paint by Ex- ample: Exemplar-based Image Editing with Diffusion Models

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Ex- ample: Exemplar-based Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023. 7

  69. [69]

    Boosting human-object interaction de- tection with text-to-image diffusion model

    Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, and Ruimao Zhang. Boosting human-object interaction de- tection with text-to-image diffusion model. arXiv preprint arXiv:2305.12252, 2023. 2

  70. [70]

    Semantic human parsing via scalable semantic transfer over multiple label domains

    Jie Yang, Chaoqun Wang, Zhen Li, Junle Wang, and Ruimao Zhang. Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19424–19433, 2023. 2

  71. [71]

    Neural Interactive Keypoint Detection

    Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, and Lei Zhang. Neural Interactive Keypoint Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15122–15132, 2023. 2

  72. [72]

    Explicit box detection unifies end-to-end multi-person pose estimation

    Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, and Lei Zhang. Explicit box detection unifies end-to-end multi-person pose estimation. In International Conference on Learning Representations, 2023. 2

  73. [73]

    Unipose: Detecting any keypoints

    Jie Yang, Ailing Zeng, Ruimao Zhang, and Lei Zhang. Unipose: Detecting any keypoints. arXiv preprint arXiv:2310.08530, 2023. 2

  74. [74]

    Effec- tive whole-body pose estimation with two-stages distillation

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 2

  75. [75]

    Retrieval- Augmented Multimodal Language Modeling

    Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettle- moyer, Wen-Tau Yih, and Memory Memory. Retrieval- Augmented Multimodal Language Modeling. 2

  76. [76]

    Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289, 2023. 6

  77. [77]

    Ni, and Heung-Yeung Shum

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Ob- ject Detection, 2022. 2

  78. [78]

    MP-Former: Mask- Piloted Transformer for Image Segmentation

    Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang. MP-Former: Mask- Piloted Transformer for Image Segmentation. arXiv preprint arXiv:2303.07336, 2023. 2

  79. [79]

    A simple framework for open-vocabulary segmentation and detection

    Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A Simple Framework for Open-V ocabulary Segmentation and Detection. arXiv preprint arXiv:2303.08131, 2023. 2, 3, 7

  80. [80]

    LLaV A-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023

    Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chun- yuan Li, and Jianwei Yang. LLaV A-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023. 2

Showing first 80 references.