arxiv: 2401.14159 · v1 · submitted 2024-01-25 · 💻 cs.CV

Recognition: no theorem link

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ailing Zeng, Feng Li, Feng Yan, Hao Zhang, He Cao, Hongyang Li, Jiayu Chen, Jie Yang, Jing Lin, Kunchang Li, Lei Zhang, Qing Jiang, Shilong Liu, Tianhe Ren, Xinyu Huang, Yukang Chen, Zhaoyang Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords Grounded SAMopen-set detectionsegment anythingzero-shot segmentationtext-prompted segmentationmodel compositionopen-world visionvisual task pipelines

0 comments

The pith

Assembling an open-set detector with a segment-anything model enables text-prompted segmentation of arbitrary regions and connects to other vision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to combine an open-set object detector with a promptable segmentation model to handle a wide range of visual tasks using only text descriptions as input. The resulting system can detect and outline objects or regions based on natural language without additional training for each new task. It further allows plugging in other models to create pipelines for automatic image labeling, controlled editing, and even 3D motion analysis. Readers should care because such composition turns specialized tools into a flexible platform for open-world vision problems. The approach reaches 48.7 mean average precision on a challenging zero-shot segmentation benchmark.

Core claim

By using an open-set detector to detect objects from text and feeding its outputs as prompts to the segment-anything model, the system achieves detection and segmentation of any regions based on arbitrary text inputs. This opens a door to connecting various vision models for diverse tasks, including automatic annotation with captioning models, controllable editing with diffusion models, and promptable 3D human motion analysis. On the SegInW zero-shot benchmark, the combination attains 48.7 mean AP.

What carries the argument

The pipeline that uses bounding boxes from the open-set detector as spatial prompts to guide the promptable segmenter.

If this is right

Automatic annotation pipelines become possible using only input images and added captioning models.
Controllable image editing is enabled by linking with diffusion models.
Promptable 3D human motion analysis is supported through integration with specialized motion models.
High performance is achieved on open-vocabulary segmentation tasks in zero-shot settings without fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model composition like this could reduce the reliance on training separate systems for each visual task.
It implies that compatibility in prompt formats between models is key to seamless assembly.
Extensions to other modalities or more complex tasks might be feasible by adding appropriate models.
The zero-shot performance suggests potential for broader applications in real-world scenarios where labeled data is scarce.

Load-bearing premise

The bounding box proposals from the open-set detector are sufficiently accurate and compatible to directly guide the segmenter without requiring refinement or additional training of the combined system.

What would settle it

A demonstration that the combined system fails to produce accurate segmentations for text-described objects that the detector correctly identifies, or that performance does not exceed what the individual models achieve separately.

read the original abstract

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grounded SAM is a straightforward assembly of Grounding DINO and SAM that produces text-prompted open-world segmentation plus some pipeline demos, with a reported 48.7 mAP on SegInW, but the paper stays at the level of engineering glue rather than new mechanisms.

read the letter

The useful part is the concrete pipeline that takes arbitrary text, runs it through Grounding DINO to get boxes and labels, then feeds those directly as prompts to SAM. The authors also show how to chain the same setup with BLIP for auto-annotation, Stable Diffusion for editing, and OSX for 3D motion, which gives practitioners a modular template they can copy. That kind of working example can save time when someone needs to prototype an open-vocabulary tool without starting from scratch. The 48.7 mean AP number on the SegInW zero-shot benchmark is the clearest quantitative anchor they provide. If the full paper includes the exact prompt format, any post-processing steps, and the code, that alone makes the work worth a look for applied groups. The soft spot is exactly the interface the stress-test flagged: the paper treats DINO outputs as ready-to-use prompts for SAM without showing ablations on localization error, label noise, or prompt mismatch. No error-propagation numbers or comparisons to refined variants appear in the abstract, and the evaluation protocol details are missing, so it is hard to judge how much of the 48.7 comes from the models versus careful tuning. Because the core is composition of existing components rather than a new algorithm or loss, the technical novelty is limited to the specific pairing and the benchmark result. This paper is aimed at engineers and applied researchers who want a ready recipe for text-driven segmentation and multi-model chaining. It is solid enough on the empirical side to deserve a serious referee who can check the implementation details and ask for the missing ablations; the community would benefit from seeing the pipeline documented even if the method itself is incremental.

Referee Report

2 major / 1 minor

Summary. The paper introduces Grounded SAM, an assembly of Grounding DINO (open-set detector) with SAM (segmentation model) to enable text-prompted detection and segmentation of arbitrary regions in open-world settings. It illustrates versatility through integrations with models such as BLIP for automatic annotation, Stable Diffusion for controllable editing, and OSX for 3D motion analysis, and reports a zero-shot result of 48.7 mean AP on the SegInW benchmark using the Grounding DINO-Base + SAM-Huge combination.

Significance. If the direct interface between detector outputs and SAM prompts holds under open-vocabulary conditions, the work provides a practical, training-free template for composing existing foundation models into more capable systems. This could lower barriers for open-world vision applications and encourage further model-assembly research; the reported SegInW score, once properly documented, would serve as a useful reference point for zero-shot segmentation performance.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the headline 48.7 mAP on SegInW is stated without any description of the evaluation protocol, comparison baselines, ablations on prompt quality or model-size variants, or error-propagation analysis, leaving the central empirical claim unsupported by verifiable evidence.
[Method] Method section: the assumption that Grounding DINO bounding boxes and labels can be used directly as drop-in prompts for SAM is presented without specifying the exact prompt construction (e.g., box-to-point conversion, label text formatting), any post-processing, or robustness measures against localization noise or label errors, which is load-bearing for the claimed compatibility and performance.

minor comments (1)

[Figure 1] Figure 1 caption and surrounding text could more explicitly label the data flow arrows between Grounding DINO and SAM to clarify the interface for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below and will revise the manuscript to improve clarity and verifiability of the presented results and method.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline 48.7 mAP on SegInW is stated without any description of the evaluation protocol, comparison baselines, ablations on prompt quality or model-size variants, or error-propagation analysis, leaving the central empirical claim unsupported by verifiable evidence.

Authors: We agree that the abstract presents the 48.7 mAP result in a concise manner without accompanying details. The Experiments section of the manuscript describes the zero-shot evaluation on SegInW using the Grounding DINO-Base + SAM-Huge combination, but to fully address the concern we will revise both the abstract and Experiments section. Revisions will include a brief statement of the evaluation protocol in the abstract, explicit description of how text prompts are derived from the benchmark, comparison to relevant zero-shot baselines, ablations across model-size variants and prompt strategies, and a short analysis of error propagation from detection to segmentation outputs. These additions will make the central claim fully supported by documented evidence. revision: yes
Referee: [Method] Method section: the assumption that Grounding DINO bounding boxes and labels can be used directly as drop-in prompts for SAM is presented without specifying the exact prompt construction (e.g., box-to-point conversion, label text formatting), any post-processing, or robustness measures against localization noise or label errors, which is load-bearing for the claimed compatibility and performance.

Authors: We concur that the Method section would benefit from greater specificity on the detector-to-segmenter interface. The current description focuses on the overall pipeline; we will expand it to detail prompt construction, including conversion of bounding boxes to center-point prompts (or direct box prompts when supported by SAM), formatting of class labels into text prompts, and any filtering or post-processing steps such as confidence thresholding. We will also add discussion of robustness, noting that SAM's promptable design tolerates moderate localization noise and that Grounding DINO's open-set training reduces label errors, together with a brief error-propagation analysis. These clarifications will be added without changing the core training-free assembly approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model assembly with no derivations or self-referential predictions

full rationale

The paper presents Grounded SAM as a pipeline that assembles existing pre-trained models (Grounding DINO for detection, SAM for segmentation, plus optional models like BLIP or Stable Diffusion) to enable text-prompted open-world tasks. No equations, parameter fitting, or derivations are described; the 48.7 mAP on SegInW is reported as an empirical benchmark result for the Base+Huge combination. All load-bearing elements are external model capabilities rather than internally derived quantities that reduce to the paper's own inputs by construction. This is a standard non-circular engineering assembly.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the interoperability of two pre-trained models and the validity of their individual capabilities rather than new theoretical elements.

axioms (1)

domain assumption Grounding DINO outputs can be used directly as effective prompts for SAM without compatibility issues or performance degradation.
This is the core premise of the Grounded SAM integration described in the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1267 out tokens · 59225 ms · 2026-05-11T06:15:23.620937+00:00 · methodology

discussion (0)

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
cs.CY 2026-05 unverdicted novelty 7.0

A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
Local Conformal Calibration of Dynamics Uncertainty from Semantic Images
cs.RO 2026-05 unverdicted novelty 7.0

OCULAR calibrates dynamics uncertainty using perception from similar environments to give guaranteed prediction regions for unseen test conditions.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
cs.CV 2026-05 unverdicted novelty 7.0

EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods esp...
Is Your Driving World Model an All-Around Player?
cs.CV 2026-05 unverdicted novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World
cs.CV 2026-05 conditional novelty 7.0

OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene g...
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
cs.CV 2026-05 unverdicted novelty 7.0

CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
cs.CV 2026-05 unverdicted novelty 7.0

ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
cs.CV 2026-05 unverdicted novelty 7.0

Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
Anny-Fit: All-Age Human Mesh Recovery
cs.CV 2026-05 unverdicted novelty 7.0

Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation
cs.RO 2026-04 unverdicted novelty 7.0

DockAnywhere lifts single demonstrations to diverse docking points via structure-preserving augmentation and point-cloud spatial editing to improve viewpoint generalization in visuomotor policies for mobile manipulation.
ROSE: Retrieval-Oriented Segmentation Enhancement
cs.CV 2026-04 unverdicted novelty 7.0

ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.
AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling
cs.CV 2026-04 unverdicted novelty 7.0

AmodalSVG produces semantically separate and geometrically complete SVG layers from natural images by using VLM-guided semantic layer peeling for amodal completion followed by adaptive vectorization.
VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions
cs.RO 2026-04 unverdicted novelty 7.0

VLN-NF benchmark adds false-premise instructions to VLN and ROAM hybrid agent improves REV-SPL by combining room navigation with evidence-gathering exploration.
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outper...
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ADM-GS decomposes static background appearance into traversal-invariant material and traversal-dependent illumination via a frequency-separated neural light field, yielding +0.98 dB PSNR gains and better cross-travers...
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
cs.CV 2026-04 unverdicted novelty 7.0

Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
Training a Student Expert via Semi-Supervised Foundation Model Distillation
cs.CV 2026-04 conditional novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

TinySet-9M dataset and DEAL point-prompted framework deliver 31.4% relative AP75 gain over supervised baselines for small object detection with one click at inference and generalization to unseen categories.
Relit-LiVE: Relight Video by Jointly Learning Environment Video
cs.CV 2026-05 unverdicted novelty 6.0

Relit-LiVE jointly predicts relit videos and viewpoint-aligned environment maps inside a single diffusion process to achieve physically consistent video relighting without camera pose input.
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
Approaching human parity in the quality of automated organoid image segmentation
cs.CV 2026-05 conditional novelty 6.0

A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.
Sparse-View 3D Gaussian Splatting in the Wild
cs.CV 2026-04 unverdicted novelty 6.0

A new sparse-view 3D Gaussian splatting method for unconstrained scenes with distractors combines diffusion-based reference-guided refinement and sparsity-aware Gaussian replication to achieve better rendering quality.
WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring
cs.CV 2026-04 unverdicted novelty 6.0

WildLIFT lifts monocular drone video to 3D for species-agnostic wildlife detection, tracking, and viewpoint analysis by integrating scene geometry with open-vocabulary segmentation.
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics
cs.CV 2026-04 unverdicted novelty 6.0

PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.
Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Wiggle and Go! uses system identification from rope motion observations to predict parameters that enable zero-shot goal-conditioned dynamic manipulation, achieving 3.55 cm accuracy on 3D target striking versus 15.34 ...
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve seriali...
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

A two-stage method synthesizes multi-view 2D motion data from internet video keypoints and trains a camera-conditioned diffusion model to recover globally consistent 3D human motion and HOI in world space.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
cs.CV 2026-04 unverdicted novelty 6.0

DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
cs.RO 2026-04 unverdicted novelty 6.0

OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
cs.CV 2026-04 unverdicted novelty 6.0

The model uses dense visuo-tactile feature interactions and material-diversity pairing on expanded datasets to generate tactile saliency maps for material segmentation, outperforming prior global-alignment methods.
Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

A scene-agnostic object codebook learned via unsupervised object-centric learning provides consistent identity-anchored representations for 3D Gaussians across multiple scenes.
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
cs.RO 2026-04 unverdicted novelty 6.0

ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
Visually-grounded Humanoid Agents
cs.CV 2026-04 unverdicted novelty 6.0

A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
cs.CV 2026-04 unverdicted novelty 5.0

CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...
LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment
cs.RO 2026-04 unverdicted novelty 5.0

LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-dist...
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
cs.RO 2026-04 unverdicted novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs
cs.RO 2026-04 unverdicted novelty 5.0

A zero-shot pipeline uses SAM2 segmentation plus numeric-label prompting of a VLM to identify drivable off-road areas and enable navigation without task-specific training or datasets.
Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images
cs.GR 2026-04 unverdicted novelty 4.0

NPCs gain spatial awareness via panoramic images turned into JSON scene data for LLMs, enabling dynamic references to nearby objects and improving player preference in user studies.
Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
cs.CV 2026-04 unverdicted novelty 4.0

Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 47 Pith papers · 5 internal anchors

[1]

Blended Latent Diffusion, Jun 2022

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion, Jun 2022. 2

work page 2022
[2]

Blended Diffusion for Text-driven Editing of Natural Images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), Sep 2022. 2

work page 2022
[3]

Qwen-VL: A Versatile Vision-Language Model for Un- derstanding, Localization, Text Reading, and Beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A Versatile Vision-Language Model for Un- derstanding, Localization, Text Reading, and Beyond, 2023. 2

work page 2023
[4]

Smpler-x: Scaling up expressive human pose and shape estimation

Zhongang Cai, Wanqi Yin, Ailing Zeng, CHEN WEI, SUN Qingping, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. In Thirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 2

work page 2023
[5]

Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. arXiv preprint arXiv:2401.04747, 2024. 2

work page arXiv 2024
[6]

HumanMAC: Masked Motion Completion for Human Motion Prediction

Ling-Hao Chen, Jiawei Zhang, Yewen Li, Yiren Pang, Xi- aobo Xia, and Tongliang Liu. HumanMAC: Masked Motion Completion for Human Motion Prediction. 2023. 2

work page 2023
[7]

Cheng, B., Girshick, R., Dollar, P., Berg, A

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A Language Modeling Framework for Object Detection. arXiv preprint arXiv:2109.10852, 2021. 2

work page arXiv 2021
[8]

Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexan- der Kirillov, Rohit Girdhar, and Alexander G. Schwing. Mask2Former for Video Instance Segmentation. 2022. 2

work page 2022
[9]

Schwing, and Alexander Kir- illov

Bowen Cheng, Alexander G. Schwing, and Alexander Kir- illov. Per-Pixel Classification is Not All You Need for Seman- tic Segmentation. 2021. 2

work page 2021
[10]

Tracking Anything with Decoupled Video Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking Anything with Decoupled Video Segmentation. In ICCV, 2023. 7

work page 2023
[11]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023. 2

work page 2023
[13]

GPT-3: Its na- ture, scope, limits, and consequences

Luciano Floridi and Massimo Chiriatti. GPT-3: Its na- ture, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020. 2

work page 2020
[14]

Make-A-Scene: Scene- Based Text-to-Image Generation with Human Priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene- Based Text-to-Image Generation with Human Priors. 2

work page
[15]

ChatGPT is not all you need

Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. ChatGPT is not all you need. A State of the Art Review of large Generative AI models.arXiv preprint arXiv:2301.04655,

work page arXiv
[16]

You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023

Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Ron- grong Ji, and Liujuan Cao. You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023. 2

work page 2023
[17]

Open-Set Image Tagging with Multi-Grained Text Supervision, 2023

Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, and Lei Zhang. Open-Set Image Tagging with Multi-Grained Text Supervision, 2023. 2

work page 2023
[18]

Tag2Text: Guiding Vision-Language Model via Image Tag- ging, 2023

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding Vision-Language Model via Image Tag- ging, 2023. 2, 4

work page 2023
[19]

DETRs with Hybrid Matching

Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080 ,

work page arXiv
[20]

T-Rex: Counting by Visual Prompting, 2023

Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, and Lei Zhang. T-Rex: Counting by Visual Prompting, 2023. 2

work page 2023
[21]

arXiv preprint arXiv:2310.01506 (2023)

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023. 2

work page arXiv 2023
[22]

HumanSD: A native skeleton-guided diffusion model for human image generation

Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. HumanSD: A native skeleton-guided diffusion model for human image generation. 2023. 2

work page 2023
[23]

Scaling up GANs for Text-to-Image Synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park, and Postech Postech. Scaling up GANs for Text-to-Image Synthesis. 2

work page
[24]

Segment Anything in High Quality

Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment Anything in High Quality. arXiv:2306.01567, 2023. 6, 7

work page arXiv 2023
[25]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment Any- thing. arXiv preprint arXiv:2304.02643, 2023. 1, 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Visual In-Context Prompting, 2023

Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jian- wei Yang, Lei Zhang, and Jianfeng Gao. Visual In-Context Prompting, 2023. 2

work page 2023
[27]

DN-DETR: Accelerate DETR Training by In- troducing Query DeNoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR Training by In- troducing Query DeNoising. In Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[28]

Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023. 2

work page 2023
[29]

Ni, and Heung-Yeung Shum

Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask DINO: To- wards A Unified Transformer-based Framework for Object Detection and Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2023. 2

work page 2023
[30]

DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, and Lei Zhang. DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6684–6693, October 2023. 2

work page 2023
[31]

BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In In- ternational Conference on Machine Learning , pages 12888– 12900. PMLR, 2022. 2, 3, 4

work page 2022
[32]

Motion-x: A large- scale 3d expressive whole-body human motion dataset

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large- scale 3d expressive whole-body human motion dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 2

work page 2023
[33]

One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer

Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023. 2, 3, 5, 6

work page 2023
[34]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review arXiv
[35]

LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, and Chunyuan Li. LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023. 2

work page 2023
[36]

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In International Conference on Learning Representations, 2022. 2

work page 2022
[37]

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022

Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su, Jun Zhu, and Lei Zhang. DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022. 2

work page 2022
[38]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023. 1, 2, 3

work page Pith review arXiv 2023
[39]

Humantomato: Text-aligned whole-body motion generation,

Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Human- TOMATO: Text-aligned Whole-body Motion Generation. arXiv preprint arXiv:2310.12978, 2023. 2

work page arXiv 2023
[40]

Cheap and Quick: Efficient Vision- Language Instruction Tuning for Large Language Models,

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and Quick: Efficient Vision- Language Instruction Tuning for Large Language Models,

work page
[41]

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020. 2

work page 2020
[42]

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021. 2

work page 2021
[43]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Con- ditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021. 2

work page arXiv 2021
[44]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to- Image Diffusion Models, Feb 2023

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to- Image Diffusion Models, Feb 2023. 2

work page 2023
[45]

GLIDE: Towards Photorealistic Image Genera- tion and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Genera- tion and Editing with Text-Guided Diffusion Models. 2

work page
[46]

GPT-4 Technical Report, 2023

OpenAI. GPT-4 Technical Report, 2023. 3

work page 2023
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021. 2

work page 2021
[48]

Hierarchical Text-Conditional Image Gener- ation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Gener- ation with CLIP Latents. 2

work page
[49]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 2

work page 2017
[50]

detrex: Benchmarking Detection Transformers

Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, et al. detrex: Benchmarking Detection Transformers. arXiv preprint arXiv:2306.07265, 2023. 2

work page arXiv 2023
[51]

A Strong and Reproducible Object Detector with Only Public Datasets, 2023

Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, and Lei Zhang. A Strong and Reproducible Object Detector with Only Public Datasets, 2023. 2

work page 2023
[52]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 3, 7

work page 2022
[53]

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022. 2

work page 2022
[54]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGontijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2

work page
[55]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weim- ing Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 2, 3

work page internal anchor Pith review arXiv 2023
[56]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lem- pitsky. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161, 2021. 7

work page arXiv 2021
[57]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. 2

work page Pith review arXiv 2022
[58]

V3Det: Vast V ocabulary Visual Detection Dataset

Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3Det: Vast V ocabulary Visual Detection Dataset. arXiv preprint arXiv:2304.03752, 2023. 4

work page arXiv 2023
[59]

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learn- ing Framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learn- ing Framework. In ICML, 2022. 2

work page 2022
[60]

CogVLM: Visual Expert for Pretrained Language Models, 2023

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual Expert for Pretrained Language Models, 2023. 2

work page 2023
[61]

Physhoi: Physics-based imita- tion of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imita- tion of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023. 2

work page arXiv 2023
[62]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671, 2023. 2, 3

work page internal anchor Pith review arXiv 2023
[63]

EfficientSAM: leveraged masked image pretraining for efficient segment anything

Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. arXiv preprint arXiv:2312.00863, 2023. 6

work page arXiv 2023
[64]

Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.04803, 2023. 7

work page arXiv 2023
[65]

Side Adapter Network for Open-V ocabulary Semantic Segmentation, 2023

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side Adapter Network for Open-V ocabulary Semantic Segmentation, 2023. 7

work page 2023
[66]

Universal Instance Perception as Object Discovery and Retrieval

Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Universal Instance Perception as Object Discovery and Retrieval. In CVPR, 2023. 2, 7

work page 2023
[67]

Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023

Feng Yan, Weixin Luo, Yujie Zhong, Yiyang Gan, and Lin Ma. Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023. 2

work page 2023
[68]

Paint by Ex- ample: Exemplar-based Image Editing with Diffusion Models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Ex- ample: Exemplar-based Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023. 7

work page 2023
[69]

Boosting human-object interaction de- tection with text-to-image diffusion model

Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, and Ruimao Zhang. Boosting human-object interaction de- tection with text-to-image diffusion model. arXiv preprint arXiv:2305.12252, 2023. 2

work page arXiv 2023
[70]

Semantic human parsing via scalable semantic transfer over multiple label domains

Jie Yang, Chaoqun Wang, Zhen Li, Junle Wang, and Ruimao Zhang. Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19424–19433, 2023. 2

work page 2023
[71]

Neural Interactive Keypoint Detection

Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, and Lei Zhang. Neural Interactive Keypoint Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15122–15132, 2023. 2

work page 2023
[72]

Explicit box detection unifies end-to-end multi-person pose estimation

Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, and Lei Zhang. Explicit box detection unifies end-to-end multi-person pose estimation. In International Conference on Learning Representations, 2023. 2

work page 2023
[73]

Unipose: Detecting any keypoints

Jie Yang, Ailing Zeng, Ruimao Zhang, and Lei Zhang. Unipose: Detecting any keypoints. arXiv preprint arXiv:2310.08530, 2023. 2

work page arXiv 2023
[74]

Effec- tive whole-body pose estimation with two-stages distillation

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023. 2

work page 2023
[75]

Retrieval- Augmented Multimodal Language Modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettle- moyer, Wen-Tau Yih, and Memory Memory. Retrieval- Augmented Multimodal Language Modeling. 2

work page
[76]

Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289, 2023. 6

work page arXiv 2023
[77]

Ni, and Heung-Yeung Shum

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Ob- ject Detection, 2022. 2

work page 2022
[78]

MP-Former: Mask- Piloted Transformer for Image Segmentation

Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang. MP-Former: Mask- Piloted Transformer for Image Segmentation. arXiv preprint arXiv:2303.07336, 2023. 2

work page arXiv 2023
[79]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A Simple Framework for Open-V ocabulary Segmentation and Detection. arXiv preprint arXiv:2303.08131, 2023. 2, 3, 7

work page arXiv 2023
[80]

LLaV A-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023

Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chun- yuan Li, and Jianwei Yang. LLaV A-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023. 2

work page 2023

Showing first 80 references.