pith. machine review for the scientific record. sign in

arxiv: 2508.07833 · v3 · submitted 2025-08-11 · 💻 cs.CV

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

classification 💻 cs.CV
keywords inversionmimicmodelmultimodalvisualalignmentconceptsconceptualization
0
0 comments X
read the original abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

    cs.CV 2025-11 unverdicted novelty 7.0

    TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.