pith. machine review for the scientific record. sign in

arxiv: 2511.18359 · v2 · submitted 2025-11-23 · 💻 cs.CV

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Pith reviewed 2026-05-17 05:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsmodel interpretabilityoptimal transporttext-to-video generationlogits to videoembedding couplingsemantic visualization
0
0 comments X

The pith

TRANSPORTER learns an optimal transport map from VLM embedding spaces to text-to-video generators so that logit scores steer the creation of videos showing the visual rules behind model predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a logits-to-video task in which high-semantic embeddings from vision-language models are coupled to text-to-video generators through optimal transport. Logit scores then serve as directions that condition the generation of short videos whose visual content changes when object attributes, action adverbs, or scene context are altered in the original caption. The resulting videos are offered as a direct, high-fidelity visualization of the decision processes that VLMs use to assign scores, addressing the difficulty of inspecting internal reasoning in these models. If the coupling succeeds, changes in generated video content become a readable proxy for shifts in VLM predictions without requiring access to model weights or gradients.

Core claim

Given a VLM and a text-to-video model, TRANSPORTER learns an optimal transport coupling between the VLM's high-semantic embedding space and the conditioning space of the generative model; logit scores then define embedding directions that drive conditional video synthesis, producing videos whose content reflects caption variations over object attributes, action adverbs, and scene context.

What carries the argument

Optimal transport coupling between VLM high-semantic embedding spaces and the latent conditioning space of a text-to-video generative model, with logit scores supplying the transport directions.

If this is right

  • Altering a single attribute in the input caption produces a corresponding visual change in the generated video that can be inspected to understand which visual cues the VLM used.
  • The same coupling can be applied across multiple VLMs to compare how different models encode the same scene elements.
  • Logit-driven video generation supplies a new form of interpretability that operates at the level of full visual sequences rather than attention maps or feature visualizations.
  • The method is model-independent once the transport map is learned, allowing reuse with any VLM whose embeddings can be extracted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the transport map generalizes across different T2V backbones, it could become a standard post-hoc inspection tool for any caption-conditioned vision model.
  • The generated videos could serve as training data for further alignment between generative models and the semantic distinctions learned by VLMs.
  • Extending the coupling to video-to-video translation might allow direct editing of real footage to match VLM decision boundaries.

Load-bearing premise

The learned optimal transport coupling will faithfully capture and transfer the underlying rules that produce the VLM's logit predictions.

What would settle it

Generate videos for a fixed set of caption variants that produce known logit shifts in the VLM; if the visual differences in the videos do not align with the magnitude or direction of those logit shifts under controlled human or automated evaluation, the transport map has failed to transfer the decision rules.

Figures

Figures reproduced from arXiv: 2511.18359 by Alexandros Stergiou.

Figure 1
Figure 1. Figure 1: Generated videos representing VLM logit modulations with TRANSPORTER . Videos corresponding to different logit predictions are obtained by coupling VLM embeddings to generative representations. Given a modulation of a VLM caption over an object, action, or scene attribute, TRANSPORTER guides the video generation process to reflect changes made in the token logit scores. Resulting embeddings are decoded int… view at source ↗
Figure 2
Figure 2. Figure 2: (a) L2V with TRANSPORTER : Embeddings zΞ ∈ R Ξ are coupled with network Φ and concept bank Q. (b) Coupling network Φ initially projects zΞ with condition πΞ to bzΩ1 = ΦΩ1 (zΞ, πΞ). Latents bzΩ2 ∈ R Ω are obtained with ΦΩ2 over decoder DΞ and encoder EΩ latents. The Learnable Optimal Transport (ρ-OT) module uses updatable projection vectors pΩ1 , pΩ2 to transport embeddings to z˜Ω. The divergence between pa… view at source ↗
Figure 3
Figure 3. Figure 3: Concept attribute control with TRANSPORTER given the caption: A close up shot of a attr bowling ball hitting the pins in a bowling alley. Initially, red is used to obtain generator/VLM encodings π − Ξ , π − Ω . Vector qred→blue is added to π − Ξ based on divergence ∆ω to π += blue . As shown, generated videos are of high visual fidelity while they also preserve scene dynamics across modulations; e.g. camer… view at source ↗
Figure 4
Figure 4. Figure 4: Flow path modulation. Given latents z ′ Ξ,t two velocity fields are predicted for conditions π −, π+ at step t. Their latent divergence ∆v corresponds to concept/attribute directions. to coupling network Φ (Fig. 2b) so attribute modulations can be explored across both R Ξ and R Ω. Generator modulations. Initially, a video caption con￾taining concept π − is tokenized by π − Ξ = TΞ(π) and π − Ω = TΩ(π) respe… view at source ↗
Figure 5
Figure 5. Figure 5: Preferred input generation with AM (top) and pro￾posed L2V (bottom) based on VideoLLaMA 3 logits correspond￾ing to walk . Beyond visualizing single logits, TRANSPORTER further enables generating videos to explore intermediate modula￾tions of the logit distribution when shifting towards run . Multi-metric results. Tab. 2 compares embeddings of gen￾erated and real videos across VLM encoders. As shown, the ba… view at source ↗
Figure 6
Figure 6. Figure 6: Generated video modulations with TRANSPORTER across VLMs. Concept vectors can visualize videos corresponding to logit distributions over a variety of video attributes which can relate to (top) active objects and affordances, such as juggling balls or clubs , (middle) changes or details in the performance of actions, with front and back handspring, and (bottom) fine-grained scene details, such as holding a … view at source ↗
Figure 7
Figure 7. Figure 7: Generated videos with Phi 4 MM logits over alternative settings. (a) Divergence modulations can be done over combined attributes, such as cutting two peppers over thin strips (∆ω = 0) to cutting one pepper over thick strips (∆ω = 1). (b) TRANSPORTER modulations can be introduced at different generation steps to highlight differences of attribute modulations (∆ω) with larger divergence, as that of thin and … view at source ↗
Figure 8
Figure 8. Figure 8: Examples of active object modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of active object modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of active object modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of active object modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of action modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of action modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of action modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of action modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of scene modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of scene modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of scene modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of multiple modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of multiple modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TRANSPORTER for a logits-to-video (L2V) task. It learns an optimal transport coupling between high-semantic embedding spaces of VLMs and text-to-video (T2V) generative models; logit scores then define directions for conditional video generation. The resulting videos are claimed to reflect caption changes over object attributes, action adverbs, and scene context, with quantitative and qualitative evaluations across VLMs supporting a novel interpretability direction.

Significance. If the transport mapping holds, the work offers a high-fidelity generative route to VLM interpretability that has not been previously explored, leveraging T2V priors to visualize internal prediction rules. This could meaningfully advance understanding of how VLMs reason over complex video scenes.

major comments (2)
  1. [§3] §3 (Optimal Transport Coupling): The central claim requires that the OT plan, when conditioned on logit scores, produces videos whose changes faithfully reflect the specific rules or features driving VLM logit predictions. However, OT minimizes a global Wasserstein cost between manifolds and does not enforce local alignment with the VLM's decision boundary or attribution; if manifold curvature differs or the T2V prior injects unrelated factors, the generated videos may be plausible yet misaligned with the VLM's actual reasoning. A concrete verification (e.g., controlled ablation on known decision features) is needed.
  2. [§5] §5 (Quantitative Evaluation): The abstract states that evaluations 'demonstrate' the L2V approach, yet no specific metrics, baselines, statistical tests, or effect sizes are referenced in the description of results. Without these, it is impossible to assess whether the transport mapping outperforms simpler alternatives or supports the fidelity-rich interpretability claim.
minor comments (2)
  1. [§3] Notation for the embedding spaces, cost function, and conditioning on logits should be introduced with a single consistent diagram early in the method section to aid readability.
  2. [Abstract] The abstract claims results 'across VLMs' but does not name the specific models or datasets used; this detail belongs in the abstract or a dedicated table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Optimal Transport Coupling): The central claim requires that the OT plan, when conditioned on logit scores, produces videos whose changes faithfully reflect the specific rules or features driving VLM logit predictions. However, OT minimizes a global Wasserstein cost between manifolds and does not enforce local alignment with the VLM's decision boundary or attribution; if manifold curvature differs or the T2V prior injects unrelated factors, the generated videos may be plausible yet misaligned with the VLM's actual reasoning. A concrete verification (e.g., controlled ablation on known decision features) is needed.

    Authors: We agree that the global nature of optimal transport does not inherently guarantee local alignment with VLM decision boundaries, and that manifold differences or T2V priors could introduce misalignments. Our current evidence relies on qualitative demonstrations where logit-driven shifts produce videos reflecting targeted attribute, action, and scene changes across VLMs. To provide the requested concrete verification, we will add a controlled ablation study in the revised manuscript: we will intervene on known VLM decision features (e.g., by altering specific object attributes in input frames) and measure the fidelity of corresponding changes in the generated videos under the OT coupling. revision: yes

  2. Referee: [§5] §5 (Quantitative Evaluation): The abstract states that evaluations 'demonstrate' the L2V approach, yet no specific metrics, baselines, statistical tests, or effect sizes are referenced in the description of results. Without these, it is impossible to assess whether the transport mapping outperforms simpler alternatives or supports the fidelity-rich interpretability claim.

    Authors: We acknowledge that the results description would benefit from greater explicitness on quantitative aspects. While the manuscript includes quantitative evaluations across VLMs in §5, we will revise this section to explicitly detail the metrics (e.g., semantic consistency and generation fidelity measures), baselines compared, statistical tests applied, and effect sizes observed. This will allow readers to better evaluate the transport mapping's performance relative to alternatives. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external OT and T2V without self-referential reduction

full rationale

The derivation introduces TRANSPORTER as an optimal transport coupling between VLM embeddings and T2V generative latents, with logit scores used to define directions for conditional generation. No equations, fitted parameters, or self-citations in the abstract reduce the generated videos or interpretability claims to inputs defined by the same data or prior author work. The L2V task and evaluations rely on external models and standard OT, remaining self-contained and falsifiable against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based only on the abstract, the central claim rests on the existence of a learnable optimal transport coupling that preserves semantic directions; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5465 in / 1025 out tokens · 22842 ms · 2026-05-17T05:56:08.482343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · 10 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harri- son, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv:2412.08905, 2024. 1, 5

  2. [2]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkin- son, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet pow- erful multimodal language models via mixture-of-loras. arXiv:2503.01743, 2025. 1

  3. [3]

    Getting vit in shape: Scaling laws for compute-optimal model design.NeurIPS, 2023

    Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design.NeurIPS, 2023. 1

  4. [4]

    Building nor- malizing flows with stochastic interpolants.ICLR, 2023

    Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants.ICLR, 2023. 2

  5. [5]

    Re- fusal in language models is mediated by a single direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Re- fusal in language models is mediated by a single direction. NeurIPS, 2024. 1

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv:2502.13923, 2025. 1

  7. [7]

    Network dissection: Quantifying inter- pretability of deep visual representations

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying inter- pretability of deep visual representations. InCVPR, 2017. 2

  8. [8]

    Continuous, subject-specific attribute control in t2i models by identifying semantic directions

    Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, and Björn Om- mer. Continuous, subject-specific attribute control in t2i models by identifying semantic directions. InCVPR, 2025. 2, 4, 6

  9. [9]

    Legrad: An explain- ability method for vision transformers via feature formation sensitivity

    Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, and Hilde Kuehne. Legrad: An explain- ability method for vision transformers via feature formation sensitivity. InICCV, 2025. 2

  10. [10]

    Labeling neural representations with inverse recognition

    Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, and Marina Höhne. Labeling neural representations with inverse recognition. InNeurIPS, 2023. 2

  11. [11]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 3

  12. [12]

    Grad-CAM++: General- ized gradient-based visual explanations for deep convolu- tional networks

    Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: General- ized gradient-based visual explanations for deep convolu- tional networks. InWACV, 2018. 2

  13. [13]

    Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers

    Hila Chefer, Shir Gur, and Lior Wolf. Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers. InICCV, 2021. 2

  14. [14]

    Plot: Prompt learning with optimal transport for vision-language models.ICLR, 2023

    Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.ICLR, 2023. 2

  15. [15]

    Interpreting and controlling vision foundation models via text explanations.arXiv:2310.10591, 2023

    Haozhe Chen, Junfeng Yang, Carl V ondrick, and Chengzhi Mao. Interpreting and controlling vision foundation models via text explanations.arXiv:2310.10591, 2023. 2

  16. [16]

    Selfie: self-interpretation of large language model embeddings

    Haozhe Chen, Carl V ondrick, and Chengzhi Mao. Selfie: self-interpretation of large language model embeddings. In ICML, 2024. 1

  17. [17]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv:2507.06261, 2025. 1

  18. [18]

    Fluxspace: Disentangled semantic editing in rectified flow models

    Yusuf Dalva, Kavana Venkatesh, and Pinar Yanardag. Fluxspace: Disentangled semantic editing in rectified flow models. InCVPR, 2025. 3

  19. [19]

    Im- plicit inversion turns clip into a decoder.arXiv:2505.23161,

    Antonio D’Orazio, Maria Rosaria Briglia, Donato Crisos- tomi, Dario Loi, Emanuele Rodolà, and Iacopo Masi. Im- plicit inversion turns clip into a decoder.arXiv:2505.23161,

  20. [20]

    An image is worth 16x16 words: Trans- formers for image recognition at scale.ICLR, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.ICLR, 2021. 5

  21. [21]

    Weakly supervised semantic segmentation by pixel-to-prototype contrast

    Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. Weakly supervised semantic segmentation by pixel-to-prototype contrast. InCVPR, 2022. 2

  22. [22]

    Deep insights into convolutional net- works for video recognition.IJCV, 2020

    Christoph Feichtenhofer, Axel Pinz, Richard P Wildes, and Andrew Zisserman. Deep insights into convolutional net- works for video recognition.IJCV, 2020. 2

  23. [23]

    Unlock- ing feature visualization for deep network with magnitude constrained optimization.NeurIPS, 2023

    Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Pi- card, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, Rémi Cadène, Lore Goetschalckx, et al. Unlock- ing feature visualization for deep network with magnitude constrained optimization.NeurIPS, 2023. 2, 5, 6

  24. [24]

    A holistic approach to unifying automatic concept extraction and concept importance estimation

    Thomas Fel, Victor Boutin, Louis Béthune, Rémi Cadène, Mazda Moayeri, Léo Andéol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation. In NeurIPS, 2023. 2

  25. [25]

    Craft: Concept recursive activation factor- ization for explainability

    Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Rémi Cadène, and Thomas Serre. Craft: Concept recursive activation factor- ization for explainability. InCVPR, 2023. 2

  26. [26]

    Interpretable explana- tions of black boxes by meaningful perturbation

    Ruth C Fong and Andrea Vedaldi. Interpretable explana- tions of black boxes by meaningful perturbation. InICCV,

  27. [27]

    Direct ascent syn- thesis: Revealing hidden generative capabilities in discrim- inative models.arXiv:2502.07753, 2025

    Stanislav Fort and Jonathan Whitaker. Direct ascent syn- thesis: Revealing hidden generative capabilities in discrim- inative models.arXiv:2502.07753, 2025. 2

  28. [28]

    Interpreting clip’s image representation via text-based de- composition

    Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based de- composition. InICLR, 2024. 2

  29. [29]

    Concept sliders: Lora adap- tors for precise control in diffusion models

    Rohit Gandikota, Joanna Materzy ´nska, Tingrui Zhou, An- tonio Torralba, and David Bau. Concept sliders: Lora adap- tors for precise control in diffusion models. InECCV, 2024. 2

  30. [30]

    Image style transfer using convolutional neural networks

    Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. InCVPR, 2016. 3

  31. [31]

    Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions

    Amin Ghiasi, Hamid Kazemi, Steven Reich, Chen Zhu, Micah Goldblum, and Tom Goldstein. Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions. InICML, 2022. 2

  32. [32]

    Arcee’s mergekit: A toolkit for merging large language models

    Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s mergekit: A toolkit for merging large language models. In EMNLP, 2024. 3

  33. [33]

    Boosting the visual interpretability of clip via adversarial fine-tuning

    Shizhan Gong, LEI Haoyu, Qi Dou, and Farzan Farnia. Boosting the visual interpretability of clip via adversarial fine-tuning. InICLR, 2025. 2

  34. [34]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 5

  35. [35]

    Uncovering unique concept vectors through latent space decomposition

    Mara Graziani, Laura O’ Mahony, An-Phi Nguyen, Hen- ning Müller, and Vincent Andrearczyk. Uncovering unique concept vectors through latent space decomposition. TMLR, 2023. 2

  36. [36]

    Gradvit: Gradient inversion of vision transformers

    Ali Hatamizadeh, Hongxu Yin, Holger R Roth, Wenqi Li, Jan Kautz, Daguang Xu, and Pavlo Molchanov. Gradvit: Gradient inversion of vision transformers. InCVPR, 2022. 2, 5, 6

  37. [37]

    Clip knows image aesthetics.FAI, 2022

    Simon Hentschel, Konstantin Kobs, and Andreas Hotho. Clip knows image aesthetics.FAI, 2022. 6

  38. [38]

    Natu- ral language descriptions of deep visual features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natu- ral language descriptions of deep visual features. InICLR,

  39. [39]

    In- specting and editing knowledge representations in language models.COLM, 2024

    Evan Hernandez, Belinda Z Li, and Jacob Andreas. In- specting and editing knowledge representations in language models.COLM, 2024. 1

  40. [40]

    Prompt-to-prompt im- age editing with cross attention control.ICLR, 2023

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.ICLR, 2023. 2

  41. [41]

    Clipscore: A reference-free eval- uation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free eval- uation metric for image captioning. InEMNLP, 2021. 6

  42. [42]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022. 2

  43. [43]

    How do vision- language models process conflicting information across modalities?arXiv:2507.01790, 2025

    Tianze Hua, Tian Yun, and Ellie Pavlick. How do vision- language models process conflicting information across modalities?arXiv:2507.01790, 2025. 2

  44. [44]

    MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

    Animesh Jain and Alexandros Stergiou. Mimic: Multi- modal inversion for model interpretation and conceptual- ization.arXiv:2508.07833, 2025. 2

  45. [45]

    Pyramidal flow matching for efficient video generative modeling.ICLR, 2025

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.ICLR, 2025. 2

  46. [46]

    Auto-encoding vari- ational bayes.ICLR, 2014

    Diederik P Kingma and Max Welling. Auto-encoding vari- ational bayes.ICLR, 2014. 3

  47. [47]

    The sinkhorn–knopp algorithm: conver- gence and applications.SIMAX, 2008

    Philip A Knight. The sinkhorn–knopp algorithm: conver- gence and applications.SIMAX, 2008. 1

  48. [48]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InICML, 2020. 2

  49. [49]

    Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models

    Matthew Kowal, Richard P Wildes, and Konstantinos G Derpanis. Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models. InCVPR, 2024. 2

  50. [50]

    Interpretable generative models through post-hoc concept bottlenecks

    Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikari- nen, and Tsui-Wei Weng. Interpretable generative models through post-hoc concept bottlenecks. InCVPR, 2025. 2

  51. [51]

    Beyond concept bottleneck models: How to make black boxes intervenable?NeurIPS, 2024

    Sonia Laguna, Ri ˇcards Marcinkeviˇcs, Moritz Vandenhirtz, and Julia V ogt. Beyond concept bottleneck models: How to make black boxes intervenable?NeurIPS, 2024. 2

  52. [52]

    Clearclip: De- composing clip representations for dense vision-language inference

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: De- composing clip representations for dense vision-language inference. InECCV, 2024. 2

  53. [53]

    Demystifying neural style transfer

    Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. InIJCAI, 2017. 3

  54. [54]

    Towards visually explaining video understand- ing networks with perturbation

    Zhenqiang Li, Weimin Wang, Zuoyue Li, Yifei Huang, and Yoichi Sato. Towards visually explaining video understand- ing networks with perturbation. InWACV, 2021. 2

  55. [55]

    Flow matching for generative modeling.ICLR, 2023

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.ICLR, 2023. 2

  56. [56]

    Decoupled weight decay regularization.ICLR, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019. 5

  57. [57]

    Dou- bly right object recognition: A why prompt for visual ratio- nales

    Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, and Carl V ondrick. Dou- bly right object recognition: A why prompt for visual ratio- nales. InCVPR, 2023. 2

  58. [58]

    Visual classification via description from large language models

    Sachit Menon and Carl V ondrick. Visual classification via description from large language models. InICLR, 2023. 2

  59. [59]

    Text-to-concept (and back) via cross-model align- ment

    Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Soheil Feizi. Text-to-concept (and back) via cross-model align- ment. InICML, 2023. 2

  60. [60]

    Gromov-wasserstein autoencoders

    Nao Nakagawa, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Gromov-wasserstein autoencoders. InICLR,

  61. [61]

    Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

    Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. NeurIPS, 2016. 2

  62. [62]

    Multifaceted feature visualization: Uncovering the different types of fea- tures learned by each neuron in deep neural networks

    Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of fea- tures learned by each neuron in deep neural networks. In ICMLw, 2016. 2

  63. [63]

    Zoom in: An in- troduction to circuits.Distill, 2020

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An in- troduction to circuits.Distill, 2020. 2

  64. [64]

    Sparse autoencoders learn monosemantic features in vision-language models

    Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. NeurIPS, 2025. 2

  65. [65]

    Future lens: Anticipating subsequent to- kens from a single hidden state.CoNNL, 2023

    Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C Wallace, and David Bau. Future lens: Anticipating subsequent to- kens from a single hidden state.CoNNL, 2023. 1

  66. [66]

    Normalizing flows for probabilistic modeling and infer- ence.JMLR, 2021

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and infer- ence.JMLR, 2021. 2

  67. [67]

    Precisecontrol: En- hancing text-to-image diffusion models with fine-grained attribute control

    Rishubh Parihar, VS Sachidanand, Sabariswaran Mani, Te- jan Karmali, and R Venkatesh Babu. Precisecontrol: En- hancing text-to-image diffusion models with fine-grained attribute control. InECCV, 2024. 2

  68. [68]

    Rise: Random- ized input sampling for explanation of black-box models

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. InBMVC, 2018. 2

  69. [69]

    A novel sliced fused gromov-wasserstein distance.arXiv:2508.02364, 2025

    Moritz Piening and Robert Beinert. A novel sliced fused gromov-wasserstein distance.arXiv:2508.02364, 2025. 3

  70. [70]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

  71. [71]

    Learning important features through propagating activation differences

    Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InICML, 2017. 2

  72. [72]

    What does clip know about a red circle? visual prompt engineering for vlms

    Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InICCV, 2023. 2

  73. [73]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv:1312.6034,

  74. [74]

    Vlg-cbm: Training concept bottleneck models with vision-language guidance.NeurIPS, 2024

    Divyansh Srivastava, Ge Yan, and Lily Weng. Vlg-cbm: Training concept bottleneck models with vision-language guidance.NeurIPS, 2024. 2

  75. [75]

    Lavib: A large-scale video interpola- tion benchmark

    Alexandros Stergiou. Lavib: A large-scale video interpola- tion benchmark. InNeurIPS, 2024. 5

  76. [76]

    Leaping into memories: Space-time deep feature synthesis

    Alexandros Stergiou and Nikos Deligiannis. Leaping into memories: Space-time deep feature synthesis. InICCV,

  77. [77]

    About time: Ad- vances, challenges, and outlooks of action understanding

    Alexandros Stergiou and Ronald Poppe. About time: Ad- vances, challenges, and outlooks of action understanding. IJCV, 2025. 1

  78. [78]

    Saliency tubes: Visual explanations for spatio- temporal convolutions

    Alexandros Stergiou, Georgios Kapidis, Grigorios Kalli- atakis, Christos Chrysoulas, Remco Veltkamp, and Ronald Poppe. Saliency tubes: Visual explanations for spatio- temporal convolutions. InICIP, 2019. 2

  79. [79]

    Just shift it: Test-time prototype shifting for zero-shot general- ization with vision-language models

    Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot general- ization with vision-language models. InWACV, 2025. 2

  80. [80]

    Ax- iomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Ax- iomatic attribution for deep networks. InICML, 2017. 2

Showing first 80 references.