arxiv: 2511.18359 · v2 · submitted 2025-11-23 · 💻 cs.CV

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Alexandros Stergiou This is my paper

Pith reviewed 2026-05-17 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsmodel interpretabilityoptimal transporttext-to-video generationlogits to videoembedding couplingsemantic visualization

0 comments

The pith

TRANSPORTER learns an optimal transport map from VLM embedding spaces to text-to-video generators so that logit scores steer the creation of videos showing the visual rules behind model predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a logits-to-video task in which high-semantic embeddings from vision-language models are coupled to text-to-video generators through optimal transport. Logit scores then serve as directions that condition the generation of short videos whose visual content changes when object attributes, action adverbs, or scene context are altered in the original caption. The resulting videos are offered as a direct, high-fidelity visualization of the decision processes that VLMs use to assign scores, addressing the difficulty of inspecting internal reasoning in these models. If the coupling succeeds, changes in generated video content become a readable proxy for shifts in VLM predictions without requiring access to model weights or gradients.

Core claim

Given a VLM and a text-to-video model, TRANSPORTER learns an optimal transport coupling between the VLM's high-semantic embedding space and the conditioning space of the generative model; logit scores then define embedding directions that drive conditional video synthesis, producing videos whose content reflects caption variations over object attributes, action adverbs, and scene context.

What carries the argument

Optimal transport coupling between VLM high-semantic embedding spaces and the latent conditioning space of a text-to-video generative model, with logit scores supplying the transport directions.

If this is right

Altering a single attribute in the input caption produces a corresponding visual change in the generated video that can be inspected to understand which visual cues the VLM used.
The same coupling can be applied across multiple VLMs to compare how different models encode the same scene elements.
Logit-driven video generation supplies a new form of interpretability that operates at the level of full visual sequences rather than attention maps or feature visualizations.
The method is model-independent once the transport map is learned, allowing reuse with any VLM whose embeddings can be extracted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the transport map generalizes across different T2V backbones, it could become a standard post-hoc inspection tool for any caption-conditioned vision model.
The generated videos could serve as training data for further alignment between generative models and the semantic distinctions learned by VLMs.
Extending the coupling to video-to-video translation might allow direct editing of real footage to match VLM decision boundaries.

Load-bearing premise

The learned optimal transport coupling will faithfully capture and transfer the underlying rules that produce the VLM's logit predictions.

What would settle it

Generate videos for a fixed set of caption variants that produce known logit shifts in the VLM; if the visual differences in the videos do not align with the magnitude or direction of those logit shifts under controlled human or automated evaluation, the transport map has failed to transfer the decision rules.

Figures

Figures reproduced from arXiv: 2511.18359 by Alexandros Stergiou.

**Figure 1.** Figure 1: Generated videos representing VLM logit modulations with TRANSPORTER . Videos corresponding to different logit predictions are obtained by coupling VLM embeddings to generative representations. Given a modulation of a VLM caption over an object, action, or scene attribute, TRANSPORTER guides the video generation process to reflect changes made in the token logit scores. Resulting embeddings are decoded int… view at source ↗

**Figure 2.** Figure 2: (a) L2V with TRANSPORTER : Embeddings zΞ ∈ R Ξ are coupled with network Φ and concept bank Q. (b) Coupling network Φ initially projects zΞ with condition πΞ to bzΩ1 = ΦΩ1 (zΞ, πΞ). Latents bzΩ2 ∈ R Ω are obtained with ΦΩ2 over decoder DΞ and encoder EΩ latents. The Learnable Optimal Transport (ρ-OT) module uses updatable projection vectors pΩ1 , pΩ2 to transport embeddings to z˜Ω. The divergence between pa… view at source ↗

**Figure 3.** Figure 3: Concept attribute control with TRANSPORTER given the caption: A close up shot of a attr bowling ball hitting the pins in a bowling alley. Initially, red is used to obtain generator/VLM encodings π − Ξ , π − Ω . Vector qred→blue is added to π − Ξ based on divergence ∆ω to π += blue . As shown, generated videos are of high visual fidelity while they also preserve scene dynamics across modulations; e.g. camer… view at source ↗

**Figure 4.** Figure 4: Flow path modulation. Given latents z ′ Ξ,t two velocity fields are predicted for conditions π −, π+ at step t. Their latent divergence ∆v corresponds to concept/attribute directions. to coupling network Φ (Fig. 2b) so attribute modulations can be explored across both R Ξ and R Ω. Generator modulations. Initially, a video caption containing concept π − is tokenized by π − Ξ = TΞ(π) and π − Ω = TΩ(π) respe… view at source ↗

**Figure 5.** Figure 5: Preferred input generation with AM (top) and proposed L2V (bottom) based on VideoLLaMA 3 logits corresponding to walk . Beyond visualizing single logits, TRANSPORTER further enables generating videos to explore intermediate modulations of the logit distribution when shifting towards run . Multi-metric results. Tab. 2 compares embeddings of generated and real videos across VLM encoders. As shown, the ba… view at source ↗

**Figure 6.** Figure 6: Generated video modulations with TRANSPORTER across VLMs. Concept vectors can visualize videos corresponding to logit distributions over a variety of video attributes which can relate to (top) active objects and affordances, such as juggling balls or clubs , (middle) changes or details in the performance of actions, with front and back handspring, and (bottom) fine-grained scene details, such as holding a … view at source ↗

**Figure 7.** Figure 7: Generated videos with Phi 4 MM logits over alternative settings. (a) Divergence modulations can be done over combined attributes, such as cutting two peppers over thin strips (∆ω = 0) to cutting one pepper over thick strips (∆ω = 1). (b) TRANSPORTER modulations can be introduced at different generation steps to highlight differences of attribute modulations (∆ω) with larger divergence, as that of thin and … view at source ↗

**Figure 8.** Figure 8: Examples of active object modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of action modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of scene modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of multiple modulations. Frame quality is compressed due to filesize (best viewed digitally) [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces an L2V task that uses optimal transport to map VLM embeddings to T2V generators for producing videos from logit scores, but the alignment between the transport plan and the model's actual decision rules remains unproven.

read the letter

The core contribution is a logits-to-video task paired with TRANSPORTER, which learns an optimal transport coupling between VLM high-semantic embeddings and text-to-video latent spaces so that logit scores can condition video generation. This lets the method produce videos that change with object attributes, action adverbs, and scene context, aiming for a more visual form of interpretability than typical attribution maps.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TRANSPORTER for a logits-to-video (L2V) task. It learns an optimal transport coupling between high-semantic embedding spaces of VLMs and text-to-video (T2V) generative models; logit scores then define directions for conditional video generation. The resulting videos are claimed to reflect caption changes over object attributes, action adverbs, and scene context, with quantitative and qualitative evaluations across VLMs supporting a novel interpretability direction.

Significance. If the transport mapping holds, the work offers a high-fidelity generative route to VLM interpretability that has not been previously explored, leveraging T2V priors to visualize internal prediction rules. This could meaningfully advance understanding of how VLMs reason over complex video scenes.

major comments (2)

[§3] §3 (Optimal Transport Coupling): The central claim requires that the OT plan, when conditioned on logit scores, produces videos whose changes faithfully reflect the specific rules or features driving VLM logit predictions. However, OT minimizes a global Wasserstein cost between manifolds and does not enforce local alignment with the VLM's decision boundary or attribution; if manifold curvature differs or the T2V prior injects unrelated factors, the generated videos may be plausible yet misaligned with the VLM's actual reasoning. A concrete verification (e.g., controlled ablation on known decision features) is needed.
[§5] §5 (Quantitative Evaluation): The abstract states that evaluations 'demonstrate' the L2V approach, yet no specific metrics, baselines, statistical tests, or effect sizes are referenced in the description of results. Without these, it is impossible to assess whether the transport mapping outperforms simpler alternatives or supports the fidelity-rich interpretability claim.

minor comments (2)

[§3] Notation for the embedding spaces, cost function, and conditioning on logits should be introduced with a single consistent diagram early in the method section to aid readability.
[Abstract] The abstract claims results 'across VLMs' but does not name the specific models or datasets used; this detail belongs in the abstract or a dedicated table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Optimal Transport Coupling): The central claim requires that the OT plan, when conditioned on logit scores, produces videos whose changes faithfully reflect the specific rules or features driving VLM logit predictions. However, OT minimizes a global Wasserstein cost between manifolds and does not enforce local alignment with the VLM's decision boundary or attribution; if manifold curvature differs or the T2V prior injects unrelated factors, the generated videos may be plausible yet misaligned with the VLM's actual reasoning. A concrete verification (e.g., controlled ablation on known decision features) is needed.

Authors: We agree that the global nature of optimal transport does not inherently guarantee local alignment with VLM decision boundaries, and that manifold differences or T2V priors could introduce misalignments. Our current evidence relies on qualitative demonstrations where logit-driven shifts produce videos reflecting targeted attribute, action, and scene changes across VLMs. To provide the requested concrete verification, we will add a controlled ablation study in the revised manuscript: we will intervene on known VLM decision features (e.g., by altering specific object attributes in input frames) and measure the fidelity of corresponding changes in the generated videos under the OT coupling. revision: yes
Referee: [§5] §5 (Quantitative Evaluation): The abstract states that evaluations 'demonstrate' the L2V approach, yet no specific metrics, baselines, statistical tests, or effect sizes are referenced in the description of results. Without these, it is impossible to assess whether the transport mapping outperforms simpler alternatives or supports the fidelity-rich interpretability claim.

Authors: We acknowledge that the results description would benefit from greater explicitness on quantitative aspects. While the manuscript includes quantitative evaluations across VLMs in §5, we will revise this section to explicitly detail the metrics (e.g., semantic consistency and generation fidelity measures), baselines compared, statistical tests applied, and effect sizes observed. This will allow readers to better evaluate the transport mapping's performance relative to alternatives. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external OT and T2V without self-referential reduction

full rationale

The derivation introduces TRANSPORTER as an optimal transport coupling between VLM embeddings and T2V generative latents, with logit scores used to define directions for conditional generation. No equations, fitted parameters, or self-citations in the abstract reduce the generated videos or interpretability claims to inputs defined by the same data or prior author work. The L2V task and evaluations rely on external models and standard OT, remaining self-contained and falsifiable against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based only on the abstract, the central claim rests on the existence of a learnable optimal transport coupling that preserves semantic directions; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5465 in / 1025 out tokens · 22842 ms · 2026-05-17T05:56:08.482343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TRANSPORTER learns an optimal transport coupling to VLM’s high-semantic embedding spaces... ρ-OT uses {p_Ω1,ρ} and {p_Ω2,ρ} sets of P learnable projection vectors... min_γρ ∫ M dγρ − τ ∫ γρ log(γρ)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

logit difference Δω computed using Hellinger distance... concept bank Q = {q_o : o ∈ O}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · 10 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harri- son, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv:2412.08905, 2024. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkin- son, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet pow- erful multimodal language models via mixture-of-loras. arXiv:2503.01743, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Getting vit in shape: Scaling laws for compute-optimal model design.NeurIPS, 2023

Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design.NeurIPS, 2023. 1

work page 2023
[4]

Building nor- malizing flows with stochastic interpolants.ICLR, 2023

Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants.ICLR, 2023. 2

work page 2023
[5]

Re- fusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Re- fusal in language models is mediated by a single direction. NeurIPS, 2024. 1

work page 2024
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv:2502.13923, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Network dissection: Quantifying inter- pretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying inter- pretability of deep visual representations. InCVPR, 2017. 2

work page 2017
[8]

Continuous, subject-specific attribute control in t2i models by identifying semantic directions

Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, and Björn Om- mer. Continuous, subject-specific attribute control in t2i models by identifying semantic directions. InCVPR, 2025. 2, 4, 6

work page 2025
[9]

Legrad: An explain- ability method for vision transformers via feature formation sensitivity

Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, and Hilde Kuehne. Legrad: An explain- ability method for vision transformers via feature formation sensitivity. InICCV, 2025. 2

work page 2025
[10]

Labeling neural representations with inverse recognition

Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, and Marina Höhne. Labeling neural representations with inverse recognition. InNeurIPS, 2023. 2

work page 2023
[11]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 3

work page 2020
[12]

Grad-CAM++: General- ized gradient-based visual explanations for deep convolu- tional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: General- ized gradient-based visual explanations for deep convolu- tional networks. InWACV, 2018. 2

work page 2018
[13]

Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers

Hila Chefer, Shir Gur, and Lior Wolf. Generic attention- model explainability for interpreting bi-modal and encoder- decoder transformers. InICCV, 2021. 2

work page 2021
[14]

Plot: Prompt learning with optimal transport for vision-language models.ICLR, 2023

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.ICLR, 2023. 2

work page 2023
[15]

Interpreting and controlling vision foundation models via text explanations.arXiv:2310.10591, 2023

Haozhe Chen, Junfeng Yang, Carl V ondrick, and Chengzhi Mao. Interpreting and controlling vision foundation models via text explanations.arXiv:2310.10591, 2023. 2

work page arXiv 2023
[16]

Selfie: self-interpretation of large language model embeddings

Haozhe Chen, Carl V ondrick, and Chengzhi Mao. Selfie: self-interpretation of large language model embeddings. In ICML, 2024. 1

work page 2024
[17]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Fluxspace: Disentangled semantic editing in rectified flow models

Yusuf Dalva, Kavana Venkatesh, and Pinar Yanardag. Fluxspace: Disentangled semantic editing in rectified flow models. InCVPR, 2025. 3

work page 2025
[19]

Im- plicit inversion turns clip into a decoder.arXiv:2505.23161,

Antonio D’Orazio, Maria Rosaria Briglia, Donato Crisos- tomi, Dario Loi, Emanuele Rodolà, and Iacopo Masi. Im- plicit inversion turns clip into a decoder.arXiv:2505.23161,

work page arXiv
[20]

An image is worth 16x16 words: Trans- formers for image recognition at scale.ICLR, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.ICLR, 2021. 5

work page 2021
[21]

Weakly supervised semantic segmentation by pixel-to-prototype contrast

Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. Weakly supervised semantic segmentation by pixel-to-prototype contrast. InCVPR, 2022. 2

work page 2022
[22]

Deep insights into convolutional net- works for video recognition.IJCV, 2020

Christoph Feichtenhofer, Axel Pinz, Richard P Wildes, and Andrew Zisserman. Deep insights into convolutional net- works for video recognition.IJCV, 2020. 2

work page 2020
[23]

Unlock- ing feature visualization for deep network with magnitude constrained optimization.NeurIPS, 2023

Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Pi- card, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, Rémi Cadène, Lore Goetschalckx, et al. Unlock- ing feature visualization for deep network with magnitude constrained optimization.NeurIPS, 2023. 2, 5, 6

work page 2023
[24]

A holistic approach to unifying automatic concept extraction and concept importance estimation

Thomas Fel, Victor Boutin, Louis Béthune, Rémi Cadène, Mazda Moayeri, Léo Andéol, Mathieu Chalvidal, and Thomas Serre. A holistic approach to unifying automatic concept extraction and concept importance estimation. In NeurIPS, 2023. 2

work page 2023
[25]

Craft: Concept recursive activation factor- ization for explainability

Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Rémi Cadène, and Thomas Serre. Craft: Concept recursive activation factor- ization for explainability. InCVPR, 2023. 2

work page 2023
[26]

Interpretable explana- tions of black boxes by meaningful perturbation

Ruth C Fong and Andrea Vedaldi. Interpretable explana- tions of black boxes by meaningful perturbation. InICCV,

work page
[27]

Direct ascent syn- thesis: Revealing hidden generative capabilities in discrim- inative models.arXiv:2502.07753, 2025

Stanislav Fort and Jonathan Whitaker. Direct ascent syn- thesis: Revealing hidden generative capabilities in discrim- inative models.arXiv:2502.07753, 2025. 2

work page arXiv 2025
[28]

Interpreting clip’s image representation via text-based de- composition

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based de- composition. InICLR, 2024. 2

work page 2024
[29]

Concept sliders: Lora adap- tors for precise control in diffusion models

Rohit Gandikota, Joanna Materzy ´nska, Tingrui Zhou, An- tonio Torralba, and David Bau. Concept sliders: Lora adap- tors for precise control in diffusion models. InECCV, 2024. 2

work page 2024
[30]

Image style transfer using convolutional neural networks

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. InCVPR, 2016. 3

work page 2016
[31]

Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions

Amin Ghiasi, Hamid Kazemi, Steven Reich, Chen Zhu, Micah Goldblum, and Tom Goldstein. Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions. InICML, 2022. 2

work page 2022
[32]

Arcee’s mergekit: A toolkit for merging large language models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s mergekit: A toolkit for merging large language models. In EMNLP, 2024. 3

work page 2024
[33]

Boosting the visual interpretability of clip via adversarial fine-tuning

Shizhan Gong, LEI Haoyu, Qi Dou, and Farzan Farnia. Boosting the visual interpretability of clip via adversarial fine-tuning. InICLR, 2025. 2

work page 2025
[34]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 5

work page 2022
[35]

Uncovering unique concept vectors through latent space decomposition

Mara Graziani, Laura O’ Mahony, An-Phi Nguyen, Hen- ning Müller, and Vincent Andrearczyk. Uncovering unique concept vectors through latent space decomposition. TMLR, 2023. 2

work page 2023
[36]

Gradvit: Gradient inversion of vision transformers

Ali Hatamizadeh, Hongxu Yin, Holger R Roth, Wenqi Li, Jan Kautz, Daguang Xu, and Pavlo Molchanov. Gradvit: Gradient inversion of vision transformers. InCVPR, 2022. 2, 5, 6

work page 2022
[37]

Clip knows image aesthetics.FAI, 2022

Simon Hentschel, Konstantin Kobs, and Andreas Hotho. Clip knows image aesthetics.FAI, 2022. 6

work page 2022
[38]

Natu- ral language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natu- ral language descriptions of deep visual features. InICLR,

work page
[39]

In- specting and editing knowledge representations in language models.COLM, 2024

Evan Hernandez, Belinda Z Li, and Jacob Andreas. In- specting and editing knowledge representations in language models.COLM, 2024. 1

work page 2024
[40]

Prompt-to-prompt im- age editing with cross attention control.ICLR, 2023

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.ICLR, 2023. 2

work page 2023
[41]

Clipscore: A reference-free eval- uation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free eval- uation metric for image captioning. InEMNLP, 2021. 6

work page 2021
[42]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022. 2

work page 2022
[43]

How do vision- language models process conflicting information across modalities?arXiv:2507.01790, 2025

Tianze Hua, Tian Yun, and Ellie Pavlick. How do vision- language models process conflicting information across modalities?arXiv:2507.01790, 2025. 2

work page arXiv 2025
[44]

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain and Alexandros Stergiou. Mimic: Multi- modal inversion for model interpretation and conceptual- ization.arXiv:2508.07833, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Pyramidal flow matching for efficient video generative modeling.ICLR, 2025

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.ICLR, 2025. 2

work page 2025
[46]

Auto-encoding vari- ational bayes.ICLR, 2014

Diederik P Kingma and Max Welling. Auto-encoding vari- ational bayes.ICLR, 2014. 3

work page 2014
[47]

The sinkhorn–knopp algorithm: conver- gence and applications.SIMAX, 2008

Philip A Knight. The sinkhorn–knopp algorithm: conver- gence and applications.SIMAX, 2008. 1

work page 2008
[48]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InICML, 2020. 2

work page 2020
[49]

Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models

Matthew Kowal, Richard P Wildes, and Konstantinos G Derpanis. Visual concept connectome (vcc): Open world concept discovery and their interlayer connections in deep models. InCVPR, 2024. 2

work page 2024
[50]

Interpretable generative models through post-hoc concept bottlenecks

Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikari- nen, and Tsui-Wei Weng. Interpretable generative models through post-hoc concept bottlenecks. InCVPR, 2025. 2

work page 2025
[51]

Beyond concept bottleneck models: How to make black boxes intervenable?NeurIPS, 2024

Sonia Laguna, Ri ˇcards Marcinkeviˇcs, Moritz Vandenhirtz, and Julia V ogt. Beyond concept bottleneck models: How to make black boxes intervenable?NeurIPS, 2024. 2

work page 2024
[52]

Clearclip: De- composing clip representations for dense vision-language inference

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: De- composing clip representations for dense vision-language inference. InECCV, 2024. 2

work page 2024
[53]

Demystifying neural style transfer

Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. InIJCAI, 2017. 3

work page 2017
[54]

Towards visually explaining video understand- ing networks with perturbation

Zhenqiang Li, Weimin Wang, Zuoyue Li, Yifei Huang, and Yoichi Sato. Towards visually explaining video understand- ing networks with perturbation. InWACV, 2021. 2

work page 2021
[55]

Flow matching for generative modeling.ICLR, 2023

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.ICLR, 2023. 2

work page 2023
[56]

Decoupled weight decay regularization.ICLR, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019. 5

work page 2019
[57]

Dou- bly right object recognition: A why prompt for visual ratio- nales

Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, and Carl V ondrick. Dou- bly right object recognition: A why prompt for visual ratio- nales. InCVPR, 2023. 2

work page 2023
[58]

Visual classification via description from large language models

Sachit Menon and Carl V ondrick. Visual classification via description from large language models. InICLR, 2023. 2

work page 2023
[59]

Text-to-concept (and back) via cross-model align- ment

Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Soheil Feizi. Text-to-concept (and back) via cross-model align- ment. InICML, 2023. 2

work page 2023
[60]

Gromov-wasserstein autoencoders

Nao Nakagawa, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Gromov-wasserstein autoencoders. InICLR,

work page
[61]

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. NeurIPS, 2016. 2

work page 2016
[62]

Multifaceted feature visualization: Uncovering the different types of fea- tures learned by each neuron in deep neural networks

Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of fea- tures learned by each neuron in deep neural networks. In ICMLw, 2016. 2

work page 2016
[63]

Zoom in: An in- troduction to circuits.Distill, 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An in- troduction to circuits.Distill, 2020. 2

work page 2020
[64]

Sparse autoencoders learn monosemantic features in vision-language models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. NeurIPS, 2025. 2

work page 2025
[65]

Future lens: Anticipating subsequent to- kens from a single hidden state.CoNNL, 2023

Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C Wallace, and David Bau. Future lens: Anticipating subsequent to- kens from a single hidden state.CoNNL, 2023. 1

work page 2023
[66]

Normalizing flows for probabilistic modeling and infer- ence.JMLR, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and infer- ence.JMLR, 2021. 2

work page 2021
[67]

Precisecontrol: En- hancing text-to-image diffusion models with fine-grained attribute control

Rishubh Parihar, VS Sachidanand, Sabariswaran Mani, Te- jan Karmali, and R Venkatesh Babu. Precisecontrol: En- hancing text-to-image diffusion models with fine-grained attribute control. InECCV, 2024. 2

work page 2024
[68]

Rise: Random- ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. InBMVC, 2018. 2

work page 2018
[69]

A novel sliced fused gromov-wasserstein distance.arXiv:2508.02364, 2025

Moritz Piening and Robert Beinert. A novel sliced fused gromov-wasserstein distance.arXiv:2508.02364, 2025. 3

work page arXiv 2025
[70]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

work page 2021
[71]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InICML, 2017. 2

work page 2017
[72]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InICCV, 2023. 2

work page 2023
[73]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv:1312.6034,

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Vlg-cbm: Training concept bottleneck models with vision-language guidance.NeurIPS, 2024

Divyansh Srivastava, Ge Yan, and Lily Weng. Vlg-cbm: Training concept bottleneck models with vision-language guidance.NeurIPS, 2024. 2

work page 2024
[75]

Lavib: A large-scale video interpola- tion benchmark

Alexandros Stergiou. Lavib: A large-scale video interpola- tion benchmark. InNeurIPS, 2024. 5

work page 2024
[76]

Leaping into memories: Space-time deep feature synthesis

Alexandros Stergiou and Nikos Deligiannis. Leaping into memories: Space-time deep feature synthesis. InICCV,

work page
[77]

About time: Ad- vances, challenges, and outlooks of action understanding

Alexandros Stergiou and Ronald Poppe. About time: Ad- vances, challenges, and outlooks of action understanding. IJCV, 2025. 1

work page 2025
[78]

Saliency tubes: Visual explanations for spatio- temporal convolutions

Alexandros Stergiou, Georgios Kapidis, Grigorios Kalli- atakis, Christos Chrysoulas, Remco Veltkamp, and Ronald Poppe. Saliency tubes: Visual explanations for spatio- temporal convolutions. InICIP, 2019. 2

work page 2019
[79]

Just shift it: Test-time prototype shifting for zero-shot general- ization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot general- ization with vision-language models. InWACV, 2025. 2

work page 2025
[80]

Ax- iomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Ax- iomatic attribution for deep networks. InICML, 2017. 2

work page 2017

Showing first 80 references.