The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Alfio Ferrara; Elisabetta Rocchetti; Sergio Picascia

arxiv: 2507.23313 · v1 · pith:LLULRNTFnew · submitted 2025-07-31 · 💻 cs.CV

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Alfio Ferrara , Sergio Picascia , Elisabetta Rocchetti This is my paper

Pith reviewed 2026-05-21 23:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image diffusion modelscontent and style separationcross-attention analysisartistic image generationprompt interpretation

0 comments

The pith

Text-to-image diffusion models show an internal separation of content and style when creating artworks from prompts, even without being taught this distinction explicitly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how transformer-based text-to-image diffusion models handle concepts of content and style in artistic images. By using cross-attention heatmaps, it attributes parts of the generated image to specific words in the prompt that describe either the subject or the artistic style. The results indicate that these models often separate these aspects naturally, with object details tied to content words and textures or backgrounds to style words. This matters because it suggests the models develop an understanding of artistic composition from vast training data alone.

Core claim

Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction.

What carries the argument

Cross-attention heatmaps, which attribute pixels in the generated image to individual prompt tokens, allowing isolation of regions influenced by content-describing tokens versus style-describing tokens.

If this is right

These models may generate more consistent artistic outputs when prompts clearly separate content and style elements.
The observed separation could enable better debugging and control of generated images by targeting specific prompt parts.
Understanding this internal representation contributes to explaining how large generative models handle complex concepts without direct supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could design prompts or fine-tuning strategies that exploit this separation for more precise artistic control.
This finding connects to broader questions about whether large models learn human-like conceptual distinctions from data patterns alone.

Load-bearing premise

That cross-attention heatmaps accurately reflect the model's internal conceptual separation of content from style rather than just correlating with surface features in the image.

What would settle it

Generating images from prompts where content and style descriptions are swapped and observing whether the attributed regions remain tied to the original categories or shift accordingly.

read the original abstract

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps content versus style tokens via cross-attention in artistic prompts and sees objects light up for content and backgrounds for style, but the support stays qualitative.

read the letter

The main thing here is that the authors use cross-attention heatmaps to track how different tokens in an artistic prompt influence regions in the generated image. Content tokens tend to light up the main objects, while style tokens hit the backgrounds and textures. This is framed as showing an emergent grasp of the content-style split in diffusion models. What the paper does well is to zero in on artistic prompts and share everything needed to look at the maps yourself. The GitHub link with code, dataset, and tool is practical and lets the community dig in directly. The abstract makes clear that this internal representation question had not been tackled this way before. The soft spots come from the method being observational only. The findings depend on reading the heatmaps by eye across examples, without reported numbers on how reliable or strong the separation is. There is also no description of tests that would break the usual associations, like rephrasing prompts or swapping token types, to see if the pattern changes. This leaves open the possibility that the maps are reflecting common ways objects and styles are described in training data rather than a deeper conceptual split. The assumption that these attributions reveal internal understanding needs more backing to hold up firmly. Since the degree of separation varies with the prompt, it is not presented as a universal property anyway. This work is for researchers focused on interpretability in text-to-image systems and how models organize artistic ideas without direct supervision. It offers a practical way to visualize prompt influence that could help with debugging or controlling generations. I would take it to the next reading group to discuss attention-based probing methods. I would not cite it in my own work right now because the results are still at the exploratory stage. A serious editor should send this to peer review, as the question is relevant and the approach is accessible, even if revisions would be needed to add quantitative and causal elements.

Referee Report

3 major / 3 minor

Summary. The paper examines how transformer-based text-to-image diffusion models internally encode content and style when generating artworks from artistic prompts. It uses cross-attention heatmaps to attribute image pixels to individual prompt tokens, reporting that content-describing tokens predominantly influence object regions while style-describing tokens affect backgrounds and textures. The authors interpret these patterns as evidence of an emergent content-style separation without explicit supervision during training, and they release code, a dataset, and a visualization tool.

Significance. If the attention-map patterns can be shown to reflect conceptual separation rather than prompt co-occurrence statistics, the work would contribute to interpretability research in generative models by providing a concrete case study of unsupervised concept disentanglement in artistic domains. The public release of code and an exploratory visualization tool strengthens reproducibility and enables follow-up experiments.

major comments (3)

[§4] §4 (Results and Analysis): the central claim that the observed heatmap separation indicates an 'emergent understanding of the content-style distinction' rests on qualitative visual inspection alone. No quantitative metrics (e.g., region overlap with semantic segmentation masks or attention entropy scores) or statistical tests across a controlled prompt set are reported, making it impossible to assess the reliability or generality of the separation.
[§3 and §4] §3 (Methodology) and §4: the attribution of pixels to content versus style tokens via cross-attention assumes that heatmap activation isolates conceptual influence. No causal interventions—such as token ablation, prompt swapping while preserving semantics, or controlled variants that break surface correlations—are described. This leaves open the possibility that the patterns reflect training-data co-occurrence statistics (nouns as foreground objects, adjectives as texture descriptors) rather than an internal content-style ontology.
[§4] §4, Figure 3–5 examples: the reported 'varying degrees of content-style separation' are illustrated with selected prompts; the manuscript does not specify the total number of prompts tested, the selection criteria, or failure cases where separation collapses. Without this information the strength of the 'in many cases' qualifier cannot be evaluated.

minor comments (3)

[§1 and §2] The abstract and introduction use 'content-style separation' and 'content-style distinction' interchangeably; a brief explicit definition in §2 would improve clarity.
[Figures] Figure captions for the attention-map visualizations should state the exact prompt tokens highlighted and the diffusion timestep at which the maps were extracted.
[§2] The related-work section omits recent papers on prompt-based editing and attention visualization in diffusion models (e.g., Prompt-to-Prompt, Attend-and-Excite); adding 2–3 citations would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important opportunities to strengthen the empirical support for our claims. We address each major comment below and describe the revisions we intend to incorporate.

read point-by-point responses

Referee: [§4] §4 (Results and Analysis): the central claim that the observed heatmap separation indicates an 'emergent understanding of the content-style distinction' rests on qualitative visual inspection alone. No quantitative metrics (e.g., region overlap with semantic segmentation masks or attention entropy scores) or statistical tests across a controlled prompt set are reported, making it impossible to assess the reliability or generality of the separation.

Authors: We agree that quantitative support is needed to substantiate the generality of the observed patterns. In the revised manuscript we will add two quantitative analyses: (1) average spatial overlap between content-token attention maps and object regions obtained from an off-the-shelf semantic segmentation model, and (2) per-token attention entropy scores comparing content versus style tokens. Both metrics will be reported with means, standard deviations, and statistical significance tests across the full prompt set. revision: yes
Referee: [§3 and §4] §3 (Methodology) and §4: the attribution of pixels to content versus style tokens via cross-attention assumes that heatmap activation isolates conceptual influence. No causal interventions—such as token ablation, prompt swapping while preserving semantics, or controlled variants that break surface correlations—are described. This leaves open the possibility that the patterns reflect training-data co-occurrence statistics (nouns as foreground objects, adjectives as texture descriptors) rather than an internal content-style ontology.

Authors: We acknowledge that attention-map correlations alone cannot rule out surface-level co-occurrence statistics. Our current contribution is an observational study of the model's internal representations; we will explicitly discuss this limitation in a new subsection. In addition, we will include a limited set of causal checks (token ablation on a subset of prompts and controlled prompt swaps that preserve semantics while altering surface statistics) to provide initial evidence that the separation is not solely attributable to training-data regularities. revision: partial
Referee: [§4] §4, Figure 3–5 examples: the reported 'varying degrees of content-style separation' are illustrated with selected prompts; the manuscript does not specify the total number of prompts tested, the selection criteria, or failure cases where separation collapses. Without this information the strength of the 'in many cases' qualifier cannot be evaluated.

Authors: We will revise the text and add an appendix that fully documents the experimental corpus: 150 prompts drawn from our released dataset, selected for balanced coverage of art movements and content categories while ensuring explicit separation between content and style descriptors. We will also describe representative failure cases (e.g., prompts with highly entangled tokens) and quantify how often clear separation occurs versus collapses. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heatmap analysis is self-contained

full rationale

The paper performs an observational study by extracting and visualizing cross-attention maps from a standard text-to-image diffusion model during artwork generation. Region attribution to content versus style tokens follows directly from the model's existing attention computation without any parameter fitting, redefinition of inputs, or load-bearing self-citations that presuppose the reported separation. The observed patterns (content tokens influencing object regions, style tokens influencing backgrounds) are presented as empirical outcomes rather than derived quantities that reduce to the analysis method itself, rendering the central claim independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions about attention mechanisms in transformer-based diffusion models and the traditional computer-vision premise that content and style can be treated as separable.

axioms (1)

domain assumption Cross-attention heatmaps in diffusion models can be used to attribute influence from specific prompt tokens to regions in the generated image.
This premise is invoked to justify isolating content versus style effects via pixel attribution.

pith-pipeline@v0.9.0 · 5744 in / 1218 out tokens · 53617 ms · 2026-05-21T23:38:00.690520+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens... compute the Intersection over Union (IoU) between the content token mask DIτC and the style token mask DIτS
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IoUCS remains consistently and significantly lower than mIoUB... a positive Δ value suggests that content and style tokens attend to distinct spatial regions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

INTRODUCTION Nowadays, we employ models that take a textual prompt as input and generate an image as output, an image which, in most cases, closely reflects the description provided in the prompt. These text-to-image ( txt2img) models are trained on billions of images sourced from the internet, including art- works from open-access, labeled online reposit...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

We then review relevant literature on the interpretability of these mod- els

RELA TED WORK In this section, we begin by introducing key concepts related to txt2img models and their underlying mechanisms. We then review relevant literature on the interpretability of these mod- els. Finally, we introduce the task of content–style disentan- glement and discuss prior work that addresses this challenge. txt2img generation and diffusion...

work page
[3]

Specifically, we analyze the spatial distribution of cross-attention values assigned to image pixels from content and style tokens in the conditioning prompt

METHODOLOGY The objective of this study is to quantify how transformer- based txt2img diffusion models distinguish between content and style concepts when generating paintings. Specifically, we analyze the spatial distribution of cross-attention values assigned to image pixels from content and style tokens in the conditioning prompt. We systematically con...

work page
[4]

Prompt construction

EXPERIMENTAL SETUP In this section, we describe how we construct the prompts to generate images, the chosen txt2img model and the choices of threshold τ for binary heatmap construction. Prompt construction. To populate the templates de- scribed in Section 3 with diverse and representative content elements, we utilize the 80 object class labels from the MS...

work page
[5]

Quantitative evaluation

RESULTS In this section, we present both quantitative and qualitative re- sults illustrating how SDXL behaves when selecting between content and style tokens as sources of information during the image generation process. Quantitative evaluation. Figure 2 illustrates how IoUCS, mIoUB, and their difference ∆ vary across different thresh- old (τ) configurati...

work page
[6]

We analyse IoU scores computed on DAAM corresponding to the content and style components in the input prompt

CONCLUSIONS This work investigates how transformer-based txt2img diffu- sion models convey the concepts of content and style when generating paintings. We analyse IoU scores computed on DAAM corresponding to the content and style components in the input prompt. Results indicate that, on average, con- tent and style components tend to influence complementa...

work page
[7]

Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

Babak Saleh and Ahmed M. Elgammal, “Large-scale classification of fine-art paintings: Learning the right metric on the right feature,”CoRR, vol. abs/1505.00855, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Algorithmic images: Artificial intel- ligence and visual culture,

Antonio Somaini, “Algorithmic images: Artificial intel- ligence and visual culture,” Grey Room, , no. 93, pp. 74–115, 10 2023

work page 2023
[9]

Not only generative art: Stable diffusion for content-style disentanglement in art analysis,

Yankun Wu, Yuta Nakashima, and Noa Garcia, “Not only generative art: Stable diffusion for content-style disentanglement in art analysis,” in Proceedings of the 2023 ACM International Conference on Multimedia Re- trieval, ICMR 2023, Thessaloniki, Greece, June 12-15,

work page 2023
[10]

199–208, ACM

2023, pp. 199–208, ACM

work page 2023
[11]

Denoising diffusion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neu- ral Information Processing Systems . 2020, vol. 33, pp. 6840–6851, Curran Associates, Inc

work page 2020
[12]

Learning transferable vi- sual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable vi- sual models from natural language supervision,” inPro- ceedings of the 38th International Conference on Ma- chine Learning, ICML 2021, 18-24 Jul...

work page 2021
[13]

Auto-encoding variational bayes,

Diederik P. Kingma and Max Welling, “Auto-encoding variational bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceed- ings, 2014

work page 2014
[14]

U-net: Convolutional networks for biomedical im- age segmentation,

Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical im- age segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III. 2015, vol. 9351 of Lec- ture Notes in Computer Science, pp. 234–241...

work page 2015
[15]

High-resolution image synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. 2022, pp. 10674–10685, IEEE

work page 2022
[16]

What the DAAM: In- terpreting Stable Diffusion Using Cross Attention,

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture, “What the DAAM: In- terpreting Stable Diffusion Using Cross Attention,” in Proceedings of the 61st Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers). 2023, pp. 5644–5659, Association for Co...

work page 2023
[17]

Con- ceptattention: Diffusion transformers learn highly inter- pretable features,

Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, and Duen Horng Chau, “Con- ceptattention: Diffusion transformers learn highly inter- pretable features,” CoRR, vol. abs/2502.04320, 2025

work page arXiv 2025
[18]

Lvlm-intrepret: An interpretability tool for large vision-language models,

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gur- wicz, Chenfei Wu, Nan Duan, and Vasudev Lal, “Lvlm-intrepret: An interpretability tool for large vision-language models,” in The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop, XAI4CV 2024, Workshop at CVPR 202...

work page 2024
[19]

Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar González-Franco, “Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,” in IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seat- tle, WA, USA, June 16-22, 2024 . 2024, pp. 3554–3563, IEEE

work page 2024
[20]

Diffu- sion model is secretly a training-free open vocabulary semantic segmenter,

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu, “Diffu- sion model is secretly a training-free open vocabulary semantic segmenter,” IEEE Trans. Image Process., vol. 34, pp. 1895–1907, 2025

work page 1907
[21]

Dif- fusionseg: Adapting diffusion towards unsupervised ob- ject discovery,

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxi- ang Liu, Yu Wang, Ya Zhang, and Yanfeng Wang, “Dif- fusionseg: Adapting diffusion towards unsupervised ob- ject discovery,” CoRR, vol. abs/2303.09813, 2023

work page arXiv 2023
[22]

Statistics, vision, and the anal- ysis of artistic style,

Daniel J. Graham, James M. Hughes, Helmut Leder, and Daniel N. Rockmore, “Statistics, vision, and the anal- ysis of artistic style,” WIREs Computational Statistics, vol. 4, no. 2, pp. 115–123, 2012

work page 2012
[23]

Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Ab- hinav Shrivastava, and Tom Goldstein, “Measuring style similarity in diffusion models,” CoRR, vol. abs/2404.01292, 2024

work page arXiv 2024
[24]

Content and style disentangle- ment for artistic style transfer,

Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer, “Content and style disentangle- ment for artistic style transfer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4422–4431

work page 2019
[25]

Style and content disentanglement in gener- ative adversarial networks,

Hadi Kazemi, Seyed Mehdi Iranmanesh, and Nasser Nasrabadi, “Style and content disentanglement in gener- ative adversarial networks,” in 2019 IEEE Winter Con- ference on Applications of Computer Vision (WACV) . IEEE, 2019, pp. 848–856

work page 2019
[26]

Improving style- content disentanglement in image-to-image translation,

Aviv Gabbay and Yedid Hoshen, “Improving style- content disentanglement in image-to-image translation,” CoRR, vol. abs/2007.04964, 2020

work page arXiv 2007
[27]

A taxonomy of prompt modifiers for text-to-image generation,

Jonas Oppenlaender, “A taxonomy of prompt modifiers for text-to-image generation,” Behav. Inf. Technol., vol. 43, no. 15, pp. 3763–3776, 2024

work page 2024
[28]

Design Guidelines for Prompt Engineering Text-to-Image Generative Models,

Vivian Liu and Lydia B Chilton, “Design Guidelines for Prompt Engineering Text-to-Image Generative Models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 2022, CHI ’22, pp. 1–23, Association for Computing Machinery

work page 2022
[29]

Identifying and analyzing performance-critical tokens in large language models,

Yu Bai, Heyan Huang, Cesare Spinoso-Di Piano, Marc- Antoine Rondeau, Sanxing Chen, Yang Gao, and Jackie Chi Kit Cheung, “Identifying and analyzing performance-critical tokens in large language models,” 2025

work page 2025
[30]

Microsoft COCO: Common Ob- jects in Context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick, “Microsoft COCO: Common Ob- jects in Context,” in Computer Vision – ECCV 2014 . 2014, pp. 740–755, Springer International Publishing

work page 2014
[31]

Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach, “Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,” in The Twelfth International Conference on Learning Rep- resentations, 2024

work page 2024

[1] [1]

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

INTRODUCTION Nowadays, we employ models that take a textual prompt as input and generate an image as output, an image which, in most cases, closely reflects the description provided in the prompt. These text-to-image ( txt2img) models are trained on billions of images sourced from the internet, including art- works from open-access, labeled online reposit...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

We then review relevant literature on the interpretability of these mod- els

RELA TED WORK In this section, we begin by introducing key concepts related to txt2img models and their underlying mechanisms. We then review relevant literature on the interpretability of these mod- els. Finally, we introduce the task of content–style disentan- glement and discuss prior work that addresses this challenge. txt2img generation and diffusion...

work page

[3] [3]

Specifically, we analyze the spatial distribution of cross-attention values assigned to image pixels from content and style tokens in the conditioning prompt

METHODOLOGY The objective of this study is to quantify how transformer- based txt2img diffusion models distinguish between content and style concepts when generating paintings. Specifically, we analyze the spatial distribution of cross-attention values assigned to image pixels from content and style tokens in the conditioning prompt. We systematically con...

work page

[4] [4]

Prompt construction

EXPERIMENTAL SETUP In this section, we describe how we construct the prompts to generate images, the chosen txt2img model and the choices of threshold τ for binary heatmap construction. Prompt construction. To populate the templates de- scribed in Section 3 with diverse and representative content elements, we utilize the 80 object class labels from the MS...

work page

[5] [5]

Quantitative evaluation

RESULTS In this section, we present both quantitative and qualitative re- sults illustrating how SDXL behaves when selecting between content and style tokens as sources of information during the image generation process. Quantitative evaluation. Figure 2 illustrates how IoUCS, mIoUB, and their difference ∆ vary across different thresh- old (τ) configurati...

work page

[6] [6]

We analyse IoU scores computed on DAAM corresponding to the content and style components in the input prompt

CONCLUSIONS This work investigates how transformer-based txt2img diffu- sion models convey the concepts of content and style when generating paintings. We analyse IoU scores computed on DAAM corresponding to the content and style components in the input prompt. Results indicate that, on average, con- tent and style components tend to influence complementa...

work page

[7] [7]

Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

Babak Saleh and Ahmed M. Elgammal, “Large-scale classification of fine-art paintings: Learning the right metric on the right feature,”CoRR, vol. abs/1505.00855, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Algorithmic images: Artificial intel- ligence and visual culture,

Antonio Somaini, “Algorithmic images: Artificial intel- ligence and visual culture,” Grey Room, , no. 93, pp. 74–115, 10 2023

work page 2023

[9] [9]

Not only generative art: Stable diffusion for content-style disentanglement in art analysis,

Yankun Wu, Yuta Nakashima, and Noa Garcia, “Not only generative art: Stable diffusion for content-style disentanglement in art analysis,” in Proceedings of the 2023 ACM International Conference on Multimedia Re- trieval, ICMR 2023, Thessaloniki, Greece, June 12-15,

work page 2023

[10] [10]

199–208, ACM

2023, pp. 199–208, ACM

work page 2023

[11] [11]

Denoising diffusion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neu- ral Information Processing Systems . 2020, vol. 33, pp. 6840–6851, Curran Associates, Inc

work page 2020

[12] [12]

Learning transferable vi- sual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable vi- sual models from natural language supervision,” inPro- ceedings of the 38th International Conference on Ma- chine Learning, ICML 2021, 18-24 Jul...

work page 2021

[13] [13]

Auto-encoding variational bayes,

Diederik P. Kingma and Max Welling, “Auto-encoding variational bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceed- ings, 2014

work page 2014

[14] [14]

U-net: Convolutional networks for biomedical im- age segmentation,

Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical im- age segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III. 2015, vol. 9351 of Lec- ture Notes in Computer Science, pp. 234–241...

work page 2015

[15] [15]

High-resolution image synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. 2022, pp. 10674–10685, IEEE

work page 2022

[16] [16]

What the DAAM: In- terpreting Stable Diffusion Using Cross Attention,

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture, “What the DAAM: In- terpreting Stable Diffusion Using Cross Attention,” in Proceedings of the 61st Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers). 2023, pp. 5644–5659, Association for Co...

work page 2023

[17] [17]

Con- ceptattention: Diffusion transformers learn highly inter- pretable features,

Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, and Duen Horng Chau, “Con- ceptattention: Diffusion transformers learn highly inter- pretable features,” CoRR, vol. abs/2502.04320, 2025

work page arXiv 2025

[18] [18]

Lvlm-intrepret: An interpretability tool for large vision-language models,

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gur- wicz, Chenfei Wu, Nan Duan, and Vasudev Lal, “Lvlm-intrepret: An interpretability tool for large vision-language models,” in The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop, XAI4CV 2024, Workshop at CVPR 202...

work page 2024

[19] [19]

Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar González-Franco, “Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,” in IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seat- tle, WA, USA, June 16-22, 2024 . 2024, pp. 3554–3563, IEEE

work page 2024

[20] [20]

Diffu- sion model is secretly a training-free open vocabulary semantic segmenter,

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu, “Diffu- sion model is secretly a training-free open vocabulary semantic segmenter,” IEEE Trans. Image Process., vol. 34, pp. 1895–1907, 2025

work page 1907

[21] [21]

Dif- fusionseg: Adapting diffusion towards unsupervised ob- ject discovery,

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxi- ang Liu, Yu Wang, Ya Zhang, and Yanfeng Wang, “Dif- fusionseg: Adapting diffusion towards unsupervised ob- ject discovery,” CoRR, vol. abs/2303.09813, 2023

work page arXiv 2023

[22] [22]

Statistics, vision, and the anal- ysis of artistic style,

Daniel J. Graham, James M. Hughes, Helmut Leder, and Daniel N. Rockmore, “Statistics, vision, and the anal- ysis of artistic style,” WIREs Computational Statistics, vol. 4, no. 2, pp. 115–123, 2012

work page 2012

[23] [23]

Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Ab- hinav Shrivastava, and Tom Goldstein, “Measuring style similarity in diffusion models,” CoRR, vol. abs/2404.01292, 2024

work page arXiv 2024

[24] [24]

Content and style disentangle- ment for artistic style transfer,

Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer, “Content and style disentangle- ment for artistic style transfer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4422–4431

work page 2019

[25] [25]

Style and content disentanglement in gener- ative adversarial networks,

Hadi Kazemi, Seyed Mehdi Iranmanesh, and Nasser Nasrabadi, “Style and content disentanglement in gener- ative adversarial networks,” in 2019 IEEE Winter Con- ference on Applications of Computer Vision (WACV) . IEEE, 2019, pp. 848–856

work page 2019

[26] [26]

Improving style- content disentanglement in image-to-image translation,

Aviv Gabbay and Yedid Hoshen, “Improving style- content disentanglement in image-to-image translation,” CoRR, vol. abs/2007.04964, 2020

work page arXiv 2007

[27] [27]

A taxonomy of prompt modifiers for text-to-image generation,

Jonas Oppenlaender, “A taxonomy of prompt modifiers for text-to-image generation,” Behav. Inf. Technol., vol. 43, no. 15, pp. 3763–3776, 2024

work page 2024

[28] [28]

Design Guidelines for Prompt Engineering Text-to-Image Generative Models,

Vivian Liu and Lydia B Chilton, “Design Guidelines for Prompt Engineering Text-to-Image Generative Models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 2022, CHI ’22, pp. 1–23, Association for Computing Machinery

work page 2022

[29] [29]

Identifying and analyzing performance-critical tokens in large language models,

Yu Bai, Heyan Huang, Cesare Spinoso-Di Piano, Marc- Antoine Rondeau, Sanxing Chen, Yang Gao, and Jackie Chi Kit Cheung, “Identifying and analyzing performance-critical tokens in large language models,” 2025

work page 2025

[30] [30]

Microsoft COCO: Common Ob- jects in Context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick, “Microsoft COCO: Common Ob- jects in Context,” in Computer Vision – ECCV 2014 . 2014, pp. 740–755, Springer International Publishing

work page 2014

[31] [31]

Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach, “Sdxl: Improving latent diffu- sion models for high-resolution image synthesis,” in The Twelfth International Conference on Learning Rep- resentations, 2024

work page 2024