arxiv: 2604.20936 · v1 · submitted 2026-04-22 · 💻 cs.MM · cs.CV· cs.HC

Recognition: unknown

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

Adam Cole, Mick Grierson

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:41 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.HC

keywords attentionbendercross-attentionvideoprobeaestheticsartistsbeyondcontrol

0 comments

The pith

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a software tool called AttentionBender. It takes the internal cross-attention maps that video diffusion transformers use to connect text prompts to image regions and applies simple 2D image operations such as rotating, scaling, or shifting those maps. These altered maps are then fed back into the model during generation. The authors generated more than 4500 short videos using different prompts, different transforms, and different layers of the transformer. By watching the outputs, they observed that changing attention in one place rarely stays local. Instead, the changes spread across the video in complex ways, often creating glitchy or distorted visuals rather than clean edits. This suggests the model's attention is highly interconnected. The work follows a research-through-design method in which building and using the tool itself is the main way of learning about the system. The resulting videos serve two purposes: they help artists understand and work around the model's default behaviors, and they produce new visual styles that the model would not generate from prompts alone.

Core claim

Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits.

Load-bearing premise

That applying 2D geometric transforms to attention maps isolates the effect of cross-attention on generation without introducing unrelated artifacts from the manipulation process itself.

Figures

Figures reproduced from arXiv: 2604.20936 by Adam Cole, Mick Grierson.

**Figure 2.** Figure 2: AI video generation user interface for three major [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attention maps visualized for the token "horse" [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Technical Diagram of AttentionBender pipeline. Cross-attention maps are intercepted, reshaped into 3D latent video shape (𝐹𝑟𝑎𝑚𝑒𝑠×𝐻𝑒𝑖𝑔ℎ𝑡 ×𝑊 𝑖𝑑𝑡ℎ), transformed (translate, rotate, etc), flattened back to sequence, multiplied by Values (𝑉 ) to complete attention process, and then continue through the standard DiT process. 4.2 Parametric Generation Tool To move beyond "one-off" transforms toward a systematic u… view at source ↗

**Figure 5.** Figure 5: The Comparative Visualization Interface. A grid [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Filter Menu. Enables targeted navigation of the [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Translation consistently shifts the position of the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Scaling changes the spatial footprint of the prompt’s [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: A small scale increase (1.04×) shows a subtle but legible difference in subject size relative to the baseline, illustrating the narrow range where scaling yields a meaningful edit. Flip operations manipulate spatial layout, but not canonical orientation. The flip transformation makes the gap between attention-map geometry and semantic structure especially visible ( [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗

**Figure 10.** Figure 10: Rotate increasingly degrades coherence with [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Sharpen increases perceived detail and definition [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: Blur reduces fine-grained texture without behav [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 14.** Figure 14: Amplify increases subject detail and visible repre [PITH_FULL_IMAGE:figures/full_fig_p008_14.png] view at source ↗

**Figure 16.** Figure 16: DiT-layer targeting. Outputs remain comparatively [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗

**Figure 17.** Figure 17: Exploding Poetic Pixels. Extreme geometric in [PITH_FULL_IMAGE:figures/full_fig_p009_17.png] view at source ↗

**Figure 18.** Figure 18: Extreme and Ambiguous Representations. Filter [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗

**Figure 19.** Figure 19: Parametric axis as creative structure. Increasing [PITH_FULL_IMAGE:figures/full_fig_p010_19.png] view at source ↗

read the original abstract

We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists' ability to build intuition for the model's material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model's learned representational space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AttentionBender gives a concrete way to apply 2D transforms to cross-attention maps in video diffusion transformers, but the distributed distortion findings rest on visuals without controls that rule out artifacts from the transforms themselves.

read the letter

The main takeaway is that AttentionBender applies 2D transforms like rotation and scaling directly to cross-attention maps in video diffusion transformers, and the resulting generations show mostly distributed changes rather than precise local edits. This extends prior Network Bending techniques into video models in a way that hasn't been shown before. The authors generated over 4500 videos to explore different prompts, operations, and attention layers, which provides a solid body of examples for how these interventions play out. The tool serves dual purposes as both a probe for understanding the model and a method for creating glitchy or altered aesthetics that go beyond standard prompting. The work is grounded in actually building and using the system, which gives it a practical feel for creative applications. The weaker area is the interpretation of the results as evidence of high entanglement in cross-attention. All the support comes from visual inspection of the outputs. There are no numbers quantifying how localized or distributed the effects are, and no comparisons to other kinds of perturbations or to unmodified attention maps processed through the same pipeline. It's possible that the transforms themselves create non-specific artifacts by disrupting the spatial alignment or requiring interpolation that affects the whole generation. Without those controls, the distributed distortions could be an artifact of the method rather than a property of the attention mechanism. This paper is for people working at the intersection of generative AI, HCI, and artistic practice. Readers who want concrete ways to experiment with model internals will find the tool and the example generations useful. It has enough substance in the technique and the scale of exploration to merit peer review, though the claims about entanglement would benefit from tighter methodology. I would recommend sending it to referees.

Circularity Check

0 steps flagged

No circularity: empirical observations from tool application are independent of inputs

full rationale

The paper's derivation chain consists of designing AttentionBender by applying 2D geometric transforms to cross-attention maps, then reporting observations from 4,500+ generated videos. No equations, fitted parameters, or predictions appear; the claim of entanglement follows directly from visualized distributed distortions rather than reducing to self-definition, self-citation chains, or renamed known results. The approach is self-contained as research-through-design with no load-bearing steps that equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no mathematical derivations, fitted parameters, or postulated entities; it is an empirical design exploration whose claims rest on the assumption that the chosen transforms meaningfully expose model behavior.

pith-pipeline@v0.9.0 · 5471 in / 1185 out tokens · 35653 ms · 2026-05-09T22:41:50.259601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 43 canonical work pages · 8 internal anchors

[1]

Abuzuraiq and Philippe Pasquier

Ahmed M. Abuzuraiq and Philippe Pasquier. 2025. Explainability-in-Action: Enabling Expressive Manipulation and Tacit Understanding by Bending Diffusion Models in ComfyUI. arXiv:2508.07183 [cs] doi:10.48550/arXiv.2508.07183

work page doi:10.48550/arxiv.2508.07183 2025
[2]

Philip E. Agre. 1997. Toward a Critical Technical Practice: Lessons Learned in Trying to Reform AI. InSocial Science, Technical Systems, and Cooperative Work. Psychology Press

1997
[3]

Giacomo Aldegheri, Alina Rogalska, Ahmed Youssef, and Eugenia Iofinova
[4]

arXiv:2310.04816 [cs] doi:10.48550/arXiv.2310.04816

Hacking Generative Models with Differentiable Network Bending. arXiv:2310.04816 [cs] doi:10.48550/arXiv.2310.04816

work page doi:10.48550/arxiv.2310.04816
[5]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. arXiv:2103.15691 [cs] doi:10.48550/arXiv.2103.15691

work page doi:10.48550/arxiv.2103.15691 2021
[6]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. 2019. Ex- plainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. arXiv:1910...

work page internal anchor Pith review doi:10.48550/arxiv 2019
[7]

Jordan Belson. 1967. Samadhi

1967
[8]

Stan Brakhage. 1963. Mothlight

1963
[9]

Terrance Broad. 2024. Using Generative AI as an Artistic Material: A Hacker’s Guide.Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) 1, 1 (2024)

2024
[10]

Terence Broad, Frederic Fol Leymarie, and Mick Grierson. 2021. Network Bending: Expressive Manipulation of Deep Generative Models. InArtificial Intelligence in Music, Sound, Art and Design: 10th International Conference, EvoMUSART 2021, Held as Part of EvoStar 2021, Virtual Event, April 7–9, 2021, Proceedings. Springer- Verlag, Berlin, Heidelberg, 20–36. ...

work page doi:10.1007/978-3-030-72914-1_2 2021
[11]

Bissyandé

Nick Bryan-Kinns, Berker Banar, Corey Ford, Courtney N. Reed, Yixiao Zhang, Simon Colton, and Jack Armitage. 2023. Exploring XAI for the Arts: Explaining Latent Space in Generative Music. arXiv:2308.05496 [cs] doi:10.48550/arXiv.2308. 05496

work page doi:10.48550/arxiv.2308 2023
[12]

Xia, and Jeba Rezwana

Nick Bryan-Kinns, Corey Ford, Alan Chamberlain, Steven David Benford, Helen Kennedy, Zijin Li, Wu Qiong, Gus G. Xia, and Jeba Rezwana. 2023. Explainable AI for the Arts: XAIxArts. InProceedings of the 15th Conference on Creativity and Cognition (C&C ’23). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3591196.3593517

work page doi:10.1145/3591196.3593517 2023
[13]

Nick Bryan-Kinns, Shuoyang Jasper Zheng, Francisco Castro, Makayla Lewis, Jia-Rey Chang, Gabriel Vigliensoni, Terence Broad, Michael Clemens, and Elizabeth Wilson. 2025. XAIxArts Manifesto: Explainable AI for the Arts. arXiv:2502.21220 [cs] doi:10.1145/3706599.3716227

work page doi:10.1145/3706599.3716227 2025
[14]

Jenna Burrell. 2016. How the Machine ‘Thinks’: Understanding Opacity in Machine Learning Algorithms.Big Data & Society3(1) (2016). doi:10.1177/ 2053951715622512

2016
[15]

Antoine Caillon and Philippe Esling. 2022. RAVE: A Variational Autoencoder for Fast and High-Quality Neural Audio Synthesis. InInternational Conference on Learning Representations, Vol. 10

2022
[16]

David Casacuberta and Ariel Guersenzvaig. 2025. Disembodied Creativity in Generative AI: Prima Facie Challenges and Limitations of Prompting in Creative Practice.Frontiers in Artificial Intelligence8 (Aug. 2025). doi:10.3389/frai.2025. 1651354

work page doi:10.3389/frai.2025 2025
[17]

Eva Cetinic and James She. 2022. Understanding and Creating Art with AI: Review and Outlook.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)(2022)

2022
[18]

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffu- sion Models. arXiv:2301.13826 [cs] doi:10.48550/arXiv.2301.13826

work page doi:10.48550/arxiv.2301.13826 2023
[19]

Hila Chefer, Shir Gur, and Lior Wolf. 2021. Transformer Interpretability Beyond Attention Visualization. arXiv:2012.09838 [cs] doi:10.48550/arXiv.2012.09838

work page doi:10.48550/arxiv.2012.09838 2021
[20]

2006.Handmade Electronic Music: The Art of Hardware Hacking

Nicolas Collins. 2006.Handmade Electronic Music: The Art of Hardware Hacking. Taylor & Francis

2006
[21]

Comfy-Org. 2023. ComfyUI: The Most Powerful and Modular Stable Diffusion GUI and Backend. https://github.com/comfy-org/comfyui

2023
[22]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 8780–8794

2021
[23]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations

2020
[24]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

2020
[25]

Luke Dzwonczyk, Carmine Emanuele Cella, and David Ban. 2024. Network Bending of Diffusion Models for Audio-Visual Generation. arXiv:2406.19589 [cs] doi:10.48550/arXiv.2406.19589

work page doi:10.48550/arxiv.2406.19589 2024
[26]

Luke Dzwonczyk, Carmine-Emanuele Cella, and David Ban. 2025. Generating Music Reactive Videos by Applying Network Bending to Stable Diffusion.Journal of the Audio Engineering Society73, 6 (2025), 388–398. doi:10.17743/jaes.2022.0210

work page doi:10.17743/jaes.2022.0210 2025
[27]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. 2024. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InProceedings of the 41st International Conference on Machine Learning...

2024
[28]

2025.AI in the Screen Sector: Perspectives and Paths Forward

Angus Finney, Brian Tarran, and Rishi Coupland. 2025.AI in the Screen Sector: Perspectives and Paths Forward. Technical Report. CoSTAR Foresight Lab. doi:10. 5281/zenodo.15601301

2025
[29]

2005.Circuit-Bending: Build Your Own Alien Instruments

Reed Ghazala. 2005.Circuit-Bending: Build Your Own Alien Instruments. Wiley

2005
[30]

Drew Hemment, Dave Murray-Rust, Vaishak Belle, Ruth Aylett, Matjaz Vidmar, and Frank Broz. 2024. Experiential AI: Between Arts and Explainable AI.Leonardo 57, 3 (June 2024), 298–306. doi:10.1162/leon_a_02524

work page doi:10.1162/leon_a_02524 2024
[31]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv:2208.01626 [cs] doi:10.48550/arXiv.2208.01626

work page internal anchor Pith review doi:10.48550/arxiv.2208.01626 2022
[32]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models (DDPM). InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 6840–6851

2020
[33]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022. Video Diffusion Models. arXiv:2204.03458 [cs] doi:10.48550/arXiv.2204.03458

work page internal anchor Pith review doi:10.48550/arxiv.2204.03458 2022
[34]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations

2021
[35]

Tom Hume. 2025. Meet Flow: AI-powered Filmmaking with Veo 3

2025
[36]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Archi- tecture for Generative Adversarial Networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4396–4405. doi:10.1109/CVPR. 2019.00453

work page doi:10.1109/cvpr 2019
[37]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. arXiv:1912.04958 [cs, eess, stat] doi:10.48550/arXiv.1912.04958

work page doi:10.48550/arxiv.1912.04958 2020
[38]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. InInternational Conference on Learning Representations

2013
[39]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.03603 2025
[40]

Vivian Liu and Lydia B Chilton. 2022. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–23. doi:10.1145/3491102.3501825

work page doi:10.1145/3491102.3501825 2022
[41]

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. 2023. VDT: General-purpose Video Diffusion Transformers via Mask Modeling. arXiv:2305.13311 [cs] doi:10.48550/arXiv.2305.13311

work page doi:10.48550/arxiv.2305.13311 2023
[42]

Yao Lyu, He Zhang, Shuo Niu, and Jie Cai. 2024. A Preliminary Exploration of YouTubers’ Use of Generative-AI in Content Creation. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’24). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3613905.3651057

work page doi:10.1145/3613905.3651057 2024
[43]

Daniel Manz and Mick Grierson. 2025. Brave: Designing an Embedded Network- Bending Instrument, Manifesting Output Diversity in Neural Audio Systems. In International Conference on Computational Creativity

2025
[44]

Louis McCallum and Matthew Yee-King. 2020. Network Bending Neural Vocoders. In4th Workshop on Machine Learning for Creativity and Design at NeurIPS 2020, Vancouver, Canada.Goldsmiths, University of London

2020
[45]

2011.The Glitch Moment(Um)

Rosa Menkman. 2011.The Glitch Moment(Um). Institute of Network Cultures

2011
[46]

Melkamu Mersha, Khang Lam, Joseph Wood, Ali AlShami, and Jugal Kalita
[47]

AlShami, and Jugal Kalita

Explainable Artificial Intelligence: A Survey of Needs, Techniques, Ap- plications, and Future Direction.Neurocomputing599 (Sept. 2024), 128111. arXiv:2409.00265 [cs] doi:10.1016/j.neucom.2024.128111

work page doi:10.1016/j.neucom.2024.128111 2024
[48]

Carman Neustaedter and Phoebe Sengers. 2012. Autobiographical Design in HCI Research: Designing and Learning through Use-It-Yourself. InProceedings of the Designing Interactive Systems Conference (DIS ’12). Association for Computing Machinery, New York, NY, USA, 514–523. doi:10.1145/2317956.2318034

work page doi:10.1145/2317956.2318034 2012
[49]

OpenAI. [n. d.]. Video Generation Models as World Simulators | OpenAI
[50]

OpenAI. 2024. Sora (Blogpost). https://openai.com/index/sora/

2024
[51]

OpenAI. 2025. Sora 2 Is Here

2025
[52]

Jonas Oppenlaender, Rhema Linder, and Johanna Silvennoinen. 2025. Prompting AI Art: An Investigation into the Creative Skill of Prompt Engineering.Interna- tional Journal of Human–Computer Interaction41, 16 (Aug. 2025), 10207–10229. doi:10.1080/10447318.2024.2431761

work page doi:10.1080/10447318.2024.2431761 2025
[53]

Nam June Paik. 1965. Magnet TV

1965
[54]

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Trans- formers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 4172–4182. doi:10.1109/ICCV51070.2023.00387

work page doi:10.1109/iccv51070.2023.00387 2023
[55]

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Trans- formers (Preprint). InProceedings of the IEEE/CVF International Conference on Computer Vision. 4195–4205

2023
[56]

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yu...

work page doi:10.48550/arxiv.2503.09642 2025
[57]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision (CLIP). InProceedings of the 38th Interna- tional Conference on Machine Learning. PMLR, 8748–8763

2021
[58]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 doi:10.48550/arXiv.1910.10683

work page internal anchor Pith review doi:10.48550/arxiv.1910.10683 2023
[59]

Nina Rajcic, Maria Teresa Llano Rodriguez, and Jon McCormack. 2024. Towards a Diffractive Analysis of Prompt-Based Generative AI. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–15. doi:10.1145/3613904.3641971

work page doi:10.1145/3613904.3641971 2024
[60]

Matt Ratto. 2011. Critical Making: Conceptual and Material Studies in Technology and Social Life.The Information Society27, 4 (July 2011), 252–260. doi:10.1080/ 01972243.2011.583819

work page arXiv 2011
[61]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models (Stable Diffusion). InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695

2022
[62]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Nassir Navab, Joachim Horneg- ger, William M. Wells, and Alejandro F. Frangi (Eds.). Springer International Publishing, Cham, 234–241. doi:10.1007/978-3-319-2...

work page doi:10.1007/978-3-319-24574-4_28 2015
[63]

Runway. 2025. Runway Research | Introducing Runway Gen-4.5

2025
[64]

Johannes Schneider. 2024. Explainable Generative AI (GenXAI): A Survey, Con- ceptualization, and Research Agenda.Artificial Intelligence Review57, 11 (Sept. 2024), 289. doi:10.1007/s10462-024-10916-x

work page doi:10.1007/s10462-024-10916-x 2024
[65]

Renee Shelby, Shalaleh Rismani, and Negar Rostamzadeh. 2024. Generative AI in Creative Practice: ML-Artist Folk Theories of T2I Use, Harm, and Harm- Reduction. InProceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/3613904.3642461

work page doi:10.1145/3613904.3642461 2024
[66]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. 2022. Make-A-Video: Text-to-Video Generation without Text- Video Data. InThe Eleventh International Conference on Learning Representations

2022
[67]

Adams Sitney

P. Adams Sitney. 2002.Visionary Film: The American A vant-garde, 1943-2000. Oxford University Press

2002
[68]

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. 2022. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. arXiv:2210.04885 [cs] doi:10. 48550/arXiv.2210.04885

work page arXiv 2022
[69]

Maddalena Torricelli, Mauro Martino, Andrea Baronchelli, and Luca Maria Aiello. 2024. The Role of Interface Design on Prompt-mediated Creativity in Generative AI. InProceedings of the 16th ACM Web Science Conference (WEB- SCI ’24). Association for Computing Machinery, New York, NY, USA, 235–240. doi:10.1145/3614419.3644000

work page doi:10.1145/3614419.3644000 2024
[70]

Luke Tredinnick and Claire Laybats. 2023. Black-Box Creativity and Generative Artifical Intelligence.Business Information Review40, 3 (Sept. 2023), 98–102. doi:10.1177/02663821231195131

work page doi:10.1177/02663821231195131 2023
[71]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc

2017
[72]

WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20314 2025
[73]

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2024. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv:2209.00796 [cs] doi:10.48550/arXiv.2209.00796

work page doi:10.48550/arxiv.2209.00796 2024
[74]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. 2024. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. arXiv:2408.06072 [cs] doi:10.48550/arXiv.2408.06072

work page internal anchor Pith review doi:10.48550/arxiv.2408.06072 2024
[75]

Shamim Yazdani, Akansha Singh, Nripsuta Saxena, Zichong Wang, Avash Pa- likhe, Deng Pan, Umapada Pal, Jie Yang, and Wenbin Zhang. 2025. Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications. arXiv:2510.21887 [cs] doi:10.48550/arXiv.2510.21887

work page doi:10.48550/arxiv.2510.21887 2025
[76]

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721 [cs] doi:10.48550/arXiv.2308.06721

work page internal anchor Pith review doi:10.48550/arxiv.2308.06721 2023
[77]

1970.Expanded Cinema

Gene Youngblood. 1970.Expanded Cinema. Dutton

1970
[78]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang
[79]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23)

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3544548.3581388

work page doi:10.1145/3544548.3581388 2023
[80]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision. 3836–3847

2023

Showing first 80 references.