Recognition: unknown
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
Pith reviewed 2026-05-08 13:02 UTC · model grok-4.3
The pith
A training-free repulsion mechanism steers diffusion models toward rare image compositions by counteracting their default common completions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Default completion bias arises when denoising trajectories are attracted to high-frequency semantic configurations. DCR constructs a counterfactual attractor by relaxing the rare compositional factor, computes the discrepancy with the target trajectory as counterfactual drift, and applies projection-based repulsion to remove guidance components aligned with this drift. The result is improved adherence to rare compositions during standard sampling.
What carries the argument
Default Completion Repulsion (DCR) via a projection-based mechanism that removes guidance components aligned with the counterfactual drift direction, where drift is the difference between target and attractor denoising trajectories.
If this is right
- Compositional fidelity rises on rare but plausible prompts while visual quality is preserved.
- The method runs inside the standard diffusion sampling loop with no retraining or architecture changes.
- Intrinsic model biases toward frequent completions are exposed and can be counteracted at inference time.
- Controllable generation can proceed by suppressing competing tendencies rather than adding explicit constraints.
- The same sampling process now supports both common and rare compositions without separate models.
Where Pith is reading between the lines
- The repulsion idea could be applied to other generative models that exhibit similar default biases, such as text or audio generators.
- Combining DCR with classifier-free guidance might produce additive control effects on both rarity and style.
- Ablating the strength of the repulsion term would reveal how much bias suppression is needed for different prompt types.
- The framework suggests a general route to debiasing generative trajectories by constructing and repelling from model-preferred alternatives.
Load-bearing premise
Relaxing the rare compositional factor while preserving surrounding semantics produces a counterfactual attractor whose trajectory difference isolates only the undesired default completion without introducing new artifacts or unintended semantic shifts.
What would settle it
Generate images from a fixed set of rare compositional prompts both with and without DCR, then score them for exact presence of the rare element and for overall visual realism; a clear rise in rare-element scores with no drop in realism scores would support the claim.
Figures
read the original abstract
Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that diffusion models exhibit default completion bias on rare but valid compositional prompts (e.g., snowy beach), collapsing toward high-frequency alternatives during denoising. It introduces Default Completion Repulsion (DCR), a training-free method that constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, defines counterfactual drift as the trajectory discrepancy, and applies projection-based repulsion to suppress drift-aligned components, thereby improving compositional fidelity without retraining.
Significance. If the core assumption holds—that the attractor construction isolates only default-completion drift without semantic side-effects—DCR would offer a meaningful advance in training-free controllable generation by directly modeling and counteracting intrinsic model biases. The approach is notable for its generality beyond explicit constraints and potential to expose model preferences, which could benefit applications needing faithful rare compositions.
major comments (1)
- [Abstract] Abstract: the central claim that relaxing the rare compositional factor 'while preserving surrounding semantics' yields a counterfactual attractor whose trajectory difference 'accurately isolates only the undesired default completion' is load-bearing but under-specified. No concrete relaxation procedure, prompt-editing rule, or validation (e.g., semantic similarity checks or trajectory visualizations) is provided to rule out unintended attribute shifts or new high-probability modes, directly engaging the stress-test concern.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The primary concern is the under-specification of the counterfactual attractor construction in the abstract. We address this directly below and propose targeted revisions to improve clarity without altering the core claims or results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that relaxing the rare compositional factor 'while preserving surrounding semantics' yields a counterfactual attractor whose trajectory difference 'accurately isolates only the undesired default completion' is load-bearing but under-specified. No concrete relaxation procedure, prompt-editing rule, or validation (e.g., semantic similarity checks or trajectory visualizations) is provided to rule out unintended attribute shifts or new high-probability modes, directly engaging the stress-test concern.
Authors: We agree that the abstract's brevity leaves the load-bearing claim under-specified and that explicit details would strengthen the presentation. The relaxation procedure consists of a prompt-editing rule that removes or neutralizes only the rare compositional factor (e.g., replacing 'snowy' with a neutral term) while retaining all other tokens and structure to preserve surrounding semantics. We will revise the abstract to state this rule concisely. The manuscript's experiments on rare compositional prompts already demonstrate improved fidelity without quality loss, which indirectly supports isolation of default-completion drift. To directly address the stress-test concern, we will add semantic similarity checks (via CLIP embeddings) between original and relaxed prompts plus trajectory visualizations in the revision, confirming no unintended attribute shifts or new modes are introduced. revision: yes
Circularity Check
No circularity: procedural method without self-referential derivations or fitted predictions
full rationale
The paper introduces DCR as a training-free guidance technique based on constructing a counterfactual attractor via prompt relaxation and applying projection-based repulsion on trajectory drift. No equations, parameter fits, or derivations are described that reduce the output to the input by construction. The framework is defined procedurally from model-internal sampling trajectories, with claims supported by experimental results on rare prompts rather than self-definition or self-citation chains. No load-bearing uniqueness theorems or ansatzes from prior author work are invoked.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mochi-1 preview.https://huggingface.co/genmo/mochi-1-preview, October 2024
Genmo AI. Mochi-1 preview.https://huggingface.co/genmo/mochi-1-preview, October 2024
2024
-
[2]
Vision-language models do not understand negation
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand negation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29612–29622, 2025
2025
-
[3]
Universal guidance for diffusion models
Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 843–852, 2023
2023
-
[4]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
2024
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023
2023
-
[7]
Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023
2023
-
[8]
arXiv preprint arXiv:2406.08070(2024)
Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold- constrained classifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024
-
[9]
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022
-
[10]
Erasing concepts from diffusion models
Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023
2023
-
[11]
Unified concept editing in diffusion models
Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5111–5120, 2024
2024
-
[12]
Preserve your own correlation: A noise prior for video diffusion models
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia- Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023
2023
-
[13]
arXiv preprint arXiv:2311.10709 (2023)
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023
-
[14]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review arXiv 2022
-
[15]
arXiv preprint arXiv:2403.14773 (2024)
Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024
-
[16]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review arXiv 2022
-
[17]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[18]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 10
work page internal anchor Pith review arXiv 2022
-
[19]
Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
2022
-
[20]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review arXiv 2022
-
[21]
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116, 2023
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116, 2023
2023
-
[22]
Taewon Kang and Ming C Lin. Negate: Constrained semantic guidance for linguistic negation in text-to- video diffusion.arXiv preprint arXiv:2603.06533, 2026
-
[23]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
LLM -grounded video diffusion models
Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023
-
[25]
Compositional visual generation with composable diffusion models
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022
2022
-
[26]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, February 2024
2024
-
[27]
Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...
2025
-
[28]
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023
-
[29]
Linguis- tic binding in diffusion models: Enhancing attribute correspondence through attention map alignment
Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, and Gal Chechik. Linguis- tic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems, 36:3536–3559, 2023
2023
-
[30]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[31]
Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
2022
-
[32]
Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, Andrew Pierson, Ankush Gupta, Austin Waters, Aäron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt, Mar...
2024
-
[33]
Freeu: Free lunch in diffusion u-net
Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4733–4743, 2024
2024
-
[34]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review arXiv 2022
-
[35]
Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn" no" to say" yes" better: Improving vision-language models via negations.arXiv preprint arXiv:2403.20312, 2024
-
[36]
Winoground: Probing vision and language models for visio-linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022
2022
-
[37]
Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdogan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, José Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Miaose...
2024
-
[38]
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022
-
[39]
Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023
-
[40]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review arXiv 2023
-
[41]
Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024
2024
-
[42]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
Not just what’s there: Enabling clip to comprehend negated visual descriptions without fine-tuning
Junhao Xiao, Zhiyu Wu, Hao Lin, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He, and Yi Chen. Not just what’s there: Enabling clip to comprehend negated visual descriptions without fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026). AAAI Press, January 2026. Main Conference
2026
-
[44]
Dymo: Training-free diffusion model alignment with dynamic multi-objective scheduling
Xin Xie and Dong Gong. Dymo: Training-free diffusion model alignment with dynamic multi-objective scheduling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13220– 13230, 2025
2025
-
[45]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022
-
[47]
Stefano Zampini, Jacob K Christopher, Luca Oneto, Davide Anguita, and Ferdinando Fioretto. Training- free constrained generation with stable diffusion models.arXiv preprint arXiv:2502.05625, 2025
-
[48]
Jichen Zhang, Liqun Zhao, Antonis Papachristodoulou, and Jack Umenberger. Constrained diffusers for safe planning and control.arXiv preprint arXiv:2506.12544, 2025
-
[49]
A snowy beach with waves
Yuhui Zhang, Yuchang Su, Yiming Liu, and Serena Yeung-Levy. Negvqa: Can vision language models understand negation? InFindings of the Association for Computational Linguistics: ACL 2025, pages 3707–3716, 2025. 12 A Ethics Statement Ethics Statement This work introduces Default Completion Repulsion (DCR), a training-free inference-time guidance method for ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.