arxiv: 2605.06512 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

Taewon Kang , Matthias Zwicker

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelscompositional generationcounterfactual guidancedefault completion biasrare promptstraining-free methodimage synthesisguidance mechanisms

0 comments

The pith

A training-free repulsion mechanism steers diffusion models toward rare image compositions by counteracting their default common completions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models often collapse rare but valid prompts like a snowy beach into more common alternatives because their denoising trajectories are pulled toward frequent patterns in the training data. The paper identifies this as default completion bias and introduces Default Completion Repulsion to model and suppress it explicitly. DCR builds a counterfactual attractor by relaxing the rare compositional element while preserving other semantics, then defines the trajectory difference as counterfactual drift. A projection-based repulsion step removes the guidance components aligned with that drift, steering the main generation away from the undesired default. Experiments show this raises compositional fidelity on rare prompts without hurting visual quality and without any retraining.

Core claim

Default completion bias arises when denoising trajectories are attracted to high-frequency semantic configurations. DCR constructs a counterfactual attractor by relaxing the rare compositional factor, computes the discrepancy with the target trajectory as counterfactual drift, and applies projection-based repulsion to remove guidance components aligned with this drift. The result is improved adherence to rare compositions during standard sampling.

What carries the argument

Default Completion Repulsion (DCR) via a projection-based mechanism that removes guidance components aligned with the counterfactual drift direction, where drift is the difference between target and attractor denoising trajectories.

If this is right

Compositional fidelity rises on rare but plausible prompts while visual quality is preserved.
The method runs inside the standard diffusion sampling loop with no retraining or architecture changes.
Intrinsic model biases toward frequent completions are exposed and can be counteracted at inference time.
Controllable generation can proceed by suppressing competing tendencies rather than adding explicit constraints.
The same sampling process now supports both common and rare compositions without separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The repulsion idea could be applied to other generative models that exhibit similar default biases, such as text or audio generators.
Combining DCR with classifier-free guidance might produce additive control effects on both rarity and style.
Ablating the strength of the repulsion term would reveal how much bias suppression is needed for different prompt types.
The framework suggests a general route to debiasing generative trajectories by constructing and repelling from model-preferred alternatives.

Load-bearing premise

Relaxing the rare compositional factor while preserving surrounding semantics produces a counterfactual attractor whose trajectory difference isolates only the undesired default completion without introducing new artifacts or unintended semantic shifts.

What would settle it

Generate images from a fixed set of rare compositional prompts both with and without DCR, then score them for exact presence of the rare element and for overall visual realism; a clear rise in rare-element scores with no drop in realism scores would support the claim.

Figures

Figures reproduced from arXiv: 2605.06512 by Matthias Zwicker, Taewon Kang.

**Figure 1.** Figure 1: Default completion bias in video diffusion models. We compare our method (DCR) against Mochi on three representative rare compositional prompts spanning Temporal Misalignment (TEMP), Attribute Rebinding (ATTR), and Environment Recomposition (ENV). Baseline models collapse toward statistically dominant completions—generating a daytime rainbow instead of a nocturnal one, a glowing sphere instead of a glowing… view at source ↗

**Figure 2.** Figure 2: Overview of Default Completion Repulsion (DCR). Given a rare but plausible compositional prompt p (e.g., “a glowing clay pot”), an LLM generates an attractor prompt pattr representing the nearest frequent counterpart (e.g., “a clay pot under natural lighting”). Both prompts enter the text-to-video diffusion process, where DCR is applied at each denoising step. (a) Guidance Pipeline: Three denoiser branche… view at source ↗

**Figure 3.** Figure 3: Representative qualitative comparison (ATTR). We compare our method with state-ofthe-art diffusion baselines (Mochi, HunyuanVideo, CogVideoX) on Attribute Rebinding (ATTR): “a glowing clay pot.” While baselines often alter object identity or material properties (e.g., generating molten objects or unrelated glowing containers), our method correctly binds the luminous attribute to the clay pot, preserving b… view at source ↗

**Figure 4.** Figure 4: Ablation study on rare compositional generation. We compare our full model against four ablations on the prompt “a rainbow at night in the sky.” Negative Prompt replaces the attractor prompt pattr into the negative slot of standard CFG, producing dark, near-black frames that entirely fail to render the rainbow—confirming that naive negation via CFG cannot encode compositional rarity. w/o Attractor Prompt c… view at source ↗

**Figure 5.** Figure 5: Attractor trajectory visualization and hyperparameter sensitivity. (Left) Comparison of Ours (DCR), Mochi (baseline), and pattr only (attractor) for a snowy beach with waves (ENV). The pattr-only output closely resembles the Mochi baseline, confirming that pattr captures the model’s default completion tendency. (Right) Sensitivity of CLIPScore, CLIP-attr, and BLIP to wattr, η, and schedule interval (rs, re… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison across diverse compositional scenarios (Set 1). We evaluate our method against state-of-the-art video diffusion models (Mochi, HunyuanVideo, CogVideoX) on eight rare compositional categories: ENV (Environment Recomposition), TEMP (Temporal Misalignment), OBJ (Object Relocation), ATTR (Attribute Rebinding), SCALE (Scale Shift), CTX (Contextual Relocation), MAT (Material and State Conf… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison across diverse compositional scenarios (Set 2). This figure presents a second set of representative prompts covering the same eight compositional categories as in view at source ↗

**Figure 8.** Figure 8: Qualitative comparison across diverse compositional scenarios (Set 3). This figure presents a third set of prompts spanning all eight compositional categories, including a lake shore with snow and blooming flowers, a surfboard standing in a grassy field, a solid cloud shaped like a rock, and a wet fire burning with visible flames. These cases probe more subtle compositional conflicts, such as simultaneous … view at source ↗

**Figure 9.** Figure 9: Attractor trajectory visualization across three compositional categories. For each prompt, we compare videos generated by Ours (DCR), Mochi (standard CFG baseline), and pattr only (standard CFG applied to the attractor prompt). In all three cases, the pattr-only output reflects the model’s default completion tendency—a sunny beach instead of a snowy one (ENV), a daytime rainbow instead of a nocturnal one (… view at source ↗

**Figure 10.** Figure 10: ENV (Environment Recomposition). Prompt: “a snowy beach with waves.” Our method preserves both snow and beach simultaneously, while baselines collapse to a standard beach. TEMP “a rainbow at night in the sky” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 11.** Figure 11: TEMP (Temporal Misalignment). Prompt: “a rainbow at night in the sky.” Our method generates a nocturnal rainbow, whereas baselines revert to daytime scenes. 29 view at source ↗

**Figure 12.** Figure 12: OBJ (Object Relocation). Prompt: “a lighthouse located in a grassy meadow.” Our method maintains the rare object-environment pairing, while baselines relocate the object to coastal regions. ATTR “a glowing clay pot” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 13.** Figure 13: ATTR (Attribute Rebinding). Prompt: “a glowing clay pot.” Our method correctly binds the glowing attribute to the clay material, while baselines alter object identity or material. 30 view at source ↗

**Figure 14.** Figure 14: SCALE (Scale Shift). Prompt: “a giant cat larger than a building.” Our method realizes extreme scale relationships, while baselines revert to typical object sizes. CTX “a kitchen in the middle of a highway” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 15.** Figure 15: CTX (Contextual Relocation). Prompt: “a kitchen in the middle of a highway.” Our method integrates both contexts coherently, while baselines generate only one dominant scene. 31 view at source ↗

**Figure 16.** Figure 16: MAT (Material-State Conflict). Prompt: “a melting ice sculpture in a snowy field.” Our method preserves both melting and snowy conditions, while baselines collapse to a single state. DENS “a lightly crowded subway platform during peak hours” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 17.** Figure 17: DENS (Density Variation). Prompt: “a lightly crowded subway platform during peak hours.” Our method maintains low density despite strong priors for crowded scenes. 32 view at source ↗

**Figure 18.** Figure 18: ENV. Prompt: “a desert oasis surrounded by snow.” TEMP “a sunrise with stars still visible in the sky” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 19.** Figure 19: TEMP. Prompt: “a sunrise with stars still visible in the sky.” 33 view at source ↗

**Figure 20.** Figure 20: OBJ. Prompt: “a canoe placed in a dry canyon.” ATTR “a reflective wooden table like a mirror” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 21.** Figure 21: ATTR. Prompt: “a reflective wooden table like a mirror.” 34 view at source ↗

**Figure 22.** Figure 22: SCALE. Prompt: “a tiny chair placed on a finger.” CTX “a library on a beach” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 23.** Figure 23: CTX. Prompt: “a library on a beach.” 35 view at source ↗

**Figure 24.** Figure 24: MAT. Prompt: “a steaming ice cube on a snowy ground.” DENS “a lightly crowded shopping street during sales season” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 25.** Figure 25: DENS. Prompt: “a lightly crowded shopping street during sales season.” 36 view at source ↗

**Figure 26.** Figure 26: ENV. Prompt: “a lake shore with snow and blooming flowers.” TEMP “a sunrise sky with a visible rainbow and moon together” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 27.** Figure 27: TEMP. Prompt: “a sunrise sky with a visible rainbow and moon together.” 37 view at source ↗

**Figure 28.** Figure 28: OBJ. Prompt: “a surfboard standing in a grassy field.” ATTR “a solid cloud shaped like a rock” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 29.** Figure 29: ATTR. Prompt: “a solid cloud shaped like a rock.” 38 view at source ↗

**Figure 30.** Figure 30: SCALE. Prompt: “a tiny bench on a coin.” CTX “a restaurant dining area in the middle of a highway” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 31.** Figure 31: CTX. Prompt: “a restaurant dining area in the middle of a highway.” 39 view at source ↗

**Figure 32.** Figure 32: MAT. Prompt: “a wet fire burning with visible flames.” DENS “a lightly crowded bus stop during commute time” Ours Mochi Hunyuan Video CogVideoX view at source ↗

**Figure 33.** Figure 33: DENS. Prompt: “a lightly crowded bus stop during commute time.” 40 view at source ↗

read the original abstract

Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that diffusion models exhibit default completion bias on rare but valid compositional prompts (e.g., snowy beach), collapsing toward high-frequency alternatives during denoising. It introduces Default Completion Repulsion (DCR), a training-free method that constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, defines counterfactual drift as the trajectory discrepancy, and applies projection-based repulsion to suppress drift-aligned components, thereby improving compositional fidelity without retraining.

Significance. If the core assumption holds—that the attractor construction isolates only default-completion drift without semantic side-effects—DCR would offer a meaningful advance in training-free controllable generation by directly modeling and counteracting intrinsic model biases. The approach is notable for its generality beyond explicit constraints and potential to expose model preferences, which could benefit applications needing faithful rare compositions.

major comments (1)

[Abstract] Abstract: the central claim that relaxing the rare compositional factor 'while preserving surrounding semantics' yields a counterfactual attractor whose trajectory difference 'accurately isolates only the undesired default completion' is load-bearing but under-specified. No concrete relaxation procedure, prompt-editing rule, or validation (e.g., semantic similarity checks or trajectory visualizations) is provided to rule out unintended attribute shifts or new high-probability modes, directly engaging the stress-test concern.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The primary concern is the under-specification of the counterfactual attractor construction in the abstract. We address this directly below and propose targeted revisions to improve clarity without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that relaxing the rare compositional factor 'while preserving surrounding semantics' yields a counterfactual attractor whose trajectory difference 'accurately isolates only the undesired default completion' is load-bearing but under-specified. No concrete relaxation procedure, prompt-editing rule, or validation (e.g., semantic similarity checks or trajectory visualizations) is provided to rule out unintended attribute shifts or new high-probability modes, directly engaging the stress-test concern.

Authors: We agree that the abstract's brevity leaves the load-bearing claim under-specified and that explicit details would strengthen the presentation. The relaxation procedure consists of a prompt-editing rule that removes or neutralizes only the rare compositional factor (e.g., replacing 'snowy' with a neutral term) while retaining all other tokens and structure to preserve surrounding semantics. We will revise the abstract to state this rule concisely. The manuscript's experiments on rare compositional prompts already demonstrate improved fidelity without quality loss, which indirectly supports isolation of default-completion drift. To directly address the stress-test concern, we will add semantic similarity checks (via CLIP embeddings) between original and relaxed prompts plus trajectory visualizations in the revision, confirming no unintended attribute shifts or new modes are introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural method without self-referential derivations or fitted predictions

full rationale

The paper introduces DCR as a training-free guidance technique based on constructing a counterfactual attractor via prompt relaxation and applying projection-based repulsion on trajectory drift. No equations, parameter fits, or derivations are described that reduce the output to the input by construction. The framework is defined procedurally from model-internal sampling trajectories, with claims supported by experimental results on rare prompts rather than self-definition or self-citation chains. No load-bearing uniqueness theorems or ansatzes from prior author work are invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; method appears to operate on existing diffusion trajectories.

pith-pipeline@v0.9.0 · 5515 in / 1080 out tokens · 34773 ms · 2026-05-08T13:02:48.301862+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 23 canonical work pages · 10 internal anchors

[1]

Mochi-1 preview.https://huggingface.co/genmo/mochi-1-preview, October 2024

Genmo AI. Mochi-1 preview.https://huggingface.co/genmo/mochi-1-preview, October 2024

2024
[2]

Vision-language models do not understand negation

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand negation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29612–29622, 2025

2025
[3]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 843–852, 2023

2023
[4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[6]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023
[7]

Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

2023
[8]

arXiv preprint arXiv:2406.08070(2024)

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold- constrained classifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024

work page arXiv 2024
[9]

E., and Wang, W

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

work page arXiv 2022
[10]

Erasing concepts from diffusion models

Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023

2023
[11]

Unified concept editing in diffusion models

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5111–5120, 2024

2024
[12]

Preserve your own correlation: A noise prior for video diffusion models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia- Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023

2023
[13]

arXiv preprint arXiv:2311.10709 (2023)

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

work page arXiv 2023
[14]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022

work page internal anchor Pith review arXiv 2022
[15]

arXiv preprint arXiv:2403.14773 (2024)

Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

work page arXiv 2024
[16]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[17]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 10

work page internal anchor Pith review arXiv 2022
[19]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022
[20]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review arXiv 2022
[21]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116, 2023

2023
[22]

Negate: Constrained semantic guidance for linguistic negation in text-to- video diffusion.arXiv preprint arXiv:2603.06533, 2026

Taewon Kang and Ming C Lin. Negate: Constrained semantic guidance for linguistic negation in text-to- video diffusion.arXiv preprint arXiv:2603.06533, 2026

work page arXiv 2026
[23]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review arXiv 2024
[24]

LLM -grounded video diffusion models

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023

work page arXiv 2023
[25]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022

2022
[26]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, February 2024

2024
[27]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...

2025
[28]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

work page arXiv 2023
[29]

Linguis- tic binding in diffusion models: Enhancing attribute correspondence through attention map alignment

Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, and Gal Chechik. Linguis- tic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems, 36:3536–3559, 2023

2023
[30]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[31]

Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

2022
[32]

Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, Andrew Pierson, Ankush Gupta, Austin Waters, Aäron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt, Mar...

2024
[33]

Freeu: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4733–4743, 2024

2024
[34]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[35]

Learn” no” to say” yes” bet- ter: Improving vision-language models via negations.arXiv preprint arXiv:2403.20312, 2024

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn" no" to say" yes" better: Improving vision-language models via negations.arXiv preprint arXiv:2403.20312, 2024

work page arXiv 2024
[36]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

2022
[37]

Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdogan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, José Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Miaose...

2024
[38]

Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

work page arXiv 2022
[39]

Gen- L-Video: multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

work page arXiv 2023
[40]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review arXiv 2023
[41]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

2024
[42]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review arXiv 2025
[43]

Not just what’s there: Enabling clip to comprehend negated visual descriptions without fine-tuning

Junhao Xiao, Zhiyu Wu, Hao Lin, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He, and Yi Chen. Not just what’s there: Enabling clip to comprehend negated visual descriptions without fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026). AAAI Press, January 2026. Main Conference

2026
[44]

Dymo: Training-free diffusion model alignment with dynamic multi-objective scheduling

Xin Xie and Dong Gong. Dymo: Training-free diffusion model alignment with dynamic multi-objective scheduling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13220– 13230, 2025

2025
[45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[46]

Yuksekgonul, F

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

work page arXiv 2022
[47]

Training-free constrained generation with stable diffusion models.arXiv preprint arXiv:2502.05625, 2025

Stefano Zampini, Jacob K Christopher, Luca Oneto, Davide Anguita, and Ferdinando Fioretto. Training- free constrained generation with stable diffusion models.arXiv preprint arXiv:2502.05625, 2025

work page arXiv 2025
[48]

Constrained

Jichen Zhang, Liqun Zhao, Antonis Papachristodoulou, and Jack Umenberger. Constrained diffusers for safe planning and control.arXiv preprint arXiv:2506.12544, 2025

work page arXiv 2025
[49]

A snowy beach with waves

Yuhui Zhang, Yuchang Su, Yiming Liu, and Serena Yeung-Levy. Negvqa: Can vision language models understand negation? InFindings of the Association for Computational Linguistics: ACL 2025, pages 3707–3716, 2025. 12 A Ethics Statement Ethics Statement This work introduces Default Completion Repulsion (DCR), a training-free inference-time guidance method for ...

2025