pith. machine review for the scientific record. sign in

arxiv: 2604.18575 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

ReCap: Lightweight Referential Grounding for Coherent Story Visualization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords story visualizationcharacter consistencydiffusion modelsreferential groundinganaphora resolutionimage sequence generationlightweight modelssemantic drift correction
0
0 comments X

The pith

ReCap maintains character identity across story images by activating previous-frame conditioning only on pronouns and aligning features to visual embeddings during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReCap as a lightweight addition to existing diffusion models for turning text stories into consistent sequences of images. It focuses on preventing characters from changing appearance from one frame to the next without adding memory banks, extra language models, or large parameter counts. The CORE module triggers visual reference to the prior frame exclusively when the text uses a pronoun, while SemDrift corrects semantic drift by matching internal features to stable visual embeddings but only at training time. This selective design yields higher character accuracy on standard cartoon benchmarks and works on real film clips. A sympathetic reader would care because it shows that heavy architectural changes may not be required for coherent visual narratives.

Core claim

ReCap's central claim is that treating pronouns as visual anchors via the CORE module allows selective conditioning on the preceding frame to propagate character identity, while SemDrift aligns denoiser representations with DINOv3 embeddings during training to enforce stability when text references are vague, all without unconditional cross-frame links or added inference cost, resulting in new state-of-the-art character consistency on the main benchmarks.

What carries the argument

The CORE module for conditional frame referencing that activates only on pronouns as anchors to propagate visual identity from the prior frame, paired with SemDrift for training-time alignment to DINOv3 embeddings to correct semantic drift.

If this is right

  • It achieves new state-of-the-art character accuracy on the FlintstonesSV and PororoSV benchmarks.
  • The full approach adds only 149K parameters and incurs zero extra cost at inference time.
  • Story visualization extends successfully to human-centric scenes drawn from real films.
  • Unconditional cross-frame conditioning and large auxiliary components become unnecessary for basic consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pronoun selectivity may generalize if augmented with other referential cues such as proper names for broader narrative coverage.
  • Training-only alignment offers a template for stabilizing other sequential generative tasks without runtime overhead.
  • The design could reduce parameter demands in related domains like consistent character generation for comics or short videos.
  • If pronouns prove sufficient anchors, full memory banks may be replaceable in many identity-tracking scenarios.

Load-bearing premise

That activating consistency mechanisms only on pronouns and correcting drift solely at training time will suffice to prevent identity changes across all narrative cases without full cross-frame attention or memory structures.

What would settle it

A new benchmark of story texts that mostly use character names instead of pronouns, where ReCap's character accuracy fails to exceed that of prior methods like StoryGPT-V.

Figures

Figures reproduced from arXiv: 2604.18575 by Aditya Arora, Akshita Gupta, Marcus Rohrbach, Pau Rodriguez.

Figure 1
Figure 1. Figure 1: Diffusion models condition each frame on text alone, failing when narrative references like “They” or “He” carry no appearance information. ReCap addresses this through a lightweight pipeline, its CORE (COnditional frame REferencing) module activates selectively via Text Conditioned Gating, conditioning generation on the pre￾ceding frame only when the narrative references a character anaphorically, as in f… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ReCap Architecture. Our method extends Stable Diffusion 3 with two components: (1) CORE (bottom): encodes the previous frame It−1 into context embedding ct−1 (Eq. 2) via a lightweight convolutional module with Guid￾ance Attention Blocks, and injects it as a residual into each transformer block (Eq. 4), activated only when no character name appears in the current text prompt via Text￾Conditioned… view at source ↗
Figure 3
Figure 3. Figure 3: Visual quality and consistency FlintstonesSV [10] and PororoSV [23] datasets, evaluated using VBench [16]. Subj-Cons. = Subject Consistency, BG-Cons. = Back￾ground Consistency, Aes-Qual. = Aesthetic Quality, Img-Qual. = Imaging Quality. Higher scores are better for all metrics. 5.4 Visual Quality Evaluation: VBench [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on FlintstonesSV [10] (top) and PororoSV [23] (bot￾tom). Pink tokens indicate anaphoric references (pronouns) in the narrative. On FlintstonesSV (top left), SD3 fails to preserve character appearance across pronoun￾containing frames (2–4), generating inconsistent clothing and visual features, while ReCap closely matches the ground truth throughout. On FlintstonesSV (top right), SD3 f… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on VWP [14], featuring a naturalistic narrative about a race car driver referred to by pronouns. In frame 2, the story uses “he” but we de￾liberately keep CORE off to isolate whether SemDrift regularization alone preserves similar scene cues (for example, the collar in the foreground and the car door in the background). In frame 4, the story again uses “he” and we keep CORE on (the d… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of temporal context length. Foreground and background accuracy as the number of conditioning frames increases. Performance remains stable across longer temporal windows, indicating that increasing the number of frames does not degrade generation quality [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on Flintstones SV [ [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on Flintstones SV [ [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on PororoSV [23] dataset with method order from top to bottom: SD3, StoryGPT-V [43], ReCap (ours), and Ground Truth [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison on PororoSV [23] dataset with method order from top to bottom: SD3, StoryGPT-V [43], ReCap (ours), and Ground Truth [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison on VWP [14]. We employ our pronoun replacing strat￾egy on top of the text taken directly from the dataset. The scene depicts a crowded party where the narrative focuses on a conversation between two individuals. SD3 gen￾erates inconsistent participants across frames and fails to ground the speakers within the crowd. Our method produces a more coherent sequence that better reflects t… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results on VWP [14]. We employ our pronoun replacing strategy on top of the text taken directly from the dataset. The narrative escalates from a meeting (frame 1) to a gun threat (frame 2). In frame 2 (“he pointed a gun at the man”), SD3 fails to depict the described interaction and produces unrelated visual content. With contextual guidance from the CORE module, our method better reflects the… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results on VWP [14]. We employ our pronoun replacing strat￾egy on top of the text taken directly from the dataset. The story transitions between different narrative moments, beginning with parents reading a letter and continuing with the son’s prison experiences. SD3 produces inconsistent characters and fails to reflect the narrative events described in the text. In contrast, our method better… view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of feature activation maps from DINOv3 and CLIP en [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
read the original abstract

Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap's CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and introduces only 149K additional parameters, a fraction of the cost of memory-bank and LLM-augmented approaches. To further stabilize identity, we incorporate SemDrift (Guided Semantic Drift Correction) applied only during training. When text is vague or referential, the denoiser lacks a visual anchor for identity-defining attributes, causing character appearance to drift across frames, SemDrift corrects this by aligning denoiser representations with pretrained DINOv3 visual embeddings, enforcing semantic identity stability at zero inference cost. ReCap outperforms previous state-of-the-art, StoryGPT-V, on the two main benchmarks for story visualization by 2.63% Character-Accuracy on FlintstonesSV and by 5.65% on PororoSV, establishing a new state-of-the-art character consistency on both benchmarks. Furthermore, we extend story visualization to human-centric narratives derived from real films, demonstrating the capability of ReCap beyond stylized cartoon domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ReCap, a lightweight consistency framework for story visualization. It features the CORE module that selectively activates on pronouns to condition the current frame on the previous one for identity propagation, adding only 149K parameters, and SemDrift which aligns denoiser features to DINOv3 embeddings solely during training to prevent semantic drift. The method claims to achieve new state-of-the-art character accuracy on FlintstonesSV (improvement of 2.63% over StoryGPT-V) and PororoSV (5.65% improvement), and extends to human-centric film narratives.

Significance. If the reported improvements hold under scrutiny, ReCap represents a significant advance in efficient story visualization by avoiding the parameter bloat of memory banks or auxiliary models while maintaining or improving consistency. The training-only alignment is particularly attractive for deployment. The lightweight design and zero-inference-cost component are clear strengths that could influence practical applications in coherent image sequence generation.

major comments (3)
  1. [Experiments] Experiments section: The headline gains of 2.63% Character-Accuracy on FlintstonesSV and 5.65% on PororoSV are presented without error bars, standard deviations, or results across multiple random seeds or data splits, preventing assessment of whether the new SOTA is statistically reliable or reproducible.
  2. [Method (CORE)] CORE module description: The selective pronoun-only activation is claimed to propagate identity without unconditional conditioning or memory banks, but no experiments or analysis evaluate performance on narratives with low pronoun frequency, use of proper names, or descriptive references; this directly tests the weakest assumption underlying the consistency claim.
  3. [Method (SemDrift)] SemDrift section: Alignment to DINOv3 embeddings occurs only at training time with zero inference cost, yet no ablation isolates its contribution or shows that train-time alignment alone suffices to block drift at inference when visual attributes diverge from DINOv3 pretraining or when pronouns are absent.
minor comments (2)
  1. [Abstract] Abstract and method: The phrase 'in our case pronouns' is used without detailing the pronoun detection implementation, its accuracy, or failure modes.
  2. [Experiments] Related work or experiments: A table comparing parameter counts, inference latency, and memory usage against all baselines (including StoryGPT-V) would clarify the lightweight advantage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments raise valid points about experimental rigor and the need for further validation of our design choices. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The headline gains of 2.63% Character-Accuracy on FlintstonesSV and 5.65% on PororoSV are presented without error bars, standard deviations, or results across multiple random seeds or data splits, preventing assessment of whether the new SOTA is statistically reliable or reproducible.

    Authors: We agree that reporting variability measures is important for assessing the reliability of the reported gains. In the revised manuscript, we will add standard deviations computed over multiple random seeds (at least three) for the character accuracy metrics on both FlintstonesSV and PororoSV, along with the corresponding results to demonstrate reproducibility. revision: yes

  2. Referee: [Method (CORE)] CORE module description: The selective pronoun-only activation is claimed to propagate identity without unconditional conditioning or memory banks, but no experiments or analysis evaluate performance on narratives with low pronoun frequency, use of proper names, or descriptive references; this directly tests the weakest assumption underlying the consistency claim.

    Authors: The CORE module selectively conditions on pronouns because they serve as common referential anchors in the story visualization benchmarks. To directly address the concern, we will add a new analysis subsection with quantitative results on narrative subsets stratified by pronoun frequency, as well as qualitative and quantitative evaluations on cases using proper names and descriptive references. revision: yes

  3. Referee: [Method (SemDrift)] SemDrift section: Alignment to DINOv3 embeddings occurs only at training time with zero inference cost, yet no ablation isolates its contribution or shows that train-time alignment alone suffices to block drift at inference when visual attributes diverge from DINOv3 pretraining or when pronouns are absent.

    Authors: We will include an ablation study in the revised paper that isolates the contribution of SemDrift by comparing performance with and without the alignment loss. We will also add analysis and results on scenarios with absent pronouns and potential divergence from DINOv3 pretraining to demonstrate that the training-time correction helps maintain consistency at inference. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims with no derivations or self-referential reductions

full rationale

The paper describes ReCap's CORE module (pronoun-selective conditioning on prior frames) and SemDrift (training-only DINOv3 alignment) as design choices, then reports direct empirical gains (2.63% and 5.65% Character-Accuracy) versus StoryGPT-V on FlintstonesSV and PororoSV. No equations, first-principles derivations, fitted-parameter predictions, or self-citation chains appear in the abstract or described text. Claims rest on benchmark comparisons rather than any tautological reduction of outputs to inputs by construction. The method's sufficiency assumptions are empirical hypotheses, not circular definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented physical entities are described. The approach implicitly assumes that DINOv3 embeddings reliably encode character identity and that selective conditioning suffices without side effects.

pith-pipeline@v0.9.0 · 5607 in / 1130 out tokens · 45643 ms · 2026-05-10T04:46:10.512566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 4 canonical work pages · 4 internal anchors

  1. [1]

    In: ICCV (2023) 4

    Ahn, D., Kim, D., Song, G., Kim, S.H., Lee, H., Kang, D., Choi, J.: Story visualization by online text augmentation with context memory. In: ICCV (2023) 4

  2. [2]

    In: SIGGRAPH (2024) 7

    Avrahami, O., Hertz, A., Vinker, Y., Arar, M., Fruchter, S., Fried, O., Cohen-Or, D., Lischinski, D.: The chosen one: Consistent characters in text- to-image diffusion models. In: SIGGRAPH (2024) 7

  3. [3]

    In: ICCV (2025) 22

    Barsellotti, L., Bianchi, L., Messina, N., Carrara, F., Cornia, M., Baraldi, L., Falchi, F., Cucchiara, R.: Talking to DINO: Bridging self-supervised vi- sion backbones with language for open-vocabulary segmentation. In: ICCV (2025) 22

  4. [4]

    In: ICCV (2021) 4

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 4

  5. [5]

    In: EMNLP (2022) 3

    Chen, H., Han, R., Wu, T.L., Nakayama, H., Peng, N.: Character-centric story visualization via visual planning and token alignment. In: EMNLP (2022) 3

  6. [6]

    In: CVPR (2017) 4

    Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: Visual dialog. In: CVPR (2017) 4

  7. [7]

    In: ICML (2024) 1, 3, 4, 8, 12, 13, 14, 21, 22, 24, 25

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transform- ers for high-resolution image synthesis. In: ICML (2024) 1, 3, 4, 8, 12, 13, 14, 21, 22, 24, 25

  8. [8]

    In: ICCV (2023) 2, 4

    Goel, A., Fernando, B., Keller, F., Bilen, H.: Who are you referring to? coreference resolution in image narrations. In: ICCV (2023) 2, 4

  9. [9]

    In: SIGGRAPH Asia (2023) 3

    Gong, Y., Guo, Z., Gao, D., Xu, R., Zhang, W., He, X., Shen, Y.: Interactive story visualization with multiple characters. In: SIGGRAPH Asia (2023) 3

  10. [10]

    In: ECCV (2018) 3, 8, 9, 10, 11, 12, 13, 24, 25

    Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., Kembhavi, A.: Imagine this! scripts to compositions to videos. In: ECCV (2018) 3, 8, 9, 10, 11, 12, 13, 24, 25

  11. [11]

    TPAMI (2025) 1

    He, H., Yang, H., Tuo, Z., Zhou, Y., Wang, Q., Zhang, Y., Liu, Z., Huang, W., Chao, H., Yin, J.: Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion. TPAMI (2025) 1

  12. [12]

    In: CVPR (2022) 7

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoen- coders are scalable vision learners. In: CVPR (2022) 7

  13. [13]

    NeurIPS (2020) 2

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020) 2

  14. [14]

    TACL (2023) 3, 4, 8, 14, 20, 28, 29

    Hong, X., Sayeed, A., Mehra, K., Demberg, V., Schiele, B.: Visual writ- ing prompts: Character-grounded story generation with curated image se- quences. TACL (2023) 3, 4, 8, 14, 20, 28, 29

  15. [15]

    In: AAAI (2023) 21 ReCap: Lightweight Referential Grounding for Coherent Story Visualization 17

    Huang, R., Long, Y., Han, J., Xu, H., Liang, X., Xu, C., Liang, X.: Nlip: Noise-robust language-image pre-training. In: AAAI (2023) 21 ReCap: Lightweight Referential Grounding for Coherent Story Visualization 17

  16. [16]

    In: CVPR (2024) 9, 12, 14

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: CVPR (2024) 9, 12, 14

  17. [17]

    In: CVPR (2025) 21

    Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: DINOv2 meets text: A unified framework for image- and pixel-level vision-language align- ment. In: CVPR (2025) 21

  18. [18]

    In: EMNLP-IJCNLP (2019) 2, 4

    Joshi, M., Levy, O., Zettlemoyer, L., Weld, D.: Bert for coreference resolu- tion: Baselines and analysis. In: EMNLP-IJCNLP (2019) 2, 4

  19. [19]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4

    Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4

  20. [20]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 1

  21. [21]

    In: EMNLP (2017) 2, 4

    Lee, K., He, L., Lewis, M., Zettlemoyer, L.: End-to-end neural coreference resolution. In: EMNLP (2017) 2, 4

  22. [22]

    In: EMNLP Findings (2022) 3

    Li, B., Lukasiewicz, T.: Learning to model multimodal semantic alignment for story visualization. In: EMNLP Findings (2022) 3

  23. [23]

    In: CVPR (2019) 2, 3, 6, 8, 9, 12, 13, 22, 26, 27

    Li, Y., Gan, Z., Shen, Y., Liu, J., Cheng, Y., Wu, Y., Carin, L., Carlson, D., Gao, J.: Storygan: A sequential conditional gan for story visualization. In: CVPR (2019) 2, 3, 6, 8, 9, 12, 13, 22, 26, 27

  24. [24]

    In: ICLR (2023) 6, 8

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow match- ing for generative modeling. In: ICLR (2023) 6, 8

  25. [25]

    In: CVPR (2024) 1

    Liu,C.,Wu,H.,Zhong,Y.,Zhang,X.,Wang,Y.,Xie,W.:Intelligentgrimm- open-ended visual storytelling via latent diffusion models. In: CVPR (2024) 1

  26. [26]

    In: EMNLP (2021) 2, 3, 9

    Maharana, A., Bansal, M.: Integrating visuospatial, linguistic and common- sense structure into story visualization. In: EMNLP (2021) 2, 3, 9

  27. [27]

    In: NAACL (2021) 3

    Maharana, A., Hannan, D., Bansal, M.: Improving generation and evalua- tion of visual stories via semantic consistency. In: NAACL (2021) 3

  28. [28]

    In: ECCV (2022) 2, 3, 6, 9, 10, 12, 13, 26, 27

    Maharana, A., Hannan, D., Bansal, M.: Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In: ECCV (2022) 2, 3, 6, 9, 10, 12, 13, 26, 27

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:Dinov2:Learn- ing robust visual features without supervision. arXiv:2304.07193 (2023) 4, 7

  30. [30]

    In: WACV (2024) 3, 6

    Pan, X., Qin, P., Li, Y., Xue, H., Chen, W.: Synthesizing coherent story with auto-regressive latent diffusion models. In: WACV (2024) 3, 6

  31. [31]

    In: ICCV (2023) 2, 3, 4, 6

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023) 2, 3, 4, 6

  32. [32]

    In: ICCV (2025) 1 18 Arora et al

    Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Li, X., Liu, D., Zhu, X., et al.: Lumina-image 2.0: A unified and efficient image generative framework. In: ICCV (2025) 1 18 Arora et al

  33. [33]

    In: ICML (2021) 7, 11

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 7, 11

  34. [34]

    In: CVPR (2023) 2, 4, 6, 7, 8, 9, 10, 22, 23

    Rahman, T., Lee, H.Y., Ren, J., Tulyakov, S., Mahajan, S., Sigal, L.: Make- a-story: Visual memory conditioned consistent story generation. In: CVPR (2023) 2, 4, 6, 7, 8, 9, 10, 22, 23

  35. [36]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022) 3

  36. [37]

    In: ICML (2021) 3

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: ICML (2021) 3

  37. [38]

    In: CVPR (2017) 4

    Rohrbach, A., Rohrbach, M., Tang, S., Joon Oh, S., Schiele, B.: Generating descriptions with grounded and co-referenced people. In: CVPR (2017) 4

  38. [39]

    In: CVPR (2022) 2, 3, 6, 22

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: CVPR (2022) 2, 3, 6, 22

  39. [40]

    NeurIPS (2022) 1, 3

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language under- standing. NeurIPS (2022) 1, 3

  40. [41]

    In: ICLR (2023) 21

    Santurkar, S., Dubois, Y., Taori, R., Liang, P., Hashimoto, T.: Is a caption worth a thousand images? a controlled study for representation learning. In: ICLR (2023) 21

  41. [42]

    In: AAAI (2025) 2

    Shen, F., Ye, H., Liu, S., Zhang, J., Wang, C., Han, X., Wei, Y.: Boosting consistency in story visualization with rich-contextual conditional diffusion models. In: AAAI (2025) 2

  42. [43]

    In: CVPR (2025) 1, 2, 4, 7, 9, 10, 12, 13, 22, 24, 25, 26, 27

    Shen, X., Elhoseiny, M.: Storygpt-v: Large language models as consistent story visualizers. In: CVPR (2025) 1, 2, 4, 7, 9, 10, 12, 13, 22, 24, 25, 26, 27

  43. [44]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 3, 6, 7, 11, 22

  44. [45]

    In: ECCV (2020) 3

    Song, Y.Z., Tam, Z.R., Chen, H.J., Lu, H.H., Shuai, H.H.: Character- preserving coherent story visualization. In: ECCV (2020) 3

  45. [46]

    In: ECCV (2024) 3, 4

    Tao, M., Bao, B.K., Tang, H., Wang, Y., Xu, C.: Storyimager: A unified and efficient framework for coherent story visualization and completion. In: ECCV (2024) 3, 4

  46. [47]

    In: NeurIPS (2017) 4, 6

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) 4, 6

  47. [48]

    In: ECCVW (2018) 21 ReCap: Lightweight Referential Grounding for Coherent Story Visualization 19

    Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Esrgan: Enhanced super-resolution generative adversarial networks. In: ECCVW (2018) 21 ReCap: Lightweight Referential Grounding for Coherent Story Visualization 19

  48. [49]

    In: ECCV (2018) 21

    Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block atten- tion module. In: ECCV (2018) 21

  49. [50]

    In: ECCV (2024) 21

    Wysoczańska, M., Siméoni, O., Ramamonjisoa, M., Bursuc, A., Trzciński, T., Pérez, P.: CLIP-DINOiser: Teaching CLIP a few DINO tricks for open- vocabulary semantic segmentation. In: ECCV (2024) 21

  50. [51]

    In: EMNLP- IJCNLP (2019) 2, 4

    Yu, X., Zhang, H., Song, Y., Song, Y., Zhang, C.: What you see is what you get: Visual pronoun coreference resolution in dialogues. In: EMNLP- IJCNLP (2019) 2, 4

  51. [52]

    In: NeurIPS (2023) 7, 21

    Zhang, J., Herrmann, C., Hur, J., Polania Cabrera, L., Jampani, V., Sun, D., Yang, M.H.: A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence. In: NeurIPS (2023) 7, 21

  52. [53]

    he” for male char- acters, “she

    Zhou, Y., Zhou, D., Cheng, M.M., Feng, J., Hou, Q.: Storydiffusion: Con- sistent self-attention for long-range image and video generation. NeurIPS (2024) 7 20 Arora et al. ReCap: Lightweight Referential Grounding for Coherent Story Visualization (Supplementary Material) In this supplementary material, we provide additional details and results to support t...