arxiv: 2604.16958 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation

Minyan Luo , Yuxin Zhang , Yifei Li , Xincan Wang , Fuzhang Wu , Tong-Yee Lee , Oliver Deussen , Weiming Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords narrative product photographygrid collage generationagentic frameworkself-reasoningimage generationvisual storytellingprompt refinementunified collage synthesis

0 comments

The pith

A self-reasoning agentic framework builds a narrative plan from a single product image then generates and refines the entire multi-grid collage as one unified image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that an automated system can produce stronger narrative product photography collages than direct prompting by first extracting an explicit story structure from the product, enforcing that structure across panels, and then using internal checks to spot and fix weaknesses. Current generators often produce disconnected scenes or weak storytelling when asked for multi-panel product visuals, which limits their use in marketing. The proposed method treats the collage as a single coordinated image, applies shared style rules, and loops through self-evaluation and targeted prompt changes until quality gates pass. If the approach works as described, it would let users obtain coherent visual stories from minimal input without manual iteration or post-editing.

Core claim

Given only a product packshot and its name, the framework first constructs a Product Narrative Framework that captures the product's identity, usage context, and situational environment, then translates these elements into complementary grids under one visual style. Constraint-aware prompts drive a generator to synthesize the full collage as a single image rather than independent panels. The output is scored on content validity and photography quality; failed gates trigger failure attribution and prompt refinement, allowing iterative self-correction. Experiments show the resulting collages score higher on aesthetic quality, narrative richness, and visual coherence than direct-prompting runs.

What carries the argument

The self-reasoning agentic framework, which builds an explicit Product Narrative Framework to coordinate story and style across grids, then routes generation through evaluation gates that trigger targeted refinement when content or quality checks fail.

If this is right

Multi-grid collages are synthesized as single unified images, which enforces cross-panel consistency that separate generation cannot guarantee.
The explicit Product Narrative Framework supplies shared constraints on identity, context, and style that direct prompts rarely maintain across panels.
Failure attribution plus targeted prompt changes allow automatic progressive improvement instead of requiring user intervention after each generation.
The same pipeline produces outputs that measurably outperform direct prompting on aesthetic quality, narrative richness, and coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified-generation step could reduce compositing artifacts that appear when independent panels are later stitched together.
Similar agentic loops with narrative planning and self-evaluation might apply to other multi-view storytelling tasks such as instructional diagrams or short comic sequences.
If the gates prove stable across product categories, the method offers a template for adding structured reasoning to other generative image workflows that currently rely on one-shot prompting.

Load-bearing premise

The internal gates can reliably detect narrative and photographic shortcomings, and each refinement round will produce measurable gains without introducing new inconsistencies or entering endless loops.

What would settle it

Human raters score the framework's collages as equal or lower in narrative richness and visual coherence than direct-prompt baselines on a held-out set of at least twenty diverse products.

Figures

Figures reproduced from arXiv: 2604.16958 by Fuzhang Wu, Minyan Luo, Oliver Deussen, Tong-Yee Lee, Weiming Dong, Xincan Wang, Yifei Li, Yuxin Zhang.

**Figure 1.** Figure 1: Examples of narrative product grid collage generation. Rather than merely depicting a product in isolation, we generate grid collages that incorporate narrative elements, establishing associations with functionality, affordances, or lifestyle enhancement. Our method can further transfer the photographic direction from a reference image while being specifically tailored to the target product, thereby genera… view at source ↗

**Figure 2.** Figure 2: With identical inputs, our method yields improved aesthetic quality, stronger narrative richness, higher visual coherence than Nano Banana Pro direct generation. (a) Product packshot. (b) Ground-truth four-panel campaign grid staged and photographed by professional photographers. Creation mode: (c) Nano Banana Pro direct output using the in-figure prompt caption; (d) our result built on Nano Banana Pro und… view at source ↗

**Figure 3.** Figure 3: The architecture of our proposed self-reasoning agentic framework for narrative product grid collage generation. traces, or evocative environmental cues, enrich the narrative framework by stimulating viewer imagination and fostering deeper engagement (Woodside et al., 2008). How to Shoot This step translates the Product Narrative Framework into Photographic Decisions(P) and determines a Global Visual Style… view at source ↗

**Figure 5.** Figure 5: Results of iterative refinement under creation mode. Experiments are conducted with a maximum of five refinement iterations, and terminate early when the generated grid satisfies the predefined evaluation thresholds. grids. In other cases, grids appear visually plausible but lack meaningful semantic connections. As shown in the 3 rd row (Stuffed Monkey), the results of Nano Banana Pro manifest as independe… view at source ↗

**Figure 4.** Figure 4: Comparison between our method and direct generation baselines under creation mode. Results are evaluated on 400 product images, with up to 3 iterations [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Iteration evaluation under creation mode. We report scores of direct generation and our iterative outputs after 1/2/3 refinement rounds (Iter1–Iter3). Trader Joe’s Canvas Tote Bag Flower Knows Cosmetics Set Straw Hat with Daisies Fujifilm Camera [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Our framework supports diverse layouts (e.g., 3×3 and 1×3), enabling flexible grid collages. and emotional progression. These results demonstrate that our method is layout-agnostic, automatically adapting its generation strategies to diverse structures while preserving narrative coherence and visual consistency. 4.3. Reference Mode Evaluation Qualitative Evaluation We curate a dataset of 500 professionall… view at source ↗

**Figure 9.** Figure 9: Results of the user study. 4.5. Ablation and Limitations Ablation Study To isolate the contribution of our IdeationGeneration-Critique loop, we conduct an ablation that removes the Aidea and Acrit. Specifically, we use MLLMmodels to obtain a brief understanding of the input product packshot, then write a set of prompts based on it, and finally generate the grid collage via the generation model (enhance… view at source ↗

**Figure 10.** Figure 10: Comparison with enhanced-prompt direct generation. Limitations As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Results for two products from distinct categories. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of inter-grid relation matrices. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

Narrative-driven product photography has become a prevalent paradigm in modern marketing, as coherent visual storytelling helps convey product value and establishes emotional engagement with consumers. However, existing image generation methods do not support structured narrative planning or cross-panel coordination, often resulting in weak storytelling and visual incoherence. In practice, narrative product photography is commonly presented as multi-grid collages, where multiple views or scenes jointly communicate a product narrative. To ensure visual consistency across grids and aesthetic harmony of the overall composition, we generate the collage as a single unified image rather than composing independently synthesized panels. We propose a self-reasoning agentic framework for narrative product grid collage generation. Given a product packshot and its name, the system first constructs a Product Narrative Framework that explicitly represents the product's identity, usage context, and situational environment, and translates it into complementary grids governed by a shared visual style. Constraint-aware prompts are then compiled and fed to a generation model that synthesizes the collage jointly. The generated output is evaluated on both content validity and photography quality, with explicit gates determining whether to proceed or refine. When evaluation fails, the system performs failure attribution and applies targeted refinement, enabling progressive improvement through iterative self-reflection. Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical recipe for an agentic pipeline that builds an explicit narrative framework then generates and self-refines a unified product collage, but the improvement claims rest on internal gates with no visible metrics or external checks.

read the letter

The main thing to know is that this is a targeted engineering pipeline for turning a product packshot into a multi-grid narrative collage. It first constructs a Product Narrative Framework that spells out identity, usage context, and environment, turns that into shared-style grid prompts, generates the whole collage as one image, and then runs gated checks on content validity and photography quality to trigger failure attribution and targeted fixes if needed.

Referee Report

2 major / 2 minor

Summary. The paper proposes a self-reasoning agentic framework for narrative product grid-collage generation. Given a product packshot and name, it first builds a Product Narrative Framework capturing identity, usage context, and environment; translates this into complementary grids with shared visual style; compiles constraint-aware prompts to synthesize the entire collage as one unified image; and applies iterative self-evaluation via content-validity and photography-quality gates, with failure attribution and targeted prompt refinement when gates fail. The abstract states that experiments show consistent improvements in aesthetic quality, narrative richness, and visual coherence over direct-prompting baselines.

Significance. If the self-reasoning loop demonstrably yields progressive, externally validated gains in narrative coherence without circular self-assessment, the approach could advance structured agentic methods for multi-panel image synthesis in marketing applications. The unified-image generation strategy for cross-grid consistency is a reasonable design choice, but the current lack of quantitative evidence prevents a firm assessment of significance.

major comments (2)

Abstract: the claim that 'Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines' is unsupported by any reported quantitative metrics, baseline implementations, sample sizes, or statistical tests, rendering the central empirical claim unverifiable from the manuscript.
Framework description (self-evaluation and refinement loop): the content-validity and photography-quality gates are presented as reliable triggers for refinement, yet the text provides no implementation details (e.g., whether gates are LLM-prompted judgments) or external validation such as human correlation studies or inter-rater agreement, leaving open the risk that reported improvements are internally defined rather than independently measured.

minor comments (2)

Abstract and introduction: the invented term 'Product Narrative Framework' is used without an explicit definition, formalization, or citation to related narrative-planning literature, which hinders reproducibility.
Overall manuscript: inclusion of qualitative examples (e.g., side-by-side collage outputs before and after refinement iterations) would help illustrate the claimed progressive improvement and failure-attribution process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, acknowledging where the current version falls short and outlining the revisions we will make.

read point-by-point responses

Referee: Abstract: the claim that 'Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines' is unsupported by any reported quantitative metrics, baseline implementations, sample sizes, or statistical tests, rendering the central empirical claim unverifiable from the manuscript.

Authors: We agree that the abstract's empirical claim is not supported by quantitative metrics, statistical tests, or detailed baseline descriptions in the current manuscript. The experiments section provides only qualitative visual examples and narrative descriptions of improvements. In the revised version we will tone down the abstract claim to accurately describe the qualitative evidence presented and add a dedicated quantitative evaluation subsection. This will include user-study metrics for aesthetic quality, narrative richness, and visual coherence; explicit baseline implementations; sample sizes; and statistical significance tests. revision: yes
Referee: Framework description (self-evaluation and refinement loop): the content-validity and photography-quality gates are presented as reliable triggers for refinement, yet the text provides no implementation details (e.g., whether gates are LLM-prompted judgments) or external validation such as human correlation studies or inter-rater agreement, leaving open the risk that reported improvements are internally defined rather than independently measured.

Authors: We acknowledge that the manuscript lacks sufficient implementation details and external validation for the evaluation gates. The gates are implemented as LLM-prompted judgments, but the specific prompts, decision criteria, and any human correlation were omitted. We will expand the framework section with the exact prompt templates used for content-validity and photography-quality gates. We will also add a human validation study reporting correlation between the automated gates and human raters, including inter-rater agreement statistics, to demonstrate that refinements are not circular. revision: yes

Circularity Check

0 steps flagged

No circularity detected in framework or experimental claims

full rationale

The paper proposes an agentic self-reasoning framework involving narrative construction, prompt compilation, generation, and iterative refinement via internal gates for content validity and photography quality. No mathematical derivation chain, equations, or first-principles results are present that reduce any claimed prediction or improvement to the inputs by construction. Experimental results are reported as comparisons against direct prompting baselines on aesthetic quality, narrative richness, and visual coherence. While the self-evaluation mechanism could introduce bias in practice, the paper does not exhibit any of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.), and the central claims remain independent of any internal redefinition or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on untested assumptions about LLM planning accuracy and image model prompt adherence; full text would be needed to identify fitted parameters or additional axioms.

axioms (2)

domain assumption Large language models can reliably construct a Product Narrative Framework and perform accurate failure attribution from generated images.
This underpins the entire self-reasoning and refinement loop described in the abstract.
domain assumption The underlying image generation model can synthesize a single coherent collage that respects multiple complementary scene constraints simultaneously.
Required for the unified generation step to succeed as claimed.

invented entities (1)

Product Narrative Framework no independent evidence
purpose: To explicitly encode product identity, usage context, and environment for guiding complementary grid generation.
New structured representation introduced to organize the narrative planning stage.

pith-pipeline@v0.9.0 · 5559 in / 1431 out tokens · 46899 ms · 2026-05-10T06:35:39.799853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 12 canonical work pages · 5 internal anchors

[1]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Association for Computing Machinery. ISBN 9781450312301. Black Forest Labs, Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y ., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., and Smith, L. FLUX.1 Kontext: Flow matching for in-con...

work page internal anchor Pith review arXiv
[2]

Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

Cao, S., Ma, N., Li, J., Li, X., Shao, L., Zhu, K., Zhou, Y ., Pu, Y ., Wu, J., Wang, J., et al. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert- level understanding.arXiv preprint arXiv:2507.14533,

work page arXiv
[3]

Storyagent: Customized storytelling video generation via multi-agent collaboration.arXiv preprint arXiv:2411.04925, 2024

Hu, P., Jiang, J., Chen, J., Han, M., Liao, S., Chang, X., and Liang, X. StoryAgent: Customized storytelling video generation via multi-agent collaboration.arXiv preprint arXiv:2411.04925,

work page arXiv
[4]

Anim-Director: A large multimodal model powered agent for controllable animation video generation

Li, Y ., Shi, H., Hu, B., Wang, L., Zhu, J., Xu, J., Zhao, Z., and Zhang, M. Anim-Director: A large multimodal model powered agent for controllable animation video generation. InSIGGRAPH Asia 2024 Conference Papers, New York, NY , USA,

2024
[5]

Liu, H., Tahmasbi, A., Haque, E

Association for Computing Machinery. Liu, H., Tahmasbi, A., Haque, E. S., and Jain, P. Llms for customized marketing content generation and evaluation at scale.arXiv preprint arXiv:2506.17863,

work page arXiv
[6]

Posterverse: A full- workflow framework for commercial-grade poster genera- tion with html-based scalable typography.arXiv preprint arXiv:2601.03993,

Liu, J., Zhang, P., Zhang, Y ., Yan, P., Zhou, H., Zhou, X., Guo, F., and Jin, L. Posterverse: A full- workflow framework for commercial-grade poster genera- tion with html-based scalable typography.arXiv preprint arXiv:2601.03993,

work page arXiv
[7]

Crmagent: A multi-agent llm system for e-commerce crm message template generation

Quan, Y ., Li, X., and Chen, Y . Crmagent: A multi-agent llm system for e-commerce crm message template generation. arXiv preprint arXiv:2507.08325,

work page arXiv
[8]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review arXiv
[9]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y ., et al. See- dream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427,

work page internal anchor Pith review arXiv
[10]

AniMaker: Multi-agent animated storytelling with MCTS-driven clip generation

Shi, H., Li, Y ., Chen, X., Wang, L., Hu, B., and Zhang, M. AniMaker: Multi-agent animated storytelling with MCTS-driven clip generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, New York, NY , USA,

2025
[11]

Generate e-commerce product back- ground by integrating category commonality and person- alized style

Wang, H., Feng, W., Li, Y ., Zhang, Z., Lv, J., Shen, J., Lin, Z., and Shao, J. Generate e-commerce product back- ground by integrating category commonality and person- alized style. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025
[12]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., and Duan, N. Visual ChatGPT: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671,

work page internal anchor Pith review arXiv
[13]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y ., et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324,

work page internal anchor Pith review arXiv
[14]

Mm-storyagent: Immersive narrated story- book video generation with a multi-agent paradigm across text, image and audio.arXiv preprint arXiv:2503.05242,

Xu, X., Mei, J., Li, C., Wu, Y ., Yan, M., Lai, S., Zhang, J., and Wu, M. Mm-storyagent: Immersive narrated story- book video generation with a multi-agent paradigm across text, image and audio.arXiv preprint arXiv:2503.05242,

work page arXiv
[15]

Dreampainter: Image background inpainting for e-commerce scenarios.arXiv preprint arXiv:2508.02155,

Zhao, S., Cheng, J., Wu, Y ., Xu, H., and Jiao, S. Dream- Painter: Image background inpainting for e-commerce scenarios.arXiv preprint arXiv:2508.02155,

work page arXiv
[16]

Zhou, T., Duan, Z., Chen, C., Zhou, W., Wang, Y ., and Li, Y

ISBN 978-1-57735- 897-8. Zhou, T., Duan, Z., Chen, C., Zhou, W., Wang, Y ., and Li, Y . AgentStory: A multi-agent system for story vi- sualization with multi-subject consistent text-to-image generation. InInternational Conference on Multimedia Retrieval (ICMR), pp. 1894–1902, New York, NY , USA,

1902