pith. sign in

arxiv: 2310.07749 · v2 · pith:QCVOTRY2new · submitted 2023-10-11 · 💻 cs.CV

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

classification 💻 cs.CV
keywords interleavedgenerationevaluationimage-textmodelsframeworkimagesopen-domain
0
0 comments X
read the original abstract

This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

    cs.CV 2026-06 unverdicted novelty 6.0

    Pareto LoRA applies Pareto-optimal gradient integration to balance text and image objectives in LoRA-based fine-tuning of unified multimodal models, reporting up to 44.9% gains in image quality on the CoMM benchmark w...

  2. Toward Native Multimodal Modeling: A Roadmap

    cs.CV 2026-05 unverdicted novelty 3.0

    A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-...