pith. sign in

hub Canonical reference

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Canonical reference. 78% of citing Pith papers cite this work as background.

18 Pith papers citing it
Background 78% of classified citations
abstract

Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

hub tools

citation-role summary

background 7 extension 1 method 1

citation-polarity summary

fields

cs.CV 18

years

2026 7 2025 11

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

cs.CV · 2025-11-24 · conditional · novelty 6.0

DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.

DanceGRPO: Unleashing GRPO on Visual Generation

cs.CV · 2025-05-12 · unverdicted · novelty 6.0

DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

A Systematic Post-Train Framework for Video Generation

cs.CV · 2026-04-28 · unverdicted · novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

LongCat-Image Technical Report

cs.CV · 2025-12-08 · unverdicted · novelty 5.0

LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.

Qwen-Image Technical Report

cs.CV · 2025-08-04 · unverdicted · novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.

Qwen-Image-2.0 Technical Report

cs.CV · 2026-05-11 · unverdicted · novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

Seedream 3.0 Technical Report

cs.CV · 2025-04-15 · unverdicted · novelty 4.0

Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware timestep sampling for 4-8x faster inference at up to 2K resolution.

Seedream 4.0: Toward Next-generation Multimodal Image Generation

cs.CV · 2025-09-24 · unverdicted · novelty 3.0

Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

citing papers explorer

Showing 18 of 18 citing papers.

  • Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 58 · internal anchor

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  • Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards cs.CV · 2026-03-01 · unverdicted · none · ref 22 · internal anchor

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  • Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 18 · 2 links · internal anchor

    Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.

  • PixelGen: Improving Pixel Diffusion with Perceptual Supervision cs.CV · 2026-02-02 · accept · none · ref 6 · internal anchor

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  • Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model cs.CV · 2025-12-15 · unverdicted · none · ref 5 · internal anchor

    Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.

  • DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 15 · internal anchor

    DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.

  • DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 8 · internal anchor

    DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

  • Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 25 · internal anchor

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

  • A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 6 · internal anchor

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  • LongCat-Image Technical Report cs.CV · 2025-12-08 · unverdicted · none · ref 2 · internal anchor

    LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.

  • Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing cs.CV · 2025-09-02 · unverdicted · none · ref 7 · internal anchor

    Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.

  • Qwen-Image Technical Report cs.CV · 2025-08-04 · unverdicted · none · ref 11 · internal anchor

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.

  • Qwen-Image-2.0 Technical Report cs.CV · 2026-05-11 · unverdicted · none · ref 9 · internal anchor

    Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

  • MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 29 · internal anchor

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  • Seedance 1.0: Exploring the Boundaries of Video Generation Models cs.CV · 2025-06-10 · unverdicted · none · ref 8 · internal anchor

    Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.

  • Seedream 3.0 Technical Report cs.CV · 2025-04-15 · unverdicted · none · ref 4 · internal anchor

    Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware timestep sampling for 4-8x faster inference at up to 2K resolution.

  • Seedance 2.0: Advancing Video Generation for World Complexity cs.CV · 2026-04-15 · unverdicted · none · ref 7 · internal anchor

    Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.

  • Seedream 4.0: Toward Next-generation Multimodal Image Generation cs.CV · 2025-09-24 · unverdicted · none · ref 4 · internal anchor

    Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.