hub Baseline reference

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan · 2025 · cs.CV · arXiv 2505.22705

Baseline reference. 73% of citing Pith papers use this work as a benchmark or comparison.

26 Pith papers citing it

Baseline 73% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 11 background 3 method 1

citation-polarity summary

baseline 11 background 3 use method 1

representative citing papers

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

cs.CV · 2026-02-06 · unverdicted · novelty 7.0

PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

An image-semantic guided method enhances MLLMs for detecting AI-generated modern Chinese poetry by combining poem text with visual representations of content, achieving 85.65% Macro-F1 with Gemini and outperforming text baselines and RoBERTa.

Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A restarted dual-stream inference approach with glyph priors and attention-guided masks improves occluded text rendering in pretrained diffusion models without fine-tuning.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

Self-Adversarial One Step Generation via Condition Shifting

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

Nucleus-Image: Sparse MoE for Image Generation

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

BLK-Assist: A Methodological Framework for Artist-Led Co-Creation with Generative AI Models

cs.CY · 2026-03-10 · unverdicted · novelty 6.0

BLK-Assist is a three-part framework (Conceptor for sketches, Stencil for transparent assets, Upscale for high-res outputs) that fine-tunes public diffusion models on one artist's proprietary corpus for style-faithful generative co-creation.

Emu3.5: Native Multimodal Models are World Learners

cs.CV · 2025-10-30 · unverdicted · novelty 6.0

Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

cs.CV · 2025-09-24 · unverdicted · novelty 6.0

EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

LongCat-Image Technical Report

cs.CV · 2025-12-08 · unverdicted · novelty 5.0

LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

cs.CV · 2025-08-28 · unverdicted · novelty 5.0

Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.

Qwen-Image Technical Report

cs.CV · 2025-08-04 · unverdicted · novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

cs.GR · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0 · 2 refs

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion eess.AS · 2026-05-29 · unverdicted · none · ref 60 · internal anchor
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer