arxiv: 2506.15742 · v2 · submitted 2025-06-17 · 💻 cs.GR

Recognition: no theorem link

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs , Stephen Batifol , Andreas Blattmann , Frederic Boesel , Saksham Consul , Cyril Diagne , Tim Dockhorn , Jack English , Zion English , Patrick Esser , Sumith Kulal , Kyle Lacey , Yam Levi , Cheng Li , Dominik Lorenz , Jonas M\"uller , Dustin Podell , Robin Rombach , Harry Saini , Axel Sauer , Luke Smith

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.GR

keywords image generationimage editingflow matchingin-context generationlatent spacecharacter consistencymulti-turn editingbenchmark

0 comments

The pith

A flow matching model unifies image generation and editing by concatenating text and image inputs in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FLUX.1 Kontext as a single architecture that generates new images or edits existing ones by incorporating semantic context from both text prompts and reference images. It relies on a straightforward sequence concatenation step inside a flow matching process to manage local edits and broader generative tasks together. This design yields stronger preservation of specific objects and characters when edits are applied repeatedly, avoiding the consistency loss seen in prior systems. The model matches leading quality benchmarks while running substantially faster, which supports interactive use. The claims rest on results from the new KontextBench dataset of 1026 image-prompt pairs spanning five editing categories.

Core claim

FLUX.1 Kontext is a generative flow matching model that unifies image generation and editing within one architecture. Using sequence concatenation to incorporate semantic context from text and image inputs, it handles both local editing and generative in-context tasks. It demonstrates improved preservation of objects and characters across multiple turns compared to models that degrade in consistency, while delivering competitive performance and significantly faster generation times.

What carries the argument

Sequence concatenation of text and image inputs inside the flow matching model operating in latent space, which unifies local and generative editing tasks.

Load-bearing premise

The 1026 image-prompt pairs in KontextBench represent typical real-world editing tasks without selection bias that favors the new model.

What would settle it

A larger independent test set of editing tasks where FLUX.1 Kontext shows equal or greater degradation in character consistency and slower speeds than existing models.

read the original abstract

We present evaluation results for FLUX.1 Kontext, a generative flow matching model that unifies image generation and editing. The model generates novel output views by incorporating semantic context from text and image inputs. Using a simple sequence concatenation approach, FLUX.1 Kontext handles both local editing and generative in-context tasks within a single unified architecture. Compared to current editing models that exhibit degradation in character consistency and stability across multiple turns, we observe that FLUX.1 Kontext improved preservation of objects and characters, leading to greater robustness in iterative workflows. The model achieves competitive performance with current state-of-the-art systems while delivering significantly faster generation times, enabling interactive applications and rapid prototyping workflows. To validate these improvements, we introduce KontextBench, a comprehensive benchmark with 1026 image-prompt pairs covering five task categories: local editing, global editing, character reference, style reference and text editing. Detailed evaluations show the superior performance of FLUX.1 Kontext in terms of both single-turn quality and multi-turn consistency, setting new standards for unified image processing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLUX.1 Kontext unifies generation and editing through latent flow matching plus sequence concatenation and reports stronger multi-turn consistency on its new KontextBench, but the benchmark's construction details are missing from the abstract.

read the letter

The core contribution here is a single flow-matching model that treats both fresh image generation and reference-based editing as the same sequence-concatenation task in latent space. That is a clean architectural move and explains why the authors can claim one model handles local edits, global changes, character reference, style reference, and text edits without separate pipelines. The speed advantage over prior editing systems also follows naturally from the flow-matching backbone they already use in FLUX.1, so that part of the story is believable on its face. Black Forest Labs has shipped reproducible models before, which gives the implementation side some credibility even without code in the abstract. The main weakness is that every performance claim—better object preservation, less drift across turns, competitive quality—rests on the 1026-pair KontextBench the authors introduce. The abstract says the benchmark was built “to validate these improvements” but supplies no information on prompt sourcing, difficulty calibration, or whether pairs were selected after seeing FLUX.1 Kontext outputs. That is exactly the selection-bias risk the stress-test flagged, and it is not minor when the headline result is “greater robustness in iterative workflows.” Without those details or at least a clear description of baseline re-implementations and statistical tests, the empirical section cannot be evaluated on its own terms. The paper is aimed at researchers and practitioners who want a single fast model for creative iteration rather than separate generation and editing stacks. It is worth sending to peer review because the unification idea is straightforward to test and the new benchmark categories are useful even if the current numbers need more scrutiny on data provenance. A referee can ask for the missing curation protocol and re-runs on an external test set; that is normal revision work rather than a reason to desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper presents FLUX.1 Kontext, a flow-matching model that unifies image generation and editing via simple sequence concatenation in latent space. It claims improved character/object preservation and multi-turn robustness over prior editing models, competitive performance with current SOTA systems, significantly faster inference, and introduces KontextBench (1026 author-curated image-prompt pairs across local editing, global editing, character reference, style reference, and text editing) to validate these advantages.

Significance. If the empirical claims hold on an independently constructed test distribution, the unified latent-space concatenation approach and observed stability in iterative workflows would constitute a practical advance for interactive image editing pipelines, with the reported speed advantage enabling new prototyping use cases. The flow-matching backbone and single-architecture design are clear strengths that could be extended.

major comments (3)

[KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.
[Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.
[Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.

minor comments (2)

[Abstract] The abstract states that the model 'achieves competitive performance' but does not name the quantitative metrics (e.g., CLIP score, LPIPS, or human preference rates) used to support this.
[Figures and Tables] Figure captions and table headers could more explicitly indicate whether results are single-turn or multi-turn to aid quick reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve transparency, reproducibility, and statistical rigor.

read point-by-point responses

Referee: [KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.

Authors: We agree that additional details on benchmark construction are essential to address potential selection bias concerns. The 1026 pairs were assembled from a mix of public image datasets and manually designed prompts targeting known multi-turn failure modes (e.g., character drift, style inconsistency). Difficulty calibration was performed by evaluating preliminary versions of several models on candidate pairs to ensure coverage across easy-to-hard cases. The final set was locked before running the complete FLUX.1 Kontext evaluation suite. We have added a new subsection 'KontextBench Construction' that documents sourcing criteria, the calibration procedure, and the timeline confirming the benchmark was fixed independently of the final model results. This revision directly mitigates the bias concern raised. revision: yes
Referee: [Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.

Authors: We concur that missing implementation details hinder independent verification. The revised manuscript now includes an expanded 'Baseline and Evaluation Protocol' section specifying: exact model versions and checkpoints used for all baselines, inference hyperparameters (steps, guidance scales, schedulers), and the reference selection protocol (fixed benchmark inputs with no post-hoc filtering or cherry-picking; prompts applied verbatim). We have also added a link to the evaluation codebase and configuration files in the supplementary material to enable exact reproduction of the reported numbers and robustness observations. revision: yes
Referee: [Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.

Authors: The referee correctly notes that statistical analysis would strengthen the empirical claims. We have updated the results section with per-category means and standard deviations across the 1026 pairs, 95% bootstrap confidence intervals, and paired statistical tests (t-tests and Wilcoxon signed-rank) comparing FLUX.1 Kontext against each baseline. A new table reports these values together with p-values, confirming statistically significant gains in multi-turn consistency metrics. These additions provide the quantitative support for the performance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons and new benchmark without reduction to inputs

full rationale

The paper presents FLUX.1 Kontext as a flow-matching model using simple sequence concatenation in latent space to unify generation and editing. No equations, derivations, or first-principles results are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims (improved multi-turn consistency, competitive SOTA results, faster inference) are supported by direct comparisons to prior editing models on the newly introduced KontextBench. The benchmark is described as created to validate observed improvements, but its results are not forced by the model's internal definitions or training procedure. This is a standard empirical evaluation setup with external benchmarks and SOTA baselines, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new benchmark tasks, the assumption that flow matching in latent space supports in-context conditioning via concatenation, and the standard supervised training assumptions for large generative models.

free parameters (1)

model architecture and training hyperparameters
The flow matching model contains a large number of learned parameters whose values are determined by training on unspecified data.

axioms (1)

domain assumption Flow matching can be extended to conditional generation by simple sequence concatenation of text and image tokens in latent space.
Invoked to justify the unified architecture without additional adapters or encoders.

pith-pipeline@v0.9.0 · 5561 in / 1286 out tokens · 60744 ms · 2026-05-10T16:30:33.008737+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Inline Critic Steers Image Editing
cs.CV 2026-05 conditional novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
cs.CR 2026-05 unverdicted novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
cs.CV 2026-05 unverdicted novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 7.0

BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
cs.CV 2026-05 unverdicted novelty 7.0

Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

MPFM uses flow matching with a Gaussian mixture prior on the velocity field and a mutual information maximizer to improve open-set anomaly detection over unimodal prototype methods.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
AI-Gram: When Visual Agents Interact in a Social Network
cs.AI 2026-04 unverdicted novelty 7.0

Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds
cs.CV 2026-04 unverdicted novelty 7.0

GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
Generative Texture Filtering
cs.CV 2026-04 unverdicted novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
cs.CV 2026-04 unverdicted novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
cs.AI 2026-04 unverdicted novelty 7.0

Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
cs.CV 2026-04 unverdicted novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
OneHOI: Unifying Human-Object Interaction Generation and Editing
cs.CV 2026-04 unverdicted novelty 7.0

OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
cs.CV 2026-04 unverdicted novelty 7.0

LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
cs.CV 2026-04 unverdicted novelty 7.0

A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
RewardFlow: Generate Images by Optimizing What You Reward
cs.CV 2026-04 unverdicted novelty 7.0

RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
cs.LG 2026-04 unverdicted novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
cs.CV 2026-04 unverdicted novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
cs.CV 2026-05 unverdicted novelty 6.0

LimeCross enables text-guided editing of individual layers in composite images by conditioning on cross-layer context via bi-stream attention while preserving layer integrity and introducing the LayerEditBench benchmark.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 6.0

BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.
Implicit Preference Alignment for Human Image Animation
cs.CV 2026-05 unverdicted novelty 6.0

IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.
EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing
cs.CV 2026-05 unverdicted novelty 6.0

EditTransfer++ delivers state-of-the-art faithfulness to visual editing examples and faster inference by removing text conditioning during fine-tuning and applying best-worst contrastive refinement plus condition compression.
Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

HSA assigns variable denoising steps to spatiotemporal tokens in DiTs based on velocity dynamics, with KV-cache sync and cached Euler updates, outperforming prior caching methods on quality-runtime tradeoffs for T2V a...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

A windowed cross-attention control method on skip features enables geometry-controlled high-resolution satellite image synthesis from pre-trained diffusion models with better alignment to control maps than prior techniques.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
cs.CV 2026-05 unverdicted novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection
cs.CV 2026-05 unverdicted novelty 6.0

MPFM transforms normal features into a structured Gaussian mixture prototype space via a mixture velocity field and mutual information regularization to achieve state-of-the-art open-set supervised anomaly detection.
3D-ReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors
cs.CV 2026-04 unverdicted novelty 6.0

BurstGP enhances raw burst image super-resolution by integrating pretrained video diffusion priors through a multiframe-aware model, degradation-aware conditioning, and color-space conversion, outperforming prior meth...
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
FluSplat: Sparse-View 3D Editing without Test-Time Optimization
cs.CV 2026-04 unverdicted novelty 6.0

FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
ReCap: Lightweight Referential Grounding for Coherent Story Visualization
cs.CV 2026-04 unverdicted novelty 6.0

ReCap improves character consistency in story visualization by 2.63% on FlintstonesSV and 5.65% on PororoSV using a selective pronoun-based conditioning module and training-only semantic drift correction.
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling
cs.CV 2026-04 unverdicted novelty 6.0

MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing
cs.CV 2026-04 unverdicted novelty 6.0

LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 89 Pith papers · 7 internal anchors

[1]

Albergo and Eric Vanden-Eijnden

Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2022. 6

work page 2022
[2]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023. 1

work page 2023
[3]

Retrieval- augmented diffusion models

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval- augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309– 15324, 2022. 3

work page 2022
[4]

Improving image editing models with generative data refinement

Frederic Boesel and Robin Rombach. Improving image editing models with generative data refinement. In Tiny Papers @ ICLR, 2024. URL https://api.semanticscholar.org/CorpusID: 271461432. 3

work page 2024
[5]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3, 7

work page 2023
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3

work page 1901
[7]

Re-imagen: Retrieval-augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval- augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022. 3

work page arXiv 2022
[8]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023. 4

work page 2023
[9]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019. 9

work page 2019
[11]

Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis

Patrick Esser, Robin Rombach, Andreas Blattmann, and Bjorn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021. 1

work page 2021
[12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403...

work page internal anchor Pith review arXiv 2024
[13]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 3

work page internal anchor Pith review arXiv 2022
[14]

Digital image processing

Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 1

work page 2009
[15]

Hidream-e1: Instruction-based image editing model, 2025

HiDream-ai. Hidream-e1: Instruction-based image editing model, 2025. URL https://github. com/HiDream-ai/HiDream-E1. 3

work page 2025
[16]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 6

work page 2022
[17]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. 1

work page 2020
[18]

Simple diffusion: End-to-end diffusion for high resolution images, 2023

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images, 2023. 14

work page 2023
[19]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 (2):3, 2022. 3

work page 2022
[20]

In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775, 2024. 3

work page arXiv 2024
[21]

Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

work page arXiv 2024
[22]

Introducing auraface: Open-source face recognition and identity preservation models

isidentical. Introducing auraface: Open-source face recognition and identity preservation models. https://huggingface.co/blog/isidentical/auraface, 2024. Accessed: 2025-05-26. 9

work page 2024
[23]

Experiment with gemini 2.0 flash na- tive image generation, 2025

Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash na- tive image generation, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation/. 3

work page 2025
[24]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models, 2023. URL https://arxiv.org/abs/2210.09276. 3

work page arXiv 2023
[25]

Understanding diffusion objectives as the elbo with simple data augmentation

Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems,

work page
[26]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 7

work page 2023
[27]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 3

work page 1931
[28]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024. 8

work page arXiv 2024
[29]

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511, 2024. 6

work page arXiv 2024
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 4, 6, 14

work page 2023
[31]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025. 7

work page internal anchor Pith review arXiv 2025
[32]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 4, 14

work page 2022
[33]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471,

work page
[34]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 6

work page 2023
[35]

Midjourney, 2025

Midjourney. Midjourney, 2025. URL https://www.midjourney.com/home. 3, 12

work page 2025
[36]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025. URL https://openai.com/index/ introducing-4o-image-generation/. 3 17

work page 2025
[37]

Drag your gan: Interactive point-based manipulation on the generative image manifold

Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 1

work page 2023
[38]

2023 , volume =

William Peebles and Saining Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023. doi: 10.1109/ iccv51070.2023.00387. URL http://dx.doi.org/10.1109/ICCV51070.2023.00387. 4

work page doi:10.1109/iccv51070.2023.00387 2023
[39]

Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation, 2025. URL https://arxiv.org/abs/2406.16855. 7

work page arXiv 2025
[40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1, 3, 5

work page 2023
[41]

Hierarchical text-conditional image generation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. 1, 7

work page 2022
[42]

Shadows can be

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. doi: 10.1109/cvpr52688.2022. 01042. URL http://dx.doi.org/10.1109/CVPR52688.2022.01042. 1, 3, 4

work page doi:10.1109/cvpr52688.2022 2022
[43]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. URL https://arxiv.org/abs/2208.12242. 3

work page arXiv 2023
[44]

Runway AI

Inc. Runway AI. Runway | tools for human imagination, 2025. URL https://runwayml.com/. 3

work page 2025
[45]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022. 1

work page 2022
[46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 8

work page 2022
[47]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 2021. 6

work page 2021
[48]

arXiv preprint arXiv:2311.17042 (2023)

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023. 6

work page arXiv 2023
[49]

Fast high-resolution image synthesis with latent adversarial diffusion distillation,

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation,

work page
[50]

URL https://arxiv.org/abs/2403.12015. 1, 6

work page arXiv
[51]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37:68658–68685, 2024. 7

work page 2024
[52]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023. 3, 7

work page arXiv 2023
[53]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 15

work page internal anchor Pith review Pith/arXiv arXiv 2014
[54]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 4 18

work page 2024
[55]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022. 1

work page 2022
[56]

Computer vision: algorithms and applications

Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022. 1

work page 2022
[57]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024. 3

work page arXiv 2024
[58]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023. 1

work page 2023
[59]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023. 3, 12

work page internal anchor Pith review arXiv 2023
[60]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789. 8

work page internal anchor Pith review arXiv 2022
[61]

Magicbrush: A manually annotated dataset for instruction-guided image editing.ArXiv, abs/2306.10012, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024. URL https://arxiv.org/abs/2306.10012. 7

work page arXiv 2024
[62]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

work page 2023
[63]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 3 19

work page arXiv 2025