FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Andreas Blattmann; Axel Sauer; Black Forest Labs; Cheng Li; Cyril Diagne; Dominik Lorenz; Dustin Podell; Frederic Boesel; Harry Saini; Jack English

arxiv: 2506.15742 · v2 · submitted 2025-06-17 · 💻 cs.GR

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs , Stephen Batifol , Andreas Blattmann , Frederic Boesel , Saksham Consul , Cyril Diagne , Tim Dockhorn , Jack English

show 13 more authors

Zion English Patrick Esser Sumith Kulal Kyle Lacey Yam Levi Cheng Li Dominik Lorenz Jonas M\"uller Dustin Podell Robin Rombach Harry Saini Axel Sauer Luke Smith

This is my paper

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.GR

keywords image generationimage editingflow matchingin-context generationlatent spacecharacter consistencymulti-turn editingbenchmark

0 comments

The pith

A flow matching model unifies image generation and editing by concatenating text and image inputs in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FLUX.1 Kontext as a single architecture that generates new images or edits existing ones by incorporating semantic context from both text prompts and reference images. It relies on a straightforward sequence concatenation step inside a flow matching process to manage local edits and broader generative tasks together. This design yields stronger preservation of specific objects and characters when edits are applied repeatedly, avoiding the consistency loss seen in prior systems. The model matches leading quality benchmarks while running substantially faster, which supports interactive use. The claims rest on results from the new KontextBench dataset of 1026 image-prompt pairs spanning five editing categories.

Core claim

FLUX.1 Kontext is a generative flow matching model that unifies image generation and editing within one architecture. Using sequence concatenation to incorporate semantic context from text and image inputs, it handles both local editing and generative in-context tasks. It demonstrates improved preservation of objects and characters across multiple turns compared to models that degrade in consistency, while delivering competitive performance and significantly faster generation times.

What carries the argument

Sequence concatenation of text and image inputs inside the flow matching model operating in latent space, which unifies local and generative editing tasks.

Load-bearing premise

The 1026 image-prompt pairs in KontextBench represent typical real-world editing tasks without selection bias that favors the new model.

What would settle it

A larger independent test set of editing tasks where FLUX.1 Kontext shows equal or greater degradation in character consistency and slower speeds than existing models.

read the original abstract

We present evaluation results for FLUX.1 Kontext, a generative flow matching model that unifies image generation and editing. The model generates novel output views by incorporating semantic context from text and image inputs. Using a simple sequence concatenation approach, FLUX.1 Kontext handles both local editing and generative in-context tasks within a single unified architecture. Compared to current editing models that exhibit degradation in character consistency and stability across multiple turns, we observe that FLUX.1 Kontext improved preservation of objects and characters, leading to greater robustness in iterative workflows. The model achieves competitive performance with current state-of-the-art systems while delivering significantly faster generation times, enabling interactive applications and rapid prototyping workflows. To validate these improvements, we introduce KontextBench, a comprehensive benchmark with 1026 image-prompt pairs covering five task categories: local editing, global editing, character reference, style reference and text editing. Detailed evaluations show the superior performance of FLUX.1 Kontext in terms of both single-turn quality and multi-turn consistency, setting new standards for unified image processing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLUX.1 Kontext unifies generation and editing through latent flow matching plus sequence concatenation and reports stronger multi-turn consistency on its new KontextBench, but the benchmark's construction details are missing from the abstract.

read the letter

The core contribution here is a single flow-matching model that treats both fresh image generation and reference-based editing as the same sequence-concatenation task in latent space. That is a clean architectural move and explains why the authors can claim one model handles local edits, global changes, character reference, style reference, and text edits without separate pipelines. The speed advantage over prior editing systems also follows naturally from the flow-matching backbone they already use in FLUX.1, so that part of the story is believable on its face. Black Forest Labs has shipped reproducible models before, which gives the implementation side some credibility even without code in the abstract. The main weakness is that every performance claim—better object preservation, less drift across turns, competitive quality—rests on the 1026-pair KontextBench the authors introduce. The abstract says the benchmark was built “to validate these improvements” but supplies no information on prompt sourcing, difficulty calibration, or whether pairs were selected after seeing FLUX.1 Kontext outputs. That is exactly the selection-bias risk the stress-test flagged, and it is not minor when the headline result is “greater robustness in iterative workflows.” Without those details or at least a clear description of baseline re-implementations and statistical tests, the empirical section cannot be evaluated on its own terms. The paper is aimed at researchers and practitioners who want a single fast model for creative iteration rather than separate generation and editing stacks. It is worth sending to peer review because the unification idea is straightforward to test and the new benchmark categories are useful even if the current numbers need more scrutiny on data provenance. A referee can ask for the missing curation protocol and re-runs on an external test set; that is normal revision work rather than a reason to desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper presents FLUX.1 Kontext, a flow-matching model that unifies image generation and editing via simple sequence concatenation in latent space. It claims improved character/object preservation and multi-turn robustness over prior editing models, competitive performance with current SOTA systems, significantly faster inference, and introduces KontextBench (1026 author-curated image-prompt pairs across local editing, global editing, character reference, style reference, and text editing) to validate these advantages.

Significance. If the empirical claims hold on an independently constructed test distribution, the unified latent-space concatenation approach and observed stability in iterative workflows would constitute a practical advance for interactive image editing pipelines, with the reported speed advantage enabling new prototyping use cases. The flow-matching backbone and single-architecture design are clear strengths that could be extended.

major comments (3)

[KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.
[Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.
[Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.

minor comments (2)

[Abstract] The abstract states that the model 'achieves competitive performance' but does not name the quantitative metrics (e.g., CLIP score, LPIPS, or human preference rates) used to support this.
[Figures and Tables] Figure captions and table headers could more explicitly indicate whether results are single-turn or multi-turn to aid quick reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve transparency, reproducibility, and statistical rigor.

read point-by-point responses

Referee: [KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.

Authors: We agree that additional details on benchmark construction are essential to address potential selection bias concerns. The 1026 pairs were assembled from a mix of public image datasets and manually designed prompts targeting known multi-turn failure modes (e.g., character drift, style inconsistency). Difficulty calibration was performed by evaluating preliminary versions of several models on candidate pairs to ensure coverage across easy-to-hard cases. The final set was locked before running the complete FLUX.1 Kontext evaluation suite. We have added a new subsection 'KontextBench Construction' that documents sourcing criteria, the calibration procedure, and the timeline confirming the benchmark was fixed independently of the final model results. This revision directly mitigates the bias concern raised. revision: yes
Referee: [Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.

Authors: We concur that missing implementation details hinder independent verification. The revised manuscript now includes an expanded 'Baseline and Evaluation Protocol' section specifying: exact model versions and checkpoints used for all baselines, inference hyperparameters (steps, guidance scales, schedulers), and the reference selection protocol (fixed benchmark inputs with no post-hoc filtering or cherry-picking; prompts applied verbatim). We have also added a link to the evaluation codebase and configuration files in the supplementary material to enable exact reproduction of the reported numbers and robustness observations. revision: yes
Referee: [Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.

Authors: The referee correctly notes that statistical analysis would strengthen the empirical claims. We have updated the results section with per-category means and standard deviations across the 1026 pairs, 95% bootstrap confidence intervals, and paired statistical tests (t-tests and Wilcoxon signed-rank) comparing FLUX.1 Kontext against each baseline. A new table reports these values together with p-values, confirming statistically significant gains in multi-turn consistency metrics. These additions provide the quantitative support for the performance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons and new benchmark without reduction to inputs

full rationale

The paper presents FLUX.1 Kontext as a flow-matching model using simple sequence concatenation in latent space to unify generation and editing. No equations, derivations, or first-principles results are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims (improved multi-turn consistency, competitive SOTA results, faster inference) are supported by direct comparisons to prior editing models on the newly introduced KontextBench. The benchmark is described as created to validate observed improvements, but its results are not forced by the model's internal definitions or training procedure. This is a standard empirical evaluation setup with external benchmarks and SOTA baselines, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new benchmark tasks, the assumption that flow matching in latent space supports in-context conditioning via concatenation, and the standard supervised training assumptions for large generative models.

free parameters (1)

model architecture and training hyperparameters
The flow matching model contains a large number of learned parameters whose values are determined by training on unspecified data.

axioms (1)

domain assumption Flow matching can be extended to conditional generation by simple sequence concatenation of text and image tokens in latent space.
Invoked to justify the unified architecture without additional adapters or encoders.

pith-pipeline@v0.9.0 · 5561 in / 1286 out tokens · 60744 ms · 2026-05-10T16:30:33.008737+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain
cs.CV 2026-05 unverdicted novelty 7.0

BrainCause recovers known visual localizations and finds new candidate representations by validating causal specificity via counterfactual stimuli and encoding models, showing activation alone produces many false positives.
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
cs.CV 2026-05 unverdicted novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
cs.CV 2026-05 unverdicted novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
cs.CV 2026-05 unverdicted novelty 7.0

GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling
cs.CV 2026-05 conditional novelty 7.0

MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.
Accelerating Rectified Flow Models via Trajectory-Aware Caching
cs.CV 2026-05 unverdicted novelty 7.0

TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historica...
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
MiVE: Multiscale Vision-language features for reference-guided video Editing
cs.CV 2026-05 unverdicted novelty 7.0

MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

UniTriGen uses unified diffusion in a shared latent space plus lightweight adapters and scene-balanced sampling to produce high-quality aligned VIS-IR-Label triplets from limited paired data, improving few-shot RGB-T ...
PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 7.0

PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Inline Critic Steers Image Editing
cs.CV 2026-05 conditional novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
cs.CR 2026-05 unverdicted novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
cs.CV 2026-05 unverdicted novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 7.0

BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
cs.CV 2026-05 unverdicted novelty 7.0

Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

MPFM models flow matching velocity as a Gaussian mixture prior per normal class plus a mutual information regularizer to improve open-set anomaly detection over unimodal prototypes.
Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

MPFM uses flow matching with a Gaussian mixture prior on the velocity field and a mutual information maximizer to improve open-set anomaly detection over unimodal prototype methods.
VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching
cs.CV 2026-04 unverdicted novelty 7.0

VeraRetouch is a 0.5B VLM-based framework with a differentiable Retouch Renderer and a new million-scale AetherRetouch-1M+ dataset that claims state-of-the-art results in reasoning photo retouching while enabling mobi...
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
AI-Gram: When Visual Agents Interact in a Social Network
cs.AI 2026-04 unverdicted novelty 7.0

Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds
cs.CV 2026-04 unverdicted novelty 7.0

GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds
cs.CV 2026-04 unverdicted novelty 7.0

GSCompleter completes 3DGS scenes from sparse viewpoints using a generate-then-register workflow with stereo-anchor view selection and ray-constrained registration to achieve metric-aware results and SOTA performance ...
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
Generative Texture Filtering
cs.CV 2026-04 unverdicted novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
cs.CV 2026-04 unverdicted novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
cs.AI 2026-04 unverdicted novelty 7.0

Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
cs.CV 2026-04 unverdicted novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
OneHOI: Unifying Human-Object Interaction Generation and Editing
cs.CV 2026-04 unverdicted novelty 7.0

OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
cs.CV 2026-04 unverdicted novelty 7.0

LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
cs.CV 2026-04 unverdicted novelty 7.0

A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
RewardFlow: Generate Images by Optimizing What You Reward
cs.CV 2026-04 unverdicted novelty 7.0

RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
cs.LG 2026-04 unverdicted novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
cs.CV 2026-04 unverdicted novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
SPRITE: From Static Mockups to Engine-Ready Game UI
cs.HC 2026-03 unverdicted novelty 7.0

SPRITE converts static game UI screenshots into editable engine-ready assets by using VLMs to parse complex layouts into a YAML intermediate representation.
Reflective Flow Sampling Enhancement
cs.CV 2026-03 unverdicted novelty 7.0

RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution
cs.HC 2026-02 unverdicted novelty 7.0

EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
cs.CV 2026-01 unverdicted novelty 7.0

LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
ATATA: One Algorithm to Align Them All
cs.CV 2026-01 unverdicted novelty 7.0

ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation
cs.CV 2025-12 unverdicted novelty 7.0

InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image g...
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
cs.CV 2025-12 unverdicted novelty 7.0

Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
Setting the Stage: Text-Driven Scene-Consistent Image Generation
cs.CV 2025-12 conditional novelty 7.0

A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
cs.CV 2025-12 unverdicted novelty 7.0

Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
cs.CV 2025-12 unverdicted novelty 7.0

AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
cs.CV 2025-12 unverdicted novelty 7.0

LivingSwap is the first video reference-guided face swapping model that uses keyframe conditioning and temporal stitching to preserve source video realism with high fidelity across long sequences.
From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
cs.LG 2025-12 conditional novelty 7.0

Flow matching models follow a two-stage process of navigation across data modes then refinement to nearest samples, revealed by exact computation of the oracle marginal velocity field.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 143 Pith papers · 8 internal anchors

[1]

Albergo and Eric Vanden-Eijnden

Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2022. 6

work page 2022
[2]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023. 1

work page 2023
[3]

Retrieval- augmented diffusion models

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval- augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309– 15324, 2022. 3

work page 2022
[4]

Improving image editing models with generative data refinement

Frederic Boesel and Robin Rombach. Improving image editing models with generative data refinement. In Tiny Papers @ ICLR, 2024. URL https://api.semanticscholar.org/CorpusID: 271461432. 3

work page 2024
[5]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3, 7

work page 2023
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3

work page 1901
[7]

Re-imagen: Retrieval-augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval- augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022. 3

work page arXiv 2022
[8]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023. 4

work page 2023
[9]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019. 9

work page 2019
[11]

Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis

Patrick Esser, Robin Rombach, Andreas Blattmann, and Bjorn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021. 1

work page 2021
[12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403...

work page internal anchor Pith review arXiv 2024
[13]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 3

work page internal anchor Pith review arXiv 2022
[14]

Digital image processing

Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 1

work page 2009
[15]

Hidream-e1: Instruction-based image editing model, 2025

HiDream-ai. Hidream-e1: Instruction-based image editing model, 2025. URL https://github. com/HiDream-ai/HiDream-E1. 3

work page 2025
[16]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 6

work page 2022
[17]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. 1

work page 2020
[18]

Simple diffusion: End-to-end diffusion for high resolution images, 2023

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images, 2023. 14

work page 2023
[19]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 (2):3, 2022. 3

work page 2022
[20]

In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775, 2024. 3

work page arXiv 2024
[21]

Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

work page arXiv 2024
[22]

Introducing auraface: Open-source face recognition and identity preservation models

isidentical. Introducing auraface: Open-source face recognition and identity preservation models. https://huggingface.co/blog/isidentical/auraface, 2024. Accessed: 2025-05-26. 9

work page 2024
[23]

Experiment with gemini 2.0 flash na- tive image generation, 2025

Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash na- tive image generation, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation/. 3

work page 2025
[24]

arXiv preprint arXiv:2210.09276 , year=

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models, 2023. URL https://arxiv.org/abs/2210.09276. 3

work page arXiv 2023
[25]

Understanding diffusion objectives as the elbo with simple data augmentation

Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems,

work page
[26]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 7

work page 2023
[27]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 3

work page 1931
[28]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024. 8

work page arXiv 2024
[29]

Liang, T

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511, 2024. 6

work page arXiv 2024
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 4, 6, 14

work page 2023
[31]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025. 7

work page internal anchor Pith review arXiv 2025
[32]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 4, 14

work page 2022
[33]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471,

work page
[34]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 6

work page 2023
[35]

Midjourney, 2025

Midjourney. Midjourney, 2025. URL https://www.midjourney.com/home. 3, 12

work page 2025
[36]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025. URL https://openai.com/index/ introducing-4o-image-generation/. 3 17

work page 2025
[37]

Drag your gan: Interactive point-based manipulation on the generative image manifold

Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 1

work page 2023
[38]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

William Peebles and Saining Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023. doi: 10.1109/ iccv51070.2023.00387. URL http://dx.doi.org/10.1109/ICCV51070.2023.00387. 4

work page doi:10.1109/iccv51070.2023.00387 2023
[39]

Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation, 2025. URL https://arxiv.org/abs/2406.16855. 7

work page arXiv 2025
[40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1, 3, 5

work page 2023
[41]

Hierarchical text-conditional image generation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. 1, 7

work page 2022
[42]

A ConvNet for the 2020s

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. doi: 10.1109/cvpr52688.2022. 01042. URL http://dx.doi.org/10.1109/CVPR52688.2022.01042. 1, 3, 4

work page doi:10.1109/cvpr52688.2022 2022
[43]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. URL https://arxiv.org/abs/2208.12242. 3

work page arXiv 2023
[44]

Runway AI

Inc. Runway AI. Runway | tools for human imagination, 2025. URL https://runwayml.com/. 3

work page 2025
[45]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022. 1

work page 2022
[46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 8

work page 2022
[47]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 2021. 6

work page 2021
[48]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023. 6

work page arXiv 2023
[49]

Fast high-resolution image synthesis with latent adversarial diffusion distillation,

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation,

work page
[50]

URL https://arxiv.org/abs/2403.12015. 1, 6

work page arXiv
[51]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37:68658–68685, 2024. 7

work page 2024
[52]

Emu edit: Precise image editing via recognition and generation tasks.arXiv preprint arXiv:2311.10089, 2023

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023. 3, 7

work page arXiv 2023
[53]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 15

work page internal anchor Pith review Pith/arXiv arXiv 2014
[54]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 4 18

work page 2024
[55]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022. 1

work page 2022
[56]

Computer vision: algorithms and applications

Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022. 1

work page 2022
[57]

OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024. 3

work page arXiv 2024
[58]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023. 1

work page 2023
[59]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023. 3, 12

work page internal anchor Pith review arXiv 2023
[60]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789. 8

work page internal anchor Pith review arXiv 2022
[61]

Magicbrush: A manually annotated dataset for instruction-guided image editing.ArXiv, abs/2306.10012, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024. URL https://arxiv.org/abs/2306.10012. 7

work page arXiv 2024
[62]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

work page 2023
[63]

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 3 19

work page internal anchor Pith review arXiv 2025

[1] [1]

Albergo and Eric Vanden-Eijnden

Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2022. 6

work page 2022

[2] [2]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023. 1

work page 2023

[3] [3]

Retrieval- augmented diffusion models

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval- augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309– 15324, 2022. 3

work page 2022

[4] [4]

Improving image editing models with generative data refinement

Frederic Boesel and Robin Rombach. Improving image editing models with generative data refinement. In Tiny Papers @ ICLR, 2024. URL https://api.semanticscholar.org/CorpusID: 271461432. 3

work page 2024

[5] [5]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3, 7

work page 2023

[6] [6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3

work page 1901

[7] [7]

Re-imagen: Retrieval-augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval- augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022. 3

work page arXiv 2022

[8] [8]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023. 4

work page 2023

[9] [9]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019. 9

work page 2019

[11] [11]

Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis

Patrick Esser, Robin Rombach, Andreas Blattmann, and Bjorn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021. 1

work page 2021

[12] [12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403...

work page internal anchor Pith review arXiv 2024

[13] [13]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 3

work page internal anchor Pith review arXiv 2022

[14] [14]

Digital image processing

Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 1

work page 2009

[15] [15]

Hidream-e1: Instruction-based image editing model, 2025

HiDream-ai. Hidream-e1: Instruction-based image editing model, 2025. URL https://github. com/HiDream-ai/HiDream-E1. 3

work page 2025

[16] [16]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 6

work page 2022

[17] [17]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. 1

work page 2020

[18] [18]

Simple diffusion: End-to-end diffusion for high resolution images, 2023

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images, 2023. 14

work page 2023

[19] [19]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 (2):3, 2022. 3

work page 2022

[20] [20]

In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775, 2024. 3

work page arXiv 2024

[21] [21]

Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

work page arXiv 2024

[22] [22]

Introducing auraface: Open-source face recognition and identity preservation models

isidentical. Introducing auraface: Open-source face recognition and identity preservation models. https://huggingface.co/blog/isidentical/auraface, 2024. Accessed: 2025-05-26. 9

work page 2024

[23] [23]

Experiment with gemini 2.0 flash na- tive image generation, 2025

Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash na- tive image generation, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation/. 3

work page 2025

[24] [24]

arXiv preprint arXiv:2210.09276 , year=

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models, 2023. URL https://arxiv.org/abs/2210.09276. 3

work page arXiv 2023

[25] [25]

Understanding diffusion objectives as the elbo with simple data augmentation

Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems,

work page

[26] [26]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 7

work page 2023

[27] [27]

Multi- concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 3

work page 1931

[28] [28]

Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024. 8

work page arXiv 2024

[29] [29]

Liang, T

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511, 2024. 6

work page arXiv 2024

[30] [30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 4, 6, 14

work page 2023

[31] [31]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025. 7

work page internal anchor Pith review arXiv 2025

[32] [32]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 4, 14

work page 2022

[33] [33]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471,

work page

[34] [34]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 6

work page 2023

[35] [35]

Midjourney, 2025

Midjourney. Midjourney, 2025. URL https://www.midjourney.com/home. 3, 12

work page 2025

[36] [36]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025. URL https://openai.com/index/ introducing-4o-image-generation/. 3 17

work page 2025

[37] [37]

Drag your gan: Interactive point-based manipulation on the generative image manifold

Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 1

work page 2023

[38] [38]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

William Peebles and Saining Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023. doi: 10.1109/ iccv51070.2023.00387. URL http://dx.doi.org/10.1109/ICCV51070.2023.00387. 4

work page doi:10.1109/iccv51070.2023.00387 2023

[39] [39]

Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation, 2025. URL https://arxiv.org/abs/2406.16855. 7

work page arXiv 2025

[40] [40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1, 3, 5

work page 2023

[41] [41]

Hierarchical text-conditional image generation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. 1, 7

work page 2022

[42] [42]

A ConvNet for the 2020s

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. doi: 10.1109/cvpr52688.2022. 01042. URL http://dx.doi.org/10.1109/CVPR52688.2022.01042. 1, 3, 4

work page doi:10.1109/cvpr52688.2022 2022

[43] [43]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. URL https://arxiv.org/abs/2208.12242. 3

work page arXiv 2023

[44] [44]

Runway AI

Inc. Runway AI. Runway | tools for human imagination, 2025. URL https://runwayml.com/. 3

work page 2025

[45] [45]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022. 1

work page 2022

[46] [46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 8

work page 2022

[47] [47]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 2021. 6

work page 2021

[48] [48]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023. 6

work page arXiv 2023

[49] [49]

Fast high-resolution image synthesis with latent adversarial diffusion distillation,

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation,

work page

[50] [50]

URL https://arxiv.org/abs/2403.12015. 1, 6

work page arXiv

[51] [51]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37:68658–68685, 2024. 7

work page 2024

[52] [52]

Emu edit: Precise image editing via recognition and generation tasks.arXiv preprint arXiv:2311.10089, 2023

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023. 3, 7

work page arXiv 2023

[53] [53]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 15

work page internal anchor Pith review Pith/arXiv arXiv 2014

[54] [54]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 4 18

work page 2024

[55] [55]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022. 1

work page 2022

[56] [56]

Computer vision: algorithms and applications

Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022. 1

work page 2022

[57] [57]

OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024. 3

work page arXiv 2024

[58] [58]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023. 1

work page 2023

[59] [59]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023. 3, 12

work page internal anchor Pith review arXiv 2023

[60] [60]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789. 8

work page internal anchor Pith review arXiv 2022

[61] [61]

Magicbrush: A manually annotated dataset for instruction-guided image editing.ArXiv, abs/2306.10012, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024. URL https://arxiv.org/abs/2306.10012. 7

work page arXiv 2024

[62] [62]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

work page 2023

[63] [63]

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 3 19

work page internal anchor Pith review arXiv 2025