pith. sign in

arxiv: 2506.15742 · v2 · submitted 2025-06-17 · 💻 cs.GR

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.GR
keywords image generationimage editingflow matchingin-context generationlatent spacecharacter consistencymulti-turn editingbenchmark
0
0 comments X

The pith

A flow matching model unifies image generation and editing by concatenating text and image inputs in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FLUX.1 Kontext as a single architecture that generates new images or edits existing ones by incorporating semantic context from both text prompts and reference images. It relies on a straightforward sequence concatenation step inside a flow matching process to manage local edits and broader generative tasks together. This design yields stronger preservation of specific objects and characters when edits are applied repeatedly, avoiding the consistency loss seen in prior systems. The model matches leading quality benchmarks while running substantially faster, which supports interactive use. The claims rest on results from the new KontextBench dataset of 1026 image-prompt pairs spanning five editing categories.

Core claim

FLUX.1 Kontext is a generative flow matching model that unifies image generation and editing within one architecture. Using sequence concatenation to incorporate semantic context from text and image inputs, it handles both local editing and generative in-context tasks. It demonstrates improved preservation of objects and characters across multiple turns compared to models that degrade in consistency, while delivering competitive performance and significantly faster generation times.

What carries the argument

Sequence concatenation of text and image inputs inside the flow matching model operating in latent space, which unifies local and generative editing tasks.

Load-bearing premise

The 1026 image-prompt pairs in KontextBench represent typical real-world editing tasks without selection bias that favors the new model.

What would settle it

A larger independent test set of editing tasks where FLUX.1 Kontext shows equal or greater degradation in character consistency and slower speeds than existing models.

read the original abstract

We present evaluation results for FLUX.1 Kontext, a generative flow matching model that unifies image generation and editing. The model generates novel output views by incorporating semantic context from text and image inputs. Using a simple sequence concatenation approach, FLUX.1 Kontext handles both local editing and generative in-context tasks within a single unified architecture. Compared to current editing models that exhibit degradation in character consistency and stability across multiple turns, we observe that FLUX.1 Kontext improved preservation of objects and characters, leading to greater robustness in iterative workflows. The model achieves competitive performance with current state-of-the-art systems while delivering significantly faster generation times, enabling interactive applications and rapid prototyping workflows. To validate these improvements, we introduce KontextBench, a comprehensive benchmark with 1026 image-prompt pairs covering five task categories: local editing, global editing, character reference, style reference and text editing. Detailed evaluations show the superior performance of FLUX.1 Kontext in terms of both single-turn quality and multi-turn consistency, setting new standards for unified image processing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents FLUX.1 Kontext, a flow-matching model that unifies image generation and editing via simple sequence concatenation in latent space. It claims improved character/object preservation and multi-turn robustness over prior editing models, competitive performance with current SOTA systems, significantly faster inference, and introduces KontextBench (1026 author-curated image-prompt pairs across local editing, global editing, character reference, style reference, and text editing) to validate these advantages.

Significance. If the empirical claims hold on an independently constructed test distribution, the unified latent-space concatenation approach and observed stability in iterative workflows would constitute a practical advance for interactive image editing pipelines, with the reported speed advantage enabling new prototyping use cases. The flow-matching backbone and single-architecture design are clear strengths that could be extended.

major comments (3)
  1. [KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.
  2. [Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.
  3. [Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.
minor comments (2)
  1. [Abstract] The abstract states that the model 'achieves competitive performance' but does not name the quantitative metrics (e.g., CLIP score, LPIPS, or human preference rates) used to support this.
  2. [Figures and Tables] Figure captions and table headers could more explicitly indicate whether results are single-turn or multi-turn to aid quick reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve transparency, reproducibility, and statistical rigor.

read point-by-point responses
  1. Referee: [KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.

    Authors: We agree that additional details on benchmark construction are essential to address potential selection bias concerns. The 1026 pairs were assembled from a mix of public image datasets and manually designed prompts targeting known multi-turn failure modes (e.g., character drift, style inconsistency). Difficulty calibration was performed by evaluating preliminary versions of several models on candidate pairs to ensure coverage across easy-to-hard cases. The final set was locked before running the complete FLUX.1 Kontext evaluation suite. We have added a new subsection 'KontextBench Construction' that documents sourcing criteria, the calibration procedure, and the timeline confirming the benchmark was fixed independently of the final model results. This revision directly mitigates the bias concern raised. revision: yes

  2. Referee: [Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.

    Authors: We concur that missing implementation details hinder independent verification. The revised manuscript now includes an expanded 'Baseline and Evaluation Protocol' section specifying: exact model versions and checkpoints used for all baselines, inference hyperparameters (steps, guidance scales, schedulers), and the reference selection protocol (fixed benchmark inputs with no post-hoc filtering or cherry-picking; prompts applied verbatim). We have also added a link to the evaluation codebase and configuration files in the supplementary material to enable exact reproduction of the reported numbers and robustness observations. revision: yes

  3. Referee: [Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.

    Authors: The referee correctly notes that statistical analysis would strengthen the empirical claims. We have updated the results section with per-category means and standard deviations across the 1026 pairs, 95% bootstrap confidence intervals, and paired statistical tests (t-tests and Wilcoxon signed-rank) comparing FLUX.1 Kontext against each baseline. A new table reports these values together with p-values, confirming statistically significant gains in multi-turn consistency metrics. These additions provide the quantitative support for the performance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons and new benchmark without reduction to inputs

full rationale

The paper presents FLUX.1 Kontext as a flow-matching model using simple sequence concatenation in latent space to unify generation and editing. No equations, derivations, or first-principles results are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims (improved multi-turn consistency, competitive SOTA results, faster inference) are supported by direct comparisons to prior editing models on the newly introduced KontextBench. The benchmark is described as created to validate observed improvements, but its results are not forced by the model's internal definitions or training procedure. This is a standard empirical evaluation setup with external benchmarks and SOTA baselines, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new benchmark tasks, the assumption that flow matching in latent space supports in-context conditioning via concatenation, and the standard supervised training assumptions for large generative models.

free parameters (1)
  • model architecture and training hyperparameters
    The flow matching model contains a large number of learned parameters whose values are determined by training on unspecified data.
axioms (1)
  • domain assumption Flow matching can be extended to conditional generation by simple sequence concatenation of text and image tokens in latent space.
    Invoked to justify the unified architecture without additional adapters or encoders.

pith-pipeline@v0.9.0 · 5561 in / 1286 out tokens · 60744 ms · 2026-05-10T16:30:33.008737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  2. From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

    cs.CV 2026-05 unverdicted novelty 7.0

    BrainCause recovers known visual localizations and finds new candidate representations by validating causal specificity via counterfactual stimuli and encoding models, showing activation alone produces many false positives.

  3. VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

    cs.CV 2026-05 unverdicted novelty 7.0

    VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.

  4. Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

  5. CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

    cs.CV 2026-05 unverdicted novelty 7.0

    CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

  6. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...

  7. MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

    cs.CV 2026-05 conditional novelty 7.0

    MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.

  8. Accelerating Rectified Flow Models via Trajectory-Aware Caching

    cs.CV 2026-05 unverdicted novelty 7.0

    TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historica...

  9. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  10. MiVE: Multiscale Vision-language features for reference-guided video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.

  11. UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniTriGen uses unified diffusion in a shared latent space plus lightweight adapters and scene-balanced sampling to produce high-quality aligned VIS-IR-Label triplets from limited paired data, improving few-shot RGB-T ...

  12. PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 7.0

    PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.

  13. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  14. Inline Critic Steers Image Editing

    cs.CV 2026-05 conditional novelty 7.0

    Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

  15. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  16. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  17. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  18. LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

    cs.CV 2026-05 unverdicted novelty 7.0

    LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.

  19. Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

    cs.CR 2026-05 unverdicted novelty 7.0

    Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

  20. Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

    cs.CV 2026-05 unverdicted novelty 7.0

    Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

  21. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.

  22. Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

  23. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...

  24. MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

  25. Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    MPFM models flow matching velocity as a Gaussian mixture prior per normal class plus a mutual information regularizer to improve open-set anomaly detection over unimodal prototypes.

  26. Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    MPFM uses flow matching with a Gaussian mixture prior on the velocity field and a mutual information maximizer to improve open-set anomaly detection over unimodal prototype methods.

  27. VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

    cs.CV 2026-04 unverdicted novelty 7.0

    VeraRetouch is a 0.5B VLM-based framework with a differentiable Retouch Renderer and a new million-scale AetherRetouch-1M+ dataset that claims state-of-the-art results in reasoning photo retouching while enabling mobi...

  28. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  29. Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

  30. AI-Gram: When Visual Agents Interact in a Social Network

    cs.AI 2026-04 unverdicted novelty 7.0

    Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.

  31. GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

    cs.CV 2026-04 unverdicted novelty 7.0

    GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.

  32. GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

    cs.CV 2026-04 unverdicted novelty 7.0

    GSCompleter completes 3DGS scenes from sparse viewpoints using a generate-then-register workflow with stereo-anchor view selection and ray-constrained registration to achieve metric-aware results and SOTA performance ...

  33. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

  34. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  35. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  36. View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

    cs.CV 2026-04 unverdicted novelty 7.0

    A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

  37. Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

    cs.AI 2026-04 unverdicted novelty 7.0

    Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...

  38. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  39. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  40. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  41. OneHOI: Unifying Human-Object Interaction Generation and Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.

  42. LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.

  43. HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

    cs.CV 2026-04 unverdicted novelty 7.0

    A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.

  44. RewardFlow: Generate Images by Optimizing What You Reward

    cs.CV 2026-04 unverdicted novelty 7.0

    RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

  45. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  46. RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

    cs.CV 2026-04 unverdicted novelty 7.0

    RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

  47. PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

    cs.LG 2026-04 unverdicted novelty 7.0

    PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

  48. HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

    cs.CV 2026-04 unverdicted novelty 7.0

    HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

  49. SPRITE: From Static Mockups to Engine-Ready Game UI

    cs.HC 2026-03 unverdicted novelty 7.0

    SPRITE converts static game UI screenshots into editable engine-ready assets by using VLMs to parse complex layouts into a YAML intermediate representation.

  50. Reflective Flow Sampling Enhancement

    cs.CV 2026-03 unverdicted novelty 7.0

    RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.

  51. EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution

    cs.HC 2026-02 unverdicted novelty 7.0

    EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.

  52. A Unified and Controllable Framework for Layered Image Generation with Visual Effects

    cs.CV 2026-01 unverdicted novelty 7.0

    LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

  53. ATATA: One Algorithm to Align Them All

    cs.CV 2026-01 unverdicted novelty 7.0

    ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.

  54. InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

    cs.CV 2025-12 unverdicted novelty 7.0

    InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image g...

  55. Do-Undo Bench: Reversibility for Action Understanding in Image Generation

    cs.CV 2025-12 unverdicted novelty 7.0

    Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.

  56. Setting the Stage: Text-Driven Scene-Consistent Image Generation

    cs.CV 2025-12 conditional novelty 7.0

    A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.

  57. Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

    cs.CV 2025-12 unverdicted novelty 7.0

    Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.

  58. AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

    cs.CV 2025-12 unverdicted novelty 7.0

    AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.

  59. Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

    cs.CV 2025-12 unverdicted novelty 7.0

    LivingSwap is the first video reference-guided face swapping model that uses keyframe conditioning and temporal stitching to preserve source video realism with high fidelity across long sequences.

  60. From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

    cs.LG 2025-12 conditional novelty 7.0

    Flow matching models follow a two-stage process of navigation across data modes then refinement to nearest samples, revealed by exact computation of the oracle marginal velocity field.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 143 Pith papers · 8 internal anchors

  1. [1]

    Albergo and Eric Vanden-Eijnden

    Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2022. 6

  2. [2]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023. 1

  3. [3]

    Retrieval- augmented diffusion models

    Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval- augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309– 15324, 2022. 3

  4. [4]

    Improving image editing models with generative data refinement

    Frederic Boesel and Robin Rombach. Improving image editing models with generative data refinement. In Tiny Papers @ ICLR, 2024. URL https://api.semanticscholar.org/CorpusID: 271461432. 3

  5. [5]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3, 7

  6. [6]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3

  7. [7]

    Re-imagen: Retrieval-augmented text-to-image generator

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval- augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022. 3

  8. [8]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023. 4

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 7

  10. [10]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019. 9

  11. [11]

    Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis

    Patrick Esser, Robin Rombach, Andreas Blattmann, and Bjorn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021. 1

  12. [12]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403...

  13. [13]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 3

  14. [14]

    Digital image processing

    Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 1

  15. [15]

    Hidream-e1: Instruction-based image editing model, 2025

    HiDream-ai. Hidream-e1: Instruction-based image editing model, 2025. URL https://github. com/HiDream-ai/HiDream-E1. 3

  16. [16]

    Classifier-free diffusion guidance, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 6

  17. [17]

    Denoising diffusion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. 1

  18. [18]

    Simple diffusion: End-to-end diffusion for high resolution images, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images, 2023. 14

  19. [19]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 (2):3, 2022. 3

  20. [20]

    In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775, 2024. 3

  21. [21]

    Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

  22. [22]

    Introducing auraface: Open-source face recognition and identity preservation models

    isidentical. Introducing auraface: Open-source face recognition and identity preservation models. https://huggingface.co/blog/isidentical/auraface, 2024. Accessed: 2025-05-26. 9

  23. [23]

    Experiment with gemini 2.0 flash na- tive image generation, 2025

    Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash na- tive image generation, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation/. 3

  24. [24]

    arXiv preprint arXiv:2210.09276 , year=

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models, 2023. URL https://arxiv.org/abs/2210.09276. 3

  25. [25]

    Understanding diffusion objectives as the elbo with simple data augmentation

    Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems,

  26. [26]

    Reducing activation recomputation in large transformer models

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 7

  27. [27]

    Multi- concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 3

  28. [28]

    Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024. 8

  29. [29]

    Liang, T

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511, 2024. 6

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 4, 6, 14

  31. [31]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025. 7

  32. [32]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 4, 14

  33. [33]

    Repaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471,

  34. [34]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 6

  35. [35]

    Midjourney, 2025

    Midjourney. Midjourney, 2025. URL https://www.midjourney.com/home. 3, 12

  36. [36]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025. URL https://openai.com/index/ introducing-4o-image-generation/. 3 17

  37. [37]

    Drag your gan: Interactive point-based manipulation on the generative image manifold

    Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 1

  38. [38]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023. doi: 10.1109/ iccv51070.2023.00387. URL http://dx.doi.org/10.1109/ICCV51070.2023.00387. 4

  39. [39]

    Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation, 2025. URL https://arxiv.org/abs/2406.16855. 7

  40. [40]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1, 3, 5

  41. [41]

    Hierarchical text-conditional image generation with clip latents, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. 1, 7

  42. [42]

    A ConvNet for the 2020s

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. doi: 10.1109/cvpr52688.2022. 01042. URL http://dx.doi.org/10.1109/CVPR52688.2022.01042. 1, 3, 4

  43. [43]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. URL https://arxiv.org/abs/2208.12242. 3

  44. [44]

    Runway AI

    Inc. Runway AI. Runway | tools for human imagination, 2025. URL https://runwayml.com/. 3

  45. [45]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022. 1

  46. [46]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 8

  47. [47]

    Projected gans converge faster

    Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 2021. 6

  48. [48]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023. 6

  49. [49]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation,

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation,

  50. [50]

    URL https://arxiv.org/abs/2403.12015. 1, 6

  51. [51]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37:68658–68685, 2024. 7

  52. [52]

    Emu edit: Precise image editing via recognition and generation tasks.arXiv preprint arXiv:2311.10089, 2023

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023. 3, 7

  53. [53]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 15

  54. [54]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 4 18

  55. [55]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022. 1

  56. [56]

    Computer vision: algorithms and applications

    Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022. 1

  57. [57]

    OmniGen: Unified Image Generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024. 3

  58. [58]

    Paint by example: Exemplar-based image editing with diffusion models

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023. 1

  59. [59]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023. 3, 12

  60. [60]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789. 8

  61. [61]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.ArXiv, abs/2306.10012, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024. URL https://arxiv.org/abs/2306.10012. 7

  62. [62]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

  63. [63]

    In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 3 19