pith. machine review for the scientific record. sign in

arxiv: 2506.15742 · v2 · submitted 2025-06-17 · 💻 cs.GR

Recognition: no theorem link

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs , Stephen Batifol , Andreas Blattmann , Frederic Boesel , Saksham Consul , Cyril Diagne , Tim Dockhorn , Jack English , Zion English , Patrick Esser , Sumith Kulal , Kyle Lacey , Yam Levi , Cheng Li , Dominik Lorenz , Jonas M\"uller , Dustin Podell , Robin Rombach , Harry Saini , Axel Sauer , Luke Smith

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.GR
keywords image generationimage editingflow matchingin-context generationlatent spacecharacter consistencymulti-turn editingbenchmark
0
0 comments X

The pith

A flow matching model unifies image generation and editing by concatenating text and image inputs in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FLUX.1 Kontext as a single architecture that generates new images or edits existing ones by incorporating semantic context from both text prompts and reference images. It relies on a straightforward sequence concatenation step inside a flow matching process to manage local edits and broader generative tasks together. This design yields stronger preservation of specific objects and characters when edits are applied repeatedly, avoiding the consistency loss seen in prior systems. The model matches leading quality benchmarks while running substantially faster, which supports interactive use. The claims rest on results from the new KontextBench dataset of 1026 image-prompt pairs spanning five editing categories.

Core claim

FLUX.1 Kontext is a generative flow matching model that unifies image generation and editing within one architecture. Using sequence concatenation to incorporate semantic context from text and image inputs, it handles both local editing and generative in-context tasks. It demonstrates improved preservation of objects and characters across multiple turns compared to models that degrade in consistency, while delivering competitive performance and significantly faster generation times.

What carries the argument

Sequence concatenation of text and image inputs inside the flow matching model operating in latent space, which unifies local and generative editing tasks.

Load-bearing premise

The 1026 image-prompt pairs in KontextBench represent typical real-world editing tasks without selection bias that favors the new model.

What would settle it

A larger independent test set of editing tasks where FLUX.1 Kontext shows equal or greater degradation in character consistency and slower speeds than existing models.

read the original abstract

We present evaluation results for FLUX.1 Kontext, a generative flow matching model that unifies image generation and editing. The model generates novel output views by incorporating semantic context from text and image inputs. Using a simple sequence concatenation approach, FLUX.1 Kontext handles both local editing and generative in-context tasks within a single unified architecture. Compared to current editing models that exhibit degradation in character consistency and stability across multiple turns, we observe that FLUX.1 Kontext improved preservation of objects and characters, leading to greater robustness in iterative workflows. The model achieves competitive performance with current state-of-the-art systems while delivering significantly faster generation times, enabling interactive applications and rapid prototyping workflows. To validate these improvements, we introduce KontextBench, a comprehensive benchmark with 1026 image-prompt pairs covering five task categories: local editing, global editing, character reference, style reference and text editing. Detailed evaluations show the superior performance of FLUX.1 Kontext in terms of both single-turn quality and multi-turn consistency, setting new standards for unified image processing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents FLUX.1 Kontext, a flow-matching model that unifies image generation and editing via simple sequence concatenation in latent space. It claims improved character/object preservation and multi-turn robustness over prior editing models, competitive performance with current SOTA systems, significantly faster inference, and introduces KontextBench (1026 author-curated image-prompt pairs across local editing, global editing, character reference, style reference, and text editing) to validate these advantages.

Significance. If the empirical claims hold on an independently constructed test distribution, the unified latent-space concatenation approach and observed stability in iterative workflows would constitute a practical advance for interactive image editing pipelines, with the reported speed advantage enabling new prototyping use cases. The flow-matching backbone and single-architecture design are clear strengths that could be extended.

major comments (3)
  1. [KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.
  2. [Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.
  3. [Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.
minor comments (2)
  1. [Abstract] The abstract states that the model 'achieves competitive performance' but does not name the quantitative metrics (e.g., CLIP score, LPIPS, or human preference rates) used to support this.
  2. [Figures and Tables] Figure captions and table headers could more explicitly indicate whether results are single-turn or multi-turn to aid quick reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve transparency, reproducibility, and statistical rigor.

read point-by-point responses
  1. Referee: [KontextBench description] The section introducing KontextBench provides no information on prompt sourcing, difficulty calibration against existing models, or whether the 1026 pairs were selected after observing FLUX.1 Kontext behavior; this directly undermines the central claim of superior multi-turn consistency because selection bias favoring the concatenation mechanism cannot be excluded.

    Authors: We agree that additional details on benchmark construction are essential to address potential selection bias concerns. The 1026 pairs were assembled from a mix of public image datasets and manually designed prompts targeting known multi-turn failure modes (e.g., character drift, style inconsistency). Difficulty calibration was performed by evaluating preliminary versions of several models on candidate pairs to ensure coverage across easy-to-hard cases. The final set was locked before running the complete FLUX.1 Kontext evaluation suite. We have added a new subsection 'KontextBench Construction' that documents sourcing criteria, the calibration procedure, and the timeline confirming the benchmark was fixed independently of the final model results. This revision directly mitigates the bias concern raised. revision: yes

  2. Referee: [Evaluation / Results] The evaluation section reports comparisons to SOTA editing models but supplies no details on baseline implementations, versions, hyperparameter choices, or the exact protocol for selecting reference images and prompts; without these, the asserted performance advantage and 'greater robustness' cannot be independently verified.

    Authors: We concur that missing implementation details hinder independent verification. The revised manuscript now includes an expanded 'Baseline and Evaluation Protocol' section specifying: exact model versions and checkpoints used for all baselines, inference hyperparameters (steps, guidance scales, schedulers), and the reference selection protocol (fixed benchmark inputs with no post-hoc filtering or cherry-picking; prompts applied verbatim). We have also added a link to the evaluation codebase and configuration files in the supplementary material to enable exact reproduction of the reported numbers and robustness observations. revision: yes

  3. Referee: [Results] No statistical significance tests, confidence intervals, or per-category variance are reported for the metrics on KontextBench; this weakens the headline assertion of 'superior performance' and 'new standards' given that the benchmark is newly introduced.

    Authors: The referee correctly notes that statistical analysis would strengthen the empirical claims. We have updated the results section with per-category means and standard deviations across the 1026 pairs, 95% bootstrap confidence intervals, and paired statistical tests (t-tests and Wilcoxon signed-rank) comparing FLUX.1 Kontext against each baseline. A new table reports these values together with p-values, confirming statistically significant gains in multi-turn consistency metrics. These additions provide the quantitative support for the performance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons and new benchmark without reduction to inputs

full rationale

The paper presents FLUX.1 Kontext as a flow-matching model using simple sequence concatenation in latent space to unify generation and editing. No equations, derivations, or first-principles results are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims (improved multi-turn consistency, competitive SOTA results, faster inference) are supported by direct comparisons to prior editing models on the newly introduced KontextBench. The benchmark is described as created to validate observed improvements, but its results are not forced by the model's internal definitions or training procedure. This is a standard empirical evaluation setup with external benchmarks and SOTA baselines, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new benchmark tasks, the assumption that flow matching in latent space supports in-context conditioning via concatenation, and the standard supervised training assumptions for large generative models.

free parameters (1)
  • model architecture and training hyperparameters
    The flow matching model contains a large number of learned parameters whose values are determined by training on unspecified data.
axioms (1)
  • domain assumption Flow matching can be extended to conditional generation by simple sequence concatenation of text and image tokens in latent space.
    Invoked to justify the unified architecture without additional adapters or encoders.

pith-pipeline@v0.9.0 · 5561 in / 1286 out tokens · 60744 ms · 2026-05-10T16:30:33.008737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  2. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  3. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  4. Inline Critic Steers Image Editing

    cs.CV 2026-05 conditional novelty 7.0

    Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

  5. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  6. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  7. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  8. LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

    cs.CV 2026-05 unverdicted novelty 7.0

    LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.

  9. Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

    cs.CR 2026-05 unverdicted novelty 7.0

    Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

  10. Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

    cs.CV 2026-05 unverdicted novelty 7.0

    Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

  11. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.

  12. Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

  13. MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

  14. Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    MPFM uses flow matching with a Gaussian mixture prior on the velocity field and a mutual information maximizer to improve open-set anomaly detection over unimodal prototype methods.

  15. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  16. Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

  17. AI-Gram: When Visual Agents Interact in a Social Network

    cs.AI 2026-04 unverdicted novelty 7.0

    Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.

  18. GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

    cs.CV 2026-04 unverdicted novelty 7.0

    GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.

  19. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

  20. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  21. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  22. View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

    cs.CV 2026-04 unverdicted novelty 7.0

    A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

  23. Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

    cs.AI 2026-04 unverdicted novelty 7.0

    Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...

  24. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  25. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  26. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  27. OneHOI: Unifying Human-Object Interaction Generation and Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.

  28. LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.

  29. HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

    cs.CV 2026-04 unverdicted novelty 7.0

    A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.

  30. RewardFlow: Generate Images by Optimizing What You Reward

    cs.CV 2026-04 unverdicted novelty 7.0

    RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

  31. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  32. RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

    cs.CV 2026-04 unverdicted novelty 7.0

    RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

  33. PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

    cs.LG 2026-04 unverdicted novelty 7.0

    PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

  34. HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

    cs.CV 2026-04 unverdicted novelty 7.0

    HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

  35. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  36. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  37. ELF: Embedded Language Flows

    cs.CL 2026-05 unverdicted novelty 6.0

    ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

  38. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  39. LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

    cs.CV 2026-05 unverdicted novelty 6.0

    LimeCross enables text-guided editing of individual layers in composite images by conditioning on cross-layer context via bi-stream attention while preserving layer integrity and introducing the LayerEditBench benchmark.

  40. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  41. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  42. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  43. BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    BRIDGE improves coarse-mask local image editing in DiT models by routing background and subject paths separately and using a discrete geometric gate on positional embeddings to reduce mask-shape bias.

  44. Implicit Preference Alignment for Human Image Animation

    cs.CV 2026-05 unverdicted novelty 6.0

    IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.

  45. EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    EditTransfer++ delivers state-of-the-art faithfulness to visual editing examples and faster inference by removing text conditioning during fine-tuning and applying best-worst contrastive refinement plus condition compression.

  46. Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    HSA assigns variable denoising steps to spatiotemporal tokens in DiTs based on velocity dynamics, with KV-cache sync and cached Euler updates, outperforming prior caching methods on quality-runtime tradeoffs for T2V a...

  47. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  48. Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    A windowed cross-attention control method on skip features enables geometry-controlled high-resolution satellite image synthesis from pre-trained diffusion models with better alignment to control maps than prior techniques.

  49. DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.

  50. Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

    cs.RO 2026-05 unverdicted novelty 6.0

    A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...

  51. Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    MPFM transforms normal features into a structured Gaussian mixture prototype space via a mixture velocity field and mutual information regularization to achieve state-of-the-art open-set supervised anomaly detection.

  52. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  53. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  54. BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors

    cs.CV 2026-04 unverdicted novelty 6.0

    BurstGP enhances raw burst image super-resolution by integrating pretrained video diffusion priors through a multiframe-aware model, degradation-aware conditioning, and color-space conversion, outperforming prior meth...

  55. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

  56. FluSplat: Sparse-View 3D Editing without Test-Time Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.

  57. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  58. ReCap: Lightweight Referential Grounding for Coherent Story Visualization

    cs.CV 2026-04 unverdicted novelty 6.0

    ReCap improves character consistency in story visualization by 2.63% on FlintstonesSV and 5.65% on PororoSV using a selective pronoun-based conditioning module and training-only semantic drift correction.

  59. MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.

  60. LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 89 Pith papers · 7 internal anchors

  1. [1]

    Albergo and Eric Vanden-Eijnden

    Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2022. 6

  2. [2]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023. 1

  3. [3]

    Retrieval- augmented diffusion models

    Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval- augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309– 15324, 2022. 3

  4. [4]

    Improving image editing models with generative data refinement

    Frederic Boesel and Robin Rombach. Improving image editing models with generative data refinement. In Tiny Papers @ ICLR, 2024. URL https://api.semanticscholar.org/CorpusID: 271461432. 3

  5. [5]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 3, 7

  6. [6]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3

  7. [7]

    Re-imagen: Retrieval-augmented text-to-image generator

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval- augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022. 3

  8. [8]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023. 4

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. 7

  10. [10]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019. 9

  11. [11]

    Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis

    Patrick Esser, Robin Rombach, Andreas Blattmann, and Bjorn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021. 1

  12. [12]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403...

  13. [13]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 3

  14. [14]

    Digital image processing

    Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 1

  15. [15]

    Hidream-e1: Instruction-based image editing model, 2025

    HiDream-ai. Hidream-e1: Instruction-based image editing model, 2025. URL https://github. com/HiDream-ai/HiDream-E1. 3

  16. [16]

    Classifier-free diffusion guidance, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 6

  17. [17]

    Denoising diffusion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. 1

  18. [18]

    Simple diffusion: End-to-end diffusion for high resolution images, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images, 2023. 14

  19. [19]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 (2):3, 2022. 3

  20. [20]

    In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775, 2024. 3

  21. [21]

    Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

  22. [22]

    Introducing auraface: Open-source face recognition and identity preservation models

    isidentical. Introducing auraface: Open-source face recognition and identity preservation models. https://huggingface.co/blog/isidentical/auraface, 2024. Accessed: 2025-05-26. 9

  23. [23]

    Experiment with gemini 2.0 flash na- tive image generation, 2025

    Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash na- tive image generation, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation/. 3

  24. [24]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models, 2023. URL https://arxiv.org/abs/2210.09276. 3

  25. [25]

    Understanding diffusion objectives as the elbo with simple data augmentation

    Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems,

  26. [26]

    Reducing activation recomputation in large transformer models

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 7

  27. [27]

    Multi- concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 3

  28. [28]

    Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024. 8

  29. [29]

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training. arXiv preprint arXiv:2410.06511, 2024. 6

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 4, 6, 14

  31. [31]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025. 7

  32. [32]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 4, 14

  33. [33]

    Repaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471,

  34. [34]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 6

  35. [35]

    Midjourney, 2025

    Midjourney. Midjourney, 2025. URL https://www.midjourney.com/home. 3, 12

  36. [36]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025. URL https://openai.com/index/ introducing-4o-image-generation/. 3 17

  37. [37]

    Drag your gan: Interactive point-based manipulation on the generative image manifold

    Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023. 1

  38. [38]

    2023 , volume =

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023. doi: 10.1109/ iccv51070.2023.00387. URL http://dx.doi.org/10.1109/ICCV51070.2023.00387. 4

  39. [39]

    Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation, 2025. URL https://arxiv.org/abs/2406.16855. 7

  40. [40]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1, 3, 5

  41. [41]

    Hierarchical text-conditional image generation with clip latents, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. 1, 7

  42. [42]

    Shadows can be

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. doi: 10.1109/cvpr52688.2022. 01042. URL http://dx.doi.org/10.1109/CVPR52688.2022.01042. 1, 3, 4

  43. [43]

    DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. URL https://arxiv.org/abs/2208.12242. 3

  44. [44]

    Runway AI

    Inc. Runway AI. Runway | tools for human imagination, 2025. URL https://runwayml.com/. 3

  45. [45]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022. 1

  46. [46]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 8

  47. [47]

    Projected gans converge faster

    Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 2021. 6

  48. [48]

    arXiv preprint arXiv:2311.17042 (2023)

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023. 6

  49. [49]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation,

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation,

  50. [50]

    URL https://arxiv.org/abs/2403.12015. 1, 6

  51. [51]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37:68658–68685, 2024. 7

  52. [52]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023. 3, 7

  53. [53]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 15

  54. [54]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 4 18

  55. [55]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022. 1

  56. [56]

    Computer vision: algorithms and applications

    Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022. 1

  57. [57]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024. 3

  58. [58]

    Paint by example: Exemplar-based image editing with diffusion models

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18381–18391, 2023. 1

  59. [59]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023. 3, 12

  60. [60]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789. 8

  61. [61]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.ArXiv, abs/2306.10012, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024. URL https://arxiv.org/abs/2306.10012. 7

  62. [62]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

  63. [63]

    In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 3 19