hub Mixed citations

Flux.https://github.com/black-forest-labs/flux

Black Forest Labs · 2024

Mixed citation behavior. Most common role is background (64%).

49 Pith papers citing it

Background 64% of classified citations

browse 49 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 9 baseline 3 method 2

citation-polarity summary

background 9 baseline 3 use method 2

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

ImageAttributionBench: How Far Are We from Generalizable Attribution?

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation

cs.CV · 2026-04-24 · conditional · novelty 7.0

KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

cs.CV · 2025-12-25 · unverdicted · novelty 7.0

InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image generation.

Delta Rectified Flow Sampling for Text-to-Image Editing

cs.CV · 2025-09-01 · unverdicted · novelty 7.0

DRFS is a new inversion-free editing technique for rectified flow models that models source-target velocity discrepancies and applies a time-dependent shift to improve fidelity and unify prior methods like DDS and FlowEdit.

Spectral Tail Auxiliary Learning for AI-Generated Image Detection

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

STAL transfers spectral tail uplift cues via a frequency teacher to train a spatial detector for AI-generated images, discarding frequency modules at inference for strong cross-generator generalization.

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.

Latent Action Control for Reasoning-Guided Unified Image Generation

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.

Registers Matter for Pixel-Space Diffusion Transformers

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

RaPD enables resolution-agnostic image generation by diffusing in a semantics-enriched continuous Neural Image Field latent space using semantic guidance and a coordinate-queried attention renderer.

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

cs.CV · 2026-05-14 · conditional · novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.

citing papers explorer

Showing 49 of 49 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 5
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion cs.LG · 2026-05-22 · unverdicted · none · ref 36
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders cs.CV · 2026-05-21 · unverdicted · none · ref 2
DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 22 · 2 links
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
ImageAttributionBench: How Far Are We from Generalizable Attribution? cs.CV · 2026-05-13 · unverdicted · none · ref 41
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 12
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR cs.CV · 2026-05-11 · unverdicted · none · ref 25
LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models cs.CV · 2026-05-07 · unverdicted · none · ref 3
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation cs.CV · 2026-04-24 · conditional · none · ref 35
KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 18
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation cs.CV · 2025-12-25 · unverdicted · none · ref 15
InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image generation.
Delta Rectified Flow Sampling for Text-to-Image Editing cs.CV · 2025-09-01 · unverdicted · none · ref 21
DRFS is a new inversion-free editing technique for rectified flow models that models source-target velocity discrepancies and applies a time-dependent shift to improve fidelity and unify prior methods like DDS and FlowEdit.
Spectral Tail Auxiliary Learning for AI-Generated Image Detection cs.CV · 2026-05-21 · unverdicted · none · ref 23
STAL transfers spectral tail uplift cues via a frequency teacher to train a spatial detector for AI-generated images, discarding frequency modules at inference for strong cross-generator generalization.
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers cs.CV · 2026-05-21 · unverdicted · none · ref 1
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation cs.CV · 2026-05-19 · unverdicted · none · ref 25
FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.
Improved Baselines with Representation Autoencoders cs.CV · 2026-05-18 · conditional · none · ref 31
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models cs.CV · 2026-05-17 · unverdicted · none · ref 18
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
Latent Action Control for Reasoning-Guided Unified Image Generation cs.CV · 2026-05-16 · unverdicted · none · ref 3
Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
Registers Matter for Pixel-Space Diffusion Transformers cs.CV · 2026-05-15 · unverdicted · none · ref 46
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations cs.CV · 2026-05-15 · unverdicted · none · ref 40
RaPD enables resolution-agnostic image generation by diffusing in a semantics-enriched continuous Neural Image Field latent space using semantic guidance and a coordinate-queried attention renderer.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 23
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation cs.CV · 2026-05-14 · conditional · none · ref 18
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 16
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution cs.CV · 2026-05-13 · unverdicted · none · ref 28
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models cs.CV · 2026-05-12 · conditional · none · ref 11 · 2 links
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 33
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 3
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 45
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models cs.CV · 2026-05-07 · unverdicted · none · ref 16
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG · 2026-04-08 · unverdicted · none · ref 12
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images cs.CV · 2026-03-02 · unverdicted · none · ref 6
HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
RefTon: Reference person shot assist virtual Try-on cs.CV · 2025-11-02 · unverdicted · none · ref 50
RefTon is a flux-based virtual try-on method that uses unpaired reference images of the target garment on different people to guide texture and detail preservation in a streamlined person-to-person pipeline without body parsing or masks.
Adversarial Concept Distillation for One-Step Diffusion Personalization cs.CV · 2025-10-23 · unverdicted · none · ref 41
OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis cs.CV · 2025-09-30 · unverdicted · none · ref 16
Automated LLM-based prompt engineering for text-to-image edge-case synthesis improves object detection robustness on the FishEye8K benchmark over naive augmentation and manual prompts.
DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 23
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention cs.CV · 2026-05-22 · unverdicted · none · ref 18
SANA-SR uses 32x deep compression autoencoding and linear-attention DiT to deliver competitive real-world image super-resolution at 0.019s inference after pruning.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 49
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Stable Audio 3 cs.SD · 2026-05-18 · unverdicted · none · ref 80
Stable Audio 3 develops fast latent diffusion models for variable-length audio generation and editing via a semantic-acoustic autoencoder and adversarial post-training.
FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery cs.CV · 2026-05-14 · unverdicted · none · ref 32
FactorizedHMR recovers 3D human meshes from video by deterministically anchoring the torso-root then probabilistically completing distal articulations via flow-matching with geometry-aware supervision and a synthetic data pipeline.
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model cs.CV · 2026-03-12 · unverdicted · none · ref 18
InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretrained diffusion models.
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning cs.CV · 2025-05-25 · unverdicted · none · ref 37
DiT-ST converts complete-text captions into split-text primitives via LLMs and injects them hierarchically across denoising stages to reduce semantic confusion in DiT-based text-to-image generation.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 42 · 2 links
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 15
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
Motif-Video 2B: Technical Report cs.CV · 2026-04-14 · unverdicted · none · ref 20 · 2 links
Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 54
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · unreviewed · ref 27
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models cs.CV · 2026-05-19 · unreviewed · ref 43 · 2 links
Asymmetric Flow Models cs.CV · 2026-05-13 · unreviewed · ref 5
PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models cs.CV · 2026-05-10 · unreviewed · ref 15

Flux.https://github.com/black-forest-labs/flux

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer