Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
hub Mixed citations
Flux.https://github.com/black-forest-labs/flux
Mixed citation behavior. Most common role is background (64%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image generation.
DRFS is a new inversion-free editing technique for rectified flow models that models source-target velocity discrepancies and applies a time-dependent shift to improve fidelity and unify prior methods like DDS and FlowEdit.
STAL transfers spectral tail uplift cues via a frequency teacher to train a spatial detector for AI-generated images, discarding frequency modules at inference for strong cross-generator generalization.
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
RaPD enables resolution-agnostic image generation by diffusing in a semantics-enriched continuous Neural Image Field latent space using semantic guidance and a coordinate-queried attention renderer.
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
citing papers explorer
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.