Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
super hub Canonical reference
Hierarchical Text-Conditional Image Generation with CLIP Latents
Canonical reference. 77% of citing Pith papers cite this work as background.
abstract
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve bo
authors
co-cited works
representative citing papers
Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
SD-MIA is a black-box membership inference attack that detects pre-training data in diffusion models via cross-modal perturbations on images and textual instructions.
PURE builds forget and retain bases from per-layer cross-attention activations along a short denoising trajectory and applies a single linear projector to cross-attention weights, yielding the best forget-retain trade-off on a ten-concept benchmark.
Mosaic is a framework for compositional multi-concept erasure in flow-based T2I models via spatial vector field blending without extra optimization, evaluated on the new CoME-Bench benchmark covering intra- and cross-category cases.
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
PMF-CL derives Pareto-minimal-forgetting algorithms for linear/basis-function regression and quadratic-bounded losses like logistic regression, achieving static O(d²) memory for d-parameter models.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
A multimodal diffusion model generates controllable alternative streetscapes from street-view imagery using visual metrics and text, shown on Chicago and Orlando data with gains in semantic consistency.
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.
HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target-domain risk.
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
citing papers explorer
-
Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Prop-Chromeleon: Adaptive Haptic Props in Mixed Reality through Generative Artificial Intelligence
A generative-AI pipeline dynamically generates and anchors virtual assets to match the shape of physical props, enabling adaptive passive haptics in MR that users rate higher in realism, immersion, and enjoyment than static baselines.
-
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
A meta-optimized in-context learning approach enables training-free cross-subject semantic visual decoding from fMRI by inferring individual neural encoding patterns via hierarchical inference on a few examples.
-
DustNET: enabling machine learning and AI models of dusty plasmas
DustNET is proposed as a shared dataset to train machine learning models that complement traditional physics equations for predictive modeling of dusty plasmas across laboratory and natural scales.