super hub Mixed citations

Flow Matching for Generative Modeling

Heli Ben-Hamu, Maximilian Nickel, Ricky TQ Chen, Yaron Lipman · 2022 · cs.LG · arXiv 2210.02747

Mixed citation behavior. Most common role is method (47%).

480 Pith papers citing it

Method 47% of classified citations

open full Pith review browse 480 citing papers more from Heli Ben-Hamu arXiv PDF

abstract

We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 77 method 73 baseline 5

citation-polarity summary

use method 73 background 69 unclear 6 baseline 5 support 2

claims ledger

abstract We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more

authors

and Matt Le Heli Ben-Hamu Maximilian Nickel Ricky TQ Chen Yaron Lipman

co-cited works

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.

Generative Modeling with Flux Matching

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Flux Matching generalizes score-based generative modeling by using a weaker objective that admits infinitely many non-conservative vector fields with the data as stationary distribution, enabling new design choices beyond traditional score matching.

A-CODE: Fully Atomic Protein Co-Design with Unified Multimodal Diffusion

q-bio.QM · 2026-05-05 · unverdicted · novelty 8.0

A-CODE presents a fully atomic one-stage multimodal diffusion model for protein co-design that claims superior unconditional generation performance over prior one- and two-stage models plus a tenfold success-rate gain on hard binder-design tasks.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

Query Lower Bounds for Diffusion Sampling

cs.LG · 2026-04-12 · unverdicted · novelty 8.0

Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Generative models on phase space

hep-ph · 2026-04-02 · unverdicted · novelty 8.0

Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Building Normalizing Flows with Stochastic Interpolants

cs.LG · 2022-09-30 · conditional · novelty 8.0 · 2 refs

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.

Increasing the Precision of Surrogate Models for Weak Lensing Mass Maps with Flow Matching

astro-ph.CO · 2026-05-22 · unverdicted · novelty 7.0

A flow matching generative model produces weak lensing mass maps with fidelity improved to below 1% and 5% on basic and higher-order statistics relative to GAN benchmarks.

Let EEG Models Learn EEG

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

JET is a conditional flow matching framework that generates EEG as continuous raw sequences with added constraints for spectral and temporal properties, achieving over 40% lower TS-FID than prior discrete denoising methods on three benchmarks.

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.

Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

PG-DPO is a new variational framework that replaces Bellman recursion with a Pontryagin-guided adjoint-MC projection for RL under non-exponential discounting and shows gains on hyperbolic and survival benchmarks.

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

cs.CV · 2026-05-19 · conditional · novelty 7.0

MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.

Probability-Conserving Flow Guidance

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AdaMaG is a guidance rule for generative models derived from decomposing continuity-equation effects into divergence and score-parallel terms, with a proof that divergence diverges near the manifold and a time-dependent bound that improves realism at no extra cost.

Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

IPR improves valid solution rates on MNIST Sudoku from 55.8% to 75.0% by iteratively refining partial regions in sequential diffusion models without external verifiers or reward models.

citing papers explorer

Showing 50 of 73 citing papers after filters.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 48 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 19 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation cs.RO · 2026-05-12 · conditional · none · ref 9 · internal anchor
A morphologically equivariant flow matching policy for bimanual robots enforces reflective symmetry to improve sample efficiency and enable zero-shot generalization to mirrored task configurations.
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation cs.RO · 2026-05-10 · unverdicted · none · ref 15 · internal anchor
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
Generative Actor-Critic with Soft Bridge Policies cs.LG · 2026-05-09 · unverdicted · none · ref 11 · internal anchor
SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 51 · internal anchor
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models cs.CV · 2026-05-09 · unverdicted · none · ref 20 · 2 links · internal anchor
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 31 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 44 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events cs.LG · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 36 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 48 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 43 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching cs.RO · 2026-04-13 · unverdicted · none · ref 12 · internal anchor
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.
Novel View Synthesis as Video Completion cs.CV · 2026-04-09 · unverdicted · none · ref 23 · internal anchor
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 48 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 39 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CV · 2026-04-07 · unverdicted · none · ref 25 · internal anchor
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos cs.CV · 2026-01-15 · unverdicted · none · ref 50 · internal anchor
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process cs.LG · 2025-09-19 · unverdicted · none · ref 13 · internal anchor
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.
FD-Bench: A Modular and Fair Benchmark for Data-driven Fluid Simulation physics.flu-dyn · 2025-05-25 · unverdicted · none · ref 63 · internal anchor
FD-Bench supplies the first modular, reproducible benchmark and leaderboard for comparing neural PDE solvers on fluid dynamics tasks with direct numerical solver baselines.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 25 · internal anchor
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments cs.CV · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
Introduces GAP datasets of synthetic heavily eroded irregular puzzle pieces modeled on archaeological fragments and PuzzleFlow, a ViT and flow-matching framework that outperforms prior jigsaw solvers on these datasets.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion cs.CL · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation q-bio.BM · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10x fewer parameters than ESM3.
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation cs.LG · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
Slowly Annealed Langevin Dynamics provides non-asymptotic KL-based convergence guarantees for tracking moving targets and enables training-free guided generation via a velocity-aware correction that accounts for pretrained marginals.
Toward Better Geometric Representations for Molecule Generative Models cs.LG · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% stability on GEOM-DRUG.
Excluding the Target Domain Improves Extrapolation: Deconfounded Hierarchical Physics Constraints cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Deconfounded Hierarchical Gate with counterfactual estimation and hierarchical constraints achieves 46% better RMSE on out-of-distribution battery temperature extrapolation, with excluding target data from pretraining outperforming inclusion.
BRICKS: Compositional Neural Markov Kernels for Zero-Shot Radiation-Matter Simulation cs.LG · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
BRICKS creates compositional neural Markov kernels via hybrid transformers and Riemannian Flow Matching on product manifolds to enable zero-shot simulation of radiation-matter interactions across arbitrary material distributions.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 25 · internal anchor
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.
Fisher Decorator: Refining Flow Policy via a Local Transport Map cs.LG · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 25 · internal anchor
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 61 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 40 · internal anchor
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks cs.CV · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving cs.RO · 2026-04-12 · unverdicted · none · ref 49 · internal anchor
The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization cs.CV · 2026-04-11 · unverdicted · none · ref 19 · internal anchor
PhyMix unifies a new multi-aspect physics evaluator with implicit policy optimization and explicit test-time correction to produce single-image 3D indoor scenes that are both visually faithful and physically plausible.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 40 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 19 · internal anchor
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 61 · internal anchor
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Monte Carlo Event Generation with Continuous Normalizing Flows hep-ph · 2026-04-03 · conditional · none · ref 34 · internal anchor
Continuous normalizing flows improve unweighting efficiency in Monte Carlo event generation for high-jet-multiplicity collider processes by factors up to 184, with wall-time gains of about ten when combined with coupling-layer flows.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model cs.RO · 2026-04-03 · conditional · none · ref 22 · internal anchor
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 66 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 61 · internal anchor
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
$\pi^{*}_{0.6}$: a VLA That Learns From Experience cs.LG · 2025-11-18 · unverdicted · none · ref 80 · internal anchor
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 56 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 5 · internal anchor
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

Flow Matching for Generative Modeling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer