{"total":58,"items":[{"citing_arxiv_id":"2607.01693","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Mathematical Introduction to Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-07-02T04:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":0.0,"formal_verification":"none","one_line_summary":"An educational exposition that layers core definitions, simplified estimates, and research-level theorems on diffusion sampling for probability-background graduate students.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31576","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Introduction to Stochastic Differential Equations for Generative Machine Learning: A Variational Perspective","primary_cat":"cs.LG","submitted_at":"2026-06-30T12:34:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":1.0,"formal_verification":"none","one_line_summary":"An expository tutorial deriving the ELBO for SDE-based generative models and presenting diffusion, score, and flow matching as variational parameterizations illustrated on a 1D example.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31340","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scenario-conditioned flow matching for probabilistic generation of three-component ground-motion waveforms","primary_cat":"physics.geo-ph","submitted_at":"2026-06-30T08:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WaveFlowGMM generates scenario-conditioned three-component ground-motion waveforms by using symbolic learning for PGA amplitude and AlphaFlow for normalized wavelet-packet waveforms that are later rescaled.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30376","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification","primary_cat":"cs.LG","submitted_at":"2026-06-29T14:37:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30230","ref_index":126,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Distributionally Robust Framework for Learned Reconstructions in Inverse Problems","primary_cat":"math.OC","submitted_at":"2026-06-29T12:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces structured DRO for learned inverse problem reconstructions with ambiguity sets aligned to the forward operator, yielding explicit dual representations and a worst-case bound that induces Tikhonov regularization on the operator Lipschitz constant.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06309","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling","primary_cat":"cs.CV","submitted_at":"2026-06-04T15:49:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RhymeFlow is a training-free acceleration framework that decouples denoising trajectories across video frames by dense processing of semantic keyframes and asynchronous skipping for non-keyframes, augmented by a latent trajectory projection module to maintain consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05327","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimarginal flow matching with optimal transport potentials","primary_cat":"cs.LG","submitted_at":"2026-06-03T18:11:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OTP-FM extends conditional flow matching by incorporating dynamic optimal transport potentials to enable efficient multimarginal transport learning with intermediate observed marginals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05254","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Flash-WAM: Modality-Aware Distillation for World Action Models","primary_cat":"cs.LG","submitted_at":"2026-06-03T15:29:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04165","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaloTrilogy: Toward a Breakthrough in One-Step, End-to-End, Physics-Guided Shower Generation for Modern Calorimeters","primary_cat":"hep-ex","submitted_at":"2026-06-02T19:27:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents CaloTrilogy, a unified one-step generative model for high-granularity calorimeter showers that combines velocity field integration, learned priors, and physics losses to match SOTA quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00133","ref_index":189,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications","primary_cat":"cs.LG","submitted_at":"2026-05-28T21:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24870","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-24T05:00:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TCC calibrates cached representations in diffusion sampling via an offline iterative procedure that accounts for trajectory shifts, improving FID from 29.83 to 27.35 on PixArt-alpha while preserving reuse policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21573","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21489","ref_index":77,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Variance Reduction for Expectations with Diffusion Teachers","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CARV amortizes upstream diffusion teacher costs over noise resamples with timestep importance sampling and stratified-inverse-CDF sampling, delivering 2-3x effective compute gains in text-to-3D experiments and order-of-magnitude variance cuts in single-step distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21466","ref_index":69,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StreamEdit: Training-Free Video Editing via Few-Step Streaming Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:52:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StreamEdit enables high-quality training-free video editing by adapting streaming video generation models with dual-branch fast sampling, self-attention bridge, cross-attention grounding, source-oriented guidance, and visual prompting, outperforming prior methods in few-step regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20780","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-20T06:22:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REPA-P aligns intermediate representations in diffusion models with physical states using first-principles PDE residuals to accelerate convergence and boost out-of-distribution robustness on PDE tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17899","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DCFold: Efficient Protein Structure Generation with Single Forward Pass","primary_cat":"cs.LG","submitted_at":"2026-05-18T06:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCFold achieves AlphaFold3-level protein structure prediction accuracy in a single forward pass using Dual Consistency training and a Temporal Geodesic Matching scheduler, delivering 15x inference acceleration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17019","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StreamingEffect: Real-Time Human-Centric Video Effect Generation","primary_cat":"cs.CV","submitted_at":"2026-05-16T14:45:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18878","ref_index":223,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis","primary_cat":"eess.SP","submitted_at":"2026-05-16T02:49:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15592","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient Image Synthesis with Sphere Latent Encoder","primary_cat":"cs.CV","submitted_at":"2026-05-15T04:03:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15055","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-14T16:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14876","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11347","ref_index":27,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gradient-Free Noise Optimization for Reward Alignment in Generative Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T00:05:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09291","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T03:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07327","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:33:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06916","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tyche: One Step Flow for Efficient Probabilistic Weather Forecasting","primary_cat":"cs.LG","submitted_at":"2026-05-07T20:25:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tyche achieves competitive probabilistic weather forecasting skill and calibration using a single-step flow model with JVP-regularized training and rollout finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06829","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T18:32:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Diffusion, score-based, and flow matching models are unified as instances of learning time-dependent vector fields inducing marginal distributions governed by continuity and Fokker-Planck equations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"valid wheneverρ t is positive and sufficiently smooth. Substituting (36) into the Fokker- Planck equation gives ∂tρt =−∇ · ρtf \u0001 + 1 2 g(t)2∇ · ρt∇logρ t \u0001 .(37) Factoring out the divergence yields ∂tρt =−∇ · ρtf− 1 2 g(t)2ρt∇logρ t \u0001 .(38) Comparing (38) with the continuity equation (35), one is led to define the deterministic velocity field vPF(x, t) =f(x, t)− 1 2 g(t)2∇x logρ t(x).(39) 35 B.4 Probability-flow ODE The corresponding deterministic dynamics are dXt dt =f(X t, t)− 1 2 g(t)2∇x logρ t(Xt).(40) By construction, the continuity equation associated with (40) is exactly (38), which is the same density evolution equation satisfied by the forward SDE. Therefore the ODE (40) and the SDE (31) share the same one-time marginals."},{"citing_arxiv_id":"2605.05689","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:29:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[7] Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, and Xuemin Lin. Towards unsupervised training of matching-based graph edit distance solver via preference-aware gan, 2025. URLhttps://arxiv.org/abs/2506.01977. [8] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. URL https://arxiv.org/abs/2303.01469. [9] Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=WNzy9bRDvG. [10] Yang Li, Jiale Ma, Yebin Yang, Qitian Wu, Hongyuan Zha, and Junchi Yan. Generative modeling reinvents supervised learning: Label repurposing with predictive consistency learning."},{"citing_arxiv_id":"2605.04569","ref_index":192,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention","primary_cat":"cs.CV","submitted_at":"2026-05-06T07:15:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIVEditor-14B applies a new sparse attention method (ISA) that prunes context and uses query-sharpness routing to cut attention latency ~60% with no loss in editing quality on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02464","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-04T11:06:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ExpoCM enables fast one-step single-image HDR reconstruction via exposure-dependent perturbations and region-conditioned consistency trajectories derived from a probability flow ODE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00329","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation","primary_cat":"cs.SD","submitted_at":"2026-05-01T01:13:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AudioTurbo≈2000 1.1B 5 22.18 - 1.30 8.88 - - - - AudioTurbo≈2000 1.1B 10 20.65 - 1.29 9.40 - - - - AUDIODEAR w/o Dist. 1700 191M 1 22.09 3.82 1.22 8.07 0.298 - - 2.61 AUDIODEAR 1700 191M 1 18.672.791.069.66 0.334 4.27±0.043.27±0.062.61 energy-scoring head are provided in Appendix E. During training, we apply a masking rate randomly sampled from the range [70,100) to the audio latents, enabling masked generative modeling with the energy-distance objective. For representation distillation, we adopt the transformer back- bone of the diffusion-based state-of-the-art model IMPACT (Huang et al., 2025) as the teacher, and integrate the distil- lation loss with the energy-distance objective using a distil- lation weight λ= 1000 , as defined in Equation 5. Unless otherwise specified, we train with a batch size of 2048 and a learning rate of 1e−3. At inference time, we follow IM- PACT by setting the number of decoding iterations to 64. Following related work (Ma et al., 2025), we apply classifier- free guidance during inference, with CFG scale set to 4.0. Ablation studies and implementation details on CFG can be found in Appendix F. 4.3. Evaluation We evaluate our proposed TTA generation framework us- ing both objective and subjective metrics. For objective assessment, we report Fr 'echet distance (FD; Heusel et al. 2017), Fr'echet audio distance (FAD; Kilgour et al. 2018), Kullback-Leibler divergence (KL), and inception score (IS; Salimans et al. 2016) following the AudioLDM evaluation protocol 2, and CLAP similarity (Wu et al., 2023) using the same pre-trained CLAP model employed by IMPACT. The CLAP model used for training 3 is different from the one used for evaluation4 to avoid taking advantage of training and evaluating with the same model. Subjective evaluation is conducted on 90 generated audio samples conditioned on the AudioCaps evaluation set prompts, using the user inter- face and rating criteria defined in AudioBox (Vyas et al., 2023). Each sample receiv"},{"citing_arxiv_id":"2604.27147","ref_index":24,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance","primary_cat":"cs.LG","submitted_at":"2026-04-29T19:56:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"FMRG reformulates guidance as deterministic optimal control, deriving a single-trajectory method using the flow map that matches or exceeds baselines on reward-guided generation and inverse problems with 3 NFEs at text-to-image scale.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"(pages 2, 3, 10, and 21) [22] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency Models, May 2023. arXiv:2303.01469 [cs, stat]. (pages 2, 4, and 10) [23] Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean Flows for One-step Generative Modeling, May 2025. arXiv:2505.13447 [cs]. (pages 2, 4, 10, and 21) [24] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, March 2024. arXiv:2310.02279 [cs, stat]. (pages 2, 4, and 10) 17 [25] Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T."},{"citing_arxiv_id":"2604.26244","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution","primary_cat":"cs.CV","submitted_at":"2026-04-29T02:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MetaSR DOVE LR HR MetaSR DOVE LR Fig. 2:Overview of MetaSR. Left: schematic of how metadata information is projected into the denoising process through native DiT modules. Right: representative qualita- tive image SR examples under Canny/depth guidance. Consistency Models learn mappings that enable one-step generation while allow- ing multi-step refinement [27]. Latent Consistency Models extend consistency- style distillation to latent diffusion backbones [17], and Adversarial Diffusion Distillation combines score distillation with adversarial objectives to improve quality at 1-4 steps [23]. In MetaSR's draft framing, such distillation methods are enabling techniques for deployment: they make it feasible to run a powerful"},{"citing_arxiv_id":"2604.24447","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment","primary_cat":"cs.RO","submitted_at":"2026-04-27T13:12:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20130","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pairing Regularization for Mitigating Many-to-One Collapse in GANs","primary_cat":"cs.LG","submitted_at":"2026-04-22T02:57:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pairing regularization mitigates intra-mode collapse in GANs by penalizing redundant latent-to-sample mappings, improving recall under collapse-prone conditions or precision under stabilized training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17706","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL","primary_cat":"cs.RO","submitted_at":"2026-04-20T01:36:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15948","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance","primary_cat":"cs.CV","submitted_at":"2026-04-17T11:10:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10857","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Query Lower Bounds for Diffusion Sampling","primary_cat":"cs.LG","submitted_at":"2026-04-12T23:47:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09168","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ELT: Elastic Looped Transformers for Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-04-10T09:53:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"lightweight Stable Diffusion variant, and refines the step distillation process of Latent Consistency Model. MaGNeTS [24] trains a family of nested transformers [41, 42] with schedules of model sizes over the generation process, without increasing the parameter count. Elastic Visual Generation:The paradigm of Any-Time or elastic generation focuses on decoupling model's parameter count from its computational depth. E-DiT [70] introduces adaptive block skipping 3 ELT: Elastic Looped Transformers for Visual Generation and MLP width reduction, allowing a single model to traverse varying computational budgets without retraining. In the context of visual reasoning, LoopViT [63] uses a weight-tied recursive architecture, employing a parameter-free dynamic exit mechanism to halt inference based on uncertainty of"},{"citing_arxiv_id":"2604.08837","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Discrete Meanflow Training Curriculum","primary_cat":"cs.LG","submitted_at":"2026-04-10T00:25:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04491","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Isokinetic Flow Matching for Pathwise Straightening of Generative Flows","primary_cat":"cs.LG","submitted_at":"2026-04-06T07:32:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Isokinetic Flow Matching adds a lightweight regularization term to flow matching that penalizes acceleration along paths via self-guided finite differences, yielding straighter trajectories and large gains in few-step sampling quality on CIFAR-10.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03225","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VOSR: A Vision-Only Generative Model for Image Super-Resolution","primary_cat":"cs.CV","submitted_at":"2026-04-03T17:50:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a restoration-oriented sampling strategy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 4 [31] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 3 [32] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023. 3 [33] Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hong- wei Yong, and Lei Zhang. Improving the stability of dif- fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023. 1, 3 [34] Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu,"},{"citing_arxiv_id":"2603.07514","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Unified View of Score-Based and Drifting Models","primary_cat":"cs.LG","submitted_at":"2026-03-08T07:41:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Drifting with Gaussian kernels exactly matches score-matching on smoothed distributions via Tweedie's formula, while Laplace kernels approximate this closely in high dimensions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13357","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-02-13T08:11:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AdaCorrection adaptively corrects offset caches in DiT inference via on-the-fly spatio-temporal validity checks to maintain near-original FID with moderate acceleration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02340","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-02-04T13:04:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.20540","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Open-source World Models","primary_cat":"cs.CV","submitted_at":"2026-01-28T12:37:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23980","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution","primary_cat":"cs.CV","submitted_at":"2025-09-28T17:08:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS reduces redundancy in diffusion models for real-world video super-resolution via attention specialization routing and progressive training, delivering state-of-the-art quality with 6.2x faster inference than prior one-step baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.16344","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Diff-ANO: Towards Fast High-Resolution Ultrasound Computed Tomography via Conditional Consistency Models and Adjoint Neural Operators","primary_cat":"math.NA","submitted_at":"2025-07-22T08:24:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diff-ANO uses conditional consistency models and adjoint neural operator surrogates to enable fast, high-quality USCT reconstructions under sparse and partial views by replacing slow PDE solvers and enabling few-step sampling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.22020","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2025-03-27T22:23:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.00200","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unified Video Action Model","primary_cat":"cs.RO","submitted_at":"2025-02-28T21:38:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"implementation for all simulation tasks. For real-world tasks, we used an improved Diffusion Policy [11], which is optimized for UMI data. It leverages a CLIP-pretrained [33] ViT-B/16 [1] vision encoder, significantly improving visual understanding. We refer to it as [DP-UMI]. • OpenVLA [25] is a state-of-the art vision-language- action (VLA) built on 7B Llama 2 [41] for multi-task setting. It is trained on a diverse dataset encompassing a wide range of robots, tasks, and environments. We fine- tune OpenVLA on each task to optimize its performance. • π0 [2] is an open-source VLA model designed for general-purpose robot control. It employs a flow match- ing based architecture to generate continuous action se- quences."},{"citing_arxiv_id":"2501.09732","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps","primary_cat":"cs.CV","submitted_at":"2025-01-16T18:30:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}