{"total":28,"items":[{"citing_arxiv_id":"2606.30705","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts","primary_cat":"cs.LG","submitted_at":"2026-06-29T14:20:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Few-step deterministic maps on continuous text latents fail because they cannot resolve discrete branch choices before sharp categorical readouts, with failure governed by decoder sharpness rather than transport accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29215","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Block Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-28T05:53:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MBD-LMs raise average tokens per forward pass from 3.47 to 6.19 (and to 9.34 with DMax) via multi-block teacher forcing and optimized parallel decoding while holding or slightly improving accuracy on math and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29150","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement","primary_cat":"cs.AI","submitted_at":"2026-06-28T02:10:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27732","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation","primary_cat":"cs.IR","submitted_at":"2026-06-26T05:26:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R2LM combines causal attention with a reverse Mamba SSM sidecar to supply right-side context in dLLMs, claiming 2.4x-12.9x throughput gains over bidirectional dLLMs and 1.9x-2.9x over AR baselines while matching or exceeding quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00295","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Order Policies for Masked Diffusion","primary_cat":"cs.LG","submitted_at":"2026-05-29T19:26:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31215","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fixed-Point Masked Generative Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-29T12:19:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25820","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-25T13:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VRCD prioritizes visually complementary positions during parallel decoding in dMLLMs by measuring attention overlap with the new Visual Redundancy Index, yielding accuracy gains over confidence-based baselines on M^3CoT and MMBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00091","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-24T16:14:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DLLM-JEPA pairs JEPA with masked diffusion LMs to enable single-pass self-supervised fine-tuning that improves task accuracy, lowers held-out loss, and preserves base-model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22967","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learned Relay Representations for Forward-Thinking Discrete Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-21T18:53:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Learned Relay Representations add a differentiable per-token channel to masked diffusion models so they can propagate latent information across iterative denoising steps, yielding better coding performance and up to 32% lower latency on Fast-dLLM v2 than standard supervised finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20316","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16829","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Constrained Code Generation with Discrete Diffusion","primary_cat":"cs.CL","submitted_at":"2026-05-16T06:15:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to steer generation toward feasible programs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15676","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dynamic Chunking for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-15T06:56:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11726","ref_index":46,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T08:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model.NeurIPS, 2023. [45] K. Rojas, J. Lin, K. Rasul, A. Schneider, Y . Nevmyvaka, M. Tao, and W. Deng. Improving reasoning for diffusion language models via group diffusion policy optimization. InICLR, 2026. [46] S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov. Simple and Effective Masked Diffusion Language Models.arXiv preprint arXiv:2406.07524, 2024. [47] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust Region Policy Optimization. InICML, 2015. [48] J. Schulman, F. Wolski, P. Dhariwal, A."},{"citing_arxiv_id":"2605.06366","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Layer Collapse in Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:39:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00354","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VQ-SAD: Vector Quantized Structure Aware Diffusion For Molecule Generation","primary_cat":"cs.LG","submitted_at":"2026-05-01T02:32:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VQ-SAD combines a pretrained VQ-VAE with diffusion models by using its codebooks as discrete tokenizers for atoms and bonds, yielding slight improvements over prior diffusion methods on QM9 and ZINC250k.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26841","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data","primary_cat":"cs.LG","submitted_at":"2026-04-29T16:06:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Uniform-based discrete diffusion models behave as associative memories that retrieve unseen data, with a dataset-size-driven memorization-to-generalization transition detectable via conditional entropy of token predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26985","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Simple Self-Conditioning Adaptation for Masked Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-04-28T19:34:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCMDM is a post-training self-conditioning adaptation for masked diffusion models that reduces generative perplexity by nearly 50% on OWT and improves performance on images, molecules, and genomics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06779","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion","primary_cat":"cs.AI","submitted_at":"2026-04-08T07:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and MCTS methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, and Yihong Zhuang. Llada2.0: Scaling up diffusion language models to 100b, 2025. URLhttps://arxiv.org/abs/2512.15745. [8] Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models, 2024. URLhttps://arxiv.org/abs/2406.07524. [9] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2024."},{"citing_arxiv_id":"2604.03677","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unlocking Prompt Infilling Capability for Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-04T10:26:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02718","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generative Frontiers: Why Evaluation Matters for Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-03T04:21:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02340","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-02-04T13:04:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.02193","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference","primary_cat":"cs.CL","submitted_at":"2025-08-04T08:43:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.17298","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mercury: Ultra-Fast Language Models Based on Diffusion","primary_cat":"cs.CL","submitted_at":"2025-06-17T17:06:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.22618","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding","primary_cat":"cs.CL","submitted_at":"2025-05-28T17:39:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16933","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","primary_cat":"cs.LG","submitted_at":"2025-05-22T17:23:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2023. [38] A. Lou, C. Meng, and S. Ermon, \"Discrete diffusion modeling by estimating the ratios of the data distribution,\" inForty-first International Conference on Machine Learning, 2024. [39] J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias, \"Simplified and generalized masked diffusion for discrete data,\"arXiv preprint arXiv:2406.04329, 2024. [40] S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov, \"Simple and effective masked diffusion language models,\" arXiv preprint arXiv:2406.07524, 2024. [41] J. Ou, S. Nie, K. Xue, F. Zhu, J. Sun, Z. Li, and C. Li, \"Your absorbing discrete diffusion secretly models the conditional distributions of clean data,\"arXiv preprint arXiv:2406."},{"citing_arxiv_id":"2502.17119","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion and Flow Matching Models for Tabular Data: A Survey","primary_cat":"cs.LG","submitted_at":"2025-02-24T13:01:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"PERFORMANCE EVALUATION METRICS A. Diffusion Models for Discrete Data. Diffusion models for discrete data can be divided into two categories [134]: 1) the first category of works converts discrete structures into a continuous latent space and then directly applies Gaussian diffusion model in the latent space. This line of works include [97], [100], [135], [136], [137], [138], [139], [140]; 2) the second category of works directly defines the diffusion process on discrete structures, and we will focus on this category in the remaining of this section. 1) Binary Diffusion Model: This line of work is for the first time studied in [22], which explored the scenario of binary random features. 2) Multinomial Diffusion Model: Multinomial Diffusion"},{"citing_arxiv_id":"2502.09992","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Diffusion Models","primary_cat":"cs.CL","submitted_at":"2025-02-14T08:23:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Nevertheless, the diffusion sampling (i.e., the reverse generation process) yields the best performance and is adopted as the default throughout this paper, especially for all experiments presented in Sec. 3. For conditional likelihood evaluation, we can naturally utilize the upper bound in Eq. (5). However, we find that the following equivalent form [20] exhibits lower variance and is more stable: −El,r0,rl \" L l LX i=1 1[ri l =M] logp θ(ri 0|p0, rl) # ,(6) where L is the sequence length of r0, l is uniformly sampled from {1,2, . . . , L}, and rl is obtained by uniformly samplingltokens fromr 0 without replacement for masking. We present the training and inference algorithms, along with theoretical details, in Appendix A."},{"citing_arxiv_id":"2409.00588","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion Policy Policy Optimization","primary_cat":"cs.RO","submitted_at":"2024-09-01T02:47:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}