{"total":21,"items":[{"citing_arxiv_id":"2605.22765","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation","primary_cat":"cs.LG","submitted_at":"2026-05-21T17:27:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Uniform diffusion models rely on a leave-one-out denoiser rather than the usual denoising posterior, with exact conversions derived; an absorbing-state reformulation is introduced that matches or exceeds masked diffusion on language modeling while preserving the original joint distribution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21484","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20813","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:06:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19470","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Drifting Objectives for Refining Discrete Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T07:22:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TokenDrift refines discrete diffusion language models by applying anti-symmetric drifting to soft-token features during training, yielding large reductions in generation perplexity at low NFEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19376","ref_index":44,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Generative Recursive Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T05:20:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19262","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Backdooring Masked Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-19T02:20:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SHADOWMASK backdoors MDLMs by replacing the all-mask terminal distribution with a trigger-mask mixture prior, achieving near-100% attack success on DiT and LLaDA-8B models across multiple datasets while resisting fine-tuning and some defenses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18253","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Machine Unlearning for Masked Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-18T11:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16836","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HYVINT: Intensity-Driven Hypergraph Generation with Variational Representations","primary_cat":"stat.ML","submitted_at":"2026-05-16T06:38:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HYVINT introduces an intensity-driven incidence mechanism and tractable variational estimator for hypergraph generation, with error bounds and empirical gains in fidelity, novelty, and diversity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09981","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation","primary_cat":"q-bio.BM","submitted_at":"2026-05-11T04:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10x fewer parameters than ESM3.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"generation of protein sequence and structure? We address the challenging task of jointly generating atomic structures and amino acid sequences. If successful, this could unlock fine-grained control over functional sites and allow protein design tasks, such as atomistic motif scaffolding. Towards this, we trained a Masked Diffusion Model (MDM) over an absorbing state [28, 29] with Transformer architecture, jointly overYeti's structure tokens (x) and amino acid sequences (s). We utilize the DA+E dataset. The unconditional co-generation task (pθ(s, x)) directly explores whether the model has learned the intrinsic coupling between amino acid 6 (a) Length: 100. (b) Length: 200. (c) Length: 300. (d) Length: 400. (e) Length: 500."},{"citing_arxiv_id":"2605.09536","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM","primary_cat":"cs.CL","submitted_at":"2026-05-10T13:38:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that prioritizes accuracy and a Speed model that favors more aggressive accelera- tion. Experiments show that TAD consistently improves the accuracy-parallelism trade-off. On LLaDA, it raises average accuracy from 46.2% to 51.6% with the Quality model and average AUP from 46.2 to 257.1 with the Speed model. Our code is available at: https://github.com/BHmingyang/TAD. 1 Introduction Diffusion large language models (dLLMs) [ 1, 2, 3, 4, 5] have recently emerged as a promising alternative to Autoregressive (AR) language models. Unlike AR models that generate tokens strictly from left to right, dLLMs inherently support bidirectional attention and parallel generation of multiple tokens. Despite this theoretical potential, achieving high parallelism in practice remains a challenge [6]."},{"citing_arxiv_id":"2605.10971","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T18:52:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Limitations:Our evaluation relies on classifier-based metrics that may not capture nuanced attribute expression. Generated sequences are short (64-1024 tokens); steering at longer lengths remains unexplored. Diversity decreases at high steering strengths in SAE-based methods. Extending to finer-grained attributes and learning steering schedules from feature dynamics are open directions. References [1] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981-17993, 2021. [2] Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov."},{"citing_arxiv_id":"2605.07048","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Unlocking High-Fidelity Molecular Generation from Mass Spectra via Dual-Stream Line Graph Diffusion","primary_cat":"cs.LG","submitted_at":"2026-05-07T23:56:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DualLGD reformulates molecular graph denoising as alternating atom and bond subproblems in separate streams, achieving 34.37% and 23.89% top-1 accuracy on NPLIB1 and MassSpecGym benchmarks, roughly 3x prior state of the art.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"updated through edge-modulated node attention layers, and consistency between atom and bond representations is established only indirectly through successive layers. This observation motivates the dual-stream design of DualLGD. Discrete graph diffusion.Denoising diffusion probabilistic models [ 21], originally developed for continuous data, have been extended to discrete state-spaces through structured categorical transition matrices [2]. DiGress [37] adapts this framework to graph generation by defining a discrete diffusion process that jointly corrupts and denoises categorical node and edge attributes, using a graph transformer as the denoising network with marginal-preserving noise schedules. DiffMS [ 4] and MBGen [36] build upon DiGress for spectrum-conditioned molecular generation, inheriting its cosine"},{"citing_arxiv_id":"2605.06553","ref_index":4,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Diverse Sampling in Diffusion Models with Marginal Preserving Particle Guidance","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:49:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EDDY adds diversity to diffusion-model samples by using kernel-based anti-symmetric pairwise drifts that preserve marginal distributions via Fokker-Planck symmetries, with practical approximations for expensive cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06548","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Continuous Latent Diffusion Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-07T16:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03360","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A-CODE: Fully Atomic Protein Co-Design with Unified Multimodal Diffusion","primary_cat":"q-bio.QM","submitted_at":"2026-05-05T04:41:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A-CODE presents a fully atomic one-stage multimodal diffusion model for protein co-design that claims superior unconditional generation performance over prior one- and two-stage models plus a tenfold success-rate gain on hard binder-design tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26841","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data","primary_cat":"cs.LG","submitted_at":"2026-04-29T16:06:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Uniform-based discrete diffusion models behave as associative memories that retrieve unseen data, with a dataset-size-driven memorization-to-generalization transition detectable via conditional entropy of token predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22152","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model","primary_cat":"cs.RO","submitted_at":"2026-04-24T01:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08302","ref_index":4,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DMax: Aggressive Parallel Decoding for dLLMs","primary_cat":"cs.LG","submitted_at":"2026-04-09T14:35:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08121","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator","primary_cat":"cs.CV","submitted_at":"2026-04-09T11:41:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22241","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MemDLM: Memory-Enhanced DLM Training","primary_cat":"cs.CL","submitted_at":"2026-03-23T17:39:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15809","ref_index":109,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MMaDA: Multimodal Large Diffusion Language Models","primary_cat":"cs.CV","submitted_at":"2025-05-21T17:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023. [108] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454-12465, 2021. [109] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981-17993, 2021. [110] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707."}],"limit":50,"offset":0}