pith. machine review for the scientific record. sign in

arxiv: 2605.14530 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion vision-language modelsmask prior driftpositional attention collapserepetitive generationvisual groundingtraining-free inferenceRoPE scalingiterative unmasking
0
0 comments X

The pith

Mask token prior drift and positional attention misalignment cause repetitive generation and weak visual grounding in large diffusion vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large diffusion vision-language models initialize generation with mask tokens and decode iteratively, yet this setup leads to hidden states drifting toward a shared mask direction and position biases that stop attending to image content. The result is repetitive text and outputs that ignore visual details, especially in longer descriptions. The authors trace both failures to the interaction between the mask initialization and the unmasking schedule. They introduce two training-free corrections applied only during decoding: one suppresses the accumulating mask prior, and the other monotonically scales rotary position embeddings to keep attention on visual tokens. Experiments show these changes cut repetition and lift grounding scores on standard multimodal and long-form benchmarks without retraining the models.

Core claim

Existing LDVLMs suffer from repetitive generation because generation tokens initialized as masks have hidden representations that progressively drift toward a shared prior direction, and from degraded visual grounding because positional attention biases misalign with the iterative unmasking process and therefore suppress attention to informative visual tokens. These two mechanisms are mitigated by Mask Prior Suppression, which counters the drift, and Monotonic RoPE Scaling, which restores attention to visual content across decoding steps.

What carries the argument

Mask Prior Suppression and Monotonic RoPE Scaling, applied at inference time to counteract mask-token prior accumulation and to realign positional attention with the unmasking schedule.

If this is right

  • Repetition rates drop and visual grounding improves on general multimodal and long-form description tasks.
  • The fixes require no retraining and can be added to any existing LDVLM architecture.
  • Performance gains hold across diverse model sizes and training regimes.
  • The interventions remain effective even as generation length increases.
  • The same two mechanisms explain why current LDVLMs underperform autoregressive models on extended outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prior-drift effects may appear in other iterative diffusion generators that begin from a shared mask or noise state.
  • Training objectives for diffusion VLMs could be revised to penalize mask-token convergence explicitly.
  • The fixes might extend to non-vision diffusion models that use iterative unmasking or progressive revelation.
  • Longer coherent multimodal outputs become feasible once these inference-time adjustments are standard.

Load-bearing premise

The observed repetition and grounding failures are driven primarily by mask prior drift and positional attention misalignment rather than by deeper problems in the training objective or model architecture.

What would settle it

Measure whether applying both Mask Prior Suppression and Monotonic RoPE Scaling to a baseline LDVLM measurably lowers repetition rate and raises visual grounding accuracy on long-form description benchmarks relative to the unmodified model.

Figures

Figures reproduced from arXiv: 2605.14530 by Chanyong Yoon, Seongjae Hwang, Sujung Hong.

Figure 1
Figure 1. Figure 1: Failure case of LDVLMs. Under parallel decoding with 64 generation tokens and 16 generation steps, LLaDA-V produces highly repetitive phrases as highlighted in red, and exhibits de￾graded visual grounding as highlighted in gray. a representative framework for discrete sequence modeling. MDMs assume an input sequence x0 = [x i ] N i=1 consisting of N tokens, including special mask tokens M. The model define… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of token repetition and mask prior drift. (a) Distinct-n (left) and repetition ratio (right) across different numbers of generation steps. Fewer generation steps lead to lower distinct-n and higher repetition. (b) 3D PCA trajectories of hidden states for the vocabulary mean embedding and the uncontextualized mask token, which converge to a similar region at the final layer (L31). (c) Cosine s… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of positional attention collapse. (a) Mean attention weight (log scale) across relative distance, showing stronger attention to mask tokens than visual tokens at similar distances and an overall decreasing trend in attention to visual tokens as relative distance increases. (b) Sum of attention to visual and mask tokens per generation token across generation steps, revealing a persistent alloc… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed model. (a) Mask prior suppression. The final hidden state h ej L is decomposed along the prior direction u eˆ L, and prior components are adaptively suppressed based on cosine similarity. (b) Monotonic RoPE scaling. Low-frequency RoPE components, which govern long-range positional interactions, are scaled more strongly than high-frequency components to preserve attention to distant… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on visual grounding and long-form generation. (a) RefCOCOg results. A red box indicates the target region. The baseline model LLaDA-V produces descriptions referring to an incorrect object shown in gray, whereas our method correctly grounds the description to the target location and achieves more accurate visual grounding shown in blue. (b) MIA results. The baseline model exhibits re… view at source ↗
Figure 6
Figure 6. Figure 6: Results of LaViDa on DetailCaps with varying genera￾tion steps. Dashed lines: LaViDa, solid lines: Ours. (a) Distinct-n scores of our method exhibit an increasing trend across the evalu￾ated settings and remain consistently higher than those of the base￾line. (b) The repetition ratio under our method shows a decreasing trend across the evaluated settings and remains consistently lower than that of the base… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of result analysis. (a) Box plot of cosine similarity between contextualized mask tokens and the vocabu￾lary mean, showing consistent reduction across generation steps. (b) Relative change in attention with respect to relative distance, where attention to distant visual tokens increases compared to the baseline, while attention to mask tokens is preserved or reduced. Effect of Monotonic RoPE … view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of mask prior drift on LaViDa. (a) 3D PCA trajectories of hidden states for the vocabulary mean embedding and the uncontextualized mask token, which converge to a similar region at the final layer (L31). (b) Cosine similarity between contextualized mask token embeddings and the vocabulary mean, showing consistently stronger alignment than random embeddings, especially with fewer generation st… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Positional Attention Collapse on LaViDa (a) Mean attention weight across relative distance (log scale), showing stronger attention to mask tokens than visual tokens at similar distances and a monotonic decay for visual tokens. (b) Sum of attention to visual and mask tokens per generation token across generation steps, revealing a persistent allocation of comparable attention weights to mas… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of result analysis on LaViDa. (a) Box plot of cosine similarity between contextualized mask tokens and the vocabulary mean, showing consistent reduction across generation steps. (b) Relative change in attention with respect to relative distance, where attention to distant visual tokens increases compared to the baseline, while attention to mask tokens is preserved or reduced. I. Additional E… view at source ↗
Figure 11
Figure 11. Figure 11: Relative performance changes on DetailCaps across generation steps using LLaDA-V. Dashed lines: LLaDA-V, solid lines: Ours. (a) ∆Distinct-n (Ours – Base) shows consistent gains, with larger improvements at moderate to larger generation steps. (b) ∆Repetition ratio (Base – Ours) remains positive across most steps, indicating reduced repetition, with the strongest reductions observed at intermediate steps. … view at source ↗
Figure 12
Figure 12. Figure 12: Generation step analysis and DetailCaps performance on LaViDa. (a) Top-5 logits of the uncontextualized mask token M, where the |eot| token consistently receives the highest logit. (b) CAPTURE scores on the DetailCaps benchmark as a function of generation steps. Contrary to a standard speed–quality trade-off, performance peaks at 16 steps and degrades with additional steps. (c) Qualitative examples showin… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results on RefCOCOg using LLaDA-V. The red bounding boxes indicate the target regions in the image. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results on Ferret using LLaDA-V. The red bounding boxes indicate the target regions in the image. What type of fruit is this? LLaDA-V: The fruit in the image is a mangoes. Mangoes are known for their sweet, juicy flesh and are commonly found in tropical and subtropical regions. They are typically eaten raw but can also be used in desserts, smoothies, and other dishes. The white flesh of mangoe… view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results on LLaVA-Bench using LLaDA-V. Describe the image using a simile. LLaDA-V: The image is a picture of of two flower shops, one on a brick building and the other on a wooden building ❌, both with green awnings. The flower shops are like two friends standing side by side by side, each with their own unique charm, attracting customers and adding to the beauty of the street LLaDA-V+Ours: The… view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative results on MIA using LLaDA-V. Provide a short description for this region. LaViDa: The net of a tennis court. LaViDa+Ours: A woman holding a tennis racket. Provide a short description for this region. LaViDa: Red plate LaViDa+Ours: Glass of wine on the table [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative results on RefCOCOg using LaViDa. The red bounding boxes indicate the target regions in the image. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative results on Ferret using LaViDa. The red bounding boxes indicate the target regions in the image. Show the detailed recipe for this dish. LaViDa: None LaViDa+Ours: The image provided a visual representation of the dish, but without textual information or additional context, it is not possible to provide a detailed recipe. Typically, a recipe for a dish would include the ingredients, cooking met… view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative results on LLaVA-Bench using LaViDa. Express the feelings that might be elicited by this image using a first-person perspective, specify any author's name visible, while implying a sense of nostalgia. LaViDa: This, this image evokes a sense of nostalgia and the. The, the stack of the,, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the the. … view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative results on MIA using LaViDa. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
read the original abstract

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that large diffusion vision-language models (LDVLMs) exhibit repetitive generation and degraded visual grounding during long-form decoding. It attributes repetitive generation to progressive drift of mask-token hidden states toward a shared prior direction, and grounding degradation to misalignment between positional attention biases (RoPE) and the iterative unmasking schedule. The authors introduce two training-free interventions—Mask Prior Suppression and Monotonic RoPE Scaling—to counteract these mechanisms and report improved performance on general multimodal and visual-grounding benchmarks, with larger gains on long-form description tasks.

Significance. If the proposed mechanisms are shown to be causal and the interventions robust, the work supplies a lightweight, architecture-agnostic decoding fix that avoids retraining. This would be practically valuable for deploying LDVLMs on extended multimodal outputs. The training-free character and claimed generalization across LDVLM families are notable strengths.

major comments (2)
  1. [§4] The central causal claim—that mask-token prior drift is the origin of repetitive generation—is supported only by correlation: the paper documents the drift and shows that Mask Prior Suppression reduces repetition, but does not report a controlled intervention that artificially induces equivalent drift (while fixing the diffusion schedule and other components) to verify that repetition increases. Without this isolation, drift could be a downstream symptom rather than the root driver (§4, experimental analysis of hidden-state trajectories).
  2. [Table 2, Figure 4] The effectiveness of Monotonic RoPE Scaling is presented as directly addressing positional attention collapse, yet the manuscript provides limited ablation isolating its contribution from Mask Prior Suppression and from changes in the unmasking schedule. Quantitative attention maps or grounding metrics before/after scaling alone would be needed to substantiate the misalignment diagnosis (Table 2 and Figure 4).
minor comments (2)
  1. [Abstract] The abstract states improvements on “general multimodal benchmarks” without naming the exact datasets or reporting absolute scores; adding these numbers would improve reproducibility.
  2. [§3] Notation for the mask-prior direction and the monotonic scaling factor is introduced without an explicit equation reference in the main text; a numbered equation would clarify the implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below with point-by-point responses, indicating where revisions have been made or will be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] The central causal claim—that mask-token prior drift is the origin of repetitive generation—is supported only by correlation: the paper documents the drift and shows that Mask Prior Suppression reduces repetition, but does not report a controlled intervention that artificially induces equivalent drift (while fixing the diffusion schedule and other components) to verify that repetition increases. Without this isolation, drift could be a downstream symptom rather than the root driver (§4, experimental analysis of hidden-state trajectories).

    Authors: We agree that the evidence is primarily correlational and interventional via suppression rather than direct induction. Artificially inducing equivalent drift while strictly fixing the diffusion schedule and all other components is non-trivial, as the drift arises organically from the mask-token initialization and iterative unmasking dynamics. To strengthen the argument, we have expanded the analysis in §4 with additional hidden-state trajectory plots across multiple models and generation lengths, demonstrating that drift onset reliably precedes measurable increases in repetition. We also include a dose-response study varying the strength of Mask Prior Suppression, which shows a consistent monotonic relationship between residual drift magnitude and repetition rate. These additions provide stronger support for the proposed mechanism without requiring an artificial induction that would risk confounding the diffusion process itself. revision: partial

  2. Referee: [Table 2, Figure 4] The effectiveness of Monotonic RoPE Scaling is presented as directly addressing positional attention collapse, yet the manuscript provides limited ablation isolating its contribution from Mask Prior Suppression and from changes in the unmasking schedule. Quantitative attention maps or grounding metrics before/after scaling alone would be needed to substantiate the misalignment diagnosis (Table 2 and Figure 4).

    Authors: We acknowledge the need for clearer isolation of Monotonic RoPE Scaling. In the revised manuscript we have added a dedicated ablation that applies Monotonic RoPE Scaling in isolation (i.e., without Mask Prior Suppression) while keeping the original unmasking schedule fixed. The updated Table 2 now reports grounding metrics and repetition rates for this isolated setting, and we include new quantitative attention-map visualizations (added to Figure 4) that compare attention distributions toward visual tokens before and after scaling. These results show measurable improvements in visual grounding attributable to the scaling alone, thereby substantiating the positional misalignment diagnosis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is observational and intervention-based

full rationale

The paper identifies mask prior drift and positional attention misalignment through direct observation of hidden-state behavior and attention patterns during iterative unmasking. It then introduces two training-free corrections (Mask Prior Suppression and Monotonic RoPE Scaling) as explicit countermeasures. No parameter is fitted to a data subset and then re-labeled as a prediction; no core premise reduces to a self-citation chain; no ansatz is smuggled via prior work; and no known empirical pattern is merely renamed. The derivation chain therefore remains independent of its own outputs and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about diffusion decoding and attention mechanisms in VLMs, with no free parameters or new entities introduced.

axioms (2)
  • domain assumption LDVLMs initialize generation tokens as mask tokens whose representations progressively drift toward a shared prior direction.
    Presented as an observed property of the generation process in these models.
  • domain assumption Positional attention bias remains fixed and misaligns with the iterative unmasking schedule.
    Derived from analysis of the model's attention architecture.

pith-pipeline@v0.9.0 · 5513 in / 1301 out tokens · 49534 ms · 2026-05-15T02:12:13.889671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 7 internal anchors

  1. [1]

    Arif, H. et al. PAINT: Paying attention to INformed tokens to mitigate hallucination in large vision-language models. arXiv preprint arXiv:2501.12835,

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

  3. [3]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  4. [4]

    H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B

    Chen, G. H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B. Allava: Har- nessing gpt4v-synthesized data for lite vision-language models.arXiv preprint arXiv:2402.11684,

  5. [5]

    Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

    Chen, X., Huang, S., Guo, C., Wei, C., He, Y ., Zhang, J., Li, H., Chen, Y ., et al. Dpad: Efficient diffusion language models with suffix dropout.arXiv preprint arXiv:2508.14148,

  6. [6]

    Benchmarking and improving detail image caption

    Dong, H., Li, J., Wu, B., Wang, J., Zhang, Y ., and Guo, H. Benchmarking and improving detail image caption. arXiv preprint arXiv:2405.19092,

  7. [7]

    Visualwebinstruct: Scaling up multimodal instruction data through web search

    Jia, Y ., Li, J., Yue, X., Li, B., Nie, P., Zou, K., and Chen, W. Visualwebinstruct: Scaling up multimodal instruction data through web search. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  8. [8]

    Referitgame: Referring to objects in photographs of natu- ral scenes

    10 Mitigating Mask Prior Drift and Positional Attention Collapse in LDVLMs Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. InProceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,

  9. [9]

    A comprehensive survey of accelerated gener- ation techniques in large language models.arXiv preprint arXiv:2405.13019,

    Khoshnoodi, M., Jain, V ., Gao, M., Srikanth, M., and Chadha, A. A comprehensive survey of accelerated gener- ation techniques in large language models.arXiv preprint arXiv:2405.13019,

  10. [10]

    Hope: Hybrid of position embedding for length generalization in vision- language models.arXiv preprint arXiv:2505.20444, 2025a

    Li, H., Qin, Y ., Ou, B., Xu, L., and Xu, R. Hope: Hybrid of position embedding for length generalization in vision- language models.arXiv preprint arXiv:2505.20444, 2025a. Li, K., Patel, O., Vi ´egas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful an- swers from a language model. InAdvances in Neural Information Proce...

  11. [11]

    A survey on diffusion language models,

    Li, S., Kallidromitis, K., Bansal, H., Gokul, A., Kato, Y ., Kozuka, K., Kuen, J., Lin, Z., Chang, K.-W., and Grover, A. Lavida: A large diffusion model for vision-language understanding.Advances in neural information process- ing systems, 2025b. Li, T., Chen, M., Guo, B., and Shen, Z. A survey on diffu- sion language models.arXiv preprint arXiv:2508.1087...

  12. [12]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, X., Yan, H., An, C., Qiu, X., and Lin, D. Scaling laws of rope-based extrapolation. InInternational Conference on Learning Representations, volu...

  13. [13]

    Large Language Diffusion Models

    Nie, S., Zhu, F., Du, C., Pang, T., Liu, Q., Zeng, G., Lin, M., and Li, C. Scaling up masked diffusion models on text. InThe Thirteenth International Conference on Learning Representations, 2025a. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language diffusion models.arXiv preprint arXiv:2502.09992...

  14. [14]

    Instruction Tuning with GPT-4

    Peng, B., Li, C., He, P., Galley, M., and Gao, J. Instruc- tion tuning with gpt-4.arXiv preprint arXiv:2304.03277,

  15. [15]

    A., Burns, K., Darrell, T., and Saenko, K

    Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP),

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

  17. [17]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

  18. [18]

    Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a

    Wang, C., Guo, J., Li, H., Tian, Y ., Nie, Y ., Xu, C., and Han, K. Circle-rope: Cone-like decoupled rotary posi- tional embedding for large vision-language models.arXiv preprint arXiv:2505.16416, 2025a. Wang, J., Wang, Y ., Xu, G., Zhang, J., Gu, Y ., Jia, H., Yan, M., Zhang, J., and Sang, J. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallu...

  19. [19]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  20. [20]

    Semantics-adaptive activation intervention for LLMs via dynamic steering vectors

    Wang, W., Yang, J., and Peng, W. Semantics-adaptive activation intervention for LLMs via dynamic steering vectors. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025b. Wang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion llms can do faster-than-ar inference via dis- crete diffusion forcing.arXiv preprint a...

  21. [21]

    Lumina- dimoo: An omni diffusion large language model for multi- modal generation and understanding.arXiv preprint arXiv:2510.06308,

    Xin, Y ., Qin, Q., Luo, S., Zhu, K., Yan, J., Tai, Y ., Lei, J., Cao, Y ., Wang, K., Wang, Y ., et al. Lumina- dimoo: An omni diffusion large language model for multi- modal generation and understanding.arXiv preprint arXiv:2510.06308,

  22. [22]

    Dream 7B: Diffusion Large Language Models

    Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

  23. [23]

    Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

    You, Z., Nie, S., Zhang, X., Hu, J., Zhou, J., Lu, Z., Wen, J.-R., and Li, C. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933,

  24. [24]

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al

    URL https://arxiv.org/ abs/2407.12772. Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  25. [25]

    AMBER is an LLM-free hallucination benchmark covering both generative (AMBER-G) 13 Mitigating Mask Prior Drift and Positional Attention Collapse in LDVLMs Table 6.Evaluation Setup.Evaluation splits, inference steps, and generation lengthLfor each benchmark. Dataset Split StepsL Dataset Split StepsL Dataset Split StepsL MME test 2 2 Ferret test 48 96 Detai...

  26. [26]

    To ensure a rigorous and fair comparison, we evaluate models under identical random seeds whenever reported results are unavailable

    For autoregressive baselines, including LLaV A-One-Vision-7B, Qwen2.5-VL-7B, InternVL3-8B, and LLaV A-1.6, we use the default evaluation setups provided by the same framework. To ensure a rigorous and fair comparison, we evaluate models under identical random seeds whenever reported results are unavailable. Notably, for LaViDa, we conduct a re-evaluation ...

  27. [27]

    Subsequent methods further refine rotary scaling to enhance extrapolation stability and efficiency in LLMs (Ding et al., 2024)

    introduces a piecewise frequency rescaling scheme that preserves high-frequency components while smoothly extrapolating to longer sequences. Subsequent methods further refine rotary scaling to enhance extrapolation stability and efficiency in LLMs (Ding et al., 2024). While these approaches are effective for extending context length under causal decoding,...

  28. [28]

    In our experiments, we use LaViDa-L only, as it shares the same language backbone as LLaDA-V , enabling a fair comparison

    and an LLM backbone based on LLaDA-8B or Dream-7B (Ye et al., 2025). In our experiments, we use LaViDa-L only, as it shares the same language backbone as LLaDA-V , enabling a fair comparison. LaViDa introduces a complementary masking strategy during training. Instead of learning from a single masked version of a response, two complementary masked variants...

  29. [29]

    For MMaDA (Yang et al., 2025), we setλ= 0.1 , β= 0.4 , k= 3 , η= 8.0 , and τ0 = 0.6

    under a consistent evaluation protocol. For MMaDA (Yang et al., 2025), we setλ= 0.1 , β= 0.4 , k= 3 , η= 8.0 , and τ0 = 0.6. For Lumina-DiMOO (Xin et al., 2025), we set λ= 0.1 , β= 0.4 , k= 3 , η= 12.0 , and τ0 = 0.6. In both cases, our method consistently outperforms the corresponding baselines, as shown in Table