LingoLoop traps MLLMs into generating up to 367 times more tokens by applying POS-aware attention adjustments to postpone EOS tokens and pruning generative paths to sustain repetitive loops.
V2PE: improving multimodal long-context capability of vision-language models with variable visual position encoding
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 1
method 1
citation-polarity summary
representative citing papers
VPSG corrects predictable directional coordinate biases in MLLMs by shuffling visual positional encodings to isolate unconditioned tendencies and steering digit decoding with a lightweight finite-state machine, yielding accuracy gains on ScreenSpot-Pro without retraining.
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
citing papers explorer
-
LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
LingoLoop traps MLLMs into generating up to 367 times more tokens by applying POS-aware attention adjustments to postpone EOS tokens and pruning generative paths to sustain repetitive loops.