pith. machine review for the scientific record. sign in

arxiv: 2604.12582 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords Video Large Language ModelsHallucinationsAnchor FrameAttention RebalancingTemporal EvidenceDecoder-side MethodsInference-time MitigationVideo Understanding
0
0 comments X

The pith

Video LLMs over-rely on one dominant frame in attention, and relaxing that dominance through decoder rebalancing reduces hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video large language models generate hallucinations when their decoder attention concentrates on a single anchor frame that holds the highest share of frame-level attention mass. This concentration pattern is largely independent of the specific video content and instead reflects a persistent model bias. The paper introduces Decoder-side Temporal Rebalancing, a training-free adjustment applied selectively in middle-to-late decoder layers that redistributes attention to encourage under-attended frames to contribute more. A reader would care because the approach improves output reliability while leaving visual encoding and overall model performance intact and requiring no additional training or auxiliary components.

Core claim

Video-LLMs exhibit a temporally imbalanced concentration pattern in which one anchor frame dominates aggregated attention mass; this bias appears model-specific and structural rather than input-dependent, and its over-dominance correlates with hallucination-prone generation. Decoder-side Temporal Rebalancing counters the imbalance by adaptively calibrating visual attention in selected decoder layers, guiding the model to ground responses in temporally broader evidence.

What carries the argument

The anchor frame, defined as the video frame with the highest aggregated frame-level attention mass, whose dominance DTR relaxes through layer-selective attention calibration.

If this is right

  • DTR improves hallucination robustness across multiple Video-LLM families on existing benchmarks.
  • Video understanding performance remains competitive after the adjustment.
  • Inference efficiency stays high because the method requires no training or auxiliary models.
  • The decoder is guided to draw on a wider set of temporal frames when forming responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The finding suggests hallucinations in these models stem more from decoder positional biases than from the visual encoder itself.
  • Similar selective rebalancing could be tested on other sequential modalities such as audio or long text to reduce analogous attention skews.
  • The approach may generalize to other forms of attention imbalance in multimodal models without requiring parameter updates.

Load-bearing premise

The anchor-frame bias is largely independent of the input video and instead reflects a persistent, model-specific structural or positional bias whose over-dominance is closely associated with hallucination-prone generation.

What would settle it

A controlled test in which the identified anchor frame changes substantially across different input videos, or in which DTR produces no measurable reduction in hallucination rates on standard benchmarks, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12582 by Caiyan Qin, Chaoning Zhang, Jiwei Wei, Kuien Liu, Pengcheng Zheng, Sihan Cao, Xiaolin Qin, Zijian Liu.

Figure 1
Figure 1. Figure 1: Anchor-frame dominance in two Video-LLMs. Video-LLaVA focuses on frame 1, while LLaVA-NeXT-Video focuses on frame 7, leading [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Anchor-frame behavior under black-frame intervention. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise visual attention ratio in Video-LLaVA-7B. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed decoder-side temporal rebalancing (DTR) framework. Given a video and a text prompt, DTR operates in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mid-layer attention before and after DTR on the same example. DTR redistributes attention across frames and corrects the answer [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on Video-LLaVA-7B (VideoHallucer). Each plot shows Overall Accuracy (%) and gain [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention visualizations under 16-frame sampling for [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Anchor-position statistics under normal inputs and black [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Black-frame analysis on Qwen2.5-VL-7B, aggregated over [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Anchor-frame distribution of InternVL3.5-8B, aggregated [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a decoder-side 'anchor-frame' phenomenon in Video-LLMs, in which attention mass concentrates on a single frame that is largely independent of video content and instead reflects a model-specific structural bias; this over-dominance is posited to drive hallucination-prone generation. The authors introduce Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference intervention that adaptively rebalances visual attention in middle-to-late decoder layers to promote temporally broader evidence aggregation. Extensive experiments on hallucination and video-understanding benchmarks are reported to show consistent gains in hallucination robustness across multiple Video-LLM families while preserving competitive understanding performance and inference speed.

Significance. If the claimed causal link between anchor-frame dominance and hallucinations is substantiated, the work supplies a lightweight, training-free, and architecture-agnostic mitigation strategy that avoids auxiliary models or input modifications. The emphasis on decoder-side rebalancing without retraining or efficiency loss is a practical strength for deployment of existing Video-LLMs.

major comments (2)
  1. [§3] §3 (Motivation and Observations): The central assertion that the anchor-frame bias 'is largely independent of the input video' and 'reflects a persistent, model-specific structural or positional bias' is presented as an empirical finding but is not supported by quantitative measurements (e.g., variance or entropy of the anchor index across videos, mutual information between anchor location and video features, or controlled ablations that isolate dominance from other attention statistics). Without such metrics, it remains unclear whether DTR's benefit arises specifically from relaxing the claimed model-specific bias or from generic attention smoothing.
  2. [§4] §4 (Experiments): The abstract and experimental claims state that DTR 'consistently improves hallucination robustness' and 'preserves competitive video understanding performance,' yet no numerical values, baseline comparisons, statistical significance tests, or ablation results isolating the rebalancing component are provided in the summary sections. This absence makes it impossible to assess the magnitude or robustness of the reported gains relative to the weakest-assumption concern.
minor comments (2)
  1. [§3] Notation for 'aggregated frame-level attention mass' and the precise definition of the anchor frame should be formalized with an equation early in §3 to avoid ambiguity when describing the rebalancing operation.
  2. [§4] The paper would benefit from a short table summarizing the exact Video-LLM families, benchmark names, and hallucination metrics used, even if full results appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical observations and experimental results. We address each major comment below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Motivation and Observations): The central assertion that the anchor-frame bias 'is largely independent of the input video' and 'reflects a persistent, model-specific structural or positional bias' is presented as an empirical finding but is not supported by quantitative measurements (e.g., variance or entropy of the anchor index across videos, mutual information between anchor location and video features, or controlled ablations that isolate dominance from other attention statistics). Without such metrics, it remains unclear whether DTR's benefit arises specifically from relaxing the claimed model-specific bias or from generic attention smoothing.

    Authors: We agree that additional quantitative support would strengthen the claim. In the current manuscript, §3 presents the observation through consistent anchor-frame patterns across diverse video inputs (visualized in Figure 2 and accompanying examples), which appear independent of content. However, we did not report explicit metrics such as variance/entropy of the anchor index or mutual information with video features. We will add these quantitative measurements in the revised §3, along with a controlled ablation comparing DTR against generic attention smoothing baselines. This will better isolate the effect of relaxing the model-specific bias and substantiate the motivation for DTR. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and experimental claims state that DTR 'consistently improves hallucination robustness' and 'preserves competitive video understanding performance,' yet no numerical values, baseline comparisons, statistical significance tests, or ablation results isolating the rebalancing component are provided in the summary sections. This absence makes it impossible to assess the magnitude or robustness of the reported gains relative to the weakest-assumption concern.

    Authors: The detailed numerical results, baseline comparisons, and ablations isolating the rebalancing component are provided in §4 (including Tables 1–4 and Figure 4). However, we acknowledge that the abstract and high-level summary sections contain only qualitative claims without specific numbers or significance tests. In the revision, we will incorporate key quantitative improvements (e.g., hallucination reduction percentages and understanding benchmark deltas) into the abstract and add a concise summary of statistical significance and ablation outcomes in the introduction to allow readers to assess magnitude and robustness without immediately consulting the full experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention validated externally

full rationale

The paper identifies an observed decoder-side attention pattern (anchor frame), proposes a training-free rebalancing heuristic DTR applied at inference time, and reports performance gains on independent hallucination and video-understanding benchmarks. No equations, fitted parameters, or self-citations are shown that would make the claimed improvement or the bias-independence statement equivalent to the authors' own inputs by construction. The derivation chain therefore remains self-contained against external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5597 in / 1071 out tokens · 33729 ms · 2026-05-10T14:51:03.354287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. 2025. Mash-vlm: Mitigating action-scene hallucination in video- llms through disentangled spatial-temporal representations. InProceedings of the Computer Vision and Pattern Recognition Conference. 13744–13753

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  4. [4]

    Jianfeng Cai, Wengang Zhou, Zongmeng Zhang, Jiale Hong, Nianji Zhan, and Houqiang Li. 2025. Mitigating hallucination in videollms via temporal-aware activation engineering.arXiv preprint arXiv:2505.12826(2025)

  5. [5]

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)

  6. [6]

    Wey Yeh Choong, Yangyang Guo, and Mohan Kankanhalli. 2024. Vidhal: Bench- marking temporal hallucinations in vision llms.arXiv preprint arXiv:2411.16771 (2024)

  7. [7]

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. 2024. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13668–13677

  8. [8]

    Xinpeng Ding, Kui Zhang, Jianhua Han, Lanqing Hong, Hang Xu, and Xiaomeng Li. 2025. Pami-vdpo: Mitigating video hallucinations by prompt-aware multi- instance video preference learning.arXiv preprint arXiv:2504.05810(2025)

  9. [9]

    Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pra- muditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto

  10. [10]

    InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Multi-modal hallucination control by visual information grounding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14303–14312

  11. [11]

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118

  12. [12]

    Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. 2025. Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation. arXiv preprint arXiv:2503.19622(2025)

  13. [13]

    Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing halluci- nations in large vision language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18135–18143

  14. [14]

    Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. 2024. Visual hallucina- tions of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024. 9614–9631

  15. [15]

    Nick Jiang, Anish Kachinthaya, Suzie Petryk, and Yossi Gandelsman. 2024. Inter- preting and editing vision-language representations to mitigate hallucinations. arXiv preprint arXiv:2410.02762(2024)

  16. [16]

    Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, and Qiang Zhu

  17. [17]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Mhbench: Demystifying motion hallucination in videollms. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4401–4409

  18. [18]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13872–13882

  19. [19]

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. 2025. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13723–13733

  20. [20]

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2025. Videochat: Chat-centric video understanding. Science China Information Sciences68, 10 (2025), 200102

  21. [21]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206

  22. [22]

    Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. 2025. Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video under- standing.arXiv preprint arXiv:2505.01481(2025)

  23. [23]

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2024. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing. 5971–5984

  24. [24]

    Qika Lin, Yifan Zhu, Xin Mei, Ling Huang, Jingying Ma, Kai He, Zhen Peng, Erik Cambria, and Mengling Feng. 2025. Has multimodal learning delivered universal intelligence in healthcare? A comprehensive survey.Information Fusion116 (2025), 102795

  25. [25]

    Shi Liu, Kecheng Zheng, and Wei Chen. 2024. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. InEuropean Conference on Computer Vision. Springer, 125–140

  26. [26]

    Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. 2024. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13151–13160

  27. [27]

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. 2024. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12585–12602

  28. [28]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

  29. [29]

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2025. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology(2025)

  30. [30]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)

  31. [31]

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. 2024. Mitigat- ing hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics: ACL 2024. 15840–15853

  32. [32]

    Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng

  33. [33]

    Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.arXiv preprint arXiv:2406.16338(2024)

  34. [34]

    Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, and Yu-Chiang Frank Wang. 2025. SEASON: Mitigating Temporal Halluci- nation in Video Large Language Models via Self-Diagnostic Contrastive Decoding. arXiv preprint arXiv:2512.04643(2025)

  35. [35]

    Hao Yin, Guangzong Si, and Zilei Wang. 2025. Clearsight: Visual signal en- hancement for object hallucination mitigation in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 14625–14634

  36. [36]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. 2024. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences67, 12 (2024), 220105

  37. [37]

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al . 2024. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13807–13816. Conference acronym ’XX...

  38. [38]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demon- strations. 543–553

  39. [39]

    Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, Zhiyu Tan, Hao Li, Xingjun Ma, and Jingjing Chen. 2024. Eventhallusion: Diagnosing event hallucinations in video llms.arXiv preprint arXiv:2409.16597(2024)

  40. [40]

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. 2024. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model. https://llava-vl.github.io/blog/2024-04-30-llava-next- video/

  41. [41]

    Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. 2023. Analyzing and mitigating object hallucination in large vision-language models.arXiv preprint arXiv:2310.00754 (2023)

  42. [42]

    Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, Jiangjiang Liu, Gang Zhang, and Jingdong Wang. 2025. Continual sft matches multimodal rlhf with negative super- vision. InProceedings of the Computer Vision and Pattern Recognition Conference. 14615–14624

  43. [43]

    Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Kening Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al . 2024. Look twice before you answer: Memory-space visual retracing for hallucination mitigation in mul- timodal large language models.arXiv preprint arXiv:2410.03577(2024). A Robustness Across Frame Sampling We further examine w...