arxiv: 2604.03556 · v1 · submitted 2026-04-04 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Sohyeon Kim , Sang Yeon Yoon , Kyeongbo Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords object hallucinationvision-language modelsattention dynamicsinference-time interventiondeterminantal point processmultimodal reasoningphase-aware suppression

0 comments

The pith

Vision-language models hallucinate less when low-attention tokens are suppressed during the focus phase of visual processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a consistent three-phase structure in how vision encoders inside large vision-language models process information: diffusion, focus, and rediffusion. It shows that object hallucinations arise especially from tokens that receive low attention during the focus phase. To address this, the authors introduce a training-free method that uses statistics from one forward pass and a Determinantal Point Process to selectively suppress those low-attention tokens only in the focus phase. Experiments on multiple model backbones and decoding strategies show reduced hallucination scores while caption quality stays competitive and added latency remains negligible. This targets internal attention dynamics rather than requiring changes to the vision encoder or iterative optimization per input.

Core claim

Hallucination behavior in LVLMs is particularly sensitive to tokens receiving low attention during the focus phase of a consistent three-phase attention structure (diffusion, focus, rediffusion) in vision encoders; selectively suppressing such tokens during the focus phase via a Determinantal Point Process reduces hallucination metrics while maintaining competitive caption quality in a lightweight, training-free manner.

What carries the argument

The three-phase attention structure (diffusion, focus, rediffusion) in vision encoders, with selective suppression of low-attention tokens in the focus phase using a Determinantal Point Process to preserve visual diversity.

If this is right

Hallucination reduction improves reliability of generated descriptions for downstream tasks such as image captioning and visual question answering.
The inference-time, training-free design allows direct application to existing deployed models without retraining costs.
Comparable hallucination mitigation to heavier adversarial methods is achieved with negligible extra latency.
The approach generalizes across multiple vision-language model architectures and different decoding strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the phase structure generalizes, similar phase-aware suppression could be tested on other multimodal architectures such as audio-language or video-language models.
Task-specific tuning of the suppression threshold might further improve results on visual reasoning benchmarks beyond captioning.
Combining this focus-phase filter with encoder-level uncertainty methods could yield additive gains in hallucination control.

Load-bearing premise

The three-phase attention structure is consistent across LVLM backbones and decoding strategies, and suppressing low-attention tokens in the focus phase reduces hallucinations without introducing new errors or degrading visual reasoning capabilities.

What would settle it

Apply the suppression method to an untested LVLM backbone, verify whether the diffusion-focus-rediffusion phases remain identifiable, and check if hallucination rates drop without loss in caption quality or introduction of new errors.

Figures

Figures reproduced from arXiv: 2604.03556 by Kyeongbo Kong, Sang Yeon Yoon, Sohyeon Kim.

**Figure 1.** Figure 1: Runtime comparison and hallucination mitigation performance (CHAIR) across recent LVLMs. Large Vision-Language Models (LVLMs) have recently demonstrated impressive progress in multimodal reasoning and image-groundded language generation. Despite these advances, they remain prone to object hallucination [12], generating descriptions of objects that are not present in the input image. Such hallucinations u… view at source ↗

**Figure 2.** Figure 2: Layer-wise attention dynamics across various LVLM backbones. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of masking strategies across different processing phases on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples of hallucination behavior across masking [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visual Attention Ratio analysis under different masking conditions. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Per-sample latency (seconds). Vision encoder time vs total time for Orig./AUE/Ours. Qwen-2.5-VL Intern-VL-2.5 Bench Metric Orig. AUE Ours Orig. AUE Ours CHAIR CHAIRS ↓ 31.4 27.8 27.8 28.6 28.4 27.0 CHAIRI ↓ 7.6 7.4 7.3 7.1 7.0 6.8 F1 ↑ 75.5 75.8 74.7 76.3 76.6 76.1 POPE ran. ↑ 81.3 79.6 80.3 94.27 94.17 94.13 pop. ↑ 80.8 79.2 79.9 88.68 88.53 88.65 adv. ↑ 80.5 78.7 79.6 89.60 85.73 85.43 [PITH_FULL_IMAGE:… view at source ↗

**Figure 7.** Figure 7: Qualitative examples on the CHAIR dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of masking patterns between AUE and our DPP-based method. In this section, we compare the spatial characteristics of DPP-based masking and PGDbased AUE masking. The proposed method selects tokens by jointly considering importance and diversity while fixing the number of retained tokens, whereas AUE removes tokens using a threshold on an uncertainty map, which can lead to varying retained to… view at source ↗

**Figure 9.** Figure 9: Layer-wise attention dynamics across evaluated LVLM backbones. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative examples of continuous token modulation in the focus [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative examples demonstrating the effect of token [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: VAR analysis on LLaVA-1.5-13B. Left: Distribution of image-level mean VAR across masking strategies. Focus-phase masking (Phase 2) produces a significantly higher VAR than the baseline (no masking), indicating increased reliance on visual tokens during generation. Right: Layer-head VAR heatmaps comparing DPP masking and no masking. Applying focus-phase masking increases visual attention across intermedia… view at source ↗

**Figure 13.** Figure 13: VAR analysis on Shikra-7B. Left: Distribution of image-level mean VAR under different masking phases. Focus-phase masking consistently yields higher VAR compared to the baseline. Right: Layer-head VAR heatmaps illustrating the increase in visual attention when DPP masking is applied during the focus phase. tion weights to visual tokens across multiple heads, indicating that the decoder relies more strongl… view at source ↗

**Figure 14.** Figure 14: Sentence-level qualitative analysis using Ground-Truth (GT) cap [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Additional qualitative examples using Ground-Truth (GT) captions. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Extended qualitative results on the CHAIR benchmark using [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 18.** Figure 18: Extended qualitative results on the CHAIR benchmark using [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Extended qualitative results on the CHAIR benchmark using Qwen [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Extended qualitative results on the CHAIR benchmark using [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Extended qualitative results on the POPE benchmark using LLaVA [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

**Figure 22.** Figure 22: Extended qualitative results on the POPE benchmark using LLaVA [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗

**Figure 23.** Figure 23: Extended qualitative results on the POPE benchmark using Shikra [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗

**Figure 24.** Figure 24: Extended qualitative results on the POPE benchmark using Qwen [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗

**Figure 25.** Figure 25: Extended qualitative results on the POPE benchmark using [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗

**Figure 26.** Figure 26: Spatial mask comparisons between our DPP-based method and the [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a lightweight training-free fix for LVLM hallucinations by targeting low-attention tokens in a claimed focus phase, but the phase itself lacks a clear reproducible definition.

read the letter

The main takeaway is that they observed vision encoders in LVLMs follow a three-phase attention pattern—diffusion, focus, rediffusion—and built a simple single-pass intervention that uses a determinantal point process to suppress low-attention tokens only during the focus phase. This keeps the method fast and training-free while aiming to cut hallucinations without much damage to caption quality. They report gains across several backbones and decoding setups, which is the practical part worth noting. What stands out is how little overhead it adds compared with optimization-heavy alternatives; that alone makes it worth testing in real pipelines. The experiments claim consistent metric improvements, and the approach stays grounded in observed attention statistics rather than fitted parameters. The soft spot is exactly the one the stress test flags: the focus phase has no explicit, reproducible rule for where it starts or ends. No layer range, entropy threshold, or head-selection criterion is spelled out, so the intervention could turn out to be more ad-hoc than the abstract suggests. If the phase boundaries were chosen by inspection on the tested models, the gains might not transfer cleanly to new architectures or decoders, and it becomes harder to know whether the mechanism or the tuning is doing the work. This is aimed at people who already deploy LVLMs and want a quick inference-time patch for trustworthiness. A reader working on hallucination mitigation or lightweight interventions would find the core idea testable and the latency claim useful. It deserves peer review because the method is concrete enough to evaluate and the empirical claim is falsifiable, even if the phase definition needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that vision encoders in LVLMs exhibit a consistent three-phase attention structure (diffusion, focus, rediffusion) identifiable from a single forward pass. Hallucination behavior is particularly sensitive to low-attention tokens during the focus phase. The authors introduce a training-free inference-time intervention that applies Determinantal Point Process (DPP) suppression to those tokens in the focus phase, reporting consistent reductions in hallucination metrics while preserving caption quality across multiple backbones and decoding strategies, with negligible added latency compared to optimization-based baselines.

Significance. If the phase structure and intervention prove robust, the work offers a practical, efficient advance in hallucination mitigation for LVLMs. Notable strengths include the training-free design, reliance on single-forward-pass input statistics rather than iterative optimization, and the use of DPP to preserve visual diversity, which supports competitive caption quality.

major comments (2)

[§3] The demarcation of the focus phase lacks a formal, reproducible definition. No explicit criterion (e.g., layer range, attention entropy threshold, or head selection rule) is stated for identifying the focus phase from the attention maps computed in a single forward pass, rendering the central mechanism difficult to verify or transfer.
[§4] The experimental claims of consistent gains across backbones rest on reported metric reductions, but the manuscript provides insufficient detail on statistical significance testing, run-to-run variance, and uniform application of data exclusion rules, which are necessary to substantiate the robustness assertions.

minor comments (2)

[Figure 2] Attention visualizations would be clearer with explicit annotations marking the identified phase boundaries.
A brief limitations paragraph discussing cases where the three-phase structure may not hold would improve completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and statistical detail where appropriate.

read point-by-point responses

Referee: [§3] The demarcation of the focus phase lacks a formal, reproducible definition. No explicit criterion (e.g., layer range, attention entropy threshold, or head selection rule) is stated for identifying the focus phase from the attention maps computed in a single forward pass, rendering the central mechanism difficult to verify or transfer.

Authors: We agree that an explicit, reproducible definition is required. The phases were identified by consistent patterns in per-layer attention entropy computed from a single forward pass, with the focus phase corresponding to the middle layers exhibiting a sharp drop in entropy relative to early (diffusion) and late (rediffusion) layers. In the revision we will add a formal criterion: the focus phase is the contiguous layer range where normalized attention entropy falls below 0.5 and remains stable across heads (averaged over all heads). We will include pseudocode in §3 and report the exact layer indices per backbone for verification. revision: yes
Referee: [§4] The experimental claims of consistent gains across backbones rest on reported metric reductions, but the manuscript provides insufficient detail on statistical significance testing, run-to-run variance, and uniform application of data exclusion rules, which are necessary to substantiate the robustness assertions.

Authors: We acknowledge that additional statistical reporting is needed to support the robustness claims. The original experiments used fixed random seeds and followed the standard data splits and exclusion rules of POPE and CHAIR without further filtering. In the revision we will report standard deviations over three independent runs for all metrics, include p-values from paired t-tests for claimed improvements, and explicitly state the data exclusion protocol in §4. These additions will be placed in the experimental results tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical observation plus direct intervention

full rationale

The paper identifies a three-phase attention structure via direct analysis of internal dynamics from a single forward pass on the input, then applies a DPP-based suppression motivated by that observation. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation chain. The intervention uses per-input statistics and does not rename known results or smuggle ansatzes. The central claim remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical discovery of a consistent three-phase attention structure and on the assumption that DPP selection preserves sufficient visual cues; no free parameters are explicitly named in the abstract, but an implicit threshold for low-attention tokens is required.

free parameters (1)

low-attention threshold
A cutoff value used to identify tokens for suppression during the focus phase; its value is presumably derived from single-pass statistics but not specified.

axioms (1)

domain assumption Vision encoders in LVLMs exhibit a consistent three-phase attention structure (diffusion, focus, rediffusion) across inputs and models.
Invoked as the basis for identifying the focus phase and targeting suppression there.

pith-pipeline@v0.9.0 · 5517 in / 1418 out tokens · 26323 ms · 2026-05-13T18:43:07.463326+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion... hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 5 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

See https://vicuna

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)2(3), 6 (2023)

work page 2023
[5]

In: The Twelfth International Conference on Learning Representations (2024)

Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J.R., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[6]

The Ninth International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. The Ninth International Conference on Learning Representations (2021)

work page 2021
[7]

In: Findings of the Association for Computational Linguistics: ACL 2025

Fu, Y., Xie, R., Sun, X., Kang, Z., Li, X.: Mitigating hallucination in multimodal large language model via hallucination-targeted direct preference optimization. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 16563– 16577 (2025)

work page 2025
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)

work page 2024
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, Z., Chen, J., Zhu, B., Luo, T., Shen, Y., Yang, X.: Devils in middle lay- ers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25004–25014 (2025)

work page 2025
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

work page 2024
[11]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallu- cination in large vision-language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 292–305. Association for Computational Linguistics (2023)

work page 2023
[12]

Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models (2024)

work page 2024
[13]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)

work page 2023
[14]

In: NeurIPS (2023) 16 Kim et al

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 16 Kim et al

work page 2023
[15]

In: European Conference on Com- puter Vision

Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In: European Conference on Com- puter Vision. pp. 125–140. Springer (2024)

work page 2024
[16]

Advances in Applied Probability7(1), 83–122 (1975)

Macchi, O.: The coincidence approach to stochastic point processes. Advances in Applied Probability7(1), 83–122 (1975)

work page 1975
[17]

The Sixth International Conference on Learning Representations (2017)

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. The Sixth International Conference on Learning Representations (2017)

work page 2017
[18]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transfer- able visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. pp. 8748–8763. PMLR (2021)

work page 2021
[19]

In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci- nation in image captioning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 4035–4045 (2018)

work page 2018
[20]

Advances in neural information processing systems (2025)

Seo, H., Kang, D.U., Cho, H., Lee, J., Chun, S.Y.: On epistemic uncertainty of visual tokens for object hallucinations in large vision-language models. Advances in neural information processing systems (2025)

work page 2025
[21]

In: Findings of the Association for Computational Linguistics: ACL 2024

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 13088–13110 (2024)

work page 2024
[22]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., Xie, Z., Wu, Y., Hu, K., Wang, J., Sun, Y., Li, Y., Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding (2024)

work page 2024
[25]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Xie, Y., Li, G., Xu, X., Kan, M.Y.: V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 13258–13273 (2024)

work page 2024
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, Z., Luo, X., Han, D., Xu, Y., Li, D.: Mitigating hallucinations in large vision-language models via dpo: On-policy data hold the key. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10610– 10620 (2025)

work page 2025
[27]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023
[28]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023) Focus Matters 17 Organization of the Supplementary Thissupplementarymaterialprovidesadditionaltechnicaldetails,extendedanal- yses, and qualitative results that...

work page arXiv 2023