pith. sign in

arxiv: 2605.29956 · v1 · pith:WEWLPRJYnew · submitted 2026-05-28 · 💻 cs.IR

Uncertainty Quantification for Multimodal Retrieval Augmented Generation

Pith reviewed 2026-06-29 05:19 UTC · model grok-4.3

classification 💻 cs.IR
keywords uncertainty quantificationmultimodal RAGvision-language modelstoken probability signalsLeMUQAUROCretrieval augmented generationinput ablation
0
0 comments X

The pith

LeMUQ quantifies uncertainty in multimodal RAG by feeding token probabilities from modality and context removals into a finetuned model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard uncertainty methods fail in multimodal retrieval-augmented generation because they ignore interactions between vision, text, and retrieved passages. LeMUQ addresses this by computing token probabilities on modified inputs that drop either the image, the retrieved context, or both, then encoding those probability sequences as special tokens for a separate finetuned predictor. A sympathetic reader would care because better uncertainty estimates could flag unreliable answers before they reach users in visual question-answering pipelines. The work reports a 3.8 percent average AUROC gain over prior baselines across multiple datasets, retrievers, and vision-language models. It also claims the learned signals generalize across retrieval setups while showing mixed transfer when the underlying vision-language model changes.

Core claim

LeMUQ improves uncertainty estimation in multimodal RAG by analyzing token-level probabilities under controlled input ablations that remove visual information, retrieved passages, or both, then training a lightweight model on these probability sequences to predict answer correctness.

What carries the argument

LeMUQ encodes probability sequences obtained after modality or context removal as probability tokens and passes them through a finetuned model that learns to combine multimodal and retrieval-aware uncertainty signals.

If this is right

  • LeMUQ yields consistent AUROC gains over both non-learned and finetuned uncertainty baselines across evaluated datasets and retrievers.
  • The method generalizes reliably when the retriever or dataset changes but shows mixed results when the underlying vision-language model is swapped.
  • The approach isolates uncertainty contributions from visual understanding, retrieval quality, and generation by design.
  • The resulting uncertainty scores can be used to decide whether to trust or reject a generated answer in multimodal RAG pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ablation-based signals may also help diagnose which stage of the RAG pipeline is most responsible for an error on any given query.
  • Similar probability-token encodings could be applied to non-RAG multimodal tasks such as visual captioning or document understanding.
  • Combining LeMUQ signals with calibration techniques that operate on the final answer distribution might further tighten the uncertainty estimates.

Load-bearing premise

Signals from simply removing modalities or retrieved context are sufficient to expose the uncertainty interactions that matter, and a model trained on those signals will generalize to new retrieval setups and vision-language models.

What would settle it

A test on a previously unseen vision-language model or retrieval corpus where LeMUQ produces no AUROC improvement over a strong baseline that uses only the original token probabilities.

Figures

Figures reproduced from arXiv: 2605.29956 by Faegheh Hasibi, Heydar Soudani, Simon Binz.

Figure 1
Figure 1. Figure 1: Overview of the LeMUQ pipeline. (1) A VLM receives the query, input image, and retrieved context, and generates a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Recall@1 performance of retrievers on the train [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Retrieval Augmented Generation (RAG) improves the question answering capabilities of Large Language Models (LLMs) by incorporating external knowledge and has recently been extended to multimodal settings through Vision-Language Models (VLMs) that integrate visual and textual information. Despite these advances, generated answers can still be incorrect or misleading. Uncertainty Quantification (UQ) methods aim to estimate the reliability of model outputs, but most existing approaches are designed for text-only models and perform poorly in multimodal RAG scenarios. A key challenge is capturing uncertainty arising from multiple stages of the pipeline, including retrieval, visual understanding, and generation. In this work, we show that modeling uncertainty using multimodal and retrieval-aware probability signals improves estimation in multimodal RAG systems. We introduce LeMUQ, a Learnable Multimodal UQ method that analyzes token probabilities under input modifications, such as removing modalities or retrieved context. By encoding these signals as probability tokens and processing them with a finetuned model, our approach captures interactions between modalities and retrieval. Experiments across datasets, retrievers, and VLMs show consistent improvements over baseline and finetuned UQ methods. Our proposed LeMUQ increases the AUROC metric by 3.8% on average. Additionally, our method shows strong generalization performance across different retrieval setups and datasets with mixed results when transferring across different VLMs. Our findings highlight the importance of modeling multimodal uncertainty and provide a step toward more reliable and safer multimodal RAG systems. Code is available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces LeMUQ, a learnable multimodal uncertainty quantification method for RAG with VLMs. It extracts token-probability signals by ablating modalities or retrieved context, encodes them as probability tokens, and trains a small model on these signals to estimate answer reliability. The central claim is that this captures modality-retrieval interactions better than baselines, yielding a 3.8% average AUROC gain, consistent improvements across datasets/retrievers/VLMs, and strong generalization to new retrieval setups and datasets (with mixed cross-VLM transfer).

Significance. If the empirical gains are robust, the work would be a useful step toward UQ tailored to multimodal RAG pipelines, where standard token-probability or text-only methods are known to underperform. The use of held-out data for evaluation and public code release are positive for reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim that the finetuned model 'captures interactions between modalities and retrieval' is undermined by the immediately following statement of 'mixed results when transferring across different VLMs'. If the learned signals are VLM-specific rather than reflecting general multimodal uncertainty structure, the 3.8% AUROC improvement and generalization narrative hold only within-VLM and do not support the modeling contribution for multimodal RAG systems at large.
  2. [Experiments] Experiments (transfer results): the mixed cross-VLM transfer performance directly contradicts the assumption that token-probability signals obtained by modality/context removal are sufficient to capture relevant uncertainty interactions that generalize reliably across VLMs; this is load-bearing for the central claim that LeMUQ improves estimation in multimodal RAG systems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address the two major comments point-by-point below, providing clarifications on the scope of our claims and the interpretation of the transfer results while remaining faithful to the reported experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the finetuned model 'captures interactions between modalities and retrieval' is undermined by the immediately following statement of 'mixed results when transferring across different VLMs'. If the learned signals are VLM-specific rather than reflecting general multimodal uncertainty structure, the 3.8% AUROC improvement and generalization narrative hold only within-VLM and do not support the modeling contribution for multimodal RAG systems at large.

    Authors: The abstract claim refers to LeMUQ's design, which explicitly constructs probability signals via modality and context ablations and trains a model to learn from their joint patterns; this is what enables it to capture modality-retrieval interactions within a given VLM pipeline, as demonstrated by the consistent AUROC gains when training and evaluating on the same VLM. The paper already reports mixed cross-VLM transfer transparently, indicating that while the underlying signals are informative, the learned mapping can exhibit VLM-specific characteristics due to differences in token probability distributions across models. This does not undermine the within-VLM contribution or the 3.8% average improvement, which holds across multiple VLMs, datasets, and retrievers. We agree that the abstract wording could be tightened to avoid any implication of universal cross-VLM generalization. revision: partial

  2. Referee: [Experiments] Experiments (transfer results): the mixed cross-VLM transfer performance directly contradicts the assumption that token-probability signals obtained by modality/context removal are sufficient to capture relevant uncertainty interactions that generalize reliably across VLMs; this is load-bearing for the central claim that LeMUQ improves estimation in multimodal RAG systems.

    Authors: The signals from modality and context removal are sufficient to capture relevant interactions for a given VLM, as shown by the strong within-VLM results and the fact that LeMUQ outperforms baselines that do not model these signals jointly. The mixed transfer results are consistent with the expectation that different VLMs produce distinct probability behaviors, so a model trained on one VLM's signals may not transfer perfectly; this is why the paper explicitly states 'mixed results when transferring across different VLMs' rather than claiming broad cross-VLM generalization. The core contribution remains the learnable multimodal UQ approach that improves estimation in multimodal RAG, with the transfer experiments serving to delineate its practical scope rather than contradict the method's validity. revision: no

Circularity Check

0 steps flagged

Empirical method evaluated on held-out data; no circularity in derivation

full rationale

The paper introduces LeMUQ as a finetuned model trained on token-probability signals obtained by input modifications (modality or context removal). Performance is reported via AUROC gains measured on held-out data across datasets, retrievers, and VLMs. No load-bearing step reduces a claimed prediction or result to the same fitted quantities by construction, nor relies on self-citation chains or self-definitional mappings. The method is a standard supervised learning pipeline whose outputs are falsifiable on external test distributions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard machine-learning assumptions such as the existence of a learnable mapping from probability signals to uncertainty scores.

pith-pipeline@v0.9.1-grok · 5794 in / 1037 out tokens · 21422 ms · 2026-06-29T05:19:54.374091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Omar Adjali, Olivier Ferret, Sahar Ghannay, and Hervé Le Borgne. 2024. Multi- Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Lingu...

  2. [2]

    Kiana Avestimehr, Emily Aye, Zalan Fabian, and Erum Mushtaq. 2025. Detecting unreliable responses in generative vision-language models via visual uncertainty. InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI

  3. [3]

    Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, and Salman Avestimehr. 2024. MARS: Meaning-aware response scoring for uncertainty estimation in generative LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7752–7767

  4. [4]

    Jinyeong Chae and Jihie Kim. 2022. Uncertainty-based Visual Question Answer- ing: Estimating Semantic Inconsistency between Image and Knowledge Base. In International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022. IEEE, 1–9. doi:10.1109/IJCNN55064.2022.9892787

  5. [5]

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

  6. [6]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 14948–14968

  7. [7]

    Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2025. Augmenting multimodal llms with self-reflective tokens for knowledge-based visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference. 9199–9209

  8. [8]

    Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.Journal of Legal Analysis16, 1 (01 2024), 64–93. arXiv:https://academic.oup.com/jla/article- pdf/16/1/64/58336922/laae003.pdf doi:10.1093/jla/laae003

  9. [9]

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5050–5063

  10. [10]

    Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, and Maxim Panov. 2025. Don’t Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search.CoRRabs/2512.09538 (2025). arXiv:2512.09538 doi:10. 48550/ARXIV.2512.09538

  11. [11]

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrin- maya Sachan, and Maxim Panov. 2025. Faithfulness-aware uncertainty quan- tification for fact-checking the output of retrieval augmented generation.arXiv preprint arXiv:2505.21072(2025)

  12. [12]

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy.Nature630, 8017 (2024), 625–630

  13. [13]

    Tom Fawcett. 2006. An introduction to ROC analysis.Pattern recognition letters 27, 8 (2006), 861–874

  14. [14]

    Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. 2024. Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL. Association for Computational Linguistics, 14664–14690

  15. [15]

    Oscar Freyer, Isabella Catharina Wiest, Jakob Nikolas Kather, and Stephen Gilbert

  16. [16]

    doi:10.1016/S2589-7500(24)00124-9

    A future role for health applications of large language models depends on regulators enforcing safety standards.The Lancet Digital Health6, 9 (2024), e662–e672. doi:10.1016/S2589-7500(24)00124-9

  17. [17]

    Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. 2023. A survey of uncertainty in deep neural networks. Artificial Intelligence Review56, Suppl 1 (2023), 1513–1589

  18. [18]

    James Harrison, John Willes, and Jasper Snoek. 2024. Variational Bayesian Last Layers. InThe Twelfth International Conference on Learning Representations, ICLR. OpenReview.net

  19. [19]

    Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. 2024. Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https: //openreview.net/forum?id=byxXa99PtF

  20. [20]

    de Vries, Maarten de Rijke, and Faegheh Hasibi

    Mohanna Hoveyda, Jelle Piepenbrock, Arjen P. de Vries, Maarten de Rijke, and Faegheh Hasibi. 2026. OrLog: Resolving Complex Queries with LLMs and Proba- bilistic Reasoning. InAdvances in Information Retrieval. Springer Nature Switzer- land, 98–114

  21. [21]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. (2025)

  22. [22]

    Hideaki Joko and Faegheh Hasibi. 2026. FACE: A Fine-Grained Reference-Free Evaluator for Conversational Information Access. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR

  23. [23]

    Hailey Joren, Jianyi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur Taly, and Cyrus Rashtchian. 2025. Sufficient Context: A New Lens on Retrieval-Augmented Generation Systems. InInternational Conference on Learning Representations (ICLR)

  24. [24]

    Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Chris Olah, and Jared Kaplan

    Saurav Kadavath, Tom Conerly, Amanda Askell, Thomas Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova Dassarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, John Kernion, Shauna Kravec, Lian...

  25. [26]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran- Johnson, et al. 2022. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

  26. [27]

    Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in bayesian deep learning for computer vision?Advances in neural information processing systems30 (2017)

  27. [28]

    Zaid Khan and Yun Fu. 2024. Consistency and uncertainty: Identifying unreliable responses from black-box vision-language models for selective visual question Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Binz et al. answering. InProceedings of the ieee/cvf conference on computer vision and pattern recognition. 10854–10863

  28. [29]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Gen- eration. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=VD-AYtP0dve

  29. [30]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  30. [31]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  31. [32]

    I-Fan Lin, Faegheh Hasibi, and Suzan Verberne. 2026. LLMs Enable Bag-of-Texts Representations for Short-Text Clustering. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  32. [33]

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with Confi- dence: Uncertainty Quantification for Black-box Large Language Models.Trans- actions on Machine Learning Research(2024). https://openreview.net/forum?id= DWkJCSxKU5

  33. [34]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  34. [35]

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei

  35. [36]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6107–6117

  36. [37]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.CoRRabs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692

  37. [38]

    Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. 2023. Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- view.net. https://openreview.net/forum?id=PQOlkgsBsik

  38. [39]

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting in retrieval-augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 5303–5315

  39. [40]

    Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Ma- soud Hashemi. 2025. Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models. InProceedings of the 31st In- ternational Conference on Computational Linguistics, COLING. Association for Computational Linguistics, 9329–9345

  40. [41]

    Andrey Malinin and Mark Gales. 2021. Uncertainty Estimation in Autoregressive Structured Prediction. InInternational Conference on Learning Representations. https://openreview.net/forum?id=jN5y-zb5Q7m

  41. [42]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9802–9822

  42. [43]

    Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. 2023. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3113– 3124

  43. [44]

    Erum Mushtaq, Zalan Fabian, Yavuz Faruk Bakman, Anil Ramakrishna, Mahdi Soltanolkotabi, and Salman Avestimehr. 2025. HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision- Language Models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 1654–1659. doi:10.1109/CVPRW67362...

  44. [45]

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. 2019. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift.Advances in neural information processing systems32 (2019)

  45. [46]

    Laura Perez-Beltrachini and Mirella Lapata. 2025. Uncertainty Quantification in Retrieval Augmented Question Answering.arXiv preprint arXiv:2502.18108 (2025)

  46. [47]

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and Narrowing the Compositionality Gap in Language Models. InFindings of the Association for Computational Linguistics: EMNLP

  47. [48]

    Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, and Irwin King. 2025. Entropy-based decoding for retrieval-augmented large language models. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 4616–4627

  48. [49]

    Mahta Rafiee, Heydar Soudani, Zahra Abbasiantaeb, Mohammad Aliannejadi, Faegheh Hasibi, and Hamed Zamani. 2026. Total Recall QA: A Verifiable Eval- uation Suite for Deep Research Agents. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval

  49. [50]

    Robertson and Hugo Zaragoza

    Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond.Found. Trends Inf. Retr.3, 4 (2009), 333–389

  50. [51]

    Heydar Soudani. 2025. Enhancing Knowledge Injection in Large Language Models for Efficient and Trustworthy Responses. InProceedings of the 48th International ACM Conference on Research and Development in Information Retrieval, SIGIR

  51. [52]

    Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. 2024. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2024. 12–22

  52. [53]

    Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. 2025. Why Uncer- tainty Estimation Methods Fall Short in RAG: An Axiomatic Analysis. InFindings of the Association for Computational Linguistics: ACL 2025. 16596–16616

  53. [54]

    Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, and Faegheh Hasibi. 2026. A Survey on Recent Advances in Conversational Data Generation.ACM Comput. Surv.58 (4 2026). Issue 10. doi:10.1145/3795686

  54. [55]

    Heydar Soudani, Hamed Zamani, and Faegheh Hasibi. 2026. Uncertainty Quan- tification for Retrieval-Augmented Reasoning. (2026)

  55. [56]

    selective prediction

    Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Chandu. 2024. Selective “selective prediction”: Reducing unnecessary abstention in vision-language reasoning. InFindings of the Associa- tion for Computational Linguistics: ACL 2024. 12935–12948

  56. [57]

    Prashant Upadhyay, Rishabh Agarwal, Sumeet Dhiman, Abhinav Sarkar, and Saumya Chaturvedi. 2024. A comprehensive survey on answer generation methods using NLP.Natural Language Processing Journal8 (2024), 100088. doi:10.1016/j.nlp.2024.100088

  57. [58]

    Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, and Artem Shelmanov. 2026. Uncertainty-Aware At- tention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs. In Proceedings of 43rd International Conference on Machine Learning (ICML)

  58. [59]

    Aparna Vinayan Kozhipuram, Samar Shailendra, and Rajan Kadel. 2025. Retrieval- Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content.Information16, 9 (2025). doi:10.3390/info16090766

  59. [60]

    Weihao Xuan, Qingcheng Zeng, Heli Qi, Junjue Wang, and Naoto Yokoya. 2025. Seeing is believing, but how much? a comprehensive analysis of verbalized calibration in vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 1408–1450

  60. [61]

    Duygu Nur Yaldiz, Yavuz Faruk Bakman, Baturalp Buyukates, Chenyang Tao, Anil Ramakrishna, Dimitrios Dimitriadis, Jieyu Zhao, and Salman Avestimehr

  61. [62]

    InFindings of the Association for Computational Linguistics: NAACL 2025

    Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs. InFindings of the Association for Computational Linguistics: NAACL 2025

  62. [63]

    Yibin Yan and Weidi Xie. 2024. Echosight: Advancing visual-language models with wiki knowledge. InFindings of the Association for Computational Linguistics: EMNLP 2024. 1538–1551

  63. [64]

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Junqi Dai, Qinyuan Cheng, Xuanjing Huang, and Xipeng Qiu. 2024. Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adap- tive Guidance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL. 2401–2416

  64. [65]

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Vis- RAG: Vision-based Retrieval-augmented Generation on Multi-modality Doc- uments. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=zG459X3Xge

  65. [66]

    Qiwei Zhao, Dong Li, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Chen Zhao, Haifeng Chen, and Xujiang Zhao. 2025. Uncertainty Propagation on LLM Agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1...

  66. [67]

    The generated answer is true

    Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. 2024. Know- ing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ...