pith. machine review for the scientific record. sign in

arxiv: 2605.00957 · v1 · submitted 2026-05-01 · 💻 cs.IR · cs.AI

Recognition: unknown

"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation

Daan Di Scala, Maaike de Boer, P{\i}nar Yolum

Pith reviewed 2026-05-09 18:27 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords certainty-aware RAGappropriate trustLLM uncertaintyretrieval augmented generationnon-objective questionsself-reflectionmoral judgments
0
0 comments X

The pith

CERTA uses relevance between question, context, and answer to express uncertainty in RAG outputs and reduce overconfidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CERTA, a Retrieval Augmented Generation system that computes relevance scores across the input question, retrieved context, and generated answer to determine how certain the model should appear. It also presents the Certainty Benchmark, a set of 90 non-objective question-context pairs spanning factuality, preference, sycophancy, and morality, with three context types. Experiments compare a baseline RAG system against CERTA variants on two LLMs and show the method flags uncertain answers, cuts over-agreement, and produces more cautious replies on moral prompts. The aim is to support appropriate trust by letting the model reflect on its own limitations rather than defaulting to confident-sounding text.

Core claim

CERTA enhances standard RAG by incorporating relevance scores computed between the question, the retrieved context, and the generated answer to modulate the certainty conveyed in the final response, enabling more reliable and honest behavior on non-objective questions.

What carries the argument

CERTA (Certainty Enhanced RAG for Trustworthy Answers), a system that uses relevance scores between question, context, and answer to determine and express uncertainty levels.

If this is right

  • CERTA identifies answers that are uncertain.
  • It decreases cases of over-agreeing with the user.
  • It produces cautious behavior when prompted for moral judgments.
  • It supports appropriate trust levels across the four benchmark categories of factuality, preference, sycophancy, and morality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relevance-based certainty signal could be tested on objective factual questions to see whether it still improves calibration.
  • Integration of CERTA-style scoring into production RAG pipelines might reduce user over-reliance on incorrect outputs.
  • The approach suggests that explicit uncertainty signaling could be combined with other self-reflection techniques already used in LLMs.

Load-bearing premise

Relevance scores computed between question, context, and answer serve as a reliable proxy for the actual uncertainty or correctness of the LLM-generated answer on non-objective questions.

What would settle it

A direct comparison on the Certainty Benchmark where human raters score answer correctness and trustworthiness independently of CERTA's relevance-based certainty ratings, checking whether the two measures align or diverge.

Figures

Figures reproduced from arXiv: 2605.00957 by Daan Di Scala, Maaike de Boer, P{\i}nar Yolum.

Figure 1
Figure 1. Figure 1: Non-objective question answered confidently by ChatGPT-4o. For example, the question “Is it more important to be honest or kind?”, (Fig￾ure 1), is a non-objective question, without a clearly correct answer. LLMs (such as, in this case, OpenAI’s ChatGPT-4o [31]) tend to produce a confident answer to these types of questions without taking any personalized user value into ac￾count, even when there should be … view at source ↗
Figure 2
Figure 2. Figure 2: RAG setup with RAG Triad components. Yellow boxes indicate LLM com￾ponents (retriever and generator) and dark blue arrows indicate Triad components (question-context-answer relevance). In contrast, Retrieval Augmented Generation (RAG) evaluation techniques can be used to provide scores that can be utilized as uncertainty levels. Different RAG evaluations exist [25, 50], including relevancy scores [37], cor… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of RAG approaches: baseline RAG and three CERTA approaches that act as certainty including extensions. CERTA-0 Prompt I value honesty and truthfulness about certainty of answers. Answer the following question: {question}, based solely on the following context: {context}. Be honest about your certainty of your answer. 3: CERTA-1. This approach is based on CERTA-0, enriched with the instruc￾tion… view at source ↗
Figure 4
Figure 4. Figure 4: Dashboard with CERTA options and chat window. We create a dashboard for demonstration purposes and follow-up trustwor￾thiness research with users. The dashboard is made with a Python backend and view at source ↗
Figure 5
Figure 5. Figure 5: Dashboard showing CERTA-2Mistral’s answer to a Factuality question with relevant context, including confidence scores. With relevant context, for example, a question where CERTA has a notable impact over RAG is “What is the meaning of life?”. To this question, RAGGPT overconfidently opts for “To seek purpose, connection, and understanding in our view at source ↗
Figure 6
Figure 6. Figure 6: Dashboard showing RAGGPT’s answer to a Personal Preference question with irrelevant context, including confidence scores view at source ↗
Figure 7
Figure 7. Figure 7: Dashboard showing CERTA-1GPT’s answer to a Moral Choices question with incomplete context, including confidence scores view at source ↗
read the original abstract

Achieving the right amount of trust in AI systems is important, but challenging. The problem is exacerbated with the rise of Large Language Models (LLMs) as they provide human-level communication capabilities, but potentially hallucinate in the content that they generate. Moreover, they express over-confidence in their answers, making it difficult for users to judge their truthfulness. An important human value that users seek is benevolence, which can be met by LLM's self-reflection leading to reliable and honest answers. Accordingly, this paper proposes conveying appropriate levels of self-reflected certainty to build appropriate trust. Our contributions are twofold: 1) We develop CERTA (Certainty Enhanced RAG for Trustworthy Answers), a specialized Retrieval Augmented Generation (RAG) system that incorporates the relevance between question, context, and answer to reflect its uncertainty in answering questions; 2) We create the Certainty Benchmark with 90 question-context pairs of non-objective questions, divided over four categories (factuality, preference, sycophancy, morality) and three types of contexts (relevant, incomplete, irrelevant). We run experiments with a baseline RAG system and three CERTA settings using two LLMs. Our evaluations indicate that CERTA helps identify answers that are uncertain, decreases the cases of over-agreeing, and provides cautious behavior when prompted for moral judgments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CERTA, a specialized RAG system that uses relevance scores computed between a question, retrieved context, and generated answer as a proxy for the LLM's uncertainty. It introduces a new Certainty Benchmark consisting of 90 question-context pairs spanning four categories (factuality, preference, sycophancy, morality) and three context types (relevant, incomplete, irrelevant). Experiments compare a baseline RAG system against three CERTA variants on two LLMs, claiming that CERTA better identifies uncertain answers, reduces over-agreement, and elicits more cautious responses on moral judgments.

Significance. If the relevance-based certainty proxy can be shown to correlate with actual epistemic uncertainty or correctness, the work would offer a lightweight, retrieval-grounded mechanism for calibrating LLM confidence on non-objective queries. The dedicated benchmark for certainty assessment across subjective categories would also be a useful contribution to evaluation practices in trustworthy AI.

major comments (3)
  1. [Abstract] Abstract and evaluation description: the claim that CERTA 'helps identify answers that are uncertain' and 'decreases the cases of over-agreeing' is presented without any quantitative metrics, error bars, statistical tests, or raw counts of agreement rates, making the magnitude and reliability of the reported improvements impossible to assess.
  2. [Benchmark and Method] Benchmark construction and proxy definition (implicit in §3–4): relevance scores between question, context, and answer are treated as a direct proxy for certainty without any external calibration against human uncertainty judgments or known hallucination cases. For the preference, sycophancy, and morality categories, where no objective ground truth exists, this proxy risks conflating context mismatch with epistemic uncertainty.
  3. [Method] Implementation details (method section): no equations, pseudocode, or parameter settings are supplied for how raw relevance scores (embedding similarity or LLM judge) are thresholded or combined into the final certainty signal used to modulate the answer, preventing reproduction or sensitivity analysis.
minor comments (2)
  1. [Benchmark] The benchmark size of 90 pairs is modest; reporting per-category breakdowns and inter-annotator agreement (if any human validation was performed) would strengthen the evaluation.
  2. [Experiments] Clarify whether the same LLM is used both for generation and for relevance judging, as this could introduce circularity in the certainty estimation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate to improve clarity, reproducibility, and discussion of limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation description: the claim that CERTA 'helps identify answers that are uncertain' and 'decreases the cases of over-agreeing' is presented without any quantitative metrics, error bars, statistical tests, or raw counts of agreement rates, making the magnitude and reliability of the reported improvements impossible to assess.

    Authors: We agree that the abstract and high-level evaluation description summarize results only qualitatively. The manuscript reports experimental comparisons between baseline RAG and CERTA variants across the 90-question Certainty Benchmark, but we acknowledge the absence of specific metrics in the abstract. We will revise the abstract to include key quantitative results, such as identification rates for uncertain answers and reductions in over-agreement, along with raw counts. Given the benchmark's modest size, we did not compute error bars or conduct formal statistical tests; we will add a brief note on result reliability in the revised text. revision: yes

  2. Referee: [Benchmark and Method] Benchmark construction and proxy definition (implicit in §3–4): relevance scores between question, context, and answer are treated as a direct proxy for certainty without any external calibration against human uncertainty judgments or known hallucination cases. For the preference, sycophancy, and morality categories, where no objective ground truth exists, this proxy risks conflating context mismatch with epistemic uncertainty.

    Authors: The relevance proxy is designed to flag cases where retrieved context is insufficient to support a confident response, serving as a practical, retrieval-grounded signal for uncertainty in RAG settings. We did not perform external calibration against human judgments or hallucination benchmarks, which is a limitation we recognize. For subjective categories, we view context mismatch as a valid trigger for expressing uncertainty rather than overconfident answers. We will expand the discussion in §3–4 to explicitly address this risk of conflation, clarify the proxy's scope as a lightweight heuristic rather than a calibrated epistemic measure, and note the absence of human validation as future work. revision: partial

  3. Referee: [Method] Implementation details (method section): no equations, pseudocode, or parameter settings are supplied for how raw relevance scores (embedding similarity or LLM judge) are thresholded or combined into the final certainty signal used to modulate the answer, preventing reproduction or sensitivity analysis.

    Authors: We agree that the method section omits the requested implementation details, which hinders reproducibility. We will add formal equations describing relevance score computation and combination into the certainty signal, pseudocode for the full CERTA pipeline (including how the signal modulates the generated answer), and the specific parameter settings and thresholds used for the three CERTA variants in our experiments with both LLMs. These additions will support reproduction and enable sensitivity analyses by readers. revision: yes

Circularity Check

0 steps flagged

No circularity in CERTA derivation or evaluations

full rationale

The paper introduces CERTA as a RAG variant that incorporates relevance scores between question, context, and answer to convey uncertainty, then evaluates it empirically on a newly created 90-pair Certainty Benchmark across four categories and three context types. No equations, fitted parameters, or self-citation chains appear in the provided text. The central claims rest on direct comparisons to a baseline RAG system using the external benchmark, with no reduction of predictions or uniqueness claims back to the method's own inputs by construction. This is a standard empirical systems paper with independent evaluation content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on an ad-hoc assumption that relevance between question, context, and answer accurately reflects answer uncertainty; no free parameters or invented physical entities are mentioned.

axioms (1)
  • ad hoc to paper Relevance between question, context, and answer can serve as a proxy for the certainty of the generated answer
    Core mechanism of CERTA; not derived from prior validated methods or first principles in the abstract.
invented entities (1)
  • CERTA no independent evidence
    purpose: Certainty-enhanced RAG system
    Newly proposed method without external validation or falsifiable handle outside the paper.

pith-pipeline@v0.9.0 · 5550 in / 1369 out tokens · 54296 ms · 2026-05-09T18:27:51.220359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 17 canonical work pages · 5 internal anchors

  1. [1]

    ACM Transactions on Intelligent Systems and Technology (TIST)12(4), 1–29 (2021)

    Aydoğan, R., Kafali, Ö., Arslan, F., Jonker, C.M., Singh, M.P.: Nova: Value-based negotiation of norms. ACM Transactions on Intelligent Systems and Technology (TIST)12(4), 1–29 (2021)

  2. [2]

    doi:10.48550/arXiv.2401.17072 , abstract =

    Aynetdinov, A., Akbik, A.: Semscore: Automated evaluation of instruction-tuned llms based on semantic textual similarity. arXiv preprint arXiv:2401.17072 (2024)

  3. [3]

    Personality and Social Psychology Bulletin29(10), 1207–1220 (2003)

    Bardi, A., Schwartz, S.H.: Values and behavior: Strength and structure of relations. Personality and Social Psychology Bulletin29(10), 1207–1220 (2003)

  4. [4]

    Flattering to deceive: The impact of sycophantic behavior on user trust in large language model.arXiv preprint arXiv:2412.02802,

    Carro, M.V.: Flattering to deceive: The impact of sycophantic behavior on user trust in large language model. preprint arXiv:2412.02802 (2024)

  5. [5]

    arXiv preprint arXiv:2501.08208 (2025)

    Chowdhury, M., He, Y.V., Higham, A., Lim, E.: ASTRID–An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems. arXiv preprint arXiv:2501.08208 (2025)

  6. [6]

    Cima, L.R., De Jonge, D., Osman, N.: Towards the incorporation of social values in automatednegotiationstrategies.In:InternationalWorkshoponValueEngineering in AI. pp. 193–207. Springer (2024) Towards Appropriate Trust with Certainty-Aware RAG XIX

  7. [7]

    In: Twenty-Sixth IJCAI

    Cranefield, S., Winikoff, M., Dignum, V., Dignum, F.: No Pizza for You: Value- based Plan Selection in BDI Agents. In: Twenty-Sixth IJCAI. pp. 178–184 (2017)

  8. [8]

    I don’t know

    Deng, Y., Zhao, Y., Li, M., Ng, S.K., Chua, T.S.: Don’t Just Say "I don’t know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations. arXiv preprint arXiv:2402.15062 (2024)

  9. [9]

    In: Companion Proceedings of the ACM on Web Conference 2025

    Dey, P., Merugu, S., Kaveri, S.: Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 947–951 (2025)

  10. [10]

    In: Proceedings of AAMAS

    Erdogan, E., Aydın, H., Dignum, F., Verbrugge, R., Yolum, P.: Mitigating privacy conflicts with computational theory of mind. In: Proceedings of AAMAS. pp. 695– 703 (2025)

  11. [11]

    Journal of Cross-Cultural Psychology42(7), 1127– 1144 (2011)

    Fischer, R., Schwartz, S.: Whence differences in value priorities? individual, cul- tural, or artifactual sources. Journal of Cross-Cultural Psychology42(7), 1127– 1144 (2011)

  12. [12]

    Early engagement and new technologies: Opening up the laboratory pp

    Friedman, B., Kahn, P.H., Borning, A., Huldtgren, A.: Value sensitive design and information systems. Early engagement and new technologies: Opening up the laboratory pp. 55–95 (2013)

  13. [13]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H.: Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)

  14. [14]

    A Survey on LLM-as-a-Judge

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594 (2024)

  15. [15]

    In: VALE

    Guerrero, E., Tzeng, S.T., Pastrav, C., Dignum, F.: Value-based decision-making in software agents: A systematic literature review. In: VALE. pp. 137–154. Springer (2025)

  16. [16]

    2023 , month = feb, journal =

    Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., Steinhardt, J.: Aligning AI with shared human values. arXiv preprint arXiv:2008.02275 (2020)

  17. [17]

    In: International Conference on Machine Learning

    Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Position: Trustllm: Trustworthiness in large language models. In: International Conference on Machine Learning. pp. 20166–20270. PMLR (2024)

  18. [18]

    ACM Comput

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv.55(12) (mar 2023)

  19. [19]

    ACM TOIT18(4), 1–21 (2018)

    Kayal, A., Brinkman, W.P., Neerincx, M.A., Riemsdijk, M.B.V.: Automatic reso- lution of normative conflicts in supportive technology based on user values. ACM TOIT18(4), 1–21 (2018)

  20. [20]

    I’m Not Sure, But

    Kim, S.S., Liao, Q.V., Vorvoreanu, M., Ballard, S., Vaughan, J.W.: " I’m Not Sure, But...": Examining the impact of large language models’ uncertainty expression on user reliance and trust. In: Proceedings of the 2024 ACM FAccT. pp. 822–835 (2024)

  21. [21]

    Computers in Human Behavior 160, 108352 (2024)

    Klingbeil,A.,Grützner,C.,Schreck,P.:TrustandrelianceonAI—Anexperimental study on the extent and costs of overreliance on AI. Computers in Human Behavior 160, 108352 (2024)

  22. [22]

    Master’s thesis, TU Wien (2025)

    König, S.: Automated evaluation and parameter optimisation for retrieval aug- mented generation systems. Master’s thesis, TU Wien (2025)

  23. [23]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Lin, S., Hilton, J., Evans, O.: TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021)

  24. [24]

    Autonomous Agents and Multi-Agent Systems36(1), 23 (2022) XX D

    Liscio, E., van der Meer, M., Siebert, L.C., Jonker, C.M., Murukannaiah, P.K.: What values should an agent align with? an empirical comparison of general and context-specific values. Autonomous Agents and Multi-Agent Systems36(1), 23 (2022) XX D. Di Scala et al

  25. [25]

    ACM Transactions on Information Systems 43(2), 1–32 (2025)

    Lyu, Y., Li, Z., Niu, S., Xiong, F., Tang, B., Wang, W., Wu, H., Liu, H., Xu, T., Chen, E.: Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. ACM Transactions on Information Systems 43(2), 1–32 (2025)

  26. [26]

    ACM Journal on Responsible Computing1(4), 1–45 (2024)

    Mehrotra, S., Degachi, C., Vereschak, O., Jonker, C.M., Tielman, M.L.: A sys- tematic review on fostering appropriate trust in human-ai interaction: Trends, opportunities and challenges. ACM Journal on Responsible Computing1(4), 1–45 (2024)

  27. [27]

    In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

    Mehrotra, S., Jonker, C.M., Tielman, M.L.: More similar values, more trust?-the effect of value similarity on trust in human-agent interaction. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. pp. 777–783 (2021)

  28. [28]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv arXiv:1301.3781 (2013)

  29. [29]

    https://mistral.ai/news/mistral-small-3-1, [Accessed 10- 10-2025]

    Mistral Small 3.1 Model. https://mistral.ai/news/mistral-small-3-1, [Accessed 10- 10-2025]

  30. [30]

    arXiv preprint arXiv:2501.03995 , year=

    Mortaheb, M., Khojastepour, M.A.A., Chakradhar, S.T., Ulukus, S.: Rag- check: Evaluating multimodal retrieval augmented generation performance. arXiv preprint arXiv:2501.03995 (2025)

  31. [31]

    OpenAIPlatformChatGPT-4oLatest.https://platform.openai.com/docs/models/ chatgpt-4o-latest, [Accessed 10-10-2025]

  32. [32]

    OpenAIPlatformGPT-4.1Model.https://platform.openai.com/docs/models/gpt- 4.1/, [Accessed 10-10-2025]

  33. [33]

    https://platform.openai.com/docs/guides/embeddings, [Accessed 10-10-2025]

    OpenAI Platform Vector Embedding Models. https://platform.openai.com/docs/guides/embeddings, [Accessed 10-10-2025]

  34. [34]

    arXiv preprint arXiv:2408.10392 (2024)

    Padhi, I., Ramamurthy, K.N., Sattigeri, P., Nagireddy, M., Dognin, P., Varshney, K.R.: Value alignment from unstructured text. arXiv preprint arXiv:2408.10392 (2024)

  35. [35]

    Microsoft Research 339, 340 (2022)

    Passi, S., Vorvoreanu, M.: Overreliance on AI literature review. Microsoft Research 339, 340 (2022)

  36. [36]

    Philosophy and engi- neering: Reflections on practice, principles and process pp

    Van de Poel, I.: Translating values into design requirements. Philosophy and engi- neering: Reflections on practice, principles and process pp. 253–266 (2013)

  37. [37]

    preprint arXiv:2311.09476 (2023)

    Saad-Falcon, J., Khattab, O., Potts, C., Zaharia, M.: Ares: An automated evaluation framework for retrieval-augmented generation systems. preprint arXiv:2311.09476 (2023)

  38. [38]

    Online readings in Psychology and Culture2(1), 11 (2012)

    Schwartz, S.H.: An overview of the schwartz theory of basic values. Online readings in Psychology and Culture2(1), 11 (2012)

  39. [39]

    Journal of Personality and Social Psychology 103(4), 663 (2012)

    Schwartz, S.H., Cieciuch, J., Vecchione, M., Davidov, E., Fischer, R., Beierlein, C., Ramos, A., Verkasalo, M., Lönnqvist, J.E., Demirutku, K., et al.: Refining the theory of basic individual values. Journal of Personality and Social Psychology 103(4), 663 (2012)

  40. [40]

    Minds and Machines33(4), 761–790 (2023)

    Serramia, M., Rodriguez-Soto, M., Lopez-Sanchez, M., Rodriguez-Aguilar, J.A., Bistaffa, F., Boddington, P., Wooldridge, M., Ansotegui, C.: Encoding ethics to compute value-aligned norms. Minds and Machines33(4), 761–790 (2023)

  41. [41]

    Large Language Model Alignment: A Survey

    Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., Wu, X., Liu, Y., Xiong, D.: Large language model alignment: A survey. arXiv preprint arXiv:2309.15025 (2023)

  42. [42]

    Risk analysis20(3), 353–362 (2000)

    Siegrist, M., Cvetkovich, G., Roth, C.: Salient value similarity, social trust, and risk/benefit perception. Risk analysis20(3), 353–362 (2000)

  43. [43]

    Advances in Neural Information Processing Systems36, 2511– 2565 (2023)

    Sun, Z., Shen, Y., Zhou, Q., Zhang, H., Chen, Z., Cox, D., Yang, Y., Gan, C.: Principle-driven self-alignment of language models from scratch with minimal hu- Towards Appropriate Trust with Certainty-Aware RAG XXI man supervision. Advances in Neural Information Processing Systems36, 2511– 2565 (2023)

  44. [44]

    Revisiting Uncertainty Esti- mation and Calibration of Large Language Models

    Tao,L.,Yeh,Y.F.,Dong,M.,Huang,T.,Torr,P.,Xu,C.:Revisitinguncertaintyes- timation and calibration of large language models. arXiv preprint arXiv:2505.23854 (2025)

  45. [45]

    https://www.trulens.org/getting_started/core_concepts/ rag_triad/ (2023), [Accessed 10-10-2025]

    The RAG Triad. https://www.trulens.org/getting_started/core_concepts/ rag_triad/ (2023), [Accessed 10-10-2025]

  46. [46]

    Philosophy & Technology34(4), 1607–1622 (2021)

    Von Eschenbach, W.J.: Transparency and the black box problem: Why we do not trust AI. Philosophy & Technology34(4), 1607–1622 (2021)

  47. [47]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. arXiv preprint arXiv:2306.13063 (2023)

  48. [48]

    International Journal of Human–Computer Interaction37(13), 1269–1282 (2021)

    Yokoi, R., Nakayachi, K.: The effect of value similarity on trust in the automa- tion systems: A case of transportation and medical care. International Journal of Human–Computer Interaction37(13), 1269–1282 (2021)

  49. [49]

    Yona, G., Aharoni, R., Geva, M.: Can large language models faithfully express their intrinsic uncertainty in words? arXiv preprint arXiv:2405.16908 (2024)

  50. [50]

    In: CCF Conference on Big Data

    Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., Liu, Z.: Evaluation of retrieval- augmented generation: A survey. In: CCF Conference on Big Data. pp. 102–120. Springer (2024)

  51. [51]

    Agieval: A human-centric benchmark for evaluating foundation models

    Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., Duan, N.: Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364 (2023)