pith. machine review for the scientific record. sign in

arxiv: 2308.05374 · v2 · pith:ZGTKTZFWnew · submitted 2023-08-10 · 💻 cs.AI · cs.LG

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Pith reviewed 2026-05-17 22:26 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM alignmenttrustworthiness evaluationlarge language modelsreliabilitysafetyfairnessrobustnesssocial norms
0
0 comments X

The pith

A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys seven key categories of trustworthiness in large language models, expanding them into twenty-nine sub-categories to provide evaluation guidance. It selects eight sub-categories for concrete measurement experiments on several common LLMs. Results show that models with more alignment work tend to score better across trustworthiness measures overall. Yet this improvement is inconsistent, stronger in some areas than in others. This variation underscores the value of detailed, ongoing testing rather than assuming broad alignment fixes everything.

Core claim

By organizing trustworthiness into reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness, and measuring eight sub-areas, the authors establish that greater alignment correlates with better overall performance but with category-dependent effectiveness, calling for finer-grained analysis and continued alignment refinements.

What carries the argument

The seven-category taxonomy with twenty-nine sub-categories that structures the survey and directs the selection of measurement studies.

If this is right

  • More aligned models can be expected to deliver higher overall trustworthiness in practice.
  • Alignment efforts must address variation by targeting specific categories separately.
  • Evaluation should include fine-grained tests rather than relying on general alignment metrics.
  • Deployment decisions benefit from checking performance across multiple trustworthiness dimensions.
  • Practitioners gain a structured guideline for iterating on LLM alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending measurements to additional sub-categories could confirm or refine the observed patterns.
  • The framework might apply to assessing trustworthiness in multimodal or other AI models beyond text-based LLMs.
  • Prioritizing categories where alignment shows weaker effects could improve overall system reliability.
  • Real-world deployment might reveal gaps not captured by the current sub-category selections.

Load-bearing premise

That the chosen seven categories, twenty-nine sub-categories, and the eight selected for measurement accurately represent the full scope of trustworthiness in real-world LLM use.

What would settle it

A replication study that applies alternative trustworthiness categories or different evaluation methods to the same models and finds no general advantage for aligned models, or uniform effects across categories, would challenge the main findings.

read the original abstract

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys seven major categories of LLM trustworthiness (reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness), subdivided into 29 sub-categories in total. It then selects a subset of 8 sub-categories, designs corresponding measurements, and applies them to several widely-used LLMs. The central empirical finding is that more aligned models tend to perform better overall in trustworthiness, although the effectiveness of alignment varies across categories. The work positions itself as providing guidance for systematic evaluation and improvement of LLM alignment.

Significance. If the directional findings hold after methodological clarification, the survey offers a structured taxonomy that consolidates key trustworthiness dimensions and supplies concrete measurement examples. The explicit enumeration of 29 sub-categories and the cross-model comparisons add practical value for practitioners seeking to iterate on alignment. The observation that alignment success is uneven across categories is a useful falsifiable pointer for future targeted work.

major comments (2)
  1. [Results / Empirical evaluation] Results section (and abstract): the claim that 'more aligned models tend to perform better in terms of overall trustworthiness' rests on an implicit ordering of the tested LLMs by alignment strength. The manuscript does not state an a-priori, externally validated ranking (e.g., base models vs. RLHF-tuned vs. further safety-tuned) constructed independently of the eight trustworthiness metrics; without this separation the reported positive trend risks circularity rather than confirmation.
  2. [Measurement studies] Measurement studies section: no explicit criteria are given for choosing the 8 sub-categories out of the 29, nor are the exact test implementations, prompt templates, or statistical controls described. These omissions leave the support for the directional claims only moderately strong and make replication or extension difficult.
minor comments (2)
  1. [Abstract] The abstract refers to 'several widely-used LLMs' without naming them; listing the specific models (and their versions) would improve immediate clarity.
  2. [Results figures/tables] Table or figure captions for the measurement results could more explicitly note the source of the alignment ordering used for the 'more aligned' vs. 'less aligned' comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We address each of the major comments below and indicate how we plan to revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Results / Empirical evaluation] Results section (and abstract): the claim that 'more aligned models tend to perform better in terms of overall trustworthiness' rests on an implicit ordering of the tested LLMs by alignment strength. The manuscript does not state an a-priori, externally validated ranking (e.g., base models vs. RLHF-tuned vs. further safety-tuned) constructed independently of the eight trustworthiness metrics; without this separation the reported positive trend risks circularity rather than confirmation.

    Authors: We agree with the referee that an explicit, a-priori ordering of the models based on their alignment efforts, independent of our evaluation metrics, would strengthen the claim and avoid any appearance of circularity. In the revised manuscript, we will add a dedicated paragraph in the Results section (and update the abstract if necessary) that describes the alignment levels of the tested LLMs based on external information, such as their training procedures documented in official papers and announcements (e.g., distinguishing base models from those fine-tuned with RLHF or additional safety measures). This ordering will be presented prior to reporting the trustworthiness scores. revision: yes

  2. Referee: [Measurement studies] Measurement studies section: no explicit criteria are given for choosing the 8 sub-categories out of the 29, nor are the exact test implementations, prompt templates, or statistical controls described. These omissions leave the support for the directional claims only moderately strong and make replication or extension difficult.

    Authors: We acknowledge that the selection criteria for the 8 sub-categories and the detailed experimental setups were not sufficiently elaborated. We will revise the Measurement studies section to include explicit criteria for selection, such as coverage of different major categories, feasibility of automated evaluation, and importance for real-world applications. Furthermore, we will provide the exact prompt templates, evaluation protocols, and any statistical methods used in an appendix to enable full replication and extension by other researchers. revision: yes

Circularity Check

0 steps flagged

No circularity in survey review or empirical measurements

full rationale

The paper is a literature survey that organizes LLM trustworthiness into seven categories and 29 sub-categories drawn from prior work, then performs new measurements on a selected subset of eight sub-categories across several LLMs. The central observation that more aligned models tend to perform better is an empirical comparison between models whose alignment status is established by external training history (e.g., base models versus those that received RLHF or safety tuning) and the independently collected trustworthiness scores. No equations, fitted parameters, or self-referential definitions are present; the ordering of models by alignment degree is not derived from the paper's own metrics. Self-citations exist as part of normal survey practice but are not load-bearing for any uniqueness claim or ansatz. The work is therefore self-contained against external benchmarks and contains no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that trustworthiness can be decomposed into the listed categories and that human intentions provide a stable reference for alignment; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Alignment refers to making models behave in accordance with human intentions
    Explicitly stated in the opening sentence of the abstract as the definition of the central task.

pith-pipeline@v0.9.0 · 5620 in / 1157 out tokens · 36983 ms · 2026-05-17T22:26:47.271196+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence law_of_existence echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  2. Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs

    cs.AI 2026-04 unverdicted novelty 7.0

    MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.

  3. Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

    cs.LG 2026-04 conditional novelty 7.0

    Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

  4. VoiceBench: Benchmarking LLM-Based Voice Assistants

    cs.CL 2024-10 unverdicted novelty 7.0

    VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

  5. Domain Restriction via Multi SAE Layer Transitions

    cs.AI 2026-05 unverdicted novelty 6.0

    Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.

  6. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

  7. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  8. Common-agency Games for Multi-Objective Test-Time Alignment

    cs.GT 2026-05 unverdicted novelty 6.0

    CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

  9. Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...

  10. AlignCultura: Towards Culturally Aligned Large Language Models?

    cs.CL 2026-04 unverdicted novelty 6.0

    Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

  11. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  12. OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

    cs.LG 2025-11 unverdicted novelty 6.0

    OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.

  13. Mapping how LLMs debate societal issues when shadowing human personality traits, sociodemographics and social media behavior

    cs.CL 2026-04 unverdicted novelty 5.0

    CDS is a new synthetic corpus of LLM-generated texts on vaccines, disinformation, gender gaps, and STEM stereotypes, linked to persona attributes to enable bias and alignment audits.

  14. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

  15. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  16. Large Language Model-Based Agents for Software Engineering: A Survey

    cs.SE 2024-09 unverdicted novelty 4.0

    A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.

  17. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

  18. A Survey on Knowledge Distillation of Large Language Models

    cs.CL 2024-02 accept novelty 3.0

    A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 30 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  2. [2]

    Alignment of language agents

    Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021

  3. [3]

    OpenAI. Gpt-4. https://openai.com/research/gpt-4, 2023

  4. [4]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

  5. [5]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  6. [6]

    Gpt-4 system card, https://cdn.openai.com/papers/gpt-4-system-card.pdf

    OpenAI. Gpt-4 system card, https://cdn.openai.com/papers/gpt-4-system-card.pdf . 2023

  7. [7]

    Andrew R. Chow. How chatgpt managed to grow faster than tiktok or instagram. https://time.com/6253615/chatgpt-fastest-growing

  8. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  9. [9]

    A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown

    Amanda Marchant, Keith Hawton, Ann Stewart, Paul Montgomery, Vinod Singaravelu, Keith Lloyd, Nicola Purdy, Kate Daine, and Ann John. A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown. PloS one, 12(8):e0181722, 2017. 41 Trustworthy LLMs

  10. [10]

    The regulation of pornography and child pornography on the internet

    Yaman Akdeniz. The regulation of pornography and child pornography on the internet. Available at SSRN 41684, 1997

  11. [11]

    Dynamics of hate based internet user networks

    Pawel Sobkowicz and Antoni Sobkowicz. Dynamics of hate based internet user networks. The European Physical Journal B, 73(4):633–643, 2010

  12. [12]

    Zikun Liu, Chen Luo, and Jia Lu. Hate speech in the internet context: Unpacking the roles of internet penetration, online legal regulation, and online opinion polarization from a transnational perspective.Information Development, page 02666669221148487, 2023

  13. [13]

    Is the internet causing political polarization? evidence from demographics

    Levi Boxell, Matthew Gentzkow, and Jesse M Shapiro. Is the internet causing political polarization? evidence from demographics. Technical report, National Bureau of Economic Research, 2017

  14. [14]

    Regulating the internet of things: first steps toward managing discrimination, privacy, security and consent

    Scott R Peppet. Regulating the internet of things: first steps toward managing discrimination, privacy, security and consent. Tex. L. Rev., 93:85, 2014

  15. [15]

    Normative challenges of identification in the internet of things: Privacy, profiling, discrimination, and the gdpr

    Sandra Wachter. Normative challenges of identification in the internet of things: Privacy, profiling, discrimination, and the gdpr. Computer law & security review, 34(3):436–449, 2018

  16. [16]

    Misuse of the internet by pedophiles: Implications for law enforcement and probation practice

    Keith F Durkin. Misuse of the internet by pedophiles: Implications for law enforcement and probation practice. Fed. Probation, 61:14, 1997

  17. [17]

    Controversies and legal issues of prescribing and dispensing medications using the internet

    Constance H Fung, Hawkin E Woo, and Steven M Asch. Controversies and legal issues of prescribing and dispensing medications using the internet. In Mayo Clinic Proceedings, volume 79, pages 188–194. Elsevier, 2004

  18. [18]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  19. [19]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  20. [20]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021

  21. [21]

    Ethical and social risks of harm from Language Models

    Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021

  22. [22]

    Evaluating the social impact of generative ai systems in systems and society

    Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, et al. Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949, 2023

  23. [23]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  24. [24]

    Eight things to know about large language models

    Samuel R Bowman. Eight things to know about large language models. arXiv preprint arXiv:2304.00612, 2023

  25. [25]

    Deep learning, 2016

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, 2016. http://www. deeplearningbook.org

  26. [26]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020

  27. [27]

    Six Challenges for Neural Machine Translation

    Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017

  28. [28]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022

  29. [29]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  30. [30]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  31. [31]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 42 Trustworthy LLMs

  32. [32]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  33. [33]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146, 2018

  34. [34]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

  35. [35]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  36. [36]

    GLM-130B: An Open Bilingual Pre-trained Model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022

  37. [37]

    Dialogpt: Large-scale generative pre-training for conversational response generation

    Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019

  38. [38]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  39. [39]

    arXiv preprint arXiv:2304.05302 , year=

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023

  40. [40]

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023

  41. [41]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

  42. [42]

    Training socially aligned language models in simulated human society

    Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023

  43. [43]

    Large language models and software as a medical device

    Johan Ordish. Large language models and software as a medical device. https://medregs.blog.gov.uk/2023/03/03/large-language-models-and-software-as-a-medical-device/

  44. [44]

    Are large language models ready for healthcare? a comparative study on clinical language understanding, 2023

    Yuqing Wang, Yun Zhao, and Linda Petzold. Are large language models ready for healthcare? a comparative study on clinical language understanding, 2023

  45. [45]

    How well do large language models support clinician information needs? https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs

    Dev Dash, Eric Horvitz, and Nigam Shah. How well do large language models support clinician information needs? https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs

  46. [46]

    Bloomberggpt: A large language model for finance, 2023

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023

  47. [47]

    Fingpt: Open-source financial large language models, 2023

    Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large language models, 2023

  48. [48]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023

  49. [49]

    A categorical archive of chatgpt failures

    Ali Borji. A categorical archive of chatgpt failures. arXiv preprint arXiv:2302.03494, 2023

  50. [50]

    Chatgpt and software testing education: Promises & perils

    Sajed Jalil, Suzzana Rafi, Thomas D LaToza, Kevin Moran, and Wing Lam. Chatgpt and software testing education: Promises & perils. In 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 4130–4137. IEEE, 2023

  51. [51]

    Fake news detection on social media: A data mining perspective

    Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter, 19(1):22–36, 2017

  52. [52]

    Some Like it Hoax: Automated Fake News Detection in Social Networks

    Eugenio Tacchini, Gabriele Ballarin, Marco L Della Vedova, Stefano Moret, and Luca De Alfaro. Some like it hoax: Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506, 2017

  53. [53]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022

  54. [54]

    A closer look at memorization in deep networks

    Devansh Arpit, Stanisław Jastrz˛ ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017. 43 Trustworthy LLMs

  55. [55]

    Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

  56. [56]

    When not to trust language models: Inves- tigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511,

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

  57. [57]

    Unsupervised dense information retrieval with contrastive learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. 2022

  58. [58]

    Prompting gpt-3 to be reliable

    Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022

  59. [59]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

  60. [60]

    Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

  61. [61]

    Artificial hallucinations in chatgpt: implications in scientific writing

    Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus, 15(2), 2023

  62. [62]

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023

  63. [63]

    False memories and confabulation

    Marcia K Johnson and Carol L Raye. False memories and confabulation. Trends in cognitive sciences, 2(4):137– 145, 1998

  64. [64]

    Calibrated language model fine-tuning for in-and out-of-distribution data

    Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie Lyu, Tuo Zhao, and Chao Zhang. Calibrated language model fine-tuning for in-and out-of-distribution data. arXiv preprint arXiv:2010.11506, 2020

  65. [65]

    Increasing faithfulness in knowledge- grounded dialogue with controllable features

    Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. Increasing faithfulness in knowledge- grounded dialogue with controllable features. arXiv preprint arXiv:2107.06963, 2021

  66. [66]

    Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513, 2023

    Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513, 2023

  67. [67]

    Modeling fluency and faithfulness for diverse neural machine translation

    Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 59–66, 2020

  68. [68]

    Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization

    Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441, 2018

  69. [69]

    Neural path hunter: Reducing hallucina- tion in dialogue systems via path grounding

    Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. Neural path hunter: Reducing hallucina- tion in dialogue systems via path grounding. arXiv preprint arXiv:2104.08455, 2021

  70. [70]

    Entity-based knowledge conflicts in question answering

    Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052, 2021

  71. [71]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023

  72. [72]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

  73. [73]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  74. [74]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

  75. [75]

    Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation

    Sashank Santhanam, Behnam Hedayatnia, Spandana Gella, Aishwarya Padmakumar, Seokhwan Kim, Yang Liu, and Dilek Hakkani-Tur. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. arXiv preprint arXiv:2110.05456, 2021. 44 Trustworthy LLMs

  76. [76]

    Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering

    Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202, 2021

  77. [77]

    Improving faithfulness in abstractive summarization with contrast candidate generation and selection

    Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. arXiv preprint arXiv:2104.09061, 2021

  78. [78]

    A simple recipe towards reducing hallucination in neural surface realisation

    Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. A simple recipe towards reducing hallucination in neural surface realisation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2673–2679, 2019

  79. [79]

    Faithful to the original: Fact aware neural abstractive summarization

    Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. Faithful to the original: Fact aware neural abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  80. [80]

    Totto: A controlled table-to-text generation dataset

    Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373, 2020

Showing first 80 references.