pith. machine review for the scientific record. sign in

arxiv: 2401.05561 · v6 · submitted 2024-01-10 · 💻 cs.CL

TrustLLM: Trustworthiness in Large Language Models

Pith reviewed 2026-05-18 11:12 UTC · model grok-4.3

classification 💻 cs.CL
keywords trustworthinesslarge language modelsLLMsbenchmarkevaluationsafetyfairnessprivacy
0
0 comments X

The pith

Proprietary large language models generally outperform open-source ones on trustworthiness measures, and trustworthiness tracks closely with overall utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out principles for trustworthy LLMs across eight dimensions and builds a benchmark covering six of them: truthfulness, safety, fairness, robustness, privacy, and machine ethics. It runs the benchmark on sixteen mainstream models using more than thirty datasets. The results indicate that trustworthiness and functional performance rise together, proprietary models lead most open-source ones, and a handful of open-source models nearly match the leaders. Some models prove overly cautious, refusing harmless requests in the name of safety and thereby lowering their usefulness. The work also highlights the need for transparency about the specific techniques used to build trustworthiness.

Core claim

By defining eight principles and applying a six-dimension benchmark to sixteen LLMs, the study finds that trustworthiness and utility are positively correlated, proprietary models generally lead open-source ones on the tested dimensions, a few open-source models approach proprietary performance, and some models over-calibrate by refusing benign prompts.

What carries the argument

The TrustLLM benchmark, which applies standardized tests across truthfulness, safety, fairness, robustness, privacy, and machine ethics to rank models.

If this is right

  • Higher trustworthiness tends to accompany stronger performance on standard tasks.
  • Widespread use of open-source LLMs carries elevated risk compared with proprietary alternatives.
  • Overly strict safety tuning can reduce model utility by blocking safe user requests.
  • Transparency about the specific methods used to improve trustworthiness enables better analysis of their effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Closing the trustworthiness gap between open-source and proprietary models could require targeted improvements in training data or alignment techniques.
  • The observed correlation between trustworthiness and utility suggests that general capability advances may bring trustworthiness gains as a side effect.
  • Developers should monitor refusal rates on safe inputs as a routine check when adding safety features.

Load-bearing premise

The selected datasets and evaluation methods for the six dimensions capture the main real-world trustworthiness risks without major gaps or biases.

What would settle it

An open-source model that scores higher than leading proprietary models on all six benchmark dimensions while correctly answering every benign prompt would contradict the reported pattern.

read the original abstract

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces TrustLLM as a comprehensive study of trustworthiness in LLMs. It proposes a set of principles spanning eight dimensions, constructs a benchmark across six dimensions (truthfulness, safety, fairness, robustness, privacy, and machine ethics) using more than 30 datasets, evaluates 16 mainstream LLMs, and reports three primary findings: a positive correlation between trustworthiness and utility, general outperformance by proprietary models over open-source counterparts, and over-calibration in some models that leads to refusal of benign prompts. The work concludes with discussion of open challenges and the need for transparency in trustworthiness technologies.

Significance. If the central empirical claims hold after methodological clarification, the paper would make a useful contribution by providing one of the larger-scale multi-dimensional evaluations of LLM trustworthiness to date. The explicit linkage of findings to model accessibility (proprietary vs. open-source) and the utility-trustworthiness trade-off supplies concrete observations that can inform deployment decisions and future alignment research. The scale (>30 datasets, 16 models) is a clear strength that distinguishes it from narrower prior benchmarks.

major comments (2)
  1. [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation): The mapping from the eight proposed principles to the six benchmark dimensions and the specific dataset choices lacks an explicit coverage or gap analysis. Without this, it is unclear whether the observed proprietary-model advantage and positive trustworthiness-utility correlation are robust to alternative task selections (e.g., long-context privacy or culturally varied ethics scenarios). This directly affects the load-bearing claim that proprietary LLMs generally outperform open-source ones.
  2. [§4] §4 (Evaluation Methodology): The manuscript provides insufficient detail on prompt templates, exact scoring rubrics (especially for subjective dimensions such as machine ethics and fairness), and any inter-annotator or inter-model consistency checks. These choices are central to the reported rankings and the over-calibration observation; their omission prevents independent verification of whether the differences are intrinsic or protocol-dependent.
minor comments (3)
  1. [Abstract] The abstract states principles across eight dimensions but a benchmark across six; a single clarifying sentence would remove potential reader confusion.
  2. [Results] Correlation plots in the results section would be strengthened by reporting confidence intervals or statistical significance for the trustworthiness-utility relationship.
  3. [Related Work] A small number of citations to prior multi-dimensional LLM safety benchmarks (e.g., HELM, DecodingTrust) appear to be missing from the related-work discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the benchmark's scope and improve methodological transparency. We address each point below and commit to revisions that strengthen the paper without altering its core claims.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation): The mapping from the eight proposed principles to the six benchmark dimensions and the specific dataset choices lacks an explicit coverage or gap analysis. Without this, it is unclear whether the observed proprietary-model advantage and positive trustworthiness-utility correlation are robust to alternative task selections (e.g., long-context privacy or culturally varied ethics scenarios). This directly affects the load-bearing claim that proprietary LLMs generally outperform open-source ones.

    Authors: We agree that an explicit mapping and gap analysis would improve transparency. In the revised version we will add a table in §3 that maps each of the eight principles to the six benchmark dimensions and lists the datasets chosen for each, together with a short discussion of coverage and acknowledged gaps (e.g., limited long-context privacy scenarios and culturally specific ethics tasks). Our dataset selection follows prior literature for each dimension; the proprietary-model advantage and trustworthiness-utility correlation hold consistently across the >30 datasets we include. We will nevertheless add a limitations paragraph noting that results may vary under alternative task distributions and flag long-context and culturally varied evaluations as important future work. revision: partial

  2. Referee: [§4] §4 (Evaluation Methodology): The manuscript provides insufficient detail on prompt templates, exact scoring rubrics (especially for subjective dimensions such as machine ethics and fairness), and any inter-annotator or inter-model consistency checks. These choices are central to the reported rankings and the over-calibration observation; their omission prevents independent verification of whether the differences are intrinsic or protocol-dependent.

    Authors: We accept that additional methodological detail is required for reproducibility. In the revision we will expand §4 (and add an appendix) with: (i) the full prompt templates used for each dimension, (ii) precise scoring rubrics including how human or automated judgments were applied to machine ethics and fairness, and (iii) inter-annotator agreement statistics for any human-evaluated subsets together with consistency checks across model outputs. These additions will allow readers to verify that reported differences are not artifacts of the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper conducts an empirical evaluation of 16 LLMs across six trustworthiness dimensions using over 30 external datasets. Central claims (proprietary models outperforming open-source ones, positive trustworthiness-utility correlation, over-calibration) derive directly from model outputs on these datasets rather than from any internal derivation, fitted parameters, or self-referential definitions. No equations, predictions, or uniqueness theorems are presented that reduce to the authors' own inputs by construction. The work is self-contained against external benchmarks, with dataset selection serving as an operationalization step rather than a circular fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the selected datasets and metrics are representative proxies for the eight trustworthiness principles; no new physical or mathematical entities are introduced.

axioms (1)
  • domain assumption Existing NLP datasets can serve as valid proxies for real-world trustworthiness failures in LLMs.
    The benchmark construction in the abstract relies on this without independent validation of coverage.

pith-pipeline@v0.9.0 · 6093 in / 1239 out tokens · 26937 ms · 2026-05-18T11:12:58.510044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

    cs.CR 2026-05 conditional novelty 8.0

    Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

  2. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  3. AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

    cs.MM 2026-04 unverdicted novelty 8.0

    AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models ...

  4. Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains whi...

  5. Profiling for Pennies: Unveiling the Privacy Iceberg of LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    LLM agents can reconstruct high-fidelity personal profiles from minimal PII seeds with over 90% accuracy in under 10 minutes at less than $3 cost, exposing three escalating tiers of privacy risks.

  6. Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

    cs.AI 2026-05 unverdicted novelty 6.0

    PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

  7. Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.

  8. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  9. Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

    cs.CL 2026-04 unverdicted novelty 6.0

    CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...

  10. OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

    cs.LG 2025-11 unverdicted novelty 6.0

    OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.

  11. Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

    cs.LG 2025-10 conditional novelty 6.0

    Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.

  12. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    cs.CR 2024-03 accept novelty 6.0

    JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...

  13. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

  14. Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

    cs.CL 2025-10 unverdicted novelty 5.0

    ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.

  15. Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction

    cs.CL 2026-05 unverdicted novelty 4.0

    A multi-view evidential framework combines semantic and reasoning information to improve accuracy and provide trustworthy uncertainty estimates for mental health prediction on text data.

  16. A Multi-Dimensional Audit of Politically Aligned Large Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    A multi-dimensional audit framework for politically aligned LLMs finds consistent trade-offs: larger models are more effective and truthful but less fair with higher bias, while fine-tuned models reduce bias but incre...

  17. Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work

    cs.AI 2026-04 unverdicted novelty 4.0

    Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.

  18. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 18 Pith papers · 33 internal anchors

  1. [1]

    A toolkit for text extraction and analysis for natural language processing tasks

    Tshephisho Joseph Sefara, Mahlatse Mbooi, Katlego Mashile, Thompho Rambuda, and Mapitsi Rangata. A toolkit for text extraction and analysis for natural language processing tasks. In 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), pages 1–6, 2022

  2. [2]

    Natural language processing: State of the art, current trends and challenges

    Diksha Khurana, Aditya Koli, Kiran Khatter, and Sukhdev Singh. Natural language processing: State of the art, current trends and challenges. Multimedia tools and applications, 82(3):3713–3744, 2023

  3. [3]

    Wordcraft: story writing with large language models

    Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022

  4. [4]

    Multilingual machine translation with large language models: Empirical results and analysis, 2023

    Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis, 2023

  5. [5]

    https://blogs.microsoft.com/blog/2023/02/07/ reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/

    Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web, 2023. https://blogs.microsoft.com/blog/2023/02/07/ reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/

  6. [6]

    https://medium.com/whatnot-engineering/ enhancing-search-using-large-language-models-f9dcb988bdb9

    Enhancing search using large language models, 2023. https://medium.com/whatnot-engineering/ enhancing-search-using-large-language-models-f9dcb988bdb9

  7. [7]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  8. [8]

    https://www.projectpro.io/article/ large-language-model-use-cases-and-applications/887

    7 top large language model use cases and applications, 2023. https://www.projectpro.io/article/ large-language-model-use-cases-and-applications/887

  9. [9]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  10. [10]

    Large language models: The future of b2b software, 2023

    MintMesh. Large language models: The future of b2b software, 2023

  11. [11]

    Bloomberggpt: A large language model for finance, 2023

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023

  12. [12]

    Scientific discovery in the age of artificial intelligence

    Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023

  13. [13]

    Hofgard, Aria Mansouri Tehrani, Rui Wang, Ameya Daigavane, Montgomery Bohde, Jerry Kurtin, Qian Huang, Tuong Phung, Minkai Xu, Chaitanya K

    Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Haiyang Yu, YuQing Xie, Xiang Fu, Alex Strasser, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nichola...

  14. [14]

    The impact of large language models on scientific discovery: a preliminary study using gpt-4, 2023

    Microsoft Research AI4Science and Microsoft Azure Quantum. The impact of large language models on scientific discovery: a preliminary study using gpt-4, 2023

  15. [15]

    Pllama: An open-source large language model for plant science, 2024

    Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik Alexandersson. Pllama: An open-source large language model for plant science, 2024

  16. [16]

    The future landscape of large language models in medicine

    Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, 80 TRUST LLM Gregory P Veldhuizen, et al. The future landscape of large language models in medicine. Communica- tions Medicine, 3(1):141, 2023

  17. [17]

    ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

    Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, and Yongdong Zhang. ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences. arXiv preprint arXiv:2311.06025, 2023

  18. [18]

    Alpacare:instruction-tuned large language models for medical application, 2023

    Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. Alpacare:instruction-tuned large language models for medical application, 2023

  19. [19]

    Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun

    Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan Adhikarla, Sunyang Fu, Xun Chen, Chen Chen, Yuyin Zhou, Xiang Li, Lifang He, Brian D. Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks, 2023

  20. [20]

    Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt, 2023

    Yirong Chen, Zhenyu Wang, Xiaofen Xing, huimin zheng, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, and Xiangmin Xu. Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt, 2023

  21. [21]

    Huatuogpt, towards taming language models to be a doctor

    Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075, 2023

  22. [22]

    Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge

    Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023

  23. [23]

    Medicalgpt: Training medical gpt model

    Ming Xu. Medicalgpt: Training medical gpt model. https://github.com/shibing624/MedicalGPT, 2023

  24. [24]

    A domain-specific next-generation large language model (llm) or chatgpt is required for biomedical engineering and research

    Soumen Pal, Manojit Bhattacharya, Sang-Soo Lee, and Chiranjib Chakraborty. A domain-specific next-generation large language model (llm) or chatgpt is required for biomedical engineering and research. Annals of Biomedical Engineering, pages 1–4, 2023

  25. [25]

    Towards generalist biomedical ai

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334, 2023

  26. [26]

    Large language models and political science

    Mitchell Linegar, Rafal Kocielnik, and R Michael Alvarez. Large language models and political science. Frontiers in Political Science, 5:1257092, 2023

  27. [27]

    https://github.com/irlab-sdu/fuzi.mingcha, 2023

    fuzi.mingcha. https://github.com/irlab-sdu/fuzi.mingcha, 2023

  28. [28]

    Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023

    Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei. Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023

  29. [29]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. In NeurIPS, 2023

  30. [30]

    Structured chemistry reasoning with large language models

    Siru Ouyang, Zhuosheng Zhang, Bing Yan, Xuan Liu, Jiawei Han, and Lianhui Qin. Structured chemistry reasoning with large language models. arXiv preprint arXiv:2311.09656, 2023

  31. [31]

    Marinegpt: Unlocking secrets of "ocean" to the public, 2023

    Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, and Sai-Kit Yeung. Marinegpt: Unlocking secrets of "ocean" to the public, 2023

  32. [32]

    Oceangpt: A large language model for ocean science tasks, 2023

    Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. Oceangpt: A large language model for ocean science tasks, 2023

  33. [33]

    Taoli llama

    Jingsi Yu, Junhui Zhu, Yujie Wang, Yang Liu, Hongxiang Chang, Jinran Nie, Cunliang Kong, Ruining Chong, XinLiu, Jiyuan An, Luming Lu, Mingwei Fang, and Lin Zhu. Taoli llama. https://github.com/ blcuicall/taoli, 2023

  34. [34]

    Artgpt-4: Artistic vision-language understanding with adapter-enhanced minigpt-4, 2023

    Zhengqing Yuan, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, and Kun Wang. Artgpt-4: Artistic vision-language understanding with adapter-enhanced minigpt-4, 2023. 81 TRUST LLM

  35. [35]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- odkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

  36. [36]

    Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing...

  37. [37]

    Palm: Efficiently training massive language models, 2023

    Towards Data Science. Palm: Efficiently training massive language models, 2023

  38. [38]

    How chatgpt works: A look inside large language models, 2023

    Wired. How chatgpt works: A look inside large language models, 2023

  39. [39]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  40. [40]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

  41. [41]

    Pathways: Asynchronous distributed dataflow for ml

    Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4:430–449, 2022

  42. [42]

    Ai alignment: A comprehensive survey, 2023

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. Ai alignment: A comprehensive survey, 2023

  43. [43]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730– 27744, 2022

  44. [44]

    Improving language model negotiation with self-play and in-context learning from ai feedback

    Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023. 82 TRUST LLM

  45. [45]

    Principle-driven self-alignment of language models from scratch with minimal human supervision

    Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023

  46. [46]

    Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs, 2023

    Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, and Niket Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs, 2023

  47. [47]

    Measuring Progress on Scalable Oversight for Large Language Models

    Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil ˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022

  48. [48]

    Discovering Language Model Behaviors with Model-Written Evaluations

    Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022

  49. [49]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023

  50. [50]

    Characterizing manipulation from ai systems

    Micah Carroll, Alan Chan, Henry Ashton, and David Krueger. Characterizing manipulation from ai systems. arXiv preprint arXiv:2303.09387, 2023

  51. [51]

    RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023

  52. [52]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

  53. [53]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  54. [54]

    The effects of reward misspecification: Mapping and mitigating misaligned models, 2022

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022

  55. [55]

    Cooperative inverse reinforcement learning

    Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016

  56. [56]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  57. [57]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023

  58. [58]

    Factuality challenges in the era of large language models

    Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, et al. Factuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189, 2023

  59. [59]

    Combating misinformation in the age of llms: Opportunities and challenges

    Canyu Chen and Kai Shu. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023

  60. [60]

    10 ways cybercriminals can abuse large language models, 2023

    Forbes Tech Council. 10 ways cybercriminals can abuse large language models, 2023

  61. [61]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

  62. [62]

    Unraveling the link between translations and gender bias in llms, 2023

    Appen. Unraveling the link between translations and gender bias in llms, 2023

  63. [63]

    Navigating the biases in llm generative ai: A guide to responsible implementation, 2023

    Forbes Tech Council. Navigating the biases in llm generative ai: A guide to responsible implementation, 2023

  64. [64]

    Large language models may leak personal data, 2022

    Slator. Large language models may leak personal data, 2022. https://slator.com/ large-language-models-may-leak-personal-data/. 83 TRUST LLM

  65. [65]

    Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023

    Zhengliang Liu, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Wei Liu, Dinggang Shen, Quanzheng Li, Tianming Liu, Dajiang Zhu, and Xiang Li. Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023

  66. [66]

    What does it mean to align ai with human values?, 2022

    Quanta Magazine. What does it mean to align ai with human values?, 2022

  67. [67]

    Openai, 2023

    OpenAI. Openai, 2023. https://www.openai.com

  68. [68]

    Ai at meta, 2023

    Meta. Ai at meta, 2023. https://ai.meta.com

  69. [69]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  70. [70]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  71. [71]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023

  72. [72]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

  73. [73]

    Do-not-answer: A dataset for evaluating safeguards in llms

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023

  74. [74]

    Chatbot arena leaderboard week 8: Introducing mt-bench and vicuna-33b

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, and Hao Zhang. Chatbot arena leaderboard week 8: Introducing mt-bench and vicuna-33b. https://lmsys.org/ chatbot-arena-leaderboard-week-8-introducing-mt-bench-and-vicuna-33b/, 2023

  75. [75]

    The big benchmarks collection - a open-llm-leaderboard collection

    Hugging Face. The big benchmarks collection - a open-llm-leaderboard collection. https://huggingface. co/spaces/OpenLLMBenchmark/The-Big-Benchmarks-Collection

  76. [76]

    https://platform.openai.com/docs/guides/moderation

    Openai moderation api, 2023. https://platform.openai.com/docs/guides/moderation

  77. [77]

    The foundation model transparency index, 2023

    Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index, 2023

  78. [78]

    Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze, Preslav Nakov, Tim Baldwin, and ...

  79. [79]

    Ernie - baidu yiyan, 2023

    Baidu. Ernie - baidu yiyan, 2023. https://yiyan.baidu.com/

  80. [80]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

Showing first 80 references.