pith. machine review for the scientific record. sign in

arxiv: 2604.07655 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.CL

Recognition: unknown

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

Han Bao, Hang Hua, Haomin Zhuang, Jiayi Ye, Pin-Yu Chen, Siyuan Wu, Xiangliang Zhang, Yanbo Wang, Yue Huang

Pith reviewed 2026-05-10 17:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords Guardian-as-an-AdvisorGaaALLM safetyover-refusalmodel spec compliancesoft-gatingGuardSetadvisory workflow
0
0 comments X

The pith

A guardian model advises base LLMs by prepending risk labels and explanations, steering outputs to match the model spec while cutting over-refusal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Guardian-as-an-Advisor, a pipeline in which a separate guardian model first labels a query as risky or safe and adds a short explanation. This advice is then prepended to the original query so the base LLM can re-infer its answer under its original rules. The approach replaces hard safety gates that block too many harmless requests. A new dataset called GuardSet supplies the training data, including slices for robustness and honesty. When the guardian is trained with supervised fine-tuning followed by reinforcement learning for label-explanation consistency, the resulting system keeps safety levels intact yet produces fewer refusals on safe inputs.

Core claim

Guardian-as-an-Advisor (GaaA) is a soft-gating method in which a guardian model outputs a binary risk label plus a concise explanation and prepends both to the user query before the base model generates its response. This keeps the base model operating under its original model spec rather than overriding it with a hard gate. Experiments show that the augmented prompts yield responses that better comply with the spec, preserve safety on harmful inputs, and reduce over-refusal on harmless ones. The guardian itself reaches competitive detection accuracy at low added compute cost.

What carries the argument

Guardian-as-an-Advisor (GaaA) soft-gating pipeline: a guardian predicts a binary risk label and explanation then prepends this advice to the original query for re-inference by the base model.

If this is right

  • Responses from the base model improve over unaugmented prompts on the same inputs.
  • Safety is maintained while over-refusal drops.
  • Advisor inference consumes under 5 percent of base-model compute and adds only 2-10 percent end-to-end latency at realistic harmful-input rates.
  • The same advisory workflow works across multiple domains covered in GuardSet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prepend-advice pattern could be applied to other alignment goals such as honesty or fairness checks.
  • GuardSet's robustness and honesty slices offer a ready benchmark for testing whether other guardian designs also reduce side effects.
  • Because the base model stays under its original spec, the method may transfer more easily across vendors than methods that retrain the base model itself.

Load-bearing premise

Prepending the guardian's risk label and explanation will reliably improve the base model's compliance and output quality without introducing new inconsistencies or harming performance on safe queries.

What would settle it

A controlled test on standard safety and helpfulness benchmarks in which the GaaA-augmented model produces either more unsafe outputs or more over-refusals than the unaugmented base model.

read the original abstract

Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline in which a guardian model outputs a binary risk label plus concise explanation that is prepended to the user's query before re-inference by the base LLM. The goal is to steer the base model toward better compliance with a vendor's model spec while preserving safety and reducing over-refusal. The authors construct GuardSet, a 208k+ multi-domain dataset that unifies harmful and harmless cases with targeted robustness and honesty slices, and train GuardAdvisor via SFT followed by RL to enforce label-explanation consistency. The abstract claims competitive detection accuracy, improved responses relative to unaugmented prompts, and low latency overhead (advisor inference <5% of base-model compute, 2-10% end-to-end).

Significance. If the empirical claims are substantiated, GaaA offers a practical alternative to hard-gated safety filters by treating the guardian as an advisor rather than a gatekeeper. The construction of GuardSet and the use of RL for consistency are concrete contributions that could be reused by others working on spec-compliant LLMs. The reported latency profile, if verified, would make the method deployable at scale.

major comments (2)
  1. [Abstract] Abstract: the claims of 'competitive detection accuracy' and 'responses improve over unaugmented prompts' are stated without any numerical results, baselines, error bars, or ablation tables. Because the central contribution is empirical, the absence of these data prevents verification of the safety-utility tradeoff that the paper asserts.
  2. [Abstract] Abstract: the core mechanism (prepending the guardian's binary label and explanation) is asserted to steer the base model toward spec compliance and reduced over-refusal, yet no quantitative evidence is supplied on steering reliability, performance stratified by query type (harmful vs. borderline), or the effect of erroneous labels. This leaves the weakest assumption untested.
minor comments (1)
  1. [Abstract] The latency study is summarized but the experimental setup (input lengths, batch sizes, hardware) is not described even at a high level, making the overhead numbers difficult to interpret.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical substantiation in the abstract. We address each major comment below and will revise the manuscript to incorporate key quantitative results from the body of the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claims of 'competitive detection accuracy' and 'responses improve over unaugmented prompts' are stated without any numerical results, baselines, error bars, or ablation tables. Because the central contribution is empirical, the absence of these data prevents verification of the safety-utility tradeoff that the paper asserts.

    Authors: We agree that the abstract would be strengthened by including specific numerical results. The full manuscript reports these details in the Experiments section, with tables showing GuardAdvisor's detection accuracy on GuardSet (competitive with baselines), compliance and over-refusal improvements when using the advisory prepend, ablations, and error bars. We will revise the abstract to include representative metrics (e.g., accuracy percentages, improvement deltas, and the stated latency overhead of <5% advisor compute and 2-10% end-to-end) while referencing the relevant tables and figures. revision: yes

  2. Referee: [Abstract] Abstract: the core mechanism (prepending the guardian's binary label and explanation) is asserted to steer the base model toward spec compliance and reduced over-refusal, yet no quantitative evidence is supplied on steering reliability, performance stratified by query type (harmful vs. borderline), or the effect of erroneous labels. This leaves the weakest assumption untested.

    Authors: The manuscript provides quantitative evidence for the steering effect through measured improvements in spec compliance and reduced over-refusal rates when the label-explanation advice is prepended. GuardSet explicitly includes targeted slices for robustness and honesty, with results stratified across harmful, harmless, and borderline query types. However, a dedicated ablation on the impact of erroneous guardian labels is not present. We will add this analysis (e.g., performance under simulated label noise) and update the abstract to summarize the steering reliability findings with specific metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with no derivations

full rationale

The paper introduces GaaA as an empirical soft-gating pipeline, constructs the GuardSet dataset (208k+ multi-domain examples), trains GuardAdvisor via SFT then RL for label-explanation consistency, and reports experimental outcomes on detection accuracy, response improvement, and latency. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on direct comparisons (augmented vs. unaugmented prompts) rather than any reduction of outputs to inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. Standard ML training assumptions (e.g., that SFT+RL produces consistent label-explanation pairs) are implicit but not detailed.

pith-pipeline@v0.9.0 · 5520 in / 1161 out tokens · 43600 ms · 2026-05-10T17:22:11.935024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

112 extracted references · 54 canonical work pages · 14 internal anchors

  1. [1]

    https://github.com/f/awesome-chatgpt-prompts, 2025

    awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt-prompts, 2025. GitHub repository

  2. [2]

    Speceval: Evaluating model adherence to behavior specifications.arXiv preprint arXiv:2509.02464, 2025

    Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, and Percy Liang. Speceval: Evaluating model adherence to behavior specifications.arXiv preprint arXiv:2509.02464, 2025

  3. [3]

    harmful-dataset (dataset)

    Anonymous. harmful-dataset (dataset). https://huggingface.co/datasets, 2023. No canoni- cal paper; please replace with the exact repository URL you use

  4. [4]

    Reasoning over public and private data in retrieval-based systems.Transactions of the Association for Computational Linguistics, 2023

    Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, and Christopher Ré. Reasoning over public and private data in retrieval-based systems.Transactions of the Association for Computational Linguistics, 2023. URLhttps://aclanthology.org/2023.tacl-1.51/

  5. [5]

    Position: General alignment has hit a ceiling; edge alignment must be taken seriously.arXiv preprint arXiv:2602.20042, 2026

    Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, and Yanfang Ye. Position: General alignment has hit a ceiling; edge alignment must be taken seriously.arXiv preprint arXiv:2602.20042, 2026

  6. [6]

    The lessons of developing process reward models in mathematical reasoning

    Elias Bassani and Ignacio Sanchez. GuardBench: A large-scale benchmark for guardrail models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18393–18409, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1...

  7. [7]

    Semantic parsing on Freebase fromquestion-answerpairs

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase fromquestion-answerpairs. InProceedingsofthe2013ConferenceonEmpiricalMethodsinNatural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/D13-1160

  8. [8]

    Pappas, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InNeurIPS Datasets and Benchmarks, 2024. 13 Guardian-as-an-Advisor: Advancing Nex...

  9. [9]

    A flexible LLM guardrail development method- ology applied to off-topic prompt detection.arXiv preprint arXiv:2411.12946, 2025

    Gabriel Chua, Shing Yee Chan, and Shaun Khoo. A flexible LLM guardrail development method- ology applied to off-topic prompt detection.arXiv preprint arXiv:2411.12946, 2025. URL https://arxiv.org/abs/2411.12946

  10. [10]

    Think you have solved question answering? try arc, the ai2 reasoning challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (System Demonstrations), 2018. URLhttps://allenai.org/data/arc

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe and et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    llm_attack_harmful_behaviors (dataset)

    LLM Attacks Community. llm_attack_harmful_behaviors (dataset). https://github.com/ llm-attacks, 2023. No canonical paper; cite the exact sub-repo if available

  13. [13]

    Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2406.18559, 2024

    OpenBMB Contributors. Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2406.18559, 2024. UltraInteract SFT dataset

  14. [14]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

  15. [15]

    arXiv preprint arXiv:2306.08568 , year=

    Ning Ding and et al. Ultrachat: A large-scale multi-turn chat dataset.arXiv preprint arXiv:2306.08568, 2023

  16. [16]

    Safeguarding large language models: A survey

    Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, et al. Safeguarding large language models: A survey.arXiv preprint arXiv:2406.02622, 2024. URLhttps://arxiv.org/abs/2406.02622

  17. [17]

    Should chatgpt be biased? challenges and risks of bias in large language models

    Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023

  18. [18]

    Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024

    Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Der- noncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024

  19. [19]

    Honestllm: Toward an honest and helpful large language model.arXiv preprint arXiv:2406.00380, 2024

    Chujie Gao, Siyuan Wu, Yue Huang, Dongping Chen, Qihui Zhang, Zhengyan Fu, Yao Wan, Lichao Sun, and Xiangliang Zhang. Honestllm: Toward an honest and helpful large language model.arXiv preprint arXiv:2406.00380, 2024

  20. [20]

    AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

    Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adap- tive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024

  21. [21]

    AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

    Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chap...

  22. [22]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

  23. [23]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InACL (Long Papers), 2022

  24. [24]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, and et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021

  25. [25]

    Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems, 37:104521–104555, 2024

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems, 37:104521–104555, 2024

  26. [26]

    Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561,

    Yue Huang, Lichao Sun, Haoran Wang, and et al. Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561, 2024

  27. [27]

    Position: Trustllm: Trustworthiness in large language models

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Position: Trustllm: Trustworthiness in large language models. InInternational Conference on Machine Learning, pages 20166–20270. PMLR, 2024

  28. [28]

    Socially responsible and trustworthy generative foundation models: Principles, challenges, and practices

    Yue Huang, Canyu Chen, Lu Cheng, Bhavya Kailkhura, Nitesh Chawla, and Xiangliang Zhang. Socially responsible and trustworthy generative foundation models: Principles, challenges, and practices. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 6825–6828, 2025

  29. [29]

    On the trustworthiness of generative foundation models: Guideline, assessment, and perspective.arXiv preprint arXiv:2502.14296, 2025

    Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, et al. On the trustworthiness of generative foundation models: Guideline, assessment, and perspective.arXiv preprint arXiv:2502.14296, 2025

  30. [30]

    Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025

    Yue Huang, Chujie Gao, Yujun Zhou, Kehan Guo, Xiangqi Wang, Or Cohen-Sasson, Max Lam- parth, and Xiangliang Zhang. Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025

  31. [31]

    Building a foundational guardrail for general agentic systems via synthetic data

    Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

  32. [32]

    Probellm: Automating principled diagnosis of llm failures.arXiv preprint arXiv:2602.12966, 2026

    Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, et al. Probellm: Automating principled diagnosis of llm failures.arXiv preprint arXiv:2602.12966, 2026

  33. [33]

    Spa: Achieving consensus in llm alignment via self-priority optimization

    Yue Huang, Xiangqi Wang, and Xiangliang Zhang. Spa: Achieving consensus in llm alignment via self-priority optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31220–31228, 2026

  34. [35]

    URLhttps://arxiv.org/abs/2312.06674

  35. [36]

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a 15 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs human-preference dataset.Advances in Neural Information Processing Systems, 36:2467...

  36. [37]

    doi:10.48550/arXiv.2406.18510 , abstract =

    Liwei Jiang and et al. Wildteaming at scale: From in-the-wild jailbreaks to automatic red teaming.arXiv preprint arXiv:2406.18510, 2024. WildJailbreak dataset (vanilla & adversarial)

  37. [38]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InACL, 2017

  38. [39]

    SLM as guardian: Pioneering AI safety with small language model

    Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Hwiyeol Jo, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, and Taiwoo Park. SLM as guardian: Pioneering AI safety with small language model. InProc. EMNLP Industry Track, pages 1333–1350, 2024. URL https://aclanthology.org/2024.emnlp-industry.99

  39. [40]

    Smith, and Han- naneh Hajishirzi

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khy- athi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Han- naneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling.https: //huggingface.co/spaces/allenai/reward-bench, 2024

  40. [41]

    2022), 1092–1097

    Yujia Li, David Choi, and Junyoung et al. Geraghty. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158

  41. [42]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021

  42. [43]

    arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

    Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389, 2023

  43. [44]

    Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

  44. [45]

    doi:10.48550/arXiv.2501.18492 , abstract =

    Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492, 2025

  45. [46]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

  46. [47]

    Synthetic interaction data for scalable personalization in large language models.arXiv preprint arXiv:2602.12394, 2026

    Yuchen Ma, Yue Huang, Wenjie Wang, Xiaonan Luo, Xiangliang Zhang, and Stefan Feuerriegel. Synthetic interaction data for scalable personalization in large language models.arXiv preprint arXiv:2602.12394, 2026

  47. [48]

    medical-reasoning (dataset)

    mamachang. medical-reasoning (dataset). https://huggingface.co/datasets/mamachang/ medical-reasoning, 2024. Hugging Face dataset

  48. [49]

    A holistic approach to undesired content detection in the real world.Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15009–15018, 2023

    Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14422–14430, 2023. doi: 10.1609/aaai.v37i12.26752. 16 Guardian-as-an-Advisor: Advancing N...

  49. [50]

    Harmbench: A standard- ized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standard- ized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (PMLR Volume 235), pages 35181–35224, 2024

  50. [51]

    meta-llama/llama-3.1-8b-instruct, 2024

    Meta. meta-llama/llama-3.1-8b-instruct, 2024. URL https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct. Accessed: 2025-xx-xx

  51. [52]

    Purplellama / prompt guard.https://github.com/meta-llama/PurpleLlama, 2023

    Meta AI. Purplellama / prompt guard.https://github.com/meta-llama/PurpleLlama, 2023

  52. [53]

    Llama guard 4 model card

    Meta AI. Llama guard 4 model card. https://huggingface.co/meta-llama/ Llama-Guard-4-12B, 2025

  53. [54]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

  54. [55]

    Cross-task generalization via natural language instructions

    Swaroop Mishra and et al. Cross-task generalization via natural language instructions. In NAACL, 2022

  55. [56]

    Chatgpt-jailbreak-prompts (github repository)

    ObservedObserver. Chatgpt-jailbreak-prompts (github repository). https://github.com/ ObservedObserver/ChatGPT-Jailbreak-Prompts, 2023

  56. [57]

    Qwen Team

    Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Zahra Ashktorab, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawa...

  57. [58]

    Granite guardian: Comprehensive LLM safeguarding

    Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, et al. Granite guardian: Comprehensive LLM safeguarding. InProc. NAACL Industry Track, pages 607–615, 2025. doi: 10.18653/v1/2025.naacl-industry.49. URL ht...

  58. [59]

    SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

    Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025

  59. [60]

    Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2024

    Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, and Mei Si. Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2024. URLhttps://arxiv.org/abs/2312.00029

  60. [61]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report. 2024. URLhttps://arxiv.org/abs/2412.15115

  61. [62]

    Squad: 100,000+questions for machine comprehension of text

    PranavRajpurkar, JianZhang, KonstantinLopyrev, andPercyLiang. Squad: 100,000+questions for machine comprehension of text. InEMNLP, 2016

  62. [63]

    Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,

    Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J. Pappas, and Hamed Hassani. Safety guardrails for LLM-enabled robots.arXiv preprint arXiv:2503.07885, 2025. URLhttps: //arxiv.org/abs/2503.07885. 17 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

  63. [64]

    Rmo: Towards better llm alignment via reshaping reward margin distributions

    Yanchi Ru, Yue Huang, and Xiangliang Zhang. Rmo: Towards better llm alignment via reshaping reward margin distributions. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32851–32859, 2026

  64. [65]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InACL, 2017. Commonly used CNN/DailyMail summarization split

  65. [66]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  66. [67]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  67. [68]

    Manning, Andrew Y

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InEMNLP, 2013

  68. [69]

    A StrongREJECT for empty jailbreaks

    Alexandra Souly and et al. A strongreject for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024

  69. [70]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

  70. [71]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL-HLT, 2019

  71. [72]

    Pku-saferlhf (dataset)

    PKU Alignment Team. Pku-saferlhf (dataset). https://huggingface.co/datasets/ PKU-Alignment/PKU-SafeRLHF, 2023. Hugging Face dataset

  72. [73]

    in-the-wild-jailbreak-prompts (dataset)

    TrustAIRLab. in-the-wild-jailbreak-prompts (dataset). https://huggingface.co/datasets/ TrustAIRLab/in-the-wild-jailbreak-prompts, 2024. Hugging Face dataset

  73. [74]

    Hale, and Paul Röttger

    Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. Simplesafetytests: a test suite for identifying critical safety risks in large language models.arXiv preprint arXiv:2311.08370, 2024

  74. [75]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. QNLI task is from GLUE

  75. [76]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. InNeurIPS, 2023

  76. [77]

    Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025

    Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, and Xiangliang Zhang. Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025. 18 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

  77. [78]

    TRUSTEVAL: A dynamic evaluation toolkit on trustworthiness of generative foundation models

    Yanbo Wang, Jiayi Ye, Siyuan Wu, Chujie Gao, Yue Huang, Xiuying Chen, Yue Zhao, and Xiangliang Zhang. TRUSTEVAL: A dynamic evaluation toolkit on trustworthiness of generative foundation models. In Nouha Dziri, Sean (Xiang) Ren, and Shizhe Diao, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa...

  78. [79]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    YizhongWangandetal. Self-instruct: Aligninglanguagemodelswithself-generatedinstructions. arXiv preprint arXiv:2212.10560, 2022

  79. [80]

    Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms.arXiv preprint arXiv:2308.13387, 2023

  80. [81]

    Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36:80079–80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36:80079–80110, 2023

Showing first 80 references.