arxiv: 2604.07655 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.CL

Recognition: unknown

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

Han Bao, Hang Hua, Haomin Zhuang, Jiayi Ye, Pin-Yu Chen, Siyuan Wu, Xiangliang Zhang, Yanbo Wang, Yue Huang

Pith reviewed 2026-05-10 17:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords Guardian-as-an-AdvisorGaaALLM safetyover-refusalmodel spec compliancesoft-gatingGuardSetadvisory workflow

0 comments

The pith

A guardian model advises base LLMs by prepending risk labels and explanations, steering outputs to match the model spec while cutting over-refusal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Guardian-as-an-Advisor, a pipeline in which a separate guardian model first labels a query as risky or safe and adds a short explanation. This advice is then prepended to the original query so the base LLM can re-infer its answer under its original rules. The approach replaces hard safety gates that block too many harmless requests. A new dataset called GuardSet supplies the training data, including slices for robustness and honesty. When the guardian is trained with supervised fine-tuning followed by reinforcement learning for label-explanation consistency, the resulting system keeps safety levels intact yet produces fewer refusals on safe inputs.

Core claim

Guardian-as-an-Advisor (GaaA) is a soft-gating method in which a guardian model outputs a binary risk label plus a concise explanation and prepends both to the user query before the base model generates its response. This keeps the base model operating under its original model spec rather than overriding it with a hard gate. Experiments show that the augmented prompts yield responses that better comply with the spec, preserve safety on harmful inputs, and reduce over-refusal on harmless ones. The guardian itself reaches competitive detection accuracy at low added compute cost.

What carries the argument

Guardian-as-an-Advisor (GaaA) soft-gating pipeline: a guardian predicts a binary risk label and explanation then prepends this advice to the original query for re-inference by the base model.

If this is right

Responses from the base model improve over unaugmented prompts on the same inputs.
Safety is maintained while over-refusal drops.
Advisor inference consumes under 5 percent of base-model compute and adds only 2-10 percent end-to-end latency at realistic harmful-input rates.
The same advisory workflow works across multiple domains covered in GuardSet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prepend-advice pattern could be applied to other alignment goals such as honesty or fairness checks.
GuardSet's robustness and honesty slices offer a ready benchmark for testing whether other guardian designs also reduce side effects.
Because the base model stays under its original spec, the method may transfer more easily across vendors than methods that retrain the base model itself.

Load-bearing premise

Prepending the guardian's risk label and explanation will reliably improve the base model's compliance and output quality without introducing new inconsistencies or harming performance on safe queries.

What would settle it

A controlled test on standard safety and helpfulness benchmarks in which the GaaA-augmented model produces either more unsafe outputs or more over-refusals than the unaugmented base model.

read the original abstract

Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The advisory guardian idea is a practical shift from hard blocking, but the paper describes gains without showing the numbers or ablations needed to confirm they hold.

read the letter

The main point is that this paper turns a guardian model into an advisor: it outputs a binary risk label plus a short explanation and prepends both to the original query so the base LLM can adjust on its own. The goal is to keep safety aligned with the model spec while cutting over-refusals that hard gates cause. They also release GuardSet, a 208k+ dataset that combines harmful and harmless cases with extra slices for robustness and honesty, which fills a real gap in existing safety data. Training the GuardAdvisor with SFT then RL to enforce label-explanation consistency is a straightforward way to make the advice reliable, and the latency claim (advisor under 5% of base compute, 2-10% end-to-end overhead) is the kind of detail that matters for deployment if it checks out. The framing around preserving the base model's original spec instead of overriding it is a clean way to think about the tradeoff. The soft spots sit in the results. The abstract says responses improve and over-refusal drops, but supplies no accuracy figures, baseline comparisons, error bars, or breakdowns by query type. Without those, it's impossible to judge whether prepending the advice actually steers the base model or whether errors in the guardian label create fresh refusals on safe queries. The stress-test concern about side effects is fair and needs direct testing. This is for engineers and researchers who build safety layers for production LLMs and want something lighter than hard filters. A reader who needs a new dataset or is experimenting with advisory prompts would get concrete value from the dataset construction and the latency study. It deserves peer review because the workflow is distinct enough and the dataset is substantial, even if the empirical claims require more detail and verification.

Referee Report

2 major / 1 minor

Summary. The paper introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline in which a guardian model outputs a binary risk label plus concise explanation that is prepended to the user's query before re-inference by the base LLM. The goal is to steer the base model toward better compliance with a vendor's model spec while preserving safety and reducing over-refusal. The authors construct GuardSet, a 208k+ multi-domain dataset that unifies harmful and harmless cases with targeted robustness and honesty slices, and train GuardAdvisor via SFT followed by RL to enforce label-explanation consistency. The abstract claims competitive detection accuracy, improved responses relative to unaugmented prompts, and low latency overhead (advisor inference <5% of base-model compute, 2-10% end-to-end).

Significance. If the empirical claims are substantiated, GaaA offers a practical alternative to hard-gated safety filters by treating the guardian as an advisor rather than a gatekeeper. The construction of GuardSet and the use of RL for consistency are concrete contributions that could be reused by others working on spec-compliant LLMs. The reported latency profile, if verified, would make the method deployable at scale.

major comments (2)

[Abstract] Abstract: the claims of 'competitive detection accuracy' and 'responses improve over unaugmented prompts' are stated without any numerical results, baselines, error bars, or ablation tables. Because the central contribution is empirical, the absence of these data prevents verification of the safety-utility tradeoff that the paper asserts.
[Abstract] Abstract: the core mechanism (prepending the guardian's binary label and explanation) is asserted to steer the base model toward spec compliance and reduced over-refusal, yet no quantitative evidence is supplied on steering reliability, performance stratified by query type (harmful vs. borderline), or the effect of erroneous labels. This leaves the weakest assumption untested.

minor comments (1)

[Abstract] The latency study is summarized but the experimental setup (input lengths, batch sizes, hardware) is not described even at a high level, making the overhead numbers difficult to interpret.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical substantiation in the abstract. We address each major comment below and will revise the manuscript to incorporate key quantitative results from the body of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'competitive detection accuracy' and 'responses improve over unaugmented prompts' are stated without any numerical results, baselines, error bars, or ablation tables. Because the central contribution is empirical, the absence of these data prevents verification of the safety-utility tradeoff that the paper asserts.

Authors: We agree that the abstract would be strengthened by including specific numerical results. The full manuscript reports these details in the Experiments section, with tables showing GuardAdvisor's detection accuracy on GuardSet (competitive with baselines), compliance and over-refusal improvements when using the advisory prepend, ablations, and error bars. We will revise the abstract to include representative metrics (e.g., accuracy percentages, improvement deltas, and the stated latency overhead of <5% advisor compute and 2-10% end-to-end) while referencing the relevant tables and figures. revision: yes
Referee: [Abstract] Abstract: the core mechanism (prepending the guardian's binary label and explanation) is asserted to steer the base model toward spec compliance and reduced over-refusal, yet no quantitative evidence is supplied on steering reliability, performance stratified by query type (harmful vs. borderline), or the effect of erroneous labels. This leaves the weakest assumption untested.

Authors: The manuscript provides quantitative evidence for the steering effect through measured improvements in spec compliance and reduced over-refusal rates when the label-explanation advice is prepended. GuardSet explicitly includes targeted slices for robustness and honesty, with results stratified across harmful, harmless, and borderline query types. However, a dedicated ablation on the impact of erroneous guardian labels is not present. We will add this analysis (e.g., performance under simulated label noise) and update the abstract to summarize the steering reliability findings with specific metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with no derivations

full rationale

The paper introduces GaaA as an empirical soft-gating pipeline, constructs the GuardSet dataset (208k+ multi-domain examples), trains GuardAdvisor via SFT then RL for label-explanation consistency, and reports experimental outcomes on detection accuracy, response improvement, and latency. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on direct comparisons (augmented vs. unaugmented prompts) rather than any reduction of outputs to inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. Standard ML training assumptions (e.g., that SFT+RL produces consistent label-explanation pairs) are implicit but not detailed.

pith-pipeline@v0.9.0 · 5520 in / 1161 out tokens · 43600 ms · 2026-05-10T17:22:11.935024+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 54 canonical work pages · 14 internal anchors

[1]

https://github.com/f/awesome-chatgpt-prompts, 2025

awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt-prompts, 2025. GitHub repository

2025
[2]

Speceval: Evaluating model adherence to behavior specifications.arXiv preprint arXiv:2509.02464, 2025

Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, and Percy Liang. Speceval: Evaluating model adherence to behavior specifications.arXiv preprint arXiv:2509.02464, 2025

work page arXiv 2025
[3]

harmful-dataset (dataset)

Anonymous. harmful-dataset (dataset). https://huggingface.co/datasets, 2023. No canoni- cal paper; please replace with the exact repository URL you use

2023
[4]

Reasoning over public and private data in retrieval-based systems.Transactions of the Association for Computational Linguistics, 2023

Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, and Christopher Ré. Reasoning over public and private data in retrieval-based systems.Transactions of the Association for Computational Linguistics, 2023. URLhttps://aclanthology.org/2023.tacl-1.51/

2023
[5]

Position: General alignment has hit a ceiling; edge alignment must be taken seriously.arXiv preprint arXiv:2602.20042, 2026

Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, and Yanfang Ye. Position: General alignment has hit a ceiling; edge alignment must be taken seriously.arXiv preprint arXiv:2602.20042, 2026

work page arXiv 2026
[6]

The lessons of developing process reward models in mathematical reasoning

Elias Bassani and Ignacio Sanchez. GuardBench: A large-scale benchmark for guardrail models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18393–18409, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1...

work page doi:10.18653/v1/ 2024
[7]

Semantic parsing on Freebase fromquestion-answerpairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase fromquestion-answerpairs. InProceedingsofthe2013ConferenceonEmpiricalMethodsinNatural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/D13-1160

2013
[8]

Pappas, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InNeurIPS Datasets and Benchmarks, 2024. 13 Guardian-as-an-Advisor: Advancing Nex...

2024
[9]

A flexible LLM guardrail development method- ology applied to off-topic prompt detection.arXiv preprint arXiv:2411.12946, 2025

Gabriel Chua, Shing Yee Chan, and Shaun Khoo. A flexible LLM guardrail development method- ology applied to off-topic prompt detection.arXiv preprint arXiv:2411.12946, 2025. URL https://arxiv.org/abs/2411.12946

work page arXiv 2025
[10]

Think you have solved question answering? try arc, the ai2 reasoning challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (System Demonstrations), 2018. URLhttps://allenai.org/data/arc

2018
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

llm_attack_harmful_behaviors (dataset)

LLM Attacks Community. llm_attack_harmful_behaviors (dataset). https://github.com/ llm-attacks, 2023. No canonical paper; cite the exact sub-repo if available

2023
[13]

Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2406.18559, 2024

OpenBMB Contributors. Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2406.18559, 2024. UltraInteract SFT dataset

work page arXiv 2024
[14]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review arXiv 2023
[15]

arXiv preprint arXiv:2306.08568 , year=

Ning Ding and et al. Ultrachat: A large-scale multi-turn chat dataset.arXiv preprint arXiv:2306.08568, 2023

work page arXiv 2023
[16]

Safeguarding large language models: A survey

Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, et al. Safeguarding large language models: A survey.arXiv preprint arXiv:2406.02622, 2024. URLhttps://arxiv.org/abs/2406.02622

work page arXiv 2024
[17]

Should chatgpt be biased? challenges and risks of bias in large language models

Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023

work page arXiv 2023
[18]

Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Der- noncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024

2024
[19]

Honestllm: Toward an honest and helpful large language model.arXiv preprint arXiv:2406.00380, 2024

Chujie Gao, Siyuan Wu, Yue Huang, Dongping Chen, Qihui Zhang, Zhengyan Fu, Yao Wan, Lichao Sun, and Xiangliang Zhang. Honestllm: Toward an honest and helpful large language model.arXiv preprint arXiv:2406.00380, 2024

work page arXiv 2024
[20]

AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adap- tive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024

work page arXiv 2024
[21]

AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chap...

work page doi:10.18653/v1/2025.naacl-long.306 2025
[22]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

work page arXiv 2024
[23]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InACL (Long Papers), 2022

2022
[24]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, and et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009
[25]

Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems, 37:104521–104555, 2024

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems, 37:104521–104555, 2024

2024
[26]

Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561,

Yue Huang, Lichao Sun, Haoran Wang, and et al. Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561, 2024

work page arXiv 2024
[27]

Position: Trustllm: Trustworthiness in large language models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Position: Trustllm: Trustworthiness in large language models. InInternational Conference on Machine Learning, pages 20166–20270. PMLR, 2024

2024
[28]

Socially responsible and trustworthy generative foundation models: Principles, challenges, and practices

Yue Huang, Canyu Chen, Lu Cheng, Bhavya Kailkhura, Nitesh Chawla, and Xiangliang Zhang. Socially responsible and trustworthy generative foundation models: Principles, challenges, and practices. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 6825–6828, 2025

2025
[29]

On the trustworthiness of generative foundation models: Guideline, assessment, and perspective.arXiv preprint arXiv:2502.14296, 2025

Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, et al. On the trustworthiness of generative foundation models: Guideline, assessment, and perspective.arXiv preprint arXiv:2502.14296, 2025

work page arXiv 2025
[30]

Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025

Yue Huang, Chujie Gao, Yujun Zhou, Kehan Guo, Xiangqi Wang, Or Cohen-Sasson, Max Lam- parth, and Xiangliang Zhang. Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025

work page arXiv 2025
[31]

Building a foundational guardrail for general agentic systems via synthetic data

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025

work page arXiv 2025
[32]

Probellm: Automating principled diagnosis of llm failures.arXiv preprint arXiv:2602.12966, 2026

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, et al. Probellm: Automating principled diagnosis of llm failures.arXiv preprint arXiv:2602.12966, 2026

work page arXiv 2026
[33]

Spa: Achieving consensus in llm alignment via self-priority optimization

Yue Huang, Xiangqi Wang, and Xiangliang Zhang. Spa: Achieving consensus in llm alignment via self-priority optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31220–31228, 2026

2026
[35]

URLhttps://arxiv.org/abs/2312.06674

work page internal anchor Pith review arXiv
[36]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a 15 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs human-preference dataset.Advances in Neural Information Processing Systems, 36:2467...

2023
[37]

doi:10.48550/arXiv.2406.18510 , abstract =

Liwei Jiang and et al. Wildteaming at scale: From in-the-wild jailbreaks to automatic red teaming.arXiv preprint arXiv:2406.18510, 2024. WildJailbreak dataset (vanilla & adversarial)

work page arXiv 2024
[38]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InACL, 2017

2017
[39]

SLM as guardian: Pioneering AI safety with small language model

Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Hwiyeol Jo, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, and Taiwoo Park. SLM as guardian: Pioneering AI safety with small language model. InProc. EMNLP Industry Track, pages 1333–1350, 2024. URL https://aclanthology.org/2024.emnlp-industry.99

2024
[40]

Smith, and Han- naneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khy- athi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Han- naneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling.https: //huggingface.co/spaces/allenai/reward-bench, 2024

2024
[41]

2022), 1092–1097

Yujia Li, David Choi, and Junyoung et al. Geraghty. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158

work page doi:10.1126/science.abq1158 2022
[42]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review arXiv 2021
[43]

arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389, 2023

work page arXiv 2023
[44]

Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

work page arXiv 2023
[45]

doi:10.48550/arXiv.2501.18492 , abstract =

Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492, 2025

work page arXiv 2025
[46]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

2022
[47]

Synthetic interaction data for scalable personalization in large language models.arXiv preprint arXiv:2602.12394, 2026

Yuchen Ma, Yue Huang, Wenjie Wang, Xiaonan Luo, Xiangliang Zhang, and Stefan Feuerriegel. Synthetic interaction data for scalable personalization in large language models.arXiv preprint arXiv:2602.12394, 2026

work page arXiv 2026
[48]

medical-reasoning (dataset)

mamachang. medical-reasoning (dataset). https://huggingface.co/datasets/mamachang/ medical-reasoning, 2024. Hugging Face dataset

2024
[49]

A holistic approach to undesired content detection in the real world.Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15009–15018, 2023

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14422–14430, 2023. doi: 10.1609/aaai.v37i12.26752. 16 Guardian-as-an-Advisor: Advancing N...

work page doi:10.1609/aaai.v37i12.26752 2023
[50]

Harmbench: A standard- ized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standard- ized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (PMLR Volume 235), pages 35181–35224, 2024

2024
[51]

meta-llama/llama-3.1-8b-instruct, 2024

Meta. meta-llama/llama-3.1-8b-instruct, 2024. URL https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct. Accessed: 2025-xx-xx

2024
[52]

Purplellama / prompt guard.https://github.com/meta-llama/PurpleLlama, 2023

Meta AI. Purplellama / prompt guard.https://github.com/meta-llama/PurpleLlama, 2023

2023
[53]

Llama guard 4 model card

Meta AI. Llama guard 4 model card. https://huggingface.co/meta-llama/ Llama-Guard-4-12B, 2025

2025
[54]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

2018
[55]

Cross-task generalization via natural language instructions

Swaroop Mishra and et al. Cross-task generalization via natural language instructions. In NAACL, 2022

2022
[56]

Chatgpt-jailbreak-prompts (github repository)

ObservedObserver. Chatgpt-jailbreak-prompts (github repository). https://github.com/ ObservedObserver/ChatGPT-Jailbreak-Prompts, 2023

2023
[57]

Qwen Team

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Zahra Ashktorab, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawa...

work page arXiv 2024
[58]

Granite guardian: Comprehensive LLM safeguarding

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, et al. Granite guardian: Comprehensive LLM safeguarding. InProc. NAACL Industry Track, pages 607–615, 2025. doi: 10.18653/v1/2025.naacl-industry.49. URL ht...

work page doi:10.18653/v1/2025.naacl-industry.49 2025
[59]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2024

Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, and Mei Si. Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2024. URLhttps://arxiv.org/abs/2312.00029

work page arXiv 2024
[61]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report. 2024. URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Squad: 100,000+questions for machine comprehension of text

PranavRajpurkar, JianZhang, KonstantinLopyrev, andPercyLiang. Squad: 100,000+questions for machine comprehension of text. InEMNLP, 2016

2016
[63]

Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,

Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J. Pappas, and Hamed Hassani. Safety guardrails for LLM-enabled robots.arXiv preprint arXiv:2503.07885, 2025. URLhttps: //arxiv.org/abs/2503.07885. 17 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

work page arXiv 2025
[64]

Rmo: Towards better llm alignment via reshaping reward margin distributions

Yanchi Ru, Yue Huang, and Xiangliang Zhang. Rmo: Towards better llm alignment via reshaping reward margin distributions. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32851–32859, 2026

2026
[65]

Liu, and Christopher D

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InACL, 2017. Commonly used CNN/DailyMail summarization split

2017
[66]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review arXiv 2024
[68]

Manning, Andrew Y

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InEMNLP, 2013

2013
[69]

A StrongREJECT for empty jailbreaks

Alexandra Souly and et al. A strongreject for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024

work page arXiv 2024
[70]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review arXiv 2022
[71]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL-HLT, 2019

2019
[72]

Pku-saferlhf (dataset)

PKU Alignment Team. Pku-saferlhf (dataset). https://huggingface.co/datasets/ PKU-Alignment/PKU-SafeRLHF, 2023. Hugging Face dataset

2023
[73]

in-the-wild-jailbreak-prompts (dataset)

TrustAIRLab. in-the-wild-jailbreak-prompts (dataset). https://huggingface.co/datasets/ TrustAIRLab/in-the-wild-jailbreak-prompts, 2024. Hugging Face dataset

2024
[74]

Hale, and Paul Röttger

Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. Simplesafetytests: a test suite for identifying critical safety risks in large language models.arXiv preprint arXiv:2311.08370, 2024

work page arXiv 2024
[75]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. QNLI task is from GLUE

2019
[76]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. InNeurIPS, 2023

2023
[77]

Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025

Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, and Xiangliang Zhang. Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025. 18 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

work page arXiv 2025
[78]

TRUSTEVAL: A dynamic evaluation toolkit on trustworthiness of generative foundation models

Yanbo Wang, Jiayi Ye, Siyuan Wu, Chujie Gao, Yue Huang, Xiuying Chen, Yue Zhao, and Xiangliang Zhang. TRUSTEVAL: A dynamic evaluation toolkit on trustworthiness of generative foundation models. In Nouha Dziri, Sean (Xiang) Ren, and Shizhe Diao, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa...

work page doi:10.18653/v1/2025.naacl-demo.8 2025
[79]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

YizhongWangandetal. Self-instruct: Aligninglanguagemodelswithself-generatedinstructions. arXiv preprint arXiv:2212.10560, 2022

work page internal anchor Pith review arXiv 2022
[80]

Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms.arXiv preprint arXiv:2308.13387, 2023

work page arXiv 2023
[81]

Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36:80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36:80079–80110, 2023

2023

Showing first 80 references.