Recognition: unknown
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Pith reviewed 2026-05-10 17:22 UTC · model grok-4.3
The pith
A guardian model advises base LLMs by prepending risk labels and explanations, steering outputs to match the model spec while cutting over-refusal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guardian-as-an-Advisor (GaaA) is a soft-gating method in which a guardian model outputs a binary risk label plus a concise explanation and prepends both to the user query before the base model generates its response. This keeps the base model operating under its original model spec rather than overriding it with a hard gate. Experiments show that the augmented prompts yield responses that better comply with the spec, preserve safety on harmful inputs, and reduce over-refusal on harmless ones. The guardian itself reaches competitive detection accuracy at low added compute cost.
What carries the argument
Guardian-as-an-Advisor (GaaA) soft-gating pipeline: a guardian predicts a binary risk label and explanation then prepends this advice to the original query for re-inference by the base model.
If this is right
- Responses from the base model improve over unaugmented prompts on the same inputs.
- Safety is maintained while over-refusal drops.
- Advisor inference consumes under 5 percent of base-model compute and adds only 2-10 percent end-to-end latency at realistic harmful-input rates.
- The same advisory workflow works across multiple domains covered in GuardSet.
Where Pith is reading between the lines
- The same prepend-advice pattern could be applied to other alignment goals such as honesty or fairness checks.
- GuardSet's robustness and honesty slices offer a ready benchmark for testing whether other guardian designs also reduce side effects.
- Because the base model stays under its original spec, the method may transfer more easily across vendors than methods that retrain the base model itself.
Load-bearing premise
Prepending the guardian's risk label and explanation will reliably improve the base model's compliance and output quality without introducing new inconsistencies or harming performance on safe queries.
What would settle it
A controlled test on standard safety and helpfulness benchmarks in which the GaaA-augmented model produces either more unsafe outputs or more over-refusals than the unaugmented base model.
read the original abstract
Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline in which a guardian model outputs a binary risk label plus concise explanation that is prepended to the user's query before re-inference by the base LLM. The goal is to steer the base model toward better compliance with a vendor's model spec while preserving safety and reducing over-refusal. The authors construct GuardSet, a 208k+ multi-domain dataset that unifies harmful and harmless cases with targeted robustness and honesty slices, and train GuardAdvisor via SFT followed by RL to enforce label-explanation consistency. The abstract claims competitive detection accuracy, improved responses relative to unaugmented prompts, and low latency overhead (advisor inference <5% of base-model compute, 2-10% end-to-end).
Significance. If the empirical claims are substantiated, GaaA offers a practical alternative to hard-gated safety filters by treating the guardian as an advisor rather than a gatekeeper. The construction of GuardSet and the use of RL for consistency are concrete contributions that could be reused by others working on spec-compliant LLMs. The reported latency profile, if verified, would make the method deployable at scale.
major comments (2)
- [Abstract] Abstract: the claims of 'competitive detection accuracy' and 'responses improve over unaugmented prompts' are stated without any numerical results, baselines, error bars, or ablation tables. Because the central contribution is empirical, the absence of these data prevents verification of the safety-utility tradeoff that the paper asserts.
- [Abstract] Abstract: the core mechanism (prepending the guardian's binary label and explanation) is asserted to steer the base model toward spec compliance and reduced over-refusal, yet no quantitative evidence is supplied on steering reliability, performance stratified by query type (harmful vs. borderline), or the effect of erroneous labels. This leaves the weakest assumption untested.
minor comments (1)
- [Abstract] The latency study is summarized but the experimental setup (input lengths, batch sizes, hardware) is not described even at a high level, making the overhead numbers difficult to interpret.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger empirical substantiation in the abstract. We address each major comment below and will revise the manuscript to incorporate key quantitative results from the body of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims of 'competitive detection accuracy' and 'responses improve over unaugmented prompts' are stated without any numerical results, baselines, error bars, or ablation tables. Because the central contribution is empirical, the absence of these data prevents verification of the safety-utility tradeoff that the paper asserts.
Authors: We agree that the abstract would be strengthened by including specific numerical results. The full manuscript reports these details in the Experiments section, with tables showing GuardAdvisor's detection accuracy on GuardSet (competitive with baselines), compliance and over-refusal improvements when using the advisory prepend, ablations, and error bars. We will revise the abstract to include representative metrics (e.g., accuracy percentages, improvement deltas, and the stated latency overhead of <5% advisor compute and 2-10% end-to-end) while referencing the relevant tables and figures. revision: yes
-
Referee: [Abstract] Abstract: the core mechanism (prepending the guardian's binary label and explanation) is asserted to steer the base model toward spec compliance and reduced over-refusal, yet no quantitative evidence is supplied on steering reliability, performance stratified by query type (harmful vs. borderline), or the effect of erroneous labels. This leaves the weakest assumption untested.
Authors: The manuscript provides quantitative evidence for the steering effect through measured improvements in spec compliance and reduced over-refusal rates when the label-explanation advice is prepended. GuardSet explicitly includes targeted slices for robustness and honesty, with results stratified across harmful, harmless, and borderline query types. However, a dedicated ablation on the impact of erroneous guardian labels is not present. We will add this analysis (e.g., performance under simulated label noise) and update the abstract to summarize the steering reliability findings with specific metrics. revision: partial
Circularity Check
No circularity: purely empirical pipeline with no derivations
full rationale
The paper introduces GaaA as an empirical soft-gating pipeline, constructs the GuardSet dataset (208k+ multi-domain examples), trains GuardAdvisor via SFT then RL for label-explanation consistency, and reports experimental outcomes on detection accuracy, response improvement, and latency. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on direct comparisons (augmented vs. unaugmented prompts) rather than any reduction of outputs to inputs by construction. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://github.com/f/awesome-chatgpt-prompts, 2025
awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt-prompts, 2025. GitHub repository
2025
-
[2]
Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, and Percy Liang. Speceval: Evaluating model adherence to behavior specifications.arXiv preprint arXiv:2509.02464, 2025
-
[3]
harmful-dataset (dataset)
Anonymous. harmful-dataset (dataset). https://huggingface.co/datasets, 2023. No canoni- cal paper; please replace with the exact repository URL you use
2023
-
[4]
Reasoning over public and private data in retrieval-based systems.Transactions of the Association for Computational Linguistics, 2023
Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, and Christopher Ré. Reasoning over public and private data in retrieval-based systems.Transactions of the Association for Computational Linguistics, 2023. URLhttps://aclanthology.org/2023.tacl-1.51/
2023
-
[5]
Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, and Yanfang Ye. Position: General alignment has hit a ceiling; edge alignment must be taken seriously.arXiv preprint arXiv:2602.20042, 2026
-
[6]
The lessons of developing process reward models in mathematical reasoning
Elias Bassani and Ignacio Sanchez. GuardBench: A large-scale benchmark for guardrail models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18393–18409, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1...
-
[7]
Semantic parsing on Freebase fromquestion-answerpairs
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase fromquestion-answerpairs. InProceedingsofthe2013ConferenceonEmpiricalMethodsinNatural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/D13-1160
2013
-
[8]
Pappas, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InNeurIPS Datasets and Benchmarks, 2024. 13 Guardian-as-an-Advisor: Advancing Nex...
2024
-
[9]
Gabriel Chua, Shing Yee Chan, and Shaun Khoo. A flexible LLM guardrail development method- ology applied to off-topic prompt detection.arXiv preprint arXiv:2411.12946, 2025. URL https://arxiv.org/abs/2411.12946
-
[10]
Think you have solved question answering? try arc, the ai2 reasoning challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (System Demonstrations), 2018. URLhttps://allenai.org/data/arc
2018
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe and et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
llm_attack_harmful_behaviors (dataset)
LLM Attacks Community. llm_attack_harmful_behaviors (dataset). https://github.com/ llm-attacks, 2023. No canonical paper; cite the exact sub-repo if available
2023
-
[13]
Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2406.18559, 2024
OpenBMB Contributors. Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2406.18559, 2024. UltraInteract SFT dataset
-
[14]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
arXiv preprint arXiv:2306.08568 , year=
Ning Ding and et al. Ultrachat: A large-scale multi-turn chat dataset.arXiv preprint arXiv:2306.08568, 2023
-
[16]
Safeguarding large language models: A survey
Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, et al. Safeguarding large language models: A survey.arXiv preprint arXiv:2406.02622, 2024. URLhttps://arxiv.org/abs/2406.02622
-
[17]
Should chatgpt be biased? challenges and risks of bias in large language models
Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023
-
[18]
Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024
Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Der- noncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024
2024
-
[19]
Honestllm: Toward an honest and helpful large language model.arXiv preprint arXiv:2406.00380, 2024
Chujie Gao, Siyuan Wu, Yue Huang, Dongping Chen, Qihui Zhang, Zhengyan Fu, Yao Wan, Lichao Sun, and Xiangliang Zhang. Honestllm: Toward an honest and helpful large language model.arXiv preprint arXiv:2406.00380, 2024
-
[20]
Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adap- tive ai content safety moderation with ensemble of llm experts.arXiv preprint arXiv:2404.05993, 2024
-
[21]
AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chap...
-
[22]
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024
-
[23]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InACL (Long Papers), 2022
2022
-
[24]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, and et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[25]
Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems, 37:104521–104555, 2024
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems, 37:104521–104555, 2024
2024
-
[26]
Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561,
Yue Huang, Lichao Sun, Haoran Wang, and et al. Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561, 2024
-
[27]
Position: Trustllm: Trustworthiness in large language models
Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Position: Trustllm: Trustworthiness in large language models. InInternational Conference on Machine Learning, pages 20166–20270. PMLR, 2024
2024
-
[28]
Socially responsible and trustworthy generative foundation models: Principles, challenges, and practices
Yue Huang, Canyu Chen, Lu Cheng, Bhavya Kailkhura, Nitesh Chawla, and Xiangliang Zhang. Socially responsible and trustworthy generative foundation models: Principles, challenges, and practices. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 6825–6828, 2025
2025
-
[29]
Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, et al. On the trustworthiness of generative foundation models: Guideline, assessment, and perspective.arXiv preprint arXiv:2502.14296, 2025
-
[30]
Yue Huang, Chujie Gao, Yujun Zhou, Kehan Guo, Xiangqi Wang, Or Cohen-Sasson, Max Lam- parth, and Xiangliang Zhang. Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025
-
[31]
Building a foundational guardrail for general agentic systems via synthetic data
Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data.arXiv preprint arXiv:2510.09781, 2025
-
[32]
Probellm: Automating principled diagnosis of llm failures.arXiv preprint arXiv:2602.12966, 2026
Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, et al. Probellm: Automating principled diagnosis of llm failures.arXiv preprint arXiv:2602.12966, 2026
-
[33]
Spa: Achieving consensus in llm alignment via self-priority optimization
Yue Huang, Xiangqi Wang, and Xiangliang Zhang. Spa: Achieving consensus in llm alignment via self-priority optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31220–31228, 2026
2026
-
[35]
URLhttps://arxiv.org/abs/2312.06674
work page internal anchor Pith review arXiv
-
[36]
Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a 15 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs human-preference dataset.Advances in Neural Information Processing Systems, 36:2467...
2023
-
[37]
doi:10.48550/arXiv.2406.18510 , abstract =
Liwei Jiang and et al. Wildteaming at scale: From in-the-wild jailbreaks to automatic red teaming.arXiv preprint arXiv:2406.18510, 2024. WildJailbreak dataset (vanilla & adversarial)
-
[38]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InACL, 2017
2017
-
[39]
SLM as guardian: Pioneering AI safety with small language model
Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Hwiyeol Jo, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, and Taiwoo Park. SLM as guardian: Pioneering AI safety with small language model. InProc. EMNLP Industry Track, pages 1333–1350, 2024. URL https://aclanthology.org/2024.emnlp-industry.99
2024
-
[40]
Smith, and Han- naneh Hajishirzi
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khy- athi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Han- naneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling.https: //huggingface.co/spaces/allenai/reward-bench, 2024
2024
-
[41]
Yujia Li, David Choi, and Junyoung et al. Geraghty. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158
-
[42]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021
work page internal anchor Pith review arXiv 2021
-
[43]
arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389
Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389, 2023
-
[44]
Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023
-
[45]
doi:10.48550/arXiv.2501.18492 , abstract =
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492, 2025
-
[46]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022
2022
-
[47]
Yuchen Ma, Yue Huang, Wenjie Wang, Xiaonan Luo, Xiangliang Zhang, and Stefan Feuerriegel. Synthetic interaction data for scalable personalization in large language models.arXiv preprint arXiv:2602.12394, 2026
-
[48]
medical-reasoning (dataset)
mamachang. medical-reasoning (dataset). https://huggingface.co/datasets/mamachang/ medical-reasoning, 2024. Hugging Face dataset
2024
-
[49]
Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14422–14430, 2023. doi: 10.1609/aaai.v37i12.26752. 16 Guardian-as-an-Advisor: Advancing N...
-
[50]
Harmbench: A standard- ized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standard- ized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning (PMLR Volume 235), pages 35181–35224, 2024
2024
-
[51]
meta-llama/llama-3.1-8b-instruct, 2024
Meta. meta-llama/llama-3.1-8b-instruct, 2024. URL https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct. Accessed: 2025-xx-xx
2024
-
[52]
Purplellama / prompt guard.https://github.com/meta-llama/PurpleLlama, 2023
Meta AI. Purplellama / prompt guard.https://github.com/meta-llama/PurpleLlama, 2023
2023
-
[53]
Llama guard 4 model card
Meta AI. Llama guard 4 model card. https://huggingface.co/meta-llama/ Llama-Guard-4-12B, 2025
2025
-
[54]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018
2018
-
[55]
Cross-task generalization via natural language instructions
Swaroop Mishra and et al. Cross-task generalization via natural language instructions. In NAACL, 2022
2022
-
[56]
Chatgpt-jailbreak-prompts (github repository)
ObservedObserver. Chatgpt-jailbreak-prompts (github repository). https://github.com/ ObservedObserver/ChatGPT-Jailbreak-Prompts, 2023
2023
-
[57]
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Zahra Ashktorab, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawa...
-
[58]
Granite guardian: Comprehensive LLM safeguarding
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, et al. Granite guardian: Comprehensive LLM safeguarding. InProc. NAACL Industry Track, pages 607–615, 2025. doi: 10.18653/v1/2025.naacl-industry.49. URL ht...
-
[59]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, and Mei Si. Bergeron: Combating adversarial attacks through a conscience-based alignment framework.arXiv preprint arXiv:2312.00029, 2024. URLhttps://arxiv.org/abs/2312.00029
-
[61]
Qwen Team. Qwen2.5 technical report. 2024. URLhttps://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Squad: 100,000+questions for machine comprehension of text
PranavRajpurkar, JianZhang, KonstantinLopyrev, andPercyLiang. Squad: 100,000+questions for machine comprehension of text. InEMNLP, 2016
2016
-
[63]
Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,
Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J. Pappas, and Hamed Hassani. Safety guardrails for LLM-enabled robots.arXiv preprint arXiv:2503.07885, 2025. URLhttps: //arxiv.org/abs/2503.07885. 17 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
-
[64]
Rmo: Towards better llm alignment via reshaping reward margin distributions
Yanchi Ru, Yue Huang, and Xiangliang Zhang. Rmo: Towards better llm alignment via reshaping reward margin distributions. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32851–32859, 2026
2026
-
[65]
Liu, and Christopher D
Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InACL, 2017. Commonly used CNN/DailyMail summarization split
2017
-
[66]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review arXiv 2024
-
[68]
Manning, Andrew Y
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InEMNLP, 2013
2013
-
[69]
A StrongREJECT for empty jailbreaks
Alexandra Souly and et al. A strongreject for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024
-
[70]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review arXiv 2022
-
[71]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL-HLT, 2019
2019
-
[72]
Pku-saferlhf (dataset)
PKU Alignment Team. Pku-saferlhf (dataset). https://huggingface.co/datasets/ PKU-Alignment/PKU-SafeRLHF, 2023. Hugging Face dataset
2023
-
[73]
in-the-wild-jailbreak-prompts (dataset)
TrustAIRLab. in-the-wild-jailbreak-prompts (dataset). https://huggingface.co/datasets/ TrustAIRLab/in-the-wild-jailbreak-prompts, 2024. Hugging Face dataset
2024
-
[74]
Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. Simplesafetytests: a test suite for identifying critical safety risks in large language models.arXiv preprint arXiv:2311.08370, 2024
-
[75]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019. QNLI task is from GLUE
2019
-
[76]
Decodingtrust: A comprehensive assessment of trustworthiness in gpt models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. InNeurIPS, 2023
2023
-
[77]
Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, and Xiangliang Zhang. Adaptive distraction: Probing llm contextual robustness with automated tree search.arXiv preprint arXiv:2502.01609, 2025. 18 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
-
[78]
TRUSTEVAL: A dynamic evaluation toolkit on trustworthiness of generative foundation models
Yanbo Wang, Jiayi Ye, Siyuan Wu, Chujie Gao, Yue Huang, Xiuying Chen, Yue Zhao, and Xiangliang Zhang. TRUSTEVAL: A dynamic evaluation toolkit on trustworthiness of generative foundation models. In Nouha Dziri, Sean (Xiang) Ren, and Shizhe Diao, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa...
-
[79]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
YizhongWangandetal. Self-instruct: Aligninglanguagemodelswithself-generatedinstructions. arXiv preprint arXiv:2212.10560, 2022
work page internal anchor Pith review arXiv 2022
-
[80]
Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms.arXiv preprint arXiv:2308.13387, 2023
-
[81]
Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36:80079–80110, 2023
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36:80079–80110, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.