AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
Pith reviewed 2026-05-23 21:51 UTC · model grok-4.3
The pith
A three-perspective framework divides AI safety into trustworthy, responsible, and safe AI to analyze risks in large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose a novel architectural framework for understanding and analyzing AI Safety by defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. They conduct an extensive review of current research and advancements from these perspectives, highlight key challenges and mitigation approaches, and illustrate innovative mechanisms, methodologies, and techniques for designing and testing AI safety through state-of-the-art examples from large language models, with the aim of promoting further research and enhancing trust in digital transformation.
What carries the argument
The three-perspective architectural framework that partitions AI safety into Trustworthy AI, Responsible AI, and Safe AI.
If this is right
- Safety research on LLMs can be organized by assigning each challenge and mitigation to one of the three perspectives.
- Reviews of state-of-the-art techniques gain consistency when grouped under Trustworthy AI, Responsible AI, or Safe AI.
- Development of new testing and design methods can target specific perspectives to address identified gaps.
- Policy and deployment decisions can reference the framework to balance the three aspects when scaling LLMs.
Where Pith is reading between the lines
- The taxonomy could serve as a template for classifying safety issues in non-language AI systems such as vision or robotics models.
- Empirical validation might involve mapping a fixed set of documented LLM incidents onto the three categories to check for coverage.
- The framework implies that future work should produce separate roadmaps or benchmarks for each perspective rather than a single undifferentiated list.
Load-bearing premise
The assumption that partitioning AI safety into the three perspectives of Trustworthy AI, Responsible AI, and Safe AI provides a comprehensive and non-redundant structure that usefully organizes the entire field and its challenges for LLMs.
What would settle it
Discovery of a major LLM safety issue, such as an emergent failure mode in generation or alignment, that cannot be assigned to any one of the three perspectives without creating substantial overlap or leaving it outside the framework.
Figures
read the original abstract
AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a novel architectural framework for AI Safety, structured around three perspectives—Trustworthy AI, Responsible AI, and Safe AI—and delivers an extensive literature review of state-of-the-art research, key challenges, and mitigation strategies, with particular emphasis on examples from LLMs.
Significance. A well-bounded taxonomy could help organize the fragmented AI safety literature for LLMs and surface gaps for future work; the paper's value therefore hinges on whether the three perspectives are shown to be both comprehensive and non-overlapping.
major comments (1)
- [Abstract, §1] Abstract and §1 (Introduction): the three perspectives are presented as distinct without operational definitions, assignment criteria, or exclusion rules (e.g., where alignment, robustness, or bias mitigation belong). This directly affects the central claim that the framework supplies a “comprehensive and non-redundant structure.”
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the framework's clarity.
read point-by-point responses
-
Referee: [Abstract, §1] Abstract and §1 (Introduction): the three perspectives are presented as distinct without operational definitions, assignment criteria, or exclusion rules (e.g., where alignment, robustness, or bias mitigation belong). This directly affects the central claim that the framework supplies a “comprehensive and non-redundant structure.”
Authors: We acknowledge that the current presentation in the abstract and §1 relies on descriptive categorization and literature examples rather than formal operational definitions or explicit assignment/exclusion rules. To directly support the claim of a comprehensive and non-redundant structure, we will revise §1 to add: (1) concise operational definitions for each perspective (Trustworthy AI: emphasis on technical reliability and verifiability; Responsible AI: emphasis on ethical, societal, and governance aspects; Safe AI: emphasis on preventing harm and ensuring alignment with human intent); (2) assignment criteria with examples (e.g., robustness and explainability assigned to Trustworthy AI, bias/fairness to Responsible AI, and alignment/harm prevention to Safe AI); and (3) exclusion rules noting that while minor overlaps exist, primary objectives determine placement. A new summary table will map key topics such as alignment, robustness, and bias mitigation to perspectives. revision: yes
Circularity Check
No circularity: taxonomy proposal with no derivations or self-referential reductions
full rationale
The paper is a literature review and taxonomy proposal. It introduces a three-perspective framework (Trustworthy AI, Responsible AI, Safe AI) as its central contribution but contains no equations, fitted parameters, predictions, or derivation chains. No load-bearing steps reduce by construction to inputs, self-citations, or prior author work. The framework is presented as an organizational ansatz without claiming to derive results from itself or external benchmarks in a circular way. This matches the default expectation for non-circular review papers.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Limitations on Accurate, Trusted, Human-level Reasoning
An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.
Reference graph
Works this paper leans on
-
[1]
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security . 308–318. 45https://openai.com/index/introducing-superalignment/ ACM Comput. Surv. 56 C. Chen et al
work page 2016
-
[2]
Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series, Vol. 69) , Carla E. Brodley (Ed.). https://doi.org/10.1145/1015330.1015430
-
[3]
Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023 , Maura ...
-
[4]
Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. In AIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021 , Marion Fourcade, Benjamin Kuipers, Seth Lazar, and Deirdre K. Mulligan (Eds.). 298–306. https://doi.org/10.1145/3461702.3462624
-
[5]
Adams, Tyler Cody, and Peter A
Stephen C. Adams, Tyler Cody, and Peter A. Beling. 2022. A survey of inverse reinforcement learning. Artif. Intell. Rev. 55, 6 (2022), 4307–4346. https://doi.org/10.1007/S10462-021-10108-X
-
[6]
Muhammad Aurangzeb Ahmad, Ilker Yaramis, and Taposh Dutta Roy. 2023. Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. CoRR abs/2311.01463 (2023). https://doi.org/10.48550/ARXIV.2311.01463 arXiv:2311.01463
-
[7]
Jaimeen Ahn and Alice Oh. 2021. Mitigating Language-Dependent Ethnic Bias in BERT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 533–549. https://do...
-
[8]
Guardrail AI. 2023. Build AI powered applications with confidence. https://www.guardrailsai.com/
work page 2023
-
[9]
NIST AI. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). (2023)
work page 2023
- [10]
-
[11]
Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yun-Hsuan Sung. 2023. Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models. CoRR abs/2302.05578 (2023). https://doi.org/10.48550/ARXIV.2302.05578 arXiv:2302.05578
-
[12]
Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15, 2 (2023)
work page 2023
-
[13]
Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the 2016 election. Journal of economic perspectives 31, 2 (2017), 211–236
work page 2017
-
[14]
Bibb Allen, Sheela Agarwal, Jayashree Kalpathy-Cramer, and Keith Dreyer. 2019. Democratizing ai. Journal of the American College of Radiology 16, 7 (2019), 961–963
work page 2019
-
[15]
Firas Almukhtar, Nawzad Mahmoodd, and Shahab Kareem. 2021. Search engine optimization: a review. Applied computer science 17, 1 (2021), 70–80
work page 2021
-
[16]
Gabriel Alon and Michael Kamfonas. 2023. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132 (2023). https://doi.org/10. 48550/ARXIV.2308.14132 arXiv:2308.14132
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi ...
-
[18]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. CoRR abs/1606.06565 (2016). arXiv:1606.06565 http://arxiv.org/abs/1606.06565
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
Markus Anderljung, Joslyn Barnhart, Jade Leung, Anton Korinek, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, et al. 2023. Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718 (2023)
-
[20]
Gemini: A Family of Highly Capable Multimodal Models
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Is...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[21]
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10403 2023
-
[22]
AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024)
work page 2024
-
[23]
Marianna Apidianaki and Aina Garí Soler. 2021. ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns’ Semantic Properties and their Prototypicality. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2021, Punta Cana, Dominican Republic, November 11, 2021 , Jasmij...
-
[24]
Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant Risk Minimization. CoRR abs/1907.02893 (2019). arXiv:1907.02893 http://arxiv.org/abs/1907.02893
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[25]
Stuart Armstrong. 2010. Utility indifference. (2010)
work page 2010
-
[26]
Stuart Armstrong. 2015. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics, Papers from the 2015 AAAI Workshop, Austin, Texas, USA, January 25, 2015 (AAAI Technical Report, Vol. WS-15-02) , Toby Walsh (Ed.). http://aaai.org/ocs/index.php/WS/AAAIW15/paper/ view/10183
work page 2015
-
[27]
Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. Thinking Inside the Box: Controlling and Using an Oracle AI. Minds Mach. 22, 4 (2012), 299–324. https://doi.org/10.1007/S11023-012-9282-2
-
[28]
Anupam Arora, Rahul Telang, and Hong Xu. 2021. Do Data Breaches Damage Reputation? Evidence from 45 Cases. Journal of Cybersecurity 7, 1 (2021). https://academic.oup.com/cybersecurity/article/7/1/tyab021/6362163
work page 2021
-
[29]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https: //arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Yitzhak Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. CoRR...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016). arXiv:1607.06450 http: //arxiv.org/abs/1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[32]
James Babcock, János Kramár, and Roman Yampolskiy. 2016. The AGI Containment Problem. In Artificial General Intelligence - 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9782) , Bas R. Steunebrink, Pei Wang, and Ben Goertzel (Eds.). 53–63. https://doi.org/10.1007/978-3-319-...
-
[33]
James Babcock, Janos Kramar, and Roman V Yampolskiy. 2019. Guidelines for artificial intelligence containment. (2019)
work page 2019
- [34]
-
[35]
Eugene Bagdasaryan and Vitaly Shmatikov. 2023. Ceci n’est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings.CoRR abs/2308.11804 (2023). https://doi.org/10.48550/ARXIV.2308.11804 arXiv:2308.11804
-
[36]
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, and Liang Zhao. 2024. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. CoRR abs/2401.00625 (2024). https://doi.org/10.48550/ARXIV.2401.00625 arXiv:2401.00625
-
[37]
Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. 2021. Recent Advances in Adversarial Training for Adversarial Robustness. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , Zhi-Hua Zhou (Ed.). 4312–4321. https://doi.org/10.24963/IJCAI.2021/591
-
[38]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022
-
[39]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022
-
[40]
Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image Hijacks: Adversarial Images can Control Generative Models at Runtime. CoRR abs/2309.00236 (2023). https://doi.org/10.48550/ARXIV.2309.00236 arXiv:2309.00236
-
[41]
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
-
[42]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...
work page 2023
-
[43]
Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. 2024. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. CoRR abs/2403.18932 (2024). https://doi.org/10.48550/ARXIV.2403.18932 arXiv:2403.18932
-
[44]
Hritik Bansal, Fan Yin, Nishad Singhi, Aditya Grover, Yu Yang, and Kai-Wei Chang. 2023. CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 . 112–123. https://doi.org/10.1109/ICCV51070.2023.00017
-
[45]
Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas. 2021. RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (V...
-
[46]
Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. 2022. Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps. CoRR abs/2204.11073 (2022). https://doi.org/10.48550/ARXIV.2204.11073 arXiv:2204.11073
-
[47]
Vita Santa Barletta, Danilo Caivano, Domenico Gigante, and Azzurra Ragone. 2023. A Rapid Review of Responsible AI frameworks: How to guide the development of ethical AI. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, EASE 2023, Oulu, Finland, June 14-16, 2023 . 358–367. https://doi.org/10.1145/359...
-
[48]
Dipto Barman, Ziyi Guo, and Owen Conlan. 2024. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Machine Learning with Applications (2024), 100545
work page 2024
- [49]
-
[50]
Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. 2022. Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, W A, United...
-
[51]
Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2019. Identifying and Controlling Important Neurons in Neural Machine Translation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
work page 2019
-
[52]
https://openreview.net/forum?id=H1z-PsR5KX
-
[53]
Seth D Baum. 2023. Assessing natural global catastrophic risks. Natural Hazards 115, 3 (2023), 2699–2719. https://doi.org/10.1007/s11069-022- 05660-w Epub 2022 Oct 12. PMID: 36245947; PMCID: PMC9553633
-
[54]
Tobias Baumann. 2018. Why I expect successful (narrow) alignment. https://s-risks.org/why-i-expect-successful-alignment/
work page 2018
-
[56]
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset. In Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, ICWSM 2020, Held Virtually, Original Venue: Atlanta, Georgia, USA, June 8-11, 2020, Munmun De Choudhury, Rumi Chunara, Aron Culotta, and Brooke Fo...
work page 2020
-
[57]
Mika Beckerich, Laura Plein, and Sergio Coronado. 2023. RatGPT: Turning online LLMs into Proxies for Malware Attacks. CoRR abs/2308.09183 (2023). https://doi.org/10.48550/ARXIV.2308.09183 arXiv:2308.09183
-
[58]
Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2017. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December...
work page 2017
-
[59]
James Henry Bell, Kallista A Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. 2020. Secure single-server aggregation with (poly) logarithmic overhead. In ACM SIGSAC Conference on Computer and Communications Security . 1253–1269
work page 2020
-
[60]
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting Latent Predictions from Transformers with the Tuned Lens. CoRR abs/2303.08112 (2023). https://doi.org/10.48550/ARXIV.2303.08112 arXiv:2303.08112
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023
-
[61]
Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise
work page 2023
-
[62]
Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety - A Review. CoRR abs/2404.14082 (2024). https: //doi.org/10.48550/ARXIV.2404.14082 arXiv:2404.14082
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14082 2024
-
[63]
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...
-
[64]
Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662 (2023). https://doi.org/10.48550/ARXIV.2308.09662 arXiv:2308.09662 ACM Comput. Surv. AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions 59
-
[65]
Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. 2023. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. CoRR abs/2309.07875 (2023). https://doi.org/10.48550/ ARXIV.2309.07875 arXiv:2309.07875
-
[66]
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2017. Evasion Attacks against Machine Learning at Test Time. CoRR abs/1708.06131 (2017). arXiv:1708.06131 http://arxiv.org/abs/1708.06131
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[67]
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
work page 2023
-
[68]
Teemu Birkstedt, Matti Minkkinen, Anushree Tandon, and Matti Mäntymäki. 2023. AI governance: themes, knowledge gaps and future agendas. Internet Research 33, 7 (2023), 133–167
work page 2023
-
[69]
Zou, Venkatesh Saligrama, and Adam Tauman Kalai
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain , Daniel D. Lee, M...
work page 2016
-
[70]
Nick Bostrom. 2002. Existential risks: Analyzing human extinction scenarios and related hazards. Journal of Evolution and technology 9 (2002)
work page 2002
-
[71]
Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies
work page 2014
-
[72]
Djamila Bouhata and Hamouma Moumen. 2022. Byzantine Fault Tolerance in Distributed Machine Learning : a Survey. CoRR abs/2205.02572 (2022). https://doi.org/10.48550/ARXIV.2205.02572 arXiv:2205.02572
-
[73]
Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy . 141–159
work page 2021
-
[74]
Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Ka...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.03540 2022
-
[75]
Stephen W. Boyd and Angelos D. Keromytis. 2004. SQLrand: Preventing SQL Injection Attacks. In Applied Cryptography and Network Security, Second International Conference, ACNS 2004, Yellow Mountain, China, June 8-11, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3089) , Markus Jakobsson, Moti Yung, and Jianying Zhou (Eds.). 292–302. https://do...
-
[76]
Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples. CoRR abs/2209.02128 (2022). https://doi.org/10.48550/ARXIV.2209.02128 arXiv:2209.02128
-
[77]
CSET Policy Brief. 2021. AI and the Future of Disinformation Campaigns. Center Secur. Emerg. Technol., Georgetown Univ., Washington, DC, USA, Tech. Rep (2021)
work page 2021
-
[78]
Blake Brittain. 2023. Pulitzer-winning authors join OpenAI, Microsoft copyright lawsuit. https://www.reuters.com/legal/pulitzer-winning- authors-join-openai-microsoft-copyright-lawsuit-2023-12-20/
work page 2023
-
[79]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR abs/1606.01540 (2016). arXiv:1606.01540 http://arxiv.org/abs/1606.01540
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[80]
Clarence Ng David Schnurr Eric Luhman Joe Taylor Li Jing Natalie Summers Ricky Wang Rohan Sahai Ryan O’Rourke Troy Luhman Will DePue Yufei Guo Connor Holmes Bill Peebles Tim Brooks. 2024. Creating video from text. (2024). https://doi.org/10.48550/arXiv.2402.17177
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17177 2024
-
[81]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.