pith. machine review for the scientific record. sign in

arxiv: 2109.13916 · v5 · submitted 2021-09-28 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Recognition: 1 theorem link

· Lean Theorem

Unsolved Problems in ML Safety

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV
keywords machine learning safetyrobustnessmonitoringalignmentsystemic safetyAI deployment riskshigh-stakes applications
0
0 comments X

The pith

Machine learning safety should focus on four research areas as models scale and deploy in critical settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper lays out a roadmap for machine learning safety research by identifying four central problems. As ML systems grow larger, gain new abilities, and enter high-stakes applications, safety becomes essential to prevent harms similar to those seen with other powerful technologies. The authors break the challenges into robustness for handling hazards, monitoring for detecting them, alignment for lowering built-in risks in models, and systemic safety for addressing larger-scale issues. They provide concrete directions for each to guide future work.

Core claim

We present four problems ready for research, namely withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety). Throughout, we clarify each problem's motivation and provide concrete research directions.

What carries the argument

A four-category framework dividing ML safety into Robustness, Monitoring, Alignment, and Systemic Safety.

If this is right

  • Research can target concrete directions for withstanding hazards such as adversarial examples and distribution shifts.
  • Methods can be developed to identify hazards through uncertainty estimation and anomaly detection.
  • Work on alignment can reduce unintended behaviors by better matching model objectives to human intent.
  • Systemic safety efforts can address risks arising from widespread deployment and interactions with other systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This structure could help researchers classify ongoing projects and spot under-explored areas within the four categories.
  • It might support cross-disciplinary efforts by linking technical fixes to broader societal deployment concerns.
  • The framework could be revisited as new model capabilities appear to check whether the categories still separate cleanly.

Load-bearing premise

That these four categories fully cover the main safety challenges in ML without significant gaps or overlaps needing a different structure.

What would settle it

Discovery of a major safety issue in deployed large models that fits none of the four categories or shows that re-organizing the problems would accelerate progress more effectively.

read the original abstract

Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that ML systems are rapidly scaling in size and capabilities while being deployed in high-stakes settings, making safety a leading priority. It refines the technical problems in the field into four research-ready categories—withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety)—and supplies motivations drawn from observed large-model behaviors along with concrete research directions for each.

Significance. If the taxonomy holds as a useful organizing lens, the paper provides a coherent roadmap that could help prioritize and structure ML safety research around scaling trends and documented failure modes. Its conceptual clarity and focus on actionable directions represent a strength for guiding community efforts, though the framework's durability will depend on subsequent research outputs validating or refining the partition.

minor comments (2)
  1. [Abstract] The abstract and introduction could briefly note potential boundary cases between categories (e.g., whether certain adversarial robustness issues fall under Robustness or Alignment) to preempt reader questions about overlaps, even if the paper does not claim exhaustiveness.
  2. [Systemic Safety] Some research directions listed under Systemic Safety would benefit from one or two additional citations to contemporaneous work on deployment risks to strengthen the motivation section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, assessment of significance, and recommendation to accept the manuscript. We appreciate the recognition that the four-problem taxonomy offers a coherent and actionable roadmap for ML safety research.

Circularity Check

0 steps flagged

No significant circularity in proposed taxonomy

full rationale

The paper is a high-level roadmap that organizes ML safety into four categories (Robustness, Monitoring, Alignment, Systemic Safety) motivated by scaling trends and deployment contexts. No equations, derivations, fitted parameters, or predictions appear anywhere in the manuscript. The central claim is an organizing lens rather than a technical result that could reduce to its own inputs by construction. No self-citations function as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are smuggled in. The structure is presented as a useful research agenda, not a provably minimal or derived partition, making the paper self-contained with zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on domain assumptions about ML capability trends and the necessity of safety research; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Machine learning systems are rapidly increasing in size, acquiring new capabilities, and being deployed in high-stakes settings
    Stated directly in the abstract as the premise motivating the roadmap.
  • domain assumption Safety for ML should be a leading research priority
    Presented as the guiding stance for the work.

pith-pipeline@v0.9.0 · 5411 in / 1200 out tokens · 39709 ms · 2026-05-16T20:42:44.186395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discovering Latent Knowledge in Language Models Without Supervision

    cs.CL 2022-12 conditional novelty 8.0

    An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...

  2. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    cs.LG 2022-11 conditional novelty 8.0

    GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

  3. Benchmarking Sensor-Fault Robustness in Forecasting

    cs.LG 2026-05 conditional novelty 7.0

    SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.

  4. Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

  5. Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

    cs.LG 2026-04 unverdicted novelty 7.0

    TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

  6. Red Teaming Language Models with Language Models

    cs.CL 2022-02 conditional novelty 7.0

    One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.

  7. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  8. SARC: A Governance-by-Architecture Framework for Agentic AI Systems

    cs.SE 2026-05 unverdicted novelty 6.0

    SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...

  9. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.

  10. AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

    cs.AI 2026-02 unverdicted novelty 6.0

    AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.

  11. Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    cs.LG 2023-09 conditional novelty 6.0

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

  12. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  13. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  14. U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

    cs.AI 2026-05 unverdicted novelty 5.0

    U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.

  15. Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...

  16. Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

    cs.CV 2026-04 unverdicted novelty 5.0

    Higher face density causes monotonic performance degradation in models and acts as a domain shift, even under balanced sampling.

  17. Beyond Context: Large Language Models' Failure to Grasp Users' Intent

    cs.AI 2025-12 unverdicted novelty 3.0

    LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.

Reference graph

Works this paper leans on

228 extracted references · 228 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    Asilomar AI Principles

    Signed by approximately 2000 AI researchers. “Asilomar AI Principles”. In: (2017)

  2. [2]

    Autonomous Weapons: An Open Letter from AI and Robotics Researchers

    Signed by 30000+ people. “Autonomous Weapons: An Open Letter from AI and Robotics Researchers”. In: (2015)

  3. [3]

    Deep Learning with Differential Privacy

    Martín Abadi, Andy Chu, I. Goodfellow, H. B. McMahan, Ilya Mironov, Kunal Talwar, and L. Zhang. “Deep Learning with Differential Privacy”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016)

  4. [4]

    Network intrusion detection system: A systematic study of machine learning and deep learning approaches

    Zeeshan Ahmad, A. Khan, W. Cheah, J. Abdullah, and Farhan Ahmad. “Network intrusion detection system: A systematic study of machine learning and deep learning approaches”. In: Trans. Emerg. Telecommun. Technol.(2021)

  5. [5]

    Concrete Problems in AI Safety

    Dario Amodei, Christopher Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dandelion Mané. “Concrete Problems in AI Safety”. In: ArXiv (2016)

  6. [6]

    Programming Satan’s Computer

    Ross J. Anderson and Roger Needham. “Programming Satan’s Computer”. In: Computer Science Today. 1995

  7. [7]

    Drago Anguelov.Machine Learning for Autonomous Driving. 2019. URL: https://www.youtube. com/watch?v=Q0nGo2-y0xY

  8. [8]

    Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

    Anish Athalye, Nicholas Carlini, and David A. Wagner. “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”. In: ICML. 2018

  9. [9]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. “Program Synthesis with Large Language Models”. In: ArXiv (2021)

  10. [10]

    Blind Backdoors in Deep Learning Models

    Eugene Bagdasaryan and Vitaly Shmatikov. “Blind Backdoors in Deep Learning Models”. In: USENIX Security Symposium. 2021

  11. [11]

    Towards Open Set Deep Networks

    Abhijit Bendale and Terrance Boult. “Towards Open Set Deep Networks”. In: CVPR (2016)

  12. [12]

    On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021)

  13. [13]

    Alien Dreams: An Emerging Art Scene

    Machine Learning at Berkeley. Alien Dreams: An Emerging Art Scene . URL: https : / / ml . berkeley.edu/blog/posts/clip-art/

  14. [14]

    Triggering Failures: Out-Of- Distribution detection by learning from local adversarial attacks in Semantic Segmentation

    Victor Besnier, Andrei Bursuc, David Picard, and Alexandre Briot. “Triggering Failures: Out-Of- Distribution detection by learning from local adversarial attacks in Semantic Segmentation”. In: ArXiv abs/2108.01634 (2021)

  15. [15]

    Evasion attacks against machine learning at test time

    Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. “Evasion attacks against machine learning at test time”. In: Joint European conference on machine learning and knowledge discovery in databases. Springer. 2013, pp. 387–402

  16. [16]

    The Values Encoded in Machine Learning Research

    Abeba Birhane, Pratyusha Kalluri, D. Card, William Agnew, Ravit Dotan, and Michelle Bao. “The Values Encoded in Machine Learning Research”. In: ArXiv (2021)

  17. [17]

    Certifiably Adversarially Robust Detection of Out-of-Distribution Data

    Julian Bitterwolf, Alexander Meinke, and Matthias Hein. “Certifiably Adversarially Robust Detection of Out-of-Distribution Data”. In: NeurIPS (2020). 14

  18. [18]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani et al. “On the Opportunities and Risks of Foundation Models”. In: ArXiv (2021)

  19. [19]

    The Vulnerable World Hypothesis

    Nick Bostrom. “The Vulnerable World Hypothesis”. In: Global Policy (2019)

  20. [20]

    Smoking behavior of adolescents exposed to cigarette advertising

    G. Botvin, C. Goldberg, E. M. Botvin, and L. Dusenbury. “Smoking behavior of adolescents exposed to cigarette advertising”. In: Public health reports (1993)

  21. [21]

    Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models

    Wieland Brendel, Jonas Rauber, and Matthias Bethge. “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models”. In: arXiv preprint arXiv:1712.04248 (2017)

  22. [22]

    Hedonic relativism and planning the good society

    Philip Brickman and Donald Campbell. “Hedonic relativism and planning the good society”. In: 1971

  23. [23]

    Language Models are Few-Shot Learners

    T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J....

  24. [24]

    Isaac Caswell, Ciprian Chelba, and David Grangier

    Miles Brundage, Shahar Avin, Jack Clark, H. Toner, P. Eckersley, Ben Garfinkel, A. Dafoe, P. Scharre, T. Zeitzoff, Bobby Filar, H. Anderson, Heather Roff, Gregory C. Allen, J. Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, S. Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, ...

  25. [25]

    Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

    Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian K. Hadfield, Heidy Khlaaf, Jingying Yang, H. Toner, Ruth Fong, Tegan Maharaj, P. W. Koh, Sara Hooker, J. Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensbold, Cullen O’Keefe, Mark Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, Jack Clark, P. Eckersley, Sarah ...

  26. [26]

    What the GDP Gets Wrong (Why Managers Should Care)

    Erik Brynjolfsson and Adam Saunders. “What the GDP Gets Wrong (Why Managers Should Care)”. In: MIT Sloan Management Review (2009)

  27. [27]

    Automating Cyber Attacks

    Ben Buchanan, John Bansemer, Dakota Cary, Jack Lucas, and Micah Musser. “Automating Cyber Attacks”. In: 2021

  28. [28]

    Truth, Lies, and Automation

    Ben Buchanan, Andrew Lohn, Micah Musser, and Katerina Sedova. “Truth, Lies, and Automation”. In: 2021

  29. [29]

    Poisoning and Backdooring Contrastive Learning

    Nicholas Carlini and A. Terzis. “Poisoning and Backdooring Contrastive Learning”. In: ArXiv abs/2106.09667 (2021)

  30. [30]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. “Towards evaluating the robustness of neural networks”. In: 2017 ieee symposium on security and privacy (sp). IEEE. 2017, pp. 39–57

  31. [31]

    Unlabeled Data Improves Adversarial Robustness

    Y . Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C. Duchi. “Unlabeled Data Improves Adversarial Robustness”. In: NeurIPS. 2019

  32. [32]

    Emerging Properties in Self-Supervised Vision Transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. “Emerging Properties in Self-Supervised Vision Transformers”. In: Proceedings of the International Conference on Computer Vision (ICCV). 2021

  33. [33]

    Destructive Cyber Operations and Machine Learning

    Dakota Cary and Daniel Cebul. “Destructive Cyber Operations and Machine Learning”. In: 2020. 15

  34. [34]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, J. Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea. Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,...

  35. [35]

    Stateful Detection of Black-Box Adversarial Attacks

    Steven Chen, Nicholas Carlini, and David A. Wagner. “Stateful Detection of Black-Box Adversarial Attacks”. In: Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence (2019)

  36. [36]

    Faulty Reward Functions in the Wild

    Jack Clark and Dario Amodei. “Faulty Reward Functions in the Wild”. In: OpenAI (2016)

  37. [37]

    Quantifying General- ization in Reinforcement Learning

    Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and J. Schulman. “Quantifying General- ization in Reinforcement Learning”. In: ICML. 2019

  38. [38]

    Certified Adversarial Robustness via Randomized Smoothing

    Jeremy M. Cohen, Elan Rosenfeld, and J. Z. Kolter. “Certified Adversarial Robustness via Randomized Smoothing”. In: ICML. 2019

  39. [39]

    Northern Command Public Affairs

    North American Aerospace Defense Command and U.S. Northern Command Public Affairs. 2021. URL: https://www.af.mil/News/Article-Display/Article/2703548/norad- usnorthcom-lead-3rd-global-information-dominance-experiment/

  40. [40]

    AI Research Considerations for Human Existential Safety (ARCHES)

    Andrew Critch and David Krueger. “AI Research Considerations for Human Existential Safety (ARCHES)”. In: ArXiv (2020)

  41. [41]

    RobustBench: a standardized adversarial robustness benchmark

    Francesco Croce, Maksym Andriushchenko, V . Sehwag, Nicolas Flammarion, M. Chiang, Prateek Mittal, and Matthias Hein. “RobustBench: a standardized adversarial robustness benchmark”. In: ArXiv abs/2010.09670 (2020)

  42. [42]

    Monitor alarm fatigue: an integrative review

    Maria Cvach. “Monitor alarm fatigue: an integrative review”. In: Biomedical instrumentation & technology (2012)

  43. [43]

    AI governance: a research agenda

    Allan Dafoe. “AI governance: a research agenda”. In: Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK (2018)

  44. [44]

    Open Problems in Cooperative AI

    Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, Kate Larson, and Thore Graepel. “Open Problems in Cooperative AI”. In: ArXiv (2020)

  45. [45]

    Out-of-Distribution Dynamics Detection: RL-Relevant Bench- marks and Results

    Mohamad H. Danesh and Alan Fern. “Out-of-Distribution Dynamics Detection: RL-Relevant Bench- marks and Results”. In: ArXiv abs/2107.04982 (2021)

  46. [46]

    Quadrennial Defense Review Report

    Department of Defense. “Quadrennial Defense Review Report”. In: (2001)

  47. [47]

    A history of internet security

    Laura DeNardis. “A history of internet security”. In: The history of information security. Elsevier, 2007

  48. [48]

    Robust artificial intelligence and robust human organizations

    Thomas G. Dietterich. “Robust artificial intelligence and robust human organizations”. In: Frontiers of Computer Science (2018)

  49. [49]

    Reinforcement Learning Under Moral Uncertainty

    Adrien Ecoffet and Joel Lehman. “Reinforcement Learning Under Moral Uncertainty”. In: ArXiv abs/2006.04734 (2021)

  50. [50]

    Measuring and Improving Consistency in Pretrained Language Models

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, E. Hovy, Hinrich Schütze, and Yoav Goldberg. “Measuring and Improving Consistency in Pretrained Language Models”. In: ArXiv (2021)

  51. [51]

    A rotation and a translation suffice: Fooling cnns with simple transformations

    Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. “A rotation and a translation suffice: Fooling cnns with simple transformations”. In: arXiv (2018). 16

  52. [52]

    Bringing People Closer Together

    Facebook. Bringing People Closer Together. URL: https://about.fb.com/news/2018/01/ news-feed-fyi-bringing-people-closer-together/

  53. [53]

    Inequality and Violent Crime

    Pablo Fajnzylber, Daniel Lederman, and Norman V . Loayza. “Inequality and Violent Crime”. In:The Journal of Law and Economics (2002)

  54. [54]

    Assessment results regarding Organization Designation Authorization (ODA) Unit Member (UM) Independence

    Wendi Folkert. “Assessment results regarding Organization Designation Authorization (ODA) Unit Member (UM) Independence”. In: Aviation Safety (2021)

  55. [55]

    System Safety in Aircraft Acquisition

    F. R. Frola and C. O. Miller. “System Safety in Aircraft Acquisition”. In: 1984

  56. [56]

    Artificial Intelligence, Values and Alignment

    Iason Gabriel. “Artificial Intelligence, Values and Alignment”. In: ArXiv (2020)

  57. [57]

    Systemantics: How Systems Work and Especially How They Fail

    John Gall. “Systemantics: How Systems Work and Especially How They Fail”. In: 1977

  58. [58]

    Augmenting Decision Making via Interactive What-If Analysis

    Sneha Gathani, Madelon Hulsebos, James Gale, P. Haas, and cCaugatay Demiralp. “Augmenting Decision Making via Interactive What-If Analysis”. In: 2021

  59. [59]

    What drives tropical deforestation?: a meta-analysis of proximate and underlying causes of deforestation based on subnational case study evidence

    Helmut Geist and Eric Lambin. “What drives tropical deforestation?: a meta-analysis of proximate and underlying causes of deforestation based on subnational case study evidence”. In: 2001

  60. [60]

    A 20-Year Community Roadmap for Artificial Intelligence Research in the US

    Yolanda Gil and Bart Selman. “A 20-Year Community Roadmap for Artificial Intelligence Research in the US”. In: ArXiv abs/1908.02624 (2019)

  61. [61]

    Motivating the Rules of the Game for Adversarial Example Research

    J. Gilmer, Ryan P. Adams, I. Goodfellow, David G. Andersen, and George E. Dahl. “Motivating the Rules of the Game for Adversarial Example Research”. In: ArXiv (2018)

  62. [62]

    Explaining Explanations: An Overview of Interpretability of Machine Learning

    Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. “Explaining Explanations: An Overview of Interpretability of Machine Learning”. In: (2018)

  63. [63]

    Adver- sarial Policies: Attacking Deep Reinforcement Learning

    Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart J. Russell. “Adver- sarial Policies: Attacking Deep Reinforcement Learning”. In: ICLR (2020)

  64. [64]

    Problems of Monetary Management: The UK Experience

    Charles Goodhart. “Problems of Monetary Management: The UK Experience”. In: 1984

  65. [65]

    The third industrial revolution: Technology, productivity, and income inequality

    Jeremy Greenwood. The third industrial revolution: Technology, productivity, and income inequality

  66. [66]

    American Enterprise Institute, 1997

  67. [67]

    There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning

    Nathan Grinsztajn, Johan Ferret, O. Pietquin, P. Preux, and M. Geist. “There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning”. In:ArXiv abs/2106.04480 (2021)

  68. [68]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. “Badnets: Identifying vulnerabilities in the machine learning model supply chain”. In: arXiv preprint arXiv:1708.06733 (2017)

  69. [69]

    On Calibration of Modern Neural Networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On Calibration of Modern Neural Networks”. In: ICML (2017)

  70. [70]

    The Off-Switch Game

    Dylan Hadfield-Menell, A. Dragan, P. Abbeel, and Stuart J. Russell. “The Off-Switch Game”. In:IJCA (2017)

  71. [71]

    Cooperative Inverse Reinforce- ment Learning

    Dylan Hadfield-Menell, Stuart J. Russell, P. Abbeel, and A. Dragan. “Cooperative Inverse Reinforce- ment Learning”. In: NIPS. 2016

  72. [72]

    Richard Harang and Ethan M. Rudd. SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection. 2020

  73. [73]

    Equality of Opportunity in Supervised Learning

    Moritz Hardt, Eric Price, and Nathan Srebro. “Equality of Opportunity in Supervised Learning”. In: NIPS. 2016

  74. [74]

    Map-Colour Theorem

    P. J. Heawood. “Map-Colour Theorem”. In: Proceedings of The London Mathematical Society (1949), pp. 161–175

  75. [75]

    Risky business: safety regulations, risk compensation, and individual behavior

    James Hedlund. “Risky business: safety regulations, risk compensation, and individual behavior”. In: Injury Prevention (2000). 17

  76. [76]

    The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization”. In: ICCV (2021)

  77. [77]

    Aligning AI With Shared Human Values

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. “Aligning AI With Shared Human Values”. In: ICLR (2021)

  78. [78]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”. In: ICLR (2021)

  79. [79]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Dan Hendrycks and Thomas Dietterich. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations”. In: Proceedings of the International Conference on Learning Repre- sentations (2019)

  80. [80]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Dan Hendrycks and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks”. In: ICLR (2017)

Showing first 80 references.