The Case for Model Science: Verify, Explore, Steer, Refine

Andreas Holzinger; Jianlong Zhou; Luca Longo; Przemyslaw Biecek; Thomas Fel; Wojciech Samek

arxiv: 2606.01189 · v1 · pith:LRQKUYV4new · submitted 2026-05-31 · 💻 cs.AI

The Case for Model Science: Verify, Explore, Steer, Refine

Przemyslaw Biecek , Luca Longo , Jianlong Zhou , Thomas Fel , Andreas Holzinger , Wojciech Samek This is my paper

Pith reviewed 2026-06-28 17:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords Model ScienceAI model analysisbenchmark limitationsmodel verificationmodel explorationsingle case studiesAI infrastructureexplainable AI

0 comments

The pith

AI research should consolidate scattered model analysis into a new discipline called Model Science built on four perspectives: Verify, Explore, Steer, and Refine.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that benchmarks have driven progress but cannot explain why models succeed or fail and routinely miss critical issues such as hallucinations and shortcuts. It proposes moving to Model Science by adapting lessons from cognitive science on multiple levels of analysis, neuroscience on single-case depth, medicine on paired training and research, and agriculture on shared infrastructure. The resulting discipline rests on three foundations: the four functional perspectives that address complementary questions about model behaviour, catalogues of datasets models and findings for cumulative knowledge, and detailed study of individual model instances. A sympathetic reader would care because current methods leave deployed systems that serve billions of users poorly understood. The proposal treats these elements as ready to be assembled into systematic practice.

Core claim

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Precedents from cognitive science, neuroscience, medicine, and agriculture show that complex systems require complementary levels of analysis, single-case depth, specialised training alongside research, and shared infrastructure. These lessons support three foundations: consolidation around the four perspectives Verify, Explore, Steer, and Refine; catalogues of datasets, models, and findings; and deep analysis of individual model instances rather than only model families.

What carries the argument

The four functional perspectives Verify, Explore, Steer, and Refine that together address complementary questions about model behaviour and form one of the three foundations for Model Science.

If this is right

Benchmarks will be supplemented by methods that identify why models succeed or fail rather than only measuring performance.
Shared catalogues of datasets, models, and findings will enable cumulative progress instead of repeated isolated studies.
Deep analysis of single model instances will reveal patterns that population-level studies across model families miss.
Specialised training in model analysis will develop in parallel with research practice.
Complementary levels of analysis will become standard for understanding complex model behaviours.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The four perspectives could provide a common language for integrating existing scattered tools for model inspection and control.
Infrastructure for Model Science might extend to regulatory requirements that demand evidence from Verify and Steer activities before large-scale deployment.
Single-instance analysis could become routine for high-stakes applications where aggregated metrics are known to overlook rare but severe failure modes.

Load-bearing premise

That precedents and practices from cognitive science, neuroscience, medicine, and agriculture can be transferred directly to create a viable new discipline for AI models.

What would settle it

A sustained effort to build the proposed catalogues and apply the four perspectives that produces no new explanations of model failures beyond existing benchmarks would undermine the case for Model Science.

Figures

Figures reproduced from arXiv: 2606.01189 by Andreas Holzinger, Jianlong Zhou, Luca Longo, Przemyslaw Biecek, Thomas Fel, Wojciech Samek.

read the original abstract

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position paper proposes 'Model Science' with four perspectives but rests on untested analogies without adaptation details or supporting evidence.

read the letter

The main takeaway is that this is a position paper arguing the AI field should consolidate model analysis efforts into a new discipline called Model Science, organized around Verify, Explore, Steer, and Refine perspectives, plus shared catalogs and single-instance studies.

The paper does a solid job naming real benchmark weaknesses: they track performance gains but leave out why models succeed or fail on specific cases, including shortcuts and hallucinations. It pulls together threads from XAI work and gives them a functional framing that could reduce duplication. The single-case angle draws a fair parallel to how neuroscience sometimes spots patterns that population averages miss.

The soft spot is the direct transfer assumption. The argument invokes cognitive science, neuroscience, medicine, and agriculture to justify the move, yet supplies no mapping of how those practices would adapt to fast-changing computational models or any pilot showing they would surface new failure modes. Shared infrastructure sounds useful in principle, but the paper does not address AI-specific issues like rapid obsolescence or the cost of maintaining model catalogs.

This is aimed at researchers already active in interpretability who want a higher-level organizational lens. Readers seeking new methods, code, or empirical tests will not find them. It deserves a serious referee because the benchmark critique is grounded in observed practice and the proposed structure is coherent enough to generate useful discussion, even if the analogies require more work to hold.

Referee Report

3 major / 2 minor

Summary. The paper argues that the AI community should move beyond benchmark-driven research to establish a new systematic discipline termed 'Model Science.' It draws on precedents from cognitive science (complementary levels of analysis), neuroscience (value of single-case studies), medicine (specialized training alongside research), and agriculture (shared infrastructure for cumulative progress) to propose three foundations: (1) four functional perspectives—Verify, Explore, Steer, and Refine—for analyzing model behavior; (2) shared infrastructure including catalogues of datasets, models, and findings; and (3) deep analysis of individual model instances rather than only model families.

Significance. If the proposed framework gains traction, it could help organize scattered model analysis efforts and address benchmark limitations such as failure to explain why models succeed or fail on tasks like hallucination detection. The paper correctly identifies that current leaderboards track performance gains but provide limited insight into internal mechanisms. However, the significance is limited by the absence of any concrete mappings, pilot implementations, or falsifiable predictions demonstrating that the cited field analogies can be adapted to computational models without substantial modification.

major comments (3)

[Abstract / Foundations section] Abstract and the section outlining the three foundations: the central readiness claim—that precedents 'point the way forward' and that the community is 'now ready' to consolidate into Model Science—rests on an untested transferability assumption. No specific argument is given showing why single-case neuroscience methods would reveal LLM internals differently from population benchmarks, why medicine-style training would scale to model analysis, or how agriculture-style catalogues would overcome rapid model obsolescence.
[Precedents from neuroscience / single-instance analysis paragraph] Discussion of the neuroscience precedent for single-instance analysis: the manuscript states that 'single cases can reveal what population studies miss' but supplies no mapping or example demonstrating how this would apply to trained neural networks, where population-level benchmarks are the dominant evaluation paradigm due to the statistical nature of learned parameters.
[Infrastructure discussion] Infrastructure foundation (catalogues of datasets, models, and findings): the proposal assumes such shared resources would enable cumulative progress, yet the text does not address or provide evidence against the risk that fast iteration cycles in AI would render catalogues obsolete faster than in agriculture, undermining the cumulative-knowledge goal.

minor comments (2)

[Four perspectives section] The four perspectives (Verify, Explore, Steer, Refine) are introduced at a high level; concrete operational definitions or example workflows for each would improve clarity.
[Precedents paragraphs] The manuscript would benefit from additional citations to specific methodological papers in the referenced fields (e.g., single-case studies in neuroscience) to ground the analogies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our position paper. We address each major comment below, clarifying the manuscript's scope as an argument for establishing Model Science rather than an empirical validation of the proposed analogies.

read point-by-point responses

Referee: [Abstract / Foundations section] Abstract and the section outlining the three foundations: the central readiness claim—that precedents 'point the way forward' and that the community is 'now ready' to consolidate into Model Science—rests on an untested transferability assumption. No specific argument is given showing why single-case neuroscience methods would reveal LLM internals differently from population benchmarks, why medicine-style training would scale to model analysis, or how agriculture-style catalogues would overcome rapid model obsolescence.

Authors: We agree that the paper presents the precedents as suggestive rather than demonstrating specific transferability arguments or mappings. As a position paper, the intent is to outline why the community should pursue such consolidation, with the task of adapting and testing these ideas left to future work in the proposed discipline. We will revise the abstract and foundations section to explicitly frame the analogies as hypotheses to be investigated rather than established transfers. revision: yes
Referee: [Precedents from neuroscience / single-instance analysis paragraph] Discussion of the neuroscience precedent for single-instance analysis: the manuscript states that 'single cases can reveal what population studies miss' but supplies no mapping or example demonstrating how this would apply to trained neural networks, where population-level benchmarks are the dominant evaluation paradigm due to the statistical nature of learned parameters.

Authors: The neuroscience reference is used to illustrate the potential value of single-instance analysis alongside population methods. We acknowledge the absence of a concrete mapping to neural networks. In revision, we will expand this paragraph with a short note on how methods such as circuit analysis on individual models could serve an analogous role to single-case studies, while recognizing that population benchmarks remain central due to the statistical nature of training. revision: partial
Referee: [Infrastructure discussion] Infrastructure foundation (catalogues of datasets, models, and findings): the proposal assumes such shared resources would enable cumulative progress, yet the text does not address or provide evidence against the risk that fast iteration cycles in AI would render catalogues obsolete faster than in agriculture, undermining the cumulative-knowledge goal.

Authors: The manuscript does not discuss the risk of rapid obsolescence in AI relative to slower-moving fields like agriculture. This is a substantive concern that merits direct engagement. We will add a dedicated paragraph to the infrastructure section acknowledging this challenge and outlining potential mitigations, such as maintaining versioned catalogues focused on general principles and failure modes rather than transient model specifics. revision: yes

Circularity Check

0 steps flagged

No circularity; proposal is a conceptual argument resting on external analogies without self-referential reductions.

full rationale

The paper advances a disciplinary proposal by invoking precedents from cognitive science, neuroscience, medicine, and agriculture to motivate four perspectives, shared infrastructure, and single-instance analysis. No equations, fitted parameters, or 'predictions' appear that reduce to inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the framework. The load-bearing step is the transferability assumption itself, which is external and falsifiable rather than tautological. This is a normal non-finding for a position paper whose derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the paper is a high-level conceptual argument relying on domain assumptions about the limitations of benchmarks and transferability of practices from other sciences.

axioms (2)

domain assumption Benchmarks reveal whether models perform but not why they succeed or fail, and miss critical failure modes such as hallucinations or shortcuts.
Stated in the abstract as the core motivation for moving beyond benchmarking.
domain assumption Lessons from cognitive science, neuroscience, medicine, and agriculture can inform the foundations of a new discipline for AI model analysis.
Invoked to justify the three proposed foundations without further justification in the abstract.

pith-pipeline@v0.9.1-grok · 5788 in / 1157 out tokens · 21569 ms · 2026-06-28T17:22:23.643988+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

283 extracted references · 56 canonical work pages

[1]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , pages =

Slack, Dylan and Hilgard, Sophie and Jia, Emily and Singh, Sameer and Lakkaraju, Himabindu , title =. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , pages =. 2020 , doi =

2020
[2]

and Applebaum, Andy and Miller, Doug P

Strom, Blake E. and Applebaum, Andy and Miller, Doug P. and Nickels, Kathryn C. and Pennington, Adam G. and Thomas, Cody B. , title =. 2018 , url =

2018
[3]

Information Fusion , volume =

Woźnica, Katarzyna and Wilczyński, Piotr and Biecek, Przemysław , title =. Information Fusion , volume =. 2026 , doi =

2026
[4]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

2025
[5]

and Creel, Kathleen A

Bommasani, Rishi and Soylu, Dilara and Liao, Thomas I. and Creel, Kathleen A. and Liang, Percy , title =. arXiv preprint arXiv:2303.15772 , year =

arXiv
[6]

Nature , volume =

Rahwan, Iyad and Cebrian, Manuel and Obradovich, Nick and others , title =. Nature , volume =. 2019 , doi =

2019
[7]

Ullman and Fernando Martinez-Plumed and Joshua B

Ryan Burnell and Wout Schellaert and John Burden and Tomer D. Ullman and Fernando Martinez-Plumed and Joshua B. Tenenbaum and Danaja Rutar and Lucy G. Cheke and Jascha Sohl-Dickstein and Melanie Mitchell and Douwe Kiela and Murray Shanahan and Ellen M. Voorhees and Anthony G. Cohn and Joel Z. Leibo and Jose Hernandez-Orallo , title =. Science , volume =. ...

2023
[8]

Beyond the Leaderboard: A Survey of the Science of Evaluation, Benchmarking, and Methodologies for Large Language Models , rights=

Sheikhi, Saeid and Loven, Lauri and Kostakos, Panos , year=. Beyond the Leaderboard: A Survey of the Science of Evaluation, Benchmarking, and Methodologies for Large Language Models , rights=. doi:10.1109/ACCESS.2026.3686088 , journal=

work page doi:10.1109/access.2026.3686088 2026
[9]

and Hanna, Alex and Paullada, Amandalynne , booktitle =

Raji, Deborah and Denton, Emily and Bender, Emily M. and Hanna, Alex and Paullada, Amandalynne , booktitle =
[10]

2017 , institution =

Kelly, Markelle and Longjohn, Rachel and Nottingham, Kolby , title =. 2017 , institution =

2017
[11]

, title =

Fisher, Ronald A. , title =. Annals of Eugenics , volume =. 1936 , doi =

1936
[12]

Gradient-Based Learning Applied to Document Recognition , journal =

LeCun, Yann and Bottou, L. Gradient-Based Learning Applied to Document Recognition , journal =. 1998 , doi =

1998
[13]

and Santorini, Beatrice and Marcinkiewicz, Mary Ann , title =

Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann , title =. Computational Linguistics , volume =
[14]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , title =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2016 , doi =

2016
[15]

, title =

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Proceedings of the 2018 EMNLP Workshop BlackboxNLP , pages =. 2018 , doi =

2018
[16]

, title =

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
[17]

and Harman, Donna K

Voorhees, Ellen M. and Harman, Donna K. , title =. TREC: Experiment and Evaluation in Information Retrieval , publisher =
[18]

Proceedings of KDD Cup and Workshop , year =

Bennett, James and Lanning, Stan , title =. Proceedings of KDD Cup and Workshop , year =
[19]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , title =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2009 , doi =

2009
[20]

and Fei-Fei, Li , title =

Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, Aditya and Bernstein, Michael and Berg, Alexander C. and Fei-Fei, Li , title =. International Journal of Computer Vision , volume =. 2015 , doi =

2015
[21]

, title =

Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
[22]

Big Data & Society , volume =

Denton, Emily and Hanna, Alex and Amironesei, Razvan and Smart, Andrew and Nicole, Hilary , title =. Big Data & Society , volume =. 2021 , doi =

2021
[23]

and Dahl, George , title =

Bowman, Samuel R. and Dahl, George , title =. arXiv preprint arXiv:2104.02145 , year =

arXiv
[24]

arXiv preprint arXiv:2310.18018 , year =

Sainz, Oscar and Campos, Jon Ander and Garc. arXiv preprint arXiv:2310.18018 , year =

arXiv
[25]

2021 , note =

Ruder, Sebastian , title =. 2021 , note =

2021
[26]

Shortcut Learning in Deep Neural Networks , booktitle =

Geirhos, Robert and Jacobsen, J. Shortcut Learning in Deep Neural Networks , booktitle =. 2020 , doi =

2020
[27]

and others , title =

Wilkinson, Mark D. and others , title =. Scientific Data , volume =. 2016 , doi =

2016
[28]

, title =

Seyhan, Attila A. , title =. Translational Medicine Communications , volume =. 2019 , doi =

2019
[29]

, title =

Fuster, Valentin and Sweeny, Joseph M. , title =. Circulation , volume =. 2011 , doi =

2011
[30]

2025 , note =

Bourdois, Loïck , title =. 2025 , note =

2025
[31]

and Adeli, Ehsan and others , title =

Bommasani, Rishi and Hudson, Drew A. and Adeli, Ehsan and others , title =. arXiv preprint arXiv:2108.07258 , year =

Pith/arXiv arXiv
[32]

2026 , url =

Artificial Intelligence Index Report 2026 , institution =. 2026 , url =

2026
[33]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Erasing Concepts from Diffusion Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
[34]

Li and Ann-Kathrin Dombrowski and Shashwat Goel and Long Phan and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew B

Nathaniel Li and Alexander Pan and Anjali Gopal and Summer Yue and Daniel Berrios and Alice Gatti and Justin D. Li and Ann-Kathrin Dombrowski and Shashwat Goel and Long Phan and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew B. Liu and Michael Chen and Isabelle Barrass and Oliver Zhang and Xiaoyuan Zhu and Rishub T...
[35]

arXiv preprint arXiv:2502.18969 , year =

(Mis)Fitting: A Survey of Scaling Laws , author =. arXiv preprint arXiv:2502.18969 , year =

arXiv
[36]

arXiv preprint arXiv:2001.08361 , year =

Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =

Pith/arXiv arXiv 2001
[37]

Bo and Tianyu Xu and Ishan Chatterjee and Katrina Passarella-Ward and Achin Kulshrestha and D Shin , journal =

Jessica Y. Bo and Tianyu Xu and Ishan Chatterjee and Katrina Passarella-Ward and Achin Kulshrestha and D Shin , journal =. Steerable Chatbots: Personalizing
[38]

Alireza Salemi and Sheshera Mysore and Michael Bendersky and Hamed Zamani , booktitle =
[39]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned

Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment: Narrow finetuning can produce broadly misaligned. Forty-second International Conference on Machine Learning , year=
[40]

Training language models to follow instructions with human feedback , volume =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[41]

International Conference on Machine Learning , pages =

Scaling Laws for Reward Model Overoptimization , author =. International Conference on Machine Learning , pages =. 2023 , journal =

2023
[42]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =
[43]

Will We Run Out of Data? Limits of

Villalobos, Pablo and Ho, Anson and Hallawa, Jaime and Atkinson, Tamay and Sevilla, Jaime , journal =. Will We Run Out of Data? Limits of. 2024 , note =

2024
[44]

Nature , year =

The Curse of Recursion: Training on Generated Data Makes Models Forget , author =. Nature , year =
[45]

arXiv preprint arXiv:2306.11644 , year =

Textbooks Are All You Need , author =. arXiv preprint arXiv:2306.11644 , year =

Pith/arXiv arXiv
[46]

arXiv preprint , year =

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , author =. arXiv preprint , year =
[47]

Proceedings of the National Academy of Sciences , volume =

Overcoming Catastrophic Forgetting in Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =
[48]

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =

The Fallacy of AI Functionality , author =. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2022 , doi =

2022
[49]

arXiv preprint arXiv:2209.07858 , year =

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. arXiv preprint arXiv:2209.07858 , year =

Pith/arXiv arXiv
[50]

Auditing

Sobieski, Bart. Auditing. arXiv preprint arXiv:2602.02560 , year =

Pith/arXiv arXiv
[51]

2024 , booktitle =

Groves, Lara and Metcalf, Jacob and Kennedy, Alayna and Vecchione, Briana and Strait, Andrew , title =. 2024 , booktitle =

2024
[52]

Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and et. al. , booktitle =. Black-Box Access is Insufficient for Rigorous. 2024 , doi =

2024
[53]

2024 , isbn =

Lam, Khoa and Lange, Benjamin and Blili-Hamelin, Borhane and Davidovic, Jovana and Brown, Shea and Hasan, Ali , title =. 2024 , isbn =. doi:10.1145/3630106.3658957 , booktitle =

work page doi:10.1145/3630106.3658957 2024
[54]

Elena, Mihaela and Valentin, Mihai and Rafaela, Coman and Codrut, Turcan , journal =. An
[55]

Proceedings of the 41st International Conference on Machine Learning , pages =

Position: Explain to Question not to Justify , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

2024
[56]

Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , ISBN=

Biecek, Przemyslaw and Samek, Wojciech , year=. Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , ISBN=. doi:10.3233/FAIA250784 , booktitle=

work page doi:10.3233/faia250784
[57]

Explainable

Holzinger, Andreas and Saranti, Anna and Molnar, Christoph and Biecek, Przemyslaw and Samek, Wojciech , booktitle =. Explainable. 2022 , publisher =

2022
[58]

ACM Computing Surveys , volume =

A Survey of Methods for Explaining Black Box Models , author =. ACM Computing Surveys , volume =. 2019 , doi =

2019
[59]

Proceedings of the 34th International Conference on Machine Learning , pages =

Axiomatic Attribution for Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

2017
[60]

and Cogswell, Michael and Das, Abhishek and Vedantam, Ramakrishna and Parikh, Devi and Batra, Dhruv , booktitle =

Selvaraju, Ramprasaath R. and Cogswell, Michael and Das, Abhishek and Vedantam, Ramakrishna and Parikh, Devi and Batra, Dhruv , booktitle =. Grad-
[61]

PLOS ONE , volume =

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , author =. PLOS ONE , volume =. 2015 , doi =

2015
[62]

Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , pages =

Layer-Wise Relevance Propagation: An Overview , author =. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , pages =. 2019 , publisher =

2019
[63]

Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , booktitle =. ``. 2016 , doi =

2016
[64]

Advances in Neural Information Processing Systems , volume =

A Unified Approach to Interpreting Model Predictions , author =. Advances in Neural Information Processing Systems , volume =
[65]

Unmasking

Lapuschkin, Sebastian and W. Unmasking. Nature Communications , volume =. 2019 , doi =

2019
[66]

Advances in Neural Information Processing Systems , volume =

Sanity Checks for Saliency Maps , author =. Advances in Neural Information Processing Systems , volume =
[67]

Transformer Circuits Thread , year =

In-context Learning and Induction Heads , author =. Transformer Circuits Thread , year =
[68]

Transformer Circuits Thread , year =

A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =
[69]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. Transformer Circuits Thread , year =
[70]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[71]

and McDougall, Callum and MacDiarmid, Monte and Tamkin, Alex and Durmus, Esin and Hume, Tristan and Mosconi, Francesco and Freeman, C

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L. and McDougall, Callum and MacDiarmid, Monte and Tamkin, Alex and Durmus, Esin and Hume, Tristan and Mosconi, Francesco and Freeman...

2024
[72]

Nauta, Meike and Seifert, Christin , booktitle =. The. 2023 , publisher =

2023
[73]

Information Fusion , volume =

Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey , author =. Information Fusion , volume =. 2024 , doi =

2024
[74]

Statistics Surveys , volume =

Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges , author =. Statistics Surveys , volume =. 2022 , doi =

2022
[75]

and Bischl, Bernd and Torgo, Luis , journal =

Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis , journal =
[76]

Position: Science of

Jiang, Han and Zhang, Susu and Yi, Xiaoyuan and Xie, Xing and Xiao, Ziang , journal =. Position: Science of
[77]

Advances in Neural Information Processing Systems , volume =

Benchmark Data Repositories for Better Benchmarking , author =. Advances in Neural Information Processing Systems , volume =
[78]

Mantovani and Jan N

Bernd Bischl and Giuseppe Casalicchio and Matthias Feurer and Pieter Gijsbers and Frank Hutter and Michel Lang and Rafael G. Mantovani and Jan N. van Rijn and Joaquin Vanschoren , journal =
[79]

Communications of the ACM , volume =

Datasheets for Datasets , author =. Communications of the ACM , volume =. 2021 , doi =

2021
[80]

and Le, Trang T

Romano, Joseph D. and Le, Trang T. and La Cava, William and Greber, John T. and Goldber, Daniel E. and Moore, Jason H. , journal =. 2022 , doi =

2022

Showing first 80 references.

[1] [1]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , pages =

Slack, Dylan and Hilgard, Sophie and Jia, Emily and Singh, Sameer and Lakkaraju, Himabindu , title =. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , pages =. 2020 , doi =

2020

[2] [2]

and Applebaum, Andy and Miller, Doug P

Strom, Blake E. and Applebaum, Andy and Miller, Doug P. and Nickels, Kathryn C. and Pennington, Adam G. and Thomas, Cody B. , title =. 2018 , url =

2018

[3] [3]

Information Fusion , volume =

Woźnica, Katarzyna and Wilczyński, Piotr and Biecek, Przemysław , title =. Information Fusion , volume =. 2026 , doi =

2026

[4] [4]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

2025

[5] [5]

and Creel, Kathleen A

Bommasani, Rishi and Soylu, Dilara and Liao, Thomas I. and Creel, Kathleen A. and Liang, Percy , title =. arXiv preprint arXiv:2303.15772 , year =

arXiv

[6] [6]

Nature , volume =

Rahwan, Iyad and Cebrian, Manuel and Obradovich, Nick and others , title =. Nature , volume =. 2019 , doi =

2019

[7] [7]

Ullman and Fernando Martinez-Plumed and Joshua B

Ryan Burnell and Wout Schellaert and John Burden and Tomer D. Ullman and Fernando Martinez-Plumed and Joshua B. Tenenbaum and Danaja Rutar and Lucy G. Cheke and Jascha Sohl-Dickstein and Melanie Mitchell and Douwe Kiela and Murray Shanahan and Ellen M. Voorhees and Anthony G. Cohn and Joel Z. Leibo and Jose Hernandez-Orallo , title =. Science , volume =. ...

2023

[8] [8]

Beyond the Leaderboard: A Survey of the Science of Evaluation, Benchmarking, and Methodologies for Large Language Models , rights=

Sheikhi, Saeid and Loven, Lauri and Kostakos, Panos , year=. Beyond the Leaderboard: A Survey of the Science of Evaluation, Benchmarking, and Methodologies for Large Language Models , rights=. doi:10.1109/ACCESS.2026.3686088 , journal=

work page doi:10.1109/access.2026.3686088 2026

[9] [9]

and Hanna, Alex and Paullada, Amandalynne , booktitle =

Raji, Deborah and Denton, Emily and Bender, Emily M. and Hanna, Alex and Paullada, Amandalynne , booktitle =

[10] [10]

2017 , institution =

Kelly, Markelle and Longjohn, Rachel and Nottingham, Kolby , title =. 2017 , institution =

2017

[11] [11]

, title =

Fisher, Ronald A. , title =. Annals of Eugenics , volume =. 1936 , doi =

1936

[12] [12]

Gradient-Based Learning Applied to Document Recognition , journal =

LeCun, Yann and Bottou, L. Gradient-Based Learning Applied to Document Recognition , journal =. 1998 , doi =

1998

[13] [13]

and Santorini, Beatrice and Marcinkiewicz, Mary Ann , title =

Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann , title =. Computational Linguistics , volume =

[14] [14]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , title =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2016 , doi =

2016

[15] [15]

, title =

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Proceedings of the 2018 EMNLP Workshop BlackboxNLP , pages =. 2018 , doi =

2018

[16] [16]

, title =

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

[17] [17]

and Harman, Donna K

Voorhees, Ellen M. and Harman, Donna K. , title =. TREC: Experiment and Evaluation in Information Retrieval , publisher =

[18] [18]

Proceedings of KDD Cup and Workshop , year =

Bennett, James and Lanning, Stan , title =. Proceedings of KDD Cup and Workshop , year =

[19] [19]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , title =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2009 , doi =

2009

[20] [20]

and Fei-Fei, Li , title =

Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, Aditya and Bernstein, Michael and Berg, Alexander C. and Fei-Fei, Li , title =. International Journal of Computer Vision , volume =. 2015 , doi =

2015

[21] [21]

, title =

Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

[22] [22]

Big Data & Society , volume =

Denton, Emily and Hanna, Alex and Amironesei, Razvan and Smart, Andrew and Nicole, Hilary , title =. Big Data & Society , volume =. 2021 , doi =

2021

[23] [23]

and Dahl, George , title =

Bowman, Samuel R. and Dahl, George , title =. arXiv preprint arXiv:2104.02145 , year =

arXiv

[24] [24]

arXiv preprint arXiv:2310.18018 , year =

Sainz, Oscar and Campos, Jon Ander and Garc. arXiv preprint arXiv:2310.18018 , year =

arXiv

[25] [25]

2021 , note =

Ruder, Sebastian , title =. 2021 , note =

2021

[26] [26]

Shortcut Learning in Deep Neural Networks , booktitle =

Geirhos, Robert and Jacobsen, J. Shortcut Learning in Deep Neural Networks , booktitle =. 2020 , doi =

2020

[27] [27]

and others , title =

Wilkinson, Mark D. and others , title =. Scientific Data , volume =. 2016 , doi =

2016

[28] [28]

, title =

Seyhan, Attila A. , title =. Translational Medicine Communications , volume =. 2019 , doi =

2019

[29] [29]

, title =

Fuster, Valentin and Sweeny, Joseph M. , title =. Circulation , volume =. 2011 , doi =

2011

[30] [30]

2025 , note =

Bourdois, Loïck , title =. 2025 , note =

2025

[31] [31]

and Adeli, Ehsan and others , title =

Bommasani, Rishi and Hudson, Drew A. and Adeli, Ehsan and others , title =. arXiv preprint arXiv:2108.07258 , year =

Pith/arXiv arXiv

[32] [32]

2026 , url =

Artificial Intelligence Index Report 2026 , institution =. 2026 , url =

2026

[33] [33]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Erasing Concepts from Diffusion Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

[34] [34]

Li and Ann-Kathrin Dombrowski and Shashwat Goel and Long Phan and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew B

Nathaniel Li and Alexander Pan and Anjali Gopal and Summer Yue and Daniel Berrios and Alice Gatti and Justin D. Li and Ann-Kathrin Dombrowski and Shashwat Goel and Long Phan and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew B. Liu and Michael Chen and Isabelle Barrass and Oliver Zhang and Xiaoyuan Zhu and Rishub T...

[35] [35]

arXiv preprint arXiv:2502.18969 , year =

(Mis)Fitting: A Survey of Scaling Laws , author =. arXiv preprint arXiv:2502.18969 , year =

arXiv

[36] [36]

arXiv preprint arXiv:2001.08361 , year =

Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =

Pith/arXiv arXiv 2001

[37] [37]

Bo and Tianyu Xu and Ishan Chatterjee and Katrina Passarella-Ward and Achin Kulshrestha and D Shin , journal =

Jessica Y. Bo and Tianyu Xu and Ishan Chatterjee and Katrina Passarella-Ward and Achin Kulshrestha and D Shin , journal =. Steerable Chatbots: Personalizing

[38] [38]

Alireza Salemi and Sheshera Mysore and Michael Bendersky and Hamed Zamani , booktitle =

[39] [39]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned

Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment: Narrow finetuning can produce broadly misaligned. Forty-second International Conference on Machine Learning , year=

[40] [40]

Training language models to follow instructions with human feedback , volume =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

[41] [41]

International Conference on Machine Learning , pages =

Scaling Laws for Reward Model Overoptimization , author =. International Conference on Machine Learning , pages =. 2023 , journal =

2023

[42] [42]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

[43] [43]

Will We Run Out of Data? Limits of

Villalobos, Pablo and Ho, Anson and Hallawa, Jaime and Atkinson, Tamay and Sevilla, Jaime , journal =. Will We Run Out of Data? Limits of. 2024 , note =

2024

[44] [44]

Nature , year =

The Curse of Recursion: Training on Generated Data Makes Models Forget , author =. Nature , year =

[45] [45]

arXiv preprint arXiv:2306.11644 , year =

Textbooks Are All You Need , author =. arXiv preprint arXiv:2306.11644 , year =

Pith/arXiv arXiv

[46] [46]

arXiv preprint , year =

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , author =. arXiv preprint , year =

[47] [47]

Proceedings of the National Academy of Sciences , volume =

Overcoming Catastrophic Forgetting in Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =

[48] [48]

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =

The Fallacy of AI Functionality , author =. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2022 , doi =

2022

[49] [49]

arXiv preprint arXiv:2209.07858 , year =

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. arXiv preprint arXiv:2209.07858 , year =

Pith/arXiv arXiv

[50] [50]

Auditing

Sobieski, Bart. Auditing. arXiv preprint arXiv:2602.02560 , year =

Pith/arXiv arXiv

[51] [51]

2024 , booktitle =

Groves, Lara and Metcalf, Jacob and Kennedy, Alayna and Vecchione, Briana and Strait, Andrew , title =. 2024 , booktitle =

2024

[52] [52]

Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and et. al. , booktitle =. Black-Box Access is Insufficient for Rigorous. 2024 , doi =

2024

[53] [53]

2024 , isbn =

Lam, Khoa and Lange, Benjamin and Blili-Hamelin, Borhane and Davidovic, Jovana and Brown, Shea and Hasan, Ali , title =. 2024 , isbn =. doi:10.1145/3630106.3658957 , booktitle =

work page doi:10.1145/3630106.3658957 2024

[54] [54]

Elena, Mihaela and Valentin, Mihai and Rafaela, Coman and Codrut, Turcan , journal =. An

[55] [55]

Proceedings of the 41st International Conference on Machine Learning , pages =

Position: Explain to Question not to Justify , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

2024

[56] [56]

Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , ISBN=

Biecek, Przemyslaw and Samek, Wojciech , year=. Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , ISBN=. doi:10.3233/FAIA250784 , booktitle=

work page doi:10.3233/faia250784

[57] [57]

Explainable

Holzinger, Andreas and Saranti, Anna and Molnar, Christoph and Biecek, Przemyslaw and Samek, Wojciech , booktitle =. Explainable. 2022 , publisher =

2022

[58] [58]

ACM Computing Surveys , volume =

A Survey of Methods for Explaining Black Box Models , author =. ACM Computing Surveys , volume =. 2019 , doi =

2019

[59] [59]

Proceedings of the 34th International Conference on Machine Learning , pages =

Axiomatic Attribution for Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

2017

[60] [60]

and Cogswell, Michael and Das, Abhishek and Vedantam, Ramakrishna and Parikh, Devi and Batra, Dhruv , booktitle =

Selvaraju, Ramprasaath R. and Cogswell, Michael and Das, Abhishek and Vedantam, Ramakrishna and Parikh, Devi and Batra, Dhruv , booktitle =. Grad-

[61] [61]

PLOS ONE , volume =

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , author =. PLOS ONE , volume =. 2015 , doi =

2015

[62] [62]

Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , pages =

Layer-Wise Relevance Propagation: An Overview , author =. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , pages =. 2019 , publisher =

2019

[63] [63]

Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , booktitle =. ``. 2016 , doi =

2016

[64] [64]

Advances in Neural Information Processing Systems , volume =

A Unified Approach to Interpreting Model Predictions , author =. Advances in Neural Information Processing Systems , volume =

[65] [65]

Unmasking

Lapuschkin, Sebastian and W. Unmasking. Nature Communications , volume =. 2019 , doi =

2019

[66] [66]

Advances in Neural Information Processing Systems , volume =

Sanity Checks for Saliency Maps , author =. Advances in Neural Information Processing Systems , volume =

[67] [67]

Transformer Circuits Thread , year =

In-context Learning and Induction Heads , author =. Transformer Circuits Thread , year =

[68] [68]

Transformer Circuits Thread , year =

A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =

[69] [69]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. Transformer Circuits Thread , year =

[70] [70]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[71] [71]

and McDougall, Callum and MacDiarmid, Monte and Tamkin, Alex and Durmus, Esin and Hume, Tristan and Mosconi, Francesco and Freeman, C

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L. and McDougall, Callum and MacDiarmid, Monte and Tamkin, Alex and Durmus, Esin and Hume, Tristan and Mosconi, Francesco and Freeman...

2024

[72] [72]

Nauta, Meike and Seifert, Christin , booktitle =. The. 2023 , publisher =

2023

[73] [73]

Information Fusion , volume =

Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey , author =. Information Fusion , volume =. 2024 , doi =

2024

[74] [74]

Statistics Surveys , volume =

Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges , author =. Statistics Surveys , volume =. 2022 , doi =

2022

[75] [75]

and Bischl, Bernd and Torgo, Luis , journal =

Vanschoren, Joaquin and van Rijn, Jan N. and Bischl, Bernd and Torgo, Luis , journal =

[76] [76]

Position: Science of

Jiang, Han and Zhang, Susu and Yi, Xiaoyuan and Xie, Xing and Xiao, Ziang , journal =. Position: Science of

[77] [77]

Advances in Neural Information Processing Systems , volume =

Benchmark Data Repositories for Better Benchmarking , author =. Advances in Neural Information Processing Systems , volume =

[78] [78]

Mantovani and Jan N

Bernd Bischl and Giuseppe Casalicchio and Matthias Feurer and Pieter Gijsbers and Frank Hutter and Michel Lang and Rafael G. Mantovani and Jan N. van Rijn and Joaquin Vanschoren , journal =

[79] [79]

Communications of the ACM , volume =

Datasheets for Datasets , author =. Communications of the ACM , volume =. 2021 , doi =

2021

[80] [80]

and Le, Trang T

Romano, Joseph D. and Le, Trang T. and La Cava, William and Greber, John T. and Goldber, Daniel E. and Moore, Jason H. , journal =. 2022 , doi =

2022