Recognition: 1 theorem link
· Lean TheoremOn the Opportunities and Risks of Foundation Models
Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3
The pith
Foundation models trained on broad data at scale develop emergent capabilities that incentivize their use across many tasks while passing any defects to all downstream adaptations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Foundation models are models trained on broad data at scale that prove adaptable to a wide range of downstream tasks. Their scale produces emergent capabilities beyond those of smaller models, while their versatility across tasks drives homogenization of the AI ecosystem. As a result, any flaws in the foundation model are inherited by every adapted system built on it. Despite impending widespread use, there is still no clear account of how these models function, when they break, or what they can ultimately do.
What carries the argument
Foundation models, defined as models pretrained on broad data at scale and then adapted for downstream tasks, which carry the argument by linking scale to emergence and by showing how shared bases transmit strengths and weaknesses to all uses.
If this is right
- Capabilities in language understanding, image generation, and robotics improve as models grow larger.
- Applications in law, healthcare, and education gain efficiency from shared bases but must account for inherited limitations.
- Societal issues such as bias, misuse, and environmental costs become concentrated rather than distributed.
- Technical work on evaluation, security, and theory must address the full model rather than isolated adaptations.
- Interdisciplinary efforts are needed to study both technical behavior and broader impacts.
Where Pith is reading between the lines
- If homogenization holds, oversight could shift from regulating individual applications to auditing and updating the small number of base models.
- Developers might test whether fine-tuning or prompt changes can reliably isolate or correct base-model defects without full retraining.
- The pattern suggests similar dynamics could appear in other scaled systems, such as large simulation models or scientific foundation models.
- A practical next step would be systematic comparison of multiple independent foundation models to measure how much their defect profiles actually overlap.
Load-bearing premise
That training at current scales reliably produces capabilities that cannot be foreseen from smaller models and that most future AI work will converge on a few shared foundation models without separate safeguards.
What would settle it
A controlled experiment showing that performance gains on downstream tasks can be fully predicted by scaling laws fitted to smaller models, with no new qualitative behaviors appearing at foundation scale.
read the original abstract
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces foundation models as large-scale models trained on broad data that adapt to diverse downstream tasks (e.g., BERT, DALL-E, GPT-3). It surveys capabilities in language, vision, robotics, reasoning, and human interaction; technical principles including architectures, training, data, systems, security, evaluation, and theory; applications in law, healthcare, and education; and societal impacts such as inequity, misuse, economic/environmental effects, and legal/ethical issues. The central thesis is that, while rooted in standard deep learning and transfer learning, scale produces emergent capabilities and incentivizes homogenization, yielding leverage but also risks since defects propagate to all adapted downstream models. It notes the current lack of understanding of their mechanisms, failure modes, and capabilities, and calls for interdisciplinary research.
Significance. If the observations on emergence and homogenization hold, the report is significant as a timely, broad synthesis that frames the foundation-model paradigm and its sociotechnical implications. It consolidates existing work, identifies gaps in understanding emergence and failures, and advocates caution plus collaboration, serving as a reference point for the field at a time of rapid deployment.
major comments (1)
- [Abstract] Abstract: the claim that scale 'results in new emergent capabilities' is load-bearing for the paradigm-shift framing yet is presented without quantitative evidence, formal bounds, or specific citations to studies showing behaviors unpredictable from smaller models; this weakens the distinction from standard scaling laws.
minor comments (2)
- [Abstract] Abstract: the long compound sentence enumerating topics reduces readability; splitting into shorter sentences would improve clarity.
- [Throughout] Throughout: the term 'homogenization' is used repeatedly but receives no explicit initial definition or scope clarification, which could confuse readers unfamiliar with the concept.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful review. We address the single major comment below and have revised the manuscript to strengthen the abstract's claim while remaining faithful to the empirical evidence and open questions discussed in the body of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that scale 'results in new emergent capabilities' is load-bearing for the paradigm-shift framing yet is presented without quantitative evidence, formal bounds, or specific citations to studies showing behaviors unpredictable from smaller models; this weakens the distinction from standard scaling laws.
Authors: We agree that the abstract statement is central to the paper's framing and benefits from explicit support. The body of the manuscript already cites and discusses empirical evidence for emergence, including in-context learning and other capabilities in GPT-3 (Brown et al., 2020) that were not observed or predictable from smaller-scale models, as well as scaling-law analyses (Kaplan et al., 2020). Formal bounds on emergence are indeed unavailable and are highlighted as an open research question throughout the paper. To directly address the referee's concern, we have revised the abstract to include a specific citation to this literature and a brief qualifier noting the empirical nature of the observed behaviors. This revision preserves the distinction from standard scaling laws without overstating theoretical guarantees. revision: yes
Circularity Check
No circularity: position paper with no derivations or fitted predictions
full rationale
The document is a discursive survey and position paper that defines foundation models, surveys capabilities/risks, and calls for interdisciplinary research. It contains no equations, no parameter fitting, no 'predictions' of quantities, and no derivation chain. All technical claims are framed as observations from external literature or as open questions (e.g., 'we currently lack a clear understanding'). No step reduces to a self-definition, self-citation load-bearing premise, or renamed known result by construction. The central statements about emergence and homogenization are presented as motivating observations rather than proven results, consistent with the paper's explicit acknowledgment of incomplete understanding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Training scale produces emergent capabilities not present or predictable in smaller models.
- domain assumption Widespread effectiveness will drive homogenization of AI development around a few base models.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclearThough foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream.
Forward citations
Cited by 60 Pith papers
-
Contrastive Identification and Generation in the Limit
Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text iden...
-
Prism: Symbolic Superoptimization of Tensor Programs
Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x spe...
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
-
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models
DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.
-
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.
-
AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment
AcuityBench harmonizes five datasets into a four-level acuity framework to evaluate LLMs on clinical urgency identification, error patterns, and uncertainty alignment across QA and conversational formats.
-
TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma
TokaMind, pre-trained on MAST tokamak data, transfers to power grid PMU data for severe event classification with F1 0.837, where difficulty depends on grid topology and CSD indicators boost early-warning performance ...
-
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63....
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
On the Safety of Graph Representation Learning
GRL-Safety benchmark shows that safety in graph representation learning depends on interactions between method design and specific graph stresses rather than broad method families.
-
Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models
Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
-
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
LUCAS-MEGA fuses 68 soil-environment datasets into a 70k-sample multimodal resource that supports self-supervised pretraining of SoilFormer, whose representations align with known soil processes.
-
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.
-
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
-
CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
CUE mitigates concept confusion in long-tailed visual recognition by expanding supervision with multi-label concept sets from zero-shot CLIP and LLMs, using auxiliary Binary Logit-Adjustment losses to achieve stronger...
-
Self-Evolving Software Agents
Self-evolving agents use BDI-LLM architecture to autonomously discover new goals and generate executable code from minimal prior knowledge in dynamic multi-agent settings.
-
LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda
A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Participatory provenance as representational auditing for AI-mediated public consultation
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD
A graph-based neural operator trained on expert-validated race-car CFD data reaches accuracy levels usable for early-stage interactive aerodynamic design exploration.
-
TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics
TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.
-
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping
PlantXpert benchmark shows fine-tuned VLMs reach up to 78% accuracy on plant phenotyping but scaling gains plateau and quantitative biological reasoning remains weak.
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
The Shrinking Lifespan of LLMs in Science
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
-
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
-
AI Agents Under EU Law
AI agent providers face an exhaustive inventory requirement for actions and data flows, as high-risk systems with untraceable behavioral drift cannot meet the AI Act's essential requirements.
-
Many Preferences, Few Policies: Towards Scalable Language Model Personalization
PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
TextGrad: Automatic "Differentiation" via Text
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
PatchTST uses subseries patching and channel-independent Transformers to deliver significantly better long-term multivariate time series forecasting and strong self-supervised transfer performance.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
No One Knows the State of the Art in Geospatial Foundation Models
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
-
A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
Under semantic underdetermination, LLMs cannot always guarantee strong correctness, strict non-bias, and high utility at once.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
-
When Should Teachers Control AI Generation for Mathematics Visuals?
Post-generation control in AI-assisted math visual creation yields higher teacher ratings for predictability and correctness than pre- or mid-generation control, with qualitative trade-offs in agency and effort.
-
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
-
Biosignal Fingerprinting: A Cross-Modal PPG-ECG Foundation Model
A cross-modal masked autoencoder creates reusable biosignal fingerprints that match or exceed specialist models on seven cardiovascular tasks using only single-modality input.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
-
Post-training makes large language models less human-like
Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.
-
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
-
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
-
A renormalization-group inspired lattice-based framework for piecewise generalized linear models
RG-inspired lattice models for piecewise GLMs provide explicit interpretable partitions and a replica-analysis-derived scaling law for regularization that allows increasing complexity without expected rise in generali...
-
The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted reg...
Reference graph
Works this paper leans on
-
[1]
Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision . 2425–2433. Maria Antoniak and David Mimno. 2021. Bad Seeds: Evaluating Lexical Methods for Bias Measurement. In Proceedings of ACL 2021. Hannah Arendt. 1987. Collective Responsibility. Springer Netherlands, Dordrecht, 43–50. Martin Arjovsky, Léon Botto...
work page internal anchor Pith review arXiv 2021
-
[2]
Neural Machine Translation by Jointly Learning to Align and Translate
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, 7 (2015), e0130140. Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius Brito Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago Meireles Paixão, Filipe Mutz, et al. 2020. Self-driving cars: A survey. ...
work page internal anchor Pith review arXiv 2015
-
[3]
Systematic Generalization: What Is Required and Can It Be Learned?. In International Conference on Learning Representations. Annette Baier. 1986. Trust and Antitrust. Ethics 96, 2 (1986), 231–260. http://www.jstor.org/stable/2381376 164 Center for Research on Foundation Models (CRFM) Andrea Bajcsy, Dylan P. Losey, M. O’Malley, and A. Dragan. 2017. Learnin...
-
[4]
Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on Robotics 33, 6 (2017), 1273–1291. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. InAdvances in Neural Information Processing Sys...
-
[5]
The Benchmark Lottery. ArXiv abs/2107.07002 (2021). Mauricio Delbracio, Damien Kelly, Michael S Brown, and Peyman Milanfar. 2021. Mobile Computational Photography: A Tour. arXiv preprint arXiv:2102.09000 (2021). Dina Demner-Fushman, Yassine Mrabet, and Asma Ben Abacha. 2020. Consumer health information and question answering: helping consumers find answer...
-
[6]
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling. ArXiv abs/2107.03451 (2021). https://arxiv.org/abs/2107.03451 Emily Dinan, Angela Fan, Ledell Wu, Jason Weston, Douwe Kiela, and Adina Williams. 2020. Multi-Dimensional Gender Bias Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Proce...
-
[7]
The tactics & tropes of the Internet Research Agency. https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article= 1003&context=senatedocs Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, et al. 2021. Distributed Deep Learning in Open Collabor...
-
[8]
The First Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021 (2021)
Training a First-Order Theorem Prover from Synthetic Data. The First Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021 (2021). https://mathai-iclr.github.io/papers/papers/MATHAI_18_paper.pdf Jaime F. Fisac, Neil F. Lugovoy, Vicenç Rúbies Royo, S. Ghosh, and C. Tomlin. 2019. Bridging Hamilton-Jacobi Safety Analysis and Reinforce...
-
[9]
Intrinsic bias metrics do not correlate with application bias
Multimodal neurons in artificial neural networks. Distill 6, 3 (2021), e30. Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sánchez, Mugdha Pandya, and Adam Lopez. 2021. Intrinsic Bias Metrics Do Not Correlate with Application Bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...
-
[10]
Learning agile and dynamic motor skills for legged robots. Science Robotics 4, 26 (2019). Janet Shibley Hyde, Rebecca S. Bigler, Daphna Joel, Charlotte Chucky Tate, and Sari M. van Anders. 2019. The Future of Sex and Gender in Psychology: Five Challenges to the Gender Binary. American Psychologist 74 (2019), 171–193. H. Iida, Dung Thai, Varun Manjunatha, ...
-
[11]
Unmasking Clever Hans predictors and assessing what machines really learn. Nature communications 10, 1 (2019), 1–8. Jill H. Larkin and Herbert A. Simon. 1987. Why a Diagram is (Sometimes) Worth Ten Thousand Words. Cognitive Science 11, 1 (1987), 65–100. https://doi.org/10.1111/j.1551-6708.1987.tb00863.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.111...
-
[12]
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744 (2016). Tal Linzen. 2020. How Can We Accelerate Progress Towards Human-like Linguistic Generalization?. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Online, 5210–5217. https://doi.org/10.1...
-
[13]
Everyone wants to do the model work, not the data work
Human Comprehension of Fairness in Machine Learning. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (New York, NY, USA) (AIES ’20). Association for Computing Machinery, New York, NY, USA, 152. https://doi.org/10.1145/3375627.3375819 Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. On the Effect of Dropping Layers o...
-
[14]
The woman worked as a babysitter: On biases in language generation
Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 1134–1141. Ali Shafahi, Parsa Saadatpanah, Chen Zhu, Amin Ghiasi, Christoph Studer, David Jacobs, and Tom Goldstein. 2019. Adver- sarially robust transfer learning. arXiv preprint arXiv:1905.08232 (2019). Neal A Sh...
-
[15]
https://www.cambridge.org/core/journals/european-review/article/abs/improving-ratings-audit-in-the-british- university-system/FC2EE640C0C44E3DB87C29FB666E9AAB Yolande Strengers, Lizhen Qu, Qiongkai Xu, and Jarrod Knibbe. 2020. Adhering, Steering, and Queering: Treatment of Gender in Natural Language Generation. In Proceedings of the 2020 CHI Conference on...
-
[16]
Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076 (2019). Rob Voigt, Nicholas P Camp, Vinodkumar Prabhakaran, William L Hamilton, Rebecca C Hetey, Camilla M Griffiths, David Jurgens, Dan Jurafsky, and Jennifer L Eberhardt. 2017. Language from police body camera footage shows racial disparities in officer respect. Proceedings of...
-
[17]
Clinical Pharmacology & Therapeutics 92, 4 (2012), 414–417
Pharmacogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics 92, 4 (2012), 414–417. Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X Liu, Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale, Mohammed Saeed, et al. 2019. Do no harm: a roadmap for responsible machine learning for health care. Nature...
-
[18]
arXiv preprint arXiv:2010.11934 (2020)
mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020). Eugene Yang, Sean MacAvaney, David D Lewis, and Ophir Frieder. 2021. Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review. arXiv preprint arXiv:2105.01044 (2021). Mengjiao Yang and Ofir Nachum. 2021. Representation Matters: Offline P...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.