Recognition: 2 theorem links
· Lean TheoremMeasuring Massive Multitask Language Understanding
Pith reviewed 2026-05-10 12:39 UTC · model grok-4.3
The pith
Current language models, including the largest GPT-3, still require substantial improvements to reach expert-level accuracy on a new 57-task test of knowledge and problem solving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a test with 57 tasks is needed to assess models' extensive world knowledge and problem solving ability, and that even the most advanced models fall short of expert performance across all these tasks, with particular weaknesses in socially important domains.
What carries the argument
A new test consisting of 57 multiple-choice tasks covering subjects from elementary mathematics to professional levels in areas such as history, computer science, and law.
If this is right
- Models exhibit lopsided performance across the different tasks.
- Models frequently do not know when they are wrong.
- Models achieve near-random accuracy on socially important subjects such as morality and law.
- The test can be used to analyze models across many tasks and identify important shortcomings.
Where Pith is reading between the lines
- This kind of broad test allows tracking of how model performance changes as models increase in size.
- Task-by-task results could help focus additional training on areas where models are weakest.
- The approach provides a consistent way to compare models on a shared set of academic and professional questions.
Load-bearing premise
The 57 chosen tasks and their expert-level thresholds accurately capture extensive world knowledge and problem solving ability without selection bias or overly narrow definitions of expertise.
What would settle it
A model that attains expert-level accuracy on all 57 tasks would show that current best models do not need further substantial improvements.
read the original abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Massive Multitask Language Understanding (MMLU) benchmark consisting of 57 multiple-choice tasks drawn from academic and professional domains such as mathematics, history, computer science, and law. The authors evaluate a range of language models and report that most achieve near-random accuracy of approximately 25%, while the largest GPT-3 model reaches an average of 43.9% (a nearly 20-point gain over random). All evaluated models remain substantially below the stated expert-level accuracy of roughly 89% on every task, with notably weak performance on morality and law; models also exhibit lopsided subject performance and poor calibration regarding their own errors. The benchmark is positioned as a tool for measuring breadth of world knowledge and problem-solving ability.
Significance. If the reported measurements hold, this work supplies a valuable, broad-coverage benchmark that enables systematic tracking of language-model progress across many domains simultaneously. Notable strengths include the careful sourcing of questions from real exams and textbooks, the public release of the full dataset for reproducibility, the consistent evaluation protocol applied to multiple model families, and the inclusion of clear random-chance baselines. These elements allow the community to replicate and extend the results, and the empirical gaps documented have already shaped subsequent scaling and evaluation research.
minor comments (3)
- [Section 3] Section 3: Additional quantitative detail on how expert-level accuracy thresholds were estimated for each task (e.g., number of experts, agreement statistics) would help readers evaluate the size of the reported gaps to expert performance.
- [Table 1] Table 1 and Section 4: The random baseline is uniformly listed near 25%; explicitly confirming that every task uses four options and noting any exceptions would remove minor ambiguity.
- [Section 5] Section 5: The discussion of lopsided performance and poor self-knowledge would be strengthened by reporting per-subject standard deviations or statistical tests of imbalance.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary accurately captures the MMLU benchmark, its construction from real academic and professional sources, the evaluation results across model families, and the documented gaps relative to expert performance. We are pleased that the strengths of careful sourcing, public data release, consistent protocols, and random baselines are highlighted. As the report lists no specific major comments or requested changes, we have no points requiring detailed rebuttal or disagreement.
Circularity Check
No significant circularity identified
full rationale
The paper constructs a new benchmark (MMLU) with 57 tasks drawn from existing exams and reports direct empirical accuracy measurements for language models against random-chance baselines and stated expert thresholds. No equations, derivations, or first-principles predictions appear; the central claims consist solely of observed performance numbers on the released test set. Self-citations, if present, are incidental background references and do not serve as load-bearing justification for any result. The evaluation chain is therefore self-contained and externally verifiable via the dataset.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/PhiForcing.leanphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
-
IndisputableMonolith/Foundation/DimensionForcing.leandimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs
Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Skill Description Deception Attack against Task Routing in Internet of Agents
Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
-
Dataset Watermarking for Closed LLMs with Provable Detection
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...
-
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.
-
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
-
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
-
Analysis and Explainability of LLMs Via Evolutionary Methods
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
-
Breaking the Secret: Economic Interventions for Combating Collusion in Embodied Multi-Agent Systems
A mutagenic incentive mechanism reshapes payoffs in embodied MAS to induce strategic defection from collusion, achieving performance comparable to non-collusion baselines in simulations and real-world tests.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
-
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.
-
AgileLog: A Forkable Shared Log for Agents on Data Streams
AgileLog introduces forkable shared logs with cheap forking and isolation to support AI agents on data streams.
-
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
-
Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
GRIP integrates retrieval into autoregressive generation through self-triggered control tokens for dynamic query planning, outperforming RAG baselines on QA benchmarks with fewer parameters than GPT-4o.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
MATU quantifies uncertainty in LLM multi-agent systems by turning reasoning trajectories into embedding matrices, stacking runs into a tensor, and decomposing it to separate sources of variability.
-
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
-
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
Path-Constrained Mixture-of-Experts
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
-
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
KMMMU benchmark demonstrates that leading multimodal models achieve at most 52.42% accuracy on hard Korean exam questions, highlighting limitations in non-English multimodal understanding.
-
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
-
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
Capabilities of GPT-4 on Medical Challenge Problems
GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
-
LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
Reference graph
Works this paper leans on
-
[1]
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). J. Artif. Intell. Res., 47: 0 253--279, 2013
2013
-
[2]
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language, 2019
2019
-
[3]
Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian. Experience grounds language, 2020
2020
-
[4]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...
2020
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
P. Clark, O. Etzioni, D. Khashabi, T. Khot, B. D. Mishra, K. Richardson, A. Sabharwal, C. Schoenick, O. Tafjord, N. Tandon, S. Bhakthavatsalam, D. Groeneveld, M. Guerquin, and M. Schmitz. From 'f' to 'a' on the n.y. regents science exams: An overview of the aristo project. ArXiv, abs/1909.01958, 2019
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[8]
Geirhos, J.-H
R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks, 2020
2020
-
[9]
C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. ICML, 2017
2017
-
[10]
Hendrycks, M
D. Hendrycks, M. Mazeika, and T. Dietterich. Deep anomaly detection with outlier exposure. ICLR, 2019 a
2019
-
[11]
D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. Natural adversarial examples. ArXiv, abs/1907.07174, 2019 b
-
[12]
Hendrycks, C
D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning ai with shared human values, 2020
2020
-
[13]
Huang, R
L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning, 2019
2019
-
[14]
Kaplan, S
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models, 2020
2020
-
[15]
Khashabi, T
D. Khashabi, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system, 2020
2020
-
[16]
T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. Qasc: A dataset for question answering via sentence composition, 2019
2019
-
[17]
Kumar, P
A. Kumar, P. Liang, and T. Ma. Verified uncertainty calibration, 2019
2019
-
[18]
G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017
2017
-
[19]
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2020
work page internal anchor Pith review arXiv 1909
-
[20]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[21]
Mihaylov, P
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018
2018
-
[22]
Ovadia, E
Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model's uncertainty? E valuating predictive uncertainty under dataset shift. NeurIPS, 2019
2019
-
[23]
Petroni, T
F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel. Language models as knowledge bases?, 2019
2019
-
[24]
Radford, J
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019
2019
-
[25]
Raffel, N
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019
2019
-
[26]
Richardson, C
M. Richardson, C. J. Burges, and E. Renshaw. MCT est: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193--203, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics
2013
-
[27]
A. B. Sai, A. K. Mohankumar, and M. M. Khapra. A survey of evaluation metrics used for nlg systems. 2020
2020
-
[28]
A. Turing. Computing machinery and intelligence. 1950
work page 1950
-
[29]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2018
work page 2018
-
[30]
A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019
work page 2019
-
[31]
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence?, 2019
work page 2019
-
[32]
R. Zellers, A. Holtzman, E. Clark, L. Qin, A. Farhadi, and Y. Choi. Evaluating machines by their real-world language use, 2020
work page 2020
-
[33]
Realizable and unrealizable specifications of reactive systems , author =
-
[34]
Kingma and Jimmy Ba , year = 2014, journal =
Diederik P. Kingma and Jimmy Ba , year = 2014, journal =. Adam:
work page 2014
-
[35]
Generative Adverarial Metric [sic] , author =
-
[36]
Feature Denoising for Improving Adversarial Robustness , author =
-
[37]
Intriguing properties of neural networks , author =
-
[38]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author =. ArXiv , volume =
-
[39]
Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E , year = 2012, journal =
work page 2012
-
[40]
Striving for Simplicity: The All Convolutional Net , author =. CoRR , volume =
-
[41]
Adversarial Logit Pairing , author =
-
[42]
Evaluating and Understanding the Robustness of Adversarial Logit Pairing , author =
-
[43]
Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in
-
[44]
Google street view: Capturing the world at street level , author =. Computer , publisher =
-
[45]
Nicomachean Ethics , author =
-
[46]
General Purpose Intelligence: Arguing the Orthogonality Thesis , author =
-
[47]
Army of None: Autonomous Weapons and the Future of War , author =
-
[48]
Synthesizing robust adversarial examples , author =
-
[49]
Adversarial Transformation Networks: Learning to Generate Adversarial Examples , author =. CoRR , volume =
-
[50]
The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , author =. J. Artif. Intell. Res. , volume = 47, pages =
-
[51]
doi: 10.18653/v1/2020.acl-main.463
Bender, Emily M. and Koller, Alexander , year = 2020, month = jul, booktitle =. Climbing towards. doi:10.18653/v1/2020.acl-main.463 , url =
-
[52]
An Introduction to the Principles of Morals and Legislation , author =
-
[53]
Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach , author =
-
[54]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. ArXiv , volume =
-
[55]
Mixmatch: A holistic approach to semi-supervised learning , author =
-
[56]
Big but Imperceptible Adversarial Perturbations via Semantic Manipulation , author =. CoRR , volume =
-
[57]
Support Vector Machines Under Adversarial Label Noise , author =
-
[58]
Piqa: Reasoning about physical commonsense in natural language
PIQA: Reasoning about Physical Commonsense in Natural Language , author =. 1911.11641 , archiveprefix =
-
[59]
PIQA: Reasoning about Physical Commonsense in Natural Language , author =
-
[60]
Experience Grounds Language , author =. 2004.10151 , archiveprefix =
-
[61]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =
-
[62]
Superintelligence: Paths, Dangers, Strategies , author =
-
[63]
Adversarial Filters of Dataset Biases , author =
-
[64]
Wieland Brendel and Matthias Bethge , year = 2018, journal =. Approximating
work page 2018
-
[65]
Adversarial patch , author =
- [66]
-
[67]
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners , author =. 2005.14165 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv 2005
- [68]
-
[69]
Learning to rank using gradient descent , author =
-
[70]
Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , author =
-
[71]
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages =
Adversarial examples are not easily detected: Bypassing ten detection methods , author =. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages =
-
[72]
2017 ieee symposium on security and privacy (sp) , pages =
Towards evaluating the robustness of neural networks , author =. 2017 ieee symposium on security and privacy (sp) , pages =
work page 2017
-
[73]
Chapelle, Olivier and Schlkopf, Bernhard and Zien, Alexander , year = 2010, publisher =
work page 2010
-
[74]
Chawla, Nitesh V and Bowyer, Kevin W and Hall, Lawrence O and Kegelmeyer, W Philip , year = 2002, journal =
work page 2002
-
[75]
Dual Path Networks , author =
-
[76]
Deep Reinforcement Learning from Human Preferences , author =
-
[77]
Learning multiple layers of features from tiny images , author =
-
[78]
Describing Textures in the Wild , author =
-
[79]
Lawrence Zitnick and Piotr Dollar , year = 2014, journal =
Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollar , year = 2014, journal =. Microsoft
work page 2014
-
[80]
Certified adversarial robustness via randomized smoothing , author =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.