Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
2004.03607 , archiveprefix =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2020 2representative citing papers
Introduces a new English dataset from r/AskParents and r/needadvice annotated for advice sentences plus preliminary models showing pre-trained LMs outperform rule-based systems but the task remains challenging.
citing papers explorer
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
Help! Need Advice on Identifying Advice
Introduces a new English dataset from r/AskParents and r/needadvice annotated for advice sentences plus preliminary models showing pre-trained LMs outperform rule-based systems but the task remains challenging.