pith. sign in

arxiv: 2206.04615 · v3 · submitted 2022-06-09 · 💻 cs.CL · cs.AI· cs.CY· cs.LG· stat.ML

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava , Abhinav Rastogi , Abhishek Rao , Abu Awal Md Shoeb , Abubakar Abid , Adam Fisch , Adam R. Brown , Adam Santoro
show 442 more authors
Aditya Gupta Adri\`a Garriga-Alonso Agnieszka Kluska Aitor Lewkowycz Akshat Agarwal Alethea Power Alex Ray Alex Warstadt Alexander W. Kocurek Ali Safaya Ali Tazarv Alice Xiang Alicia Parrish Allen Nie Aman Hussain Amanda Askell Amanda Dsouza Ambrose Slone Ameet Rahane Anantharaman S. Iyer Anders Andreassen Andrea Madotto Andrea Santilli Andreas Stuhlm\"uller Andrew Dai Andrew La Andrew Lampinen Andy Zou Angela Jiang Angelica Chen Anh Vuong Animesh Gupta Anna Gottardi Antonio Norelli Anu Venkatesh Arash Gholamidavoodi Arfa Tabassum Arul Menezes Arun Kirubarajan Asher Mullokandov Ashish Sabharwal Austin Herrick Avia Efrat Aykut Erdem Ayla Karaka\c{s} B. Ryan Roberts Bao Sheng Loe Barret Zoph Bart{\l}omiej Bojanowski Batuhan \"Ozyurt Behnam Hedayatnia Behnam Neyshabur Benjamin Inden Benno Stein Berk Ekmekci Bill Yuchen Lin Blake Howald Bryan Orinion Cameron Diao Cameron Dour Catherine Stinson Cedrick Argueta C\'esar Ferri Ram\'irez Chandan Singh Charles Rathkopf Chenlin Meng Chitta Baral Chiyu Wu Chris Callison-Burch Chris Waites Christian Voigt Christopher D. Manning Christopher Potts Cindy Ramirez Clara E. Rivera Clemencia Siro Colin Raffel Courtney Ashcraft Cristina Garbacea Damien Sileo Dan Garrette Dan Hendrycks Dan Kilman Dan Roth Daniel Freeman Daniel Khashabi Daniel Levy Daniel Mosegu\'i Gonz\'alez Danielle Perszyk Danny Hernandez Danqi Chen Daphne Ippolito Dar Gilboa David Dohan David Drakard David Jurgens Debajyoti Datta Deep Ganguli Denis Emelin Denis Kleyko Deniz Yuret Derek Chen Derek Tam Dieuwke Hupkes Diganta Misra Dilyar Buzan Dimitri Coelho Mollo Diyi Yang Dong-Ho Lee Dylan Schrader Ekaterina Shutova Ekin Dogus Cubuk Elad Segal Eleanor Hagerman Elizabeth Barnes Elizabeth Donoway Ellie Pavlick Emanuele Rodola Emma Lam Eric Chu Eric Tang Erkut Erdem Ernie Chang Ethan A. Chi Ethan Dyer Ethan Jerzak Ethan Kim Eunice Engefu Manyasi Evgenii Zheltonozhskii Fanyue Xia Fatemeh Siar Fernando Mart\'inez-Plumed Francesca Happ\'e Francois Chollet Frieda Rong Gaurav Mishra Genta Indra Winata Gerard de Melo Germ\'an Kruszewski Giambattista Parascandolo Giorgio Mariani Gloria Wang Gonzalo Jaimovitch-L\'opez Gregor Betz Guy Gur-Ari Hana Galijasevic Hannah Kim Hannah Rashkin Hannaneh Hajishirzi Harsh Mehta Hayden Bogar Henry Shevlin Hinrich Sch\"utze Hiromu Yakura Hongming Zhang Hugh Mee Wong Ian Ng Isaac Noble Jaap Jumelet Jack Geissinger Jackson Kernion Jacob Hilton Jaehoon Lee Jaime Fern\'andez Fisac James B. Simon James Koppel James Zheng James Zou Jan Koco\'n Jana Thompson Janelle Wingfield Jared Kaplan Jarema Radom Jascha Sohl-Dickstein Jason Phang Jason Wei Jason Yosinski Jekaterina Novikova Jelle Bosscher Jennifer Marsh Jeremy Kim Jeroen Taal Jesse Engel Jesujoba Alabi Jiacheng Xu Jiaming Song Jillian Tang Joan Waweru John Burden John Miller John U. Balis Jonathan Batchelder Jonathan Berant J\"org Frohberg Jos Rozen Jose Hernandez-Orallo Joseph Boudeman Joseph Guerr Joseph Jones Joshua B. Tenenbaum Joshua S. Rule Joyce Chua Kamil Kanclerz Karen Livescu Karl Krauth Karthik Gopalakrishnan Katerina Ignatyeva Katja Markert Kaustubh D. Dhole Kevin Gimpel Kevin Omondi Kory Mathewson Kristen Chiafullo Ksenia Shkaruta Kumar Shridhar Kyle McDonell Kyle Richardson Laria Reynolds Leo Gao Li Zhang Liam Dugan Lianhui Qin Lidia Contreras-Ochando Louis-Philippe Morency Luca Moschella Lucas Lam Lucy Noble Ludwig Schmidt Luheng He Luis Oliveros Col\'on Luke Metz L\"utfi Kerem \c{S}enel Maarten Bosma Maarten Sap Maartje ter Hoeve Maheen Farooqi Manaal Faruqui Mantas Mazeika Marco Baturan Marco Marelli Marco Maru Maria Jose Ram\'irez Quintana Marie Tolkiehn Mario Giulianelli Martha Lewis Martin Potthast Matthew L. Leavitt Matthias Hagen M\'aty\'as Schubert Medina Orduna Baitemirova Melody Arnaud Melvin McElrath Michael A. Yee Michael Cohen Michael Gu Michael Ivanitskiy Michael Starritt Michael Strube Micha{\l} Sw\k{e}drowski Michele Bevilacqua Michihiro Yasunaga Mihir Kale Mike Cain Mimee Xu Mirac Suzgun Mitch Walker Mo Tiwari Mohit Bansal Moin Aminnaseri Mor Geva Mozhdeh Gheini Mukund Varma T Nanyun Peng Nathan A. Chi Nayeon Lee Neta Gur-Ari Krakover Nicholas Cameron Nicholas Roberts Nick Doiron Nicole Martinez Nikita Nangia Niklas Deckers Niklas Muennighoff Nitish Shirish Keskar Niveditha S. Iyer Noah Constant Noah Fiedel Nuan Wen Oliver Zhang Omar Agha Omar Elbaghdadi Omer Levy Owain Evans Pablo Antonio Moreno Casares Parth Doshi Pascale Fung Paul Pu Liang Paul Vicol Pegah Alipoormolabashi Peiyuan Liao Percy Liang Peter Chang Peter Eckersley Phu Mon Htut Pinyu Hwang Piotr Mi{\l}kowski Piyush Patil Pouya Pezeshkpour Priti Oli Qiaozhu Mei Qing Lyu Qinlang Chen Rabin Banjade Rachel Etta Rudolph Raefer Gabriel Rahel Habacker Ramon Risco Rapha\"el Milli\`ere Rhythm Garg Richard Barnes Rif A. Saurous Riku Arakawa Robbe Raymaekers Robert Frank Rohan Sikand Roman Novak Roman Sitelew Ronan LeBras Rosanne Liu Rowan Jacobs Rui Zhang Ruslan Salakhutdinov Ryan Chi Ryan Lee Ryan Stovall Ryan Teehan Rylan Yang Sahib Singh Saif M. Mohammad Sajant Anand Sam Dillavou Sam Shleifer Sam Wiseman Samuel Gruetter Samuel R. Bowman Samuel S. Schoenholz Sanghyun Han Sanjeev Kwatra Sarah A. Rous Sarik Ghazarian Sayan Ghosh Sean Casey Sebastian Bischoff Sebastian Gehrmann Sebastian Schuster Sepideh Sadeghi Shadi Hamdan Sharon Zhou Shashank Srivastava Sherry Shi Shikhar Singh Shima Asaadi Shixiang Shane Gu Shubh Pachchigar Shubham Toshniwal Shyam Upadhyay Shyamolima (Shammie) Debnath Siamak Shakeri Simon Thormeyer Simone Melzi Siva Reddy Sneha Priscilla Makini Soo-Hwan Lee Spencer Torene Sriharsha Hatwar Stanislas Dehaene Stefan Divic Stefano Ermon Stella Biderman Stephanie Lin Stephen Prasad Steven T. Piantadosi Stuart M. Shieber Summer Misherghi Svetlana Kiritchenko Swaroop Mishra Tal Linzen Tal Schuster Tao Li Tao Yu Tariq Ali Tatsu Hashimoto Te-Lin Wu Th\'eo Desbordes Theodore Rothschild Thomas Phan Tianle Wang Tiberius Nkinyili Timo Schick Timofei Kornev Titus Tunduny Tobias Gerstenberg Trenton Chang Trishala Neeraj Tushar Khot Tyler Shultz Uri Shaham Vedant Misra Vera Demberg Victoria Nyamai Vikas Raunak Vinay Ramasesh Vinay Uday Prabhu Vishakh Padmakumar Vivek Srikumar William Fedus William Saunders William Zhang Wout Vossen Xiang Ren Xiaoyu Tong Xinran Zhao Xinyi Wu Xudong Shen Yadollah Yaghoobzadeh Yair Lakretz Yangqiu Song Yasaman Bahri Yejin Choi Yichi Yang Yiding Hao Yifu Chen Yonatan Belinkov Yu Hou Yufang Hou Yuntao Bai Zachary Seid Zhuoye Zhao Zijian Wang Zijie J. Wang Zirui Wang Ziyi Wu
This is my paper

Pith reviewed 2026-05-10 23:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LGstat.ML
keywords language modelsscalingbenchmarksBIG-benchemergent abilitiesmodel evaluationsocial biascalibration
0
0 comments X

The pith

Scale brings gradual gains on knowledge tasks but sudden breakthroughs on complex ones in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BIG-bench, a collection of 204 tasks designed to test abilities believed to lie beyond current language models, spanning linguistics, reasoning, science, and social domains. It evaluates a range of transformer models from millions to hundreds of billions of parameters against human expert raters on every task. Performance and calibration both rise with size yet remain far below human levels across architectures. Tasks heavy on knowledge or memorization scale smoothly and predictably, while those needing multiple steps or fragile metrics show abrupt jumps at certain sizes. Social bias often grows with scale in unclear contexts but can be reduced through prompting.

Core claim

BIG-bench evaluations demonstrate that model performance and calibration improve with scale across dense and sparse transformers, yet stay poor in absolute terms relative to human raters. Tasks improve gradually and predictably when they center on knowledge or memorization; tasks show sudden breakthroughs at critical scales when they involve multiple components or brittle metrics. Performance patterns are similar across model classes with some gains from sparsity, and social bias typically rises with scale under ambiguous conditions though prompting mitigates it.

What carries the argument

BIG-bench, a suite of 204 diverse tasks contributed by 450 authors that probes capabilities beyond those of current models and tracks how performance changes across model sizes.

If this is right

  • Larger models will show predictable improvement on knowledge-based tasks but may suddenly gain new abilities on multi-step tasks at certain sizes.
  • Calibration of model outputs will continue to improve with size yet remain unreliable compared to human judgments.
  • Sparse model architectures will retain a modest edge over dense ones at equivalent scales.
  • Social biases in model outputs will tend to increase with scale unless addressed by techniques such as prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need to design new tasks focused on multi-step reasoning to better anticipate when abrupt capability jumps will occur.
  • The observed patterns imply that simple extrapolation from small-model trends will underestimate sudden changes in what models can do.
  • Maintaining human expert baselines will require ongoing updates as model performance approaches or crosses them on individual tasks.

Load-bearing premise

The 204 tasks chosen represent the capabilities that will matter for future models and human rater performance gives a stable, unbiased ceiling for comparison.

What would settle it

A follow-up evaluation on the same tasks where models exceed human raters on a majority of them or where no clear split appears between gradual and breakthrough scaling behaviors.

read the original abstract

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript introduces the Beyond the Imitation Game benchmark (BIG-bench) with 204 tasks contributed by 450 authors across 132 institutions, spanning linguistics, math, reasoning, biology, social bias and other domains. It evaluates OpenAI GPT models, Google-internal dense transformers and Switch-style sparse transformers across scales from millions to hundreds of billions of parameters, supplies human expert rater baselines on all tasks, and reports that model performance and calibration improve with scale yet remain poor in absolute terms relative to humans; tasks with gradual scaling tend to involve knowledge or memorization while breakthrough scaling appears in multi-step or brittle-metric tasks; social bias tends to increase with scale under ambiguous context but can be mitigated by prompting.

Significance. If the reported empirical patterns hold, the work supplies a valuable large-scale characterization of current language-model capabilities and limitations that can inform scaling research, capability forecasting and harm mitigation. Credit is due for the multi-institutional task collection, the provision of human baselines, the explicit separation of gradual versus breakthrough scaling behaviors, and the absence of fitted parameters or circular reductions in the analysis.

minor comments (4)
  1. [Abstract] Abstract: the list of findings is presented as a single dense sentence; reformatting the key observations as bullets would improve immediate readability for readers scanning the paper.
  2. [Evaluation] Evaluation protocol: the manuscript should state the precise prompting templates, number of shots, and decoding parameters used for each model family so that the reported scores can be reproduced by independent groups.
  3. [Results] Results section: performance curves are shown without error bars or statistical tests; adding these would allow readers to assess whether observed differences between model classes or scales are reliable.
  4. [Analysis] Task categorization: the distinction between 'gradual' and 'breakthrough' tasks is described qualitatively; a short appendix listing the specific tasks falling into each category with their scaling exponents would make the claim more concrete.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its significance for scaling research and capability forecasting, and the recommendation of minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point. We are prepared to incorporate any minor suggestions or clarifications if supplied by the editor or referee.

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper introduces the BIG-bench dataset of 204 tasks and reports direct empirical measurements of model performance across scales, model classes, and human raters. No mathematical derivations, parameter fits, or predictions are claimed; scaling trends, gradual vs. breakthrough behaviors, and bias observations are presented as descriptive results from the evaluations themselves. The central claims rest on the contributed tasks and rater baselines without reduction to prior fits or self-citation chains. This is the expected non-finding for a large-scale benchmarking effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical axioms, free parameters, or invented entities; it relies on standard transformer architectures and human evaluation protocols already established in the field.

pith-pipeline@v0.9.0 · 7756 in / 1136 out tokens · 49013 ms · 2026-05-10T23:21:35.284290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

    econ.EM 2026-05 accept novelty 8.0

    EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

  2. Do generative video models understand physical principles?

    cs.CV 2025-01 unverdicted novelty 8.0

    Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

  3. LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    cs.CL 2024-06 unverdicted novelty 8.0

    LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

  4. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    cs.CL 2023-05 conditional novelty 8.0

    Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

  5. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  6. Progress measures for grokking via mechanistic interpretability

    cs.LG 2023-01 accept novelty 8.0

    Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

  7. NARRA-Gym for Evaluating Interactive Narrative Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that stati...

  8. TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

    cs.CV 2026-04 accept novelty 7.0

    TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

  9. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  10. Graph Property Inference in Small Language Models: Effects of Representation and Reasoning Strategy

    cs.LG 2026-02 conditional novelty 7.0

    Small instruction-tuned language models cannot reliably estimate graph-theoretic properties from textual encodings, though adjacency-list formats and multi-branch reasoning reduce errors relative to edge lists and sin...

  11. Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

    cs.SE 2026-01 conditional novelty 7.0

    Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

  12. DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

    cs.AI 2025-11 unverdicted novelty 7.0

    DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spati...

  13. The Art of Scaling Reinforcement Learning Compute for LLMs

    cs.LG 2025-10 unverdicted novelty 7.0

    A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...

  14. Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

    cs.CV 2025-04 conditional novelty 7.0

    Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.

  15. Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

    cs.CL 2025-02 unverdicted novelty 7.0

    KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...

  16. A ghost mechanism: An analytical model of abrupt learning in recurrent networks

    cs.LG 2025-01 unverdicted novelty 7.0

    The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.

  17. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  18. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  19. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  20. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  21. gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

    gr-qc 2026-05 unverdicted novelty 6.0

    LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.

  22. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  23. Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...

  24. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  25. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...

  26. AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

    cs.AI 2026-04 unverdicted novelty 6.0

    AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...

  27. QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

    cs.CL 2026-04 unverdicted novelty 6.0

    QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

  28. Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier

    cs.AI 2026-04 unverdicted novelty 6.0

    ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...

  29. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  30. The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

    cs.AI 2026-04 unverdicted novelty 6.0

    Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.

  31. Measuring Representation Robustness in Large Language Models for Geometry

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...

  32. Memory in the Age of AI Agents

    cs.CL 2025-12 unverdicted novelty 6.0

    The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

  33. Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

    cs.AI 2025-10 unverdicted novelty 6.0

    A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

  34. Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

    cs.HC 2025-09 unverdicted novelty 6.0

    A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.

  35. Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies

    cs.CR 2025-08 unverdicted novelty 6.0

    Integrates LLMs with domain ontologies and SHACL constraints to produce accurate, explainable structured outputs from cybersecurity logs for threat intelligence.

  36. PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.

  37. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  38. Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

    cs.CL 2024-11 unverdicted novelty 6.0

    DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.

  39. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    cs.AI 2024-08 conditional novelty 6.0

    Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

  40. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  41. Lessons from the Trenches on Reproducible Evaluation of Language Models

    cs.CL 2024-05 accept novelty 6.0

    The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

  42. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  43. LLM Evaluators Recognize and Favor Their Own Generations

    cs.CL 2024-04 unverdicted novelty 6.0

    LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

  44. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  45. Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

    cs.CL 2024-02 conditional novelty 6.0

    DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.

  46. Gemini: A Family of Highly Capable Multimodal Models

    cs.CL 2023-12 conditional novelty 6.0

    Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.

  47. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    cs.CL 2023-10 conditional novelty 6.0

    FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...

  48. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  49. Simple synthetic data reduces sycophancy in large language models

    cs.CL 2023-08 unverdicted novelty 6.0

    Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

  50. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  51. Scaling Data-Constrained Language Models

    cs.CL 2023-05 conditional novelty 6.0

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  52. Improving Factuality and Reasoning in Language Models through Multiagent Debate

    cs.CL 2023-05 unverdicted novelty 6.0

    Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

  53. Towards Expert-Level Medical Question Answering with Large Language Models

    cs.CL 2023-05 unverdicted novelty 6.0

    Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.

  54. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  55. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  56. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    cs.CL 2023-02 accept novelty 6.0

    ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.

  57. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

    cs.AI 2023-01 conditional novelty 6.0

    The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.

  58. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  59. Large Language Models Are Human-Level Prompt Engineers

    cs.LG 2022-11 unverdicted novelty 6.0

    APE generates instruction candidates via LLM and selects the best by zero-shot performance of a second LLM, matching or beating human prompts on 19 of 24 NLP tasks.

  60. Large Language Models Can Self-Improve

    cs.CL 2022-10 unverdicted novelty 6.0

    A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 100 Pith papers · 1 internal anchor

  1. [1]

    MathQA: Towards interpretable math word problem solving with operation-based formalisms

    URL https://arxiv.org/abs/1808.01400. (cited on p. 30) Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 245–256. PMLR, 13–18 July 2020. URLhttps://proc...

  2. [2]

    (cited on p

    URL https://arxiv.org/abs/1606.06565. (cited on p. 40) Brandon Amos and J. Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks, 2017. URLhttps: //arxiv.org/abs/1703.00443. (cited on p. 38) Philip W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science.177.4047.393. URLhttps: //www.science.org/doi/ab...

  3. [3]

    (cited on p

    URL https://arxiv.org/abs/2001.08435. (cited on p. 39) Nihat Bayat and Gökhan Çetinkaya. The relationship between inference skills and reading comprehension.TED EĞİTİM VE BİLİM (Education and Science), 45(203):177–190, 2020. doi: 10.15390/EB.2020.8782. URLhttp://egitimvebilim.ted.org. tr/index.php/EB/article/view/8782. (cited on p. 34) Mayur J. Bency, Ahm...

  4. [4]

    On the Opportunities and Risks of Foundation Models

    URL https://arxiv.org/abs/2108.07258. (cited on p. 4) Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models, 2019. URL https://arxiv.org/abs/1904.03035. (cited on p. 33) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. COMET: Commonsense transformers for a...

  5. [5]

    doi: 10.18653/v1/W18-6433

    Association for Computational Linguistics. doi: 10.18653/v1/W18-6433. URLhttps://aclanthology.org/W18-6433. (cited on p. 39) Corrado Böhm. On a family of Turing machines and the related programming language.ICC Bulletin, 3:187–194, 1964. (cited on p. 38) Kate Cain and Jane V. Oakhill. Inference making ability and its relation to comprehension failure.Read...

  6. [6]

    Simplicity: a unifying principle in cognitive science? , volume =

    doi: 10.1016/S1364-6613(02)00005-0. URL https://doi.org/10.1016/S1364-6613(02)00005-0. (cited on p. 38) Antonio Chella, Arianna Pipitone, Alain Morin, and Famira Racy. Developing self-awareness in robots via inner speech.Frontiers in Robotics and AI, 7, 2020. doi: 10.3389/frobt.2020.00016. URLhttps://www.frontiersin.org/article/10.3389/frobt. 2020.00016. ...

  7. [7]

    doi: 10.18653/v1/W19-3824

    Association for Computational Linguistics. doi: 10.18653/v1/W19-3824. URLhttps://aclanthology.org/W19-3824. (cited on p. 31) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context, 2018. URLhttps://arxiv.org/abs/1808.07036. (cited on p. 40) François Chollet. On the mea...

  8. [8]

    doi: 10.1007/978-3-319-40566-7_4

    Springer. doi: 10.1007/978-3-319-40566-7_4. URL https://doi.org/10.1007/978-3-319-40566-7_4. (cited on p. 36) Andrew Cropper, Rolf Morel, and Stephen Muggleton. Learning higher-order logic programs.Machine Learning, 109:1289–1322,

  9. [9]

    overinformative

    doi: 10.1007/s10994-019-05862-7. URL https://doi.org/10.1007/s10994-019-05862-7. (cited on p. 34) Joe Cruse. Emoji usage in TV conversation.Twitter blog, 18 Nov 2015. URLhttps://blog.twitter.com/en_us/a/2015/emoji- usage-in-tv-conversation. (cited on p. 31) Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore,...

  10. [10]

    (cited on p

    URL https://arxiv.org/abs/1707.03904. (cited on p. 33) Kaustubh Dhole, Gurdeep Singh, Priyadarshini P. Pai, and Sukanta Mondal. Sequence-based prediction of protein–protein interaction sites with l1-logreg classifier.Journal of Theoretical Biology, 348:47–54, 2014. doi: 10.1016/j.jtbi.2014.01.028. URL https://pubmed.ncbi.nlm.nih.gov/24486250/. (cited on p...

  11. [11]

    (cited on p

    URL https://arxiv.org/abs/1910.02227. (cited on p. 32) Matan Eyal, Tal Baumel, and Michael Elhadad. Question answering as an automatic evaluation metric for news article summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...

  12. [12]

    doi: 10.18653/v1/N19-1395

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1395. URLhttps://aclanthology.org/N19-1395. (cited on p. 32) Felix Faltings, Michel Galley, Gerold Hintz, Chris Brockett, Chris Quirk, Jianfeng Gao, and Bill Dolan. Text editing by command,

  13. [13]

    doi: 10.18653/v1/P18-1082

    URL https://arxiv.org/abs/2010.12826. (cited on p. 39) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P...

  14. [14]

    Fodor and Zenon W

    doi: https://doi.org/10.1016/0010-0277(88)90031-5. URL https://www.sciencedirect.com/science/article/pii/ 0010027788900315. (cited on p. 30) 63 Mark Forsyth.The Elements of Eloquence: Secrets of the Perfect Turn of Phrase. Berkley, New York, 2014. (cited on p. 33) Lea Frermann, Shay B. Cohen, and Mirella Lapata. Whodunnit? Crime drama as a case for natura...

  15. [15]

    doi: 10.5555/1625275.1625535

    Morgan Kaufmann. doi: 10.5555/1625275.1625535. URLhttps://dl.acm.org/doi/10.5555/1625275.1625535. (cited on p. 36) Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Za...

  16. [16]

    (cited on p

    URL https://arxiv.org/abs/2109.06838. (cited on p. 28) Edward Gibson, Richard Futrell, Julian Jara-Ettinger, Kyle Mahowald, Leon Bergen, Sivalogeswaran Ratnasingam, Mitchell Gibson, Steven T. Piantadosi, and Bevil R. Conway. Color naming across languages reflects color use.Proceedings of the National Academy of Sciences, 114(40):10785–10790, 2017. doi: 10...

  17. [17]

    doi: 10.18653/v1/N19-1061

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1061. URLhttps://aclanthology.org/N19-1061. (cited on p. 33) Roberto González-Ibáñez, Smaranda Muresan, and Nina Wacholder. Identifying sarcasm in Twitter: A closer look. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp...

  18. [18]

    URL https://doi.org/10.35111/0z6y-q265

    doi: 10.35111/0z6y-q265. URL https://doi.org/10.35111/0z6y-q265. (cited on p. 5) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines, 2014. URLhttps://arxiv.org/abs/1410.5401. (cited on pp. 34 and 38) Alex Graves, Greg Wayne, Malcom Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Col- menarejo, Edward Grefenst...

  19. [19]

    URL https://doi.org/10.1145/1925844.1926423

    doi: 10.1145/1925844.1926423. URL https://doi.org/10.1145/1925844.1926423. (cited on p. 36) Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples.Commun. ACM, 55(8): 97–105, Aug. 2012. doi: 10.1145/2240236.2240260. URLhttps://doi.org/10.1145/2240236.2240260. (cited on p. 36) Sumit Gulwani, José Hernández-Orallo,...

  20. [20]

    (cited on p

    URL https://link.springer.com/article/10.1007/BF02172093. (cited on p. 39) F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Trans. Interact. Intell. Syst., 5 (4), Dec. 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. (cited on p. 36) Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, M...

  21. [21]

    Henrich, S

    Springer. URL https://www.ecva.net/papers/eccv_2018/papers_ECCV/papers/Lisa_Anne_Hendricks_Women_also_ Snowboard_ECCV_2018_paper.pdf. (cited on p. 37) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. URL https://openrev...

  22. [22]

    29) China Household Management Research Center, Ministry of Public Security

    (cited on p. 29) China Household Management Research Center, Ministry of Public Security. National name report 2018. 2019. http: //news.cpd.com.cn/n18151/201901/t20190130_830962.html (Accessed 3 March 2021). (cited on p. 33) China Household Management Research Center, Ministry of Public Security. National name report 2019. 2020. https: //www.mps.gov.cn/n2...

  23. [23]

    doi: 10.18653/v1/2020.acl-main.164

    URL https://instagram-engineering.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine- learning-for-emoji-trends-7f5f9cb979ad. (cited on p. 31) Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th Annual Meeting of the Assoc...

  24. [24]

    (cited on p

    URL https://arxiv.org/abs/2007.01282. (cited on p. 38) 69 Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash. Indic-transformers: An analysis of transformer language models for indian languages, 2020. URLhttps://arxiv.org/abs/2011.02323. (cited on p. 33) Mario Jarmasz. Roget’s Thesaurus as a lexical resource for natural langua...

  25. [25]

    (cited on p

    URL https://arxiv.org/abs/2005.01229. (cited on p. 41) Aditya Joshi, Vinita Sharma, and Pushpak Bhattacharyya. Harnessing context incongruity for sarcasm detection. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp...

  26. [26]

    The N arrative QA reading comprehension challenge

    doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023. (cited on p. 35) Jan Kocoń, Piotr Miłkowski, and Kamil Kanclerz. MultiEmo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews. In Maciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M. A. Sloot (eds.),Computational...

  27. [27]

    URL https://doi.org/10.1007/s10992-020-09581-6

    doi: 10.1007/s10992-020-09581-6. URL https://doi.org/10.1007/s10992-020-09581-6. (cited on p. 29) Alexander W. Kocurek, Ethan Jerzak, and Rachel Etta Rudolph. Against conventional wisdom.Philosophers’ Imprint, 20(22): 1–27, 2020. URLhttp://hdl.handle.net/2027/spo.3521354.0020.022. (cited on p. 29) Moshe Koppel and Jonathan Schler. Authorship verification ...

  28. [28]

    (cited on p

    URL https://arxiv.org/abs/2101.00379. (cited on p. 30) Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7315–7330, Online, July 2020a. Association for Computational Ling...

  29. [29]

    doi: 10.18653/v1/W19-3005

    Association for Computational Linguistics. doi: 10.18653/v1/W19-3005. URLhttps://aclanthology.org/W19-3005. (cited on p. 39) Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational...

  30. [30]

    and Rudinger, Rachel

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URLhttps://aclanthology.org/N19-1063. (cited on p. 31) Andrew Mayne. OpenAI API alchemy: Emoji storytelling.Andrew Mayne blog, 24 June 2020. URLhttps://andrewmayneblog. wordpress.com/2020/06/24/open-ai-alchemy-emoji-storytelling/. (cited on p. 31) Joshua Maynez, Shashi Narayan, Bernd Bo...

  31. [31]

    Andere zeiten, andere lehren

    URL https://arxiv.org/abs/2005.00661. (cited on pp. 30 and 40) Eric Mays, Fred J. Damerau, and Robert L. Mercer. Context based spelling correction.Information Processing & Management, 27(5):517–522, 1991. doi: https://doi.org/10.1016/0306-4573(91)90066-U. URLhttps://www.sciencedirect.com/science/ article/pii/030645739190066U. (cited on p. 41) Momoh Karmah...

  32. [32]

    31) David Milne and Ian H

    (cited on p. 31) David Milne and Ian H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30, Menlo Park,

  33. [33]

    URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf

    Association for the Advancement of Artificial Intelligence. URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf. (cited on p. 36) Republic of China Ministry of the Interior. National name statistical analysis, 2018.https://www.ris.gov.tw/documents/data/ 5/2/107namestat.pdf (Accessed 3 March 2021). (cited on p. 33) Swaroop Mishra, Danie...

  34. [34]

    The deep boot- strap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127,

    (cited on p. 14) Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127, 2020. (cited on p. 14) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hur...

  35. [35]

    Cohen, and Mirella Lapata

    (cited on p. 34) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, Brussels, Belgium, October-November 2018. Association for Computationa...

  36. [36]

    doi: 10.18653/v1/P19-1442

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. URLhttps://aclanthology.org/P19-1442. (cited on p. 31) Marilyn Nippold, Melissa Allen, and Dixon Kirsch. Proverb comprehension as a function of reading proficiency in preadolescents. Language Speech and Hearing Services in Schools, 32:90, 04 2001. doi: 10.1044/0161-1461(2001/ 009). URL ...

  37. [37]

    URL https://doi.org/10.1080/02724980443000566

    doi: 10.1080/02724980443000566. URL https://doi.org/10.1080/02724980443000566. (cited on p. 35) The Working Committee on the Revision of the National Standard Occupational Classification. Standard Occupational Classification of the People’s Republic of China. China Labour and Social Security Publishing House, 2015.http://www. jiangmen.gov.cn/bmpd/jmsrlzyh...

  38. [38]

    32) Judea Pearl.Causality: Models, Reasoning, and Inference

    (cited on p. 32) Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. (cited on p. 30) Devin Pelser and Hugh Murrell. Deep and dense sarcasm detection, 2019. URLhttps://arxiv.org/abs/1911.07474. (cited on p. 39) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models, 2021. ...

  39. [39]

    29) Tony A

    (cited on p. 29) Tony A. Plate.Holographic Reduced Representations: Distributed Representation for Cognitive Structures. CSLI, Stanford, CA,

  40. [40]

    29) Robert Plutchik

    (cited on p. 29) Robert Plutchik. A general psychoevolutionary theory of emotion. In Robert Plutchik and Henry Kellerman (eds.),Theories of Emotion, pp. 3–33. Academic Press, 1980. doi: https://doi.org/10.1016/B978-0-12-558701-3.50007-7. URL https: //www.sciencedirect.com/science/article/pii/B9780125587013500077. (cited on p. 32) Nadia Polikarpova, Ivan K...

  41. [41]

    URLhttps://aclanthology.org/2020.lrec-1.125

    European Language Resources Association. URLhttps://aclanthology.org/2020.lrec-1.125. (cited on p. 31) Damien Sileo, Wout Vossen, and Robbe Raymaekers. Zero-shot recommendation as language modeling. In Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (eds.),Advances in Information Retrieval...

  42. [42]

    (cited on p

    John Benjamins, Amsterdam, 2010. (cited on p. 35) Bernd Steinbach and Roman Kohut. Neural networks – a model of boolean functions.5th International Workshop on Boolean Problems, Freiburg, Sept. 2002., 2002. URL https://www.researchgate.net/publication/246931125_Neural_Networks_- _A_Model_of_Boolean_Functions. (cited on p. 29) Nisan Stiennon, Long Ouyang, ...

  43. [43]

    38) Zijian Wang and David Jurgens

    (cited on p. 38) Zijian Wang and David Jurgens. It’s going to be okay: Measuring access to support in online communities. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 33–45, Brussels, Belgium, October-November

  44. [44]

    assessing BERT’s syntactic abilities

    Association for Computational Linguistics. doi: 10.18653/v1/D18-1004. URLhttps://aclanthology.org/D18-1004. (cited on p. 39) Zijian Wang and Christopher Potts. TalkDown: A corpus for condescension detection in context. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat...

  45. [45]

    (cited on p

    URL https://huggingface.co/bert-syntax/extending-bert-syntax.pdf. (cited on p. 39) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, M...

  46. [46]

    (cited on p

    URL https://arxiv.org/abs/1705.10272. (cited on p. 38) Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2367–2376, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-...

  47. [47]

    (cited on pp

    URL https://arxiv.org/abs/2002.04326. (cited on pp. 29 and 35) Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. Learning the Dyck language with attention-based Seq2Seq models. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 138–146, Florence, Italy, August 2019c. Association for Computational Linguistics...

  48. [48]

    31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao

    (cited on p. 31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering, 2019d. URLhttps://arxiv.org/abs/1906.02467. (cited on p. 32) Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. In Nick Bostrom an...

  49. [49]

    Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

    Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URLhttps://aclanthology.org/N18-2003. (cited on pp. 31 and 41) Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021. URLhttps://arxiv.org/abs/2102.09690. (cited on p. 41) Ben Zhou, Daniel Khashab...