Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Pith reviewed 2026-05-10 23:21 UTC · model grok-4.3
The pith
Scale brings gradual gains on knowledge tasks but sudden breakthroughs on complex ones in language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BIG-bench evaluations demonstrate that model performance and calibration improve with scale across dense and sparse transformers, yet stay poor in absolute terms relative to human raters. Tasks improve gradually and predictably when they center on knowledge or memorization; tasks show sudden breakthroughs at critical scales when they involve multiple components or brittle metrics. Performance patterns are similar across model classes with some gains from sparsity, and social bias typically rises with scale under ambiguous conditions though prompting mitigates it.
What carries the argument
BIG-bench, a suite of 204 diverse tasks contributed by 450 authors that probes capabilities beyond those of current models and tracks how performance changes across model sizes.
If this is right
- Larger models will show predictable improvement on knowledge-based tasks but may suddenly gain new abilities on multi-step tasks at certain sizes.
- Calibration of model outputs will continue to improve with size yet remain unreliable compared to human judgments.
- Sparse model architectures will retain a modest edge over dense ones at equivalent scales.
- Social biases in model outputs will tend to increase with scale unless addressed by techniques such as prompting.
Where Pith is reading between the lines
- Developers may need to design new tasks focused on multi-step reasoning to better anticipate when abrupt capability jumps will occur.
- The observed patterns imply that simple extrapolation from small-model trends will underestimate sudden changes in what models can do.
- Maintaining human expert baselines will require ongoing updates as model performance approaches or crosses them on individual tasks.
Load-bearing premise
The 204 tasks chosen represent the capabilities that will matter for future models and human rater performance gives a stable, unbiased ceiling for comparison.
What would settle it
A follow-up evaluation on the same tasks where models exceed human raters on a majority of them or where no clear split appears between gradual and breakthrough scaling behaviors.
read the original abstract
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Beyond the Imitation Game benchmark (BIG-bench) with 204 tasks contributed by 450 authors across 132 institutions, spanning linguistics, math, reasoning, biology, social bias and other domains. It evaluates OpenAI GPT models, Google-internal dense transformers and Switch-style sparse transformers across scales from millions to hundreds of billions of parameters, supplies human expert rater baselines on all tasks, and reports that model performance and calibration improve with scale yet remain poor in absolute terms relative to humans; tasks with gradual scaling tend to involve knowledge or memorization while breakthrough scaling appears in multi-step or brittle-metric tasks; social bias tends to increase with scale under ambiguous context but can be mitigated by prompting.
Significance. If the reported empirical patterns hold, the work supplies a valuable large-scale characterization of current language-model capabilities and limitations that can inform scaling research, capability forecasting and harm mitigation. Credit is due for the multi-institutional task collection, the provision of human baselines, the explicit separation of gradual versus breakthrough scaling behaviors, and the absence of fitted parameters or circular reductions in the analysis.
minor comments (4)
- [Abstract] Abstract: the list of findings is presented as a single dense sentence; reformatting the key observations as bullets would improve immediate readability for readers scanning the paper.
- [Evaluation] Evaluation protocol: the manuscript should state the precise prompting templates, number of shots, and decoding parameters used for each model family so that the reported scores can be reproduced by independent groups.
- [Results] Results section: performance curves are shown without error bars or statistical tests; adding these would allow readers to assess whether observed differences between model classes or scales are reliable.
- [Analysis] Task categorization: the distinction between 'gradual' and 'breakthrough' tasks is described qualitatively; a short appendix listing the specific tasks falling into each category with their scaling exponents would make the claim more concrete.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, the recognition of its significance for scaling research and capability forecasting, and the recommendation of minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point. We are prepared to incorporate any minor suggestions or clarifications if supplied by the editor or referee.
Circularity Check
No significant circularity; purely empirical benchmark
full rationale
The paper introduces the BIG-bench dataset of 204 tasks and reports direct empirical measurements of model performance across scales, model classes, and human raters. No mathematical derivations, parameter fits, or predictions are claimed; scaling trends, gradual vs. breakthrough behaviors, and bias observations are presented as descriptive results from the evaluations themselves. The central claims rest on the contributed tasks and rater baselines without reduction to prior fits or self-citation chains. This is the expected non-finding for a large-scale benchmarking effort.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
-
Do generative video models understand physical principles?
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
NARRA-Gym for Evaluating Interactive Narrative Agents
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that stati...
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Graph Property Inference in Small Language Models: Effects of Representation and Reasoning Strategy
Small instruction-tuned language models cannot reliably estimate graph-theoretic properties from textual encodings, though adjacency-list formats and multi-branch reasoning reduce errors relative to edge lists and sin...
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spati...
-
The Art of Scaling Reinforcement Learning Compute for LLMs
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...
-
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.
-
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...
-
A ghost mechanism: An analytical model of abrupt learning in recurrent networks
The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
-
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
-
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
-
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment
Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
-
Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
-
Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies
Integrates LLMs with domain ontologies and SHACL constraints to produce accurate, explainable structured outputs from cybersecurity logs for threat intelligence.
-
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
LLM Evaluators Recognize and Favor Their Own Generations
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
Simple synthetic data reduces sycophancy in large language models
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
-
Towards Expert-Level Medical Question Answering with Large Language Models
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.
-
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
Large Language Models Are Human-Level Prompt Engineers
APE generates instruction candidates via LLM and selects the best by zero-shot performance of a second LLM, matching or beating human prompts on 19 of 24 NLP tasks.
-
Large Language Models Can Self-Improve
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
Reference graph
Works this paper leans on
-
[1]
MathQA: Towards interpretable math word problem solving with operation-based formalisms
URL https://arxiv.org/abs/1808.01400. (cited on p. 30) Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 245–256. PMLR, 13–18 July 2020. URLhttps://proc...
-
[2]
URL https://arxiv.org/abs/1606.06565. (cited on p. 40) Brandon Amos and J. Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks, 2017. URLhttps: //arxiv.org/abs/1703.00443. (cited on p. 38) Philip W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science.177.4047.393. URLhttps: //www.science.org/doi/ab...
-
[3]
URL https://arxiv.org/abs/2001.08435. (cited on p. 39) Nihat Bayat and Gökhan Çetinkaya. The relationship between inference skills and reading comprehension.TED EĞİTİM VE BİLİM (Education and Science), 45(203):177–190, 2020. doi: 10.15390/EB.2020.8782. URLhttp://egitimvebilim.ted.org. tr/index.php/EB/article/view/8782. (cited on p. 34) Mayur J. Bency, Ahm...
-
[4]
On the Opportunities and Risks of Foundation Models
URL https://arxiv.org/abs/2108.07258. (cited on p. 4) Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models, 2019. URL https://arxiv.org/abs/1904.03035. (cited on p. 33) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. COMET: Commonsense transformers for a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tvcg.2011.185 2019
-
[5]
Association for Computational Linguistics. doi: 10.18653/v1/W18-6433. URLhttps://aclanthology.org/W18-6433. (cited on p. 39) Corrado Böhm. On a family of Turing machines and the related programming language.ICC Bulletin, 3:187–194, 1964. (cited on p. 38) Kate Cain and Jane V. Oakhill. Inference making ability and its relation to comprehension failure.Read...
-
[6]
Simplicity: a unifying principle in cognitive science? , volume =
doi: 10.1016/S1364-6613(02)00005-0. URL https://doi.org/10.1016/S1364-6613(02)00005-0. (cited on p. 38) Antonio Chella, Arianna Pipitone, Alain Morin, and Famira Racy. Developing self-awareness in robots via inner speech.Frontiers in Robotics and AI, 7, 2020. doi: 10.3389/frobt.2020.00016. URLhttps://www.frontiersin.org/article/10.3389/frobt. 2020.00016. ...
-
[7]
Association for Computational Linguistics. doi: 10.18653/v1/W19-3824. URLhttps://aclanthology.org/W19-3824. (cited on p. 31) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context, 2018. URLhttps://arxiv.org/abs/1808.07036. (cited on p. 40) François Chollet. On the mea...
-
[8]
doi: 10.1007/978-3-319-40566-7_4
Springer. doi: 10.1007/978-3-319-40566-7_4. URL https://doi.org/10.1007/978-3-319-40566-7_4. (cited on p. 36) Andrew Cropper, Rolf Morel, and Stephen Muggleton. Learning higher-order logic programs.Machine Learning, 109:1289–1322,
-
[9]
doi: 10.1007/s10994-019-05862-7. URL https://doi.org/10.1007/s10994-019-05862-7. (cited on p. 34) Joe Cruse. Emoji usage in TV conversation.Twitter blog, 18 Nov 2015. URLhttps://blog.twitter.com/en_us/a/2015/emoji- usage-in-tv-conversation. (cited on p. 31) Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore,...
-
[10]
URL https://arxiv.org/abs/1707.03904. (cited on p. 33) Kaustubh Dhole, Gurdeep Singh, Priyadarshini P. Pai, and Sukanta Mondal. Sequence-based prediction of protein–protein interaction sites with l1-logreg classifier.Journal of Theoretical Biology, 348:47–54, 2014. doi: 10.1016/j.jtbi.2014.01.028. URL https://pubmed.ncbi.nlm.nih.gov/24486250/. (cited on p...
-
[11]
URL https://arxiv.org/abs/1910.02227. (cited on p. 32) Matan Eyal, Tal Baumel, and Michael Elhadad. Question answering as an automatic evaluation metric for news article summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...
-
[12]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1395. URLhttps://aclanthology.org/N19-1395. (cited on p. 32) Felix Faltings, Michel Galley, Gerold Hintz, Chris Brockett, Chris Quirk, Jianfeng Gao, and Bill Dolan. Text editing by command,
-
[13]
URL https://arxiv.org/abs/2010.12826. (cited on p. 39) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P...
-
[14]
doi: https://doi.org/10.1016/0010-0277(88)90031-5. URL https://www.sciencedirect.com/science/article/pii/ 0010027788900315. (cited on p. 30) 63 Mark Forsyth.The Elements of Eloquence: Secrets of the Perfect Turn of Phrase. Berkley, New York, 2014. (cited on p. 33) Lea Frermann, Shay B. Cohen, and Mirella Lapata. Whodunnit? Crime drama as a case for natura...
-
[15]
Morgan Kaufmann. doi: 10.5555/1625275.1625535. URLhttps://dl.acm.org/doi/10.5555/1625275.1625535. (cited on p. 36) Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Za...
-
[16]
URL https://arxiv.org/abs/2109.06838. (cited on p. 28) Edward Gibson, Richard Futrell, Julian Jara-Ettinger, Kyle Mahowald, Leon Bergen, Sivalogeswaran Ratnasingam, Mitchell Gibson, Steven T. Piantadosi, and Bevil R. Conway. Color naming across languages reflects color use.Proceedings of the National Academy of Sciences, 114(40):10785–10790, 2017. doi: 10...
-
[17]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1061. URLhttps://aclanthology.org/N19-1061. (cited on p. 33) Roberto González-Ibáñez, Smaranda Muresan, and Nina Wacholder. Identifying sarcasm in Twitter: A closer look. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp...
-
[18]
URL https://doi.org/10.35111/0z6y-q265
doi: 10.35111/0z6y-q265. URL https://doi.org/10.35111/0z6y-q265. (cited on p. 5) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines, 2014. URLhttps://arxiv.org/abs/1410.5401. (cited on pp. 34 and 38) Alex Graves, Greg Wayne, Malcom Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Col- menarejo, Edward Grefenst...
-
[19]
URL https://doi.org/10.1145/1925844.1926423
doi: 10.1145/1925844.1926423. URL https://doi.org/10.1145/1925844.1926423. (cited on p. 36) Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples.Commun. ACM, 55(8): 97–105, Aug. 2012. doi: 10.1145/2240236.2240260. URLhttps://doi.org/10.1145/2240236.2240260. (cited on p. 36) Sumit Gulwani, José Hernández-Orallo,...
-
[20]
URL https://link.springer.com/article/10.1007/BF02172093. (cited on p. 39) F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Trans. Interact. Intell. Syst., 5 (4), Dec. 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. (cited on p. 36) Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, M...
-
[21]
Springer. URL https://www.ecva.net/papers/eccv_2018/papers_ECCV/papers/Lisa_Anne_Hendricks_Women_also_ Snowboard_ECCV_2018_paper.pdf. (cited on p. 37) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. URL https://openrev...
-
[22]
29) China Household Management Research Center, Ministry of Public Security
(cited on p. 29) China Household Management Research Center, Ministry of Public Security. National name report 2018. 2019. http: //news.cpd.com.cn/n18151/201901/t20190130_830962.html (Accessed 3 March 2021). (cited on p. 33) China Household Management Research Center, Ministry of Public Security. National name report 2019. 2020. https: //www.mps.gov.cn/n2...
-
[23]
doi: 10.18653/v1/2020.acl-main.164
URL https://instagram-engineering.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine- learning-for-emoji-trends-7f5f9cb979ad. (cited on p. 31) Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th Annual Meeting of the Assoc...
-
[24]
URL https://arxiv.org/abs/2007.01282. (cited on p. 38) 69 Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash. Indic-transformers: An analysis of transformer language models for indian languages, 2020. URLhttps://arxiv.org/abs/2011.02323. (cited on p. 33) Mario Jarmasz. Roget’s Thesaurus as a lexical resource for natural langua...
-
[25]
URL https://arxiv.org/abs/2005.01229. (cited on p. 41) Aditya Joshi, Vinita Sharma, and Pushpak Bhattacharyya. Harnessing context incongruity for sarcasm detection. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp...
-
[26]
The N arrative QA reading comprehension challenge
doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023. (cited on p. 35) Jan Kocoń, Piotr Miłkowski, and Kamil Kanclerz. MultiEmo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews. In Maciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M. A. Sloot (eds.),Computational...
-
[27]
URL https://doi.org/10.1007/s10992-020-09581-6
doi: 10.1007/s10992-020-09581-6. URL https://doi.org/10.1007/s10992-020-09581-6. (cited on p. 29) Alexander W. Kocurek, Ethan Jerzak, and Rachel Etta Rudolph. Against conventional wisdom.Philosophers’ Imprint, 20(22): 1–27, 2020. URLhttp://hdl.handle.net/2027/spo.3521354.0020.022. (cited on p. 29) Moshe Koppel and Jonathan Schler. Authorship verification ...
-
[28]
URL https://arxiv.org/abs/2101.00379. (cited on p. 30) Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7315–7330, Online, July 2020a. Association for Computational Ling...
-
[29]
Association for Computational Linguistics. doi: 10.18653/v1/W19-3005. URLhttps://aclanthology.org/W19-3005. (cited on p. 39) Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational...
-
[30]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URLhttps://aclanthology.org/N19-1063. (cited on p. 31) Andrew Mayne. OpenAI API alchemy: Emoji storytelling.Andrew Mayne blog, 24 June 2020. URLhttps://andrewmayneblog. wordpress.com/2020/06/24/open-ai-alchemy-emoji-storytelling/. (cited on p. 31) Joshua Maynez, Shashi Narayan, Bernd Bo...
-
[31]
URL https://arxiv.org/abs/2005.00661. (cited on pp. 30 and 40) Eric Mays, Fred J. Damerau, and Robert L. Mercer. Context based spelling correction.Information Processing & Management, 27(5):517–522, 1991. doi: https://doi.org/10.1016/0306-4573(91)90066-U. URLhttps://www.sciencedirect.com/science/ article/pii/030645739190066U. (cited on p. 41) Momoh Karmah...
-
[32]
(cited on p. 31) David Milne and Ian H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30, Menlo Park,
-
[33]
URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf
Association for the Advancement of Artificial Intelligence. URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf. (cited on p. 36) Republic of China Ministry of the Interior. National name statistical analysis, 2018.https://www.ris.gov.tw/documents/data/ 5/2/107namestat.pdf (Accessed 3 March 2021). (cited on p. 33) Swaroop Mishra, Danie...
-
[34]
(cited on p. 14) Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127, 2020. (cited on p. 14) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hur...
-
[35]
(cited on p. 34) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, Brussels, Belgium, October-November 2018. Association for Computationa...
-
[36]
Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. URLhttps://aclanthology.org/P19-1442. (cited on p. 31) Marilyn Nippold, Melissa Allen, and Dixon Kirsch. Proverb comprehension as a function of reading proficiency in preadolescents. Language Speech and Hearing Services in Schools, 32:90, 04 2001. doi: 10.1044/0161-1461(2001/ 009). URL ...
-
[37]
URL https://doi.org/10.1080/02724980443000566
doi: 10.1080/02724980443000566. URL https://doi.org/10.1080/02724980443000566. (cited on p. 35) The Working Committee on the Revision of the National Standard Occupational Classification. Standard Occupational Classification of the People’s Republic of China. China Labour and Social Security Publishing House, 2015.http://www. jiangmen.gov.cn/bmpd/jmsrlzyh...
-
[38]
32) Judea Pearl.Causality: Models, Reasoning, and Inference
(cited on p. 32) Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. (cited on p. 30) Devin Pelser and Hugh Murrell. Deep and dense sarcasm detection, 2019. URLhttps://arxiv.org/abs/1911.07474. (cited on p. 39) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models, 2021. ...
-
[39]
(cited on p. 29) Tony A. Plate.Holographic Reduced Representations: Distributed Representation for Cognitive Structures. CSLI, Stanford, CA,
-
[40]
(cited on p. 29) Robert Plutchik. A general psychoevolutionary theory of emotion. In Robert Plutchik and Henry Kellerman (eds.),Theories of Emotion, pp. 3–33. Academic Press, 1980. doi: https://doi.org/10.1016/B978-0-12-558701-3.50007-7. URL https: //www.sciencedirect.com/science/article/pii/B9780125587013500077. (cited on p. 32) Nadia Polikarpova, Ivan K...
-
[41]
URLhttps://aclanthology.org/2020.lrec-1.125
European Language Resources Association. URLhttps://aclanthology.org/2020.lrec-1.125. (cited on p. 31) Damien Sileo, Wout Vossen, and Robbe Raymaekers. Zero-shot recommendation as language modeling. In Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (eds.),Advances in Information Retrieval...
-
[42]
John Benjamins, Amsterdam, 2010. (cited on p. 35) Bernd Steinbach and Roman Kohut. Neural networks – a model of boolean functions.5th International Workshop on Boolean Problems, Freiburg, Sept. 2002., 2002. URL https://www.researchgate.net/publication/246931125_Neural_Networks_- _A_Model_of_Boolean_Functions. (cited on p. 29) Nisan Stiennon, Long Ouyang, ...
-
[43]
38) Zijian Wang and David Jurgens
(cited on p. 38) Zijian Wang and David Jurgens. It’s going to be okay: Measuring access to support in online communities. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 33–45, Brussels, Belgium, October-November
work page 2018
-
[44]
assessing BERT’s syntactic abilities
Association for Computational Linguistics. doi: 10.18653/v1/D18-1004. URLhttps://aclanthology.org/D18-1004. (cited on p. 39) Zijian Wang and Christopher Potts. TalkDown: A corpus for condescension detection in context. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat...
-
[45]
URL https://huggingface.co/bert-syntax/extending-bert-syntax.pdf. (cited on p. 39) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, M...
-
[46]
URL https://arxiv.org/abs/1705.10272. (cited on p. 38) Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2367–2376, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-...
-
[47]
URL https://arxiv.org/abs/2002.04326. (cited on pp. 29 and 35) Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. Learning the Dyck language with attention-based Seq2Seq models. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 138–146, Florence, Italy, August 2019c. Association for Computational Linguistics...
-
[48]
31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao
(cited on p. 31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering, 2019d. URLhttps://arxiv.org/abs/1906.02467. (cited on p. 32) Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. In Nick Bostrom an...
-
[49]
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods
Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URLhttps://aclanthology.org/N18-2003. (cited on pp. 31 and 41) Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021. URLhttps://arxiv.org/abs/2102.09690. (cited on p. 41) Ben Zhou, Daniel Khashab...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.