pith. machine review for the scientific record. sign in

arxiv: 2311.16867 · v2 · submitted 2023-11-28 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

The Falcon Series of Open Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords open language modelsFalconlarge language modelspretrainingdecoder-onlyweb databenchmark evaluation
0
0 comments X

The pith

Falcon-180B, trained on 3.5 trillion tokens from web data, nears PaLM-2-Large performance at lower pretraining and inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Falcon series of causal decoder-only language models in 7B, 40B, and 180B sizes, trained predominantly on high-quality web corpora. The largest model, Falcon-180B, uses the largest openly documented pretraining run to date with over 3.5 trillion tokens. It outperforms models such as PaLM and Chinchilla, improves on LLaMA 2 and Inflection-1, and approaches PaLM-2-Large while requiring less pretraining and inference compute, placing it among the three strongest models alongside GPT-4 and PaLM-2-Large. The authors release a 600B-token data extract and the models themselves under a permissive license to support open development of large language models.

Core claim

Falcon-180B significantly outperforms models such as PaLM or Chinchilla, improves upon concurrently developed models such as LLaMA 2 or Inflection-1, and nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it one of the three best language models in the world along with GPT-4 and PaLM-2-Large.

What carries the argument

Causal decoder-only transformer models trained on diverse high-quality web corpora using a custom distributed training codebase that scales efficiently to 4,096 A100 GPUs on cloud infrastructure with limited interconnect.

If this is right

  • Open release of the 600B-token web data extract and the models under permissive license enables community replication and extension of the training approach.
  • Custom distributed training on limited-interconnect cloud hardware demonstrates a practical path for large-scale pretraining without specialized clusters.
  • High performance from filtered web data indicates that scale and quality curation can substitute for exclusive data sources in building competitive models.
  • Lower inference cost relative to peers supports broader deployment of near-frontier capabilities in open settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emphasis on web-data filtering may generalize to show that careful curation matters more than proprietary data origins for frontier-level performance.
  • Releasing both models and training data at this scale could accelerate independent verification of scaling laws in open environments.
  • Efficiency gains on commodity cloud hardware might lower barriers for academic or smaller-team reproduction of similar models.

Load-bearing premise

The reported benchmark results reflect genuine capability gains rather than differences in evaluation protocols, data contamination, or undisclosed advantages in testing conditions.

What would settle it

Independent re-evaluation of Falcon-180B on the same benchmarks as PaLM-2-Large, using identical protocols and explicit checks for data overlap or contamination, would show whether performance truly nears that level.

read the original abstract

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Falcon series of open causal decoder-only language models (7B, 40B, and 180B parameters) trained on up to 3.5 trillion tokens of predominantly web-derived data. It details a custom distributed training framework enabling efficient pretraining on up to 4096 A100 GPUs with limited interconnect, reports benchmark results claiming Falcon-180B outperforms PaLM and Chinchilla while approaching PaLM-2-Large at lower cost, and releases the models plus a 600B-token data extract under a permissive license.

Significance. If the benchmark comparisons prove robust, the work would be significant for open LLM research by documenting one of the largest openly detailed pretraining runs, providing a competitive 180B model, and releasing tooling and data that could accelerate reproducible scaling studies and reduce dependence on closed models.

major comments (2)
  1. [Evaluation] Evaluation section (main results tables): the direct comparisons to closed models such as PaLM-2-Large and GPT-4 do not specify the exact few-shot templates, answer normalization procedures, or decontamination filters applied to the baselines. This detail is load-bearing for the central claim that Falcon-180B 'nears the performance of PaLM-2-Large' given the web-crawled training corpus.
  2. [Data] Data section (corpus construction): while the 3.5T-token web corpus is described at high level, the manuscript provides no quantitative overlap statistics or explicit decontamination pipeline for standard benchmarks (MMLU, HellaSwag, etc.). Without these, the reported gains cannot be confidently attributed to capability rather than leakage.
minor comments (2)
  1. [Training] Figure captions in the training infrastructure section could more clearly label scaling curves with exact token counts and hardware configurations for reproducibility.
  2. [Abstract] The abstract's phrasing 'one of the three best language models in the world' is subjective; a more precise qualifier such as 'among the highest-performing openly documented models' would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the clarity and rigor of our work. We address each major comment below and will revise the manuscript to incorporate additional details where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (main results tables): the direct comparisons to closed models such as PaLM-2-Large and GPT-4 do not specify the exact few-shot templates, answer normalization procedures, or decontamination filters applied to the baselines. This detail is load-bearing for the central claim that Falcon-180B 'nears the performance of PaLM-2-Large' given the web-crawled training corpus.

    Authors: We agree that explicit specification of evaluation protocols is essential for reproducibility and fair comparison. While high-level descriptions appear in the evaluation section and appendix, we will expand the main text in the revised manuscript to list the precise few-shot templates, answer normalization procedures (e.g., log-likelihood vs. probability normalization), and decontamination filters applied to all baselines including PaLM-2-Large and GPT-4. This will directly support the performance claims. revision: yes

  2. Referee: [Data] Data section (corpus construction): while the 3.5T-token web corpus is described at high level, the manuscript provides no quantitative overlap statistics or explicit decontamination pipeline for standard benchmarks (MMLU, HellaSwag, etc.). Without these, the reported gains cannot be confidently attributed to capability rather than leakage.

    Authors: We acknowledge the value of quantitative decontamination evidence. In the revision, we will add a dedicated subsection detailing our decontamination pipeline (including n-gram overlap filtering against common benchmarks) and report overlap statistics (e.g., 13-gram contamination rates) for MMLU, HellaSwag, and similar suites. The released 600B-token data extract will further enable independent verification, allowing readers to confirm that gains reflect capability rather than leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmark reporting

full rationale

The paper reports the training of decoder-only models on a 3.5T-token web corpus and their benchmark scores against external models. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct training runs and standard benchmark comparisons rather than any step that reduces by construction to inputs defined inside the paper. No self-citation chain, ansatz smuggling, or renaming of known results is present. The work is self-contained as an empirical description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the standard transformer decoder architecture and the empirical practice of scaling language model training on filtered web text; no new mathematical axioms or invented physical entities are introduced.

axioms (2)
  • standard math Causal decoder-only transformer architecture supports next-token prediction at scale
    Invoked implicitly throughout the model description and training setup.
  • domain assumption High-quality web data filtered appropriately yields capable language models
    Central to the data assembly claim in the abstract.

pith-pipeline@v0.9.0 · 5596 in / 1394 out tokens · 35066 ms · 2026-05-16T09:42:50.408969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  2. How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

    cs.CL 2026-04 unverdicted novelty 7.0

    Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.

  3. From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums

    cs.AI 2026-02 unverdicted novelty 7.0

    A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despi...

  4. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  5. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  6. Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

  7. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  8. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  9. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.

  10. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

  11. Language models recognize dropout and Gaussian noise applied to their activations

    cs.AI 2026-04 unverdicted novelty 6.0

    Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.

  12. Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

    cs.CL 2026-04 unverdicted novelty 6.0

    CoT2Edit trains LLMs to reason over edited knowledge using agent-generated CoTs, SFT, GRPO, and RAG, achieving generalization across six editing scenarios on three models.

  13. SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

    cs.AI 2026-02 unverdicted novelty 6.0

    SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...

  14. Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

    cs.CV 2026-05 unverdicted novelty 5.0

    A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.

  15. Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems

    cs.AI 2026-04 unverdicted novelty 5.0

    An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.

  16. SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs

    cs.IR 2026-03 conditional novelty 5.0

    SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball...

  17. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  18. InternLM2 Technical Report

    cs.CL 2024-03 unverdicted novelty 5.0

    InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

  19. AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval

    cs.IR 2026-03 unverdicted novelty 3.0

    AgriIR is a configurable RAG framework using modular stages and 1B-parameter models to deliver grounded, citable answers for Indian agricultural information access.

  20. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

269 extracted references · 269 canonical work pages · cited by 18 Pith papers · 64 internal anchors

  1. [1]

    GPGPU@ASPLOS , year=

    Warp size impact in GPUs: large or small? , author=. GPGPU@ASPLOS , year=

  2. [2]

    ArXiv , year=

    XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , author=. ArXiv , year=

  3. [3]

    arXiv preprint arXiv:2203.03466 , year=

    Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. arXiv preprint arXiv:2203.03466 , year=

  4. [4]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

  5. [5]

    arXiv preprint arXiv:2106.10199 , year=

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models , author=. arXiv preprint arXiv:2106.10199 , year=

  6. [6]

    and Schwenk, Holger and Stoyanov, Veselin

    Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  7. [7]

    NAACL , year=

    mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. NAACL , year=

  8. [8]

    2022 , publisher=

    The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink , author=. 2022 , publisher=

  9. [9]

    GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model , author=

  10. [11]

    arXiv preprint arXiv:2202.07785 , year=

    Predictability and Surprise in Large Generative Models , author=. arXiv preprint arXiv:2202.07785 , year=

  11. [12]

    Finetuned Language Models Are Zero-Shot Learners

    Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

  12. [13]

    arXiv preprint arXiv:2112.12731 , year=

    ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , author=. arXiv preprint arXiv:2112.12731 , year=

  13. [14]

    ArXiv , year=

    What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers , author=. ArXiv , year=

  14. [15]

    ArXiv , year=

    Few-shot Learning with Multilingual Language Models , author=. ArXiv , year=

  15. [16]

    ArXiv , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. ArXiv , year=

  16. [17]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

  17. [18]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  18. [19]

    What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=

  19. [20]

    ArXiv , year=

    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers , author=. ArXiv , year=

  20. [21]

    Gaussian Error Linear Units (GELUs)

    Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=

  21. [22]

    Searching for Activation Functions

    Searching for activation functions , author=. arXiv preprint arXiv:1710.05941 , year=

  22. [23]

    arXiv preprint arXiv:2011.04006 , year=

    Long range arena: A benchmark for efficient transformers , author=. arXiv preprint arXiv:2011.04006 , year=

  23. [24]

    LaMDA: Language Models for Dialog Applications

    LaMDA: Language Models for Dialog Applications , author=. arXiv preprint arXiv:2201.08239 , year=

  24. [25]

    Scaling Laws for Autoregressive Generative Modeling

    Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , year=

  25. [26]

    Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =

    Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =

  26. [27]

    Advances in Neural Information Processing Systems , volume=

    Limits to depth efficiencies of self-attention , author=. Advances in Neural Information Processing Systems , volume=

  27. [28]

    Advances in Neural Information Processing Systems , pages =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , pages =

  28. [30]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=

  29. [31]

    On the sizes of OpenAI API models , url=

    Gao, Leo , year=. On the sizes of OpenAI API models , url=. EleutherAI Blog , publisher=

  30. [32]

    Turing-NLG: A 17-billion-parameter language model by Microsoft , url=

    Rosset, Corby , year=. Turing-NLG: A 17-billion-parameter language model by Microsoft , url=. Microsoft Research Blog , publisher=

  31. [33]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  32. [34]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

  33. [35]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Roformer: Enhanced transformer with rotary position embedding , author=. arXiv preprint arXiv:2104.09864 , url=

  34. [36]

    Advances in neural information processing systems , pages=

    Attention is all you need , author=. Advances in neural information processing systems , pages=

  35. [37]

    doi:10.5281/zenodo.5297715 , url =

    Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

  36. [38]

    URL https://openai.com/blog/sparse-transformers , year=

    Generating Long Sequences with Sparse Transformers , author=. URL https://openai.com/blog/sparse-transformers , year=

  37. [39]

    Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav , title =

  38. [40]

    Using the Output Embedding to Improve Language Models

    Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

  39. [41]

    OpenAI Blog , year=

    Improving language understanding by generative pre-training , author=. OpenAI Blog , year=

  40. [42]

    International Conference on Learning Representations , year=

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=

  41. [43]

    2020 , eprint=

    GLU Variants Improve Transformer , author=. 2020 , eprint=

  42. [44]

    Language Modeling with Gated Convolutional Networks

    Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier , title =. CoRR , volume =. 2016 , url =. 1612.08083 , timestamp =

  43. [45]

    doi:10.5281/zenodo.5371628 , url =

    Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

  44. [46]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. arXiv preprint arXiv:2201.11990 , year=

  45. [48]

    NAACL , year =

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. NAACL , year =

  46. [49]

    S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

    Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth Inter...

  47. [50]

    HEAD - QA : A Healthcare Dataset for Complex Reasoning

    Vilares, David and G \'o mez-Rodr \' guez, Carlos. HEAD - QA : A Healthcare Dataset for Complex Reasoning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1092

  48. [51]

    Hellaswag: Can a machine really finish your sentence? Annual Meeting of the Association for Computational Linguistics, 2019

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

  49. [52]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

  50. [53]

    International Conference on Learning Representations , year=

    Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=

  51. [54]

    CoRR , volume =

    Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , title =. CoRR , volume =. 2020 , url =. 2007.08124 , timestamp =

  52. [55]

    M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh. M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

  53. [56]

    EMNLP , year =

    Ben Zhou, Daniel Khashabi, Qiang Ning and Dan Roth , title =. EMNLP , year =

  54. [57]

    Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

    Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

  55. [58]

    Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

    Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

  56. [59]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  57. [60]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  58. [61]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  59. [62]

    PROST : P hysical Reasoning about Objects through Space and Time

    Aroca-Ouellette, St \'e phane and Paik, Cory and Roncone, Alessandro and Kann, Katharina. PROST : P hysical Reasoning about Objects through Space and Time. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

  60. [63]

    PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

  61. [64]

    Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =

  62. [65]

    2017 , journal=

    Crowdsourcing Multiple Choice Science Questions , author=. 2017 , journal=

  63. [66]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

  64. [67]

    International Conference on Learning Representations , year=

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. International Conference on Learning Representations , year=

  65. [68]

    Machine Learning Challenges Workshop , pages=

    The PASCAL recognising textual entailment challenge , author=. Machine Learning Challenges Workshop , pages=. 2005 , organization=

  66. [69]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  67. [70]

    and Zettlemoyer, Luke , title =

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =

  68. [71]

    Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=

    The winograd schema challenge , author=. Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=

  69. [72]

    CoRR , volume =

    Yingshan Chang and Mridu Narang and Hisami Suzuki and Guihong Cao and Jianfeng Gao and Yonatan Bisk , title =. CoRR , volume =. 2021 , url =. 2109.00590 , timestamp =

  70. [73]

    WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , journal=

    Mohammad Taher Pilehvar and os. WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , journal=. 2018 , url=

  71. [74]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. arXiv preprint arXiv:1907.10641 , year=

  72. [75]

    Layer Normalization

    Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

  73. [76]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  74. [77]

    2021 , eprint=

    8-bit Optimizers via Block-wise Quantization , author=. 2021 , eprint=

  75. [78]

    CPM: A Large-scale Generative Chinese Pre-trained Language Model , author=

  76. [79]

    CPM-2: Large-scale Cost-efficient Pre-trained Language Models , author=

  77. [80]

    arXiv preprint arXiv:2110.04725 , year=

    Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning , author=. arXiv preprint arXiv:2110.04725 , year=

  78. [81]

    Advances in Neural Information Processing Systems , volume=

    Unified language model pre-training for natural language understanding and generation , author=. Advances in Neural Information Processing Systems , volume=

  79. [82]

    arXiv preprint arXiv:2104.12369 , year=

    PanGu- : Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , author=. arXiv preprint arXiv:2104.12369 , year=

  80. [83]

    arXiv preprint arXiv:2010.11934 , year=

    mT5: A massively multilingual pre-trained text-to-text transformer , author=. arXiv preprint arXiv:2010.11934 , year=

Showing first 80 references.