arxiv: 2311.16867 · v2 · submitted 2023-11-28 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

The Falcon Series of Open Language Models

Ebtesam Almazrouei , Hamza Alobeidli , Abdulaziz Alshamsi , Alessandro Cappelli , Ruxandra Cojocaru , M\'erouane Debbah , \'Etienne Goffinet , Daniel Hesslow

show 6 more authors

Julien Launay Quentin Malartic Daniele Mazzotta Badreddine Noune Baptiste Pannier Guilherme Penedo

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords open language modelsFalconlarge language modelspretrainingdecoder-onlyweb databenchmark evaluation

0 comments

The pith

Falcon-180B, trained on 3.5 trillion tokens from web data, nears PaLM-2-Large performance at lower pretraining and inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Falcon series of causal decoder-only language models in 7B, 40B, and 180B sizes, trained predominantly on high-quality web corpora. The largest model, Falcon-180B, uses the largest openly documented pretraining run to date with over 3.5 trillion tokens. It outperforms models such as PaLM and Chinchilla, improves on LLaMA 2 and Inflection-1, and approaches PaLM-2-Large while requiring less pretraining and inference compute, placing it among the three strongest models alongside GPT-4 and PaLM-2-Large. The authors release a 600B-token data extract and the models themselves under a permissive license to support open development of large language models.

Core claim

Falcon-180B significantly outperforms models such as PaLM or Chinchilla, improves upon concurrently developed models such as LLaMA 2 or Inflection-1, and nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it one of the three best language models in the world along with GPT-4 and PaLM-2-Large.

What carries the argument

Causal decoder-only transformer models trained on diverse high-quality web corpora using a custom distributed training codebase that scales efficiently to 4,096 A100 GPUs on cloud infrastructure with limited interconnect.

If this is right

Open release of the 600B-token web data extract and the models under permissive license enables community replication and extension of the training approach.
Custom distributed training on limited-interconnect cloud hardware demonstrates a practical path for large-scale pretraining without specialized clusters.
High performance from filtered web data indicates that scale and quality curation can substitute for exclusive data sources in building competitive models.
Lower inference cost relative to peers supports broader deployment of near-frontier capabilities in open settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on web-data filtering may generalize to show that careful curation matters more than proprietary data origins for frontier-level performance.
Releasing both models and training data at this scale could accelerate independent verification of scaling laws in open environments.
Efficiency gains on commodity cloud hardware might lower barriers for academic or smaller-team reproduction of similar models.

Load-bearing premise

The reported benchmark results reflect genuine capability gains rather than differences in evaluation protocols, data contamination, or undisclosed advantages in testing conditions.

What would settle it

Independent re-evaluation of Falcon-180B on the same benchmarks as PaLM-2-Large, using identical protocols and explicit checks for data overlap or contamination, would show whether performance truly nears that level.

read the original abstract

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Falcon-180B is a practical open release that gets close to closed models on benchmarks, but the comparisons need more detail on exact protocols and decontamination to hold up.

read the letter

The paper's main contribution is the release of the Falcon-7B, 40B, and 180B models plus a 600B-token slice of their web-derived training data under a permissive license. They trained the largest one on 3.5 trillion tokens and describe a custom distributed setup that ran on up to 4096 A100s over AWS with limited interconnect. That infrastructure section and the concrete model releases are the parts that actually move the field forward for people who want to run or fine-tune large open models themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Falcon series of open causal decoder-only language models (7B, 40B, and 180B parameters) trained on up to 3.5 trillion tokens of predominantly web-derived data. It details a custom distributed training framework enabling efficient pretraining on up to 4096 A100 GPUs with limited interconnect, reports benchmark results claiming Falcon-180B outperforms PaLM and Chinchilla while approaching PaLM-2-Large at lower cost, and releases the models plus a 600B-token data extract under a permissive license.

Significance. If the benchmark comparisons prove robust, the work would be significant for open LLM research by documenting one of the largest openly detailed pretraining runs, providing a competitive 180B model, and releasing tooling and data that could accelerate reproducible scaling studies and reduce dependence on closed models.

major comments (2)

[Evaluation] Evaluation section (main results tables): the direct comparisons to closed models such as PaLM-2-Large and GPT-4 do not specify the exact few-shot templates, answer normalization procedures, or decontamination filters applied to the baselines. This detail is load-bearing for the central claim that Falcon-180B 'nears the performance of PaLM-2-Large' given the web-crawled training corpus.
[Data] Data section (corpus construction): while the 3.5T-token web corpus is described at high level, the manuscript provides no quantitative overlap statistics or explicit decontamination pipeline for standard benchmarks (MMLU, HellaSwag, etc.). Without these, the reported gains cannot be confidently attributed to capability rather than leakage.

minor comments (2)

[Training] Figure captions in the training infrastructure section could more clearly label scaling curves with exact token counts and hardware configurations for reproducibility.
[Abstract] The abstract's phrasing 'one of the three best language models in the world' is subjective; a more precise qualifier such as 'among the highest-performing openly documented models' would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the clarity and rigor of our work. We address each major comment below and will revise the manuscript to incorporate additional details where appropriate.

read point-by-point responses

Referee: [Evaluation] Evaluation section (main results tables): the direct comparisons to closed models such as PaLM-2-Large and GPT-4 do not specify the exact few-shot templates, answer normalization procedures, or decontamination filters applied to the baselines. This detail is load-bearing for the central claim that Falcon-180B 'nears the performance of PaLM-2-Large' given the web-crawled training corpus.

Authors: We agree that explicit specification of evaluation protocols is essential for reproducibility and fair comparison. While high-level descriptions appear in the evaluation section and appendix, we will expand the main text in the revised manuscript to list the precise few-shot templates, answer normalization procedures (e.g., log-likelihood vs. probability normalization), and decontamination filters applied to all baselines including PaLM-2-Large and GPT-4. This will directly support the performance claims. revision: yes
Referee: [Data] Data section (corpus construction): while the 3.5T-token web corpus is described at high level, the manuscript provides no quantitative overlap statistics or explicit decontamination pipeline for standard benchmarks (MMLU, HellaSwag, etc.). Without these, the reported gains cannot be confidently attributed to capability rather than leakage.

Authors: We acknowledge the value of quantitative decontamination evidence. In the revision, we will add a dedicated subsection detailing our decontamination pipeline (including n-gram overlap filtering against common benchmarks) and report overlap statistics (e.g., 13-gram contamination rates) for MMLU, HellaSwag, and similar suites. The released 600B-token data extract will further enable independent verification, allowing readers to confirm that gains reflect capability rather than leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmark reporting

full rationale

The paper reports the training of decoder-only models on a 3.5T-token web corpus and their benchmark scores against external models. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct training runs and standard benchmark comparisons rather than any step that reduces by construction to inputs defined inside the paper. No self-citation chain, ansatz smuggling, or renaming of known results is present. The work is self-contained as an empirical description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the standard transformer decoder architecture and the empirical practice of scaling language model training on filtered web text; no new mathematical axioms or invented physical entities are introduced.

axioms (2)

standard math Causal decoder-only transformer architecture supports next-token prediction at scale
Invoked implicitly throughout the model description and training setup.
domain assumption High-quality web data filtered appropriately yields capable language models
Central to the data assembly claim in the abstract.

pith-pipeline@v0.9.0 · 5596 in / 1394 out tokens · 35066 ms · 2026-05-16T09:42:50.408969+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums
cs.AI 2026-02 unverdicted novelty 7.0

A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despi...
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
cs.LG 2026-05 unverdicted novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Language models recognize dropout and Gaussian noise applied to their activations
cs.AI 2026-04 unverdicted novelty 6.0

Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting
cs.CL 2026-04 unverdicted novelty 6.0

CoT2Edit trains LLMs to reason over edited knowledge using agent-generated CoTs, SFT, GRPO, and RAG, achieving generalization across six editing scenarios on three models.
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
cs.AI 2026-02 unverdicted novelty 6.0

SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
cs.CV 2026-05 unverdicted novelty 5.0

A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.
Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems
cs.AI 2026-04 unverdicted novelty 5.0

An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs
cs.IR 2026-03 conditional novelty 5.0

SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball...
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
InternLM2 Technical Report
cs.CL 2024-03 unverdicted novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval
cs.IR 2026-03 unverdicted novelty 3.0

AgriIR is a configurable RAG framework using modular stages and 1B-parameter models to deliver grounded, citable answers for Indian agricultural information access.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

269 extracted references · 269 canonical work pages · cited by 18 Pith papers · 64 internal anchors

[1]

GPGPU@ASPLOS , year=

Warp size impact in GPUs: large or small? , author=. GPGPU@ASPLOS , year=

work page
[2]

ArXiv , year=

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , author=. ArXiv , year=

work page
[3]

arXiv preprint arXiv:2203.03466 , year=

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. arXiv preprint arXiv:2203.03466 , year=

work page arXiv
[4]

The Power of Scale for Parameter-Efficient Prompt Tuning

The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2106.10199 , year=

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models , author=. arXiv preprint arXiv:2106.10199 , year=

work page arXiv
[6]

and Schwenk, Holger and Stoyanov, Veselin

Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018
[7]

NAACL , year=

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , author=. NAACL , year=

work page
[8]

2022 , publisher=

The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink , author=. 2022 , publisher=

work page 2022
[9]

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model , author=

work page
[11]

arXiv preprint arXiv:2202.07785 , year=

Predictability and Surprise in Large Generative Models , author=. arXiv preprint arXiv:2202.07785 , year=

work page arXiv
[12]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2112.12731 , year=

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , author=. arXiv preprint arXiv:2112.12731 , year=

work page arXiv
[14]

ArXiv , year=

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers , author=. ArXiv , year=

work page
[15]

ArXiv , year=

Few-shot Learning with Multilingual Language Models , author=. ArXiv , year=

work page
[16]

ArXiv , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. ArXiv , year=

work page
[17]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , author=

work page
[20]

ArXiv , year=

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers , author=. ArXiv , year=

work page
[21]

Gaussian Error Linear Units (GELUs)

Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Searching for Activation Functions

Searching for activation functions , author=. arXiv preprint arXiv:1710.05941 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2011.04006 , year=

Long range arena: A benchmark for efficient transformers , author=. arXiv preprint arXiv:2011.04006 , year=

work page arXiv 2011
[24]

LaMDA: Language Models for Dialog Applications

LaMDA: Language Models for Dialog Applications , author=. arXiv preprint arXiv:2201.08239 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Scaling Laws for Autoregressive Generative Modeling

Scaling laws for autoregressive generative modeling , author=. arXiv preprint arXiv:2010.14701 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[26]

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =

Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =

work page doi:10.14618/ids-pub-9021 2019
[27]

Advances in Neural Information Processing Systems , volume=

Limits to depth efficiencies of self-attention , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

Advances in Neural Information Processing Systems , pages =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , pages =

work page
[30]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=

work page
[31]

On the sizes of OpenAI API models , url=

Gao, Leo , year=. On the sizes of OpenAI API models , url=. EleutherAI Blog , publisher=

work page
[32]

Turing-NLG: A 17-billion-parameter language model by Microsoft , url=

Rosset, Corby , year=. Turing-NLG: A 17-billion-parameter language model by Microsoft , url=. Microsoft Research Blog , publisher=

work page
[33]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[34]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Roformer: Enhanced transformer with rotary position embedding , author=. arXiv preprint arXiv:2104.09864 , url=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[37]

doi:10.5281/zenodo.5297715 , url =

Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

work page doi:10.5281/zenodo.5297715
[38]

URL https://openai.com/blog/sparse-transformers , year=

Generating Long Sequences with Sparse Transformers , author=. URL https://openai.com/blog/sparse-transformers , year=

work page
[39]

Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav , title =

work page
[40]

Using the Output Embedding to Improve Language Models

Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

work page 2017
[41]

OpenAI Blog , year=

Improving language understanding by generative pre-training , author=. OpenAI Blog , year=

work page
[42]

International Conference on Learning Representations , year=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=

work page
[43]

2020 , eprint=

GLU Variants Improve Transformer , author=. 2020 , eprint=

work page 2020
[44]

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier , title =. CoRR , volume =. 2016 , url =. 1612.08083 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

doi:10.5281/zenodo.5371628 , url =

Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

work page doi:10.5281/zenodo.5371628
[46]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. arXiv preprint arXiv:2201.11990 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

NAACL , year =

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. NAACL , year =

work page
[49]

S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth Inter...

work page 2012
[50]

HEAD - QA : A Healthcare Dataset for Complex Reasoning

Vilares, David and G \'o mez-Rodr \' guez, Carlos. HEAD - QA : A Healthcare Dataset for Complex Reasoning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1092

work page doi:10.18653/v1/p19-1092 2019
[51]

Hellaswag: Can a machine really finish your sentence? Annual Meeting of the Association for Computational Linguistics, 2019

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[52]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

work page doi:10.18653/v1/p16-1144 2016
[53]

International Conference on Learning Representations , year=

Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=

work page
[54]

CoRR , volume =

Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , title =. CoRR , volume =. 2020 , url =. 2007.08124 , timestamp =

work page arXiv 2020
[55]

M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh. M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

work page doi:10.18653/v1/n19-1245 2019
[56]

EMNLP , year =

Ben Zhou, Daniel Khashabi, Qiang Ning and Dan Roth , title =. EMNLP , year =

work page
[57]

Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the Third International Workshop on Paraphrasing (IWP2005) , year=

work page
[58]

Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year =

work page
[59]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

work page
[60]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page
[61]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

PROST : P hysical Reasoning about Objects through Space and Time

Aroca-Ouellette, St \'e phane and Paik, Cory and Roncone, Alessandro and Kann, Katharina. PROST : P hysical Reasoning about Objects through Space and Time. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

work page 2021
[63]

PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

work page 2019
[64]

Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =

work page
[65]

2017 , journal=

Crowdsourcing Multiple Choice Science Questions , author=. 2017 , journal=

work page 2017
[66]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

International Conference on Learning Representations , year=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. International Conference on Learning Representations , year=

work page
[68]

Machine Learning Challenges Workshop , pages=

The PASCAL recognising textual entailment challenge , author=. Machine Learning Challenges Workshop , pages=. 2005 , organization=

work page 2005
[69]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[70]

and Zettlemoyer, Luke , title =

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =

work page 2017
[71]

Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=

The winograd schema challenge , author=. Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning , year=

work page
[72]

CoRR , volume =

Yingshan Chang and Mridu Narang and Hisami Suzuki and Guihong Cao and Jianfeng Gao and Yonatan Bisk , title =. CoRR , volume =. 2021 , url =. 2109.00590 , timestamp =

work page arXiv 2021
[73]

WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , journal=

Mohammad Taher Pilehvar and os. WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , journal=. 2018 , url=

work page 2018
[74]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. arXiv preprint arXiv:1907.10641 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[75]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

2021 , eprint=

8-bit Optimizers via Block-wise Quantization , author=. 2021 , eprint=

work page 2021
[78]

CPM: A Large-scale Generative Chinese Pre-trained Language Model , author=

work page
[79]

CPM-2: Large-scale Cost-efficient Pre-trained Language Models , author=

work page
[80]

arXiv preprint arXiv:2110.04725 , year=

Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning , author=. arXiv preprint arXiv:2110.04725 , year=

work page arXiv
[81]

Advances in Neural Information Processing Systems , volume=

Unified language model pre-training for natural language understanding and generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[82]

arXiv preprint arXiv:2104.12369 , year=

PanGu- : Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , author=. arXiv preprint arXiv:2104.12369 , year=

work page arXiv
[83]

arXiv preprint arXiv:2010.11934 , year=

mT5: A massively multilingual pre-trained text-to-text transformer , author=. arXiv preprint arXiv:2010.11934 , year=

work page arXiv 2010

Showing first 80 references.