arxiv: 2403.04652 · v3 · submitted 2024-03-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Yi: Open Foundation Models by 01.AI

01.AI: Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Pengcheng Nie, Peng Liu, Qiang Liu, Senbin Yang, Shawn Yue, Shiming Yang, Wenhao Huang, Wen Xie, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai

Pith reviewed 2026-05-13 05:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords foundation modelsdata qualitydata engineeringlanguage modelsmultimodal modelslong contexttransformer architecture

0 comments

The pith

The Yi models achieve strong benchmark results primarily through high-quality data curation from a cascaded deduplication and filtering pipeline on 3.1 trillion tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the Yi family of 6B and 34B language models gains its capabilities mainly from data quality produced by targeted engineering rather than architectural innovations. A classical transformer backbone is pretrained on English and Chinese corpora assembled through successive rounds of deduplication and quality filtering. This base supports extensions to chat models with high human preference scores, 200K-token context lengths, depth-upscaled variants, and vision-language models that align visual features to language semantics. Finetuning relies on a small set of instruction examples each verified by engineers. If the attribution holds, it implies that optimized data pipelines can deliver competitive foundation models and that further scaling of parameters with similarly refined data should yield stronger results.

Core claim

The Yi model family, built on 6B and 34B pretrained language models, demonstrates strong performance across benchmarks and human evaluations primarily because of data quality resulting from data-engineering efforts. Pretraining uses 3.1 trillion tokens of English and Chinese text constructed via a cascaded deduplication and quality filtering pipeline. Chat models are created from a small, multi-iteration polished instruction dataset under 10K examples where every instance is verified by machine learning engineers. Vision-language models combine the chat language model with a vision transformer encoder to align visual representations to the language model's semantic space. Context length is扩展

What carries the argument

The cascaded data deduplication and quality filtering pipeline that assembles the 3.1 trillion token pretraining corpus from English and Chinese sources.

If this is right

Base models reach strong results on a wide range of benchmarks including MMLU.
Finetuned chat models attain high human preference rates on major evaluation platforms.
Lightweight continual pretraining extends context length to 200K tokens while preserving strong needle-in-a-haystack retrieval.
Continual pretraining to increase model depth further improves performance on the same data.
Continued scaling of model parameters with similarly optimized data is expected to produce stronger frontier models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If data quality dominates, applying comparable filtering pipelines to existing public datasets could raise performance of other models without architecture changes.
The reliance on a small verified instruction set implies that careful human review may matter more for alignment than large volumes of unverified data.
The straightforward attachment of a vision encoder to an already strong language model suggests multimodal extension is mainly a matter of alignment training rather than joint pretraining from scratch.

Load-bearing premise

The performance improvements arise chiefly from the superior quality of data produced by the custom deduplication and filtering pipeline rather than from other training details or scale factors.

What would settle it

Train identical 6B and 34B transformer models on publicly available unfiltered English and Chinese corpora and measure whether the resulting MMLU and other benchmark scores fall substantially below the reported Yi results.

read the original abstract

We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Yi family of open foundation models, including 6B and 34B pretrained transformers, extended to chat, 200K-context, depth-upscaled, and vision-language variants. It reports strong results on MMLU and human-preference platforms (AlpacaEval, Chatbot Arena) and attributes these gains primarily to a cascaded deduplication/quality-filtering pipeline that produced 3.1T English/Chinese tokens plus a <10K polished instruction set; the architecture is described as a standard transformer.

Significance. If the data-quality attribution were substantiated by controlled evidence, the work would usefully illustrate how targeted data engineering can yield competitive performance with classical architectures and modest parameter counts, complementing scaling-law studies. The open release of models and the long-context and multimodal extensions could enable follow-on research, but the current manuscript supplies only final benchmark numbers without isolating the pipeline's causal contribution.

major comments (3)

[Abstract] Abstract: the central claim that 'we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts' is unsupported by any ablation, baseline comparison, or quantitative isolation of the pipeline. No results are shown for identical 6B/34B models trained on RedPajama, Dolma, or unfiltered Common Crawl, nor are perplexity curves or token-level statistics provided.
[Abstract] The manuscript supplies no error bars, variance estimates, or multiple-run statistics for the reported MMLU and preference scores, making it impossible to assess whether the claimed gains over prior models are statistically reliable.
[Pretraining description] No head-to-head dataset comparison or controlled experiment is described that would test the weakest assumption: that the specific cascaded deduplication and quality-filtering choices outperform standard high-quality corpora under matched compute.

minor comments (2)

[Abstract] The abstract and main text should explicitly state the exact MMLU scores, baseline models, and evaluation settings rather than using the phrase 'strong performance.'
[Finetuning description] Clarify whether the <10K instruction examples were used only for the chat models or also influenced the base pretrained checkpoints.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence would strengthen the manuscript's claims about data quality. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts' is unsupported by any ablation, baseline comparison, or quantitative isolation of the pipeline. No results are shown for identical 6B/34B models trained on RedPajama, Dolma, or unfiltered Common Crawl, nor are perplexity curves or token-level statistics provided.

Authors: We acknowledge that the manuscript lacks direct ablations or baseline comparisons with models trained on alternative corpora such as RedPajama, Dolma, or unfiltered Common Crawl. Performing such controlled experiments at the full 3.1T token scale would require prohibitive additional compute resources. Our attribution rests on the iterative design of the cascaded deduplication and quality-filtering pipeline, which was specifically engineered to produce high-quality English and Chinese data, combined with the observed benchmark performance. In the revised manuscript, we will expand the pretraining section with further details on pipeline components and add an explicit limitations paragraph noting the absence of these controlled comparisons. revision: partial
Referee: [Abstract] The manuscript supplies no error bars, variance estimates, or multiple-run statistics for the reported MMLU and preference scores, making it impossible to assess whether the claimed gains over prior models are statistically reliable.

Authors: We agree that the lack of error bars and variance estimates limits statistical assessment. Large-scale pretraining runs are typically performed once due to extreme computational cost, consistent with reporting practices in comparable works. The MMLU scores reflect single training runs, while preference scores on AlpacaEval and Chatbot Arena aggregate user interactions without per-run variance. We will revise the evaluation sections to explicitly state the single-run nature of the results and include any available statistics from internal smaller-scale validations or related literature. revision: partial
Referee: [Pretraining description] No head-to-head dataset comparison or controlled experiment is described that would test the weakest assumption: that the specific cascaded deduplication and quality-filtering choices outperform standard high-quality corpora under matched compute.

Authors: The pretraining description emphasizes the end-to-end pipeline tailored for our multilingual corpus rather than comparative experiments. Head-to-head tests under matched compute were not conducted, as the focus was on releasing performant models; internal development iterations guided the pipeline choices but are not documented. We will update the pretraining section to provide additional rationale for the cascaded approach and include a limitations statement acknowledging the lack of direct head-to-head comparisons. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical attribution stands on external verification

full rationale

The paper contains no mathematical derivations, equations, or first-principles claims that could reduce to self-definition or fitted inputs. The central assertion—that Yi performance stems primarily from a cascaded deduplication/quality-filtering pipeline on 3.1T tokens and a polished <10K instruction set—is presented as an empirical observation tied to the classical transformer architecture and compute infrastructure. No self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the provided text. The claim is falsifiable via independent replication or ablations and does not internally equate its output to its inputs by construction. This is the expected non-finding for a model-release paper whose load-bearing content is benchmark numbers rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical training outcomes using the classical transformer architecture and standard scaling practices; no new mathematical axioms or fitted parameters are introduced beyond routine hyperparameter choices.

axioms (1)

standard math Classical transformer architecture and scaling laws hold for the described model sizes and data volumes
The paper explicitly builds upon the classical transformer without new proofs or modifications to core assumptions.

pith-pipeline@v0.9.0 · 5691 in / 1293 out tokens · 131343 ms · 2026-05-13T05:39:05.639941+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear
Building upon our scalable super-computing infrastructure and the classical transformer architecture

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
cs.CL 2026-04 unverdicted novelty 7.0

TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
An Interpretable and Scalable Framework for Evaluating Large Language Models
stat.ML 2026-05 unverdicted novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs
cs.SE 2026-05 unverdicted novelty 6.0

SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
cs.CL 2026-04 unverdicted novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
cs.CR 2026-04 unverdicted novelty 6.0

BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
cs.CL 2024-06 unverdicted novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions
cs.CR 2026-05 unverdicted novelty 5.0

LLMs for smart contract security analysis show lexical bias from identifier names causing high false positives, with prompting creating precision-recall trade-offs, positioning them as complements rather than replacem...
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
cs.CL 2026-04 unverdicted novelty 4.0

A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research f...
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 28 Pith papers · 28 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program Synthesis With lLarge Language Models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page
[4]

URL https://arxiv.org/pdf/2309.16609.pdf

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about Physical Commonsense in Natural Language. ArXiv, abs/1911.11641, 2019. URL https://api.semanticscholar.org/CorpusID:208290939

work page arXiv 1911
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[7]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review arXiv 2023
[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

QuAC : Question Answering in Context, 2018

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC : Question Answering in Context, 2018

work page 2018
[10]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019

work page 2019
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018

work page 2018
[13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168, 2021. 20

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Redpajama: an open dataset for training large language models, 2023

Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023
[15]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023

work page 2023
[16]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022
[17]

FiDO: Fusion-in-Decoder Optimized for Stronger Performance and Faster Inference

Michiel de Jong, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, and William Cohen. FiDO: Fusion-in-Decoder Optimized for Stronger Performance and Faster Inference. arXiv preprint arXiv:2212.08153, 2022

work page arXiv 2022
[18]

DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A....

work page 2024
[19]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022

work page internal anchor Pith review arXiv 2022
[20]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

work page arXiv 2023
[21]

How abilities in large language models are affected by supervised fine-tuning data composition, 2023

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition, 2023

work page 2023
[22]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023

work page 2023
[23]

Data engineering for scaling language models to 128k context

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024

work page arXiv 2024
[24]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022

work page internal anchor Pith review arXiv 2022
[26]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017

work page 2017
[27]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3608–3617, 2018. 21

work page 2018
[29]

URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009
[30]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. 2020

work page 2020
[32]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023

work page arXiv 2023
[34]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6700–6709, 2019

work page 2019
[35]

Openclip, July 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/ zenodo.5143773

work page 2021
[36]

Neftune: Noisy embeddings improve instruction finetuning

Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, et al. Neftune: Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023

work page arXiv 2023
[37]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023

work page 2023
[38]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. 2020

work page 2020
[39]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 787–798, 2014

work page 2014
[40]

Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling

Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeon- woo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling. 2023

work page 2023
[41]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017

work page 2017
[42]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180, 2023. 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring Massive Multitask Language Understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023

work page arXiv 2023
[45]

Sequence paral- lelism: Long sequence training from system perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120 , 2021

work page arXiv 2021
[46]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023
[47]

Chinese llava

LinkSoul-AI. Chinese llava. https://github.com/LinkSoul-AI/Chinese-LLaVA, 2023

work page 2023
[48]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023

work page arXiv 2023
[51]

#instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023

Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023

work page 2023
[52]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, 2018

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, 2018

work page 2018
[53]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

work page 2019
[54]

arXiv preprint arXiv:2309.09400 , year=

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv preprint arXiv:2309.09400, 2023

work page arXiv 2023
[55]

ChatML, 2022

OpenAI. ChatML, 2022. URL https://github.com/openai/openai-python/blob/ e389823ba013a24b4c32ce38fa0bd87e6bccae94/chatml.md

work page 2022
[56]

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[57]

Testing language models on a held-out high school national finals exam

Keiran Paster. Testing language models on a held-out high school national finals exam. https: //huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023

work page 2023
[58]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, 2023

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, 2023

work page 2023
[59]

Efficiently Scaling Transformer Inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently Scaling Transformer Inference. Proceedings of Machine Learning and Systems , 5, 2023

work page 2023
[60]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv preprint arXiv:2112.11446, 2021. 23

work page internal anchor Pith review Pith/arXiv arXiv 2021
[61]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

ZeRO: Memory Opti- mizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory Opti- mizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages 1–16. IEEE, 2020

work page 2020
[63]

SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016

work page 2016
[64]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019

work page 2019
[65]

SocialIQA: Commonsense Reasoning about Social Interactions, 2019

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. SocialIQA: Commonsense Reasoning about Social Interactions, 2019

work page 2019
[66]

Beyond chinchilla-optimal: Accounting for inference in language model scaling laws

Nikhil Sardana and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023

work page arXiv 2023
[67]

Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems , 36, 2024

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[68]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[69]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[70]

GLU Variants Improve Transformer

Noam Shazeer. GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[71]

Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching

Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching. Technical report, Technical Report DOI-TR-161, Department of Informatics, Kyushu University, 1999

work page 1999
[72]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[73]

Textcaps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 , pages 742–758. Springer, 2020

work page 2020
[74]

Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Ag- nieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Ag- nieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Ama...

work page 2023
[75]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging Big- Bench Tasks and Whether Chain-of-Thought can Solve Them.arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, 2019

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, 2019

work page 2019
[78]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[80]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.Advances in Neural Information Processing Systems, 06 2017. URL https://arxiv.org/pdf/1706.03762.pdf

work page internal anchor Pith review Pith/arXiv arXiv 2017
[82]

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. arXiv preprint arXiv:1911.00359, 2019

work page arXiv 1911

Showing first 80 references.