pith. machine review for the scientific record. sign in

arxiv: 2407.21783 · v3 · submitted 2024-07-31 · 💻 cs.AI · cs.CL· cs.CV

Recognition: 3 theorem links

· Lean Theorem

The Llama 3 Herd of Models

Aaditya Singh, Aaron Grattafiori, Aayushi Srivastava, Abha Jain, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahmad Al-Dahle, Ahuva Goldstand, Aiesha Letman, Ajay Menon, Ajay Sharma, Akhil Mathur, Alan Schelten, Alex Boesenberg, Alexei Baevski, Alex Vaughan, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Amy Yang, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Angela Fan, Anirudh Goyal, Ankit Ramchandani, Annie Dong, Annie Franco, Anthony Hartshorn, Anuj Goyal, Aobo Yang, Aparajita Saraf, Archie Sravankumar, Archi Mitra, Arkabandhu Chowdhury, Artem Korenev, Arthur Hinsvark, Arun Rao, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Azadeh Yazdan, Baptiste Roziere, Beau James, Benjamin Leonhardi, Ben Maurer, Bernie Huang, Bethany Biron, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Binh Tang, Bobbie Chern, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Charlotte Caucheteux, Chaya Nayak, Chester Hu, Ching-Hsiang Chu, Chloe Bi, Chris Cai, Chris Marra, Chris McConnell, Christian Keller, Chris Tindal, Christophe Touret, Christoph Feichtenhofer, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cynthia Gao, Cyrus Nikolaidis, Damien Allonsius, Damon Civin, Dana Beaty, Daniel Kreymer, Danielle Pintz, Daniel Li, Daniel Song, Danny Livshits, Danny Wyatt, David Adkins, David Esiobu, Davide Testuggine, David Xu, Delia David, Devi Parikh, Dhruv Choudhary, Dhruv Mahajan, Diana Liskovich, Didem Foss, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Egor Lakomkin, Ehab AlBadawy, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Elina Lobanova, Emily Dinan, Emily Hahn, Emily Wood, Eric Michael Smith, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Filip Radenovic, Firat Ozgenel, Francesco Caggioni, Francisco Guzm\'an, Frank Kanayet, Frank Seide, Frank Zhang, Gabriela Medina Florez, Gabriella Schwarz, Gabrielle Lee, Gabriel Synnaeve, Gada Badeer, Georgia Lewis Anderson, Georgia Swee, Gil Halpern, Govind Thattai, Graeme Nail, Grant Herman, Gregoire Mialon, Grigory Sizov, Guangyi (Jack) Zhang, Guan Pang, Guillem Cucurell, Guna Lakshminarayanan, Hailey Nguyen, Hakan Inan, Hamid Shojanazeri, Hannah Korevaar, Hannah Wang, Hanwen Zha, Han Zou, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hongyuan Zhan, Hugo Touvron, Hunter Goldman, Hu Xu, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Iliyan Zarov, Imanol Arrieta Ibarra, Irina-Elena Veliche, Isabel Kloumann, Ishan Misra, Itai Gat, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jake Weissman, James Geboski, James Kohli, Jana Vranes, Jan Geffert, Janice Lam, Japhet Asher, Jason Park, Jay Mahadeokar, Jean-Baptiste Gaya, Jeet Shah, Jeff Marcus, Jeff Tang, Jelmer van der Linde, Jennifer Billock, Jennifer Chan, Jenny Hong, Jenny Zhen, Jenya Lee, Jeremy Fu, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jianfeng Chi, Jian Jin, Jianyu Huang, Jiawen Liu, Jiecao Yu, Jie Wang, Jingyi Yang, Joanna Bitton, Joe Cummings, Joe Spisak, Jonathan McPhie, Jonathan Torres, Jon Carvill, Jongsoo Park, Jon Shepard, Joseph Rocca, Josh Ginsburg, Joshua Johnstun, Joshua Saxe, Junjie Wang, Junteng Jia, Kai Wu, Kalyan Vasuden Alwala, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Kartikeya Upasani, Katayoun Zand, Kate Plawiak, Kathy Matosich, Kaushik Veeraraghavan, Ke Li, Kelly Michelena, Kenneth Heafield, Keqian Li, Kevin Stone, Khalid El-Arini, Kiran Jagadeesh, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kunal Chawla, Kun Huang, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lauren Rantala-Yeary, Laurens van der Maaten, Lavender A, Lawrence Chen, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Liang Tan, Licheng Yu, Liron Moshkovich, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Luca Wehrstedt, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Madian Khabsa, Mahesh Pasupuleti, Manav Avalani, Manish Bhatt, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Mathew Oldham, Mathieu Rita, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Maya Pavlova, Meghan Keneally, Melanie Kambadur, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mikayel Samvelyan, Mike Clark, Mike Lewis, Mike Macey, Mike Wang, Mik Vyatskov, Min Si, Miquel Jubert Hermoso, Mitesh Kumar Singh, Mohammad Rastegari, Mo Metanat, Mona Hassan, Munish Bansal, Naman Goyal, Nandhini Santhanam, Narjes Torabi, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Bashlykov, Nikolay Bogoychev, Nikolay Pavlovich Laptev, Niladri Chatterji, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Olivier Duchenne, Omkar Salpekar, Onur \c{C}elebi, Ozlem Kalinli, Parkin Kent, Parth Parekh, Patrick Alrassy, Paul Saab, Pavan Balaji, Pedro Rittner, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prajjwal Bhargava, Prashant Ratanchandani, Pratik Dubal, Praveen Krishnan, Pritish Yuvraj, Punit Singh Koura, Puxin Xu, Qian Liang, Qing He, Qingxiao Dong, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Ragavan Srinivasan, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raj Ganapathy, Ramon Calderer, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Ricardo Silveira Cabral, Roberta Raileanu, Robert Stojnic, Robin Battey, Rocky Wang, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Saghar Hosseini, Sahana Chennabasappa, Sai Jayesh Bondu, Samyak Datta, Sanjay Singh, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Sean Bell, Seiji Yamamoto, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharadh Ramaswamy, Sharan Narang, Sharath Raparthy, Shaun Lindsay, Sheng Feng, Shenghao Lin, Sheng Shen, Shengxin Cindy Zha, Shengye Wan, Shishir Patil, Shiva Shankar, Shruti Bhosale, Shun Zhang, Shuqiang Zhang, Simon Vandenhende, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Suchin Gururangan, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Borodinsky, Sydney Goldman, Tal Remez, Tamara Best, Tamar Glaser, Tamar Herman, Tara Fowler, Tarek Sheasha, Thilo Koehler, Thomas Georgiou, Thomas Robinson, Thomas Scialom, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Tzook Shaked, Ujjwal Karn, Varun Vontimitta, Vedanuj Goswami, Vibhor Gupta, Victoria Ajayi, Victoria Montanez, Vignesh Ramanathan, Vijai Mohan, Viktor Kerkez, Vinay Satish Kumar, Vincent Gonguet, Virginie Do, Vishal Mangla, Vish Vogeti, V\'itor Albiero, Vladan Petrovic, Vladimir Ivanov, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Wei Li, Weiwei Chu, Wenchen Wang, Wenhan Xiong, Wenwen Jiang, Wenyin Fu, Wes Bouaziz, Whitney Meers, Will Constable, Xavier Martinet, Xiaocheng Tang, Xiaodong Wang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xiaoqing Ellen Tan, Xide Xia, Xilun Wu, Xinbo Gao, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yaniv Kleinman, Yanjun Chen, Yashesh Gaur, Yasmine Babaei, Ye Hu, Ye Jia, Yenda Li, Ye Qi, Yilin Zhang, Ying Zhang, Yi Wen, Yiwen Song, Yossi Adi, Youngjin Nam, Yuchen Hao, Yuchen Zhang, Yue Li, Yundi Qian, Yuning Mao, Yunlu Li, Yu (Sid) Wang, Yu Zhao, Yuzi He, Zacharie Delpierre Coudert, Zachary DeVito, Zach Rait, Zef Rosnbrick, Zhaoduo Wen, Zhengxing Chen, Zheng Yan, Zhenyu Yang, Zhiwei Zhao, Zhiyu Ma, Zoe Papakipos

Pith reviewed 2026-05-08 21:44 UTC · model claude-opus-4-7

classification 💻 cs.AI cs.CLcs.CV
keywords foundation modelslarge language modelsdense Transformerlong contextmultilingualitytool usecompositional multimodalityopen weights
0
0 comments X

The pith

A 405B-parameter dense Transformer with a 128K context matches GPT-4-class quality across language, code, reasoning, and tool use, and reaches competitive multimodal performance by attaching image, video, and speech encoders rather than tra

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a family of foundation language models topped by a 405B-parameter dense Transformer with a 128K-token context window, and reports an empirical case that this open model is roughly on par with the leading closed models on multilingual text, coding, reasoning, and tool-use benchmarks. It defends a deliberately conservative architectural choice — a single dense Transformer rather than a mixture of experts or other sparsity scheme — and argues that careful data curation, scaling, and post-training pipelines (supervised fine-tuning, preference optimization, safety tuning) are what carry the quality. For multimodality, it advances a compositional thesis: train modality encoders for image, video, and speech and attach them to the language model through adapters, rather than pretraining a single fused multimodal model from scratch. The paper claims this composition is already competitive with state-of-the-art on standard perception benchmarks. The release of weights for both pretrained and post-trained versions, along with a separate input/output safety classifier, is part of the contribution: it puts a frontier-scale model in third-party hands so others can reproduce or contest the parity claim.

Core claim

A 405-billion-parameter dense Transformer with a 128K-token context window, trained and post-trained at scale, reaches quality comparable to the strongest closed language models on a wide span of tasks: multilingual text, code, mathematical and general reasoning, and tool use. The paper further argues that you do not need a single end-to-end multimodal model to be competitive on images, video, and speech: bolting modality-specific encoders onto the frozen-or-lightly-adapted language model — a compositional rather than fused design — already lands near state-of-the-art on standard benchmarks for those modalities.

What carries the argument

A 405B dense Transformer scaled with disciplined data, long-context (128K) training, and a multi-stage post-training stack (SFT + preference optimization + safety tuning), combined with a compositional multimodal recipe in which separately trained image, video, and speech encoders are attached via adapters to the language model rather than co-trained from scratch.

If this is right

  • <0>An openly released 405B model with 128K context lets outside groups reproduce
  • audit
  • and red-team frontier-scale behavior
  • including contamination checks the paper itself cannot fully rule out.</0>
  • <1>If a plain dense Transformer at this scale really matches mixture-of-experts and other architectural variants used by competitors
  • the marginal value of architectural novelty over data and post-training is smaller than commonly assumed.</1>
  • <2>The compositional multimodal result implies that strong vision
  • video

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • <0>Editorial inference: the parity-with-GPT-4 framing is partly a claim about ceilings — that a dense
  • well-curated recipe at 405B is near a plateau where further gains from scale-and-data alone are sublinear
  • and that the next frontier is post-training
  • tools
  • and modalities rather than parameter count.</0>
  • <1>Editorial inference: the compositional multimodal choice is also a hedging strategy — encoders can be swapped or upgraded without retraining the language core
  • which matters more for a release pipeline than for a single benchmark number.</1>
  • <2>Editorial inference: open-weighting a 405B model effectively externalizes evaluation

Load-bearing premise

That the public benchmark scores used to claim parity with the strongest closed models actually reflect general capability, rather than overlap between evaluation sets and the (undisclosed) pretraining corpus or evaluation choices that flatter the released model.

What would settle it

A controlled head-to-head evaluation on tasks constructed after Llama 3's training cutoff and verified to be absent from its training data — covering multilingual reasoning, code, math, and long-context retrieval — in which the 405B model trails the named frontier models by a wide margin would directly undermine the parity claim.

read the original abstract

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 4 minor

Summary. The manuscript presents the Llama 3 family of foundation models, headlined by a dense 405B-parameter Transformer with a 128K-token context window, alongside 8B and 70B variants. The authors describe pretraining, post-training (SFT and preference optimization), a safety model (Llama Guard 3), and a compositional approach for adding image, video, and speech encoders to the language backbone. The empirical claim is that Llama 3 reaches quality "comparable to leading language models such as GPT-4" across multilingual text, code, reasoning, and tool-use benchmarks, and that the multimodal extensions are competitive with the state of the art on their respective tasks. Pretrained and post-trained 405B weights and Llama Guard 3 are released; the multimodal extensions are not.

Significance. If the parity claim holds, this is a significant artifact contribution: an openly released 405B dense model with a 128K context window narrows the gap between open-weights and closed frontier models and enables third-party scientific work (interpretability, fine-tuning, contamination audits, red-teaming) that is impossible on closed APIs. The release of Llama Guard 3 as a separate safety classifier and the description of a working compositional multimodal pipeline are independently useful. The paper's strengths that should be credited explicitly: (i) the weights themselves are released, so the headline inference claims are independently verifiable in a way that closed-model papers are not; (ii) the scope of the empirical evaluation is unusually broad; (iii) the compositional rather than end-to-end multimodal recipe is a concrete, reproducible design choice. The weakness, addressed below, is that the relative claim against GPT-4 is the load-bearing scientific assertion and is the part hardest to verify from the paper alone.

major comments (4)
  1. [Abstract / Evaluation sections] The central comparative claim — 'comparable quality to leading language models such as GPT-4' — is a relative claim against a closed, moving target. The manuscript should make explicit, in one place, for every headline benchmark: (a) whether the GPT-4 number was re-run by the authors or quoted, (b) the API snapshot/date, (c) prompt template, system message, few-shot exemplars, decoding parameters, and CoT policy used for both models, and (d) whether these were held identical across systems. Several percentage points on MMLU/GSM8K/MATH/HumanEval/MBPP can be moved by these choices alone, and the claimed parity gaps are of that order. Without this matrix the parity claim cannot be audited.
  2. [Pretraining data / decontamination] The decontamination methodology (typically n-gram overlap with eval sets in releases of this kind) catches near-duplicates but not paraphrases, translations, or solutions discussed in web text. Because the pretraining corpus is not disclosed at document level, an independent contamination probe is needed to support the headline benchmark numbers: e.g., performance on freshly constructed or post-cutoff held-out variants, perplexity gap between benchmark items and matched controls, or membership-inference-style tests on benchmark instances. Please add at least one such probe, or qualify the parity framing accordingly.
  3. [Multimodal experiments] The abstract states the compositional image/video/speech approach 'performs competitively with the state-of-the-art,' but the corresponding models are 'not yet being broadly released.' For a non-released system the burden on evaluation transparency is higher, not lower: please ensure the multimodal sections specify exactly which baselines, checkpoints, and protocols are compared, and which numbers are taken from prior work versus re-run.
  4. [Scope of contribution] It would help the reader if the manuscript stated which elements are intended as scientific contributions (e.g., scaling-law analyses, post-training recipe ablations, the compositional multimodal recipe, Llama Guard 3 design) versus engineering/release documentation. As written, the paper mixes both, and reviewers cannot easily identify which claims are meant to be defended on methodological grounds.
minor comments (4)
  1. [Abstract] 'comparable quality to leading language models such as GPT-4' would be more precise as a quantified statement (e.g., 'within X points on benchmark suite Y under matched protocol Z'). The current phrasing invites overreading.
  2. [Abstract] Clarify what 'compositional approach' means at the abstract level (frozen LM + trained adapter + modality encoder, or otherwise), since this is the multimodal design contribution being claimed.
  3. [Release] Specify in the abstract or introduction the license under which Llama 3 and Llama Guard 3 are released, as this materially affects the artifact's scientific value (third-party reproducibility, contamination audits, fine-tuning studies).
  4. [Terminology] The phrase 'natively support multilinguality, coding, reasoning, and tool usage' conflates capability with training emphasis; consider rewording to indicate that these are explicitly targeted in the data mixture and post-training, not architectural features.

Simulated Author's Rebuttal

4 responses · 2 unresolved

We thank the referee for a careful and constructive report, and in particular for crediting the open release of weights as the mechanism by which our headline inference claims can be independently audited. We agree with the central thrust of the major comments: the load-bearing scientific assertion in the manuscript is the parity claim against closed frontier models, and that claim deserves a more explicit evaluation-protocol matrix and an independent contamination probe than the current draft provides. We will revise accordingly. Below we respond point by point, indicate where the manuscript will be amended, and note one item (closed-model evaluation transparency) where our ability to comply is intrinsically limited by the closed nature of the comparator.

read point-by-point responses
  1. Referee: The 'comparable to GPT-4' claim needs an explicit per-benchmark matrix: re-run vs. quoted, API snapshot/date, prompt template, system message, few-shot exemplars, decoding parameters, CoT policy, and whether these were held identical across systems.

    Authors: We agree and will add such a matrix as an appendix table covering every headline benchmark in the main text (MMLU, MMLU-Pro, GSM8K, MATH, HumanEval, MBPP, GPQA, IFEval, multilingual and tool-use suites). For each row we will state: (a) re-run by us vs. quoted from the source paper/leaderboard; (b) for re-runs of GPT-4/GPT-4o/Claude/Gemini, the exact API model identifier and the date window in which calls were made; (c) the full prompt, system message, k-shot exemplars, temperature/top-p/max-tokens, and CoT/no-CoT setting; and (d) an explicit indicator of whether the protocol was held identical across systems. Where we quoted vendor-reported numbers (because re-running was not possible or not faithful, e.g. tool-use harnesses we do not control) we will mark this and avoid framing those rows as parity evidence. We will also soften abstract language from 'comparable quality' to a more precise statement keyed to the matrix, and flag that the residual gaps are within the range that prompt/decoding choices alone can move. We acknowledge that some transparency limits are intrinsic: we cannot disclose the internals of closed comparators, only our calling conditions. revision: yes

  2. Referee: n-gram decontamination misses paraphrases/translations/solutions in web text; an independent contamination probe is needed (post-cutoff variants, perplexity-gap, membership-inference) or the parity framing should be qualified.

    Authors: This is a fair point. Our released decontamination procedure is indeed n-gram-based and we agree it does not bound paraphrase or translated leakage. In revision we will add at least two probes: (i) evaluation on post-training-cutoff held-out variants — we will report results on benchmarks released after our data cutoff (e.g. recent contest-math and code competitions, post-cutoff GPQA-style items, and freshly authored multilingual items) and contrast with the headline numbers; and (ii) a perplexity-gap analysis comparing model NLL on benchmark items vs. matched controls drawn from the same source distribution but not in any benchmark. Where the gap is non-trivial we will flag the affected benchmarks and weaken the parity framing for those specific tasks rather than the overall claim. A full membership-inference study at 405B is more involved; we will scope what is feasible and report it, and otherwise will explicitly qualify the framing as the referee suggests. revision: yes

  3. Referee: For the unreleased multimodal models the bar on evaluation transparency is higher: specify baselines, checkpoints, protocols, and which numbers are quoted vs. re-run.

    Authors: We accept this. The multimodal sections will be revised so that every reported comparison lists: the baseline model and exact checkpoint/version, whether the number is taken from the original publication or re-run by us, and the evaluation protocol (prompt, decoding, frame-sampling for video, audio preprocessing for speech, scoring script). We will also add a per-task table separating 're-run by us under matched protocol' from 'quoted from prior work' rows, mirroring the language-model matrix described above. We will additionally weaken 'competitively with the state-of-the-art' in the abstract to a task-conditional statement, since the unreleased status of the multimodal models means readers cannot independently verify these numbers and we should not lean on them as if they could. revision: yes

  4. Referee: State which elements are intended as scientific contributions versus engineering/release documentation, so reviewers can identify which claims are defended on methodological grounds.

    Authors: We agree this clarification will help readers. We will add a short 'Scope of contributions' subsection in the introduction that explicitly classifies the components. Our intended scientific contributions are: the scaling-law analysis used to choose the 405B compute/data point and its predictive validation; the post-training recipe (rejection sampling + SFT + DPO iteration) and its ablations; the compositional multimodal recipe; and the Llama Guard 3 taxonomy and classifier design. The remaining material — infrastructure, parallelism, data-pipeline engineering, and the benchmark suite itself — is release/engineering documentation supporting reproducibility of the released weights, and we will label it as such rather than as methodological claims to be defended. Headline benchmark numbers are evidence about the released artifact, not standalone scientific claims, and we will frame them that way. revision: yes

standing simulated objections not resolved
  • Full transparency on the comparator side of the GPT-4 parity claim is intrinsically bounded: we can disclose our API snapshot, prompts, and decoding settings, but not the closed model's internals, version drift between snapshots, or any server-side prompt processing. We will document our side completely and qualify the parity claim accordingly, but cannot eliminate this asymmetry.
  • A complete membership-inference contamination study at 405B scale across all headline benchmarks may exceed what we can include in revision; we will report what is feasible and qualify the remainder rather than over-claim coverage.

Circularity Check

0 steps flagged

No significant circularity: Llama 3 is an empirical engineering report whose central claim is benchmarked against external systems and externally reproducible weights, not a self-derivation.

full rationale

The paper's load-bearing claim — that the 405B dense Transformer with 128K context delivers "comparable quality to leading language models such as GPT-4 on a plethora of tasks," and that compositional image/video/speech encoders are competitive with SOTA — is a relative empirical claim against external systems on external benchmarks (MMLU, GSM8K, MATH, HumanEval, MBPP, etc.). It is not derived from a chain of equations, fitted parameters, or a uniqueness theorem; there is therefore no structural way for the conclusion to reduce to its own inputs by definition.\n\nThe artifact itself (open weights, Llama Guard 3) is externally reproducible: any third party can rerun the released model and check the numbers, which satisfies the "code-reproduced / externally falsifiable" exception to circularity. Self-citation to prior Llama work is bibliographic rather than load-bearing for the parity claim.\n\nThe genuine concerns flagged by the reader — benchmark contamination given undisclosed pretraining data, asymmetric evaluation harnesses against a closed GPT-4 API, n-gram-only decontamination missing paraphrases — are real, but they are measurement-validity / correctness-risk issues, not circularity in the technical sense used here. They would show up as "the benchmark numbers may not measure what the paper says they measure," not as "the prediction equals the input by construction." Per the analyzer's hard rule #5, "this is not standard consensus" or "the eval protocol is suspect" belongs under correctness risk, not circularity.\n\nOnly text available is the abstract, so a thorough section-by-section walk is not possible, but nothing in the abstract describes a derivation step that fits the seven circularity patterns. Score: 1, reflecting routine self-citation to prior Llama models without any load-bearing reduction.

Axiom & Free-Parameter Ledger

4 free parameters · 3 axioms · 1 invented entities

Llama 3 introduces no new physical or theoretical entities. The 'ledger' is dominated by engineering free parameters (scale, context, data mixture, post-training settings) plus standard benchmark-validity assumptions. The single new released artifact beyond the LLMs themselves is Llama Guard 3, which is empirically falsifiable because its weights are public.

free parameters (4)
  • Model scale (8B, 70B, 405B parameters) = 405B flagship
    Chosen via internal scaling-law experiments described in the report; not derived from theory.
  • Context length = 128K tokens
    Engineering choice based on long-context training and evaluation.
  • Data mixture weights across domains and languages = not disclosed at document level
    Tuned empirically; central to the parity claim but not externally auditable.
  • Post-training hyperparameters (SFT/preference optimization) = internal
    Tuned against internal eval suites.
axioms (3)
  • domain assumption Public benchmarks validly measure the capabilities they name (MMLU, HumanEval, GSM8K, MATH, multilingual, ASR, etc.).
    The parity claim is benchmark-mediated.
  • domain assumption Pretraining data is adequately decontaminated against evaluation sets.
    Required to interpret benchmark parity as capability parity.
  • domain assumption Compositional multimodality (frozen-ish encoders + adapters + LLM) is a fair comparator to end-to-end multimodal systems on the chosen tasks.
    Underlies the 'competitive with state-of-the-art' multimodal claim.
invented entities (1)
  • Llama Guard 3 independent evidence
    purpose: Classifier for unsafe model inputs and outputs, released alongside the language models.
    Released with weights, so its behavior is directly testable by third parties; this is a new artifact rather than an unfalsifiable postulate.

pith-pipeline@v0.9.0 · 14829 in / 5495 out tokens · 89954 ms · 2026-05-08T21:44:46.085270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Models Lack Temporal Awareness of Medical Knowledge

    cs.LG 2026-05 unverdicted novelty 8.0

    LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

  2. Inference-Time Machine Unlearning via Gated Activation Redirection

    cs.LG 2026-05 conditional novelty 8.0

    GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.

  3. Pretraining Exposure Explains Popularity Judgments in Large Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

  4. Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

    cs.LG 2026-05 accept novelty 8.0

    Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

  5. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 8.0

    SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

  6. Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

    cs.CR 2026-05 unverdicted novelty 8.0

    Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...

  7. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

    cs.AI 2026-05 unverdicted novelty 8.0

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  8. OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

    cs.LG 2026-05 unverdicted novelty 8.0

    OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...

  9. MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

    cs.LG 2026-05 unverdicted novelty 8.0

    MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

  10. Narrow Secret Loyalty Dodges Black-Box Audits

    cs.CR 2026-05 unverdicted novelty 8.0

    Narrow secret loyalties implanted via fine-tuning in LLMs at multiple scales evade black-box audits unless the auditor knows the target principal.

  11. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  12. DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

    cs.LG 2026-05 conditional novelty 8.0

    INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

  13. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  14. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  15. HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    cs.SD 2026-04 unverdicted novelty 8.0

    HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...

  16. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  17. Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

    cs.CL 2026-04 conditional novelty 8.0

    Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

  18. Backdoor Attacks on Decentralised Post-Training

    cs.CR 2026-03 conditional novelty 8.0

    An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...

  19. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  20. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  21. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    cs.AI 2024-08 unverdicted novelty 8.0

    The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

  22. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  23. LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

    cs.CL 2026-05 unverdicted novelty 7.0

    LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.

  24. IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

    cs.CL 2026-05 unverdicted novelty 7.0

    A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.

  25. Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

    cs.LG 2026-05 conditional novelty 7.0

    ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...

  26. From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

    cs.LG 2026-05 conditional novelty 7.0

    AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.

  27. Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

    math.OC 2026-05 conditional novelty 7.0

    Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

  28. Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

    cs.CL 2026-05 conditional novelty 7.0

    LLM simulators exhibit near-zero selective response to targeted misconception feedback and behave sycophantically, but SFT and SFS-aligned RL improve this property.

  29. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  30. BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

    cs.AI 2026-05 conditional novelty 7.0

    BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.

  31. From Noise to Diversity: Random Embedding Injection in LLM Reasoning

    cs.AI 2026-05 conditional novelty 7.0

    Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.

  32. Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.

  33. Much of Geospatial Web Search Is Beyond Traditional GIS

    cs.IR 2026-05 unverdicted novelty 7.0

    Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.

  34. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  35. Uniform Scaling Limits in AdamW-Trained Transformers

    stat.ML 2026-05 unverdicted novelty 7.0

    AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...

  36. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

  37. ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-...

  38. MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives ...

  39. Locking Pretrained Weights via Deep Low-Rank Residual Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...

  40. BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

    cs.LG 2026-05 unverdicted novelty 7.0

    BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.

  41. Infinite Mask Diffusion for Few-Step Distillation

    cs.CL 2026-05 unverdicted novelty 7.0

    Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.

  42. Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning

    cs.DC 2026-05 unverdicted novelty 7.0

    FLTorrent achieves within-round source unlinkability in decentralized federated learning via a BitTorrent warm-up with pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling, reaching...

  43. SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

  44. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  45. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...

  46. HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation

    cs.HC 2026-05 unverdicted novelty 7.0

    HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.

  47. From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

  48. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

    cs.AI 2026-05 unverdicted novelty 7.0

    Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

  49. Pretraining large language models with MXFP4 on Native FP4 Hardware

    cs.LG 2026-05 unverdicted novelty 7.0

    Weight-gradient quantization drives most convergence problems in MXFP4 pretraining of Llama 3.1-8B; deterministic Hadamard rotations stabilize training by correcting structured micro-scaling errors.

  50. K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

    cs.CL 2026-05 conditional novelty 7.0

    K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

  51. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  52. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

  53. From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Production logs from a 504-GPU LLM training cluster show 100% failure detection via multi-metric analysis, NFS saturation limiting bandwidth to 1.4-10.4% of link speed, and auto-retry achieving 33.3% success versus 12...

  54. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  55. Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

    cs.CR 2026-05 accept novelty 7.0

    Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.

  56. Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

    cs.LG 2026-05 unverdicted novelty 7.0

    Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.

  57. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

    cs.CR 2026-05 unverdicted novelty 7.0

    Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...

  58. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

  59. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  60. CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG

    cs.LG 2026-05 unverdicted novelty 7.0

    CDS4RAG cyclically optimizes full RAG hyperparameters by distinguishing and alternating between retriever and generator components, boosting performance up to 1.54x over prior methods on benchmarks.