Recognition: 3 theorem links
· Lean TheoremThe Llama 3 Herd of Models
Pith reviewed 2026-05-08 21:44 UTC · model claude-opus-4-7
The pith
A 405B-parameter dense Transformer with a 128K context matches GPT-4-class quality across language, code, reasoning, and tool use, and reaches competitive multimodal performance by attaching image, video, and speech encoders rather than tra
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A 405-billion-parameter dense Transformer with a 128K-token context window, trained and post-trained at scale, reaches quality comparable to the strongest closed language models on a wide span of tasks: multilingual text, code, mathematical and general reasoning, and tool use. The paper further argues that you do not need a single end-to-end multimodal model to be competitive on images, video, and speech: bolting modality-specific encoders onto the frozen-or-lightly-adapted language model — a compositional rather than fused design — already lands near state-of-the-art on standard benchmarks for those modalities.
What carries the argument
A 405B dense Transformer scaled with disciplined data, long-context (128K) training, and a multi-stage post-training stack (SFT + preference optimization + safety tuning), combined with a compositional multimodal recipe in which separately trained image, video, and speech encoders are attached via adapters to the language model rather than co-trained from scratch.
If this is right
- <0>An openly released 405B model with 128K context lets outside groups reproduce
- audit
- and red-team frontier-scale behavior
- including contamination checks the paper itself cannot fully rule out.</0>
- <1>If a plain dense Transformer at this scale really matches mixture-of-experts and other architectural variants used by competitors
- the marginal value of architectural novelty over data and post-training is smaller than commonly assumed.</1>
- <2>The compositional multimodal result implies that strong vision
- video
Where Pith is reading between the lines
- <0>Editorial inference: the parity-with-GPT-4 framing is partly a claim about ceilings — that a dense
- well-curated recipe at 405B is near a plateau where further gains from scale-and-data alone are sublinear
- and that the next frontier is post-training
- tools
- and modalities rather than parameter count.</0>
- <1>Editorial inference: the compositional multimodal choice is also a hedging strategy — encoders can be swapped or upgraded without retraining the language core
- which matters more for a release pipeline than for a single benchmark number.</1>
- <2>Editorial inference: open-weighting a 405B model effectively externalizes evaluation
Load-bearing premise
That the public benchmark scores used to claim parity with the strongest closed models actually reflect general capability, rather than overlap between evaluation sets and the (undisclosed) pretraining corpus or evaluation choices that flatter the released model.
What would settle it
A controlled head-to-head evaluation on tasks constructed after Llama 3's training cutoff and verified to be absent from its training data — covering multilingual reasoning, code, math, and long-context retrieval — in which the 405B model trails the named frontier models by a wide margin would directly undermine the parity claim.
read the original abstract
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Llama 3 family of foundation models, headlined by a dense 405B-parameter Transformer with a 128K-token context window, alongside 8B and 70B variants. The authors describe pretraining, post-training (SFT and preference optimization), a safety model (Llama Guard 3), and a compositional approach for adding image, video, and speech encoders to the language backbone. The empirical claim is that Llama 3 reaches quality "comparable to leading language models such as GPT-4" across multilingual text, code, reasoning, and tool-use benchmarks, and that the multimodal extensions are competitive with the state of the art on their respective tasks. Pretrained and post-trained 405B weights and Llama Guard 3 are released; the multimodal extensions are not.
Significance. If the parity claim holds, this is a significant artifact contribution: an openly released 405B dense model with a 128K context window narrows the gap between open-weights and closed frontier models and enables third-party scientific work (interpretability, fine-tuning, contamination audits, red-teaming) that is impossible on closed APIs. The release of Llama Guard 3 as a separate safety classifier and the description of a working compositional multimodal pipeline are independently useful. The paper's strengths that should be credited explicitly: (i) the weights themselves are released, so the headline inference claims are independently verifiable in a way that closed-model papers are not; (ii) the scope of the empirical evaluation is unusually broad; (iii) the compositional rather than end-to-end multimodal recipe is a concrete, reproducible design choice. The weakness, addressed below, is that the relative claim against GPT-4 is the load-bearing scientific assertion and is the part hardest to verify from the paper alone.
major comments (4)
- [Abstract / Evaluation sections] The central comparative claim — 'comparable quality to leading language models such as GPT-4' — is a relative claim against a closed, moving target. The manuscript should make explicit, in one place, for every headline benchmark: (a) whether the GPT-4 number was re-run by the authors or quoted, (b) the API snapshot/date, (c) prompt template, system message, few-shot exemplars, decoding parameters, and CoT policy used for both models, and (d) whether these were held identical across systems. Several percentage points on MMLU/GSM8K/MATH/HumanEval/MBPP can be moved by these choices alone, and the claimed parity gaps are of that order. Without this matrix the parity claim cannot be audited.
- [Pretraining data / decontamination] The decontamination methodology (typically n-gram overlap with eval sets in releases of this kind) catches near-duplicates but not paraphrases, translations, or solutions discussed in web text. Because the pretraining corpus is not disclosed at document level, an independent contamination probe is needed to support the headline benchmark numbers: e.g., performance on freshly constructed or post-cutoff held-out variants, perplexity gap between benchmark items and matched controls, or membership-inference-style tests on benchmark instances. Please add at least one such probe, or qualify the parity framing accordingly.
- [Multimodal experiments] The abstract states the compositional image/video/speech approach 'performs competitively with the state-of-the-art,' but the corresponding models are 'not yet being broadly released.' For a non-released system the burden on evaluation transparency is higher, not lower: please ensure the multimodal sections specify exactly which baselines, checkpoints, and protocols are compared, and which numbers are taken from prior work versus re-run.
- [Scope of contribution] It would help the reader if the manuscript stated which elements are intended as scientific contributions (e.g., scaling-law analyses, post-training recipe ablations, the compositional multimodal recipe, Llama Guard 3 design) versus engineering/release documentation. As written, the paper mixes both, and reviewers cannot easily identify which claims are meant to be defended on methodological grounds.
minor comments (4)
- [Abstract] 'comparable quality to leading language models such as GPT-4' would be more precise as a quantified statement (e.g., 'within X points on benchmark suite Y under matched protocol Z'). The current phrasing invites overreading.
- [Abstract] Clarify what 'compositional approach' means at the abstract level (frozen LM + trained adapter + modality encoder, or otherwise), since this is the multimodal design contribution being claimed.
- [Release] Specify in the abstract or introduction the license under which Llama 3 and Llama Guard 3 are released, as this materially affects the artifact's scientific value (third-party reproducibility, contamination audits, fine-tuning studies).
- [Terminology] The phrase 'natively support multilinguality, coding, reasoning, and tool usage' conflates capability with training emphasis; consider rewording to indicate that these are explicitly targeted in the data mixture and post-training, not architectural features.
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report, and in particular for crediting the open release of weights as the mechanism by which our headline inference claims can be independently audited. We agree with the central thrust of the major comments: the load-bearing scientific assertion in the manuscript is the parity claim against closed frontier models, and that claim deserves a more explicit evaluation-protocol matrix and an independent contamination probe than the current draft provides. We will revise accordingly. Below we respond point by point, indicate where the manuscript will be amended, and note one item (closed-model evaluation transparency) where our ability to comply is intrinsically limited by the closed nature of the comparator.
read point-by-point responses
-
Referee: The 'comparable to GPT-4' claim needs an explicit per-benchmark matrix: re-run vs. quoted, API snapshot/date, prompt template, system message, few-shot exemplars, decoding parameters, CoT policy, and whether these were held identical across systems.
Authors: We agree and will add such a matrix as an appendix table covering every headline benchmark in the main text (MMLU, MMLU-Pro, GSM8K, MATH, HumanEval, MBPP, GPQA, IFEval, multilingual and tool-use suites). For each row we will state: (a) re-run by us vs. quoted from the source paper/leaderboard; (b) for re-runs of GPT-4/GPT-4o/Claude/Gemini, the exact API model identifier and the date window in which calls were made; (c) the full prompt, system message, k-shot exemplars, temperature/top-p/max-tokens, and CoT/no-CoT setting; and (d) an explicit indicator of whether the protocol was held identical across systems. Where we quoted vendor-reported numbers (because re-running was not possible or not faithful, e.g. tool-use harnesses we do not control) we will mark this and avoid framing those rows as parity evidence. We will also soften abstract language from 'comparable quality' to a more precise statement keyed to the matrix, and flag that the residual gaps are within the range that prompt/decoding choices alone can move. We acknowledge that some transparency limits are intrinsic: we cannot disclose the internals of closed comparators, only our calling conditions. revision: yes
-
Referee: n-gram decontamination misses paraphrases/translations/solutions in web text; an independent contamination probe is needed (post-cutoff variants, perplexity-gap, membership-inference) or the parity framing should be qualified.
Authors: This is a fair point. Our released decontamination procedure is indeed n-gram-based and we agree it does not bound paraphrase or translated leakage. In revision we will add at least two probes: (i) evaluation on post-training-cutoff held-out variants — we will report results on benchmarks released after our data cutoff (e.g. recent contest-math and code competitions, post-cutoff GPQA-style items, and freshly authored multilingual items) and contrast with the headline numbers; and (ii) a perplexity-gap analysis comparing model NLL on benchmark items vs. matched controls drawn from the same source distribution but not in any benchmark. Where the gap is non-trivial we will flag the affected benchmarks and weaken the parity framing for those specific tasks rather than the overall claim. A full membership-inference study at 405B is more involved; we will scope what is feasible and report it, and otherwise will explicitly qualify the framing as the referee suggests. revision: yes
-
Referee: For the unreleased multimodal models the bar on evaluation transparency is higher: specify baselines, checkpoints, protocols, and which numbers are quoted vs. re-run.
Authors: We accept this. The multimodal sections will be revised so that every reported comparison lists: the baseline model and exact checkpoint/version, whether the number is taken from the original publication or re-run by us, and the evaluation protocol (prompt, decoding, frame-sampling for video, audio preprocessing for speech, scoring script). We will also add a per-task table separating 're-run by us under matched protocol' from 'quoted from prior work' rows, mirroring the language-model matrix described above. We will additionally weaken 'competitively with the state-of-the-art' in the abstract to a task-conditional statement, since the unreleased status of the multimodal models means readers cannot independently verify these numbers and we should not lean on them as if they could. revision: yes
-
Referee: State which elements are intended as scientific contributions versus engineering/release documentation, so reviewers can identify which claims are defended on methodological grounds.
Authors: We agree this clarification will help readers. We will add a short 'Scope of contributions' subsection in the introduction that explicitly classifies the components. Our intended scientific contributions are: the scaling-law analysis used to choose the 405B compute/data point and its predictive validation; the post-training recipe (rejection sampling + SFT + DPO iteration) and its ablations; the compositional multimodal recipe; and the Llama Guard 3 taxonomy and classifier design. The remaining material — infrastructure, parallelism, data-pipeline engineering, and the benchmark suite itself — is release/engineering documentation supporting reproducibility of the released weights, and we will label it as such rather than as methodological claims to be defended. Headline benchmark numbers are evidence about the released artifact, not standalone scientific claims, and we will frame them that way. revision: yes
- Full transparency on the comparator side of the GPT-4 parity claim is intrinsically bounded: we can disclose our API snapshot, prompts, and decoding settings, but not the closed model's internals, version drift between snapshots, or any server-side prompt processing. We will document our side completely and qualify the parity claim accordingly, but cannot eliminate this asymmetry.
- A complete membership-inference contamination study at 405B scale across all headline benchmarks may exceed what we can include in revision; we will report what is feasible and qualify the remainder rather than over-claim coverage.
Circularity Check
No significant circularity: Llama 3 is an empirical engineering report whose central claim is benchmarked against external systems and externally reproducible weights, not a self-derivation.
full rationale
The paper's load-bearing claim — that the 405B dense Transformer with 128K context delivers "comparable quality to leading language models such as GPT-4 on a plethora of tasks," and that compositional image/video/speech encoders are competitive with SOTA — is a relative empirical claim against external systems on external benchmarks (MMLU, GSM8K, MATH, HumanEval, MBPP, etc.). It is not derived from a chain of equations, fitted parameters, or a uniqueness theorem; there is therefore no structural way for the conclusion to reduce to its own inputs by definition.\n\nThe artifact itself (open weights, Llama Guard 3) is externally reproducible: any third party can rerun the released model and check the numbers, which satisfies the "code-reproduced / externally falsifiable" exception to circularity. Self-citation to prior Llama work is bibliographic rather than load-bearing for the parity claim.\n\nThe genuine concerns flagged by the reader — benchmark contamination given undisclosed pretraining data, asymmetric evaluation harnesses against a closed GPT-4 API, n-gram-only decontamination missing paraphrases — are real, but they are measurement-validity / correctness-risk issues, not circularity in the technical sense used here. They would show up as "the benchmark numbers may not measure what the paper says they measure," not as "the prediction equals the input by construction." Per the analyzer's hard rule #5, "this is not standard consensus" or "the eval protocol is suspect" belongs under correctness risk, not circularity.\n\nOnly text available is the abstract, so a thorough section-by-section walk is not possible, but nothing in the abstract describes a derivation step that fits the seven circularity patterns. Score: 1, reflecting routine self-citation to prior Llama models without any load-bearing reduction.
Axiom & Free-Parameter Ledger
free parameters (4)
- Model scale (8B, 70B, 405B parameters) =
405B flagship
- Context length =
128K tokens
- Data mixture weights across domains and languages =
not disclosed at document level
- Post-training hyperparameters (SFT/preference optimization) =
internal
axioms (3)
- domain assumption Public benchmarks validly measure the capabilities they name (MMLU, HumanEval, GSM8K, MATH, multilingual, ASR, etc.).
- domain assumption Pretraining data is adequately decontaminated against evaluation sets.
- domain assumption Compositional multimodality (frozen-ish encoders + adapters + LLM) is a fair comparator to end-to-end multimodal systems on the chosen tasks.
invented entities (1)
-
Llama Guard 3
independent evidence
Lean theorems connected to this paper
-
Foundation.PhiForcing / Foundation.DimensionForcingphi_equation; eight_tick_forces_D3 unclearOur largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens.
Forward citations
Cited by 60 Pith papers
-
Large Language Models Lack Temporal Awareness of Medical Knowledge
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
-
Inference-Time Machine Unlearning via Gated Activation Redirection
GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
-
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
-
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
-
Narrow Secret Loyalty Dodges Black-Box Audits
Narrow secret loyalties implanted via fine-tuning in LLMs at multiple scales evade black-box audits unless the auditor knows the target principal.
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
-
Architecture Determines Observability of Transformers
Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
Backdoor Attacks on Decentralised Post-Training
An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking
LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.
-
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
-
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
-
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization
Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
-
Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators
LLM simulators exhibit near-zero selective response to targeted misconception feedback and behave sycophantically, but SFT and SFS-aligned RL improve this property.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
-
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
-
Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.
-
Much of Geospatial Web Search Is Beyond Traditional GIS
Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
Uniform Scaling Limits in AdamW-Trained Transformers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-...
-
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives ...
-
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
-
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
-
Infinite Mask Diffusion for Few-Step Distillation
Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
-
Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning
FLTorrent achieves within-round source unlinkability in decentralized federated learning via a BitTorrent warm-up with pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling, reaching...
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
-
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
-
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
-
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
-
Pretraining large language models with MXFP4 on Native FP4 Hardware
Weight-gradient quantization drives most convergence problems in MXFP4 pretraining of Llama 3.1-8B; deterministic Hadamard rotations stabilize training by correcting structured micro-scaling errors.
-
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
-
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
Production logs from a 504-GPU LLM training cluster show 100% failure detection via multi-metric analysis, NFS saturation limiting bandwidth to 1.4-10.4% of link speed, and auto-retry achieving 33.3% success versus 12...
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
-
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG
CDS4RAG cyclically optimizes full RAG hyperparameters by distinguishing and alternating between retriever and generator components, boosting performance up to 1.54x over prior methods on benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.