pith. sign in

arxiv: 2302.13971 · v1 · submitted 2023-02-27 · 💻 cs.CL

LLaMA: Open and Efficient Foundation Language Models

Pith reviewed 2026-05-24 09:34 UTC · model grok-4.3

classification 💻 cs.CL
keywords foundation language modelspublic datasetsmodel scalingbenchmark performanceopen releaseGPT-3 comparison
0
0 comments X

The pith

LLaMA models trained only on public data outperform GPT-3 despite fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLaMA, a family of foundation language models sized from 7B to 65B parameters. These models are trained on trillions of tokens drawn exclusively from publicly available datasets. The central result is that the 13B version surpasses GPT-3 (175B) on most standard benchmarks while the 65B version reaches parity with Chinchilla-70B and PaLM-540B. The authors release the full set of models to support wider research. This shows that high performance is achievable without access to proprietary data collections.

Core claim

It is possible to train state-of-the-art foundation language models using only publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B.

What carries the argument

The LLaMA collection of foundation language models trained on public data mixtures to reach competitive benchmark scores.

Load-bearing premise

The evaluation benchmarks used are fair and representative measures of capability that do not favor models trained on the specific public data mixtures chosen by the authors.

What would settle it

A new held-out benchmark or evaluation set on which LLaMA-13B scores below GPT-3 or LLaMA-65B falls behind Chinchilla-70B.

read the original abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the LLaMA collection of foundation language models with 7B to 65B parameters, trained on trillions of tokens from publicly available datasets only. It reports that the 13B model outperforms GPT-3 (175B) on most benchmarks and the 65B model is competitive with Chinchilla-70B and PaLM-540B, and releases the models openly.

Significance. If the benchmark results hold after accounting for potential data issues, the work is significant in demonstrating that competitive large language models can be developed using exclusively public data, which has implications for accessibility and reproducibility in the field. The open release of models is a notable strength that enables community verification and extension.

major comments (2)
  1. [§2.1] §2.1 Pre-training data: The description of the data mixture (CommonCrawl, C4, GitHub, books, Wikipedia) does not report quantitative decontamination statistics or n-gram/document overlap measures with the test splits of the benchmarks used in §3; this directly bears on whether the headline claim that LLaMA-13B outperforms GPT-3 (175B) reflects genuine capability gains rather than memorization.
  2. [§3] §3 Main results, Tables 2–4: The reported benchmark scores lack accompanying error bars, details on the exact evaluation protocol (e.g., number of shots, prompt templates, decontamination steps per task), or multiple-run statistics, which undermines the robustness of the cross-model comparisons that form the central empirical claim.
minor comments (2)
  1. [Figure 1] Figure 1 and associated text: axis labels and legend entries for model-size scaling could be made more consistent across subplots for clarity.
  2. [Abstract] The abstract states performance claims without previewing the exact token count or data breakdown; while the body supplies these, a brief quantitative summary in the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [§2.1] The description of the data mixture does not report quantitative decontamination statistics or n-gram/document overlap measures with the test splits of the benchmarks used in §3; this directly bears on whether the headline claim that LLaMA-13B outperforms GPT-3 (175B) reflects genuine capability gains rather than memorization.

    Authors: We agree that quantitative decontamination statistics would provide additional reassurance. The manuscript describes the removal of overlapping documents but does not include specific overlap measures. We have revised the text in §2.1 to provide more information on the decontamination process employed during data preparation. However, detailed n-gram overlap percentages were not computed or recorded at the time, so we cannot provide them. We believe the description of the method used sufficiently addresses the spirit of the comment. revision: partial

  2. Referee: [§3] The reported benchmark scores lack accompanying error bars, details on the exact evaluation protocol (e.g., number of shots, prompt templates, decontamination steps per task), or multiple-run statistics, which undermines the robustness of the cross-model comparisons that form the central empirical claim.

    Authors: We acknowledge the value of including more evaluation details. We have updated §3 with additional information on the evaluation protocol, specifying the number of shots and referencing the prompt templates from the original benchmark papers for each task. Decontamination is now explicitly linked to the procedure in §2.1. Error bars and multiple-run statistics are not provided because the models were evaluated in a single run, which is standard practice given the computational expense of large-scale training and inference. We have added a note in the revised manuscript acknowledging this limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark reporting

full rationale

The paper presents an empirical study of training LLaMA models on public datasets and evaluating them on standard benchmarks. No derivation chain, equations, or first-principles predictions exist that could reduce to fitted inputs, self-definitions, or self-citation load-bearing steps. The central claims (e.g., LLaMA-13B outperforming GPT-3 on benchmarks) are direct performance measurements, not constructed outputs from the paper's own methods. This is a standard non-circular empirical report.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on the standard transformer architecture and scaling behavior established in prior literature; no new mathematical axioms or invented entities are introduced. Training involves many unstated hyperparameters typical of large-scale language model training.

free parameters (2)
  • model parameter counts
    7B, 13B, 33B, 65B sizes chosen to explore scaling regime.
  • training token count
    Trillions of tokens selected to reach competitive performance.
axioms (2)
  • standard math Transformer architecture is sufficient for language modeling at these scales
    Invoked implicitly by training standard decoder-only transformers.
  • domain assumption Publicly available datasets contain sufficient high-quality text for SOTA performance
    Central to the claim that proprietary data is unnecessary.

pith-pipeline@v0.9.0 · 5675 in / 1352 out tokens · 26828 ms · 2026-05-24T09:34:01.387284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

    cs.CV 2026-05 unverdicted novelty 8.0

    VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...

  2. Privacy Auditing with Zero (0) Training Run

    cs.CR 2026-05 unverdicted novelty 8.0

    Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

  3. Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

    cs.LG 2026-05 unverdicted novelty 8.0

    Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

  4. Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

    cs.LG 2026-05 accept novelty 8.0

    Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

  5. Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...

  6. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  7. Backdoor Attacks on Decentralised Post-Training

    cs.CR 2026-03 conditional novelty 8.0

    An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...

  8. Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

    cs.SE 2025-06 conditional novelty 8.0

    First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

  9. BEAVER: An Enterprise Benchmark for Text-to-SQL

    cs.CL 2024-09 unverdicted novelty 8.0

    BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.

  10. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  11. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  12. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

    cs.HC 2024-05 conditional novelty 8.0

    AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences acros...

  13. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  14. Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    cs.IR 2024-03 unverdicted novelty 8.0

    BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

  15. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  16. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  17. Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    cs.CL 2023-05 accept novelty 8.0

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  18. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    cs.CL 2023-04 conditional novelty 8.0

    API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

  19. Instruction Tuning with GPT-4

    cs.CL 2023-04 unverdicted novelty 8.0

    GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

  20. Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.

  21. CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

    cs.CR 2026-05 unverdicted novelty 7.0

    CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus pri...

  22. Brain-LLM Alignment Tracks Training Data, Not Typology

    cs.CL 2026-05 unverdicted novelty 7.0

    Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic...

  23. A mathematical theory of balancing relational generalization and memorization

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces transitive inference with exceptions task and analytically shows kernel ridge regression balances relational generalization and memorization depending on representational geometry, with validation in finetu...

  24. ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.

  25. Generative Conversational Recommender System

    cs.IR 2026-05 unverdicted novelty 7.0

    A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.

  26. Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

    cs.CV 2026-05 unverdicted novelty 7.0

    AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.

  27. On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

  28. Provable Joint Decontamination for Benchmarking Multiple Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.

  29. Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 7.0

    HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.

  30. RECIPE: Procedural Planning via Grounding in Instructional Video

    cs.CV 2026-05 unverdicted novelty 7.0

    RECIPE improves visual procedural planners by rewarding plans according to their grounding quality in ASR transcripts via GRPO, yielding +7–8 in-domain and up to +16 zero-shot macro-accuracy gains over base models and...

  31. Modality-Decoupled Online Recursive Editing

    cs.LG 2026-05 conditional novelty 7.0

    M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.

  32. 4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segm...

  33. SNLP: Layer-Parallel Inference via Structured Newton Corrections

    cs.LG 2026-05 unverdicted novelty 7.0

    SNLP enables layer-parallel Transformer inference by replacing sequential layer execution with structured Newton corrections and SNLP-aware training regularization, yielding up to 2.3x wall-clock speedup on 0.5B model...

  34. Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

    cs.CV 2026-05 unverdicted novelty 7.0

    A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

  35. How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

    cs.AI 2026-05 unverdicted novelty 7.0

    EEG study of 27 participants reveals distinct neural patterns for AI-generated hallucinations, with misjudged ones failing to trigger standard fact verification pathways.

  36. MO-CAPO: Multi-Objective Cost-Aware Prompt Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    MO-CAPO introduces a budget-aware multi-objective optimizer that jointly tunes LLM prompt performance and inference cost, producing diverse Pareto fronts more efficiently than standard NSGA-II.

  37. EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

    cs.CV 2026-05 unverdicted novelty 7.0

    EntropyScan detects backdoored LVLMs by quantifying structural anomalies in visual attention distributions on benign samples via Tsallis entropy and reference-anchored Z-score normalization.

  38. Dynamic Chunking for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

  39. Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

    cs.LG 2026-05 unverdicted novelty 7.0

    QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.

  40. MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

    cs.CR 2026-05 unverdicted novelty 7.0

    MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.

  41. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

  42. Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

    cs.CL 2026-05 unverdicted novelty 7.0

    The paper introduces Manta-LM, which approximates the Hamilton-Jacobi-Bellman optimal policy via Flow Matching in a rectified latent control space to enable high-fidelity parallel language generation.

  43. Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.

  44. Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

    cs.LG 2026-05 conditional novelty 7.0

    A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

  45. SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

    cs.LG 2026-05 unverdicted novelty 7.0

    SurF applies the Time Rescaling Theorem as a learnable bijection to create a single generative model for forecasting irregular multivariate event streams that outperforms or matches baselines on six benchmarks.

  46. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  47. BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

    cs.RO 2026-05 unverdicted novelty 7.0

    BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

  48. IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

    cs.CL 2026-05 unverdicted novelty 7.0

    A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.

  49. CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    cs.CV 2026-05 conditional novelty 7.0

    LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...

  50. The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data...

  51. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  52. Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.

  53. GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

    cs.CL 2026-05 unverdicted novelty 7.0

    Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.

  54. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  55. Efficient and Adaptive Human Activity Recognition via LLM Backbones

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained LLMs adapted via convolutional projections and LoRA act as efficient frozen backbones for sensor-based human activity recognition, delivering strong data efficiency and cross-dataset transfer.

  56. DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

    cs.CV 2026-05 unverdicted novelty 7.0

    DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.

  57. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  58. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  59. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  60. V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.