pith. machine review for the scientific record. sign in

hub

Datacomp- LM : In search of the next generation of training sets for language models

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

hub tools

citation-role summary

background 1

citation-polarity summary

years

2026 12 2025 2

roles

background 1

polarities

background 1

representative citing papers

Projection-Free Transformers via Gaussian Kernel Attention

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Compute Optimal Tokenization

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

XekRung Technical Report

cs.CR · 2026-04-30 · unverdicted · novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

citing papers explorer

Showing 14 of 14 citing papers.

  • Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models cs.CL · 2026-05-10 · conditional · none · ref 57

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  • Projection-Free Transformers via Gaussian Kernel Attention cs.LG · 2026-05-04 · unverdicted · none · ref 16

    Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

  • Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 29 · 2 links

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  • Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data cs.CL · 2026-05-11 · unverdicted · none · ref 42

    Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.

  • ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 117

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  • Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 61

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  • NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty cs.AI · 2026-05-03 · unverdicted · none · ref 3

    NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acquisition.

  • Compute Optimal Tokenization cs.CL · 2026-05-02 · unverdicted · none · ref 14

    Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.

  • Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling cs.CL · 2026-04-23 · unverdicted · none · ref 6

    X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.

  • Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 47

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

  • Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 22

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  • Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 49

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  • SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 194

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  • XekRung Technical Report cs.CR · 2026-04-30 · unverdicted · none · ref 33

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.