pith. sign in

hub Mixed citations

DataComp-LM: In search of the next generation of training sets for language models

Mixed citation behavior. Most common role is background (60%).

20 Pith papers citing it
Background 60% of classified citations
abstract

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

hub tools

citation-role summary

background 4 dataset 1

citation-polarity summary

years

2026 17 2025 3

representative citing papers

PithTrain: A Compact and Agent-Native MoE Training System

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.

Projection-Free Transformers via Gaussian Kernel Attention

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

CoFrGeNet: Continued Fraction Architectures for Language Generation

cs.CL · 2026-01-29 · unverdicted · novelty 7.0 · 2 refs

CoFrGeNets implement a continued-fraction function class as plug-in replacements for transformer blocks, delivering competitive or superior downstream performance on GPT2-xl and Llama3-scale models with one-half to two-thirds the parameters.

Scaling Laws for Mixture Pretraining Under Data Constraints

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

Demystifying Data Organization for Enhanced LLM Training

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

Four guidelines for data organization and two new ordering methods (STR and SAW) improve LLM training stability and performance across scales when reusing pre-computed scores.

XekRung Technical Report

cs.CR · 2026-04-30 · unverdicted · novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

citing papers explorer

Showing 20 of 20 citing papers.

  • PithTrain: A Compact and Agent-Native MoE Training System cs.LG · 2026-05-29 · unverdicted · none · ref 18 · internal anchor

    PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.

  • Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models cs.CL · 2026-05-10 · conditional · none · ref 57 · internal anchor

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  • Projection-Free Transformers via Gaussian Kernel Attention cs.LG · 2026-05-04 · unverdicted · none · ref 16 · internal anchor

    Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

  • Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 29 · 2 links · internal anchor

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  • CoFrGeNet: Continued Fraction Architectures for Language Generation cs.CL · 2026-01-29 · unverdicted · none · ref 23 · 2 links · internal anchor

    CoFrGeNets implement a continued-fraction function class as plug-in replacements for transformer blocks, delivering competitive or superior downstream performance on GPT2-xl and Llama3-scale models with one-half to two-thirds the parameters.

  • Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 24 · internal anchor

    Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.

  • Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data cs.CL · 2026-05-11 · unverdicted · none · ref 42 · internal anchor

    Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.

  • ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 117 · internal anchor

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  • Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 61 · internal anchor

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  • NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty cs.AI · 2026-05-03 · unverdicted · none · ref 3 · internal anchor

    NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acquisition.

  • Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling cs.CL · 2026-04-23 · unverdicted · none · ref 6 · internal anchor

    X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.

  • Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 47 · internal anchor

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

  • Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 22 · internal anchor

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  • Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 49 · internal anchor

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  • Demystifying Data Organization for Enhanced LLM Training cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor

    Four guidelines for data organization and two new ordering methods (STR and SAW) improve LLM training stability and performance across scales when reusing pre-computed scores.

  • Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework cs.CL · 2025-11-26 · unverdicted · none · ref 10 · internal anchor

    Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.

  • SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 194 · internal anchor

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  • XekRung Technical Report cs.CR · 2026-04-30 · unverdicted · none · ref 33 · internal anchor

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

  • Tokenization with Split Trees cs.CL · 2026-05-21 · unreviewed · ref 57 · internal anchor
  • Compute Optimal Tokenization cs.CL · 2026-05-02 · unreviewed · ref 14 · internal anchor