pith. machine review for the scientific record. sign in

arxiv: 2604.28075 · v2 · submitted 2026-04-30 · 💻 cs.CL · cs.AI

Recognition: unknown

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords German language modelingdata filteringhigh-quality subsetsmulti-epoch trainingsample efficiencyweb corporanon-English LLMs
0
0 comments X

The pith

Repeating high-quality filtered German web data over multiple epochs produces better language models than training once on larger, less-filtered datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent work has shown that filtering English web data improves training, but for German the question is whether to filter strictly for quality and repeat it, or use more diverse lightly filtered data in one pass. The paper constructs hierarchical filters on 500 million German web documents and runs experiments at different model sizes and token amounts. It finds that repeating the high-quality core beats the single-pass approach on bigger sets, and this holds even after seven epochs of repetition. This suggests that for languages like German, concentrating on semantic quality through filtering is more efficient than chasing maximum unique tokens. The authors release their trained models and cleaned evaluation benchmarks, which achieve strong results despite using 10-360x fewer tokens than comparable models.

Core claim

Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume.

What carries the argument

Hierarchical quality filters applied to 500M web documents to create high-signal subsets for repeated training, contrasted with single-pass training on larger diverse sets.

If this is right

  • Repeating the filtered high-quality data yields higher performance than single-pass training on less filtered larger corpora across model scales.
  • The advantage of high-quality repetition holds after seven epochs of training.
  • Models trained this way achieve competitive results with 10-360 times fewer tokens than comparable models.
  • Releasing the trained models and cleaned evaluation benchmarks enables further research on efficient non-English language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The quality-over-diversity finding may extend to other high-resource languages like French or Japanese where similar web data is available.
  • Better data filtering techniques could reduce the need for massive data collection efforts in LLM training.
  • The persistence of the gap after multiple epochs raises questions about the optimal number of epochs for filtered data.

Load-bearing premise

The hierarchical quality filters reliably select high-signal data without introducing selection biases that artificially favor repeated training, and that the multi-epoch and single-pass regimes are compared under equivalent effective token exposure and evaluation conditions.

What would settle it

A comparison in which the performance of single-pass training on the larger less-filtered set equals or exceeds that of multi-epoch training on the high-quality set at the same total number of tokens seen.

Figures

Figures reproduced from arXiv: 2604.28075 by Alan Akbik, Ansar Aynetdinov, Patrick Haller.

Figure 1
Figure 1. Figure 1: Example of a problematic instance from the view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot evaluation of 350M models trained view at source ↗
Figure 3
Figure 3. Figure 3: Performance gain when scaling the model size from 350M to 1B parameters view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Coherence, Information Value, and Educational Quality scores in FW2-DE yielded by view at source ↗
Figure 6
Figure 6. Figure 6: Annotation prompt used to assign Information Value and Coherence scores. view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template used for standalone Likert-scale evaluation of outputs generated by instruction-tuned view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used for evaluation of binary correctness of outputs generated by instruction-tuned view at source ↗
read the original abstract

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript investigates the trade-off between data diversity and quality for German LLM pretraining. It applies hierarchical quality filters to 500M web documents to create high-signal subsets, then compares multi-epoch training on these filtered sets against single-pass training on larger, less-filtered corpora. Experiments across model scales and token budgets show that repeating the high-quality data consistently outperforms single-pass training on diverse data, with the performance advantage persisting after 7 epochs. The authors release the resulting Boldt models and cleaned evaluation benchmarks, claiming SOTA results despite using 10-360x fewer tokens than prior models.

Significance. If the central comparisons are performed under strictly equivalent total token exposure and without filter-induced distributional biases, the result would be significant for sample-efficient pretraining of non-English LLMs. It challenges the default strategy of maximizing unique data volume and supports semantic concentration via aggressive filtering. The public release of models and benchmarks is a clear strength, enabling direct verification and extension in German NLP.

major comments (3)
  1. §3 (Experimental Setup) and §4 (Results): The manuscript does not explicitly state how total token exposure is equated between the multi-epoch regime (e.g., 7 epochs on a filtered corpus of size N) and the single-pass regime on the larger diverse corpus. If the single-pass arm uses the full size of the less-filtered set without subsampling to exactly 7N tokens, the outperformance claim is not directly comparable. Please add a table or paragraph detailing corpus sizes, effective tokens processed in each arm, and confirmation that budgets are matched across all reported scales.
  2. §4.2 (Results tables/figures): The abstract claims 'consistent results across scales and budgets' and a persistent gap after 7 epochs, yet no statistical significance tests, standard deviations, or confidence intervals are reported for the performance differences. This makes it difficult to assess whether observed gaps are reliable or could arise from variance; add error bars and p-values to the relevant tables (e.g., Table 2 or 3) or figures.
  3. §2.2 (Filter construction): The hierarchical quality filters are central to the claim that 'high-signal' data enables efficient repetition. However, there is no quantitative check for selection bias, such as comparing lexical diversity (type-token ratio, n-gram overlap) or benchmark contamination rates between the filtered subset and the original diverse corpus. Without this, it remains possible that the filters preferentially retain repetition-friendly or eval-overlapping data, mechanically favoring the multi-epoch arm.
minor comments (2)
  1. Abstract: The claim of '10-360x fewer tokens than comparable models' should name the specific prior models and their exact token counts for precision and verifiability.
  2. §5 (Discussion): A brief limitations paragraph acknowledging that results are German-specific and may not generalize to other high-resource languages would improve completeness.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. The feedback highlights important aspects of experimental clarity, statistical reporting, and potential biases that we address below. We have revised the paper to improve transparency where feasible and provide additional analysis.

read point-by-point responses
  1. Referee: §3 (Experimental Setup) and §4 (Results): The manuscript does not explicitly state how total token exposure is equated between the multi-epoch regime (e.g., 7 epochs on a filtered corpus of size N) and the single-pass regime on the larger diverse corpus. If the single-pass arm uses the full size of the less-filtered set without subsampling to exactly 7N tokens, the outperformance claim is not directly comparable. Please add a table or paragraph detailing corpus sizes, effective tokens processed in each arm, and confirmation that budgets are matched across all reported scales.

    Authors: We agree that explicit documentation of token budgets is essential for fair comparison. In our experiments, total token exposure was matched across regimes: for a filtered corpus of N tokens trained over E epochs, the single-pass baseline on the diverse corpus was trained on a random subsample of exactly E × N tokens drawn from the larger set. This was done consistently for all reported scales and budgets. We will add a new table in §3 (and reference it in §4) that lists original corpus sizes, filtered sizes, epochs, effective tokens processed, and model scales for every comparison, along with a paragraph confirming the subsampling procedure. revision: yes

  2. Referee: §4.2 (Results tables/figures): The abstract claims 'consistent results across scales and budgets' and a persistent gap after 7 epochs, yet no statistical significance tests, standard deviations, or confidence intervals are reported for the performance differences. This makes it difficult to assess whether observed gaps are reliable or could arise from variance; add error bars and p-values to the relevant tables (e.g., Table 2 or 3) or figures.

    Authors: We acknowledge that the lack of uncertainty estimates is a limitation. Pretraining runs are computationally expensive, so each configuration was trained only once. We will add a paragraph in §4.2 discussing this constraint and noting that the performance advantage is consistent across multiple independent scales (e.g., 125M to 1.3B parameters) and token budgets, providing informal evidence of robustness. We cannot compute p-values or error bars from repeated full-scale runs at this time; smaller-scale ablations with variance estimates will be referenced where available. revision: partial

  3. Referee: §2.2 (Filter construction): The hierarchical quality filters are central to the claim that 'high-signal' data enables efficient repetition. However, there is no quantitative check for selection bias, such as comparing lexical diversity (type-token ratio, n-gram overlap) or benchmark contamination rates between the filtered subset and the original diverse corpus. Without this, it remains possible that the filters preferentially retain repetition-friendly or eval-overlapping data, mechanically favoring the multi-epoch arm.

    Authors: We appreciate the referee's point on potential selection bias. In the revised manuscript, we will expand §2.2 with a quantitative analysis: we will report type-token ratios and 5-gram overlap statistics for the filtered subset versus the full corpus, as well as contamination rates against our evaluation benchmarks (measured via exact string matching). Our preliminary internal checks show only modest reductions in lexical diversity and negligible eval overlap, suggesting the filters primarily remove low-quality noise rather than introducing repetition or contamination bias. These results will be presented in a new table or figure. revision: yes

standing simulated objections not resolved
  • Full statistical significance testing (p-values and error bars from multiple independent runs) cannot be provided, as only single training runs were feasible given the computational cost of LLM pretraining.

Circularity Check

0 steps flagged

No significant circularity: purely empirical comparison with independent experimental results

full rationale

The paper reports experimental outcomes from training language models on hierarchically filtered German web data versus larger unfiltered corpora, across scales and token budgets. No mathematical derivation, ansatz, uniqueness theorem, or fitted-parameter prediction is present; the central claim (repetition of high-quality subsets outperforming single-pass on diverse data, even at 7 epochs) rests on observed perplexity or downstream metrics from actual training runs. Token-budget equivalence and filter bias are empirical questions addressed by the experimental design itself rather than by construction or self-citation. Any self-citations (if present) are peripheral and not load-bearing for the reported performance gaps. The work is self-contained against external benchmarks and does not reduce its results to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven effectiveness of the hierarchical quality filters at isolating high-signal text and on the assumption that multi-epoch repetition does not introduce overfitting that would invalidate the comparison.

axioms (1)
  • domain assumption Hierarchical quality filters applied to web documents can reliably identify high-signal German text suitable for repeated training.
    The paper's entire experimental design depends on these filters producing meaningfully cleaner subsets than the unfiltered corpus.

pith-pipeline@v0.9.0 · 5512 in / 1287 out tokens · 37541 ms · 2026-05-07T05:42:46.946035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

    Aleph-alpha-germanweb: Improving german- language llm pre-training with model-based data curation and synthetic data generation.Preprint, arXiv:2505.00022. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXi...

  2. [2]

    InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

    Enhancing multilingual LLM pretraining with model-based data selection. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct elec- tricity? a new dataset for open book question an- swering. InProceedin...

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    Scaling data-constrained language models. In Advances in Neural Information Processing Systems, volume 36, pages 50358–50376. Curran Associates, Inc. Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. 2020. A monolingual approach to contextual- ized word embeddings for mid-resource languages. InProceedings of the 58th Annual Meeting of the As- s...

  4. [4]

    We then fine-tuned three respective snowflake-arctic-embed-m-v2.0 (Yu et al.,

    and Gemma-3-27B (Gemma 3 Team, 2025) due to their extensive multilingual support, how- ever we found their performance to be on average weaker compared to a more capable 70B model, as we show in Table 6, and the resulting distributions to be inferior to the one yielded by the 70B model. We then fine-tuned three respective snowflake-arctic-embed-m-v2.0 (Yu et al.,

  5. [5]

    A hyperparameter sweep showed the linear decay of a max learning rate of 1e-4 to 0 and a batch size of 64 to be optimal over 10 epochs

    regression models on the annotated 500k sample, each responsible for scoring one of the document’s quality aspects. A hyperparameter sweep showed the linear decay of a max learning rate of 1e-4 to 0 and a batch size of 64 to be optimal over 10 epochs. The best epoch is chosen at the end based on the holdout validation set accuracy. We truncate the web doc...

  6. [6]

    Coherence score: <Coherence points>. Information value score: <Information value points>

    with a warmup lasting 1% of total training steps and decay to 1% of the peak learning rate, which we empirically determined to be optimal at 5e−4 for all models. We use the AdamW optimizer (Loshchilov and Hutter, 2019) with β1 = 0.9 , β2 = 0.95 and decoupled L2 weight decay co- efficient of 0.1. The effective batch size is 0.5M tokens and the gradients ar...