Not All Synthetic Data Is Yours to Learn From

Li Chen; Richard G. Baraniuk; Sina Alemohammad; Zhangyang Wang

arxiv: 2605.31126 · v1 · pith:JZZMABKJnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.LG

Not All Synthetic Data Is Yours to Learn From

Sina Alemohammad , Li Chen , Richard G. Baraniuk , Zhangyang Wang This is my paper

Pith reviewed 2026-06-28 22:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords synthetic dataself-traininglanguage modelscapability amplificationmemorization decouplingcompatibilityunconditional generation

0 comments

The pith

Synthetic data improves a language model only when it matches the model's existing capabilities in a relational way.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a language model can improve by training on plain text it generates from the BOS token alone, with no prompts, rewards, or external signals. It finds success only when the source data is compatible with the student model, a property of the pair rather than the data itself. Under this condition, self-training amplifies capabilities already latent in the pretrained model. Standard proxies such as semantic similarity or token likelihood do not predict which sources will help. In controlled experiments the same process also preserves or raises benchmark scores while cutting verbatim extraction of held-out text by more than 95 percent.

Core claim

In prompt-free unconditional self-training, synthetic data utility is relational rather than intrinsic: self-generated data works best, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. In controlled Pythia experiments, benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent with no forget set, privacy objective, or targeted unlearning. These results indicate that prompt-free self-training works by amplifying what the student already knows, not

What carries the argument

The latent capability resurfacing hypothesis, which states that weak self-training amplifies capabilities already present in the pretrained model only when the synthetic source is compatible with the student.

If this is right

Self-generated data is the most effective source for this form of self-training.
Same-lineage synthetic corpora transfer better than stronger but differently trained sources.
Cross-family synthetic data yields substantially weaker gains.
Benchmark gains can occur while verbatim memorization of the training text falls sharply without any explicit unlearning step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compatibility between source and student might be estimated in advance from training history or architecture to select useful synthetic corpora.
The observed separation of capability gains from memorization could be useful in privacy-sensitive settings where reduced extraction is desirable.
The same relational pattern may appear in other self-improvement loops such as iterative preference optimization on model-generated outputs.

Load-bearing premise

The controlled Pythia setup and chosen benchmarks accurately isolate the relational compatibility effect and the memorization decoupling without unstated data selection or controls.

What would settle it

A cross-family synthetic corpus that improves the student model as much as same-lineage data while still producing the 95 percent drop in exact-match extraction would falsify the relational-compatibility claim.

Figures

Figures reproduced from arXiv: 2605.31126 by Li Chen, Richard G. Baraniuk, Sina Alemohammad, Zhangyang Wang.

**Figure 1.** Figure 1: Synthetic utility is relational, not intrinsic. We study a prompt-free BOS-only pipeline in which a source model samples unconditional text and a student fine-tunes on it. (A) Transfer tracks student–source compatibility: self-generated data is strongest, same-lineage transfer is next, and cross-family transfer is weakest. (B) Intrinsic corpus proxies, including benchmark proximity and mean likelihood unde… view at source ↗

**Figure 2.** Figure 2: Self-generated data produces transient gains on structured reasoning, math, and code for Qwen2.5-0.5B, while generic replay does not. ∆ performance relative to the frozen base model over 40 epochs. Synthetic self-generated corpora at three temperatures are compared against a matched Common Corpus replay baseline. Shaded bands show ±1 std. across subsets. All synthetic corpora undergo 8-gram decontamination… view at source ↗

**Figure 3.** Figure 3: The identical protocol on LLaMA-3.2-1B yields only narrow comprehension gains, with no improvement on math or code. ∆ performance relative to the frozen base model over 40 epochs. Unlike Qwen, LLaMA shows no GSM8K or Minerva-MATH gains under any condition, and HumanEval degrades substantially. Shaded bands show ±1 std. across subsets. 0 0.5 1 0 0.05 0.1 Cosine Similarity Density ARC Challenge τ=0.75 τ=1.0 … view at source ↗

**Figure 4.** Figure 4: Benchmark proximity does not explain the gains: training on semantically distant samples produces the same GSM8K improvement as training on higher-similarity samples. Left three panels: max cosine similarity distributions between synthetic corpus and benchmark items. Right panel: GSM8K ∆ accuracy for Qwen τ=1.25 subsets partitioned below vs. above the 0.35 similarity threshold. Ruling out benchmark contami… view at source ↗

**Figure 5.** Figure 5: Synthetic utility is relational: self-generated data is strongest, same-lineage transfer outperforms a larger but differently trained model, and cross-family transfer is weakest. ∆ performance of the Qwen2.5- 0.5B student trained on synthetic data from four source models over 40 epochs. Dashed lines: self-generated (Qwen2.5-0.5B). Qwen3-8B is larger and more capable than Qwen2.5-7B, yet transfers worse, in… view at source ↗

**Figure 6.** Figure 6: Likelihood under the student does not predict utility: two corpora with nearly identical mean NLL (µ=9.10 vs µ=9.01) produce opposite downstream effects. KDE of average NLL per token for synthetic corpora generated by Qwen2.5-0.5B (left, own) and Llama-3.2-1B (right, cross), all scored by Qwen2.5-0.5B. Dashed lines indicate distribution means. Despite converging at τ=1.25, the own corpus substantially impr… view at source ↗

**Figure 7.** Figure 7: Benchmark capability under self-training is preserved or improved across four evaluations. ∆ accuracy relative to the frozen base model over 40 epochs on ARC-Challenge (25-shot), HellaSwag (5-shot), ARC-Easy (0-shot), and WinoGrande (5-shot). 0 10 20 30 40 0 200 400 600 Epoch Sequences Memorized Verbatim Extraction (Text) 0 10 20 30 40 1,000 1,500 2,000 2,500 Epoch Verbatim Extraction (Code) 0 10 20 30 40 … view at source ↗

**Figure 8.** Figure 8: Capability and memorization move in opposite directions. Left two panels: number of verbatim sequences extracted via the prefix-completion attack of Carlini et al. [2021] on text corpora (Enron, PileCC, Wikipedia) and code (GitHub). Right two panels: average log-probability difference log pθt (x | y) − log pθbase (x|y) of the true continuation under each checkpoint relative to the base model, evaluated on… view at source ↗

**Figure 9.** Figure 9: shows that the few-shot results mirror the zero-shot pattern qualitatively. On ARC-Challenge and HellaSwag, self-generated data at τ=1.25 produces the largest sustained gains, while the Common Corpus baseline is competitive but weaker. On GSM8K, synthetic data at τ=0.75 and τ=1.0 produces early transient gains that decay over continued training, reproducing the transient-thendegrade pattern observed in t… view at source ↗

**Figure 10.** Figure 10: Qwen data is not a generic upgrade: a LLaMA student trained on Qwen teacher data fails to outperform LLaMA-self, confirming that utility is tied to student–source compatibility. ∆ performance of the Llama-3.2-1B student trained on self-generated data vs. Qwen2.5-0.5B teacher data at three temperatures over 40 epochs. Dashed line: self-generated (Llama-3.2-1B, τ=1.25). Shaded bands show ±1 std. 0 2 4 6 8 1… view at source ↗

**Figure 11.** Figure 11: The NLL convergence pattern mirrors the Qwen-scorer case: own and cross corpora become indistinguishable in likelihood at high temperature yet produce different downstream effects. KDE of average NLL per token for synthetic corpora generated by Llama-3.2-1B (left, own) and Qwen2.5-0.5B (right, cross), all scored by Llama-3.2-1B. Dashed lines indicate distribution means. The pattern mirrors the Qwenscorer… view at source ↗

read the original abstract

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues synthetic data utility in prompt-free self-training is relational to the student model and that capability gains can separate from verbatim memorization without unlearning, but the abstract leaves the experimental controls too vague to verify the decoupling claim.

read the letter

The central observation here is that self-generated text helps a model more than text from stronger but unrelated models, and that in their Pythia setup this process improves benchmarks while cutting exact-match extraction on held-out sequences by over 95 percent with no forget set or unlearning loss. That separation is the part worth paying attention to if the numbers hold up.

The work does a clean job of framing the minimal unconditional case—no prompts, no verifier—and testing the relational claim across lineages. Showing that benchmark similarity and per-token likelihood are poor predictors is useful negative evidence. The latent capability resurfacing idea is stated plainly as a hypothesis rather than overclaimed.

The main softness is in the experimental reporting. The abstract gives no dataset sizes, generation lengths, number of runs, or details on how the held-out exact-match sequences were sampled and filtered. If those sequences were chosen or post-processed in ways that make verbatim matches easier to suppress, the 95 percent drop could be less general than presented. The stress-test concern about unstated controls on the held-out set is reasonable to raise until the full methods section is checked.

This is for people already running self-training or synthetic data loops who want to think about compatibility and memorization trade-offs. It is not yet a finished result, but the questions are concrete enough that a serious referee should see it. The authors have isolated a regime worth probing further rather than just fitting another curve.

Referee Report

2 major / 2 minor

Summary. The paper examines prompt-free unconditional self-training of language models on text generated from the BOS token. It claims that synthetic data utility is a relational property of the source-student pair (self-generated data most effective, same-lineage transfer outperforms cross-family), that standard intrinsic proxies like semantic similarity or per-token likelihood fail to predict utility, and that in controlled Pythia experiments benchmark utility is preserved while held-out exact-match extraction drops over 95% with no forget set or unlearning objective. This supports the 'latent capability resurfacing hypothesis' that weak self-training amplifies preexisting capabilities under a compatibility condition rather than importing structure.

Significance. If the relational utility and decoupling results hold under rigorous controls, the work would provide a concrete empirical basis for viewing self-training as capability amplification rather than data importation, with direct implications for privacy-preserving model improvement and synthetic data selection. The absence of any external supervision or reward model makes the minimal setting a useful testbed for isolating mechanisms in LLM self-improvement.

major comments (2)

[Abstract and experimental sections describing Pythia setup] The decoupling result (benchmark utility preserved while exact-match extraction on held-out data falls >95%) is load-bearing for the latent capability resurfacing hypothesis, yet the manuscript provides no description of how the held-out sequences are constructed (sampling method, length distribution, diversity criteria, or post-generation filtering). Without this, it is impossible to exclude the possibility that the drop is an artifact of distributional shift between the synthetic corpus and the held-out set rather than evidence of memorization decoupling.
[Findings 1 and 2] The claim that 'synthetic utility is relational rather than intrinsic' rests on comparisons across self-generated, same-lineage, and cross-family sources, but the manuscript does not report the number of runs, statistical tests, or controls for model scale and training duration that would be needed to establish that same-lineage transfer is reliably superior to stronger but differently trained sources.

minor comments (2)

[Abstract] The abstract states three findings and a hypothesis but supplies no dataset sizes, benchmark lists, or statistical details; these should be summarized even at the abstract level for a self-contained empirical paper.
[Introduction and hypothesis statement] Notation for 'compatibility condition' is introduced informally; a precise definition or operationalization (e.g., via a measurable distance between source and student distributions) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate additional details where needed.

read point-by-point responses

Referee: [Abstract and experimental sections describing Pythia setup] The decoupling result (benchmark utility preserved while exact-match extraction on held-out data falls >95%) is load-bearing for the latent capability resurfacing hypothesis, yet the manuscript provides no description of how the held-out sequences are constructed (sampling method, length distribution, diversity criteria, or post-generation filtering). Without this, it is impossible to exclude the possibility that the drop is an artifact of distributional shift between the synthetic corpus and the held-out set rather than evidence of memorization decoupling.

Authors: We agree that the construction details for the held-out set are essential to substantiate the decoupling result. The sequences were generated unconditionally from the BOS token using identical sampling parameters to the synthetic corpus (temperature 1.0, no nucleus sampling) but with disjoint random seeds, lengths sampled from the empirical distribution of the training data, and post-filtering to remove any exact matches or high-similarity sequences to the training set. In the revised manuscript we will add an explicit subsection describing the sampling method, length distribution, diversity criteria (minimum unique n-gram count), and filtering steps, along with a control analysis confirming comparable perplexity and embedding distributions between the held-out and synthetic sets to address distributional shift concerns. revision: yes
Referee: [Findings 1 and 2] The claim that 'synthetic utility is relational rather than intrinsic' rests on comparisons across self-generated, same-lineage, and cross-family sources, but the manuscript does not report the number of runs, statistical tests, or controls for model scale and training duration that would be needed to establish that same-lineage transfer is reliably superior to stronger but differently trained sources.

Authors: We acknowledge that explicit reporting of run counts and statistical tests would strengthen the relational utility claim. The reported trends were observed consistently across five model scales (Pythia 70M–2.8B) with fixed training duration and token count for all conditions; however, the main text presents single-run results. In revision we will report results from three independent seeds per configuration, add error bars, and include paired statistical tests (e.g., Wilcoxon) for the key same-lineage vs. cross-family comparisons while retaining the existing scale and duration controls. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical study with independent experimental observations

full rationale

The paper is an empirical study reporting observations from controlled experiments on prompt-free self-training of language models. No equations or derivations are present that reduce any result to a fitted parameter defined by the same data or to a self-citation chain. The latent capability resurfacing hypothesis is an interpretive label for experimental outcomes (relational utility, failure of intrinsic proxies, and capability-memorization decoupling in Pythia setups), not a mathematical claim derived from inputs by construction. The 95% exact-match drop is presented as a measured experimental byproduct, not a prediction forced by fitting. Any self-citations are not load-bearing for the central claims, which rest on the reported experimental controls and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced latent capability resurfacing hypothesis and on the validity of unreported experimental measurements; no free parameters are introduced and no new physical entities are postulated.

axioms (1)

domain assumption Language models can be fine-tuned on text generated unconditionally from the BOS token.
Invoked when defining the minimal self-training regime.

invented entities (1)

latent capability resurfacing hypothesis no independent evidence
purpose: To explain why self-training succeeds only under source-student compatibility
Newly proposed to account for the reported relational property and the decoupling observation.

pith-pipeline@v0.9.1-grok · 5808 in / 1343 out tokens · 27208 ms · 2026-06-28T22:47:32.706801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel de Souza Pereira Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge

URLhttps://openreview.net/forum?id=kpLRYtPGt3. Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel de Souza Pereira Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-Embed-Nemotron-8B: A universal text embedding model for multilingual and cross-lingual tasks, 2025. URL https://arxiv.org/ abs/2511.07025. Stella Biderman, Hailey Scho...

work page arXiv 2025
[2]

URLhttps://arxiv.org/abs/1803.05457. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168. Yunzhen Feng, Elvis Dohmatob, Pu Yang,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.805 2021

[1] [1]

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel de Souza Pereira Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge

URLhttps://openreview.net/forum?id=kpLRYtPGt3. Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel de Souza Pereira Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-Embed-Nemotron-8B: A universal text embedding model for multilingual and cross-lingual tasks, 2025. URL https://arxiv.org/ abs/2511.07025. Stella Biderman, Hailey Scho...

work page arXiv 2025

[2] [2]

URLhttps://arxiv.org/abs/1803.05457. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168. Yunzhen Feng, Elvis Dohmatob, Pu Yang,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.805 2021