Industrial-scale LLMs require over 150B tokens for long-context continual pre-training to reach intrinsic saturation, with perplexity and retrieval-head attention providing stronger signals than needle-in-a-haystack tests.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Revealing the Learning Dynamics of Long-Context Continual Pre-training
Industrial-scale LLMs require over 150B tokens for long-context continual pre-training to reach intrinsic saturation, with perplexity and retrieval-head attention providing stronger signals than needle-in-a-haystack tests.