Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

· 2026 · cs.AI · arXiv 2604.10480

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a \textit{lineage-aware diversity-oriented dataset}. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

representative citing papers

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs cs.CL · 2026-06-10 · unverdicted · none · ref 34 · internal anchor
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

fields

years

verdicts

representative citing papers

citing papers explorer