An AI-ready, Polarized Electron-Positron Collision Dataset

Alaettin Serhan Mete; Benjamin Nachman; Chi Lung Cheng; Simon Corrodi; T. J. Hobbs

arxiv: 2606.00224 · v1 · pith:4P3VIF5Bnew · submitted 2026-05-29 · ✦ hep-ex · hep-ph

An AI-ready, Polarized Electron-Positron Collision Dataset

Chi Lung Cheng , Simon Corrodi , T. J. Hobbs , Alaettin Serhan Mete , Benjamin Nachman This is my paper

Pith reviewed 2026-06-28 19:29 UTC · model grok-4.3

classification ✦ hep-ex hep-ph

keywords SLD experimentpolarized electron positron collisionsAI ready datasetlegacy data modernizationparticle physics data releasemachine learning applicationsSLAC Linear Collider

0 comments

The pith

A dataset of 660,000 polarized electron-positron collisions from the SLD experiment is now available in modern formats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases a modernized version of data from the SLD experiment at the SLAC Linear Collider. It includes about 660,000 reconstructed events from 1996 to 1998 at a center-of-mass energy of 91.2 GeV with a polarized electron beam. The legacy data has been converted to current file formats using AI agents, and internal documents have been digitized. This effort aims to make the data usable for today's machine learning and particle physics research. The release comes with physics validation examples to show its reliability.

Core claim

The central claim is the presentation of an AI-ready dataset from the SLD experiment consisting of approximately 660,000 reconstructed events collected with a highly polarized electron beam at sqrt(s) ≈ 91.2 GeV, translated from legacy formats with the help of AI agents and accompanied by a corpus of newly digitized documentation.

What carries the argument

The AI-assisted translation of legacy data formats into modern widely-used file formats that preserves the original physics content and reconstruction quality.

If this is right

The dataset allows machine learning models to be trained and tested on real polarized collider data.
It enables new physics analyses that leverage both the original polarization and modern computational tools.
The digitized documentation supports detailed studies of the original reconstruction methods.
This provides a benchmark dataset for AI applications in high-energy physics with known beam polarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing other legacy datasets in similar AI-ready forms could broaden the scope of training data available for particle physics AI models.
The approach of using AI for data format translation might be applied to convert data from additional historical experiments.
Validation of the dataset against original publications could lead to improved methods for ensuring fidelity in data modernization projects.

Load-bearing premise

The translation of the data from legacy formats to modern ones by AI agents maintains the accuracy and completeness of the original physics measurements and event reconstruction.

What would settle it

Performing the same physics analysis, such as measuring the left-right asymmetry, on the new dataset and obtaining results consistent with the original SLD publications would support the claim; inconsistency would falsify it.

Figures

Figures reproduced from arXiv: 2606.00224 by Alaettin Serhan Mete, Benjamin Nachman, Chi Lung Cheng, Simon Corrodi, T. J. Hobbs.

**Figure 2.** Figure 2: FIG. 2: Reconstructed visible invariant-mass spectra: (a) hadronic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3: Distribution of the event-shape variable [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4: Polar-angle distributions for the three leptonic channels in the 1997–1998 subset, separated by the sign of the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5: Extracted values for the effective weak mixing, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6: t-SNE projection of OmniLearned jet [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7: Representative pages illustrating the distinct failure modes of the four extractors. Each panel pairs a [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We present a modernized, AI-ready release of reconstructed data from the SLD experiment at the SLAC Linear Collider (SLC). The dataset comprises approximately 660{,}000 reconstructed events collected at $\sqrt{s}\approx 91.2$~GeV with a highly polarized electron beam from 1996--1998. The data have been translated from legacy formats into modern, widely-used file formats with the help of AI agents. The release also includes a corpus of newly digitized SLD internal documentation. We describe the contents of both components and provide physics validation demonstrations along with illustrations of their utility for physics and machine learning research in particle physics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This releases old SLD polarized data in modern formats via AI translation, which is practically useful but rests on unshown fidelity checks.

read the letter

The main takeaway is a public release of roughly 660,000 reconstructed events from the 1996-1998 SLD run at the Z pole, now in current file formats, along with digitized internal documentation. The work uses AI agents to handle the legacy-to-modern conversion and includes some physics validation examples plus notes on ML use cases.

What stands out is the practical step of making polarized electron data available in AI-ready shape. Polarized samples at this energy are not common in public modern releases, so this could help groups testing algorithms on real detector output rather than simulation. Digitizing the old notes is also a straightforward win for reproducibility.

The weaker part is the translation step itself. The abstract claims validation demonstrations, yet the description does not include direct quantitative checks such as event-by-event agreement on track parameters, energy clusters, or standard observables like thrust between the original DST files and the new versions. Without those numbers it is hard to judge whether the AI process introduced offsets or losses. If the full paper has those comparisons, they need to be front and center; if not, the claim that the physics content is preserved stays untested at the level that matters for downstream analysis.

This paper is aimed at the intersection of high-energy physics and machine learning, especially people who want real polarized data for training or benchmarking. It is not a major theoretical advance, but data resources like this can be worth referee time if the validation is solid. I would send it out for review so the community can examine the fidelity evidence and decide on its reliability for actual use.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a modernized, AI-ready release of approximately 660,000 reconstructed events from the SLD experiment at the SLC, collected at √s ≈ 91.2 GeV with a highly polarized electron beam during 1996–1998. The data have been translated from legacy formats to modern file formats using AI agents, accompanied by a corpus of newly digitized SLD internal documentation; the paper describes the dataset contents and provides physics validation demonstrations for use in particle physics and machine learning research.

Significance. If the AI-assisted translation is demonstrated to preserve the original reconstruction fidelity without introducing unquantified systematics, the release would provide a rare publicly available polarized e⁺e⁻ dataset at the Z pole, enabling new AI/ML studies on polarization observables and legacy data revival that are otherwise inaccessible due to outdated formats.

major comments (1)

[Abstract and validation section] Abstract and validation section: The central claim that the released dataset is 'AI-ready' and retains the original physics content rests on the fidelity of the AI-driven format translation, yet the described physics validation demonstrations do not include quantitative event-by-event comparisons (e.g., matching of track parameters, calorimeter clusters, thrust, or acollinearity) against the original SLD DST or micro-DST records; without such metrics, any systematic offsets introduced by the AI agents remain unquantified and undermine usability for precision analyses.

minor comments (1)

[Abstract] The abstract uses non-standard comma formatting in '660{,}000'; adopt conventional scientific notation such as 660000 or 6.6×10^5.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback emphasizing the need to rigorously demonstrate fidelity of the AI-assisted translation. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and validation section] Abstract and validation section: The central claim that the released dataset is 'AI-ready' and retains the original physics content rests on the fidelity of the AI-driven format translation, yet the described physics validation demonstrations do not include quantitative event-by-event comparisons (e.g., matching of track parameters, calorimeter clusters, thrust, or acollinearity) against the original SLD DST or micro-DST records; without such metrics, any systematic offsets introduced by the AI agents remain unquantified and undermine usability for precision analyses.

Authors: We agree that event-by-event quantitative comparisons to the original SLD DST/micro-DST records would provide the strongest possible evidence that no systematic offsets were introduced during translation. However, the legacy reconstruction chain and original data formats are no longer executable on modern systems, precluding such direct matching. The existing validation instead demonstrates consistency of key physics observables (thrust, acollinearity, lepton identification) with published SLD results and Monte Carlo expectations. We will revise the validation section and abstract to (i) add quantitative distribution-level comparisons (means, widths, and efficiencies) against historical SLD publications for track parameters and calorimeter clusters, and (ii) explicitly state the limitations on event-by-event fidelity metrics. This is a partial revision. revision: partial

standing simulated objections not resolved

Direct event-by-event matching to original SLD DST records, which is impossible due to inaccessibility of the legacy reconstruction software.

Circularity Check

0 steps flagged

Data release paper contains no derivations, predictions, or fitted quantities

full rationale

The manuscript is a data-release note describing translation of legacy SLD files into modern formats with AI assistance and supplying validation demonstrations. No equations, parameter fits, or predictive claims appear that could reduce to their own inputs by construction. The central assertion (fidelity of the translated dataset) is supported by external validation steps rather than by self-referential definitions or self-citations. Consequently the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data release and format conversion paper; it introduces no free parameters, mathematical axioms, or invented physical entities.

pith-pipeline@v0.9.1-grok · 5647 in / 1117 out tokens · 23580 ms · 2026-06-28T19:29:49.353122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages · 3 internal anchors

[2]

SLC Design Group,SLC Design Handbook, Tech. Rep. SLAC-R-714, SLAC National Accelerator Laboratory, 1984. [2]SLDCollaboration, K. Abe et al.,A High precision measurement of the left-rightZboson cross-section asymmetry,Phys. Rev. Lett.84(2000) 5945–5949, [hep-ex/0004026]

work page internal anchor Pith review Pith/arXiv arXiv 1984
[3]

SLD Collaboration,SLD Design Report, Tech. Rep. SLAC-R-273, SLAC National Accelerator Laboratory, 1984. [4]SLDCollaboration, K. Abe et al.,Design and performance of the SLD vertex detector: A 307 Mpixel tracking system,Nucl. Instrum. Meth. A400(1997) 287–343

1984
[4]

Carleo, I

G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Machine learning and the physical sciences,Rev. Mod. Phys.91(2019), no. 4 045002, [arXiv:1903.10563]

work page arXiv 2019
[5]

SLD reconstructed mini-DSTs from the 1996–1997 SLC runs

C. L. Cheng, B. Nachman, A. S. Mete, T. Hobbs, and S. Corrodi, “SLD reconstructed mini-DSTs from the 1996–1997 SLC runs.” Zenodo dataset, 2026. https://zenodo.org/records/19925960

work page arXiv 1996
[6]

jazelle_reader: A Python toolkit for translating SLD Jazelle binaries to AI-friendly formats

C. L. Cheng, “jazelle_reader: A Python toolkit for translating SLD Jazelle binaries to AI-friendly formats.” https://github.com/HEP-KE/jazelle_reader, 2026

2026
[7]

An improved direct measurement of leptonic coupling asymmetries with polarized Z bosons

J. Pivarski et al., “Awkward Array: Manipulating JSON-like data with NumPy-like idioms.” Zenodo, 2024. [9]SLDCollaboration, K. Abe et al.,An improved direct measurement of leptonic coupling asymmetries with polarizedZbosons,Phys. Rev. Lett.86(2001) 1162–1166, [hep-ex/0010015]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Farhi,A QCD Test for Jets,Phys

E. Farhi,A QCD Test for Jets,Phys. Rev. Lett.39 (1977) 1587–1588. [11]Particle Data GroupCollaboration, S. Navas et al., Review of particle physics,Phys. Rev. D110(2024), no. 3 030001

1977
[9]

Bhimji, C

W. Bhimji, C. Harris, V. Mikuni, and B. Nachman, Foundation model framework for all tasks involving jet physics,Phys. Rev. D113(2026), no. 3 032020, [arXiv:2510.24066]

work page arXiv 2026
[10]

Catani, Y

S. Catani, Y. L. Dokshitzer, M. Olsson, G. Turnock, and B. R. Webber,New clustering algorithm for multi-jet cross sections ine +e− annihilation,Phys. Lett. B269(1991) 432–438

1991
[11]

Y. Chen, L. Heinrich, R. Tornqvist, et al.,Open data from the ALEPH experiment at the LEPe +e− collider, arXiv preprint(2021) [arXiv:2107.10847]

work page arXiv 2021
[12]

H1 Collaboration,Long-term data preservation and analysis at H1,EPJ Web Conf.251(2021) 02001

2021
[13]

H. Qu, C. Li, and S. Qian,Particle Transformer for jet tagging,Proc. Mach. Learn. Res.162(2022) 18281–18292, [arXiv:2202.03772]

work page arXiv 2022
[14]

van der Maaten and G

L. van der Maaten and G. Hinton,Visualizing data using t-SNE,Journal of Machine Learning Research9 (2008) 2579–2605

2008
[15]

Marker: Fast and highly accurate pdf to markdown/json

V. Paruchuri, “Marker: Fast and highly accurate pdf to markdown/json.” https://github.com/datalab-to/marker, 2023

2023
[16]

C. Auer, M. Lysak, A. Nassar, M. Dolfi, N. Livathinos, P. Vagenas, C. B. Ramis, M. Omenetti, F. Lindlbauer, K. Dinkla, L. Mishra, Y. Kim, S. Gupta, R. T. de Lima, V. Weber, L. Morin, I. Meijer, V. Kuropiatnyk, and P. W. J. Staar,Docling technical report,arXiv preprint arXiv:2408.09869(2024)

work page arXiv 2024
[17]

Nougat: Neural Optical Understanding for Academic Documents

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, Nougat: Neural optical understanding for academic documents,arXiv preprint arXiv:2308.13418(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Azure ai document intelligence

Microsoft, “Azure ai document intelligence.” https://learn.microsoft.com/en-us/azure/ ai-services/document-intelligence/, 2024

2024
[19]

Accessed: March 29, 2025

Anthropic,Model context protocol, 2024. Accessed: March 29, 2025

2024

[1] [2]

SLC Design Group,SLC Design Handbook, Tech. Rep. SLAC-R-714, SLAC National Accelerator Laboratory, 1984. [2]SLDCollaboration, K. Abe et al.,A High precision measurement of the left-rightZboson cross-section asymmetry,Phys. Rev. Lett.84(2000) 5945–5949, [hep-ex/0004026]

work page internal anchor Pith review Pith/arXiv arXiv 1984

[2] [3]

SLD Collaboration,SLD Design Report, Tech. Rep. SLAC-R-273, SLAC National Accelerator Laboratory, 1984. [4]SLDCollaboration, K. Abe et al.,Design and performance of the SLD vertex detector: A 307 Mpixel tracking system,Nucl. Instrum. Meth. A400(1997) 287–343

1984

[3] [4]

Carleo, I

G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Machine learning and the physical sciences,Rev. Mod. Phys.91(2019), no. 4 045002, [arXiv:1903.10563]

work page arXiv 2019

[4] [5]

SLD reconstructed mini-DSTs from the 1996–1997 SLC runs

C. L. Cheng, B. Nachman, A. S. Mete, T. Hobbs, and S. Corrodi, “SLD reconstructed mini-DSTs from the 1996–1997 SLC runs.” Zenodo dataset, 2026. https://zenodo.org/records/19925960

work page arXiv 1996

[5] [6]

jazelle_reader: A Python toolkit for translating SLD Jazelle binaries to AI-friendly formats

C. L. Cheng, “jazelle_reader: A Python toolkit for translating SLD Jazelle binaries to AI-friendly formats.” https://github.com/HEP-KE/jazelle_reader, 2026

2026

[6] [7]

An improved direct measurement of leptonic coupling asymmetries with polarized Z bosons

J. Pivarski et al., “Awkward Array: Manipulating JSON-like data with NumPy-like idioms.” Zenodo, 2024. [9]SLDCollaboration, K. Abe et al.,An improved direct measurement of leptonic coupling asymmetries with polarizedZbosons,Phys. Rev. Lett.86(2001) 1162–1166, [hep-ex/0010015]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [8]

Farhi,A QCD Test for Jets,Phys

E. Farhi,A QCD Test for Jets,Phys. Rev. Lett.39 (1977) 1587–1588. [11]Particle Data GroupCollaboration, S. Navas et al., Review of particle physics,Phys. Rev. D110(2024), no. 3 030001

1977

[8] [9]

Bhimji, C

W. Bhimji, C. Harris, V. Mikuni, and B. Nachman, Foundation model framework for all tasks involving jet physics,Phys. Rev. D113(2026), no. 3 032020, [arXiv:2510.24066]

work page arXiv 2026

[9] [10]

Catani, Y

S. Catani, Y. L. Dokshitzer, M. Olsson, G. Turnock, and B. R. Webber,New clustering algorithm for multi-jet cross sections ine +e− annihilation,Phys. Lett. B269(1991) 432–438

1991

[10] [11]

Y. Chen, L. Heinrich, R. Tornqvist, et al.,Open data from the ALEPH experiment at the LEPe +e− collider, arXiv preprint(2021) [arXiv:2107.10847]

work page arXiv 2021

[11] [12]

H1 Collaboration,Long-term data preservation and analysis at H1,EPJ Web Conf.251(2021) 02001

2021

[12] [13]

H. Qu, C. Li, and S. Qian,Particle Transformer for jet tagging,Proc. Mach. Learn. Res.162(2022) 18281–18292, [arXiv:2202.03772]

work page arXiv 2022

[13] [14]

van der Maaten and G

L. van der Maaten and G. Hinton,Visualizing data using t-SNE,Journal of Machine Learning Research9 (2008) 2579–2605

2008

[14] [15]

Marker: Fast and highly accurate pdf to markdown/json

V. Paruchuri, “Marker: Fast and highly accurate pdf to markdown/json.” https://github.com/datalab-to/marker, 2023

2023

[15] [16]

C. Auer, M. Lysak, A. Nassar, M. Dolfi, N. Livathinos, P. Vagenas, C. B. Ramis, M. Omenetti, F. Lindlbauer, K. Dinkla, L. Mishra, Y. Kim, S. Gupta, R. T. de Lima, V. Weber, L. Morin, I. Meijer, V. Kuropiatnyk, and P. W. J. Staar,Docling technical report,arXiv preprint arXiv:2408.09869(2024)

work page arXiv 2024

[16] [17]

Nougat: Neural Optical Understanding for Academic Documents

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, Nougat: Neural optical understanding for academic documents,arXiv preprint arXiv:2308.13418(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [18]

Azure ai document intelligence

Microsoft, “Azure ai document intelligence.” https://learn.microsoft.com/en-us/azure/ ai-services/document-intelligence/, 2024

2024

[18] [19]

Accessed: March 29, 2025

Anthropic,Model context protocol, 2024. Accessed: March 29, 2025

2024