Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting

Adekunle Ajibode; Ahmed E. Hassan; Bram Adams; Keheliya Gallaba; Oussama Ben Sghaier

arxiv: 2606.21787 · v1 · pith:QPLYZBTWnew · submitted 2026-06-19 · 💻 cs.SE

Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting

Adekunle Ajibode , Oussama Ben Sghaier , Keheliya Gallaba , Bram Adams , Ahmed E. Hassan This is my paper

Pith reviewed 2026-06-26 13:17 UTC · model grok-4.3

classification 💻 cs.SE

keywords pre-trained language modelsmetadata imputationsemantic fingerprintingmodel lineageconfiguration fileshugging facereuse chainsAI bills of materials

0 comments

The pith

Semantic Fingerprinting imputes missing metadata for pre-trained language models by treating configuration files as structural blueprints combined with repository tags.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Semantic Fingerprinting (SemFin) as a method to fill gaps in PTLM metadata such as licenses, reuse methods, and pipeline tags. It does this by extracting signals from Hugging Face configuration files, which encode instantiation requirements, and merging them with existing model tags. Evaluation on over 317,000 models shows higher accuracy than graph-based propagation methods and the ability to handle models with no connections to others. When run on unlabeled models, the approach expands traceable reuse and license chains substantially while revealing new patterns.

Core claim

Configuration files serve as structural blueprints for model reuse, particularly for transformer architectures. By combining these files with repository tags, SemFin reconstructs lineage chains and imputes fields where propagation methods cannot, achieving up to 31.4 percent and 26.6 percent higher accuracy than Graph Avg and Hub Avg baselines while covering 16.6 percent of isolated models.

What carries the argument

Semantic Fingerprinting (SemFin), which merges configuration-file contents with model tags to produce imputed metadata values and expanded lineage graphs.

If this is right

Prediction accuracy rises by up to 31.4 percent over Graph Avg and 26.6 percent over Hub Avg baselines.
Metadata is imputed for 16.6 percent of isolated models that propagation methods cannot reach.
Traceable reuse-method chains expand by 75.9 percent and license lineage chains by 53.6 percent across 167,089 previously unlabeled models.
Eighty-six previously invisible reuse-method patterns become visible, with the share of incompatible license patterns rising only from 34.8 percent to 36.8 percent.
The resulting chains support automated construction of AI Bills of Materials directly from model artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fingerprint signals could be used to flag models whose declared metadata conflicts with the configuration file contents.
Extending the approach to other model hubs would test whether configuration files remain informative outside the Hugging Face ecosystem.
Imputed lineages could feed into automated checks for license compatibility before model composition in production systems.

Load-bearing premise

Configuration files reliably encode the technical requirements needed to instantiate and reuse the models.

What would settle it

A manual audit of 500 randomly sampled models with known ground-truth metadata where SemFin's imputed values match the ground truth less than 60 percent of the time.

read the original abstract

Pre-trained language models (PTLMs) hosted on platforms such as Hugging Face form complex lineage structures similar to software dependency graphs. However, unlike traditional software ecosystems, PTLM repositories often lack reliable provenance due to missing metadata, such as licenses, reuse methods, pipeline tags, model types, and training libraries. To address this gap, we introduce Semantic Fingerprinting (SemFin), a lightweight approach that combines Hugging Face (HF) configuration files with model repository tags to automatically impute missing model metadata fields and reconstruct model lineage chains. We evaluate SemFin on a large-scale dataset of 317,133 PTLMs. Our results show that configuration files typically encode the technical requirements necessary to instantiate and reuse models, enabling them to serve as a structural blueprint for model reuse, particularly for transformer-based architectures. By combining these configuration files with model repository tags, SemFin significantly outperforms the existing propagation-based imputation approaches, improving prediction accuracy by up to 31.4% and 26.6% compared to Graph Avg and Hub Avg baselines. Importantly, SemFin also imputes metadata for 16.6% of isolated models where propagation-based methods fail. Applying SemFin to impute missing reuse-method and license metadata for 167,089 unlabeled models reveals that traceable reuse method chains expand by 75.9% and license lineage chains by 53.6%, uncovering 86 previously invisible reuse method patterns, while the proportion of incompatible license patterns only increases from 34.8% to 36.8%. These findings demonstrate how automatically derived structural signals can support the automated construction of AI Bills of Materials (AIBOMs), helping transform metadata from an error-prone manual declaration into information inferred directly from model artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemFin shows clear gains over propagation baselines on 317k HF models by using config files plus tags, but the isolated-model coverage claim lacks a verified accuracy check.

read the letter

The core takeaway is that this paper gives a workable method to fill in missing PTLM metadata on Hugging Face by pulling signals from config files and repository tags, beating simple graph and hub averages by up to 31 percent on the labeled portion of their 317k-model set. It also reports being able to produce imputations for models that sit outside any lineage graph.

What stands out is the scale and the downstream effect: applying the method to 167k unlabeled models expands traceable reuse chains by 75.9 percent and license chains by 53.6 percent while only slightly raising the share of incompatible license patterns. That kind of concrete expansion number is useful for anyone thinking about AI bills of materials.

The soft spot sits exactly where the stress-test note flags it. The accuracy figures come from models that already have ground-truth labels, so they do not directly test whether the config-plus-tag signal remains reliable once graph neighbors disappear. The 16.6 percent figure for isolated models appears to be a coverage count rather than a measured accuracy, and the abstract does not describe a held-out masking experiment to confirm the signal strength in that regime. If the paper has that split in the full text, it should be front and center; otherwise the isolated-model claim rests on an untested extrapolation.

The work is aimed at ML engineers and platform maintainers who need better provenance tooling. It is the kind of applied SE paper that deserves a serious referee because the dataset is large, the baselines are straightforward, and the practical payoff is easy to assess even if some evaluation details need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Semantic Fingerprinting (SemFin), a method combining Hugging Face configuration files with repository tags to impute missing metadata fields (licenses, reuse methods, pipeline tags, model types, training libraries) for pre-trained language models. On a dataset of 317,133 PTLMs, it claims to outperform Graph Avg and Hub Avg baselines by up to 31.4% and 26.6% in accuracy, impute metadata for 16.6% of isolated models, and when applied to 167,089 unlabeled models, expand reuse-method chains by 75.9% and license chains by 53.6% while uncovering 86 new patterns.

Significance. If the empirical results hold under rigorous validation, the work provides a practical, artifact-driven technique for metadata imputation at scale in PTLM repositories. This could meaningfully support automated construction of AI Bills of Materials by shifting from manual declarations to inference from model artifacts, with the reported lineage expansions indicating potential impact on provenance tracking and reuse analysis.

major comments (2)

[Abstract] Abstract: The central quantitative claims (accuracy gains of 31.4%/26.6% and 16.6% imputation rate on isolated models) are presented without any description of the evaluation methodology, including data splits, baseline implementations, handling of selection biases in the 317k-model corpus, or statistical measures such as error bars or significance tests. This is load-bearing for the primary empirical contribution.
[Abstract] Abstract: The claim that SemFin imputes metadata for 16.6% of isolated models (where propagation baselines return nothing) lacks supporting evidence from a held-out evaluation that artificially masks known metadata on connected models to simulate isolation; accuracy can only be measured where ground truth exists, so the isolated-model performance and downstream lineage-expansion figures rest on an untested extrapolation of the config+tag signal.

minor comments (1)

[Abstract] Abstract: The phrasing 'configuration files typically encode the technical requirements necessary to instantiate and reuse models' is asserted as enabling the approach but would benefit from explicit quantification or supporting examples to strengthen the weakest assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation presentation. The comments identify areas where additional clarity will strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central quantitative claims (accuracy gains of 31.4%/26.6% and 16.6% imputation rate on isolated models) are presented without any description of the evaluation methodology, including data splits, baseline implementations, handling of selection biases in the 317k-model corpus, or statistical measures such as error bars or significance tests. This is load-bearing for the primary empirical contribution.

Authors: We agree that the abstract omits key methodological details needed to contextualize the reported figures. The full paper describes a held-out evaluation on the 317k-model corpus with explicit train/test splits, baseline re-implementations (Graph Avg and Hub Avg), and accuracy metrics; however, these were not summarized in the abstract. In revision we will expand the abstract with a concise clause on the evaluation protocol, data partitioning, and confirmation that significance testing was performed, while preserving length constraints. revision: yes
Referee: [Abstract] Abstract: The claim that SemFin imputes metadata for 16.6% of isolated models (where propagation baselines return nothing) lacks supporting evidence from a held-out evaluation that artificially masks known metadata on connected models to simulate isolation; accuracy can only be measured where ground truth exists, so the isolated-model performance and downstream lineage-expansion figures rest on an untested extrapolation of the config+tag signal.

Authors: The 16.6% figure is the share of isolated (degree-zero) models for which config+tag features yield a non-null prediction while the two propagation baselines return none; it is not itself an accuracy claim. Accuracy is measured exclusively on connected models via held-out splits with ground truth. The downstream lineage expansions on the 167k unlabeled models are produced by applying the model trained on connected data. We acknowledge that this constitutes an extrapolation and will revise the text to state the assumption explicitly, add a limitations paragraph, and include a new simulation experiment that masks metadata on connected models to quantify signal reliability under isolation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external Hugging Face dataset with no derivations or self-referential definitions.

full rationale

The paper describes an imputation method (SemFin) that combines configuration files and repository tags, then reports accuracy improvements versus Graph Avg and Hub Avg baselines on a 317k-model dataset. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the abstract or described claims. All reported numbers (31.4%, 26.6%, 16.6%, 75.9%, 53.6%) are presented as direct measurements against external ground truth and baselines rather than quantities derived by construction from the method's own inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that configuration files contain reliable structural signals for metadata imputation; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Configuration files typically encode the technical requirements necessary to instantiate and reuse models, enabling them to serve as a structural blueprint for model reuse, particularly for transformer-based architectures.
Stated directly in the abstract as the justification for using configuration files in the imputation approach.

pith-pipeline@v0.9.1-grok · 5867 in / 1446 out tokens · 29366 ms · 2026-06-26T13:17:48.824508+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 7 linked inside Pith

[1]

Version 1.0, ac- cessed: 2025-01-15

URLhttps://github.com/ SAILResearch/replication-26-adekunle_semantic_fingerprinting. Version 1.0, ac- cessed: 2025-01-15. Adem Ait, Javier Luis C´anovas Izquierdo, and Jordi Cabot. On the suitability of hugging face hub for empirical studies.Empirical Software Engineering, 30(2):1–48,

2025
[2]

On the synchronization be- tween hugging face pre-trained language models and their upstream github repository.arXiv preprint arXiv:2508.10157,

Adekunle Ajibode, Abdul Ali Bangash, Bram Adams, and Ahmed E Hassan. On the synchronization be- tween hugging face pre-trained language models and their upstream github repository.arXiv preprint arXiv:2508.10157,

arXiv
[3]

Ecosystem graphs: The social footprint of foundation models.arXiv preprint arXiv:2303.15772,

Rishi Bommasani, Dilara Soylu, Thomas I Liao, Kathleen A Creel, and Percy Liang. Ecosystem graphs: The social footprint of foundation models.arXiv preprint arXiv:2303.15772,

arXiv
[4]

Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting 47 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

Pith/arXiv arXiv
[5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

2019
[6]

Accessed: 2026-06-19. G ´EANT. Glossary – Open Source Software and Licensing.https://wiki.geant.org/spaces/GSD/ pages/1265336493/Glossary+%E2%80%93+Open+Source+Software+and+Licensing. Accessed: 2026-06-19. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The elements of statistical learning: data mining, infer- ence, and prediction. Springer Science ...

arXiv 2026
[7]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

Pith/arXiv arXiv
[8]

We should chart an atlas of all the world’s models.arXiv preprint arXiv:2503.10633,

Eliahu Horwitz, Nitzan Kurer, Jonathan Kahana, Liel Amar, and Yedid Hoshen. We should chart an atlas of all the world’s models.arXiv preprint arXiv:2503.10633,

arXiv
[9]

Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146,

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146,

Pith/arXiv arXiv
[10]

From hugging face to github: Tracing license drift in the open-source ai ecosystem.arXiv preprint arXiv:2509.09873,

James Jewitt, Hao Li, Bram Adams, Gopi Krishnan Rajbahadur, and Ahmed E Hassan. From hugging face to github: Tracing license drift in the open-source ai ecosystem.arXiv preprint arXiv:2509.09873,

arXiv
[11]

Anatomy of a machine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811,

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a machine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811,

arXiv
[12]

Deduplicating training data makes language models better

K Lee, D Ippolito, A Nystrom, C Zhang, D Eck, C Callison-Burch, and N Carlini. Deduplicating training data makes language models better. arxiv 2022.arXiv preprint arXiv:2107.06499. Josh Lerner and Jean Tirole. The scope of open source licensing.Journal of Law, Economics, and Organization, 21(1):20–56,

Pith/arXiv arXiv 2022
[13]

Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

Pith/arXiv arXiv 1907
[14]

A survey on self-supervised pre-training for sequential transfer learning in neural networks

Huanru Henry Mao. A survey on self-supervised pre-training for sequential transfer learning in neural networks. arXiv preprint arXiv:2007.00800,

arXiv 2007
[15]

On the standardization of behavioral use clauses and their adoption for responsible licensing of ai.arXiv preprint arXiv:2402.05979,

Daniel McDuff, Tim Korjakow, Scott Cambo, Jesse Josua Benjamin, Jenny Lee, Yacine Jernite, Carlos Mu ˜noz Ferrandis, Aaron Gokaslan, Alek Tarkowski, Joseph Lindley, et al. On the standardization of behavioral use clauses and their adoption for responsible licensing of ai.arXiv preprint arXiv:2402.05979,

arXiv
[16]

Gpt-4 technical report

R OpenAI. Gpt-4 technical report. arxiv 2303.08774.View in Article, 2:13,

Pith/arXiv arXiv
[17]

Building an open aibom standard in the wild.arXiv preprint arXiv:2510.07070,

Gopi Krishnan Rajbahadur, Keheliya Gallaba, Elyas Rashno, Arthit Suriyawongkul, Karen Bennet, Kate Stewart, and Ahmed E Hassan. Building an open aibom standard in the wild.arXiv preprint arXiv:2510.07070,

arXiv
[18]

Unitn: Training deep convolutional neural network for twitter senti- ment classification

Aliaksei Severyn and Alessandro Moschitti. Unitn: Training deep convolutional neural network for twitter senti- ment classification. InProceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 464–469,

2015
[19]

The ml supply chain in the era of software 2.0: Lessons learned from hugging face.arXiv preprint arXiv:2502.04484,

Trevor Stalnaker, Nathan Wintersgill, Oscar Chaparro, Laura A Heymann, Massimiliano Di Penta, Daniel M German, and Denys Poshyvanyk. The ml supply chain in the era of software 2.0: Lessons learned from hugging face.arXiv preprint arXiv:2502.04484,

arXiv
[20]

Hidden licensing risks in the llmware ecosystem.arXiv preprint arXiv:2602.10758,

Bo Wang, Yueyang Chen, Jieke Shi, Minghui Li, Yunbo Lyu, Yinan Wu, Youfang Lin, and Zhou Yang. Hidden licensing risks in the llmware ecosystem.arXiv preprint arXiv:2602.10758,

arXiv
[21]

A broad-coverage challenge corpus for sentence under- standing through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence under- standing through inference. InProceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pages 1112– 1122,

2018
[22]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: system demon- strations, pages 38–45,

2020
[23]

Small changes, big trouble: Demystifying and parsing license variants for incompatibility detection in the pypi ecosystem.arXiv preprint arXiv:2507.14594,

Weiwei Xu, Hengzhi Ye, Kai Gao, and Minghui Zhou. Small changes, big trouble: Demystifying and parsing license variants for incompatibility detection in the pypi ecosystem.arXiv preprint arXiv:2507.14594,

arXiv
[24]

A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Pith/arXiv arXiv 2023

[1] [1]

Version 1.0, ac- cessed: 2025-01-15

URLhttps://github.com/ SAILResearch/replication-26-adekunle_semantic_fingerprinting. Version 1.0, ac- cessed: 2025-01-15. Adem Ait, Javier Luis C´anovas Izquierdo, and Jordi Cabot. On the suitability of hugging face hub for empirical studies.Empirical Software Engineering, 30(2):1–48,

2025

[2] [2]

On the synchronization be- tween hugging face pre-trained language models and their upstream github repository.arXiv preprint arXiv:2508.10157,

Adekunle Ajibode, Abdul Ali Bangash, Bram Adams, and Ahmed E Hassan. On the synchronization be- tween hugging face pre-trained language models and their upstream github repository.arXiv preprint arXiv:2508.10157,

arXiv

[3] [3]

Ecosystem graphs: The social footprint of foundation models.arXiv preprint arXiv:2303.15772,

Rishi Bommasani, Dilara Soylu, Thomas I Liao, Kathleen A Creel, and Percy Liang. Ecosystem graphs: The social footprint of foundation models.arXiv preprint arXiv:2303.15772,

arXiv

[4] [4]

Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting 47 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

Pith/arXiv arXiv

[5] [5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

2019

[6] [6]

Accessed: 2026-06-19. G ´EANT. Glossary – Open Source Software and Licensing.https://wiki.geant.org/spaces/GSD/ pages/1265336493/Glossary+%E2%80%93+Open+Source+Software+and+Licensing. Accessed: 2026-06-19. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The elements of statistical learning: data mining, infer- ence, and prediction. Springer Science ...

arXiv 2026

[7] [7]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

Pith/arXiv arXiv

[8] [8]

We should chart an atlas of all the world’s models.arXiv preprint arXiv:2503.10633,

Eliahu Horwitz, Nitzan Kurer, Jonathan Kahana, Liel Amar, and Yedid Hoshen. We should chart an atlas of all the world’s models.arXiv preprint arXiv:2503.10633,

arXiv

[9] [9]

Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146,

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146,

Pith/arXiv arXiv

[10] [10]

From hugging face to github: Tracing license drift in the open-source ai ecosystem.arXiv preprint arXiv:2509.09873,

James Jewitt, Hao Li, Bram Adams, Gopi Krishnan Rajbahadur, and Ahmed E Hassan. From hugging face to github: Tracing license drift in the open-source ai ecosystem.arXiv preprint arXiv:2509.09873,

arXiv

[11] [11]

Anatomy of a machine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811,

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a machine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811,

arXiv

[12] [12]

Deduplicating training data makes language models better

K Lee, D Ippolito, A Nystrom, C Zhang, D Eck, C Callison-Burch, and N Carlini. Deduplicating training data makes language models better. arxiv 2022.arXiv preprint arXiv:2107.06499. Josh Lerner and Jean Tirole. The scope of open source licensing.Journal of Law, Economics, and Organization, 21(1):20–56,

Pith/arXiv arXiv 2022

[13] [13]

Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

Pith/arXiv arXiv 1907

[14] [14]

A survey on self-supervised pre-training for sequential transfer learning in neural networks

Huanru Henry Mao. A survey on self-supervised pre-training for sequential transfer learning in neural networks. arXiv preprint arXiv:2007.00800,

arXiv 2007

[15] [15]

On the standardization of behavioral use clauses and their adoption for responsible licensing of ai.arXiv preprint arXiv:2402.05979,

Daniel McDuff, Tim Korjakow, Scott Cambo, Jesse Josua Benjamin, Jenny Lee, Yacine Jernite, Carlos Mu ˜noz Ferrandis, Aaron Gokaslan, Alek Tarkowski, Joseph Lindley, et al. On the standardization of behavioral use clauses and their adoption for responsible licensing of ai.arXiv preprint arXiv:2402.05979,

arXiv

[16] [16]

Gpt-4 technical report

R OpenAI. Gpt-4 technical report. arxiv 2303.08774.View in Article, 2:13,

Pith/arXiv arXiv

[17] [17]

Building an open aibom standard in the wild.arXiv preprint arXiv:2510.07070,

Gopi Krishnan Rajbahadur, Keheliya Gallaba, Elyas Rashno, Arthit Suriyawongkul, Karen Bennet, Kate Stewart, and Ahmed E Hassan. Building an open aibom standard in the wild.arXiv preprint arXiv:2510.07070,

arXiv

[18] [18]

Unitn: Training deep convolutional neural network for twitter senti- ment classification

Aliaksei Severyn and Alessandro Moschitti. Unitn: Training deep convolutional neural network for twitter senti- ment classification. InProceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 464–469,

2015

[19] [19]

The ml supply chain in the era of software 2.0: Lessons learned from hugging face.arXiv preprint arXiv:2502.04484,

Trevor Stalnaker, Nathan Wintersgill, Oscar Chaparro, Laura A Heymann, Massimiliano Di Penta, Daniel M German, and Denys Poshyvanyk. The ml supply chain in the era of software 2.0: Lessons learned from hugging face.arXiv preprint arXiv:2502.04484,

arXiv

[20] [20]

Hidden licensing risks in the llmware ecosystem.arXiv preprint arXiv:2602.10758,

Bo Wang, Yueyang Chen, Jieke Shi, Minghui Li, Yunbo Lyu, Yinan Wu, Youfang Lin, and Zhou Yang. Hidden licensing risks in the llmware ecosystem.arXiv preprint arXiv:2602.10758,

arXiv

[21] [21]

A broad-coverage challenge corpus for sentence under- standing through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence under- standing through inference. InProceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pages 1112– 1122,

2018

[22] [22]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: system demon- strations, pages 38–45,

2020

[23] [23]

Small changes, big trouble: Demystifying and parsing license variants for incompatibility detection in the pypi ecosystem.arXiv preprint arXiv:2507.14594,

Weiwei Xu, Hengzhi Ye, Kai Gao, and Minghui Zhou. Small changes, big trouble: Demystifying and parsing license variants for incompatibility detection in the pypi ecosystem.arXiv preprint arXiv:2507.14594,

arXiv

[24] [24]

A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

Pith/arXiv arXiv 2023