The data provenance initiative: A large scale audit of dataset licensing & attribution in ai

Longpre, S · 2023 · arXiv 2310.16787

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation

cs.CV · 2026-06-28 · unverdicted · novelty 6.0

RAHA applies rank-aware hyperbolic alignment to vision-language dataset distillation by enforcing geodesic alignment in the shared low-rank range and regularizing the residual subspace for improved transfer.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance

cs.SE · 2024-12-30 · unverdicted · novelty 5.0

LicenseGPT fine-tuned on 500 expert-annotated licenses raises prediction agreement to 64.30% and cuts per-license analysis time by 94.44% from 108s to 6s in lawyer user studies.

The ATOM Report: Measuring the Open Language Model Ecosystem

cs.CY · 2026-04-08

citing papers explorer

Showing 5 of 5 citing papers.

Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation cs.CV · 2026-06-28 · unverdicted · none · ref 42
RAHA applies rank-aware hyperbolic alignment to vision-language dataset distillation by enforcing geodesic alignment in the shared low-rank range and regularizing the residual subspace for improved transfer.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 114
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 227
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance cs.SE · 2024-12-30 · unverdicted · none · ref 60
LicenseGPT fine-tuned on 500 expert-annotated licenses raises prediction agreement to 64.30% and cuts per-license analysis time by 94.44% from 108s to 6s in lawyer user studies.
The ATOM Report: Measuring the Open Language Model Ecosystem cs.CY · 2026-04-08 · unreviewed · ref 1

The data provenance initiative: A large scale audit of dataset licensing & attribution in ai

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer