DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
The data provenance initiative: A large scale audit of dataset licensing & attribution in ai
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
LicenseGPT fine-tuned on 500 expert-annotated licenses raises prediction agreement to 64.30% and cuts per-license analysis time by 94.44% from 108s to 6s in lawyer user studies.