DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Comparing bad apples to good oranges: Aligning large language models via joint preference optimization
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.
citing papers explorer
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
VideoPhy: Evaluating Physical Commonsense for Video Generation
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
-
IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language
Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.