Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Abinaya Mahendiran; Ahmet \"Ust\"un; Aisha Alaagib; B\"orje F. Karlsson; Daniel Dsouza; Deividas Mataciunas; Dominik Krzemi\'nski; Emad A. Alghamdi; Freddie Vargus; Hakimeh Fadaei

arxiv: 2402.06619 · v1 · pith:EXHXNSIJnew · submitted 2024-02-09 · 💻 cs.CL · cs.AI

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Shivalika Singh , Freddie Vargus , Daniel Dsouza , B\"orje F. Karlsson , Abinaya Mahendiran , Wei-Yin Ko , Herumb Shandilya , Jay Patel

show 25 more authors

Deividas Mataciunas Laura OMahony Mike Zhang Ramith Hettiarachchi Joseph Wilson Marina Machado Luisa Souza Moura Dominik Krzemi\'nski Hakimeh Fadaei Irem Erg\"un Ifeoma Okoh Aisha Alaagib Oshan Mudannayake Zaid Alyafeai Vu Minh Chien Sebastian Ruder Surya Guthikonda Emad A. Alghamdi Sebastian Gehrmann Niklas Muennighoff Max Bartolo Julia Kreutzer Ahmet \"Ust\"un Marzieh Fadaee Sara Hooker

This is my paper

classification 💻 cs.CL cs.AI

keywords datasetslanguagecollectiondatasetlanguagesbridgeexistinginstances

0 comments

read the original abstract

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories
cs.CL 2026-06 unverdicted novelty 7.0

BabelJudge introduces a perturbation-based framework to audit LLM judges for position bias, verbosity bias, order inconsistency, and cross-lingual degradation without human preference labels.
Bayesian Model Merging
cs.LG 2026-05 unverdicted novelty 6.0

Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines a...
DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
cs.LG 2026-05 unverdicted novelty 6.0

DynaMiCS uses short probing runs to build a slope matrix of cross-domain effects and solves a constrained optimization over mixture weights to improve targets while respecting performance bounds on constrained domains.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
cs.LG 2026-04 unverdicted novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.