MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Manual audit shows web-scraped Lombard corpora are largely noisy and biased toward Western varieties over Eastern ones.
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
PortBERT releases two RoBERTa models for Portuguese that match or beat prior monolingual and multilingual models on translated GLUE/SuperGLUE tasks while reporting training and inference times.
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.
citing papers explorer
-
MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
-
"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard
Manual audit shows web-scraped Lombard corpora are largely noisy and biased toward Western varieties over Eastern ones.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
PortBERT: Navigating the Depths of Portuguese Language Models
PortBERT releases two RoBERTa models for Portuguese that match or beat prior monolingual and multilingual models on translated GLUE/SuperGLUE tasks while reporting training and inference times.
-
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.