Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =

Suárez, Pedro Javier Ortiz, Sagot, Benoît, Romary, Laurent , editor = · 2019 · DOI 10.14618/ids-pub-9021

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open at publisher browse 7 citing papers

representative citing papers

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.

"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Manual audit shows web-scraped Lombard corpora are largely noisy and biased toward Western varieties over Eastern ones.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

PortBERT: Navigating the Depths of Portuguese Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 3.0

PortBERT releases two RoBERTa models for Portuguese that match or beat prior monolingual and multilingual models on translated GLUE/SuperGLUE tasks while reporting training and inference times.

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

cs.CL · 2026-05-21 · unverdicted · novelty 3.0

Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.

citing papers explorer

Showing 7 of 7 citing papers.

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment cs.CL · 2026-06-25 · unverdicted · none · ref 26
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard cs.CL · 2026-06-04 · unverdicted · none · ref 73
Manual audit shows web-scraped Lombard corpora are largely noisy and biased toward Western varieties over Eastern ones.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 248
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 26
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 295
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
PortBERT: Navigating the Depths of Portuguese Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 56
PortBERT releases two RoBERTa models for Portuguese that match or beat prior monolingual and multilingual models on translated GLUE/SuperGLUE tasks while reporting training and inference times.
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild cs.CL · 2026-05-21 · unverdicted · none · ref 63
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =

fields

years

verdicts

representative citing papers

citing papers explorer