MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
ByT5 : Towards a token-free future with pre-trained byte-to-byte models
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
YOMI-Bench is a new benchmark of four tasks for kanji reading and phonological understanding in LLMs, showing low performance even for Japanese-specific and commercial models.
Tokenizer fertility varies 2.5x across 25 European languages with domain-invariant rankings, morphological fragmentation in high-fertility cases, and a Ukrainian penalty from pre-training underrepresentation.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
citing papers explorer
-
YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese
YOMI-Bench is a new benchmark of four tasks for kanji reading and phonological understanding in LLMs, showing low performance even for Japanese-specific and commercial models.
-
The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
Tokenizer fertility varies 2.5x across 25 European languages with domain-invariant rankings, morphological fragmentation in high-fertility cases, and a Ukrainian penalty from pre-training underrepresentation.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.