pith. sign in

Schmidt, Chris Tanner, and Yuval Pinter

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

fields

cs.CL 2

years

2026 2

verdicts

UNVERDICTED 2

clear filters

representative citing papers

Tokenization with Split Trees

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment cs.CL · 2026-06-25 · unverdicted · none · ref 2

    MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.

  • Tokenization with Split Trees cs.CL · 2026-05-21 · unverdicted · none · ref 73

    ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.