Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Land, Sander, Bartolo, Max · 2024 · DOI 10.18653/v1/2024.emnlp-main.649

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open at publisher browse 9 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.

Chatbots Output Meaningful (but Problematic) Language

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.

Tokenization with Split Trees

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.

ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography

cs.CR · 2026-04-28 · unverdicted · novelty 7.0

ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while matching baseline security and quality.

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.

Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese

cs.AI · 2026-06-17 · unverdicted · novelty 5.0

TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

cs.CL · 2026-06-12 · unverdicted · novelty 5.0

A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.

citing papers explorer

Showing 9 of 9 citing papers.

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment cs.CL · 2026-06-25 · unverdicted · none · ref 29
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
Chatbots Output Meaningful (but Problematic) Language cs.CL · 2026-06-02 · unverdicted · none · ref 98
LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.
Tokenization with Split Trees cs.CL · 2026-05-21 · unverdicted · none · ref 56
ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.
ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography cs.CR · 2026-04-28 · unverdicted · none · ref 12
ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while matching baseline security and quality.
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG cs.CL · 2026-06-30 · unverdicted · none · ref 92
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
Inside the LLM Word Factory cs.CL · 2026-06-07 · unverdicted · none · ref 17
Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.
Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization cs.CL · 2026-05-17 · unverdicted · none · ref 64
Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.
Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese cs.AI · 2026-06-17 · unverdicted · none · ref 22
TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.
Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models cs.CL · 2026-06-12 · unverdicted · none · ref 9
A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer