pith. sign in

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

citation-role summary

background 1

citation-polarity summary

fields

cs.CL 1 cs.LG 1

years

2026 2

roles

background 1

polarities

background 1

representative citing papers

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

cs.LG · 2026-05-20 · conditional · novelty 7.0

X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

citing papers explorer

Showing 2 of 2 citing papers.

  • X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation cs.LG · 2026-05-20 · conditional · none · ref 20 · internal anchor

    X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

  • SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 26 · 2 links · internal anchor

    SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.