Tag: bpe
All the articles with the tag "bpe".
-
From Code Points to Subwords: Building a Byte-Level BPE Tokenizer
A walk through tokenization as taught in Stanford's CS336, from Unicode primitives to a full byte-level BPE tokenizer with encode/decode. Covers why UTF-8 wins, how BPE as compression generalizes to language modeling, and the data structures that make the merge loop tractable on a 10GB corpus.