Skip to content

Archives

All the articles I've archived.

2026 1
April 1
  • From Code Points to Subwords: Building a Byte-Level BPE Tokenizer

    A walk through tokenization as taught in Stanford's CS336, from Unicode primitives to a full byte-level BPE tokenizer with encode/decode. Covers why UTF-8 wins, how BPE as compression generalizes to language modeling, and the data structures that make the merge loop tractable on a 10GB corpus.

2025 3
August 3