Choosing a Tokenization Algorithm

Brot supports 5 tokenization algorithms. The three subword algorithms (BPE, WordPiece, Unigram) handle open vocabulary by splitting rare words into smaller pieces. Word-level and character-level are simpler alternatives.

BPE (Byte Pair Encoding)

BPE starts with individual characters and iteratively merges the most frequent adjacent pairs. The merge rules, learned during training, define how text is split. Used by GPT-2, GPT-3/4, RoBERTa, and LLaMA.

Constructor: Brot.bpe. Trainer: Brot.train_bpe.

Key parameters:

vocab_size — target vocabulary size (default: 30000)
min_frequency — minimum pair frequency for merging (default: 0)
dropout — probability of skipping merges for data augmentation
byte_fallback — use <0x00> byte tokens instead of unknown token
continuing_subword_prefix — prefix for non-initial subwords
end_of_word_suffix — suffix marking word boundaries (e.g., </w>)

open Brot

let tokenizer =
  bpe
    ~vocab:
      [ ("h", 0); ("e", 1); ("l", 2); ("o", 3); (" ", 4); ("w", 5);
        ("r", 6); ("d", 7); ("he", 8); ("ll", 9); ("llo", 10);
        ("hello", 11); ("wo", 12); ("rl", 13); ("rld", 14); ("world", 15) ]
    ~merges:
      [ ("h", "e"); ("l", "l"); ("ll", "o"); ("he", "llo");
        ("w", "o"); ("r", "l"); ("rl", "d"); ("wo", "rld") ]
    ()

let enc = encode tokenizer "hello world"
let tokens = Encoding.tokens enc (* [| "hello"; " "; "world" |] *)

Training BPE:

open Brot

let tokenizer =
  train_bpe ~vocab_size:80 ~min_frequency:1 ~show_progress:false
    (`Seq (List.to_seq
       [ "The quick brown fox jumps over the lazy dog";
         "The dog barked at the brown fox";
         "Quick brown foxes are rare and beautiful" ]))

let size = vocab_size tokenizer
let enc = encode tokenizer "The brown fox"

WordPiece

WordPiece uses a greedy longest-match-first algorithm. For each word, it finds the longest prefix in the vocabulary, then continues with the remainder prefixed by a continuation marker (default: ##). Used by BERT, DistilBERT, and Electra.

Constructor: Brot.wordpiece. Trainer: Brot.train_wordpiece.

Key parameters:

vocab_size — target vocabulary size (default: 30000)
continuing_subword_prefix — prefix for non-initial subwords (default: ##)
max_input_chars_per_word — words longer than this become unknown (default: 100)

open Brot

let tokenizer =
  wordpiece
    ~vocab:
      [ ("[UNK]", 0); ("the", 1); ("cat", 2); ("play", 3);
        ("##ing", 4); ("##ed", 5); ("##s", 6); ("un", 7);
        ("##happy", 8); ("##ly", 9) ]
    ~pre:(Pre_tokenizer.whitespace ())
    ~decoder:(Decoder.wordpiece ())
    ~unk_token:"[UNK]" ()

let enc = encode tokenizer "the cat playing unhappily"
let tokens = Encoding.tokens enc
(* [| "the"; "cat"; "play"; "##ing"; "un"; "##happy"; "##ly" |] *)
let decoded = decode tokenizer (Encoding.ids enc)
(* "the cat playing unhappily" *)

Training WordPiece:

open Brot

let tokenizer =
  train_wordpiece ~vocab_size:80 ~show_progress:false
    (`Seq (List.to_seq
       [ "The quick brown fox jumps over the lazy dog";
         "The dog barked at the brown fox";
         "Quick brown foxes are rare and beautiful" ]))

let size = vocab_size tokenizer
let enc = encode tokenizer "The brown fox"

Unigram

Unigram uses probabilistic segmentation: given a vocabulary of subwords with log-probabilities, it finds the segmentation that maximizes the total likelihood. Training uses the EM algorithm to iteratively prune the vocabulary. Used by T5, ALBERT, mBART, and XLNet.

Constructor: Brot.unigram. Trainer: Brot.train_unigram.

Key parameters:

vocab_size — target vocabulary size (default: 8000)
shrinking_factor — fraction of vocabulary to retain per pruning round (default: 0.75)
max_piece_length — maximum subword length (default: 16)
n_sub_iterations — EM sub-iterations per pruning round (default: 2)

Vocabulary entries are (token, score) pairs where scores are negative log probabilities:

open Brot

let tokenizer =
  unigram
    ~vocab:
      [ ("<unk>", 0.0); ("the", -1.0); ("cat", -1.5);
        ("th", -2.0); ("e", -2.5); ("c", -3.0); ("a", -3.0);
        ("t", -3.0); ("at", -2.0); ("he", -2.0);
        ("sat", -1.8); ("on", -1.5) ]
    ~unk_token:"<unk>" ()

let enc = encode tokenizer "the cat sat on"

Training Unigram:

open Brot

let tokenizer =
  train_unigram ~vocab_size:60 ~show_progress:false
    (`Seq (List.to_seq
       [ "The quick brown fox jumps over the lazy dog";
         "The dog barked at the brown fox";
         "Quick brown foxes are rare and beautiful" ]))

let size = vocab_size tokenizer
let enc = encode tokenizer "The brown fox"

Word-level

Word-level tokenization maps each word directly to a token ID. No subword splitting is performed — words not in the vocabulary are replaced by the unknown token.

Constructor: Brot.word_level. Trainer: Brot.train_wordlevel.

Best suited for small controlled vocabularies and prototyping. For production use with open vocabulary, prefer a subword algorithm.

When no pre-tokenizer is specified, word_level defaults to Pre_tokenizer.whitespace.

open Brot

let tokenizer =
  word_level
    ~vocab:
      [ ("[UNK]", 0); ("the", 1); ("cat", 2); ("sat", 3);
        ("on", 4); ("a", 5); ("mat", 6) ]
    ~unk_token:"[UNK]" ()

(* Known words get their IDs, unknown words become [UNK] *)
let enc = encode tokenizer "the cat sat on a rug"
let tokens = Encoding.tokens enc
(* [| "the"; "cat"; "sat"; "on"; "a"; "[UNK]" |] *)
let ids = Encoding.ids enc
(* [| 1; 2; 3; 4; 5; 0 |] *)

Character-level

Character-level tokenization maps each byte to a token with ID equal to its ordinal value. No vocabulary or training is needed.

Constructor: Brot.chars.

Useful as a byte-level fallback or for models that operate directly on characters:

open Brot

let tokenizer = chars ()

let enc = encode tokenizer "Hi!"
let tokens = Encoding.tokens enc (* [| "H"; "i"; "!" |] *)
let ids = Encoding.ids enc       (* [| 72; 105; 33 |] *)

Quick Reference

Algorithm	Splitting strategy	Typical vocab	Notable models	Constructor	Trainer
BPE	Iterative merge of frequent pairs	30K-50K	GPT-2, RoBERTa, LLaMA	`bpe`	`train_bpe`
WordPiece	Greedy longest-match with `##` prefix	30K	BERT, DistilBERT, Electra	`wordpiece`	`train_wordpiece`
Unigram	Probabilistic max-likelihood segmentation	8K-32K	T5, ALBERT, mBART, XLNet	`unigram`	`train_unigram`
Word-level	Whole words, no splitting	Varies	Simple models	`word_level`	`train_wordlevel`
Character-level	Each byte is a token	256	Byte-level models	`chars`	—