Choosing a Tokenization Algorithm
Brot supports 5 tokenization algorithms. The three subword algorithms (BPE, WordPiece, Unigram) handle open vocabulary by splitting rare words into smaller pieces. Word-level and character-level are simpler alternatives.
BPE (Byte Pair Encoding)
BPE starts with individual characters and iteratively merges the most frequent adjacent pairs. The merge rules, learned during training, define how text is split. Used by GPT-2, GPT-3/4, RoBERTa, and LLaMA.
Constructor: Brot.bpe. Trainer: Brot.train_bpe.
Key parameters:
vocab_size— target vocabulary size (default: 30000)min_frequency— minimum pair frequency for merging (default: 0)dropout— probability of skipping merges for data augmentationbyte_fallback— use<0x00>byte tokens instead of unknown tokencontinuing_subword_prefix— prefix for non-initial subwordsend_of_word_suffix— suffix marking word boundaries (e.g.,</w>)
open Brot
let tokenizer =
bpe
~vocab:
[ ("h", 0); ("e", 1); ("l", 2); ("o", 3); (" ", 4); ("w", 5);
("r", 6); ("d", 7); ("he", 8); ("ll", 9); ("llo", 10);
("hello", 11); ("wo", 12); ("rl", 13); ("rld", 14); ("world", 15) ]
~merges:
[ ("h", "e"); ("l", "l"); ("ll", "o"); ("he", "llo");
("w", "o"); ("r", "l"); ("rl", "d"); ("wo", "rld") ]
()
let enc = encode tokenizer "hello world"
let tokens = Encoding.tokens enc (* [| "hello"; " "; "world" |] *)
Training BPE:
open Brot
let tokenizer =
train_bpe ~vocab_size:80 ~min_frequency:1 ~show_progress:false
(`Seq (List.to_seq
[ "The quick brown fox jumps over the lazy dog";
"The dog barked at the brown fox";
"Quick brown foxes are rare and beautiful" ]))
let size = vocab_size tokenizer
let enc = encode tokenizer "The brown fox"
WordPiece
WordPiece uses a greedy longest-match-first algorithm. For each word, it
finds the longest prefix in the vocabulary, then continues with the
remainder prefixed by a continuation marker (default: ##). Used by BERT,
DistilBERT, and Electra.
Constructor: Brot.wordpiece. Trainer: Brot.train_wordpiece.
Key parameters:
vocab_size— target vocabulary size (default: 30000)continuing_subword_prefix— prefix for non-initial subwords (default:##)max_input_chars_per_word— words longer than this become unknown (default: 100)
open Brot
let tokenizer =
wordpiece
~vocab:
[ ("[UNK]", 0); ("the", 1); ("cat", 2); ("play", 3);
("##ing", 4); ("##ed", 5); ("##s", 6); ("un", 7);
("##happy", 8); ("##ly", 9) ]
~pre:(Pre_tokenizer.whitespace ())
~decoder:(Decoder.wordpiece ())
~unk_token:"[UNK]" ()
let enc = encode tokenizer "the cat playing unhappily"
let tokens = Encoding.tokens enc
(* [| "the"; "cat"; "play"; "##ing"; "un"; "##happy"; "##ly" |] *)
let decoded = decode tokenizer (Encoding.ids enc)
(* "the cat playing unhappily" *)
Training WordPiece:
open Brot
let tokenizer =
train_wordpiece ~vocab_size:80 ~show_progress:false
(`Seq (List.to_seq
[ "The quick brown fox jumps over the lazy dog";
"The dog barked at the brown fox";
"Quick brown foxes are rare and beautiful" ]))
let size = vocab_size tokenizer
let enc = encode tokenizer "The brown fox"
Unigram
Unigram uses probabilistic segmentation: given a vocabulary of subwords with log-probabilities, it finds the segmentation that maximizes the total likelihood. Training uses the EM algorithm to iteratively prune the vocabulary. Used by T5, ALBERT, mBART, and XLNet.
Constructor: Brot.unigram. Trainer: Brot.train_unigram.
Key parameters:
vocab_size— target vocabulary size (default: 8000)shrinking_factor— fraction of vocabulary to retain per pruning round (default: 0.75)max_piece_length— maximum subword length (default: 16)n_sub_iterations— EM sub-iterations per pruning round (default: 2)
Vocabulary entries are (token, score) pairs where scores are negative
log probabilities:
open Brot
let tokenizer =
unigram
~vocab:
[ ("<unk>", 0.0); ("the", -1.0); ("cat", -1.5);
("th", -2.0); ("e", -2.5); ("c", -3.0); ("a", -3.0);
("t", -3.0); ("at", -2.0); ("he", -2.0);
("sat", -1.8); ("on", -1.5) ]
~unk_token:"<unk>" ()
let enc = encode tokenizer "the cat sat on"
Training Unigram:
open Brot
let tokenizer =
train_unigram ~vocab_size:60 ~show_progress:false
(`Seq (List.to_seq
[ "The quick brown fox jumps over the lazy dog";
"The dog barked at the brown fox";
"Quick brown foxes are rare and beautiful" ]))
let size = vocab_size tokenizer
let enc = encode tokenizer "The brown fox"
Word-level
Word-level tokenization maps each word directly to a token ID. No subword splitting is performed — words not in the vocabulary are replaced by the unknown token.
Constructor: Brot.word_level. Trainer: Brot.train_wordlevel.
Best suited for small controlled vocabularies and prototyping. For production use with open vocabulary, prefer a subword algorithm.
When no pre-tokenizer is specified, word_level defaults to
Pre_tokenizer.whitespace.
open Brot
let tokenizer =
word_level
~vocab:
[ ("[UNK]", 0); ("the", 1); ("cat", 2); ("sat", 3);
("on", 4); ("a", 5); ("mat", 6) ]
~unk_token:"[UNK]" ()
(* Known words get their IDs, unknown words become [UNK] *)
let enc = encode tokenizer "the cat sat on a rug"
let tokens = Encoding.tokens enc
(* [| "the"; "cat"; "sat"; "on"; "a"; "[UNK]" |] *)
let ids = Encoding.ids enc
(* [| 1; 2; 3; 4; 5; 0 |] *)
Character-level
Character-level tokenization maps each byte to a token with ID equal to its ordinal value. No vocabulary or training is needed.
Constructor: Brot.chars.
Useful as a byte-level fallback or for models that operate directly on characters:
open Brot
let tokenizer = chars ()
let enc = encode tokenizer "Hi!"
let tokens = Encoding.tokens enc (* [| "H"; "i"; "!" |] *)
let ids = Encoding.ids enc (* [| 72; 105; 33 |] *)
Quick Reference
| Algorithm | Splitting strategy | Typical vocab | Notable models | Constructor | Trainer |
|---|---|---|---|---|---|
| BPE | Iterative merge of frequent pairs | 30K-50K | GPT-2, RoBERTa, LLaMA | bpe |
train_bpe |
| WordPiece | Greedy longest-match with ## prefix |
30K | BERT, DistilBERT, Electra | wordpiece |
train_wordpiece |
| Unigram | Probabilistic max-likelihood segmentation | 8K-32K | T5, ALBERT, mBART, XLNet | unigram |
train_unigram |
| Word-level | Whole words, no splitting | Varies | Simple models | word_level |
train_wordlevel |
| Character-level | Each byte is a token | 256 | Byte-level models | chars |
— |