The Tokenization Pipeline
Brot processes text through up to 5 stages, each optional and independently configurable:
text
│
├─ 1. Normalizer — clean and transform text
├─ 2. Pre-tokenizer — split into pieces with byte offsets
├─ 3. Algorithm — map pieces to token IDs (BPE, WordPiece, …)
├─ 4. Post-processor — add special tokens, set type IDs
└─ 5. Decoder — reverse the encoding back to text
│
▼
Encoding.t (ids, tokens, offsets, masks, …)
Each stage is set when constructing the tokenizer. Omit any stage and it is skipped.
Normalization
Normalizers transform text before tokenization. They handle lowercasing, accent removal, Unicode normalization, whitespace cleanup, and model-specific preprocessing.
Available normalizers:
- Unicode:
nfc,nfd,nfkc,nfkd - Text transforms:
lowercase,strip_accents,strip,replace,prepend - Byte-level:
byte_level(GPT-2 style byte-to-Unicode mapping) - Model-specific:
bert(clean text, CJK padding, optional lowercasing and accent stripping)
Compose normalizers with sequence:
open Brot
let n =
Normalizer.sequence
[ Normalizer.nfd; Normalizer.strip_accents; Normalizer.lowercase ]
let r1 = Normalizer.apply n "Café Résumé" (* "cafe resume" *)
let r2 = Normalizer.apply n "HELLO" (* "hello" *)
The BERT normalizer combines several transforms:
open Brot
let n = Normalizer.bert ~lowercase:true ()
(* Lowercases, cleans control characters, pads CJK *)
let r1 = Normalizer.apply n "Hello World" (* "hello world" *)
let r2 = Normalizer.apply n "Café" (* "cafe" *)
Pre-tokenization
Pre-tokenizers split text into pieces before the algorithm runs. Each piece carries byte offsets into the original text. The algorithm then tokenizes each piece independently.
Available pre-tokenizers:
| Pre-tokenizer | Description |
|---|---|
whitespace () |
Split on \w+\|[^\w\s]+ (word chars grouped, non-word grouped) |
whitespace_split () |
Split on whitespace (simplest) |
bert () |
BERT-style: whitespace + punctuation isolation + CJK separation |
byte_level () |
GPT-2 style byte-level encoding with regex splitting |
punctuation () |
Separate punctuation from alphanumeric content |
split ~pattern () |
Split on a literal string pattern |
char_delimiter c |
Split on a single character |
digits () |
Split on digit boundaries |
metaspace () |
Replace whitespace with a visible marker (SentencePiece) |
unicode_scripts () |
Split on Unicode script boundaries |
fixed_length n |
Fixed-size character chunks |
Use pre_tokenize to inspect how a pre-tokenizer splits text. It returns
a list of (piece, (start_offset, end_offset)) pairs:
open Brot
let text = "Hello, world! How's it going?"
let whitespace_pieces =
Pre_tokenizer.pre_tokenize (Pre_tokenizer.whitespace ()) text
(* [("Hello", (0,5)); (",", (5,6)); ("world", (7,12)); ("!", (12,13)); ...] *)
let bert_pieces =
Pre_tokenizer.pre_tokenize (Pre_tokenizer.bert ()) text
let punct_pieces =
Pre_tokenizer.pre_tokenize (Pre_tokenizer.punctuation ()) text
Compose pre-tokenizers with sequence. Each pre-tokenizer in the chain
processes the pieces from the previous one:
open Brot
let pre =
Pre_tokenizer.sequence
[ Pre_tokenizer.whitespace_split (); Pre_tokenizer.digits () ]
let pieces = Pre_tokenizer.pre_tokenize pre "order 42 shipped"
(* [("order", _); ("4", _); ("2", _); ("shipped", _)] *)
Tokenization Algorithms
The algorithm maps pre-tokenized pieces to token IDs using the vocabulary. Brot supports 5 algorithms:
| Algorithm | How it splits | Notable models |
|---|---|---|
| BPE | Iterative merge of most frequent pairs | GPT-2, GPT-3/4, RoBERTa, LLaMA |
| WordPiece | Greedy longest-match with ## prefix |
BERT, DistilBERT, Electra |
| Unigram | Probabilistic segmentation (max likelihood) | T5, ALBERT, mBART, XLNet |
| Word-level | Whole words, no subword splitting | Simple models, prototyping |
| Character-level | Each byte is a token | Byte-level fallback |
See Choosing an Algorithm for details on each algorithm, when to use it, and how to configure training.
Post-processing
Post-processors add special tokens and set type IDs after tokenization.
They handle model-specific requirements like [CLS]/[SEP] for BERT or
<s>/</s> for RoBERTa.
Available post-processors:
bert ~sep ~cls ()—[CLS] A [SEP]or[CLS] A [SEP] B [SEP], type IDs 0/1roberta ~sep ~cls ()—<s> A </s>or<s> A </s> </s> B </s>, all type IDs 0byte_level ()— adjust offsets for byte-level encodingtemplate ~single ()— custom template with$A,$B, and literal token placeholderssequence processors— chain multiple post-processors
open Brot
let tokenizer =
wordpiece
~vocab:
[ ("[UNK]", 0); ("[CLS]", 1); ("[SEP]", 2);
("the", 3); ("cat", 4); ("sat", 5); ("how", 6); ("are", 7); ("you", 8) ]
~specials:(List.map special [ "[UNK]"; "[CLS]"; "[SEP]" ])
~pre:(Pre_tokenizer.whitespace ())
~post:(Post_processor.bert ~cls:("[CLS]", 1) ~sep:("[SEP]", 2) ())
~decoder:(Decoder.wordpiece ())
~unk_token:"[UNK]" ()
(* Single sentence: [CLS] the cat sat [SEP] *)
let single = encode tokenizer "the cat sat"
(* Sentence pair: [CLS] the cat sat [SEP] how are you [SEP] *)
let pair = encode tokenizer ~pair:"how are you" "the cat sat"
(* type_ids: 0 for first sentence + [CLS]/[SEP], 1 for second + [SEP] *)
let type_ids = Encoding.type_ids pair
The template post-processor gives full control over the format. Use $A
and $B as sequence placeholders, and literal token names in brackets.
Append :N to set type IDs:
open Brot
let tokenizer =
word_level
~vocab:
[ ("[BOS]", 0); ("[EOS]", 1); ("hello", 2); ("world", 3) ]
~specials:(List.map special [ "[BOS]"; "[EOS]" ])
~pre:(Pre_tokenizer.whitespace ())
~post:
(Post_processor.template
~single:"[BOS]:0 $A:0 [EOS]:0"
~pair:"[BOS]:0 $A:0 [EOS]:0 $B:1 [EOS]:1"
~special_tokens:[ ("[BOS]", 0); ("[EOS]", 1) ]
())
~unk_token:"[UNK]" ()
let enc = encode tokenizer "hello world"
let tokens = Encoding.tokens enc (* [| "[BOS]"; "hello"; "world"; "[EOS]" |] *)
let type_ids = Encoding.type_ids enc (* [| 0; 0; 0; 0 |] *)
Decoding
Decoders reverse encoding-specific transformations to produce natural text from token strings. They operate on token strings (looked up from the vocabulary), not IDs.
Decoders fall into two categories:
- Per-token — transform each token independently:
bpe,byte_fallback,metaspace - Collapsing — process the entire token list as a whole:
byte_level,wordpiece,replace,strip,fuse
This distinction matters when composing with sequence: per-token decoders
pass a list of transformed tokens to the next decoder, while collapsing
decoders produce a single result.
Available decoders:
| Decoder | Type | Description |
|---|---|---|
bpe () |
Per-token | Strip end-of-word suffix, insert spaces |
byte_fallback () |
Per-token | Convert <0x41> hex tokens to bytes |
metaspace () |
Per-token | Convert metaspace markers to spaces |
byte_level () |
Collapsing | Reverse GPT-2 byte-to-Unicode encoding |
wordpiece () |
Collapsing | Strip ## prefix, join subwords |
replace ~pattern ~by () |
Collapsing | Replace literal pattern in joined text |
strip () |
Collapsing | Remove leading/trailing characters |
fuse () |
Collapsing | Concatenate all tokens with no delimiter |
ctc () |
Per-token | CTC output decoding (deduplication, pad removal) |
open Brot
(* WordPiece decoder: strips ## prefix and joins subwords *)
let wp = Decoder.wordpiece ()
let text = Decoder.decode wp [ "[CLS]"; "play"; "##ing"; "cat"; "##s"; "[SEP]" ]
(* "[CLS] playing cats [SEP]" *)
(* Sequence of decoders *)
let seq = Decoder.sequence [ Decoder.fuse (); Decoder.replace ~pattern:"_" ~by:" " () ]
let text2 = Decoder.decode seq [ "_Hello"; "_world" ]
(* " Hello world" *)
When using Brot.decode, the tokenizer looks up token strings from the
vocabulary and then applies the configured decoder automatically.
Complete Example
Here is a complete BERT-style tokenizer using all 5 pipeline stages:
open Brot
let tokenizer =
wordpiece
(* 1. Normalizer: lowercase and clean text *)
~normalizer:(Normalizer.bert ~lowercase:true ())
(* 2. Pre-tokenizer: BERT-style splitting *)
~pre:(Pre_tokenizer.bert ())
(* 3. Algorithm: WordPiece with ## prefix *)
~vocab:
[ ("[PAD]", 0); ("[UNK]", 1); ("[CLS]", 2); ("[SEP]", 3);
("the", 4); ("cat", 5); ("sat", 6); ("on", 7); ("mat", 8);
("play", 9); ("##ing", 10); ("##ed", 11); ("a", 12) ]
~specials:(List.map special [ "[PAD]"; "[UNK]"; "[CLS]"; "[SEP]" ])
~unk_token:"[UNK]" ~pad_token:"[PAD]"
(* 4. Post-processor: add [CLS] and [SEP] *)
~post:(Post_processor.bert ~cls:("[CLS]", 2) ~sep:("[SEP]", 3) ())
(* 5. Decoder: strip ## and join *)
~decoder:(Decoder.wordpiece ())
()
(* "The Cat" is normalized to "the cat" before tokenization *)
let enc = encode tokenizer "The Cat Played On A Mat"
let tokens = Encoding.tokens enc
(* [| "[CLS]"; "the"; "cat"; "play"; "##ed"; "on"; "a"; "mat"; "[SEP]" |] *)
(* Decode back, skipping special tokens *)
let text = decode tokenizer ~skip_special_tokens:true (Encoding.ids enc)
(* "the cat played on a mat" *)