The Tokenization Pipeline

Brot processes text through up to 5 stages, each optional and independently configurable:

text
 │
 ├─ 1. Normalizer       — clean and transform text
 ├─ 2. Pre-tokenizer    — split into pieces with byte offsets
 ├─ 3. Algorithm        — map pieces to token IDs (BPE, WordPiece, …)
 ├─ 4. Post-processor   — add special tokens, set type IDs
 └─ 5. Decoder          — reverse the encoding back to text
 │
 ▼
Encoding.t (ids, tokens, offsets, masks, …)

Each stage is set when constructing the tokenizer. Omit any stage and it is skipped.

Normalization

Normalizers transform text before tokenization. They handle lowercasing, accent removal, Unicode normalization, whitespace cleanup, and model-specific preprocessing.

Available normalizers:

Unicode: nfc, nfd, nfkc, nfkd
Text transforms: lowercase, strip_accents, strip, replace, prepend
Byte-level: byte_level (GPT-2 style byte-to-Unicode mapping)
Model-specific: bert (clean text, CJK padding, optional lowercasing and accent stripping)

Compose normalizers with sequence:

open Brot

let n =
  Normalizer.sequence
    [ Normalizer.nfd; Normalizer.strip_accents; Normalizer.lowercase ]

let r1 = Normalizer.apply n "Café Résumé" (* "cafe resume" *)
let r2 = Normalizer.apply n "HELLO"        (* "hello" *)

The BERT normalizer combines several transforms:

open Brot

let n = Normalizer.bert ~lowercase:true ()
(* Lowercases, cleans control characters, pads CJK *)
let r1 = Normalizer.apply n "Hello World" (* "hello world" *)
let r2 = Normalizer.apply n "Café"        (* "cafe" *)

Pre-tokenization

Pre-tokenizers split text into pieces before the algorithm runs. Each piece carries byte offsets into the original text. The algorithm then tokenizes each piece independently.

Available pre-tokenizers:

Pre-tokenizer	Description
`whitespace ()`	Split on `\w+\\|[^\w\s]+` (word chars grouped, non-word grouped)
`whitespace_split ()`	Split on whitespace (simplest)
`bert ()`	BERT-style: whitespace + punctuation isolation + CJK separation
`byte_level ()`	GPT-2 style byte-level encoding with regex splitting
`punctuation ()`	Separate punctuation from alphanumeric content
`split ~pattern ()`	Split on a literal string pattern
`char_delimiter c`	Split on a single character
`digits ()`	Split on digit boundaries
`metaspace ()`	Replace whitespace with a visible marker (SentencePiece)
`unicode_scripts ()`	Split on Unicode script boundaries
`fixed_length n`	Fixed-size character chunks

Use pre_tokenize to inspect how a pre-tokenizer splits text. It returns a list of (piece, (start_offset, end_offset)) pairs:

open Brot

let text = "Hello, world! How's it going?"

let whitespace_pieces =
  Pre_tokenizer.pre_tokenize (Pre_tokenizer.whitespace ()) text
(* [("Hello", (0,5)); (",", (5,6)); ("world", (7,12)); ("!", (12,13)); ...] *)

let bert_pieces =
  Pre_tokenizer.pre_tokenize (Pre_tokenizer.bert ()) text

let punct_pieces =
  Pre_tokenizer.pre_tokenize (Pre_tokenizer.punctuation ()) text

Compose pre-tokenizers with sequence. Each pre-tokenizer in the chain processes the pieces from the previous one:

open Brot

let pre =
  Pre_tokenizer.sequence
    [ Pre_tokenizer.whitespace_split (); Pre_tokenizer.digits () ]

let pieces = Pre_tokenizer.pre_tokenize pre "order 42 shipped"
(* [("order", _); ("4", _); ("2", _); ("shipped", _)] *)

Tokenization Algorithms

The algorithm maps pre-tokenized pieces to token IDs using the vocabulary. Brot supports 5 algorithms:

Algorithm	How it splits	Notable models
BPE	Iterative merge of most frequent pairs	GPT-2, GPT-3/4, RoBERTa, LLaMA
WordPiece	Greedy longest-match with `##` prefix	BERT, DistilBERT, Electra
Unigram	Probabilistic segmentation (max likelihood)	T5, ALBERT, mBART, XLNet
Word-level	Whole words, no subword splitting	Simple models, prototyping
Character-level	Each byte is a token	Byte-level fallback

See Choosing an Algorithm for details on each algorithm, when to use it, and how to configure training.

Post-processing

Post-processors add special tokens and set type IDs after tokenization. They handle model-specific requirements like [CLS]/[SEP] for BERT or <s>/</s> for RoBERTa.

Available post-processors:

bert ~sep ~cls () — [CLS] A [SEP] or [CLS] A [SEP] B [SEP], type IDs 0/1
roberta ~sep ~cls () — <s> A </s> or <s> A </s> </s> B </s>, all type IDs 0
byte_level () — adjust offsets for byte-level encoding
template ~single () — custom template with $A, $B, and literal token placeholders
sequence processors — chain multiple post-processors

open Brot

let tokenizer =
  wordpiece
    ~vocab:
      [ ("[UNK]", 0); ("[CLS]", 1); ("[SEP]", 2);
        ("the", 3); ("cat", 4); ("sat", 5); ("how", 6); ("are", 7); ("you", 8) ]
    ~specials:(List.map special [ "[UNK]"; "[CLS]"; "[SEP]" ])
    ~pre:(Pre_tokenizer.whitespace ())
    ~post:(Post_processor.bert ~cls:("[CLS]", 1) ~sep:("[SEP]", 2) ())
    ~decoder:(Decoder.wordpiece ())
    ~unk_token:"[UNK]" ()

(* Single sentence: [CLS] the cat sat [SEP] *)
let single = encode tokenizer "the cat sat"

(* Sentence pair: [CLS] the cat sat [SEP] how are you [SEP] *)
let pair = encode tokenizer ~pair:"how are you" "the cat sat"
(* type_ids: 0 for first sentence + [CLS]/[SEP], 1 for second + [SEP] *)
let type_ids = Encoding.type_ids pair

The template post-processor gives full control over the format. Use $A and $B as sequence placeholders, and literal token names in brackets. Append :N to set type IDs:

open Brot

let tokenizer =
  word_level
    ~vocab:
      [ ("[BOS]", 0); ("[EOS]", 1); ("hello", 2); ("world", 3) ]
    ~specials:(List.map special [ "[BOS]"; "[EOS]" ])
    ~pre:(Pre_tokenizer.whitespace ())
    ~post:
      (Post_processor.template
         ~single:"[BOS]:0 $A:0 [EOS]:0"
         ~pair:"[BOS]:0 $A:0 [EOS]:0 $B:1 [EOS]:1"
         ~special_tokens:[ ("[BOS]", 0); ("[EOS]", 1) ]
         ())
    ~unk_token:"[UNK]" ()

let enc = encode tokenizer "hello world"
let tokens = Encoding.tokens enc   (* [| "[BOS]"; "hello"; "world"; "[EOS]" |] *)
let type_ids = Encoding.type_ids enc (* [| 0; 0; 0; 0 |] *)

Decoding

Decoders reverse encoding-specific transformations to produce natural text from token strings. They operate on token strings (looked up from the vocabulary), not IDs.

Decoders fall into two categories:

Per-token — transform each token independently: bpe, byte_fallback, metaspace
Collapsing — process the entire token list as a whole: byte_level, wordpiece, replace, strip, fuse

This distinction matters when composing with sequence: per-token decoders pass a list of transformed tokens to the next decoder, while collapsing decoders produce a single result.

Available decoders:

Decoder	Type	Description
`bpe ()`	Per-token	Strip end-of-word suffix, insert spaces
`byte_fallback ()`	Per-token	Convert `<0x41>` hex tokens to bytes
`metaspace ()`	Per-token	Convert metaspace markers to spaces
`byte_level ()`	Collapsing	Reverse GPT-2 byte-to-Unicode encoding
`wordpiece ()`	Collapsing	Strip `##` prefix, join subwords
`replace ~pattern ~by ()`	Collapsing	Replace literal pattern in joined text
`strip ()`	Collapsing	Remove leading/trailing characters
`fuse ()`	Collapsing	Concatenate all tokens with no delimiter
`ctc ()`	Per-token	CTC output decoding (deduplication, pad removal)

open Brot

(* WordPiece decoder: strips ## prefix and joins subwords *)
let wp = Decoder.wordpiece ()
let text = Decoder.decode wp [ "[CLS]"; "play"; "##ing"; "cat"; "##s"; "[SEP]" ]
(* "[CLS] playing cats [SEP]" *)

(* Sequence of decoders *)
let seq = Decoder.sequence [ Decoder.fuse (); Decoder.replace ~pattern:"_" ~by:" " () ]
let text2 = Decoder.decode seq [ "_Hello"; "_world" ]
(* " Hello world" *)

When using Brot.decode, the tokenizer looks up token strings from the vocabulary and then applies the configured decoder automatically.

Complete Example

Here is a complete BERT-style tokenizer using all 5 pipeline stages:

open Brot

let tokenizer =
  wordpiece
    (* 1. Normalizer: lowercase and clean text *)
    ~normalizer:(Normalizer.bert ~lowercase:true ())
    (* 2. Pre-tokenizer: BERT-style splitting *)
    ~pre:(Pre_tokenizer.bert ())
    (* 3. Algorithm: WordPiece with ## prefix *)
    ~vocab:
      [ ("[PAD]", 0); ("[UNK]", 1); ("[CLS]", 2); ("[SEP]", 3);
        ("the", 4); ("cat", 5); ("sat", 6); ("on", 7); ("mat", 8);
        ("play", 9); ("##ing", 10); ("##ed", 11); ("a", 12) ]
    ~specials:(List.map special [ "[PAD]"; "[UNK]"; "[CLS]"; "[SEP]" ])
    ~unk_token:"[UNK]" ~pad_token:"[PAD]"
    (* 4. Post-processor: add [CLS] and [SEP] *)
    ~post:(Post_processor.bert ~cls:("[CLS]", 2) ~sep:("[SEP]", 3) ())
    (* 5. Decoder: strip ## and join *)
    ~decoder:(Decoder.wordpiece ())
    ()

(* "The Cat" is normalized to "the cat" before tokenization *)
let enc = encode tokenizer "The Cat Played On A Mat"
let tokens = Encoding.tokens enc
(* [| "[CLS]"; "the"; "cat"; "play"; "##ed"; "on"; "a"; "mat"; "[SEP]" |] *)

(* Decode back, skipping special tokens *)
let text = decode tokenizer ~skip_special_tokens:true (Encoding.ids enc)
(* "the cat played on a mat" *)