Brot vs. HuggingFace Tokenizers -- A Practical Comparison

This guide explains how Brot relates to Python's HuggingFace Tokenizers, focusing on:

  • How core concepts map (tokenizer types, pipeline stages, encoding results)
  • Where the APIs feel similar vs. deliberately different
  • How to translate common HuggingFace patterns into Brot

If you already use HuggingFace Tokenizers, this should be enough to become productive in Brot quickly.


1. Big-Picture Differences

Aspect HuggingFace Tokenizers (Python) Brot (OCaml)
Language Python bindings over Rust Native OCaml
Core type tokenizers.Tokenizer Brot.t
Encoding result tokenizers.Encoding Encoding.t
Algorithms BPE, WordPiece, Unigram, WordLevel Brot.bpe, Brot.wordpiece, Brot.unigram, Brot.word_level, Brot.chars
Pipeline stages Mutable properties on Tokenizer object Immutable ~normalizer, ~pre, ~post, ~decoder args
Mutability Tokenizer is mutable (set properties after creation) Tokenizer is immutable after creation
HuggingFace compat Native format Full tokenizer.json read/write via from_file/save_pretrained
Training Trainer objects passed to tokenizer.train() Brot.train_bpe, Brot.train_wordpiece, etc.
Padding config tokenizer.enable_padding() ~padding arg on encode/encode_batch
Truncation config tokenizer.enable_truncation() ~truncation arg on encode/encode_batch

Brot semantics to know (read once):

  • Tokenizers are immutable. Pipeline components are set at construction time, not mutated after.
  • from_file returns (t, string) result. Handle errors explicitly.
  • Padding and truncation are per-call parameters, not global tokenizer state.
  • Special tokens use a record type (Brot.special) with explicit control over stripping and normalization.
  • encode returns Encoding.t; use encode_ids when you only need the ID array.

2. Loading Pretrained Tokenizers

2.1 From a tokenizer.json file

HuggingFace

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

Brot

let tokenizer = Brot.from_file "tokenizer.json" |> Result.get_ok

Both read the same tokenizer.json format. Brot's from_file returns a result instead of raising an exception.

2.2 From vocabulary and merges files

HuggingFace

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE.from_file("vocab.json", "merges.txt"))

Brot

let tokenizer =
  Brot.from_model_file
    ~vocab:"vocab.json"
    ~merges:"merges.txt"
    ()

When ~merges is omitted, Brot infers WordPiece instead of BPE.

2.3 Saving

HuggingFace

tokenizer.save("tokenizer.json")

Brot

Brot.save_pretrained tokenizer ~path:"./my_tokenizer"

save_pretrained creates path/tokenizer.json in HuggingFace format. Use to_json when you need the JSON value directly.


3. Encoding Text

3.1 Basic encoding

HuggingFace

output = tokenizer.encode("Hello world!")
output.ids          # [101, 7592, 2088, 999, 102]
output.tokens       # ['[CLS]', 'hello', 'world', '!', '[SEP]']
output.offsets      # [(0, 0), (0, 5), (6, 11), (11, 12), (0, 0)]
output.type_ids     # [0, 0, 0, 0, 0]
output.attention_mask  # [1, 1, 1, 1, 1]

Brot

let enc = Brot.encode tokenizer "Hello world!"
let ids   = Encoding.ids enc            (* int array *)
let toks  = Encoding.tokens enc         (* string array *)
let offs  = Encoding.offsets enc        (* (int * int) array *)
let types = Encoding.type_ids enc       (* int array *)
let mask  = Encoding.attention_mask enc (* int array *)

3.2 IDs only

HuggingFace

ids = tokenizer.encode("Hello world!").ids

Brot

let ids = Brot.encode_ids tokenizer "Hello world!"

encode_ids is a shortcut that avoids constructing the full Encoding.t when you only need token IDs.

3.3 Without special tokens

HuggingFace

output = tokenizer.encode("Hello world!", add_special_tokens=False)

Brot

let enc = Brot.encode tokenizer ~add_special_tokens:false "Hello world!"

4. Decoding

4.1 Basic decoding

HuggingFace

text = tokenizer.decode([101, 7592, 2088, 999, 102])
text_clean = tokenizer.decode([101, 7592, 2088, 999, 102], skip_special_tokens=True)

Brot

let text = Brot.decode tokenizer [| 101; 7592; 2088; 999; 102 |]
let text_clean =
  Brot.decode tokenizer ~skip_special_tokens:true
    [| 101; 7592; 2088; 999; 102 |]

4.2 Batch decoding

HuggingFace

texts = tokenizer.decode_batch([[101, 7592, 102], [101, 2088, 102]])

Brot

let texts =
  Brot.decode_batch tokenizer
    [ [| 101; 7592; 102 |]; [| 101; 2088; 102 |] ]

5. Batch Encoding

HuggingFace

outputs = tokenizer.encode_batch(["Hello world!", "How are you?"])
# outputs is a list of Encoding objects
for enc in outputs:
    print(enc.ids)

Brot

let encodings =
  Brot.encode_batch tokenizer
    [ "Hello world!"; "How are you?" ]

let () =
  List.iter
    (fun enc ->
      let ids = Encoding.ids enc in
      Array.iter (Printf.printf "%d ") ids;
      print_newline ())
    encodings

Both return a list of encoding objects, one per input.


6. Padding and Truncation

6.1 Padding

In HuggingFace, padding is global state on the tokenizer. In Brot, it is a per-call parameter.

HuggingFace

tokenizer.enable_padding(
    direction="right",
    pad_id=0,
    pad_token="[PAD]",
    length=128,         # fixed length
)
output = tokenizer.encode("Hello")
# output.attention_mask shows 0s for padding positions

Brot

let pad = Brot.padding ~pad_id:0 ~pad_token:"[PAD]" (`Fixed 128)
let enc = Brot.encode tokenizer ~padding:pad "Hello"
(* Encoding.attention_mask enc has 0s for padding positions *)

Padding strategies:

HuggingFace Brot
length=None (pad to longest in batch) `Batch_longest
length=128 (fixed) `Fixed 128
pad_to_multiple_of=8 `To_multiple 8
direction="left" ~direction:Left`
direction="right" (default) ~direction:Right` (default)

6.2 Truncation

HuggingFace

tokenizer.enable_truncation(max_length=512, direction="right")
output = tokenizer.encode("Very long text ...")

Brot

let trunc = Brot.truncation 512
let enc = Brot.encode tokenizer ~truncation:trunc "Very long text ..."

Truncation direction defaults to `Right in both libraries.

6.3 Combined padding and truncation

HuggingFace

tokenizer.enable_padding(length=512, pad_token="[PAD]", pad_id=0)
tokenizer.enable_truncation(max_length=512)
outputs = tokenizer.encode_batch(texts)

Brot

let pad = Brot.padding ~pad_token:"[PAD]" ~pad_id:0 (`Fixed 512)
let trunc = Brot.truncation 512
let encodings =
  Brot.encode_batch tokenizer ~padding:pad ~truncation:trunc texts

The key difference: Brot passes these as arguments, so different calls can use different settings without mutating the tokenizer.


7. Sentence Pairs

HuggingFace

# Single pair
output = tokenizer.encode("premise", "hypothesis")
output.type_ids  # [0, 0, 0, 0, 1, 1, 1]  (with BERT post-processor)

# Batch of pairs
outputs = tokenizer.encode_batch([("premise1", "hyp1"), ("premise2", "hyp2")])

Brot

(* Single pair *)
let enc = Brot.encode tokenizer ~pair:"hypothesis" "premise"
let type_ids = Encoding.type_ids enc  (* 0s for first, 1s for second *)

(* Batch of pairs *)
let encodings =
  Brot.encode_pairs_batch tokenizer
    [ ("premise1", "hyp1"); ("premise2", "hyp2") ]

Brot uses the ~pair optional argument on encode for single pairs and a dedicated encode_pairs_batch for batches, instead of overloading the same function with tuples.


8. Special Tokens

8.1 Defining special tokens

HuggingFace

from tokenizers import AddedToken

tokenizer.add_special_tokens([
    AddedToken("[CLS]", single_word=False, lstrip=False, rstrip=False),
    AddedToken("[SEP]", single_word=False, lstrip=False, rstrip=False),
    AddedToken("[PAD]", single_word=False, lstrip=False, rstrip=False),
])

Brot

let tokenizer =
  Brot.bpe
    ~specials:[
      Brot.special "[CLS]";
      Brot.special "[SEP]";
      Brot.special "[PAD]";
    ]
    ~pad_token:"[PAD]"
    ~bos_token:"[CLS]"
    ~eos_token:"[SEP]"
    ()

In HuggingFace, special tokens are added after construction. In Brot, they are part of construction since tokenizers are immutable. The special function accepts optional ~single_word, ~lstrip, ~rstrip, and ~normalized parameters matching AddedToken.

8.2 Role tokens

HuggingFace

tokenizer.pad_token       # "[PAD]"
tokenizer.cls_token       # "[CLS]"
tokenizer.sep_token       # "[SEP]"
tokenizer.unk_token       # "[UNK]"

Brot

let pad = Brot.pad_token tokenizer  (* string option *)
let bos = Brot.bos_token tokenizer  (* string option *)
let eos = Brot.eos_token tokenizer  (* string option *)
let unk = Brot.unk_token tokenizer  (* string option *)

Brot uses bos_token/eos_token instead of cls_token/sep_token since these are model-agnostic roles. They return option instead of raising on missing tokens.

8.3 Special tokens mask

Both libraries provide a mask distinguishing special tokens from content tokens in the encoding:

HuggingFace

output.special_tokens_mask  # [1, 0, 0, 0, 1]

Brot

let mask = Encoding.special_tokens_mask enc  (* int array: 1 for special, 0 for content *)

9. Pipeline Components

Both libraries use the same four-stage pipeline: normalizer, pre-tokenizer, post-processor, decoder. The difference is how they are configured.

9.1 Normalizer

HuggingFace

from tokenizers import normalizers

tokenizer.normalizer = normalizers.Sequence([
    normalizers.NFD(),
    normalizers.StripAccents(),
    normalizers.Lowercase(),
])

Brot

let norm =
  Normalizer.sequence
    [ Normalizer.nfd; Normalizer.strip_accents; Normalizer.lowercase ]

let tokenizer = Brot.bpe ~normalizer:norm ()

Common normalizers:

HuggingFace Brot
normalizers.NFC() Normalizer.nfc
normalizers.NFD() Normalizer.nfd
normalizers.NFKC() Normalizer.nfkc
normalizers.NFKD() Normalizer.nfkd
normalizers.Lowercase() Normalizer.lowercase
normalizers.StripAccents() Normalizer.strip_accents
normalizers.Strip() Normalizer.strip ()
normalizers.Replace(pattern, rep) Normalizer.replace ~pattern ~replacement
normalizers.Prepend(s) Normalizer.prepend s
normalizers.BertNormalizer() Normalizer.bert ()
normalizers.ByteLevel() Normalizer.byte_level ()
normalizers.Sequence([...]) Normalizer.sequence [...]

9.2 Pre-tokenizer

HuggingFace

from tokenizers import pre_tokenizers

tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Punctuation(),
])

Brot

let pre =
  Pre_tokenizer.sequence
    [ Pre_tokenizer.whitespace_split ();
      Pre_tokenizer.punctuation () ]

let tokenizer = Brot.bpe ~pre ()

Common pre-tokenizers:

HuggingFace Brot
pre_tokenizers.Whitespace() Pre_tokenizer.whitespace ()
pre_tokenizers.WhitespaceSplit() Pre_tokenizer.whitespace_split ()
pre_tokenizers.BertPreTokenizer() Pre_tokenizer.bert ()
pre_tokenizers.ByteLevel() Pre_tokenizer.byte_level ()
pre_tokenizers.Punctuation() Pre_tokenizer.punctuation ()
pre_tokenizers.Digits() Pre_tokenizer.digits ()
pre_tokenizers.Metaspace() Pre_tokenizer.metaspace ()
pre_tokenizers.UnicodeScripts() Pre_tokenizer.unicode_scripts ()
pre_tokenizers.CharDelimiterSplit(c) Pre_tokenizer.char_delimiter c
pre_tokenizers.Split(pattern, ...) Pre_tokenizer.split ~pattern ()
pre_tokenizers.Sequence([...]) Pre_tokenizer.sequence [...]

9.3 Post-processor

HuggingFace

from tokenizers import processors

tokenizer.post_processor = processors.BertProcessing(
    sep=("[SEP]", 102),
    cls=("[CLS]", 101),
)

Brot

let post =
  Post_processor.bert
    ~sep:("[SEP]", 102)
    ~cls:("[CLS]", 101)
    ()

let tokenizer = Brot.bpe ~post ()

Common post-processors:

HuggingFace Brot
processors.BertProcessing(sep, cls) Post_processor.bert ~sep ~cls ()
processors.RobertaProcessing(sep, cls) Post_processor.roberta ~sep ~cls ()
processors.ByteLevel() Post_processor.byte_level ()
processors.TemplateProcessing(single, pair, special_tokens) Post_processor.template ~single ?pair ~special_tokens ()
processors.Sequence([...]) Post_processor.sequence [...]

9.4 Decoder

HuggingFace

from tokenizers import decoders

tokenizer.decoder = decoders.WordPiece(prefix="##")

Brot

let dec = Decoder.wordpiece ~prefix:"##" ()
let tokenizer = Brot.wordpiece ~decoder:dec ()

Common decoders:

HuggingFace Brot
decoders.BPEDecoder(suffix) Decoder.bpe ~suffix ()
decoders.ByteLevel() Decoder.byte_level ()
decoders.ByteFallback() Decoder.byte_fallback ()
decoders.WordPiece(prefix) Decoder.wordpiece ~prefix ()
decoders.Metaspace() Decoder.metaspace ()
decoders.CTC() Decoder.ctc ()
decoders.Replace(pattern, by) Decoder.replace ~pattern ~by ()
decoders.Strip() Decoder.strip ()
decoders.Fuse() Decoder.fuse ()
decoders.Sequence([...]) Decoder.sequence [...]

9.5 Inspecting the pipeline

HuggingFace

tokenizer.normalizer
tokenizer.pre_tokenizer
tokenizer.post_processor
tokenizer.decoder

Brot

let norm = Brot.normalizer tokenizer     (* Normalizer.t option *)
let pre  = Brot.pre_tokenizer tokenizer  (* Pre_tokenizer.t option *)
let post = Brot.post_processor tokenizer (* Post_processor.t option *)
let dec  = Brot.decoder tokenizer        (* Decoder.t option *)

Brot returns option for each stage, since any stage can be absent.


10. Training Tokenizers

10.1 BPE training

HuggingFace

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(
    vocab_size=30000,
    min_frequency=2,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]"],
)
tokenizer.train(["corpus.txt"], trainer)

Brot

let tokenizer =
  Brot.train_bpe
    (`Files [ "corpus.txt" ])
    ~vocab_size:30000
    ~min_frequency:2
    ~specials:[
      Brot.special "[UNK]";
      Brot.special "[CLS]";
      Brot.special "[SEP]";
      Brot.special "[PAD]";
    ]
    ~unk_token:"[UNK]"
    ~pad_token:"[PAD]"

Brot combines the Tokenizer + Trainer pattern into a single function call. Training data is passed as `Files (file paths) or `Seq (string sequence).

10.2 WordPiece training

HuggingFace

from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
trainer = WordPieceTrainer(vocab_size=30000, special_tokens=["[UNK]", "[PAD]"])
tokenizer.train(["corpus.txt"], trainer)

Brot

let tokenizer =
  Brot.train_wordpiece
    (`Files [ "corpus.txt" ])
    ~vocab_size:30000
    ~unk_token:"[UNK]"
    ~specials:[ Brot.special "[UNK]"; Brot.special "[PAD]" ]
    ~pad_token:"[PAD]"

10.3 Unigram training

HuggingFace

from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(vocab_size=8000, special_tokens=["<unk>", "<pad>"])
tokenizer.train(["corpus.txt"], trainer)

Brot

let tokenizer =
  Brot.train_unigram
    (`Files [ "corpus.txt" ])
    ~vocab_size:8000
    ~unk_token:"<unk>"
    ~specials:[ Brot.special "<unk>"; Brot.special "<pad>" ]
    ~pad_token:"<pad>"

10.4 Training from in-memory data

HuggingFace

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=1000)
tokenizer.train_from_iterator(
    ["Hello world", "How are you?", "Hello again"],
    trainer,
)

Brot

let texts = [ "Hello world"; "How are you?"; "Hello again" ]
let tokenizer =
  Brot.train_bpe (`Seq (List.to_seq texts)) ~vocab_size:1000

10.5 Extending an existing tokenizer

HuggingFace

# Load, then retrain with more data
tokenizer = Tokenizer.from_file("tokenizer.json")
trainer = BpeTrainer(vocab_size=50000)
tokenizer.train(["more_data.txt"], trainer)

Brot

let base = Brot.from_file "tokenizer.json" |> Result.get_ok
let tokenizer =
  Brot.train_bpe ~init:base (`Files [ "more_data.txt" ]) ~vocab_size:50000

The ~init parameter on training functions lets you extend an existing tokenizer with additional data.


11. Vocabulary Inspection

HuggingFace

tokenizer.get_vocab()                 # dict: token -> id
tokenizer.get_vocab_size()            # int
tokenizer.token_to_id("[CLS]")        # int or None
tokenizer.id_to_token(101)            # str or None

Brot

let v     = Brot.vocab tokenizer         (* (string * int) list *)
let size  = Brot.vocab_size tokenizer    (* int *)
let id    = Brot.token_to_id tokenizer "[CLS]"  (* int option *)
let token = Brot.id_to_token tokenizer 101       (* string option *)

vocab returns an association list instead of a dictionary. token_to_id and id_to_token return option instead of nullable values.


12. Quick Cheat Sheet

Task HuggingFace Tokenizers Brot
Load from file Tokenizer.from_file("tokenizer.json") Brot.from_file "tokenizer.json"
Save to file tokenizer.save("tokenizer.json") Brot.save_pretrained tokenizer ~path:"./out"
Encode text tokenizer.encode("Hello") Brot.encode tokenizer "Hello"
Encode IDs only tokenizer.encode("Hello").ids Brot.encode_ids tokenizer "Hello"
Encode batch tokenizer.encode_batch(["a", "b"]) Brot.encode_batch tokenizer ["a"; "b"]
Encode pair tokenizer.encode("a", "b") Brot.encode tokenizer ~pair:"b" "a"
Encode pairs batch tokenizer.encode_batch([("a","b"), ...]) Brot.encode_pairs_batch tokenizer [("a","b"); ...]
Decode tokenizer.decode(ids) Brot.decode tokenizer ids
Decode batch tokenizer.decode_batch([ids1, ids2]) Brot.decode_batch tokenizer [ids1; ids2]
Get token IDs output.ids Encoding.ids enc
Get tokens output.tokens Encoding.tokens enc
Get attention mask output.attention_mask Encoding.attention_mask enc
Get type IDs output.type_ids Encoding.type_ids enc
Get offsets output.offsets Encoding.offsets enc
Padding tokenizer.enable_padding(length=128) Brot.encode tokenizer ~padding:(Brot.padding (Fixed 128)) ...`
Truncation tokenizer.enable_truncation(max_length=512) Brot.encode tokenizer ~truncation:(Brot.truncation 512) ...
Vocab size tokenizer.get_vocab_size() Brot.vocab_size tokenizer
Token to ID tokenizer.token_to_id("[CLS]") Brot.token_to_id tokenizer "[CLS]"
ID to token tokenizer.id_to_token(101) Brot.id_to_token tokenizer 101
Train BPE tokenizer.train(files, BpeTrainer(...)) Brot.train_bpe (Files files) ~vocab_size:30000`
Train WordPiece tokenizer.train(files, WordPieceTrainer(...)) Brot.train_wordpiece (Files files) ~vocab_size:30000`
Train Unigram tokenizer.train(files, UnigramTrainer(...)) Brot.train_unigram (Files files) ~vocab_size:8000`
Train from iterator tokenizer.train_from_iterator(iter, trainer) Brot.train_bpe (Seq seq) ~vocab_size:1000`
Set normalizer tokenizer.normalizer = normalizers.Lowercase() Brot.bpe ~normalizer:Normalizer.lowercase ()
Set pre-tokenizer tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel() Brot.bpe ~pre:(Pre_tokenizer.byte_level ()) ()
Set post-processor tokenizer.post_processor = processors.BertProcessing(...) Brot.bpe ~post:(Post_processor.bert ~sep ~cls ()) ()
Set decoder tokenizer.decoder = decoders.WordPiece() Brot.bpe ~decoder:(Decoder.wordpiece ()) ()
Add special tokens tokenizer.add_special_tokens([AddedToken(...)]) Pass ~specials:[Brot.special "..."; ...] at construction