ᚨ brot
Tokenization for OCaml
Brot tokenizes text into token IDs for language models and reverses the
process. It supports BPE, WordPiece, Unigram, word-level, and
character-level algorithms, loads and saves HuggingFace tokenizer.json
files, and is 1.3-6x faster than HuggingFace tokenizers on most
benchmarks.
Features
- Tokenization algorithms: BPE, WordPiece, Unigram, word-level, character-level
- HuggingFace compatible: load and save
tokenizer.json, load vocab/merges model files - Composable pipeline: normalizer, pre-tokenizer, post-processor, decoder — each stage independently configurable
- Rich encoding output: token IDs, string tokens, byte offsets, attention masks, type IDs, word IDs, special token masks
- Training: train BPE, WordPiece, Unigram, and word-level tokenizers from scratch
- Performance: 1.3-6x faster than HuggingFace tokenizers (Rust native)
Quick Start
Build a BPE tokenizer from a vocabulary and merge rules, encode text, and decode it back:
open Brot
let tokenizer =
bpe
~vocab:
[ ("h", 0); ("e", 1); ("l", 2); ("o", 3); (" ", 4); ("w", 5);
("r", 6); ("d", 7); ("he", 8); ("ll", 9); ("llo", 10);
("hello", 11); ("wo", 12); ("rl", 13); ("rld", 14); ("world", 15) ]
~merges:
[ ("h", "e"); ("l", "l"); ("ll", "o"); ("he", "llo");
("w", "o"); ("r", "l"); ("rl", "d"); ("wo", "rld") ]
()
let encoding = encode tokenizer "hello world"
let ids = Encoding.ids encoding (* [| 11; 4; 15 |] *)
let tokens = Encoding.tokens encoding (* [| "hello"; " "; "world" |] *)
let decoded = decode tokenizer ids (* "hello world" *)
Load a pretrained tokenizer from a HuggingFace tokenizer.json file:
open Brot
let tokenizer = from_file "tokenizer.json" |> Result.get_ok
let encoding = encode tokenizer "Hello world!"
let ids = Encoding.ids encoding
Train a tokenizer from a text corpus:
open Brot
let tokenizer =
train_bpe ~vocab_size:100 ~show_progress:false
(`Seq (List.to_seq
[ "The quick brown fox jumps over the lazy dog";
"The dog barked at the fox";
"Quick brown foxes are rare" ]))
let size = vocab_size tokenizer
let ids = encode_ids tokenizer "The quick fox"
Next Steps
- Getting Started — encode, decode, pipeline basics, training
- The Tokenization Pipeline — how the 5 pipeline stages work
- Pretrained Tokenizers — loading, saving, and building known model pipelines
- Batch Processing — padding, truncation, encoding metadata
- Choosing an Algorithm — BPE vs WordPiece vs Unigram and when to use each