Module Brot
Tokenization for OCaml.
Brot tokenizes text into token IDs for language models and reverses the process. Tokenization proceeds through configurable stages:
- Normalization: clean and normalize text (lowercase, accent removal, Unicode normalization). See
Normalizer. - Pre-tokenization: split text into words or sub-words. See
Pre_tokenizer. - Tokenization: apply vocabulary-based encoding (BPE, WordPiece, Unigram, word-level, or character-level).
- Post-processing: add special tokens and set type IDs. See
Post_processor. - Padding/Truncation: adjust sequence lengths for batching.
Each stage is optional and configurable. Open the module to use it, it defines only modules in your scope.
Quick start
Load a pretrained tokenizer:
let tokenizer = Brot.from_file "tokenizer.json" |> Result.get_ok in
let encoding = Brot.encode tokenizer "Hello world!" in
let _ids = Encoding.ids encodingCreate a BPE tokenizer from scratch:
let tokenizer =
Brot.bpe
~vocab:[("hello", 0); ("world", 1); ("[PAD]", 2)]
~merges:[]
()
in
let encoding = Brot.encode tokenizer "hello world" in
let _text = Brot.decode tokenizer (Encoding.ids encoding)Train a new tokenizer:
let texts = [ "Hello world"; "How are you?"; "Hello again" ] in
let tokenizer =
Brot.train_bpe (`Seq (List.to_seq texts)) ~vocab_size:1000
in
Brot.save_pretrained tokenizer ~path:"./my_tokenizer"EncodingTokenization encodings.NormalizerText normalization.Pre_tokenizerPre-tokenization.Post_processorPost-processing.DecoderToken decoding.
module Normalizer : sig ... endText normalization.
module Pre_tokenizer : sig ... endPre-tokenization.
module Post_processor : sig ... endPost-processing.
module Decoder : sig ... endToken decoding.
module Encoding : sig ... endTokenization encodings.
Types
The type for padding and truncation directions. `Left operates at the beginning of the sequence, `Right at the end.
type special = {token : string;(*The token text (e.g.,
*)"<pad>","<unk>").single_word : bool;(*Whether this token must match whole words only.
*)lstrip : bool;(*Whether to strip whitespace on the left.
*)rstrip : bool;(*Whether to strip whitespace on the right.
*)normalized : bool;(*Whether to apply normalization to this token.
*)
}The type for special token configurations.
Special tokens are never split during tokenization and can be skipped during decoding. Token IDs are assigned automatically when added to the vocabulary. The semantic role (pad, unk, bos, etc.) is contextual, not encoded in the type.
The type for padding length strategies.
`Batch_longest: pad to the longest sequence in the batch.`Fixed n: pad every sequence to exactlyntokens.`To_multiple n: pad to the smallest multiple ofnthat is at least the sequence length.
type padding = {length : pad_length;direction : direction;pad_id : int option;pad_type_id : int option;pad_token : string option;
}The type for padding configurations.
When pad_id, pad_type_id, or pad_token are None, the tokenizer's configured padding token is used. Raises Invalid_argument at padding time if no padding token is configured and these fields are None.
The type for truncation configurations. Sequences exceeding max_length tokens are trimmed from the given direction.
The type for training data sources.
`Files paths: read training text from files, one line per example.`Seq seq: use a sequence of strings.
val special :
?single_word:bool ->
?lstrip:bool ->
?rstrip:bool ->
?normalized:bool ->
string ->
specialspecial token is a special token configuration for token.
single_word defaults to false. lstrip and rstrip default to false. normalized defaults to false.
val padding :
?direction:direction ->
?pad_id:int ->
?pad_type_id:int ->
?pad_token:string ->
pad_length ->
paddingpadding length is a padding configuration for the given length strategy.
direction defaults to `Right. Other fields default to None (falls back to the tokenizer's configured padding token).
val truncation : ?direction:direction -> int -> truncationtruncation max_length is a truncation configuration limiting sequences to max_length tokens. direction defaults to `Right.
Constructors
val bpe :
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * int) list ->
?merges:(string * string) list ->
?cache_capacity:int ->
?dropout:float ->
?continuing_subword_prefix:string ->
?end_of_word_suffix:string ->
?fuse_unk:bool ->
?byte_fallback:bool ->
?ignore_merges:bool ->
unit ->
tbpe () is a BPE (Byte Pair Encoding) tokenizer. Used by GPT-2, GPT-3, RoBERTa.
normalizer: text normalization. Default: none.pre: pre-tokenization strategy. Default: none.post: post-processor for special tokens. Default: none.decoder: decoding strategy. Default: none.specials: special tokens to add to vocabulary. Default:[].bos_token,eos_token,pad_token: role markers; added to vocabulary if not already present. Default: none.unk_token: token for unknown characters. Configures both the role and the BPE model's unknown handling. Default: none.vocab: initial vocabulary as(token, id)pairs. Default:[].merges: merge rules as(left, right)pairs learned during training. Default:[].cache_capacity: LRU cache size for tokenization results. Default:10000.dropout: probability [0;1] of skipping merges (data augmentation). Default: none (no dropout).continuing_subword_prefix: prefix for non-initial subwords (e.g.,"##"). Default: none.end_of_word_suffix: suffix marking word boundaries (e.g.,"</w>"). Default: none.fuse_unk: merge consecutive unknown tokens. Default:false.byte_fallback: use byte-level fallback ("<0x00>") instead of unknown token. Default:false.ignore_merges: skip merge application (character-level output). Default:false.
val wordpiece :
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * int) list ->
?continuing_subword_prefix:string ->
?max_input_chars_per_word:int ->
unit ->
twordpiece () is a WordPiece tokenizer. Used by BERT, DistilBERT, Electra.
WordPiece uses a greedy longest-match-first algorithm to split words into subword pieces prefixed with a continuation marker (e.g., "running" becomes ["run"; "##ning"]).
vocab: initial vocabulary as(token, id)pairs. Default:[].unk_token: token for out-of-vocabulary words. Default:"[UNK]".continuing_subword_prefix: prefix for non-initial subwords. Default:"##".max_input_chars_per_word: words longer than this are replaced withunk_token. Default:100.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token) are as in bpe.
val word_level :
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * int) list ->
unit ->
tword_level () is a word-level tokenizer.
Maps each word directly to a token ID. No subword splitting is performed. Words not in vocabulary map to unk_token.
Note. When pre is not provided, Pre_tokenizer.whitespace is used by default.
vocab: initial vocabulary as(word, id)pairs. Default:[].unk_token: token for out-of-vocabulary words. Default:"<unk>".
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token) are as in bpe.
val unigram :
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab:(string * float) list ->
unit ->
tunigram () is a Unigram tokenizer. Used by AlBERT, T5, mBART.
Unigram uses probabilistic segmentation to find optimal subword splits based on token log-probabilities.
vocab: initial vocabulary as(token, score)pairs where scores are negative log probabilities. Default:[].unk_token: token for unknown characters. Default: none.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token) are as in bpe.
val chars :
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
unit ->
tchars () is a character-level tokenizer.
Each byte in the input becomes a separate token with ID equal to its ordinal value. No vocabulary is required.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token) are as in bpe.
val from_model_file :
vocab:string ->
?merges:string ->
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
unit ->
tfrom_model_file ~vocab () loads a tokenizer from HuggingFace model files.
The model type is inferred from the arguments: if merges is provided, a BPE tokenizer is created; otherwise WordPiece.
vocab: path to vocabulary file (vocab.json). Expected format: JSON object mapping tokens to IDs ({"hello": 0, "world": 1}).merges: path to merges file (merges.txt). One merge per line as space-separated token pairs. Lines starting with"#version"are skipped.
Raises Sys_error if a file cannot be read.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token, unk_token) are as in bpe.
add_tokens t tokens is t with tokens added to the vocabulary. Only supported for word-level tokenizers.
Raises Invalid_argument if the tokenizer does not support dynamic vocabulary extension.
Accessors
val normalizer : t -> Normalizer.t optionnormalizer t is t's normalizer, if any.
val pre_tokenizer : t -> Pre_tokenizer.t optionpre_tokenizer t is t's pre-tokenizer, if any.
val post_processor : t -> Post_processor.t optionpost_processor t is t's post-processor, if any.
val bos_token : t -> string optionbos_token t is t's beginning-of-sequence token, if any.
val eos_token : t -> string optioneos_token t is t's end-of-sequence token, if any.
val pad_token : t -> string optionpad_token t is t's padding token, if any.
val unk_token : t -> string optionunk_token t is t's unknown token, if any.
Vocabulary
val vocab : t -> (string * int) listvocab t is t's vocabulary as (token, id) pairs.
val vocab_size : t -> intvocab_size t is the number of tokens in t's vocabulary.
val token_to_id : t -> string -> int optiontoken_to_id t token is the ID of token in t, if any.
val id_to_token : t -> int -> string optionid_to_token t id is the token string for id in t, if any.
Encoding and decoding
val encode :
t ->
?pair:string ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
string ->
Encoding.tencode t text is the encoding of text by t.
pair: a second sentence for sentence-pair tasks. The post-processor merges both sequences with appropriate type IDs. Default: none.add_special_tokens: whether to insert special tokens via the post-processor. Default:true.padding: padding configuration. Default: none (no padding).truncation: truncation configuration. Default: none (no truncation).
val encode_batch :
t ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
string list ->
Encoding.t listencode_batch t texts is the encoding of each text in texts.
Optional parameters are as in encode. For sentence-pair tasks, use encode_pairs_batch.
val encode_pairs_batch :
t ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
(string * string) list ->
Encoding.t listencode_pairs_batch t pairs encodes a batch of sentence pairs. Each element is (primary, secondary).
Optional parameters are as in encode.
val encode_ids :
t ->
?pair:string ->
?add_special_tokens:bool ->
?padding:padding ->
?truncation:truncation ->
string ->
int arrayencode_ids t text is Encoding.ids (encode t text).
Optional parameters are as in encode.
val decode : t -> ?skip_special_tokens:bool -> int array -> stringdecode t ids is the text obtained by decoding ids through t's vocabulary and decoder.
skip_special_tokens defaults to false.
val decode_batch :
t ->
?skip_special_tokens:bool ->
int array list ->
string listdecode_batch t ids_list decodes each element of ids_list.
skip_special_tokens defaults to false.
Training
val train_bpe :
?init:t ->
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?min_frequency:int ->
?limit_alphabet:int ->
?initial_alphabet:string list ->
?continuing_subword_prefix:string ->
?end_of_word_suffix:string ->
?show_progress:bool ->
?max_token_length:int ->
data ->
ttrain_bpe data trains a BPE tokenizer from data.
Learns merge rules by iteratively merging the most frequent adjacent pairs until reaching the target vocabulary size.
init: existing tokenizer to extend. Default: create new.vocab_size: target vocabulary size including special tokens. Default:30000.min_frequency: minimum pair frequency to be merged. Default:0.limit_alphabet: maximum number of initial characters to keep. Default: none (keep all).initial_alphabet: characters to include regardless of frequency. Default:[].continuing_subword_prefix: prefix for non-initial subwords. Default: none.end_of_word_suffix: suffix marking word boundaries. Default: none.show_progress: display progress bar. Default:true.max_token_length: maximum token length. Default: none.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token, unk_token) are as in bpe.
val train_wordpiece :
?init:t ->
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?min_frequency:int ->
?limit_alphabet:int ->
?initial_alphabet:string list ->
?continuing_subword_prefix:string ->
?end_of_word_suffix:string ->
?show_progress:bool ->
data ->
ttrain_wordpiece data trains a WordPiece tokenizer from data.
Learns subword vocabulary by maximizing language model likelihood.
init: existing tokenizer to extend. Default: create new.vocab_size: target vocabulary size including special tokens. Default:30000.min_frequency: minimum frequency for a subword to be included. Default:0.limit_alphabet: maximum number of initial characters to keep. Default: none (keep all).initial_alphabet: characters to include regardless of frequency. Default:[].continuing_subword_prefix: prefix for non-initial subwords. Default:"##".end_of_word_suffix: suffix marking word boundaries. Default: none.show_progress: display progress bar. Default:true.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token, unk_token) are as in bpe.
val train_wordlevel :
?init:t ->
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?min_frequency:int ->
?show_progress:bool ->
data ->
ttrain_wordlevel data trains a word-level tokenizer from data.
Builds vocabulary by collecting unique words, optionally filtering by frequency. No subword splitting.
init: existing tokenizer to extend. Default: create new.vocab_size: target vocabulary size including special tokens. Default:30000.min_frequency: minimum frequency for a word to be included. Default:0.show_progress: display progress bar. Default:true.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token, unk_token) are as in bpe.
val train_unigram :
?init:t ->
?normalizer:Normalizer.t ->
?pre:Pre_tokenizer.t ->
?post:Post_processor.t ->
?decoder:Decoder.t ->
?specials:special list ->
?bos_token:string ->
?eos_token:string ->
?pad_token:string ->
?unk_token:string ->
?vocab_size:int ->
?show_progress:bool ->
?shrinking_factor:float ->
?max_piece_length:int ->
?n_sub_iterations:int ->
data ->
ttrain_unigram data trains a Unigram tokenizer from data.
Learns probabilistic subword vocabulary using EM algorithm.
init: existing tokenizer to extend. Default: create new.vocab_size: target vocabulary size including special tokens. Default:8000.show_progress: display progress bar. Default:true.shrinking_factor: fraction of vocabulary to retain in each pruning iteration. Default:0.75.max_piece_length: maximum subword length. Default:16.n_sub_iterations: number of EM sub-iterations per pruning round. Default:2.
Pipeline parameters (normalizer, pre, post, decoder, specials, bos_token, eos_token, pad_token, unk_token) are as in bpe.
Model files
val export_tiktoken : t -> merges_path:string -> vocab_path:string -> unitexport_tiktoken t ~merges_path ~vocab_path exports t's BPE merges and vocabulary in tiktoken-compatible format.
Warning. Only BPE tokenizers are supported. Raises Failure for other model types.
val save_model_files :
t ->
folder:string ->
?prefix:string ->
unit ->
string listsave_model_files t ~folder ?prefix () saves t's underlying model files (vocabulary and merges) to folder and returns the list of created file paths.
prefix defaults to "".
HuggingFace compatibility
val from_file : string -> (t, string) Stdlib.resultfrom_file path is a tokenizer loaded from a HuggingFace tokenizer.json file. Errors if the file cannot be read or has invalid format.
val from_json : Jsont.json -> (t, string) Stdlib.resultfrom_json json is a tokenizer deserialized from HuggingFace JSON format. Errors if json has a missing or unknown model type, or invalid parameters.
val to_json : t -> Jsont.jsonto_json t is t serialized to HuggingFace JSON format.
val save_pretrained : t -> path:string -> unitsave_pretrained t ~path saves t to path in HuggingFace format. Creates path/tokenizer.json.
Raises Sys_error if path cannot be written.
Formatting
val pp : Stdlib.Format.formatter -> t -> unitpp formats a tokenizer for inspection. Shows algorithm type, vocabulary size, and configured pipeline stages.