Module Brot.Pre_tokenizer
Pre-tokenization.
Pre-tokenization.
Pre-tokenizers split raw text into pieces before vocabulary-based tokenization (BPE, WordPiece, etc.) is applied. Each piece carries byte offsets into the original text.
See Brot for the full tokenization pipeline.
Constructors
val whitespace : unit -> twhitespace () splits on whitespace using pattern \w+|[^\w\s]+.
Groups word characters (letters, digits, underscore) together and groups non-word, non-space characters together. Whitespace is used as delimiter but not included in output.
val whitespace_split : unit -> twhitespace_split () splits on any whitespace characters.
Removes whitespace from output. Simplest and fastest pre-tokenizer.
val bert : unit -> tbert () applies BERT-style pre-tokenization.
Splits on whitespace, isolates punctuation, and separates CJK characters individually.
val byte_level :
?add_prefix_space:bool ->
?use_regex:bool ->
?trim_offsets:bool ->
unit ->
tbyte_level () is a byte-level pre-tokenizer. Used by GPT-2, GPT-3, RoBERTa.
Converts text to byte representation and applies GPT-2's regex pattern for splitting.
add_prefix_space: add space at beginning if text does not start with whitespace. Default:true.use_regex: use GPT-2's regex pattern for splitting. Default:true.trim_offsets: adjust offsets for byte-level encoding. Default:true.
type behavior = [ | `Isolated(*Keep delimiter as separate piece
*)| `Removed(*Remove delimiter
*)| `Merged_with_previous(*Merge delimiter with previous piece
*)| `Merged_with_next(*Merge delimiter with next piece
*)| `Contiguous(*Group consecutive delimiters together
*)
]Delimiter handling behavior for splitting operations.
punctuation () separates punctuation from alphanumeric content.
behavior defaults to `Isolated.
split ~pattern () splits on a literal string pattern.
behavior defaults to `Removed. When invert is true, splits on everything except the pattern; defaults to false.
val char_delimiter : char -> tchar_delimiter c splits on character c, removing it from output.
Equivalent to split ~pattern:(String.make 1 c) ~behavior:`Removed ().
val digits : ?individual_digits:bool -> unit -> tdigits () splits on digit boundaries.
When individual_digits is true, each digit is a separate piece; when false (default), consecutive digits are grouped.
type prepend_scheme = [ | `First(*Only prepend to first piece
*)| `Never(*Never prepend
*)| `Always(*Always prepend if not starting with space
*)
]Controls when metaspace prepends the replacement character.
val metaspace :
?replacement:char ->
?prepend_scheme:prepend_scheme ->
?split:bool ->
unit ->
tmetaspace () replaces whitespace with a visible marker. Used by SentencePiece models.
replacement: character to replace spaces with. Default:'_'.prepend_scheme: when to prepend the replacement character. Default:`Always.split: whether to split on the replacement character. Default:true.
val unicode_scripts : unit -> tunicode_scripts () splits on Unicode script boundaries.
Separates text when the writing system changes (e.g., Latin to Cyrillic, Latin to Han).
val fixed_length : int -> tfixed_length n splits into fixed-length character chunks.
The last chunk may be shorter than n.
sequence ts chains multiple pre-tokenizers left-to-right.
Each pre-tokenizer processes the pieces from the previous one. Offsets are composed correctly through the chain.
Operations
val pre_tokenize : t -> string -> (string * (int * int)) listpre_tokenize t text splits text into pieces with character offsets.
Returns a list of (piece, (start, end_)) where start and end_ are byte positions in the original text. Offsets are non-overlapping and in ascending order.
Formatting
val pp : Stdlib.Format.formatter -> t -> unitpp ppf t formats t for inspection.
Byte-level decoding
byte_level_decode s reverses byte-level encoding by converting the special Unicode codepoints back to original byte values.
Serialization
val to_json : t -> Jsont.jsonto_json t serializes t to HuggingFace JSON format.
val of_json : Jsont.json -> (t, string) Stdlib.resultof_json json is a pre-tokenizer from HuggingFace JSON format. Errors if json is not an object, has a missing or unknown "type" field, or has invalid parameters.