Module Brot.Normalizer
Text normalization.
Text normalization.
Normalizers transform text before tokenization: lowercasing, accent removal, Unicode normalization, whitespace cleanup, and model-specific preprocessing. They are the first stage in the tokenization pipeline, applied before Pre_tokenizer and vocabulary-based encoding.
Compose normalizers with sequence:
let n =
Normalizer.sequence
[ Normalizer.nfd; Normalizer.strip_accents; Normalizer.lowercase ]
in
Normalizer.apply n "Caf\u{00E9}"
(* "cafe" *)See Brot for the full tokenization pipeline.
Normalizers
Unicode normalization
val nfc : tnfc is Unicode NFC normalization (canonical composition).
val nfd : tnfd is Unicode NFD normalization (canonical decomposition).
val nfkc : tnfkc is Unicode NFKC normalization (compatibility composition).
val nfkd : tnfkd is Unicode NFKD normalization (compatibility decomposition).
Text transforms
val lowercase : tlowercase is Unicode case folding to lowercase.
val strip_accents : tstrip_accents removes combining marks after NFD decomposition. Applies nfd before stripping.
val strip : ?left:bool -> ?right:bool -> unit -> tstrip ?left ?right () is a normalizer that strips Unicode whitespace from text boundaries. left and right default to true.
val replace : pattern:string -> replacement:string -> treplace ~pattern ~replacement is a normalizer that replaces all pattern matches with replacement. pattern is a PCRE regular expression, compiled once at construction time.
Raises Re.Pcre.Parse_error if pattern is not valid PCRE.
val prepend : string -> tprepend s is a normalizer that prepends s to non-empty text. Empty text is returned unchanged.
Byte-level encoding
val byte_level : ?add_prefix_space:bool -> unit -> tbyte_level ?add_prefix_space () is GPT-2 style byte-level encoding. Each byte is mapped to a printable Unicode codepoint using the GPT-2 byte-to-unicode table.
add_prefix_spaceadds a space prefix when the text does not start with whitespace. Defaults tofalse.
Model-specific
val bert :
?clean_text:bool ->
?handle_chinese_chars:bool ->
?strip_accents:bool option ->
?lowercase:bool ->
unit ->
tbert () is a BERT normalizer.
clean_text: remove control characters and normalize whitespace. Default:true.handle_chinese_chars: pad CJK ideographs with spaces. Default:true.strip_accents: strip accents after NFD decomposition. WhenNone, accents are stripped ifflowercaseistrue. Default:None.lowercase: lowercase text via Unicode case folding. Default:true.
Composition
Applying
val apply : t -> string -> stringapply n s is s normalized by n.
Formatting
val pp : Stdlib.Format.formatter -> t -> unitpp ppf n formats n for inspection.
Serialization
val to_json : t -> Jsont.jsonto_json n is n serialized to HuggingFace-compatible JSON.
val of_json : Jsont.json -> (t, string) Stdlib.resultof_json json is a normalizer deserialized from HuggingFace JSON. Errors if json is not an object, has a missing or unknown "type" field, or has invalid parameters.