Module Brot.Post_processor
Post-processing.
Post-processing tokenization output with special tokens.
Post-processors add special tokens and type IDs to tokenized sequences after core tokenization. They handle model-specific requirements like [CLS] and [SEP] for BERT, sentence pair formatting, and byte-level offset adjustments.
Constructors
bert ~sep ~cls () is a BERT-style post-processor.
Single: [CLS] A [SEP]. Pair: [CLS] A [SEP] B [SEP]. Type IDs: 0 for the first sequence, 1 for the second.
roberta ~sep ~cls () is a RoBERTa-style post-processor.
Single: <s> A </s>. Pair: <s> A </s> </s> B </s>. All type IDs are 0.
trim_offsets defaults to true. add_prefix_space defaults to true.
val byte_level : ?trim_offsets:bool -> unit -> tbyte_level () is a byte-level post-processor that adjusts character offsets for byte-level encoding.
trim_offsets removes leading and trailing whitespace from offsets. Defaults to true.
template ~single () is a template-based post-processor.
Templates use $A and $B as sequence placeholders and literal special token names (e.g. [CLS]). Type IDs can be specified with a colon suffix: $A:0, [SEP]:1.
special_tokens defaults to [].
Processing
val process :
t ->
?pair:Encoding.t ->
Encoding.t ->
add_special_tokens:bool ->
Encoding.tprocess t enc ~add_special_tokens adds special tokens and sets type IDs on enc.
When ~pair is provided, both sequences are merged into a single encoding with appropriate type IDs. When ~add_special_tokens is false, special token insertion is skipped but byte-level offset trimming still applies.
val added_tokens : t -> is_pair:bool -> intadded_tokens t ~is_pair is the number of special tokens t adds. Useful for calculating the truncation budget.
Formatting
val pp : Stdlib.Format.formatter -> t -> unitpp formats a post-processor for inspection.
Serialization
val of_json : Jsont.json -> (t, string) Stdlib.resultof_json json is a post-processor from HuggingFace tokenizer.json format. Errors if json is not an object, has a missing or unknown "type" field, or has invalid parameters.
val to_json : t -> Jsont.jsonto_json t is t serialized to HuggingFace tokenizer.json format.