Module Brot.Encoding
Tokenization encodings.
Tokenization encodings.
An encoding bundles token IDs for model input with alignment metadata: byte offsets, word indices, segment type IDs, attention masks, and special-token flags.
Encodings are produced by Brot.encode and post-processed with truncate and pad. All parallel arrays (ids, type_ids, tokens, word_ids, offsets, special_tokens_mask, attention_mask) share the same length, equal to length.
Construction
val empty : tempty is the encoding with no tokens.
val create :
ids:int array ->
type_ids:int array ->
tokens:string array ->
words:int option array ->
offsets:(int * int) array ->
special_tokens_mask:int array ->
attention_mask:int array ->
?overflowing:t list ->
unit ->
tcreate ~ids ~type_ids ~tokens ~words ~offsets ~special_tokens_mask ~attention_mask () is an encoding from the given arrays.
All arrays must have the same length; no validation is performed. overflowing defaults to [].
val token :
id:int ->
token:string ->
offset:(int * int) ->
type_id:int ->
special:bool ->
ttoken ~id ~token ~offset ~type_id ~special is a single-token encoding. When special is true, special_tokens_mask is 1 and word_ids is None; otherwise special_tokens_mask is 0. attention_mask is always 1.
val from_tokens : (int * string * (int * int)) list -> type_id:int -> tfrom_tokens tokens ~type_id is an encoding from a list of (id, token_string, (start, end_offset)) triples. Every token gets the given type_id, attention_mask 1, special_tokens_mask 0 and word_ids None.
concat a b is the encoding with a's tokens followed by b's. overflowing and sequence ranges are taken from a.
concat_list encs is the concatenation of encs in order. overflowing and sequence ranges are taken from the first element. Allocates once rather than creating intermediate arrays per pair.
Accessors
val ids : t -> int arrayids enc is the token ID array.
val type_ids : t -> int arraytype_ids enc is the segment ID array. Typically 0 for the first sequence and 1 for the second in sentence-pair tasks.
val tokens : t -> string arraytokens enc is the string representation of each token.
val word_ids : t -> int option arrayword_ids enc maps each token to its source word index, or None for special tokens.
val offsets : t -> (int * int) arrayoffsets enc is the (start, end_) byte offset spans into the original text for each token.
val special_tokens_mask : t -> int arrayspecial_tokens_mask enc is 1 for special tokens (CLS, SEP, padding) and 0 for content tokens.
val attention_mask : t -> int arrayattention_mask enc is 1 for real tokens and 0 for padding tokens.
overflowing enc is the list of overflow encodings produced by truncate when the input exceeds max_length. Each element is a sliding window over the excess tokens.
val is_empty : t -> boolis_empty enc is true iff enc has no tokens.
val length : t -> intlength enc is the number of tokens in enc.
Operations
truncate enc ~max_length ~stride ~direction limits enc to at most max_length tokens.
Excess tokens are split into sliding windows of size max_length with overlap stride and stored in overflowing. If length enc <= max_length, enc is returned unchanged.
stride must be strictly less than max_length. When max_length is 0, all tokens move to overflowing and empty is returned.
val pad :
t ->
target_length:int ->
pad_id:int ->
pad_type_id:int ->
pad_token:string ->
direction:[ `Left | `Right ] ->
tpad enc ~target_length ~pad_id ~pad_type_id ~pad_token ~direction extends enc to exactly target_length tokens.
Padding tokens have attention_mask 0 and special_tokens_mask 1. If length enc >= target_length, enc is returned unchanged. Padding is applied recursively to overflowing encodings. When direction is `Left, offsets and sequence ranges are shifted accordingly.