Module Brot.Decoder
Token decoding.
Decoding tokens back to text.
Decoders convert token strings back into natural text by reversing encoding-specific transformations (prefix/suffix removal, byte-level decoding, whitespace normalization, etc.).
Decoders operate on token strings, not IDs. Convert IDs to strings via vocabulary first, then apply decode.
Some decoders transform each token independently (per-token: bpe, metaspace, replace, strip, byte_fallback), while others collapse the entire token list into a single result (collapsing: byte_level, wordpiece, fuse). This distinction matters when composing decoders with sequence.
Constructors
val bpe : ?suffix:string -> unit -> tbpe ~suffix () is a per-token decoder for BPE-encoded tokens. Strips suffix from end-of-word tokens and inserts spaces between words. suffix defaults to "".
val byte_level : unit -> tbyte_level () is a collapsing decoder that reverses GPT-2 style byte-to-Unicode encoding back to original bytes.
val byte_fallback : unit -> tbyte_fallback () is a per-token decoder for byte fallback tokens. Converts hex byte tokens (e.g. "<0x41>") back to their byte values, accumulating consecutive byte tokens into strings. Non-byte tokens pass through unchanged.
val wordpiece : ?prefix:string -> ?cleanup:bool -> unit -> twordpiece ~prefix ~cleanup () is a collapsing decoder for WordPiece tokens. Strips continuation prefix (default "##") from non-initial subwords and joins tokens into words. When cleanup is true (default), normalizes whitespace in the result.
val metaspace : ?replacement:char -> ?add_prefix_space:bool -> unit -> tmetaspace ~replacement ~add_prefix_space () is a per-token decoder that converts metaspace markers back to regular spaces. replacement defaults to '_'. When add_prefix_space is true (default), the leading replacement character on the first token is stripped.
val ctc :
?pad_token:string ->
?word_delimiter_token:string ->
?cleanup:bool ->
unit ->
tctc ~pad_token ~word_delimiter_token ~cleanup () is a per-token decoder for CTC (Connectionist Temporal Classification) output. Deduplicates consecutive tokens, removes pad_token (default "<pad>"), and when cleanup is true (default), replaces word_delimiter_token (default "|") with spaces.
sequence decoders chains decoders left-to-right. Each decoder's output token list feeds into the next.
val replace : pattern:string -> by:string -> unit -> treplace ~pattern ~by () is a collapsing decoder that joins the token list, replaces all literal occurrences of pattern with by in the result, and returns a single-element list.
val strip : ?left:bool -> ?right:bool -> ?content:char -> unit -> tstrip ~left ~right ~content () is a collapsing decoder that joins the token list and removes leading (when left is true) and/or trailing (when right is true) occurrences of content from the result. left and right default to false; content defaults to ' '.
val fuse : unit -> tfuse () is a collapsing decoder that concatenates all tokens into a single string with no delimiter.
Operations
val decode : t -> string list -> stringdecode decoder tokens applies decoder to tokens and returns the decoded text.
Formatting
val pp : Stdlib.Format.formatter -> t -> unitpp ppf decoder formats decoder for debugging.
Serialization
val to_json : t -> Jsont.jsonto_json decoder serializes decoder to HuggingFace JSON format.
val of_json : Jsont.json -> (t, string) Stdlib.resultof_json json is a decoder from HuggingFace JSON format. Errors if json is not an object, has a missing or unknown "type" field, or has invalid parameters.