saga

Hugging Face quality. OCaml efficiency. Production-ready tokenization.


why saga?

modern tokenizers

BPE and WordPiece out of the box. Compatible with pretrained models from Hugging Face.

unicode done right

Proper handling of multilingual text, emoji, and special characters. No surprises.

direct tensor encoding

Encode straight to Nx tensors. Batch processing with padding and truncation built in.

type-safe vocabularies

Vocabularies that can't go out of sync. Special tokens that always exist.


show me the code

PYTHON

from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "bert-base"
)

# Tokenize
tokens = tokenizer.tokenize("Hello world")

# Encode to IDs
input_ids = tokenizer.encode(
    "Hello world",
    max_length=128,
    padding="max_length"
)

# Batch encode
batch = tokenizer(
    ["Hello", "World"],
    padding=True,
    return_tensors="pt"
)

SAGA

open Saga

(* Load tokenizer *)
let tokenizer = 
  Wordpiece.from_files 
    ~vocab:"vocab.txt"

(* Tokenize *)
let tokens = Wordpiece.tokenize tokenizer "Hello world"

(* Encode to IDs *)
let input_ids = 
  encode ~vocab 
    ~max_len:128 
    "Hello world"

(* Batch encode - returns Nx tensor *)
let batch = 
  encode_batch ~vocab
    ~pad:true
    ["Hello"; "World"]

tokenization that works

(* Simple tokenization *)
let tokens = tokenize "Hello world! 你好世界"
(* ["Hello"; "world"; "!"; "你好世界"] *)

(* Build vocabulary from corpus *)
let vocab = 
  vocab ~max_size:30000 ~min_freq:2
    (List.concat_map tokenize corpus)

(* BPE tokenization *)
let bpe = Bpe.from_files ~vocab:"vocab.json" ~merges:"merges.txt" in
let tokens = Bpe.tokenize bpe "unrecognizable"
(* ["un", "##rec", "##ogn", "##iz", "##able"] *)

(* Unicode normalization *)
let clean = 
  normalize ~lowercase:true ~strip_accents:true 
    "Café RÉSUMÉ"
(* "cafe resume" *)

the good parts

Pretrained compatibility
Load vocabularies from Hugging Face models. Your BERT tokenizer works out of the box.

Batch processing that scales
Encode thousands of texts efficiently. Automatic padding and truncation. Direct tensor output.

Unicode that doesn't break
Proper grapheme clustering. CJK text handled correctly. Emoji that don't corrupt your tokens.

Type safety throughout
Vocabularies that can't get out of sync. Special tokens that are always defined. No string keys to typo.


what's implemented

tokenizers

  • ✓ Word-level
  • ✓ Character-level
  • ✓ BPE
  • ✓ WordPiece
  • ○ SentencePiece

preprocessing

  • ✓ Unicode normalization
  • ✓ Accent stripping
  • ✓ Case folding
  • ✓ Whitespace cleanup
  • ✓ Control char removal

features

  • ✓ Vocabulary management
  • ✓ Special tokens
  • ✓ Batch encoding
  • ✓ Padding/truncation
  • ○ Training tokenizers

integration

  • ✓ Nx tensor output
  • ✓ File I/O
  • ✓ Token caching
  • ○ Rust backend
  • ○ Model zoo

get started

Saga is part of the Raven ecosystem. When it's released:

# Install
opam install saga

# Try it
open Saga

let () =
  let text = "Hello world! How are you?" in
  let tokens = tokenize text in
  List.iter print_endline tokens
    

For now, check out the documentation to learn more.