saga
Hugging Face quality. OCaml efficiency. Production-ready tokenization.
why saga?
modern tokenizers
BPE and WordPiece out of the box. Compatible with pretrained models from Hugging Face.
unicode done right
Proper handling of multilingual text, emoji, and special characters. No surprises.
direct tensor encoding
Encode straight to Nx tensors. Batch processing with padding and truncation built in.
type-safe vocabularies
Vocabularies that can't go out of sync. Special tokens that always exist.
show me the code
PYTHON
from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained( "bert-base" ) # Tokenize tokens = tokenizer.tokenize("Hello world") # Encode to IDs input_ids = tokenizer.encode( "Hello world", max_length=128, padding="max_length" ) # Batch encode batch = tokenizer( ["Hello", "World"], padding=True, return_tensors="pt" )
SAGA
open Saga (* Load tokenizer *) let tokenizer = Wordpiece.from_files ~vocab:"vocab.txt" (* Tokenize *) let tokens = Wordpiece.tokenize tokenizer "Hello world" (* Encode to IDs *) let input_ids = encode ~vocab ~max_len:128 "Hello world" (* Batch encode - returns Nx tensor *) let batch = encode_batch ~vocab ~pad:true ["Hello"; "World"]
tokenization that works
(* Simple tokenization *) let tokens = tokenize "Hello world! 你好世界" (* ["Hello"; "world"; "!"; "你好世界"] *) (* Build vocabulary from corpus *) let vocab = vocab ~max_size:30000 ~min_freq:2 (List.concat_map tokenize corpus) (* BPE tokenization *) let bpe = Bpe.from_files ~vocab:"vocab.json" ~merges:"merges.txt" in let tokens = Bpe.tokenize bpe "unrecognizable" (* ["un", "##rec", "##ogn", "##iz", "##able"] *) (* Unicode normalization *) let clean = normalize ~lowercase:true ~strip_accents:true "Café RÉSUMÉ" (* "cafe resume" *)
the good parts
Pretrained compatibility
Load vocabularies from Hugging Face models. Your BERT tokenizer works out of the box.
Batch processing that scales
Encode thousands of texts efficiently. Automatic padding and truncation. Direct tensor output.
Unicode that doesn't break
Proper grapheme clustering. CJK text handled correctly. Emoji that don't corrupt your tokens.
Type safety throughout
Vocabularies that can't get out of sync. Special tokens that are always defined. No string keys to typo.
what's implemented
tokenizers
- ✓ Word-level
- ✓ Character-level
- ✓ BPE
- ✓ WordPiece
- ○ SentencePiece
preprocessing
- ✓ Unicode normalization
- ✓ Accent stripping
- ✓ Case folding
- ✓ Whitespace cleanup
- ✓ Control char removal
features
- ✓ Vocabulary management
- ✓ Special tokens
- ✓ Batch encoding
- ✓ Padding/truncation
- ○ Training tokenizers
integration
- ✓ Nx tensor output
- ✓ File I/O
- ✓ Token caching
- ○ Rust backend
- ○ Model zoo
get started
Saga is part of the Raven ecosystem. When it's released:
# Install opam install saga # Try it open Saga let () = let text = "Hello world! How are you?" in let tokens = tokenize text in List.iter print_endline tokens
For now, check out the documentation to learn more.