Module Talon.Col

Column creation and manipulation for heterogeneous data types.

Columns are the fundamental building blocks of dataframes, each storing a homogeneous array of values with consistent null handling.

type t =
  1. | P : ('a, 'b) Nx.dtype * ('a, 'b) Nx.t * bool array option -> t
  2. | S : string option array -> t
  3. | B : bool option array -> t
    (*

    Heterogeneous column representation with explicit null support.

    Variants:

    • P (dtype, tensor, mask): Numeric data stored as 1D Nx tensors with an optional null mask indicating which entries are missing
    • S arr: String data with explicit None for nulls
    • B arr: Boolean data with explicit None for nulls

    Null representation:

    • Float columns: the optional mask determines which rows are null. The underlying tensor may contain any float values (including NaN), but these are treated as data unless masked out.
    • Integer columns: the optional mask determines null rows. Extreme integer values such as Int32.min_int remain valid data unless explicitly masked.
    • String/Boolean columns: None values indicate nulls

    Invariants:

    • Numeric tensors must be 1D
    • All values in a column have the same length
    • Null semantics are preserved across operations

    Performance:

    • Numeric operations leverage vectorized Nx computations
    • String/Boolean operations use standard OCaml array operations
    *)

From arrays (non-nullable)

Create columns from arrays without introducing null masks. Values are taken literally; to represent missing data, use the _opt constructors instead.

val float32 : float array -> t

float32 arr creates a float32 column from array.

The resulting column has no null mask. All values, including nan, are treated as regular data. Use float32_opt to create a nullable column.

Time complexity: O(n) where n is array length.

val float64 : float array -> t

float64 arr creates a float64 column from array.

The resulting column has no null mask. All values, including nan, are treated as regular data. Use float64_opt for nullable columns.

Time complexity: O(n) where n is array length.

val int32 : int32 array -> t

int32 arr creates an int32 column from array.

The resulting column has no null mask. Values equal to Int32.min_int are treated as ordinary data. Use int32_opt to represent missing values.

Time complexity: O(n) where n is array length.

val int64 : int64 array -> t

int64 arr creates an int64 column from array.

The resulting column has no null mask. Values equal to Int64.min_int remain ordinary data. Use int64_opt to represent missing values.

Time complexity: O(n) where n is array length.

val bool : bool array -> t

bool arr creates a non-nullable boolean column from array.

All values are wrapped as Some value, creating a column with no nulls. Use bool_opt if you need explicit null support.

Time complexity: O(n) where n is array length.

val string : string array -> t

string arr creates a non-nullable string column from array.

All values are wrapped as Some value, creating a column with no nulls. Use string_opt if you need explicit null support.

Time complexity: O(n) where n is array length.

From option arrays (nullable)

Create columns from option arrays with explicit null representation. Numeric types attach a null mask (while storing placeholder values in the tensor), whereas string/bool types preserve the option structure.

val float32_opt : float option array -> t

float32_opt arr creates a nullable float32 column.

None values are recorded in the null mask. Placeholder nan values are stored in the tensor but callers must rely on the mask (via option accessors or Agg helpers) to detect nulls.

Example:

  let col = Col.float32_opt [| Some 1.0; None; Some 3.14 |] in
  assert (Col.null_count col = 1)

Time complexity: O(n) where n is array length.

val float64_opt : float option array -> t

float64_opt arr creates a nullable float64 column.

None values are recorded in the null mask. Placeholder nan values are stored in the tensor but callers must rely on the mask to detect nulls.

Time complexity: O(n) where n is array length.

val int32_opt : int32 option array -> t

int32_opt arr creates a nullable int32 column.

None values are recorded in the null mask. The tensor stores a placeholder value (Int32.min_int) for efficiency, but the mask is authoritative when checking for nulls.

Example:

  let col = Col.int32_opt [| Some 42l; None; Some (-1l) |] in
  assert (Col.null_count col = 1)

Time complexity: O(n) where n is array length.

val int64_opt : int64 option array -> t

int64_opt arr creates a nullable int64 column.

None values are recorded in the null mask. The tensor stores a placeholder value (Int64.min_int) for efficiency, but the mask is authoritative when checking for nulls.

Time complexity: O(n) where n is array length.

val bool_opt : bool option array -> t

bool_opt arr creates a nullable boolean column.

Unlike numeric columns, boolean columns preserve the option type directly for more precise null semantics.

Time complexity: O(1) - no array copying required.

val string_opt : string option array -> t

string_opt arr creates a nullable string column.

The option array is used directly without conversion, preserving exact null semantics.

Time complexity: O(1) - no array copying required.

From lists

Convenience functions for creating columns from lists. Equivalent to creating arrays with Array.of_list then using the array functions.

val float32_list : float list -> t

float32_list lst creates a float32 column from list.

Time complexity: O(n) where n is list length.

val float64_list : float list -> t

float64_list lst creates a float64 column from list.

Time complexity: O(n) where n is list length.

val int32_list : int32 list -> t

int32_list lst creates an int32 column from list.

Time complexity: O(n) where n is list length.

val int64_list : int64 list -> t

int64_list lst creates an int64 column from list.

Time complexity: O(n) where n is list length.

val bool_list : bool list -> t

bool_list lst creates a boolean column from list.

Time complexity: O(n) where n is list length.

val string_list : string list -> t

string_list lst creates a string column from list.

Time complexity: O(n) where n is list length.

val null_mask : t -> bool array option

null_mask col returns the explicit null mask tracked for numeric columns constructed via nullable builders.

Returns Some mask when an explicit mask exists, None otherwise.

From tensors

Direct integration with Nx tensors for efficient column creation.

val of_tensor : ('a, 'b) Nx.t -> t

of_tensor t creates a column from a 1D tensor.

The tensor's dtype is preserved in the resulting column. The column is treated as non-nullable: existing payload values (including NaNs or extremal integers) remain regular data. Use the _opt builders to attach null masks.

  • raises Invalid_argument

    if tensor is not 1D.

Time complexity: O(1) - tensor is used directly without copying.

Null handling

Functions for detecting and manipulating null values in columns. Null semantics vary by column type but are handled consistently.

val has_nulls : t -> bool

has_nulls col returns true if column contains any null values.

Checks for nulls according to column type:

  • Numeric columns: consult the null mask (if present)
  • String/Boolean columns: scan for None values

Time complexity: O(n) in worst case (scans entire column).

val null_count : t -> int

null_count col returns the number of null values in column.

Counts nulls according to column type. More efficient than scanning with has_nulls if you need the exact count.

Time complexity: O(n) - must scan entire column.

val drop_nulls : t -> t

drop_nulls col returns a new column with null values removed.

Creates a new column with shorter length containing only non-null values. The column type is preserved (numeric columns remain numeric, etc.).

Example:

  let col = Col.float32_opt [| Some 1.0; None; Some 3.0 |] in
  let clean = Col.drop_nulls col in
  (* clean now contains [1.0; 3.0] *)
  assert (Col.null_count clean = 0)

Time complexity: O(n) where n is the original column length.

val fill_nulls_float32 : t -> value:float -> t

fill_nulls_float32 col ~value replaces null values with the given float value.

Works only on float32 columns. NaN values are treated as nulls and replaced with the specified value.

  • parameter value

    The replacement value for null entries

  • raises Invalid_argument

    if column is not float32 type

Time complexity: O(n) where n is the column length.

val fill_nulls_float64 : t -> value:float -> t

fill_nulls_float64 col ~value replaces null values with the given float value.

Works only on float64 columns. NaN values are treated as nulls and replaced with the specified value.

  • parameter value

    The replacement value for null entries

  • raises Invalid_argument

    if column is not float64 type

Time complexity: O(n) where n is the column length.

val fill_nulls_int32 : t -> value:int32 -> t

fill_nulls_int32 col ~value replaces null values with the given int32 value.

Works only on int32 columns. Int32.min_int values are treated as nulls and replaced with the specified value.

  • parameter value

    The replacement value for null entries

  • raises Invalid_argument

    if column is not int32 type

Time complexity: O(n) where n is the column length.

val fill_nulls_int64 : t -> value:int64 -> t

fill_nulls_int64 col ~value replaces null values with the given int64 value.

Works only on int64 columns. Int64.min_int values are treated as nulls and replaced with the specified value.

  • parameter value

    The replacement value for null entries

  • raises Invalid_argument

    if column is not int64 type

Time complexity: O(n) where n is the column length.

val fill_nulls_string : t -> value:string -> t

fill_nulls_string col ~value replaces null values with the given string value.

Works only on string columns. None values are treated as nulls and replaced with the specified value.

  • parameter value

    The replacement value for null entries

  • raises Invalid_argument

    if column is not string type

Time complexity: O(n) where n is the column length.

val fill_nulls_bool : t -> value:bool -> t

fill_nulls_bool col ~value replaces null values with the given boolean value.

Works only on boolean columns. None values are treated as nulls and replaced with the specified value.

  • parameter value

    The replacement value for null entries

  • raises Invalid_argument

    if column is not boolean type

Time complexity: O(n) where n is the column length.