Module Talon.Agg

Column-wise aggregation and transformation operations.

This module provides efficient aggregations that operate on entire columns, producing scalar results or new columns. All operations preserve type safety through dedicated submodules for different data types.

Performance: Operations leverage vectorized Nx computations where possible, providing significant speedups over manual iteration. Use cumsum, diff, and similar functions for efficient column transformations.

Type-specific aggregations

Each submodule ensures operations are only applied to compatible column types, preventing runtime type errors and providing clear semantics.

module Float : sig ... end

Float aggregations - work on any numeric column (int or float types).

module Int : sig ... end

Integer aggregations - work on any integer column type.

module String : sig ... end

String aggregations - work on string columns only.

module Bool : sig ... end

Boolean aggregations - work on boolean columns only.

Generic aggregations

These functions work on any column type and provide information about the data structure rather than mathematical operations.

val count : t -> string -> int

count df name returns number of non-null values.

Null definition varies by column type:

  • Float columns: NaN values
  • Integer columns: Int32.min_int, Int64.min_int sentinel values
  • String/Bool columns: None values

This is the complement of null_count from the Col module.

Time complexity: O(n) where n is the number of rows.

val nunique : t -> string -> int

nunique df name returns count of unique non-null values for any column type.

Works with any column type. Null values are excluded from the unique count. For large datasets, this operation may use significant memory to track unique values.

Time complexity: O(n) for simple types, O(n * m) for strings where m is average length.

val value_counts : t -> string -> Col.t * int array

value_counts df name returns unique non-null values and their counts.

Returns a tuple of (unique_values_column, counts_array) where the arrays have the same length and corresponding indices match. Useful for frequency analysis and building histograms.

The order of values is not guaranteed.

Time complexity: O(n) for simple types, O(n * m) for strings.

val is_null : t -> string -> bool array

is_null df name returns boolean array where true indicates null values.

Null definition varies by column type:

  • Float columns: NaN values
  • Integer columns: Int32.min_int, Int64.min_int sentinel values
  • String/Bool columns: None values

Useful for conditional operations and null-aware filtering.

Time complexity: O(n) where n is the number of rows.

Column transformations

These operations return new columns, preserving the input column's dtype where possible. They are efficient alternatives to row-wise computations for common column transformations.

val cumsum : t -> string -> Col.t

cumsum df name returns cumulative sum preserving column dtype.

Computes running total from first row to current row. Null values are treated as 0 for the cumulative operation but preserved in the output (i.e., null + value = null in the result).

The result column has the same dtype as the input column.

  • raises Invalid_argument

    if column is not numeric.

Time complexity: O(n) where n is the number of rows.

val cumprod : t -> string -> Col.t

cumprod df name returns cumulative product preserving column dtype.

Computes running product from first row to current row. Null values propagate through the computation (null * value = null).

  • raises Invalid_argument

    if column is not numeric.

Time complexity: O(n) where n is the number of rows.

val diff : t -> string -> ?periods:int -> unit -> Col.t

diff df name ?periods () returns difference between elements.

Computes value[i] - value[i-periods] for each element. The first periods elements will be null since there are no previous values.

  • parameter periods

    Number of periods to shift for difference (default 1).

  • raises Invalid_argument

    if column is not numeric.

Time complexity: O(n) where n is the number of rows.

val pct_change : t -> string -> ?periods:int -> unit -> Col.t

pct_change df name ?periods () returns percentage change between elements.

Computes (value[i] - value[i-periods]) / value[i-periods] for each element. The first periods elements will be null. Division by zero produces null.

  • parameter periods

    Number of periods to shift for comparison (default 1).

  • raises Invalid_argument

    if column is not numeric.

Time complexity: O(n) where n is the number of rows.

val shift : t -> string -> periods:int -> Col.t

shift df name ~periods shifts values by periods.

Positive periods shift forward (values move down, nulls fill the top). Negative periods shift backward (values move up, nulls fill the bottom).

  • parameter periods

    Number of positions to shift. Positive values shift forward, negative values shift backward.

Time complexity: O(n) where n is the number of rows.

val fillna : t -> string -> value:Col.t -> Col.t

fillna df name ~value fills null/missing values with provided value.

The value column must either:

  • Have exactly one element (broadcast to all null positions)
  • Have the same length as the target column (element-wise replacement)

The value column must have the same type as the target column.

  • parameter value

    Column containing replacement values for nulls.

  • raises Invalid_argument

    if value type doesn't match column type or if value column has incompatible length.

Time complexity: O(n) where n is the number of rows.