Module Talon.Agg
Column-wise aggregation and transformation operations.
This module provides efficient aggregations that operate on entire columns, producing scalar results or new columns. All operations preserve type safety through dedicated submodules for different data types.
Performance: Operations leverage vectorized Nx computations where possible, providing significant speedups over manual iteration. Use cumsum, diff, and similar functions for efficient column transformations.
Type-specific aggregations
Each submodule ensures operations are only applied to compatible column types, preventing runtime type errors and providing clear semantics.
module Float : sig ... endFloat aggregations - work on any numeric column (int or float types).
module Int : sig ... endInteger aggregations - work on any integer column type.
module String : sig ... endString aggregations - work on string columns only.
module Bool : sig ... endBoolean aggregations - work on boolean columns only.
Generic aggregations
These functions work on any column type and provide information about the data structure rather than mathematical operations.
val count : t -> string -> intcount df name returns number of non-null values.
Null definition varies by column type:
- Float columns: NaN values
- Integer columns: Int32.min_int, Int64.min_int sentinel values
- String/Bool columns:
Nonevalues
This is the complement of null_count from the Col module.
Time complexity: O(n) where n is the number of rows.
val nunique : t -> string -> intnunique df name returns count of unique non-null values for any column type.
Works with any column type. Null values are excluded from the unique count. For large datasets, this operation may use significant memory to track unique values.
Time complexity: O(n) for simple types, O(n * m) for strings where m is average length.
value_counts df name returns unique non-null values and their counts.
Returns a tuple of (unique_values_column, counts_array) where the arrays have the same length and corresponding indices match. Useful for frequency analysis and building histograms.
The order of values is not guaranteed.
Time complexity: O(n) for simple types, O(n * m) for strings.
val is_null : t -> string -> bool arrayis_null df name returns boolean array where true indicates null values.
Null definition varies by column type:
- Float columns: NaN values
- Integer columns: Int32.min_int, Int64.min_int sentinel values
- String/Bool columns:
Nonevalues
Useful for conditional operations and null-aware filtering.
Time complexity: O(n) where n is the number of rows.
Column transformations
These operations return new columns, preserving the input column's dtype where possible. They are efficient alternatives to row-wise computations for common column transformations.
cumsum df name returns cumulative sum preserving column dtype.
Computes running total from first row to current row. Null values are treated as 0 for the cumulative operation but preserved in the output (i.e., null + value = null in the result).
The result column has the same dtype as the input column.
Time complexity: O(n) where n is the number of rows.
cumprod df name returns cumulative product preserving column dtype.
Computes running product from first row to current row. Null values propagate through the computation (null * value = null).
Time complexity: O(n) where n is the number of rows.
diff df name ?periods () returns difference between elements.
Computes value[i] - value[i-periods] for each element. The first periods elements will be null since there are no previous values.
Time complexity: O(n) where n is the number of rows.
pct_change df name ?periods () returns percentage change between elements.
Computes (value[i] - value[i-periods]) / value[i-periods] for each element. The first periods elements will be null. Division by zero produces null.
Time complexity: O(n) where n is the number of rows.
shift df name ~periods shifts values by periods.
Positive periods shift forward (values move down, nulls fill the top). Negative periods shift backward (values move up, nulls fill the bottom).
Time complexity: O(n) where n is the number of rows.
fillna df name ~value fills null/missing values with provided value.
The value column must either:
- Have exactly one element (broadcast to all null positions)
- Have the same length as the target column (element-wise replacement)
The value column must have the same type as the target column.
Time complexity: O(n) where n is the number of rows.