Module Fehu.Gae

Generalized Advantage Estimation.

Correctly handles the distinction between terminated and truncated episodes. On termination, the bootstrap value is zero. On truncation, the bootstrap value comes from next_values.

GAE

val compute : rewards:float array -> values:float array -> terminated:bool array -> truncated:bool array -> next_values:float array -> gamma:float -> lambda:float -> float array * float array

compute ~rewards ~values ~terminated ~truncated ~next_values ~gamma ~lambda is (advantages, returns).

next_values.(t) is V(s_{t+1

}

). When terminated.(t) is true, the bootstrap value is zero and the GAE trace resets. When truncated.(t) is true, the bootstrap value is next_values.(t) and the trace resets for the new episode. Otherwise, continuation uses the next step's value.

Raises Invalid_argument if array lengths differ.

val compute_from_values : rewards:float array -> values:float array -> terminated:bool array -> truncated:bool array -> last_value:float -> gamma:float -> lambda:float -> float array * float array

compute_from_values ~rewards ~values ~terminated ~truncated ~last_value ~gamma ~lambda is (advantages, returns).

Convenience wrapper around compute that builds next_values from values and last_value: next_values.(t) = values.(t+1) for t < n-1, and next_values.(n-1) = last_value.

Raises Invalid_argument if array lengths differ.

Monte Carlo returns

val returns : rewards:float array -> terminated:bool array -> truncated:bool array -> gamma:float -> float array

returns ~rewards ~terminated ~truncated ~gamma computes discounted cumulative returns. The accumulation resets at terminal or truncated states.

Normalization

val normalize : ?eps:float -> float array -> float array

normalize arr is a copy of arr with zero mean and unit variance. eps (default 1e-8) prevents division by zero.