Fehu vs. Gymnasium -- A Practical Comparison

This guide explains how Fehu's reinforcement learning API relates to Python's Gymnasium (and Stable Baselines3 for collection/buffer/GAE), focusing on:

  • How core concepts map (Env, Space, step loop, wrappers)
  • Where the APIs feel similar vs. deliberately different
  • How to translate common Gymnasium patterns into Fehu

If you already use Gymnasium, this should be enough to become productive in Fehu quickly.


1. Big-Picture Differences

Aspect Gymnasium (Python) Fehu (OCaml)
Language Dynamic, interpreted Statically typed, compiled
Environment type gymnasium.Env ('obs, 'act, 'render) Env.t
Observation/action Untyped (np.ndarray, int, etc.) Parametric: 'obs and 'act tracked in the type
Spaces gymnasium.spaces.* 'a Space.t with typed modules (Space.Discrete, Space.Box, ...)
Step result Tuple (obs, reward, terminated, truncated, info) Record Env.step with named fields
Wrappers Subclassing gymnasium.Wrapper Env.wrap or composable combinators (map_observation, etc.)
Vectorized envs gymnasium.vector.SyncVectorEnv Vec_env.create
Trajectory collection External (Stable Baselines3, TorchRL) Built-in: Collect.rollout, Collect.episodes
Replay buffers External (Stable Baselines3, TorchRL) Built-in: Buffer.create, Buffer.add, Buffer.sample
GAE External (Stable Baselines3) Built-in: Gae.compute, Gae.returns, Gae.normalize
Policy evaluation Manual loop or SB3 evaluate_policy Built-in: Eval.run
RNG np.random / seed passed to env.reset(seed=...) Implicit scope via Nx.Rng.run ~seed
Rendering String mode "human", "rgb_array" Polymorphic variants `Human, `Rgb_array, etc.
Mutability Environments are mutable objects Environments are immutable handles; state is internal

Fehu semantics to know (read once):

  • Env.reset must be called before Env.step. After a terminal step, another reset is required.
  • Spaces validate observations and actions automatically -- Env.step raises if an action is outside the action space.
  • RNG is scoped: wrap your code in Nx.Rng.run ~seed:42 (fun () -> ...) instead of passing seeds to individual calls.
  • Trajectory collection, replay buffers, GAE, and evaluation are built into Fehu, not external libraries.

2. Spaces

2.1 Discrete

Gymnasium

import gymnasium as gym

space = gym.spaces.Discrete(5)           # {0, 1, 2, 3, 4}
space = gym.spaces.Discrete(5, start=1)  # {1, 2, 3, 4, 5}

sample = space.sample()
assert space.contains(sample)

Fehu

open Fehu

let space = Space.Discrete.create 5            (* {0, 1, 2, 3, 4} *)
let space = Space.Discrete.create ~start:1 5   (* {1, 2, 3, 4, 5} *)

let sample = Space.sample space
let valid  = Space.contains space sample

let n     = Space.Discrete.n space      (* 5 *)
let start = Space.Discrete.start space  (* 1 *)

(* Convert between discrete elements and ints *)
let action = Space.Discrete.of_int 3
let value  = Space.Discrete.to_int action

Discrete elements are (int32, Nx.int32_elt) Nx.t scalars, not bare OCaml ints.

2.2 Box (continuous)

Gymnasium

import numpy as np

space = gym.spaces.Box(
    low=np.array([-1.0, -2.0]),
    high=np.array([1.0, 2.0]),
    dtype=np.float32,
)
sample = space.sample()

Fehu

let space =
  Space.Box.create
    ~low:[| -1.0; -2.0 |]
    ~high:[| 1.0; 2.0 |]

let sample = Space.sample space
let (low, high) = Space.Box.bounds space

Box elements are (float, Nx.float32_elt) Nx.t tensors. Infinite bounds are allowed; sampling falls back to uniform draws in [-1e6, 1e6] clamped to bounds.

2.3 Multi_binary

Gymnasium

space = gym.spaces.MultiBinary(4)  # {0,1}^4

Fehu

let space = Space.Multi_binary.create 4

Elements are (int32, Nx.int32_elt) Nx.t vectors with values 0 or 1.

2.4 Multi_discrete

Gymnasium

space = gym.spaces.MultiDiscrete([3, 5, 2])  # 3 axes: {0..2}, {0..4}, {0..1}

Fehu

let space = Space.Multi_discrete.create [| 3; 5; 2 |]

2.5 Composite spaces

Gymnasium

space = gym.spaces.Tuple((
    gym.spaces.Discrete(3),
    gym.spaces.Box(low=0.0, high=1.0, shape=(2,)),
))

space = gym.spaces.Dict({
    "position": gym.spaces.Box(low=-10.0, high=10.0, shape=(3,)),
    "velocity": gym.spaces.Box(low=-1.0, high=1.0, shape=(3,)),
})

Fehu

let space =
  Space.Tuple.create [
    Space.Pack (Space.Discrete.create 3);
    Space.Pack (Space.Box.create ~low:[| 0.0; 0.0 |] ~high:[| 1.0; 1.0 |]);
  ]

let space =
  Space.Dict.create [
    ("position", Space.Pack (Space.Box.create ~low:[| -10.; -10.; -10. |] ~high:[| 10.; 10.; 10. |]));
    ("velocity", Space.Pack (Space.Box.create ~low:[| -1.; -1.; -1. |] ~high:[| 1.; 1.; 1. |]));
  ]

Composite space elements use Value.t for heterogeneous data: Tuple.element = Value.t list, Dict.element = (string * Value.t) list.

2.6 Sequence and Text

Gymnasium

space = gym.spaces.Sequence(gym.spaces.Discrete(5), seed=42)
space = gym.spaces.Text(max_length=32, charset="abcdef")

Fehu

let space = Space.Sequence.create ~max_length:10 (Space.Discrete.create 5)
let space = Space.Text.create ~charset:"abcdef" ~max_length:32 ()

2.7 Common operations

All space types share the same interface:

let sample = Space.sample space           (* random element *)
let valid  = Space.contains space sample  (* membership test *)
let spec   = Space.spec space             (* structural description *)
let shape  = Space.shape space            (* dimensionality, if defined *)

(* Serialization via Value.t *)
let packed   = Space.pack space sample
let unpacked = Space.unpack space packed  (* (element, string) result *)

(* Edge cases for testing *)
let edges = Space.boundary_values space

3. Creating Environments

3.1 From a registry

Gymnasium

env = gym.make("CartPole-v1", render_mode="human")

Fehu does not have a global registry. Environments are constructed directly:

let env =
  Env.create
    ~id:"CartPole-v1"
    ~observation_space:(Space.Box.create
      ~low:[| -4.8; Float.neg_infinity; -0.418; Float.neg_infinity |]
      ~high:[| 4.8; Float.infinity; 0.418; Float.infinity |])
    ~action_space:(Space.Discrete.create 2)
    ~render_mode:`Human
    ~render_modes:["human"; "rgb_array"]
    ~reset:(fun _env ?options:_ () ->
      let obs = (* initial state *) in
      (obs, Info.empty))
    ~step:(fun _env action ->
      let obs = (* next state *) in
      Env.step_result ~observation:obs ~reward:1.0 ())
    ()

Env.create takes the observation space, action space, and two callbacks: reset and step. Optional render and close callbacks handle visualization and cleanup.

3.2 Step result construction

Gymnasium returns a flat tuple from env.step():

obs, reward, terminated, truncated, info = env.step(action)

Fehu uses a record with named fields, and provides a convenience constructor with defaults:

(* Inside a step callback *)
Env.step_result
  ~observation:obs
  ~reward:1.0
  ~terminated:false
  ~truncated:false
  ~info:Info.empty
  ()

(* Defaults: reward=0., terminated=false, truncated=false, info=Info.empty *)
Env.step_result ~observation:obs ()

4. Step Loop

4.1 Basic episode

Gymnasium

env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)

total_reward = 0.0
while True:
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    if terminated or truncated:
        break

env.close()

Fehu

let () =
  Nx.Rng.run ~seed:42 (fun () ->
    let env = (* create environment *) in
    let (obs, _info) = Env.reset env () in
    let obs = ref obs in
    let total_reward = ref 0.0 in
    let done_ = ref false in
    while not !done_ do
      let action = Space.sample (Env.action_space env) in
      let step = Env.step env action in
      obs := step.observation;
      total_reward := !total_reward +. step.reward;
      done_ := step.terminated || step.truncated
    done;
    Env.close env)

Key differences:

  • RNG is scoped with Nx.Rng.run ~seed:42 rather than passed to reset.
  • Step results are accessed by field name (step.observation, step.reward).
  • Env.step raises Invalid_argument if called without a prior reset or after a terminal step without resetting.

4.2 Multiple episodes

Gymnasium

for episode in range(10):
    obs, info = env.reset()
    done = False
    while not done:
        action = policy(obs)
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

Fehu -- manual loop or use Collect.episodes:

(* Manual *)
let () =
  Nx.Rng.run ~seed:0 (fun () ->
    let env = (* create environment *) in
    for _ep = 0 to 9 do
      let (obs, _info) = Env.reset env () in
      let obs = ref obs in
      let done_ = ref false in
      while not !done_ do
        let action = policy !obs in
        let step = Env.step env action in
        obs := step.observation;
        done_ := step.terminated || step.truncated
      done
    done;
    Env.close env)

(* Or use Collect.episodes directly *)
let trajs =
  Nx.Rng.run ~seed:0 (fun () ->
    let env = (* create environment *) in
    Collect.episodes env
      ~policy:(fun obs -> (policy obs, None, None))
      ~n_episodes:10 ())

5. Wrappers

5.1 Gymnasium approach: subclassing

Gymnasium

class NormalizeObservation(gym.Wrapper):
    def __init__(self, env, mean, std):
        super().__init__(env)
        self.mean = mean
        self.std = std

    def observation(self, obs):
        return (obs - self.mean) / self.std

env = NormalizeObservation(env, mean=0.0, std=1.0)

5.2 Fehu approach: composable functions

Fehu provides Env.wrap for full control and specialized combinators for common patterns.

map_observation -- transform observations:

let normalized_env =
  Env.map_observation
    ~observation_space:obs_space
    ~f:(fun obs _info ->
      let normalized = (* normalize obs *) in
      (normalized, Info.empty))
    env

map_action -- transform actions before passing to the inner env:

let remapped_env =
  Env.map_action
    ~action_space:new_action_space
    ~f:(fun new_action -> (* convert to inner action *))
    env

map_reward -- transform rewards:

let scaled_env =
  Env.map_reward
    ~f:(fun ~reward ~info -> (reward *. 0.1, info))
    env

clip_action -- clamp continuous actions to bounds:

(* Gymnasium *)
(* from gymnasium.wrappers import ClipAction *)
(* env = ClipAction(env) *)

(* Fehu *)
let clipped_env = Env.clip_action env

clip_observation -- clamp observations:

let clipped_env =
  Env.clip_observation
    ~low:[| -5.0; -5.0 |]
    ~high:[| 5.0; 5.0 |]
    env

time_limit -- enforce maximum episode length:

(* Gymnasium *)
(* from gymnasium.wrappers import TimeLimit *)
(* env = TimeLimit(env, max_episode_steps=200) *)

(* Fehu *)
let limited_env = Env.time_limit ~max_episode_steps:200 env

5.3 Full custom wrapper with Env.wrap

When the combinators are not enough, use Env.wrap directly:

let custom_env =
  Env.wrap
    ~observation_space:new_obs_space
    ~action_space:new_act_space
    ~reset:(fun inner ?options () ->
      let (obs, info) = Env.reset inner ?options () in
      (transform_obs obs, info))
    ~step:(fun inner action ->
      let step = Env.step inner (transform_action action) in
      { step with observation = transform_obs step.observation })
    env

Env.wrap receives the inner environment as the first argument to reset, step, render, and close. Guards (closed check, needs-reset check, space validation) are enforced automatically.

5.4 Composing wrappers

Wrappers compose by chaining:

let env =
  base_env
  |> Env.time_limit ~max_episode_steps:500
  |> Env.clip_action
  |> Env.map_reward ~f:(fun ~reward ~info -> (reward *. 0.01, info))

6. Vectorized Environments

6.1 Synchronous vectorization

Gymnasium

envs = gym.vector.SyncVectorEnv([
    lambda: gym.make("CartPole-v1") for _ in range(4)
])

obs, infos = envs.reset()
actions = envs.action_space.sample()  # batch of 4 actions
obs, rewards, terminated, truncated, infos = envs.step(actions)
envs.close()

Fehu

let venv =
  Vec_env.create [env1; env2; env3; env4]

let n = Vec_env.num_envs venv  (* 4 *)

let (observations, infos) = Vec_env.reset venv ()
let actions = Array.init n (fun _ -> Space.sample (Vec_env.action_space venv)) in
let step = Vec_env.step venv actions

(* step.observations : 'obs array      -- one per env *)
(* step.rewards      : float array     -- one per env *)
(* step.terminated   : bool array      -- one per env *)
(* step.truncated    : bool array      -- one per env *)
(* step.infos        : Info.t array    -- one per env *)

Vec_env.close venv

Key differences:

  • Vec_env.create takes a list of already-constructed environments. All must have structurally identical spaces.
  • Terminated or truncated environments are automatically reset. The terminal observation is stored in the step's info under "final_observation" (as a packed Value.t), and the terminal info under "final_info".
  • The step result is a record with named arrays, not a tuple.

7. Trajectory Collection

7.1 Fixed-step rollout

Gymnasium + Stable Baselines3

from stable_baselines3.common.buffers import RolloutBuffer

# Manual loop or SB3 internals
obs, _ = env.reset()
for step in range(2048):
    action, log_prob, value = policy(obs)
    obs, reward, terminated, truncated, info = env.step(action)
    buffer.add(obs, action, reward, ...)
    if terminated or truncated:
        obs, _ = env.reset()

Fehu -- built-in:

let trajectory =
  Collect.rollout env
    ~policy:(fun obs ->
      let action = (* select action *) in
      let log_prob = (* optional log probability *) in
      let value = (* optional value estimate *) in
      (action, Some log_prob, Some value))
    ~n_steps:2048

Collect.rollout handles resets on episode boundaries automatically and returns a Collect.t record:

(* Collect.t fields: *)
trajectory.observations       (* 'obs array *)
trajectory.actions            (* 'act array *)
trajectory.rewards            (* float array *)
trajectory.next_observations  (* 'obs array *)
trajectory.terminated         (* bool array *)
trajectory.truncated          (* bool array *)
trajectory.infos              (* Info.t array *)
trajectory.log_probs          (* float array option *)
trajectory.values             (* float array option *)

let n = Collect.length trajectory

7.2 Complete episodes

Gymnasium + manual

episodes = []
for _ in range(10):
    obs, _ = env.reset()
    episode = []
    done = False
    while not done:
        action = policy(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        episode.append((obs, action, reward, next_obs, terminated, truncated))
        obs = next_obs
        done = terminated or truncated
    episodes.append(episode)

Fehu -- built-in:

let episode_list =
  Collect.episodes env
    ~policy:(fun obs -> (policy obs, None, None))
    ~n_episodes:10
    ~max_steps:1000
    ()
(* episode_list : ('obs, 'act) Collect.t list *)

Each element is one episode as a Collect.t. Concatenate them with Collect.concat:

let all_transitions = Collect.concat episode_list

8. Replay Buffers

8.1 Standard replay buffer

Stable Baselines3

from stable_baselines3.common.buffers import ReplayBuffer

buffer = ReplayBuffer(buffer_size=100_000, observation_space=..., action_space=...)
buffer.add(obs, next_obs, action, reward, done, infos)
batch = buffer.sample(batch_size=256)

Fehu -- built-in:

let buf = Buffer.create ~capacity:100_000

let () =
  Buffer.add buf {
    Buffer.observation = obs;
    action;
    reward = 1.0;
    next_observation = next_obs;
    terminated = false;
    truncated = false;
  }

(* Uniform random sampling *)
let batch = Buffer.sample buf ~batch_size:256
(* batch : ('obs, 'act) Buffer.transition array *)

(* Structure-of-arrays form for training loops *)
let (observations, actions, rewards, next_observations, terminated, truncated) =
  Buffer.sample_arrays buf ~batch_size:256

8.2 Buffer queries

let n       = Buffer.size buf       (* current number of stored transitions *)
let cap     = Buffer.capacity buf   (* maximum capacity *)
let full    = Buffer.is_full buf    (* true when size = capacity *)
let ()      = Buffer.clear buf      (* remove all transitions, keep storage *)

9. GAE and Returns

9.1 Generalized Advantage Estimation

Stable Baselines3 (internal)

# SB3 computes GAE internally in on-policy algorithms
# or manually:
import numpy as np

def compute_gae(rewards, values, dones, next_values, gamma=0.99, lam=0.95):
    advantages = np.zeros_like(rewards)
    last_gae = 0
    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * next_values[t] * (1 - dones[t]) - values[t]
        advantages[t] = last_gae = delta + gamma * lam * (1 - dones[t]) * last_gae
    returns = advantages + values
    return advantages, returns

Fehu -- built-in, with correct terminated/truncated handling:

let (advantages, returns) =
  Gae.compute
    ~rewards:trajectory.rewards
    ~values:(Option.get trajectory.values)
    ~terminated:trajectory.terminated
    ~truncated:trajectory.truncated
    ~next_values  (* float array: V(s_{t+1}) for each t *)
    ~gamma:0.99
    ~lambda:0.95

When you have values from a rollout and a final bootstrap value:

let (advantages, returns) =
  Gae.compute_from_values
    ~rewards:trajectory.rewards
    ~values:(Option.get trajectory.values)
    ~terminated:trajectory.terminated
    ~truncated:trajectory.truncated
    ~last_value:0.0
    ~gamma:0.99
    ~lambda:0.95

compute_from_values builds next_values from values and last_value automatically: next_values.(t) = values.(t+1) for t < n-1, and next_values.(n-1) = last_value.

9.2 Monte Carlo returns

Manual Python

def discounted_returns(rewards, dones, gamma=0.99):
    returns = np.zeros_like(rewards)
    running = 0.0
    for t in reversed(range(len(rewards))):
        running = rewards[t] + gamma * running * (1 - dones[t])
        returns[t] = running
    return returns

Fehu

let mc_returns =
  Gae.returns
    ~rewards:trajectory.rewards
    ~terminated:trajectory.terminated
    ~truncated:trajectory.truncated
    ~gamma:0.99

9.3 Normalization

let normalized_advantages = Gae.normalize advantages
let normalized_custom     = Gae.normalize ~eps:1e-5 advantages

10. Policy Evaluation

Gymnasium + Stable Baselines3

from stable_baselines3.common.evaluation import evaluate_policy

mean_reward, std_reward = evaluate_policy(
    model, env, n_eval_episodes=10, deterministic=True
)

Fehu -- built-in:

let stats =
  Eval.run env
    ~policy:(fun obs -> (* deterministic action *))
    ~n_episodes:10
    ~max_steps:1000
    ()

(* stats.mean_reward  : float *)
(* stats.std_reward   : float *)
(* stats.mean_length  : float *)
(* stats.n_episodes   : int   *)

Eval.run resets the environment between episodes and collects total reward and episode length across all episodes.


11. Rendering

11.1 Render modes

Gymnasium

env = gym.make("CartPole-v1", render_mode="human")
env.reset()
env.step(action)
frame = env.render()  # None for "human", np.ndarray for "rgb_array"

Fehu

let env =
  Env.create
    ~render_mode:`Human
    ~render_modes:["human"; "rgb_array"]
    ~render:(fun () -> (* return 'render option *))
    (* ... *)
    ()

let frame = Env.render env  (* 'render option *)

Render modes are polymorphic variants: `Human, `Rgb_array, `Ansi, `Svg, `Custom of string.

11.2 Frame type

For Rgb_array environments, Fehu uses Render.image:

(* Render.image fields: *)
(* width        : int                              *)
(* height       : int                              *)
(* pixel_format : Render.Pixel.format (Rgb|Rgba|Gray) *)
(* data         : uint8 bigarray                   *)

11.3 Recording rendered rollouts

Gymnasium

from gymnasium.wrappers import RecordVideo
env = RecordVideo(env, video_folder="./videos")

Fehu -- use Render.rollout or Render.on_render:

(* Run a policy and feed frames to a sink *)
Render.rollout env
  ~policy:(fun obs -> (* action *))
  ~steps:500
  ~sink:(fun frame -> (* save or display frame *))
  ()

(* Or wrap the env to capture every rendered frame *)
let recording_env =
  Render.on_render
    ~sink:(fun frame -> (* process frame *))
    env

12. Info Dictionaries

Gymnasium uses plain Python dicts for info:

obs, info = env.reset()
print(info.get("elapsed_steps", 0))

Fehu uses typed Info.t dictionaries with Value.t values:

let info = Info.of_list [
  ("elapsed_steps", Info.int 42);
  ("success", Info.bool true);
]

let steps = Info.find "elapsed_steps" info  (* Value.t option *)
let steps = Info.find_exn "elapsed_steps" info  (* Value.t, raises on missing *)

let info' = Info.set "custom_key" (Info.float 3.14) info
let info' = Info.merge info1 info2  (* info2 wins on conflicts *)
let is_empty = Info.is_empty info

13. Quick Cheat Sheet

Task Gymnasium / SB3 Fehu
Create env gym.make("CartPole-v1") Env.create ~observation_space ~action_space ~reset ~step ()
Reset obs, info = env.reset(seed=42) let (obs, info) = Env.reset env ()
Step obs, r, term, trunc, info = env.step(a) let s = Env.step env a (record fields)
Close env.close() Env.close env
Discrete space gym.spaces.Discrete(5) Space.Discrete.create 5
Box space gym.spaces.Box(low, high) Space.Box.create ~low ~high
Sample from space space.sample() Space.sample space
Contains check space.contains(x) Space.contains space x
Observation wrapper class W(gym.ObservationWrapper) Env.map_observation ~observation_space ~f env
Action wrapper class W(gym.ActionWrapper) Env.map_action ~action_space ~f env
Reward wrapper class W(gym.RewardWrapper) Env.map_reward ~f env
Clip actions ClipAction(env) Env.clip_action env
Time limit TimeLimit(env, max_episode_steps=N) Env.time_limit ~max_episode_steps:N env
Vectorize gym.vector.SyncVectorEnv([...]) Vec_env.create [env1; env2; ...]
Rollout N steps Manual loop / SB3 internal Collect.rollout env ~policy ~n_steps
Collect N episodes Manual loop Collect.episodes env ~policy ~n_episodes ()
Replay buffer ReplayBuffer(buffer_size=N, ...) Buffer.create ~capacity:N
Add to buffer buffer.add(obs, next_obs, ...) Buffer.add buf transition
Sample from buffer buffer.sample(batch_size=B) Buffer.sample buf ~batch_size:B
GAE SB3 internal / manual Gae.compute ~rewards ~values ~terminated ~truncated ~next_values ~gamma ~lambda
Discounted returns Manual loop Gae.returns ~rewards ~terminated ~truncated ~gamma
Normalize advantages (adv - mean) / std Gae.normalize advantages
Evaluate policy evaluate_policy(model, env, n_eval_episodes=10) Eval.run env ~policy ~n_episodes:10 ()
Render env.render() Env.render env
Record frames RecordVideo(env, ...) Render.on_render ~sink env
Seed RNG env.reset(seed=42) Nx.Rng.run ~seed:42 (fun () -> ...)