Side-by-side RL

To improve at reinforcement learning, people are often told that one good way is to find a research paper, read it, and then implement the algorithm described in it. This would be relatively straightforward if papers clearly outlined their training procedure (which they sometimes don't) and all of their implementation details (which they often don't). Unfortunately, it isn't. Consequently, implementations are riddled with little tricks that are not related to the algorithm, just to get it to work. Trying to refer back to it, especially after some time, results in confusion about how the algorithm relates to the code, even with help.

This website is an attempt to aid with the last problem by laying out various popular RL algorithms, transcribed exactly from their papers, side-by-side and line-by-line with their exact implementation in code. Each algorithm corresponds to a code file in the source repository on GitHub that can be used to explore any non-algorithm-related details and to test it.

However, these implementations are the minimum possible to work in basic environments and should only be used as reference. They are not performant or stable and most implementation details are not considered.

REINFORCE

REINFORCE is an on-policy policy gradient learning algorithm, presented in Simple statistical gradient-following algorithms for connectionist reinforcement learning by Williams in 1992 (though the term "policy gradient algorithm" was only introduced in 2000 in Policy Gradient Methods for Reinforcement Learning with Function Approximation by Sutton et al.). The version shown here is from Reinforcement Learning: An Introduction (second edition) by Sutton & Barto.
The algorithm simply learns a policy that directly maximizes the return of the policy, according to the policy gradient theorem. The algorithm is also augmented by comparing the return to a baseline, such as an estimate of the state value. This does not change the expected value of the update (leaving it unbiased), but can reduce variance, thus improve learning speed.

REINFORCE with Baseline (episodic), for estimating $π_{0} \approx π_{*}$

Input: a differentiable policy parameterization $π (a ∣ s, θ)$
Input: a differentiable state-value function parameterization $\hat{v} (s, 𝐰)$
Algorithm parameters: step sizes $α^{θ} > 0, α^{𝐰} > 0$
Initialize policy parameter $θ \in ℝ^{d^{'}}$ and state-value weights $𝐰 \in ℝ^{d}$ (e.g., to $0$ )
Loop forever (for each episode):
Generate an episode $S_{0}, A_{0}, R_{1}, \dots, S_{T - 1}, A_{T - 1}, R_{T}$
Loop for each step of the episode $t = 0, 1, \dots, T - 1$ :
$G \leftarrow \sum_{k = t + 1}^{T} γ^{k - t - 1} R_{k}$
$δ \leftarrow G - \hat{v} (S_{t}, 𝐰)$
$𝐰 \leftarrow 𝐰 + α^{𝐰} δ \nabla \hat{v} (S_{t}, 𝐰)$
$θ \leftarrow θ + α^{θ} γ^{t} δ \nabla \ln π (A_{t} ∣ S_{t}, θ)$

```
class REINFORCE:

...

def learn(
```
```
  self,
```

  step_size: float,
  baseline_step_size: float,

```
) -> None:
```

  self.policy, self.baseline = self.initialize_networks()

  policy_optimizer = torch.optim.SGD(
    self.policy.parameters(),
    lr=step_size,
    maximize=True
  )
  value_optimizer = torch.optim.SGD(
    self.baseline.parameters(),
    lr=baseline_step_size
  )

```
  while True:
```

    episode = self.generate_episode()
    T = len(episode)

```
    for t in range(T):
```

      episode_return: float = sum(
        self.discount_rate**(k-t-1) * episode.rewards[k]
        for k in range(t + 1, T + 1)
      )

      advantage = (
        episode_return
        - self.baseline(episode.states[t])
      )

      value_loss = advantage ** 2

      value_optimizer.zero_grad()
      value_loss.backward()
      value_optimizer.step()

      policy_loss = (
        self.discount_rate ** t
        * advantage.detach()
        * self.policy.distribution(
          episode.states[t]
        ).log_prob(
          episode.actions[t]
        )
      )

      policy_optimizer.zero_grad()
      policy_loss.backward()
      policy_optimizer.step()

Actor-Critic

Actor-critic methods have been studied from as early as 1977 (An adaptive optimal controller for discrete-time Markov environments by Witten). The version shown here is from Reinforcement Learning: An Introduction (second edition) by Sutton & Barto.
Actor-critic methods introduce a critic that is applied to not only the initial state of a transition, but also the resulting state after the action is applied.
The algorithm shown here is an on-policy policy gradient algorithm similar to REINFORCE, but changes the objective by replacing the full return with the one-step temporal-difference residual. It is also a fully online, incremental algorithm, where each transition is processed as they occur and then never revisited.

One-step Actor-Critic (episodic), for estimating $π_{θ} \approx π_{*}$

Input: a differentiable policy parameterization $π (a ∣ s, θ)$
Input: a differentiable state-value function parameterization $\hat{v} (s, 𝐰)$
Algorithm parameters: step sizes $α^{θ} > 0, α^{𝐰} > 0$
Initialize policy parameter $θ \in ℝ^{d^{'}}$ and state-value weights $𝐰 \in ℝ^{d}$ (e.g., to $0$ )
Loop forever (for each episode):
Initialize $S$ (first state of episode)
$I \leftarrow 1$
Loop while $S$ is not terminal (for each time step):
$A ~ π (\cdot ∣ S, θ)$
Take action $A$ observe $S^{'}, R$
$δ \leftarrow R + \hat{v} (S^{'}, 𝐰) - \hat{v} (S, 𝐰)$
(if $S^{'}$ is terminal, then $\hat{v} (S^{'}, 𝐰) ≐ 0$ )
$𝐰 \leftarrow 𝐰 + α^{𝐰} δ \nabla \hat{v} (S, 𝐰)$
$θ \leftarrow θ + α^{θ} I δ \nabla \ln π (A ∣ S, θ)$
$I \leftarrow γ I$
$S \leftarrow S^{'}$

class ActorCritic(Algorithm):

...

def learn(

```
  self,
```

  policy_step_size: float,
  critic_step_size: float,

```
) -> None:
```

  self.policy, self.critic = self.initialize_networks()

  policy_optimizer = torch.optim.SGD(
    self.policy.parameters(),
    lr=policy_step_size,
    maximize=True
  )
  critic_optimizer = torch.optim.SGD(
    self.critic.parameters(),
    lr=critic_step_size
  )

```
  while True:
```

    state = torch.as_tensor(self.env.reset()[0])

```
    discount = 1
```
```
    done = False
    while not done:
```
```
      action = self.policy(state)
```

      next_state, reward, done = self.step(action)

      if done:
        next_value = 0
      else:
        with torch.no_grad():
          next_value = self.critic(next_state)

      temporal_difference = (
        reward
        + self.discount_rate * next_value
        - self.critic(state)
      )

      value_loss = temporal_difference ** 2

      critic_optimizer.zero_grad()
      value_loss.backward()
      critic_optimizer.step()

      policy_loss = (
        discount
        * temporal_difference.detach()
        * self.policy.distribution(state).log_prob(action)
      )

      policy_optimizer.zero_grad()
      policy_loss.backward()
      policy_optimizer.step()

```
      discount *= self.discount_rate
```
```
      state = next_state
```

Deep Q-Networks (DQN)

Deep Q-Networks is an off-policy algorithm, introduced in Playing Atari with Deep Reinforcement Learning by Mnih et al. in 2013.
DQNs follow the Q-learning training procedure using a deep neural network to estimate the action-value function Q. The behavior policy is ε-greedy. DQN training also utilizes experience replay, which increases data efficiency and, by sampling samples randomly, breaks the correlation between consecutive samples, thus reducing the variance of the updates. Because DQN takes the maximum over its action values, it can only be used for environments with discrete action spaces.

The DQN paper was re-released in 2015 as Human-level control through deep reinforcement learning by Mnih et al.
It introduces the use of a target Q network to make training more stable, as well as gradient clipping.

Deep Q-learning with Experience Replay

Initialize replay memory $𝒟$ to capacity $N$
Initialize action-value function $Q$ with random weights
for $episode = 1, M$ do
Initialise sequence $s_{1} = x_{1}$ and preprocessed sequenced $ϕ_{1} = ϕ (s_{1})$
for $t = 1, T$ do
With probability $ϵ$ select a random action $a_{t}$
otherwise select $a_{t} = {max}_{a} Q^{*} (ϕ (s_{t}), a; θ)$
Execute action $a_{t}$ in emulator and observe reward $r_{t}$ and image $x_{t + 1}$
Set $s_{t + 1} = s_{t}, a_{t}, x_{t + 1}$ and preprocess $ϕ_{t + 1} = ϕ (s_{t + 1})$
Store transition $(ϕ_{t}, a_{t}, r_{t}, ϕ_{t + 1})$ in $𝒟$
Sample random minibatch of transitions $(ϕ_{t}, a_{t}, r_{t}, ϕ_{t + 1})$ from $𝒟$
Set $y_{j} = {\begin{matrix} r_{j} & for terminal ϕ_{j + 1} \\ r_{j} + γ {max}_{a^{'}} Q (ϕ_{j + 1}, a^{'}; θ) & for non-terminal ϕ_{j + 1} \end{matrix}$
Perform a gradient descent step on $(y_{j} - Q (ϕ_{j}, a_{j}; θ))^{2}$ according to equation 3
end for
end for

class DQN(Algorithm):

...

def learn(
  self,
  replay_capacity: int,  # N
  learning_rate: float,
  exploration_rate: float,  # ε
  batch_size: int,
  discount_rate: float  # γ
) -> None:

  replay_memory = ReplayBuffer(replay_capacity)

  self.q_network = self.initialize_networks()

  q_network_optimizer = torch.optim.RMSprop(
    self.q_network.parameters(),
    lr=learning_rate
  )

```
  while True:
```
```
    observation = self.env_reset()
```
```
    done = False
    while not done:
```

      if torch.rand(1) < exploration_rate:
        action = torch.as_tensor(
          self.env.action_space.sample()
        )

      else:
        with torch.no_grad():
          action = self.policy(observation)

      (
        next_observation,
        reward,
        done
      ) = self.env_step(action)

      # use observations directly
      # instead of processing

      replay_memory.store(
        (
          observation,
          action,
          reward,
          next_observation,
          done
        )
      )

      replay_data = replay_memory.sample(batch_size)

      target_q = (
        replay_data.rewards
        + (
          (1 - replay_data.dones)
          * discount_rate
          * self.q_network(
            replay_data.next_observations
          ).max(dim=1).values
        )
      )

      predicted_q = self.q_network(
        replay_data.observations
      ).gather(
        1, replay_data.actions.unsqueeze(1)
      ).squeeze()

      q_loss = torch.mean(
        (target_q - predicted_q) ** 2
      )

      q_network_optimizer.zero_grad()
      q_loss.backward()
      q_network_optimizer.step()

```
      observation = next_observation
```

Advantage actor-critic (A2C)

Advantage actor-critic was introduced in Asynchronous Methods for Deep Reinforcement Learning by Mnih et al. in 2016 as the asynchronous advantage actor-critic algorithm A3C but is generally implemented without asynchronicity without loss of performance (source).
It is an on-policy policy gradient algorithm where the objective is the advantage function.

Asynchronous advantage actor-critic - pseudocode for each actor-learner thread.

// Assume global shared parameter vectors $θ$ and $θ_{ν}$ and global shared counter $T = 0$
// Assume thread-specific parameter vectors $θ^{'}$ and ${θ_{v}}^{'}$
Initialize thread step counter $t \leftarrow 1$
repeat
Reset gradients: $d θ \leftarrow 0$ and $d θ_{ν} \leftarrow 0$ .
Synchronize thread-specific parameters $θ^{'} = θ$ and ${θ_{ν}}^{'} = θ_{ν}$
$t_{s t a r t} = t$
Get state $s_{t}$
repeat
Perform $a_{t}$ according to policy $π (a_{t} ∣ s_{t}; θ^{'})$
Receive reward $r_{t}$ and new state $s_{t + 1}$
$t \leftarrow t + 1$
$T \leftarrow T + 1$
until terminal $s_{t}$ or $t - t_{s t a r t} == t_{m a x}$
$R = {\begin{matrix} 0 & for terminal s_{t} \\ V (s_{t}, {θ_{ν}}^{'}) & for non-terminal s_{t} // Bootstrap from last state \end{matrix}$
for $i \in t - 1, \dots, t_{s t a r t}$ do
$R \leftarrow r_{i} + γ R$
Accumulate gradients wrt $θ^{'}$ : $d θ \leftarrow d θ + \nabla_{θ^{'}} \log π (a_{i} ∣ s_{i}; θ^{'}) (R - V (s_{i}; {θ_{ν}}^{'}))$
Accumulate gradients wrt ${θ_{ν}}^{'}$ : $d θ_{ν} \leftarrow d θ_{ν} + \partial (R - V (s_{i}; {θ_{ν}}^{'}))^{2} / \partial {θ_{ν}}^{'}$
end for
Perform asynchronous update of $θ$ using $d θ$ and of $θ_{ν}$ using $d θ_{ν}$ .
until $T > T_{m a x}$

class A2C(Algorithm):

...

def learn(
  self,
  learning_rate: float,
  critic_learning_rate: float,
  rollout_length: int,  # t_max
  discount_rate: float  # γ
) -> None:

  (
    self.policy,
    self.critic
  ) = self.initialize_networks()

  policy_optimizer = torch.optim.RMSprop(
    self.policy.parameters(),
    lr=learning_rate,
    maximize=True
  )
  critic_optimizer = torch.optim.RMSprop(
    self.critic.parameters(),
    lr=critic_learning_rate,
  )
  terminal = True

  # not needed for
  # synchronous implementation

```
  while True:
```

    policy_optimizer.zero_grad()
    critic_optimizer.zero_grad()

    # not needed for
    # synchronous implementation

    states = []
    actions = []
    rewards = []

    if terminal:
      state = self.env_reset()

```
    for t in range(rollout_length):
```
```
      action = self.policy(state)
```

      (
        next_state,
        reward,
        terminal
      ) = self.env_step(action)

      states.append(state)
      actions.append(action)
      rewards.append(reward)
      state = next_state

```
      # t
```
```
      # self.timestep
```
```
      if terminal:
        break
```

    if terminal:
      return_ = 0
    else:
      with torch.no_grad():
        return_ = self.critic(state).item()

    for i in reversed(range(len(states))):

      return_ = (
        rewards[i]
        + discount_rate * return_
      )

      advantage = (
        return_
        - self.critic(states[i])
      )

      policy_loss = (
        self.policy.distribution(
          states[i]
        ).log_prob(
          actions[i]
        )
        * advantage.detach()
      )
      policy_loss.backward()

      critic_loss = advantage ** 2
      critic_loss.backward()

    policy_optimizer.step()
    critic_optimizer.step()

```
    # T_max not used here (run forever)
```

Proximal Policy Optimization (PPO)

PPO is an off-policy policy gradient algorithm introduced in Proximal Policy Optimization Algorithms by Schulman et al. in 2017.
PPO maximizes a surrogate objective consisting of the advantage multiplied by a measure of how different the new policy is from the old policy (by taking multiple policy iteration steps). This ratio is clipped to prevent large/unstable policy changes. (An alternative to clipping is to add a penalty proportional to the KL divergence, but this performed worse in the paper)
When the neural network architecture shares parameters between the policy and value function, an extra value function error term is added to the loss function. An extra entropy bonus is also added to encourage exploration.
PPO is trained by running a number of actors for a small, fixed number of timesteps, and then performing policy iteration with minibatch stochastic gradient descent for a fixed number of epochs.
The algorithm shown here is based on the algorithm presented in the paper, with an expanded definition of the objective.

PPO, Actor-Critic Style

for iteration $= 1, 2, \dots$ do
for actor $= 1, 2, \dots, N$ do
Run policy $π_{θ_{old}}$ in environment for $T$ timesteps
Compute advantage estimates ${\hat{A}}_{1}, \dots, {\hat{A}}_{T}$
end for
Optimize surrogate $L$ wrt $θ$ , with $K$ epochs and minibatch size $M \leq N T$
where $r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})}$
and $L^{C L I P} (θ) = {\hat{𝔼}}_{t} [min (r_{t} (θ) \hat{A_{t}}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A_{t}})]$
and $L_{t}^{V F} = (V_{θ} (s_{t}) - V_{t}^{targ})^{2}$
and $S$ denotes an entropy bonus
and $L_{t}^{C L I P + V F + S} (θ) = {\hat{𝔼}}_{t} [L_{t}^{C L I P} (θ) - c_{1} L_{t}^{V F} (θ) + c_{2} S [π_{θ}] (s_{t})]$
$θ_{old} \leftarrow θ$
end for

class PPO(Algorithm):

...

def learn(
  self,
  learning_rate: float,
  horizon: int,  # T
  discount_rate: float,  # γ
  advantage_smoothing_factor: float,  # λ
  policy_iteration_steps: int,  # K
  batch_size: int,  # M
  clip_threshold: float,  # ε
  value_loss_importance: float,  # c₁
  entropy_loss_importance: float,  # c₂
) -> None:

  assert horizon % batch_size == 0

  self.policy = self.initialize_networks()

  policy_optimizer = torch.optim.Adam(
    self.policy.parameters(),
    lr=learning_rate,
    maximize=True
  )

```
  while True:
```
```
    # one actor
```

    rollout = self.generate_rollout(horizon)

    # Generalized advantage estimation
    with torch.no_grad():
      values = self.policy.value(
        rollout.observations
      )
      next_values = self.policy.value(
        rollout.next_observations
      )

    temporal_differences = (
      rollout.rewards
      + (
        (1 - rollout.dones)
        * discount_rate
        * next_values
      )
      - values
    )

    advantages = torch.empty(horizon)
    last_advantage = 0
    for t in reversed(range(horizon)):
      if rollout.dones[t]:
        last_advantage = 0
      last_advantage = advantages[t] = (
        temporal_differences[t]
        + discount_rate
        * advantage_smoothing_factor
        * last_advantage
      )

    returns = advantages + values

    with torch.no_grad():
      old_log_probs = self.policy.distribution(
        rollout.observations
      ).log_prob(
        rollout.actions
      )

    for _ in range(policy_iteration_steps):
      for batch in Batch(
        observations=rollout.observations,
        actions=rollout.actions,
        old_log_probs=old_log_probs,
        advantages=advantages,
        returns=returns,
        batch_size=batch_size
      ):

        log_probs = self.policy.distribution(
          batch.observations
        ).log_prob(
          batch.actions
        )
        ratio = torch.exp(
          log_probs - batch.old_log_probs
        )

        policy_loss = torch.min(
          ratio * batch.advantages,
          torch.clamp(
            ratio,
            1 - clip_threshold,
            1 + clip_threshold
          )
          * batch.advantages
        ).mean()

        new_values = self.policy.value(
          batch.observations
        )
        value_loss = torch.mean(
          (new_values - batch.returns) ** 2
        )

        entropy_loss = -log_probs.mean()

        loss = (
          policy_loss
          - value_loss_importance * value_loss
          - entropy_loss_importance * entropy_loss
        )

        policy_optimizer.zero_grad()
        loss.backward()
        policy_optimizer.step()

Deep Deterministic Policy Gradient (DDPG)

DDPG is an off-policy actor-critic algorithm introduced in Continuous control with deep reinforcement learning by Lillicrap et al. in 2015. It is based on Deterministic Policy Gradient Algorithms by Silver et al.
DDPG adapts the DQN training procedure to environments with high-dimensional continuous action spaces using an actor-critic method. The behavior policy augments the actor policy with a noise process; the paper uses an Ornstein–Uhlenbeck process. The training procedure is otherwise identical to DQN, including the use of a replay buffer and target networks, though it uses "soft" target updates rather than directly copying the weights. DDPG can only be used for environments with continuous action spaces.
The paper also mentions using batch normalization to normalize the scale of different environment observation spaces, but that is not pertinent to the training algorithm itself.

DDPG algorithm

Randomly initialize critic network $Q (s, a ∣ θ^{Q})$ and actor $μ (s ∣ θ^{μ})$ with weights $θ^{Q}$ and $θ^{μ}$ .
Initialize target network $Q^{'}$ and $μ^{'}$ with weights $θ^{Q^{'}} \leftarrow θ^{Q}$ , $θ^{μ^{'}} \leftarrow θ^{μ}$
Initialize replay buffer $R$
for episode $= 1, M$ do
Initialize a random process $𝒩$ for action exploration
Receive initial observation state $s_{1}$
for $t = 1, T$ do
Select action $a_{t} = μ (s_{t} ∣ θ^{μ}) + 𝒩_{t}$ according to the current policy and exploration noise
Execute action $a_{t}$ and observe reward and observe new state $s_{t + 1}$
Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $R$
Sample a random minibatch of $N$ transitions $(s_{i}, a_{i}, r_{i}, s_{i + 1})$ from $R$
Set $y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}})$
Update critic by minimizing the loss: $L = \frac{1}{N} \sum_{i} (y_{i} - Q (s_{i}, a_{i} ∣ θ^{Q}))^{2}$
Update the actor policy using the sampled policy gradient:
$\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a ∣ θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s ∣ θ^{μ}) |_{s_{i}}$
Update the target networks:
$θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}$
$θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}$

class DDPG(Algorithm):

...

def learn(
  self,
  learning_rate: float,
  critic_learning_rate: float,
  replay_size: int,
  noise_kwargs: Mapping[str, float],
  batch_size: int,
  discount_rate: float,  # γ
  target_network_update_rate: float  # τ
) -> None:

  (
    self.policy,
    self.critic
  ) = self.initialize_networks()

  policy_optimizer = torch.optim.Adam(
    self.policy.parameters(),
    lr=learning_rate,
    maximize=True
  )
  critic_optimizer = torch.optim.Adam(
    self.critic.parameters(),
    lr=critic_learning_rate,
    weight_decay=0.01
  )

  target_policy = copy.deepcopy(self.policy)
  target_policy.requires_grad_(False)

  target_critic = copy.deepcopy(self.critic)
  target_critic.requires_grad_(False)

  replay_buffer = ReplayBuffer(replay_size)

```
  while True:
```

    noise = self.initialize_noise(**noise_kwargs)

```
    state = self.env_reset()
```
```
    done = False
    while not done:
```

      with torch.no_grad():
        action = self.policy(state) + noise()

      next_state, reward, done = self.env_step(action)

      replay_buffer.store(
        (state, action, reward, next_state, done)
      )

      batch = replay_buffer.sample(batch_size)

      target_value = (
        batch.rewards
        + discount_rate
        * target_critic(
          batch.next_states,
          target_policy(batch.next_states)
        )
      )

      predicted_value = self.critic(
        batch.states,
        batch.actions
      )
      critic_loss = torch.mean(
        (target_value - predicted_value) ** 2
      )

      critic_optimizer.zero_grad()
      critic_loss.backward()
      critic_optimizer.step()

      policy_loss = torch.mean(
        self.critic(
          batch.states,
          self.policy(batch.states)
        )
      )

      policy_optimizer.zero_grad()
      policy_loss.backward()
      policy_optimizer.step()

      for network, target_network in (
        (self.policy, target_policy),
        (self.critic, target_critic)
      ):
        target_network.load_state_dict({
          key: (
            target_network_update_rate * val
            + (1 - target_network_update_rate)
            * target_network.state_dict()[key]
          )
          for key, val in network.state_dict().items()
        })

```
      state = next_state
```

Categorical DQN (C51)

C51 is an off-policy Q-learning algorithm introduced in A Distributional Perspective on Reinforcement Learning by Bellemare, Dabney and Munos in 2017.
C51 is a modification of DQN where instead of learning a single value of each state-action, it learns the distribution of the value. For tractability, the value distribution is modelled by a discrete distribution whose support is an equally-spaced set of atoms, which makes the learning procedure similar to multiclass classification. Otherwise, the training procedure is identical to DQN, including using an ε-greedy behavior policy, replay experience and a target network. Like DQN, C51 can only be used for environments with discrete action spaces.

Categorical Algorithm

input A transition $x_{t}, a_{t}, r_{t}, x_{t + 1}, γ_{t} \in [0, 1]$
$Q (x_{t + 1}, a) ≔ \sum_{i} z_{i} p_{i} (x_{t + 1}, a)$
$a^{*} \leftarrow {arg max}_{a} Q (x_{t + 1}, a)$
$m_{i} = 0$ , $i \in 0, \dots, N - 1$
for $j \in 0, \dots, N - 1$ do
# Compute the projection of $\hat{𝒯} z_{j}$ onto the support ${z_{i}}$
$\hat{𝒯} z_{j} \leftarrow [r_{t} + γ_{t} z_{j}]_{V_{M I N}}^{V_{M A X}}$
$b_{j} \leftarrow (\hat{𝒯} z_{j} - V_{M I N}) / Δ z$ # $b_{j} \in [0, N - 1]$
$l \leftarrow ⌊ b_{j} ⌋$ , $u \leftarrow ⌈ b_{j} ⌉$
# Distribute probability of $\hat{𝒯} z_{j}$
$m_{l} \leftarrow m_{l} + p_{j} (x_{t + 1}, a^{*}) (u - b_{j})$
$m_{u} \leftarrow m_{u} + p_{j} (x_{t + 1}, a^{*}) (b_{j} - l)$
end for
output $- \sum_{i} m_{i} \log p_{i} (x_{t}, a_{t})$ # Cross-entropy loss

class C51(Algorithm):

...

def learn(
  self,
  replay_capacity: int,
  learning_rate: float,
  epochs: int,
  epsilon: float,
  batch_size: int,
  discount_rate: float,
  target_network_update_frequency: int
) -> None:
  replay_memory = ReplayBuffer(replay_capacity)
  self.q_net = self.initialize_networks()
  q_net_optimizer = torch.optim.Adam(
    self.q_net.parameters(),
    lr=learning_rate
  )

  target_q_net = copy.deepcopy(self.q_net).requires_grad_(False)
  for _ in range(epochs):
    observation = torch.as_tensor(self.env.reset()[0])
    done = False
    while not done:
      if torch.rand(1) < epsilon:
        action = torch.as_tensor(self.env.action_space.sample())
      else:
        with torch.no_grad():
          action = self.policy(observation)

      (
        next_observation,
        reward,
        done
      ) = self.env_step(action)

      replay_memory.store(
        (observation, action, reward, next_observation, done)
      )

      replay_data = replay_memory.sample(batch_size)

      next_q_probs = target_q_net(replay_data.next_observations)

      next_actions = (
        self.values * next_q_probs
      ).sum(-1).argmax(-1)

      target_distribution = torch.zeros(self.atoms, batch_size)

```
      # batch
```

      target_values = torch.clamp(
        replay_data.rewards
        + (1 - replay_data.dones)
        * discount_rate
        * self.values.unsqueeze(1),
        self.minimum_value,
        self.maximum_value
      )

      target_values_positions = (
        (target_values - self.minimum_value)
        / self.value_step
      )

      target_indices_lower = target_values_positions.floor().long()
      target_indices_upper = target_values_positions.ceil().long()

      next_value_probs = next_q_probs.take_along_dim(
        next_actions.reshape(-1, 1, 1),
        1
      ).squeeze(1).T
      next_value_probs_lower = (
        next_value_probs
        # * (target_indices_upper - target_values_positions)
        * (target_indices_lower + 1 - target_values_positions)
      )
      next_value_probs_upper = (
        next_value_probs
        * (target_values_positions - target_indices_lower)
      )
      for atom in range(self.atoms):

        target_distribution.scatter_add_(
          0,
          target_indices_lower[[atom]],
          next_value_probs_lower[[atom]]
        )

        target_distribution.scatter_add_(
          0,
          target_indices_upper[[atom]],
          next_value_probs_upper[[atom]]
        )

      target_distribution = target_distribution.T

      value = self.q_net(
        replay_data.observations
      ).take_along_dim(
        replay_data.actions.reshape(-1, 1, 1),
        1,
      ).squeeze(1)
      q_loss = -(target_distribution * value.log()).sum()
      q_loss /= batch_size

      q_net_optimizer.zero_grad()
      q_loss.backward()
      q_net_optimizer.step()

      if self.timestep % target_network_update_frequency == 0:
        target_q_net.load_state_dict(self.q_net.state_dict())

      observation = next_observation