Side-by-side RL

To improve at reinforcement learning, people are often told that one good way is to find a research paper, read it, and then implement the algorithm described in it. This would be relatively straightforward if papers clearly outlined their training procedure (which they sometimes don't) and all of their implementation details (which they often don't). Unfortunately, it isn't. Consequently, implementations are riddled with little tricks that are not related to the algorithm, just to get it to work. Trying to refer back to it, especially after some time, results in confusion about how the algorithm relates to the code, even with help.

This website is an attempt to aid with the last problem by laying out various popular RL algorithms, transcribed exactly from their papers, side-by-side and line-by-line with their exact implementation in code. Each algorithm corresponds to a code file in the source repository on GitHub that can be used to explore any non-algorithm-related details and to test it.

However, these implementations are the minimum possible to work in basic environments and should only be used as reference. They are not performant or stable and most implementation details are not considered.

REINFORCE

REINFORCE is an on-policy policy gradient learning algorithm, presented in Simple statistical gradient-following algorithms for connectionist reinforcement learning by Williams in 1992 (though the term "policy gradient algorithm" was only introduced in 2000 in Policy Gradient Methods for Reinforcement Learning with Function Approximation by Sutton et al.). The version shown here is from Reinforcement Learning: An Introduction (second edition) by Sutton & Barto.
The algorithm simply learns a policy that directly maximizes the return of the policy, according to the policy gradient theorem. The algorithm is also augmented by comparing the return to a baseline, such as an estimate of the state value. This does not change the expected value of the update (leaving it unbiased), but can reduce variance, thus improve learning speed.

REINFORCE with Baseline (episodic), for estimating π0π*

Actor-Critic

Actor-critic methods have been studied from as early as 1977 (An adaptive optimal controller for discrete-time Markov environments by Witten). The version shown here is from Reinforcement Learning: An Introduction (second edition) by Sutton & Barto.
Actor-critic methods introduce a critic that is applied to not only the initial state of a transition, but also the resulting state after the action is applied.
The algorithm shown here is an on-policy policy gradient algorithm similar to REINFORCE, but changes the objective by replacing the full return with the one-step temporal-difference residual. It is also a fully online, incremental algorithm, where each transition is processed as they occur and then never revisited.

One-step Actor-Critic (episodic), for estimating πθπ*

Deep Q-Networks (DQN)

Deep Q-Networks is an off-policy algorithm, introduced in Playing Atari with Deep Reinforcement Learning by Mnih et al. in 2013.
DQNs follow the Q-learning training procedure using a deep neural network to estimate the action-value function Q. The behavior policy is ε-greedy. DQN training also utilizes experience replay, which increases data efficiency and, by sampling samples randomly, breaks the correlation between consecutive samples, thus reducing the variance of the updates. Because DQN takes the maximum over its action values, it can only be used for environments with discrete action spaces.

The DQN paper was re-released in 2015 as Human-level control through deep reinforcement learning by Mnih et al.
It introduces the use of a target Q network to make training more stable, as well as gradient clipping.

Deep Q-learning with Experience Replay

Advantage actor-critic (A2C)

Advantage actor-critic was introduced in Asynchronous Methods for Deep Reinforcement Learning by Mnih et al. in 2016 as the asynchronous advantage actor-critic algorithm A3C but is generally implemented without asynchronicity without loss of performance (source).
It is an on-policy policy gradient algorithm where the objective is the advantage function.

Asynchronous advantage actor-critic - pseudocode for each actor-learner thread.

Proximal Policy Optimization (PPO)

PPO is an off-policy policy gradient algorithm introduced in Proximal Policy Optimization Algorithms by Schulman et al. in 2017.
PPO maximizes a surrogate objective consisting of the advantage multiplied by a measure of how different the new policy is from the old policy (by taking multiple policy iteration steps). This ratio is clipped to prevent large/unstable policy changes. (An alternative to clipping is to add a penalty proportional to the KL divergence, but this performed worse in the paper)
When the neural network architecture shares parameters between the policy and value function, an extra value function error term is added to the loss function. An extra entropy bonus is also added to encourage exploration.
PPO is trained by running a number of actors for a small, fixed number of timesteps, and then performing policy iteration with minibatch stochastic gradient descent for a fixed number of epochs.
The algorithm shown here is based on the algorithm presented in the paper, with an expanded definition of the objective.

PPO, Actor-Critic Style

Deep Deterministic Policy Gradient (DDPG)

DDPG is an off-policy actor-critic algorithm introduced in Continuous control with deep reinforcement learning by Lillicrap et al. in 2015. It is based on Deterministic Policy Gradient Algorithms by Silver et al.
DDPG adapts the DQN training procedure to environments with high-dimensional continuous action spaces using an actor-critic method. The behavior policy augments the actor policy with a noise process; the paper uses an Ornstein–Uhlenbeck process. The training procedure is otherwise identical to DQN, including the use of a replay buffer and target networks, though it uses "soft" target updates rather than directly copying the weights. DDPG can only be used for environments with continuous action spaces.
The paper also mentions using batch normalization to normalize the scale of different environment observation spaces, but that is not pertinent to the training algorithm itself.

DDPG algorithm

Categorical DQN (C51)

C51 is an off-policy Q-learning algorithm introduced in A Distributional Perspective on Reinforcement Learning by Bellemare, Dabney and Munos in 2017.
C51 is a modification of DQN where instead of learning a single value of each state-action, it learns the distribution of the value. For tractability, the value distribution is modelled by a discrete distribution whose support is an equally-spaced set of atoms, which makes the learning procedure similar to multiclass classification. Otherwise, the training procedure is identical to DQN, including using an ε-greedy behavior policy, replay experience and a target network. Like DQN, C51 can only be used for environments with discrete action spaces.

Categorical Algorithm