To improve at reinforcement learning, people are often told that one good way is to find a research paper, read it, and then implement the algorithm described in it.
This would be relatively straightforward if papers clearly outlined their training procedure (which they sometimes don't) and all of their implementation details (which they often don't).
Unfortunately, it isn't.
Consequently, implementations are riddled with little tricks that are not related to the algorithm, just to get it to work.
Trying to refer back to it, especially after some time, results in confusion about how the algorithm relates to the code, even with help.
This website is an attempt to aid with the last problem by laying out various popular RL algorithms, transcribed exactly from their papers, side-by-side and line-by-line with their exact implementation in code.
Each algorithm corresponds to a code file in the source repository on GitHub that can be used to explore any non-algorithm-related details and to test it.
However, these implementations are the minimum possible to work in basic environments and should only be used as reference.
They are not performant or stable and most implementation details are not considered.
REINFORCE is an on-policy policy gradient learning algorithm, presented in
Simple statistical gradient-following algorithms for connectionist reinforcement learning
by Williams in 1992 (though the term "policy gradient algorithm" was only introduced in 2000 in
Policy Gradient Methods for Reinforcement Learning with Function Approximation
by Sutton et al.). The version shown here is from
Reinforcement Learning: An Introduction
(second edition) by Sutton & Barto.
The algorithm simply learns a policy that directly maximizes the return of the policy, according to the policy gradient theorem. The algorithm is also augmented by comparing the return to a baseline, such as an estimate of the state value. This does not change the expected value of the update (leaving it unbiased), but can reduce variance, thus improve learning speed.
REINFORCE with Baseline (episodic), for estimating
Input: a differentiable policy parameterization
Input: a differentiable state-value function parameterization
Algorithm parameters: step sizes
Initialize policy parameter and state-value weights (e.g., to )
Actor-critic methods have been studied from as early as 1977
(An adaptive optimal controller for discrete-time Markov environments
by Witten). The version shown here is from
Reinforcement Learning: An Introduction
(second edition) by Sutton & Barto.
Actor-critic methods introduce a critic that is applied to not only the initial state of a transition, but also the resulting state after the action is applied.
The algorithm shown here is an on-policy policy gradient algorithm similar to REINFORCE, but changes the objective by replacing the full return with the one-step temporal-difference residual. It is also a fully online, incremental algorithm, where each transition is processed as they occur and then never revisited.
One-step Actor-Critic (episodic), for estimating
Input: a differentiable policy parameterization
Input: a differentiable state-value function parameterization
Algorithm parameters: step sizes
Initialize policy parameter and state-value weights (e.g., to )
Deep Q-Networks is an off-policy algorithm, introduced in
Playing Atari with Deep Reinforcement Learning
by Mnih et al. in 2013.
DQNs follow the Q-learning training procedure using a deep neural network to estimate the action-value function Q. The behavior policy is ε-greedy. DQN training also utilizes experience replay, which increases data efficiency and, by sampling samples randomly, breaks the correlation between consecutive samples, thus reducing the variance of the updates. Because DQN takes the maximum over its action values, it can only be used for environments with discrete action spaces.
Advantage actor-critic was introduced in
Asynchronous Methods for Deep Reinforcement Learning
by Mnih et al. in 2016 as the asynchronous advantage actor-critic algorithm A3C but is generally implemented without asynchronicity without loss of performance (source).
It is an on-policy policy gradient algorithm where the objective is the advantage function.
Asynchronous advantage actor-critic - pseudocode for each actor-learner thread.
// Assume global shared parameter vectors and and global shared counter
// Assume thread-specific parameter vectors and
Initialize thread step counter
repeat
Reset gradients: and .
Synchronize thread-specific parameters and
Get state
repeat
Perform according to policy
Receive reward and new state
until terminal or
fordo
Accumulate gradients wrt :
Accumulate gradients wrt :
end for
Perform asynchronous update of using and of using .
PPO is an off-policy policy gradient algorithm introduced in
Proximal Policy Optimization Algorithms
by Schulman et al. in 2017.
PPO maximizes a surrogate objective consisting of the advantage multiplied by a measure of how different the new policy is from the old policy (by taking multiple policy iteration steps). This ratio is clipped to prevent large/unstable policy changes. (An alternative to clipping is to add a penalty proportional to the KL divergence, but this performed worse in the paper)
When the neural network architecture shares parameters between the policy and value function, an extra value function error term is added to the loss function.
An extra entropy bonus is also added to encourage exploration.
PPO is trained by running a number of actors for a small, fixed number of timesteps, and then performing policy iteration with minibatch stochastic gradient descent for a fixed number of epochs.
The algorithm shown here is based on the algorithm presented in the paper, with an expanded definition of the objective.
PPO, Actor-Critic Style
for iteration do
for actor do
Run policy in environment for timesteps
Compute advantage estimates
end for
Optimize surrogate wrt , with epochs and minibatch size
for _ in range(policy_iteration_steps):
for batch in Batch(
observations=rollout.observations,
actions=rollout.actions,
old_log_probs=old_log_probs,
advantages=advantages,
returns=returns,
batch_size=batch_size
):
DDPG is an off-policy actor-critic algorithm introduced in
Continuous control with deep reinforcement learning
by Lillicrap et al. in 2015. It is based on
Deterministic Policy Gradient Algorithms
by Silver et al.
DDPG adapts the DQN training procedure to environments with high-dimensional continuous action spaces using an actor-critic method. The behavior policy augments the actor policy with a noise process; the paper uses an Ornstein–Uhlenbeck process. The training procedure is otherwise identical to DQN, including the use of a replay buffer and target networks, though it uses "soft" target updates rather than directly copying the weights. DDPG can only be used for environments with continuous action spaces.
The paper also mentions using batch normalization to normalize the scale of different environment observation spaces, but that is not pertinent to the training algorithm itself.
DDPG algorithm
Randomly initialize critic network and actor with weights and .
Initialize target network and with weights ,
Initialize replay buffer
for episode do
Initialize a random process for action exploration
Receive initial observation state
fordo
Select action according to the current policy and exploration noise
Execute action and observe reward and observe new state
Store transition in
Sample a random minibatch of transitions from
Set
Update critic by minimizing the loss:
Update the actor policy using the sampled policy gradient:
for network, target_network in (
(self.policy, target_policy),
(self.critic, target_critic)
):
target_network.load_state_dict({
key: (
target_network_update_rate * val
+ (1 - target_network_update_rate)
* target_network.state_dict()[key]
)
for key, val in network.state_dict().items()
})
C51 is an off-policy Q-learning algorithm introduced in
A Distributional Perspective on Reinforcement Learning
by Bellemare, Dabney and Munos in 2017.
C51 is a modification of DQN where instead of learning a single value of each state-action, it learns the distribution of the value. For tractability, the value distribution is modelled by a discrete distribution whose support is an equally-spaced set of atoms, which makes the learning procedure similar to multiclass classification. Otherwise, the training procedure is identical to DQN, including using an ε-greedy behavior policy, replay experience and a target network. Like DQN, C51 can only be used for environments with discrete action spaces.
Categorical Algorithm
input A transition
,
fordo
# Compute the projection of onto the support
#
,
# Distribute probability of
end for
output# Cross-entropy loss
class C51(Algorithm):
...
def learn(
self,
replay_capacity: int,
learning_rate: float,
epochs: int,
epsilon: float,
batch_size: int,
discount_rate: float,
target_network_update_frequency: int
) -> None:
replay_memory = ReplayBuffer(replay_capacity)
self.q_net = self.initialize_networks()
q_net_optimizer = torch.optim.Adam(
self.q_net.parameters(),
lr=learning_rate
)
target_q_net = copy.deepcopy(self.q_net).requires_grad_(False)
for _ in range(epochs):
observation = torch.as_tensor(self.env.reset()[0])
done = False
while not done:
if torch.rand(1) < epsilon:
action = torch.as_tensor(self.env.action_space.sample())
else:
with torch.no_grad():
action = self.policy(observation)
(
next_observation,
reward,
done
) = self.env_step(action)
replay_memory.store(
(observation, action, reward, next_observation, done)
)