This blog post is not yet completed, stay tuned!

Want more information? Check out my master’s thesis and preprint paper.

This post is an adaptation of my Master of Science thesis and my preprint paper *Empirical Design in Reinforcement Learning*.

# What Even is Soft Actor-Critic?

Soft Actor-Critic (SAC) is an actor-critic algorithm that is considered state-of-the-art by many people. It just takes a quick search on Google Scholar to find many sources that claim this. The algorithm can solve complicated simulated problems and can even control robots [1,2]. The original two SAC papers ([1] and [2]) showed that SAC could outperform many of the best algorithms such as DDPG [3] and TD3 [4] in terms of achieving the highest episodic return on a number of environments from the MuJoCo suite in OpenAI gym.

So, SAC seems like a fairly good algorithm, and a lot of people seem to agree! But, what even is SAC exactly?

Soft Actor-Critic is an off-policy actor-critic algorithm. It’s off-policy because it uses experience replay to store transitions and update with them later. Furthermore, the algorithm works in the entropy-regularized setting, where we learn *soft* state- and action-value functions by augmenting the reward with the entropy of the policy, scaled by an **entropy scale** $τ$. the effective reward at timestep $(t+1)$ is:

where $\mathscr{H}_t = \mathscr{H}(π(\cdot \mid S_t))$ is the Shannon entropy of the policy in state $S_t$, defined as:

\[\mathscr{H}(\pi(\cdot \mid S_t)) = - \mathbb{E}_\pi \left[ \ln \pi(\cdot \mid S_t) \right]\]The entropy scale $\tau$ determines the relative importance of rewards and entropy. In the next few sections, we will represent the true action-value functions with entropy scale $τ$ as $Q^τ$ and the true state-value function as $V^τ$. If the value functions are approximated by a critic, then their parameters will be subscripts: $Q_\theta^\tau$ and $V_\psi^\tau$.

Furthermore, the algorithm is **not** a policy gradient algorithm since it does not use the policy gradient theorem [5]. Instead, the algorithm performs approximate (soft) policy iteration (API) which consists of two steps: **policy evaluation** and **policy improvement**. These steps correspond to the critic and the actor respectively.

The algorithm has two different widely used variants described in [1] and [2]. The major difference between these two variants is in the critic. In either case, SAC uses a double Q critic. This means that the critic is composed of two soft action-value functions. Whenever the algorithm requires an action’s value, the critic will use each of these two functions to predict the actions value, and then return the lower action value.

## Policy Evaluation: The Critic

Because SAC is an approximate (soft) policy iteration algorithm, it uses two phases: **policy evaluation** and **policy improvement**. The policy evaluation phase is conducted by the critic, while the policy improvement phase is conducted by the actor.

### Soft Actor-Critic

In [1], the critic also has a state-value function, which is used to construct the update target for the soft action-value functions:

\[J_{Q_\theta^\tau}(\theta) \doteq \mathbb{E}_{S_t, A_t \sim \mathscr{D}} \left[ \frac{1}{2} (Q^{\tau}_{\theta}(S_t, A_t) - \hat Q^\tau(S_t, A_t)^2 \right]\]where $J$ is an estimate of the value error (the objective to minimize), $θ$ are the parameters of the soft action-value function $Q_\theta^\tau$ with entropy scale $\tau$, $\mathscr{D}$ is the distribution from which states are sampled (in this case, the distribution is induced by a replay buffer), and

\[\hat Q^\tau(S_t, A_t) = R_t + \gamma \mathbb{E}_{S_{t+1} \sim \mathscr{P}}[V^\tau_{\bar ψ}(S_{t+1})]\]where $\gamma$ is the discount factor, $\mathscr{P}$ is the transition dynamics, $\bar ψ$ are the parameters of a **target** soft state-value function $V^\tau_{\bar \psi}$. The parameters of the target soft state-value function are learned using a polyak average over the parameters $ψ$ of the soft state-value function $V_\psi^\tau$:

The soft state-value function $V^\tau_\psi$ is learned by minimizing the following objective:

\[J_{V_\psi^\tau}(\psi) \doteq \mathbb{E}_{S \sim \mathscr{D}} \left[ \frac{1}{2} \left( V^\tau_\psi (S) - \mathbb{E}_{A \sim \pi_\phi(\cdot \mid S)} [Q^\tau_\theta(S, A) - \tau \ln (\pi_\phi(A \mid S)) ] \right) \right]\]where $\phi$ are the parameters of the current policy $\pi_\phi$. To estimate this objective, we usually only use a single state and action sample.

### Modern Soft Actor-Critic

[2] has been called *modern SAC* because it includes a number of tricks to improve performance, such as automatic entropy regularization and is considered to outperform the original SAC in [1]. This variance of SAC also does not utilize a state-value function $V$. Instead, the update to the action-value functions minimize the following objective:

where

\[\hat V^\tau(s) = \mathbb{E}_{A \sim \pi} \left[ Q^\tau_\theta (s, A) - \tau \ln(\pi(A ∣ s)) \right]\]Usually, a single action is used to estimate this objective, resulting in:

\[J_{Q_\theta^\tau}(\theta) \doteq \mathbb{E}_{S_t, A_t \sim \mathscr{D}} \left[ \frac{1}{2} \left( Q^\tau_\theta (S_t, A_t) - \left( R_t + \gamma \left( Q^\tau_\theta (S_{t+1}, A_{t+1}) - τ \ln(\pi(A_{t+1} \mid S_{t+1}) \right) \right) \right)^2 \right]\]You might recognize this as the well-known Sarsa update! If instead of using a single action sample, we were to use the complete expectation, then we would get an Expected Sarsa update. Of course, performing an Expected Sarsa update is infeasible in the continuous-action setting, but we could use any number of action samples to estimate this performance objective, we just usually only use one.

## Policy Improvement: The Actor

Unlike the critic, both versions of SAC use the same policy improvement operator. This operator uses the Boltzmann distribution over action values, defined as

\[\mathscr{B}Q^\tau(s, a) \doteq \frac{\exp(Q^\tau(s, a) \tau^{-1})}{Z} \qquad \text{for } Z ≐ \int_{\mathscr{A}} \exp(Q^\tau(s, b) τ^{-1}) \ db\]in the continuous action setting, and where $\mathscr{A}$ is the action space. In the discrete action setting, we replace integrals by summations. Because we usually don’t know $Q^\tau$ exactly, we generally approximate $\mathscr{B}Q^\tau$ using a critic $Q^\tau_\theta$:

\[\mathscr{B}Q_\theta^\tau(s, a) \doteq \frac{\exp(Q_\theta^\tau(s, a) \tau^{-1})}{Z} \qquad \text{for } Z ≐ \int_{\mathscr{A}} \exp(Q_\theta^\tau(s, b) τ^{-1}) \ db\]Then, to update the policy, we minimize

\[J_{\pi_\phi}(\phi) ≐ \mathbb{KL}\left(\pi_\phi \mid \mid \mathscr{B}Q^\tau \right)\]where $\phi$ are the parameters of the policy $\pi_\phi$ and $\mathbb{KL}$ is the KL-Divergence. Again, because we usually don’t know $Q^τ$ exactly, we approximate it with a critic and the above equation becomes

\[J_{\pi_\phi}(\phi) ≐ \mathbb{KL}\left(\pi_\phi \mid \mid \mathscr{B}Q_\theta^\tau \right)\]There are two different ways to construct the gradient for this equation. The first way uses the likelihood trick, which we won’t discuss here (see [6] for more information on this method). The second way, which is used by SAC, uses the reparameterization trick:

\[J_{\pi_\phi}(\phi) = \mathbb{E}_{S_t \sim \mathscr{D}, \varepsilon_t \sim P} \left[ \ln(\pi_\phi(f_\phi(\varepsilon_t, S_t) \mid S_t)) - Q^\tau_\theta (S_t, f_\phi(\varepsilon_t, S_t)) \right]\]where $\varepsilon_t$ is random noise sampled from the distribution $D$ and $A_t = f_\phi(\varepsilon_t, S_t)$. For example, $f_\phi$ might be the quantile function for the policy distribution and $D$ be the uniform distribution on $[0, 1]$.

At its original introduction, SAC used an ArctanhNormal policy (an ArctanNormal distribution is the distribution of a random variable whose hyperbolic arctangent is normally distributed – this is sometimes called a *Squashed Gaussian* distribution). For this form of policy, we have $f_\phi(\varepsilon_t, S_t) ≐ \tanh(\mu_{S_t} + \varepsilon_t \sigma_{S_t})$. Here, $\mu_s$ and $\sigma_s$ are the location and scale parameters of the policy in state $s$.

This results in the following gradient approximation:

\[\hat \nabla_\phi J_{\pi_\phi}(\phi) = \nabla_\phi \ln(\pi_\phi(A_t \mid S_t)) + \left( \nabla_{A_t} \ln(\pi_\phi(A_t \mid S_t)) - \nabla_{A_t} Q^\tau_\theta(S_t, A_t)) \nabla_\phi f_\phi(\varepsilon_t, S_t) \right)\]This gradient is an approximation because it is a sample of the true gradient (which is an expectation itself). Furthermore, if we know $Q^\tau$ exactly, we can replace $Q^\tau_\theta$ with $Q^\tau$.

## Putting this all Together

Putting the last two sections together, we get the Soft Actor-Critic algorithm. In Algorithm 1, we see the modern version of SAC, while Algorithm 2 shows the original SAC algorithm. Here, we’ve used lowercase letters to denote value functions: $q$ denotes action-value functions while $v$ denotes state-value functions.

These images are taken from my master’s thesis.

# Considering Original Results

Let’s review the results published in the original SAC paper [1], since this is the version of SAC which was published as a conference paper. The second version [2] has never been published. Here are the learning curves of SAC on six different MuJoCo environments from OpenAI gym.

*The Learning Curves Originally Reported in the SAC Paper*

These learning curves were constructed over 5 runs. The solid line denotes the mean performance, while shaded regions denote the difference between the minimum and maximum performance. Performance is measured every 1,000 steps by recording the average episodic return over a number of episodes where only the mean action of the policy is selected in each state. No learning is done during this evaluation phase. See [1] for more details on the experimental process.

As you can see, we have four baseline algorithms: DDPG, PPO, SQL, and TD3. On nearly all these environments, SAC outperforms the baseline algorithms, but perhaps not statistically significantly. Nevertheless, the following conclusion is drawn:

SAC performs comparably to the baseline methods on the easier tasks and outperforms them on the harder tasks with a large margin both in terms of learning speed and the final performance.

It’s left up to the reader to guess which tasks are *easier* and which are *harder*. These results here convinced the entire community that SAC was a state-of-the-art algorithm, the best in continuous-action control.

## Wait, What About DDPG and TD3?

If you search the internet a bit for deep actor-critic algorithms, it won’t take long before you come across two resources: Spinning Up [7] and the TD3 paper [4]. Here are some learning curves reported by Spinning Up on Half Cheetah:

*The Learning Curves on HalfCheetah Reported by SpinningUp*

And here are some learning curves reported by the TD3 paper, pay close attention to the learning curves on Half Cheetah:

*Learning Curves Reported by the TD3 paper*

Something is up… these two results **don’t at all match those learning curves reported by the SAC paper on HalfCheetah**. In these results, DDPG and TD3 look much better, while SAC looks much worse. How can this be?

# Reproducing the Original Experiments

# Is the Consensus Correct?

# So What?

# References

[1] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.

[2] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. Soft Actor-Critic Algorithms and Applications. 2019.

[3] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra. Continuous Control with Deep Reinforcement Learning. 2016.

[4] Scott Fujimoto, Herke van Hoof, David Meger. Addressing Function Approximation Error in Actor-Critic Methods. 2018.

[5] Richard S. Sutton, David McAllester, Satinder Singh, Yshay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. 1999.

[6] Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, Martha White. Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences. 2021.

[7] Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.