\[ \newcommand{\softmax}{\mathop{\rm softmax}\nolimits} \newcommand{\qed}{\tag*{$\Box$}} \]

Building a Numerically Stable Softmax

You may have noticed that when calculating the softmax a lot of neural network tool kits first subtract the maximum element from each element before calculating the softmax (defined as exponentiation of each element followed by normalization via the sum of exponentiated elements). Questions about why this is done are met with canned responses about numerical stability and brushed off. Concerns about numerical stability are valid, after all softmax does have an \(e^x\) term in it. If one of your elements is really big the result will be even bigger (and the sum bigger still) which could cause an overflow. Subtracting the maximum from all elements means that the largest element will be \(0\) so your exponentiation and the subsequent sum will never overflow. The problem is that no one ever explains why we can subtract the max and still get the same values out of the softmax. Let's look at how that works.

\[ \begin{align} \cssId{show-line-1}\softmax(x) &= \frac{e^{x_i}}{\sum_j e^{x_j}} \\ \cssId{show-line-2}\softmax(x) &= \softmax(x - \alpha) \\ \cssId{show-line-3}\alpha &= \max_i x_i \end{align} \]
\[ \begin{align} \cssId{show-line-4}\softmax(x-\alpha) &= \frac{e^{x_i - \alpha}}{\sum_j e^{x_j - \alpha}} \\ \cssId{show-line-5}\softmax(x_i) &= \frac{e^{x_i}}{\sum_j e^{x_j}} \\ &= \cssId{show-line-6}{ \frac{e^{-\alpha}}{e^{-\alpha}} * \frac{e^{x_i}}{\sum_j e^{x_j}} }\\ &= \cssId{show-line-7}{ \frac{e^{-\alpha}e^{x_i}}{e^{-\alpha}\sum_j e^{x_j}} }\\ &= \cssId{show-line-8}{ \frac{e^{-\alpha}e^{x_i}}{\sum_j e^{-\alpha} e^{x_j}} }\\ &= \cssId{show-line-9}{ \frac{e^{-\alpha + x_i}}{\sum_j e^{-\alpha + x_j}} }\\ &= \cssId{show-line-10}{ \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}} }\\ \cssId{show-line-11} \softmax(x) &= \softmax(x - \alpha) \qed \end{align} \]
  • This is the definition of softmax; it takes a vector x and exponentiates each element. It then normalizes each element by the sum of all exponentiated elements.
  • This is the equality we are trying to prove.
  • This is our definition of \( \alpha \). We are using the max of all elements in \(x\) because that is what people do for numerical stability, but this could be any arbitrary number.
  • This is the same as the definition of softmax, but we include the subtraction of \(\alpha\) from each element. This is the value we want to transform the softmax equation into.
  • Again, this is our definition of the softmax.
  • Anything divided by itself is \(1\) (because this is the same as multiplying by the multiplicative inverse which yields the multiplicative identity) and any number multiplied by \(1\) is itself (because \(1\) is the multiplicative identity). This means we can introduce this \(\frac{e^{-\alpha}}{e^{-\alpha}}\) via multiplication without changing any values.
  • The product of two fractions is the product of their numerators divided by the product of their denominators.
  • Multiplication distributes over the addition in the summation so we can push the \( e^{-\alpha} \) into the summation
  • The rules of exponents state that when multiplying elements that have the same base we can sum the exponents.
  • Addition is commutative so we can move the \( -\alpha \) after the \( x_i \)
  • We showed that both \(\softmax(x)\) and \(\softmax(x - \alpha)\) are equal to the same value so by the transitive property they are equal to each other.

We've shown that we can subtract the max without effecting our result (in fact we can subtract any number we choose). This means we can keep on subtracting the maximum for numerical stability without worrying about getting different results.