Problem

Given a bunch of numbers each representing a value for a given item you want to transform them into a metric to identify the highest value with the following properties:

  • the resulting metric should normalize all values, that is the sum of all values should be 1
  • the metric should favor only one item among the numbers (the one with the highest original value), thus boosting it to make it stand apart more clearly

Use case

Activation function in the last layer of a classification network. Each item stands for a certain class and only one class is to be selected. In the previous layer any kind of activation can result (normally between 0 and infinity) but we want to have a normalized output at the end that tells us which class has been activated.

Two-step approach

  1. Make all values positive (and boost higher values)
  2. Normalize all values so that they sum to 1.
import numpy as np
original_values = np.random.randn(10)
original_values
array([ 0.02614146,  1.45368858, -0.36014279,  0.81388848, -0.28236045,
       -1.2516738 ,  0.7559367 ,  1.185065  , -0.96293044,  0.65885819])

Step 1: Make everything positive while keeping the order of elements constant (monotonicity)

There are various ways of doing it, but a very convenient one is to use each value as the power of exp

step1 = np.exp(original_values)
step1
array([1.02648614, 4.2788684 , 0.69757671, 2.25666596, 0.75400185,
       0.28602565, 2.12960538, 3.27089941, 0.38177248, 1.93258444])
# check if the ranks in both arrays are still the same (order is preserved)
from numpy.testing import assert_array_equal
assert_array_equal(np.argsort(step1), np.argsort(step1))

# check if all values are positive
assert all(step1 > 0)

Step 2: Normalize the values to lie between 0 and 1. The sum of all values should be 1

step2 = step1/step1.sum()
step2
array([0.06033013, 0.25148384, 0.04099899, 0.13263203, 0.04431529,
       0.01681071, 0.12516425, 0.19224203, 0.02243808, 0.11358465])
# check if all values are between 0 and 1
softmax_values = step2
assert all(0 <= softmax_values)
assert all(softmax_values <= 1)

# check if the values sum up to 1
from numpy.testing import assert_almost_equal
assert_almost_equal(softmax_values.sum(), 1)
# Plot the original_values versus the softmax values. 
# We sort both arrays in increasing order. You can see that the 
# line for the softmax_values is slightly steeper, 
# thus indicating the boost of higher values.
import matplotlib.pyplot as plt

with plt.xkcd():
    fig, (ax1, ax2) = plt.subplots(2, 1)
    ax1.plot(sorted(original_values));
    ax1.set(ylabel="Original values");
    ax2.plot(sorted(softmax_values));
    ax2.set(ylabel="Softmax values");

Usage in Deep Learning frameworks

All Deep Learning frameworks have softmax functions. Here we show the Keras and PyTorch versions.

import keras
from keras import backend as K
keras_result = keras.activations.softmax(
    K.variable(value=original_values.reshape(1, -1)), axis=-1).numpy().flatten()
Using TensorFlow backend.
# Since Keras's softmax function uses a different approach, 
# the precision of the results varies
from numpy.testing import assert_array_almost_equal
assert_array_almost_equal(keras_result, softmax_values)
import torch
pytorch_result = torch.nn.functional.softmax(
    torch.tensor(original_values.reshape(1, -1)), dim=1).numpy().flatten()
pytorch_result
array([0.06033013, 0.25148384, 0.04099899, 0.13263203, 0.04431529,
       0.01681071, 0.12516425, 0.19224203, 0.02243808, 0.11358465])
assert_array_almost_equal(pytorch_result, softmax_values)

Caveat

Since softmax boosts the item with the highest value (winner takes it all), you shouldn't be using softmax whenever you want to have more than one element in the output (e.g., in multi-label classification scenarios).