Softmax Regression

Motivating Softmax Regression

  • Up until now, the only type of classification we've gone over is binary classification
  • In other words, we're only able to classify two possible labels
  • Sometimes, we may want to classify an observation where multiple classifications exist
  • Said another way, we may be interested in multi-classification
  • For example, we could predict if an image is a cat, dog, or neither
  • We wouldn't use standard regression or binary classification
  • Instead, we would want to use softmax regression

Describing Softmax Regression

  • Softmax Regression refers to a network where its output layer uses a softmax activation function
  • Up until now, our output layer has only included a single neuron
  • We do this so we can output a single number, instead of a vector
  • For softmax regression, our network's output layer includes cc number of neurons
  • Here, cc represents the number of classes
  • For example, we would set c=3c=3 if we're trying to predict whether an image is a cat, dog, or neither
  • In summary, a network with an output layer that is the softmax activation function will typically return a vector of predictions, instead of a single-number prediction

Defining Softmax Regression

  1. Receive some input

    • Typically, the input of our output layer is the activations from the previous layer a[l1]a^{[l-1]}
    • Here, a[l1]a^{[l-1]} is a 3×13 \times 1 vector
  2. Compute our weighted input z[l]z^{[l]}:

    • Here, W[l]W^{[l]} is a 4×34 \times 3 matrix
    • And, b[l]b^{[l]} is a 4×14 \times 1 vector
    • Then, z[l]z^{[l]} is a 4×14 \times 1 vector
z[l]=W[l]a[l1]+b[l]z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}
  1. Compute the softmax activations a[l]a^{[l]}:

    • Here, a[l]a^{[l]} is a 4×14 \times 1 vector
a[l]=ezi[l]i=1nezi[l]=ezi[l]i=14ezi[l]a^{[l]} = \frac{e^{z_{i}^{[l]}}}{\sum_{i=1}^{n}e^{z_{i}^{[l]}}} = \frac{e^{z_{i}^{[l]}}}{\sum_{i=1}^{4}e^{z_{i}^{[l]}}}

softmaxnetwork

Example of using a Softmax Activation Function

  • Let's keep using our example network listed above
  • Let's define z[l]z^{[l]} as the following:
z[l]=[5213]z^{[l]} = \begin{bmatrix} 5 \cr 2 \cr -1 \cr 3 \end{bmatrix}
  • Before computing the softmatrix activations, we should define each individual ezi[l]e^{z_{i}^{[l]}}:
ez[l]=[ez1[l]ez2[l]ez3[l]ez3[l]]=[e5e2e1e3]=[148.47.40.420.1]e^{z^{[l]}} = \begin{bmatrix} e^{z_{1}^{[l]}} \cr e^{z_{2}^{[l]}} \cr e^{z_{3}^{[l]}} \cr e^{z_{3}^{[l]}} \end{bmatrix} = \begin{bmatrix} e^{5} \cr e^{2} \cr e^{-1} \cr e^{3} \end{bmatrix} = \begin{bmatrix} 148.4 \cr 7.4 \cr 0.4 \cr 20.1 \end{bmatrix}
  • We still need to sum those values together:
i=14ezi[l]=148.4+7.4+0.4+20.1=176.3\sum_{i=1}^{4}e^{z_{i}^{[l]}} = 148.4 + 7.4 + 0.4 + 20.1 = 176.3
  • Now, let's compute each individual softmatrix activation ai[l]a_{i}^{[l]}:
a[l]=[e5176.3e2176.3e1176.3e3176.3]=[0.8420.0420.0020.114]a^{[l]} = \begin{bmatrix} \frac{e^{5}}{176.3} \cr \cr \frac{e^{2}}{176.3} \cr \cr \frac{e^{-1}}{176.3} \cr \cr \frac{e^{3}}{176.3} \end{bmatrix} = \begin{bmatrix} 0.842 \cr 0.042 \cr 0.002 \cr 0.114 \end{bmatrix}

Intuition behind Softmax Learning

  • In a network without any hidden layers, an output layer using a softmax activation function creates linear decision boundaries
  • There would be c1c-1 number of linear decision boundaries
  • As we add hidden layers to our network, we would not have linear decision boundaries anymore
  • Instead, our network would use non-linear decision boundaries
  • These non-linear decision boundaries are useful for understanding non-linear relationships between the cc number of classes
  • Also, we should notice that each value in our softmax output vector represents a probability
  • Specifically, we could add up these values and they'd add up to 11
  • Essentially, softmax converts the values given by z[l]z^{[l]} into probabilities represented by a[l]a^{[l]}
  • Therefore, softmax becomes logistic regression when c=2c=2

tldr

  • Softmax regression is used when we're interested in multi-classification
  • Softmax regression refers to a form of regression on a network with an output layer that is the softmax activation function
  • A network with an output layer that is the softmax activation function will typically return a vector of predictions, instead of a single-number prediction
  • The number of neurons in the output layer will be equal to the number of classes cc that we're predicting on

References

Previous
Next

Batch Normalization

Summary for Training Neural Networks