12. Softmax Activation: Understanding Probability Distributions in Neural Networks

What is Softmax?

Softmax is a function that converts a vector of raw scores (logits) into a vector of probabilities. The output of the softmax function is a probability distribution—each element in the output vector is between 0 and 1, and the sum of all elements is 1.

Mathematically

For a vector z of logits, the softmax function is defined as:

softmax(z_i) = e^z_i / Σ_j e^z_j

Where:

z_i is the raw score (logit) for class i.
The numerator e^z_i is the exponential of the raw score.
The denominator is the sum of the exponentials of all the logits in the vector, ensuring that the sum of the output probabilities equals 1.

Applying Softmax in PyTorch

In PyTorch, F.softmax(output, dim=1) applies the softmax function to the output tensor along the specified dimension dim=1.

1. Output Tensor (Logits)

The output tensor is typically a 2D tensor of shape (batch_size, num_classes), where:

batch_size is the number of samples in the batch.
num_classes is the number of classes your model is predicting.

For example, if output has the shape (1, 3), it might look like this:

[[2.0, 1.0, 0.1]]

Here, the output represents the logits for three classes.

2. Softmax Transformation

F.softmax(output, dim=1) converts these logits into probabilities along the num_classes dimension (dim=1). After applying softmax, the output might look like this:

[[0.659, 0.242, 0.099]]

This means that the model is 65.9% confident in class 0, 24.2% confident in class 1, and 9.9% confident in class 2.

3. Dimension Argument (dim=1)

dim=1 specifies that the softmax function should be applied across the num_classes dimension. This ensures that for each sample in the batch, the logits are converted into probabilities that sum to 1 across all classes.

Why Use Softmax?

Probabilities: Softmax transforms raw scores into probabilities, making the output interpretable in terms of likelihood for each class.
Multi-Class Classification: Softmax is typically used in the last layer of a neural network for multi-class classification tasks, where the model needs to assign a probability to each class.
Loss Calculation: The output of softmax is often used with the negative log-likelihood loss or cross-entropy loss, which compares the predicted probabilities with the true labels.

Example Calculation

For a clearer picture, let’s calculate the softmax manually:

Assume the logits are [2.0, 1.0, 0.1]:

e^2.0 ≈ 7.389
e^1.0 ≈ 2.718
e^0.1 ≈ 1.105

7.389 + 2.718 + 1.105 ≈ 11.212

softmax(2.0) ≈ 7.389 / 11.212 ≈ 0.659
softmax(1.0) ≈ 2.718 / 11.212 ≈ 0.242
softmax(0.1) ≈ 1.105 / 11.212 ≈ 0.099

Calculate the exponentials:
Sum of exponentials:
Compute softmax for each class:

So the resulting probabilities are approximately [0.659, 0.242, 0.099].

Summary

Softmax converts logits to a probability distribution.
dim=1 indicates that softmax is applied across the class scores for each sample.
The output probabilities sum to 1 and represent the model’s confidence in each class for a given input.

softmax 1