A Geometric Interpretation of Neural Networks
February 20, 2024 / 8 min read
Last Updated: February 20, 2024The Canonical Neural Network
Let's start by reviewing the basic multi-layer perceptron (MLP) neural network.
Given
Where
Geometric Interpretation
But what do those matrix multiplications actually do?
A neural network basically does 2 things: Rotating and Twisting in high dimensional space.
Rotating in mathematical jargon for Linear Transformations (
Linear Transformations
So what is a linear transformation and why does it matter for neural networks? When we multiply a matrix
This transformation can be thought of as a combination of rotation and/or scaling and/or reflection of the input vector
Rotation (when the
is orthonormal, i.e. ). E.g., if
is rotating by 90 degrees in the counter-clockwise direction.) Scaling (when the
is diagonal). E.g., if
is scaling by 2 in the x-axis and 3 in the y-axis. Reflection (when the
is negative). E.g.
is reflecting in the x-axis. Shearing (when
is non-zero off-diagonal). E.g.
is shearing in the x-axis.
Let's visualize the linear transformations on an arbitrary 2D Cloud of
Creating Random 2D Cloud
1import torch2from res.plot_lib import set_default, show_scatterplot, plot_bases3from matplotlib.pyplot import plot, title, axis, figure, gca, gcf45# generate some points in 2-D space6n_points = 1_0007X = torch.randn(n_points, 2).to('gpu')8show_scatterplot(X, title='X')910OI = torch.cat((torch.zeros(2, 2), torch.eye(2))).to('gpu')11plot_bases(OI) # basis vector
Now let's apply some linear transformations to this cloud of points. We first generate a random matrix
Applying Linear Transformations
1# create a random matrix2W = torch.randn(2, 2).to(device)3# singular value decomposition4U, S, V = torch.svd(W)56torch.manual_seed(2024)7# Define original basis vectors8OI = torch.cat((torch.zeros(2, 2), torch.eye(2))).to(device)910# Apply transformations sequentially11Y1 = X @ V # Note: V is already transposed in the output of torch.svd12Y2 = Y1 @ torch.diag(S) # S is a diagonal matrix13Y3 = Y2 @ U1415# Transform the basis vectors for each step16new_OI_Y1 = OI @ V17new_OI_Y2 = new_OI_Y1 @ torch.diag(S)18new_OI_Y3 = new_OI_Y2 @ U1920# Titles and data for plots21titles = [22'X (Original)',23'Y1 = XV\nV = [{:.3f}, {:.3f}], [{:.3f}, {:.3f}]'.format(V[0, 0].item(), V[0, 1].item(), V[1, 0].item(), V[1, 1].item()),24'Y2 = SY1\nS = [{:.3f}, {:.3f}]'.format(S[0].item(), S[1].item()),25'Y3 = UY2\nU = [{:.3f}, {:.3f}], [{:.3f}, {:.3f}]'.format(U[0, 0].item(), U[0, 1].item(), U[1, 0].item(), U[1, 1].item())26]27Ys = [X, Y1, Y2, Y3]28new_OIs = [OI, new_OI_Y1, new_OI_Y2, new_OI_Y3]2930# Plot the sequential transformations31plt.figure(figsize=(15, 5))3233for i in range(4):34plt.subplot(1, 4, i+1)35show_scatterplot(Ys[i], colors, title=titles[i], axis=True)36# plot_bases(OI)37plot_bases(new_OIs[i])3839plt.show()
Did you notice anything? The cloud of points was rotated, scaled, and rotated again. Let's analyze the transformations:
The first transformation
rotated the cloud by the matrix . To find how much the cloud was rotated, we first note that generally, a rotation matrix can be written as:
Which means we can extract the angle of rotation from the matrix
using the formula randians. This means that rotated the cloud by degrees. In other words, the cloud was rotated by 75 degrees in the clockwise direction. However, a 75 degrees rotation means that the blue cloud would be in the 4th quadrant, but the cloud is in the 3rd quadrant. This is because the matrix
also contains a reflection! We can check that , which means that the matrix is reflecting the cloud, resulting in the blue cloud landing in the 3rd quadrant (though outside of the scope of this article, we can find axis of reflection via eigen-decomposition). The second transformation
scaled the cloud by the diagonal matrix . This means that the cloud was scaled by a factor of 2.3 in the x-axis and 0.5933 in the y-axis. We check that, indeed, the cloud had become both longer and thinner. Finally, the third transformation
rotated the cloud again by the matrix . This means that the cloud was rotated by -137.59 degrees, a clock-wise rotation.
Singular Value Decomposition (SVD)
What we just did called Singular Value Decomposition of the matrix
are the rotation-reflection matrices we just saw. is the scaling factor for the dimension, and is for the dimension. Larger the scaling factor, the more stretched the space is in that direction, and vice versa.
Non-linear Transformations
Linear transforms can rotate, reflect, stretch and compress, but cannot squash/curve.
We need non-linearities for this. To visualize this, we can use the tanh function, which squashes the input to the range
There are a couple of famous non-linear functions that are used in neural networks:
- Hyperbolic Tangent (tanh)
- Squash the input to be between
and - Squash the output to be between
and . - Input and output (x and y) that are already in the range doesn't change much
- 2 kinks: one at
and the other at
- Sigmoid
- Squashes the input to be between
and - Squashes the output to be between
and - Commonly used in the output layer of binary classifiers
Applying Non-linear Transformations
1import torch2import torch.nn as nn3plt.figure(figsize=(15, 5))4plt.subplot(1, 4, 1)5show_scatterplot(Y3, colors, title='h=Y3')6plot_bases(OI)78# Loop through scaling factors and plot9for s in range(1, 4):10plt.subplot(1, 4, s + 1)11Y = torch.tanh(s * Y3).data # Scale & apply non-linearity12show_scatterplot(Y, colors, title=f'Y=tanh({s}*h)')13plot_bases(OI, width=0.01)1415plt.show()
And that's it! We have seen how a neural network can be thought of as a series of linear and non-linear transformations in high-dimensional space. This kind of flexibility allow neural networks to model complex relationships in the data, and is the reason why they are so powerful.
Have a wonderful day.
– Frank
It's not a black box, it's a high-dimensional space transformer!