Feed Forward Gradient Descent using Matrix Math

A feedforward neural network is an neural network where connections between layers are in one direction. The outputs of a layer become inputs (x) to the next. Coincidentally, matrix math also establishes a one-way relationship. Consider the following, which multiplies a layer by its weights that map to the next layer.

f \left( \underbrace{\begin{bmatrix} 1 & x_{i,1} & \cdots & x_{i,n} \end{bmatrix}_{(1 \times (n+1))}}_{layer_{i}} \cdot \underbrace{\begin{bmatrix} w_{0,0} & \cdots & w_{0,m-1} \\ \vdots & \ddots & \vdots \\ w_{n,0} & \cdots & w_{n,m-1} \end{bmatrix}_{((n+1) \times m)}}_{weights_{i \rightarrow i+1}} \right)_{(1 \times m)}

= \underbrace{\begin{bmatrix} x_{i+1,1} & \cdots & x_{i+1,m} \end{bmatrix}_{(1\times m)}}_{layer_{i+1}}

The goal of gradient descent is to calculate an update to the weights, such that the next calculation of the network will result in a lower error when estimating the output. In other words, we aim to calculate the change in error with respect to a change in weights (allowing us to know how much to change the weights).

\frac{\partial L}{\partial w_{i \rightarrow i+1}}

Since the product of layer times weights is going to be used a lot, we can simplify it by calling it z.

z_{i(1 \times m)}=\underbrace{\begin{bmatrix} 1 & x_{i,1} & \cdots & x_{i,n} \end{bmatrix}_{(1 \times (n+1))}}_{layer_{i}} \cdot \underbrace{\begin{bmatrix} w_{0,0} & \cdots & w_{0,m-1} \\ \vdots & \ddots & \vdots \\ w_{n,0} & \cdots & w_{n,m-1} \end{bmatrix}_{((n+1) \times m)}}_{weights_{i \rightarrow i+1}}

This notation can be summarized as follows:

z = x \cdot w

such that we revisit the matrix differentiation rules:

\frac{\partial L}{\partial w} = x^T \frac{\partial L}{\partial z}

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z}^T \cdot w

\frac{\partial z}{\partial x} = w^T

\frac{\partial z}{\partial w} = x

This is important when we consider the function f(z). This could be a number of different functions - typically of the sigmoid family. logistic, hyperbolic tangent, arctangent, etc. Sigmoid functions range from 0 to 1 across all inputs (not actually reaching 0 or 1 in practice) and are both reversible and differentiable. One such sigmoid function is the logistic function:

f \left(z \right ) = \frac{1}{1 + e^{-z}}

which can be differentiated as follows

f' \left(z \right ) = \frac{\partial }{\partial z}f(z) = \frac{e^{-z}}{\left(1 + e^{-z} \right )^2}

and expanded

\frac{e^{-z}}{\left(1 + e^{-z} \right )^2} = \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}}

further expands to

\frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \frac{1}{1 + e^{-z}} \cdot \left(1 - \frac{1}{1 + e^{-z}} \right)

which simplifies to

\frac{1}{1 + e^{-z}} \cdot \left(1 - \frac{1}{1 + e^{-z}} \right) = f(z) \cdot \left(1- f(z) \right )

such that

f'(z)_{(1 \times m)} = f(z)_{(1 \times m)} * \left(1- f(z) \right )_{(1 \times m)}

where * is the Schur/Hadamard product. This is the product of each element - not the matrix product. In the same manner, the logistic function can be inverted using.

f^{-1} \left(z \right ) = ln \left(\frac{1}{z} - 1\right)

This, however, assumes that x never reaches 0 or 1 exactly.

We now think about what the network hopes to minimize, and that is error/residual/loss. The loss function is often defined as the 1/2 residual sum squared.

L = \frac{1}{2}\cdot \left( \underbrace{\tilde{x}_{(1 \times m)}}_{measured} - \underbrace{\hat{x}_{(1 \times m)}}_{estimated} \right )^2

Such that it can be conveniently differentiated as follows:

\frac{\partial L}{\partial x} = \underbrace{\hat{x}_{(1 \times m)}}_{estimated} - \underbrace{\tilde{x}_{(1 \times m)}}_{measured}

Since we also know that error is driven by the estimates,

\frac{\partial}{\partial z}\hat{x} = \frac{\partial}{\partial z}f(z) = f'(z) = f(z) * (1-f(z))

the chain rule can then be applied

\frac{\partial}{\partial z}L =\frac{\partial L}{\partial \hat{x}} \frac{\partial \hat{x}}{\partial z} = \left(\underbrace{\hat{x}_{i_{(1 \times m)}}}_{estimated} - \underbrace{\tilde{x}_{i_{(1 \times m)}}}_{measured} \right ) * f'(z_{i-1})_{(1 \times m)}

and substituting in we see

\frac{\partial L}{\partial w_{i \rightarrow i+1}} = x^T_{i} \cdot \frac{\partial L}{\partial z_i}

= \begin{bmatrix} 1 \\ x_i,1 \\ \vdots \\ x_i,n \end{bmatrix}_{((n + 1) \times 1)} \cdot \begin{bmatrix} \left(\underbrace{\hat{x}_{i+1,(1 \times m)}}_{estimated} - \underbrace{\tilde{x}_{i+1,(1 \times m)}}_{measured} \right ) * f' \left(z_i_{(1 \times m)} \right )_{(1 \times m)} \end{bmatrix}_{(1 \times m)}

= \begin{bmatrix} 1 \\ x_i,1 \\ \vdots \\ x_i,n \end{bmatrix}_{((n + 1) \times 1)} \cdot \begin{bmatrix} \left(\underbrace{\hat{x}_{i+1,(1 \times m)}}_{estimated} - \underbrace{\tilde{x}_{i+1,(1 \times m)}}_{measured} \right ) * f' \left(x_i_{(1 \times (n+1))} \cdot w_{i \rightarrow i+1, ((n+1) \times m)} \right )_{(1 \times m)} \end{bmatrix}_{(1 \times m)}

The last part is to set a learning rate such that

\delta w_{i \rightarrow i+1} += \gamma \cdot \frac{\partial L}{\partial w_{i \rightarrow i+1}}

Great, but what about the weights of the previous layers? First we need to know how L varies with x.

\frac{\partial L}{\partial x_{i}} = \frac{\partial L}{\partial z_{i}} \frac{\partial z_{i}}{\partial x_{i}}

and from the above rules of differentiation

\frac{\partial L}{\partial z_{i}}\frac{\partial z_{i}}{\partial x_{i}} = \frac{\partial L}{\partial z_{i}} \cdot w^T_{i \rightarrow i+1}

Now, again, we can use the chain rule

\frac{\partial L}{\partial w_{i-1 \rightarrow i}} = \frac{\partial L}{\partial x_{i}} \frac{\partial x_i}{\partial w_{i-1 \rightarrow i}}

and we know

\frac{\partial x_i}{\partial w_{i-1 \rightarrow i}} = x^T_{i-1} \cdot \frac{\partial L}{\partial z_{i-1}}

where

\frac{\partial L}{\partial z_{i-1 }} = \frac{\partial L}{\partial x_{i}} * \frac{\partial x_{i}}{\partial w_{i-1 \rightarrow i}} =\frac{\partial L}{\partial x_{i}} * f'(z_{i-1})

so now

\frac{\partial L}{\partial w_{i-1 \rightarrow i}} = x^T_{i-1} \cdot \left( \underbrace{\left( \left( \left(\hat{x}_{i+1} - \tilde{x}_{i+1} \right ) * f'(x_i \cdot w_{i \rightarrow i+1}) \right) \cdot w^T_{i \rightarrow i+1} \right )}_{\frac{\partial L}{\partial x_{i}}} * f'(x_{i-1} \cdot w_{i-1 \rightarrow i}) \right )

This is what is referred to as back-propagation. It is performed from the final layer back to the initial layer before the new residual is calculated.