Backpropagation

1D Networks

Preliminary: Calculus

Error

Error \(e\):

\[e = \frac{1}{2} (t-o)^2\]

Derivative with respect to network output \(o\) given target \(t\):

\[\begin{align} \frac{de}{do} &= \lim_{do \to 0} \frac{1}{2} \frac{(t - (o + do))^2 - (t-o)^2}{do} \\ \\ &= \frac{1}{2} \lim_{do \to 0} \frac{(t^2 + o^2 + do^2 - 2to + 2odo - 2tdo) - t^2 -o^2 + 2to}{do} \\ \\ &= \frac{1}{2} \lim_{do \to 0} \frac{do^2 + 2odo - 2tdo}{do} \\ \\ &= \frac{1}{2} \lim_{do \to 0} do + 2o - 2t \\ &= \frac{1}{2} (2o - 2t) \\ &= \frac{1}{2} (2(o - t)) \\ &= (o - t) \end{align}\]

Sigmoid

Sigmoid:

\[s(x) = \frac{1}{1 + e^{-x}}\]

Derivative:

\[\frac{d s(x)}{dx} = s(x) (1 - s(x))\]

Linear-Sigmoid

Network:

\[\begin{align} x &\;\rightarrow\; y = ax + b \\ &\;\rightarrow\; s(y) = o \end{align}\]

Error:

\[e = \frac{1}{2}(t - o)^2\]

Weight delta:

\[\begin{align} \frac{de}{da} &= \frac{de}{do} \cdot \frac{do}{dy} \cdot \frac{dy}{da} \\ \\ &= (o-t) \cdot \frac{d s(y)}{dy} \cdot \frac{d (ax + b)}{da} \\ \\ &= (o-t) \cdot s(y)(1 - s(y)) \cdot x \end{align}\]

Bias delta:

\[\begin{align} \frac{de}{db} &= \frac{de}{do} \cdot \frac{do}{dy} \cdot \frac{dy}{db} \\ \\ &= (o-t) \cdot \frac{d s(y)}{dy} \cdot \frac{d (ax + b)}{db} \\ \\ &= (o-t) \cdot s(y)(1 - s(y)) \cdot 1 \end{align}\]

Linear-Sigmoid-Linear-Sigmoid

Network:

\[\begin{align} x_1 &\;\rightarrow\; y = ax_1 + b \\ &\;\rightarrow\; s(y) = x_2 \\ &\;\rightarrow\; z = cx_2 + d \\ &\;\rightarrow\; s(z) = o \end{align}\]

Error:

\[e = \frac{1}{2}(t - o)^2\]

Second layer (output layer):

\[\begin{align} \frac{de}{dc} &= \frac{de}{do} \cdot \frac{do}{dz} \cdot \frac{dz}{dc} = (o-t) \cdot s(z)(1 - s(z)) \cdot x_2 \end{align}\]

\[\begin{align} \frac{de}{dd} &= \frac{de}{do} \cdot \frac{do}{dz} \cdot \frac{dz}{dd} = (o-t) \cdot s(z)(1 - s(z)) \cdot 1 \end{align}\]

First layer (hidden layer):

\[\begin{align} \frac{de}{da} &= \frac{de}{do} \cdot \frac{do}{dz} \cdot \frac{dz}{dx_2} \cdot \frac{dx_2}{dy} \cdot \frac{dy}{da} \\ &= (o-t) \cdot s(z)(1 - s(z)) \cdot c \cdot s(y)(1 - s(y)) \cdot x_1 \end{align}\]

\[\begin{align} \frac{de}{db} &= \frac{de}{do} \cdot \frac{do}{dz} \cdot \frac{dz}{dx_2} \cdot \frac{dx_2}{dy} \cdot \frac{dy}{db} \\ &= (o-t) \cdot s(z)(1 - s(z)) \cdot c \cdot s(y)(1 - s(y)) \cdot 1 \end{align}\]

Nth Linear Layer

Corollary from everything so far:

Given the Nth linear layer:

\[y_n = a_nx_n + b_n\]

Layer delta common to weight and bias deltas:

\[\frac{de}{dy_n} = \frac{de}{do} \cdot \frac{do}{dy_n}\]

Where \(\frac{do}{dy_n}\) is accumulated from subsequent layers.

Weight delta:

\[\frac{de}{da_n} = \frac{de}{dy_n} \cdot \frac{dy_n}{da_n} = \frac{de}{dy_n} \cdot x_n\]

Bias delta:

\[\frac{de}{da_n} = \frac{de}{dy_n} \cdot \frac{dy_n}{db_n} = \frac{de}{dy_n} \cdot 1\]

ND Networks

Error

Error:

\[\vec{e} = \frac{1}{2}(\vec{t} - \vec{o})^2\]

Derivative:

\[\begin{align} \frac{\partial \vec{e}}{\partial \vec{o}} &= \begin{bmatrix} \frac{de_1}{do_1} & \frac{de_2}{do_1} & \cdots & \frac{de_N}{do_1} \\ \\ \frac{de_1}{do_2} & \frac{de_2}{do_2} & \cdots & \frac{de_N}{do_2} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \frac{de_1}{do_N} & \frac{de_2}{do_N} & \cdots & \frac{de_N}{do_N} \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{d(\frac{1}{2}(t_1 - o_1)^2)}{do_1} & \frac{d(\frac{1}{2}(t_2 - o_2)^2)}{do_1} & \cdots & \frac{d(\frac{1}{2}(t_N - o_N)^2)}{do_1} \\ \\ \frac{d(\frac{1}{2}(t_1 - o_1)^2)}{do_2} & \frac{d(\frac{1}{2}(t_2 - o_2)^2)}{do_2} & \cdots & \frac{d(\frac{1}{2}(t_N - o_N)^2)}{do_2} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \frac{d(\frac{1}{2}(t_1 - o_1)^2)}{do_N} & \frac{d(\frac{1}{2}(t_2 - o_2)^2)}{do_N} & \cdots & \frac{d(\frac{1}{2}(t_N - o_N)^2)}{do_N} \end{bmatrix} \\ \\ &= \begin{bmatrix} o_1 - t_1 & 0 & \cdots & 0 \\ 0 & o_2 - t_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & o_N - t_N \end{bmatrix} \end{align}\]

Sigmoid

Sigmoid:

\[\vec{o} = s(\vec{y})\]

Derivative:

\[\begin{align} \frac{\partial \vec{o}}{\partial \vec{y}} &= \begin{bmatrix} \frac{do_1}{dy_1} & \frac{do_2}{dy_1} & \cdots & \frac{do_N}{dy_1} \\ \\ \frac{do_1}{dy_2} & \frac{do_2}{dy_2} & \cdots & \frac{do_N}{dy_2} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \frac{do_1}{dy_N} & \frac{do_2}{dy_N} & \cdots & \frac{do_N}{dy_N} \end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{d(s(o_1))}{do_1} & \frac{d(s(o_2))}{do_1} & \cdots & \frac{d(s(o_N))}{do_1} \\ \\ \frac{d(s(o_1))}{do_2} & \frac{d(s(o_2))}{do_2} & \cdots & \frac{d(s(o_N))}{do_2} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \frac{d(s(o_1))}{do_N} & \frac{d(s(o_2))}{do_N} & \cdots & \frac{d(s(o_N))}{do_N} \end{bmatrix} \\ \\ &= \begin{bmatrix} s(o_1)(1 - s(o_1)) & 0 & \cdots & 0 \\ 0 & s(o_2)(1 - s(o_2)) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s(o_N)(1 - s(o_N)) \end{bmatrix} \end{align}\]

Linear-Sigmoid

Network:

\[\begin{align} \vec{x} &\;\rightarrow\; \vec{y} = \vec{x}A + \vec{b} \\ &\;\rightarrow\; s(\vec{y}) = \vec{o} \end{align}\]

Notation:

Error:

\[\vec{e} = \frac{1}{2}(\vec{t} - \vec{o})^2\]

Weights delta:

\[\frac{\partial \vec{e}}{\partial a_{ij}} = \frac{\partial \vec{y}}{\partial \vec{a_{ij}}} \cdot \frac{\partial \vec{o}}{\partial \vec{y}} \cdot \frac{\partial \vec{e}}{\partial \vec{o}}\]

Error and activation:

\[\begin{align} \frac{\partial \vec{o}}{\partial \vec{y}} \cdot \frac{\partial \vec{e}}{\partial \vec{o}} &= \begin{bmatrix} s(o_1)(1 - s(o_1)) & 0 & \cdots & 0 \\ 0 & s(o_2)(1 - s(o_2)) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s(o_M)(1 - s(o_M)) \end{bmatrix} \begin{bmatrix} o_1 - t_1 & 0 & \cdots & 0 \\ 0 & o_2 - t_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & o_M - t_M \end{bmatrix} \\ \\ &= \begin{bmatrix} s(o_1)(1 - s(o_1))(o_1 - t_1) & 0 & \cdots & 0 \\ 0 & s(o_2)(1 - s(o_2))(o_2 - t_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s(o_M)(1 - s(o_M))(o_M - t_M) \end{bmatrix} \\ \\ &= \begin{bmatrix} \delta_1 & 0 & \cdots & 0 \\ 0 & \delta_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \delta_M \end{bmatrix} \;\; \textrm{(Introducing $\delta_i$ for shorthand notation)} \end{align}\]

Linear component:

\[\begin{align} \frac{\partial \vec{y}}{\partial a_{ij}} &= \frac{\partial (\vec{x}A + \vec{b})}{\partial a_{ij}} \\ \\ &= \frac{\partial (\vec{x}A)}{\partial a_{ij}} \\ &= \frac{\partial}{\partial a_{ij}} \begin{bmatrix}x_1 & x_2 & \ldots & x_N\end{bmatrix} \begin{bmatrix}A_{11} & A_{12} & \ldots & A_{1M} \\ A_{21} & A_{22} & \ldots & A_{2M} \\ \vdots & \vdots & \ddots & \vdots \\ A_{N1} & A_{N2} & \ldots & A_{NM}\end{bmatrix} \\ \\ &= \begin{bmatrix} \frac{\partial}{\partial a_{ij}} \sum_k x_k a_{k1} \;, & \frac{\partial}{\partial a_{ij}} \sum_k x_k a_{k2} \;, & \cdots \;, & \frac{\partial}{\partial a_{ij}} \sum_k x_k a_{kM} \end{bmatrix} \\ \\ &= \begin{bmatrix}0 & \ldots & x_i & \ldots & 0\end{bmatrix}_j \;\; \textrm{(column j)} \end{align}\]

Together:

$$\[\begin{align} \frac{\partial \vec{e}}{\partial a_{ij}} &= \frac{\partial \vec{y}}{\partial \vec{a_{ij}}} \cdot \frac{\partial \vec{o}}{\partial \vec{y}} \cdot \frac{\partial \vec{e}}{\partial \vec{o}} \\ \\ &= \begin{bmatrix}0 & \ldots & x_i & \ldots & 0\end{bmatrix}_j \begin{bmatrix} \delta_1 & 0 & \cdots & 0 \\ 0 & \delta_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \delta_M \end{bmatrix} \\ \\ &= \begin{bmatrix}0 & \ldots & x_i \delta_i & \ldots & 0\end{bmatrix}_j \;\; \textrm{(column j)} \end{align}\]$$

Therefore: (collapsing the rank-3 tensor \(\frac{\partial \vec{e}}{\partial A}\) into 2d)

$$\[\begin{align} \frac{\partial \vec{e}}{\partial A} &= \begin{bmatrix} x_1 \delta_1 & x_1 \delta_1 & \cdots & x_1 \delta_1 \\ x_2 \delta_2 & x_2 \delta_2 & \cdots & x_2 \delta_2 \\ \vdots & \vdots & \ddots & \vdots \\ x_N \cdot \delta_M & x_N \cdot \delta_M & \cdots & x_N \cdot \delta_M \end{bmatrix} \\ \\ &= \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_N\end{bmatrix} \begin{bmatrix}\delta_1 & \delta_2 & \cdots & \delta_M\end{bmatrix} \\ \\ &= \vec{x}^T \vec{\delta} \end{align}\]$$

Bias delta:

\[\frac{\partial \vec{e}}{\partial \vec{b}} = \frac{\partial \vec{y}}{\partial \vec{b}} \cdot \frac{\partial \vec{o}}{\partial \vec{y}} \cdot \frac{\partial \vec{e}}{\partial \vec{o}}\]

Bias component:

\[\begin{align} \frac{\partial \vec{y}}{\partial \vec{b}} &= \frac{\partial (\vec{x}A + \vec{b})}{\partial \vec{b}} \\ \\ &= \frac{\partial \vec{b}}{\partial \vec{b}} \\ \\ &= I_{M \times M} \end{align}\]

Therefore:

$$\[\begin{align} \frac{\partial \vec{e}}{\partial \vec{b}} &= \frac{\partial \vec{y}}{\partial \vec{b}} \cdot \frac{\partial \vec{o}}{\partial \vec{y}} \cdot \frac{\partial \vec{e}}{\partial \vec{o}} \\ \\ &= I_{M \times M} \begin{bmatrix} \delta_1 & 0 & \cdots & 0 \\ 0 & \delta_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \delta_M \end{bmatrix} \\ \\ &= \begin{bmatrix} \delta_1 & 0 & \cdots & 0 \\ 0 & \delta_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \delta_M \end{bmatrix} \end{align}\]$$