Data Science

a^{[l+2]} = relu(W^{[l+2]}relu(W^{[l+1]}a^{[l]} + b^{[l+1]}) + b^{[l+2]} + a^{[l]})

a^{[l+2]} = relu(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]})

\text{where } z^{[l+1]} = W^{[l+1]}a^{[l]} + b^{[l+1]}

\text{where } a^{[l+1]} = relu(z^{[l+1]})

\text{where } z^{[l+2]} = W^{[l+2]}a^{[l+1]} + b^{[l+2]}

\text{where } a^{[l+2]} = relu(z^{[l+2]} + a^{[l]})

a^{[l]} \to \overbrace{z^{[l+1]}}^{W^{[l+1]}a^{[l]} + b^{[l+1]}} \to \overbrace{a^{[l+1]}}^{relu(z^{[l+1]})} \to \overbrace{z^{[l+2]}}^{W^{[l+2]}a^{[l+1]} + b^{[l+2]}} \to \overbrace{a^{[l+2]}}^{relu(z^{[l+2]}+a^{[l]})}

residualblock

Theoretically, the training error should continue to decrease as we increase the number of layers to a plain network
Realistically, the training error begins to increase as the number of layers reaches a certain point
This is an issue caused by the vanishing gradient problem
Resnet is able avoid this problem
Specifically, resnet is able to increase accuracy as the number of layers increases

resneterror

a^{[l+2]} = relu(\xcancel{W^{[l+2]}a^{[l+1]} + b^{[l+2]}} + a^{[l]})

Each $a^{[l+2]}$ solution will still learn something from $a^{[l]}$ even in worst case scenario
In other words, $\hat{y}$ will generally improve even if $z^{[l+2]}=0$
This helps prevent the vanishing gradient problem to some degree

The classical block in resnet is a residual block
Resnet introduces a so-called identity shortcut connection
This connection attempts to skip one or more layers
Theoretically, the training error should continue to decrease as we increase the number of layers to a plain network
Realistically, the training error begins to increase as the number of layers reaches a certain point
This is an issue caused by the vanishing gradient problem
Resnet is able avoid this problem
Specifically, resnet is able to increase accuracy as the number of layers increases

Common Case Studies

Inception

Residual Network