Feedforward Networks

Introducing Feedforward Networks

  • Up to now, we've been discussing neural networks where the output from one layer is used as input to the next layer
  • These multilayer perceptrons are feedforward neural networks
  • These models are called feedforward because information from xx flows through the intermediate computations defined by a function ff to produce the output yy
  • In other words, information is only fed forward and never fed back
  • Networks that include feedback connections are called recurrent neural networks

Motivating Feedforward Networks

  • We assume that data is created by a data-generating process
  • A data-generating process is the unknown, underlying phenomenon that creates the data
  • This data-generating process is modeled by a true function
  • We'll refer to this true function as ff^{*}

Defining Feedforward Networks

  • The goal of a feedforward network is to approximate some unknown function ff^{*} with ff
  • In this case, ff^{*} is the unknown, optimal classifier that maps an input xx to a category yy
  • In this case, ff is a known, approximated classifier that maps an input xx to a category yy
  • In other words, we estimate f(x)f^{*}(x) with f(x;θ)f(x;\theta)
  • A feedforward network defines a mapping y=f(x;θ)y=f(x;\theta) and learns the value of the parameters θ\theta that result in the best function approximation

Representing Feedforward Networks

  • Feedforward networks are typically represented by composing together many different functions
  • For example, the output values of our feedforward network may be defined as a chain of three functions:
f(x)=f3(f2(f1(x)))f(x) = f^{3}(f^{2}(f^{1}(x)))
  • Where f1f^{1} is called the first hidden layer
  • Where f2f^{2} is called the second hidden layer
  • Where f3f^{3} is called the output layer

Training Feedforward Networks

  • ff models the training data
  • ff does not model the data-generating process
  • ff^{*} models the data-generating process
  • We want f(x)f(x) to match f(x)f^{*}(x)
  • The training data provides us with noisy, approximate examples of f(x)f^{*}(x) evaluated at different training points
  • Each example xx is accompanied by a label yf(x)y \approx f^{*}(x)
  • In other words, we hope that the training data xx produce training labels yy that are close to f(x)f^{*}(x)

Training Layers of Feedforward Networks

  • The training data directly specifies what the output layer must do at each point xx
  • The goal of the output layer is to produce a value that is close to yy
  • We don't want the output layer to produce a value that is always equal to yy, since we would be overfitting the noise
  • The behavior of the other layers (i.e. hidden layers) is not directly specified by the training data
  • The goal of the hidden layers is not to produce a value that is close to yy
  • Instead, the goal of the hidden layers is to help the output layer produce a value that is close to yy
  • In other words, the learning algorithm must decide how to use these hidden layers to best implement an approximation of ff^{*}
  • These layers are called hidden layers because the training data does not show the desired output for each of these layers

Learning Nonlinear Functions

  • Sometimes our yy is a nonlinear function of xx
  • In this case, we will want to transform xx so that yy becomes a linear function of xx
  • We will usually want to apply the linear model to a transformed input ϕ(x)\phi(x), instead of applying a linear model to xx itself

    • Here, ϕ\phi is a nonlinear transformation
  • We can think of ϕ\phi as a new representation of xx
  • We can choose the mapping ϕ\phi by using:

    1. Generic feature mapping ϕ\phi implicitely used in kernel functions

      • These generic feature mappings are usually generalizations
      • These generalizations usually produce poor predictions on a test set
    2. Manually engineered ϕ\phi functions

      • Until the advent of deep learning, this was the dominant approach
      • It requires decades of human effort for each separate task
    3. Activation function used in deep learning

      • The strategy of deep learning is to learn ϕ\phi:
      y=f(x;θ,w)=ϕ(x;θ)y = f(x;\theta,w) = \phi(x;\theta)
      • In this approach, we now have the following:

        • Parameters θ\theta that we use to learn ϕ\phi
        • Parameters ww that map from ϕ(x)\phi(x) to the desired output
      • This is an example of a deep feedforward network

Feature Mapping using Activation Functions

  • This approach is the only one of the three that gives up on the convexity of the training problem, but the benefits outweigh the harms
  • In this approach, we parametrize the representation as ϕ(x;θ)\phi(x;\theta)
  • And, we use the optimization algorithm to find the θ\theta that corresponds to a good representation
  • If we wish, this approach can capture the benefit of the first approach by being highly generic

    • We do this by using a very broad family ϕ(x;θ)\phi(x;\theta)
  • Deep learning can also capture the benefit of the second approach by providing model customization

    • Human practitioners can encode their knowledge to help generaliziation by designing families ϕ(x;θ)\phi(x; \theta) that they expect will perform well
    • The advantage is that the human designer only needs to find the right general function family, rather than precisely the right function

tldr

  • When we want to find nonlinear decision boundaries, we transform xx so that yy becomes a linear function of xx
  • We will usually want to apply the linear model to a transformed input ϕ(x)\phi(x), instead of applying a linear model to xx itself
  • Feedforward networks are multilayer perceptrons are feedforward neural networks where input data xx goes through functions ff to produce the output yy
  • Data in feedforward networks are never fed backwards

References

Previous
Next

Architecture of Neural Networks

Learning XOR