Learning XOR

Introducing the XOR Function

  • We can model the XOR function using a feedforward network
  • The XOR function takes in two binary values x1x_{1} and x2x_{2}
  • The XOR function returns 11 when exactly one of its binary inputs is equal to 11
  • Otherwise, the function returns 00

Illustrating an XOR Problem

  • The most classic example of a linearly inseparable pattern is a logical exclusive-or (XOR) function
  • In these situations, there are typically two distinct classes of data points that can't be linearly separated initially
  • For example, we may have the following data:

xordata

  • We can model our data using an XOR gate
  • A linear model applied directly to the original input cannot implement the XOR function
  • However, a linear model can solve the problem in our transformed space, where the transformed space is represented by the features extracted by a neural network
  • In our sample solution, the two points that must have output 1 have been collapsed into overlapping points in feature space

Representing XOR as a Feedforward Network

  • In most cases, the true target function ff^{*} is unknown
  • However, we know the target function is the XOR function in this case:
f:{0,1}2{0,1}f^{*} : \lbrace 0,1 \rbrace^{2} \to \lbrace 0,1 \rbrace (x,y)f(x,y)(x,y) \mapsto f^{\ast}(x,y) f(x,y)={1if x&y=10if x&y1f^{\ast}(x,y) = \begin{cases} 1 &\text{if } x \And y = 1 \cr 0 &\text{if } x \And y \not = 1 \end{cases}
  • In this example, there are only four possible inputs:
χ=(0,0),(0,1),(1,0),(1,1)\chi = {(0,0), (0,1), (1,0), (1,1)}
  • Therefore, we will train the network on all four of these points
  • To introduce the concept of loss functions, we'll treat this problem as a regression problem
  • Since we're treat this problem as a regression problem, we can use the mean squared error is our loss function
  • However, in practical applications, MSE is usually not an appropriate loss function for modeling binary data
  • The MSE loss function in our case is the following:
J(θ)=14xχ(f(x)f(x;θ))2J(\theta) = \frac{1}{4}\sum_{x \in \chi}(f^{*}(x)-f(x;\theta))^{2}
  • If we assume our model is a linear model, then we could define the form of our model f(x;θ)f(x;\theta) as the following:
f(x;θ)=f(x;w,b)=xTw+bf(x;\theta) = f(x;w,b) = x^{T}w + b
  • With respect to ww and bb, we can minimize J(θ)J(\theta) iteratively using gradient descent or in closed form using the normal equations

Defining the XOR Architecture

  • We may decide to model the XOR function using a three-layer neural network
  • Our network could have two inputs, a hidden layers with two nodes, and an output layer with one node
  • In this case, we can describe an affine transformation from a vector xx to a vector hh to a vector yy
  • We need a vector of weights w1w_{1} for our inputs xx and another vector of weights w2w_{2} for our hidden layer outputs hh
w1=[1111];w2=[32]w_{1} = \begin{bmatrix} 1 \cr 1 \cr 1 \cr 1 \end{bmatrix}; w_{2} = \begin{bmatrix} 3 \cr -2 \end{bmatrix}
  • We need a vector of biases b1b_{1} for our hidden layer and another vector of biases b2b_{2} for our output layer:
b1=[01];b2=[2]b_{1} = \begin{bmatrix} 0 \cr -1 \end{bmatrix}; b_{2} = [-2]
  • Since we only have four data points, our network, inputs, and output looks like the following:

xornetwork

x1x_{1} x2x_{2} h1h_{1} h2h_{2} yy
0 0 0 0 0
0 1 1 0 1
1 0 1 0 1
1 1 1 1 0

tldr

  • Neural networks are good at modeling many types of functions
  • These functions include AND, OR, XOR, etc.

References

Previous
Next

Feedforward Networks

Gradient-Based Learning