Gated Recurrent Units

Introducing Gated Recurrent Units

  • A GRU is a type of RNN
  • GRUs help preserve information associated with previous nodes
  • Said another way, GRUs decide how to update the hidden state
  • Vanilla RNNs lose relevant information for long sequences of words
  • Whereas, GRUs don't suffer this problem

Defining the Architecture of Gated Recurrent Units

  • In general, a GRU cell consists of 44 components:

    • Reset gate: rt=σ(Wrxt+Urht1)r_{t} = \sigma(W_{r}x_{t} + U_{r}h_{t-1})

      • xtx_{t} represents the current information
      • ht1h_{t-1} represents the previous information
      • WrW_{r} represents how much current information is weighted
      • UrU_{r} represents how much previous information is weighted
    • Update gate: zt=σ(Wzxt+Uzht1)z_{t} = \sigma(W_{z}x_{t} + U_{z}h_{t-1})

      • xtx_{t} represents the current information
      • ht1h_{t-1} represents the previous information
      • WzW_{z} represents how much current information is weighted
      • UzU_{z} represents how much previous information is weighted
    • New memory content: ht=tanh(Whxt+rtUhht1)h'_{t} = \tanh(W_{h}x_{t} + r_{t} \circ U_{h}h_{t-1})

      • xtx_{t} represents the current information
      • ht1h_{t-1} represents the previous information
      • rtr_{t} represents how much previous information should be forgotten
      • WhW_{h} represents how much current information is weighted
      • UhU_{h} represents how much previous information is weighted
    • Final memory content: ht=ztht+(1zt)ht1h_{t} = z_{t} \circ h'_{t} + (1-z_{t}) \circ h_{t-1}

      • ztz_{t} represents how much the unit updates its information with the current information
      • ht1h_{t-1} represents the previous information
      • hth'_{t} represents the current information with some degree of dependency on the previous information

cellgru

Defining the Intuition behind Gated Recurrent Units

  • The final memory content represents the updated information output by the unit
  • The update gate represents how much the unit will update its information with the new memory content

    • Essentially, this is a weighting of how much the new memory content will have on the larger system
  • The new memory content represents some balance between the current information and previous information

    • This balance is determined by the reset gate
  • The reset gate represents how much of previous information should be forgotten

Describing the Formulas in Gated Recurrent Units

  • Note, the formulas for the update and reset gate are nearly structured identically

    • Impying, the difference in behavior between these gates comes from their learned weights
  • The \circ operator represents element-wise multiplication between two vectors
  • If rt0r_{t} \to 0, then the previous information is essentially forgotten (or reset)
  • If zt0z_{t} \to 0, then the final update won't rely on the current information as much as the previous information

Defining a GRU in Trax

  • ShiftRight: Shifts the tensor to the right by padding on axis 11

    • The mode refers to the context in which the model is being used
    • Possible values are:

      • train (default)
      • eval
      • predict
  • Embedding: Maps discrete tokens to vectors

    • It will have shape: vocabulary length×dimension of output vectors\text{vocabulary length} \times \text{dimension of output vectors}
    • The dimension of output vectors is the number of elements in the word embedding
  • GRU: The GRU layer

    • It leverages another Trax layer called GRUCell
    • The number of GRU units should match the number of elements in the word embedding
    • If you want to stack two consecutive GRU layers, it can be done by using python's list comprehension
  • Dense: Vanilla Dense layer
  • LogSoftMax: Log Softmax function
mode = 'train'
vocab_size = 256
model_dimension = 512
n_layers = 2

GRU = tl.Serial(
    tl.ShiftRight(mode=mode), # Do remember to pass the mode parameter if you are using it for interence/test as default is train 
    tl.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
    [tl.GRU(n_units=model_dimension) for _ in range(n_layers)], # You can play around n_layers if you want to stack more GRU layers together
    tl.Dense(n_units=vocab_size),
    tl.LogSoftmax()
    )

References

Previous
Next

Recurrent Neural Networks

Long Short-Term Memory