Layers in Trax

Describing a Serial Layer

  • Combinators are used for composing sub-layers together
  • In other words, they represent our neural network
  • They can include the following layers as arguments:

    • A ReLU activation layer
    • A LogSoftmax activation layer
    • An Embedding layer
  • Specifically, a Serial combinator uses stack semantics to manage data for its sublayers
  • Each sublayer sees only the inputs it needs and returns only the outputs it has generated
  • The sublayers interact via the data stack

Describing a Dense Layer

  • A dense layer refers to a fully-connected layer in a neural network
  • Dense layers are the prototypical example of a trainable layer
  • Meaning, it is a layer with trainable weights
  • Each node in a dense layer computes a weighted sum of all node values from the preceding layer and adds a bias term:
z[i]=W[i]a[i1]+b[i]z^{[i]} = W^{[i]} a^{[i-1]} + b^{[i]}

denselayer

Describing Activation Layers

  • An activation layer computes an element-wise, nonlinear function on the preceding layer’s output
  • Trax follows the current practice of separating the activation function into its own layer
  • Trax supports the following activation functions (and more):

    • ReLU
    • Elu
    • Sigmoid
    • Tanh
    • LogSoftmax
  • In the following example, we're using a logsoftmax layer:

logsoftmaxlayer

Describing an Embedding Layer

  • An embedding layer is a trainable layer
  • Generally, it is used to map discrete data into vectors
  • Typically, this discrete data refers to words in NLP
  • Specifically, it takes an index assigned to each word from your vocabulary and maps it to a representation of that word with a determined dimension:
Vocabulary Index embedding1embedding_{1} embedding2embedding_{2}
i 11 0.0200.020 0.0060.006
am 22 0.003-0.003 0.0100.010
happy 33 0.0090.009 0.0100.010
because 44 0.011-0.011 0.018-0.018
learning 55 0.040-0.040 0.047-0.047
nlp 66 0.009-0.009 0.0500.050
sad 77 0.044-0.044 0.0010.001
not 88 0.0110.011 0.022-0.022
  • Every value from our embeddings are trainable
  • The number of weights in our embedding layer equals the number of words in our vocabulary multiplied by the number of embeddings
  • The embedding layer usually feeds into a mean activation layer to reduce the size of the embeddings

References

Next

Data Pipelines using Trax