Gram Matrices in Neural Style Transfer

Written on 2021-05-10

In this paper, it has been shown that matching the Gram matrices of feature maps is equivalent to minimizing the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel. Thus, the paper argues that the essence of neural style transfer is to generate a new image from white noise by matching the neural activations with the content image and the Gram matrices with the style image.

The original algorithm for neural style transfer used a cost function that minimized the sum of the content loss and the style loss. Here, the content loss represented the difference in content between the content image and our generated image. And, the style loss represented the difference in style between the style image and our generated image.

The style loss function uses the gram matrix. Specifically, the style loss represents the normalized, squared difference between the gram matrix of the style image and the gram matrix of the generated image. The gram matrix function cares about some aspects between two images, but it doesn't care about the specific presence or location of features within an image.

In the original paper for neural syle transfer, a new, generated image xx^{*} is iteratively created by optimizing a content loss and style loss, given by the following formula. Here, LL are the individual losses, and α\alpha and β\beta are the weights for content and style losses:

Lgen=αLcontent+βLstyleL_{gen} = \alpha L_{content} + \beta L_{style}

Mathematically, we can see that the losses of the generated image are just a weighted combination of the style and content losses. Here, LcontentL_{content} is defined by the squared error between the feature maps of a specific layer ll for the the generated image xx^{*} and the content image xcx^{c}:

Lcontent=12i=1Nlj=1Ml(FijlPijl)2L_{content} = \frac{1}{2} \sum_{i=1}^{N_{l}} \sum_{j=1}^{M_{l}} (F_{ij}^{l} - P_{ij}^{l})^{2}

Here, the feature maps of xx^{*}, xcx^{c}, and xsx^{s} in the lthl^{th} layer of a CNN are denoted by FlF^{l}, PlP^{l}, and SlS^{l}, respectively. Thus, the loss of the content image LcontentL_{content} represents some combination of the feature maps of the generated image and content image. The loss of the style image LstyleL_{style} is defined as the sum of several style losses LstylelL_{style}^{l} from different layers:

Lstyle=lwlLstylelL_{style} = \sum_{l} w_{l} L_{style}^{l}

Here, wlw_{l} is the weight of the loss in the layer ll, and the loss of the style image LstylelL_{style}^{l} is defined by the squared error between the feature correlations expressed by Gram matrices of the generated image xx^{*} and the style image xsx^{s}, where the Gram matrix GlG^{l} is just the inner product between the vectorized feature maps of the generated image xx^{*} in the lthl^{th} layer.

Lstylel=14Nl2Ml2i=1Nlj=1Nl(GijlAijl)2L_{style}^{l} = \frac{1}{4 N_{l}^{2} M_{l}^{2}} \sum_{i=1}^{N_{l}} \sum_{j=1}^{N_{l}} (G_{ij}^{l} - A_{ij}^{l})^{2} Gijl=k=1MlFiklFjklG_{ij}^{l} = \sum_{k=1}^{M_{l}} F_{ik}^{l} F_{jk}^{l} Aijl=k=1MlSiklSjklA_{ij}^{l} = \sum_{k=1}^{M_{l}} S_{ik}^{l} S_{jk}^{l}

Designed and developed by Darius Kharazi © 2020

Built with Gatsby