Image Style Transfer with Denoising Diffusion Probabilistic Models (DDPM)

Andriyan Saputra
5 min readJan 23, 2024

Neural-style, or Neural-Transfer, allows reproducing a given image with a new artistic style

On this occasion I studied FastAi’s new Practical Deep Learning for Coders part 2: From Deep Learning Foundations to Steady Diffusion. In week 20, we learn about how to capture the style of an image and try to use it to combine that style with other images.

Figure 1. Hand drawn AI generating artworks illustration (Edited from: Freepik.com)

Style Transfer

Starting with the capacity of deep Convolutional Neural Networks (CNNs) to learn image classification, the initial layers primarily capture gradients and textures, while subsequent layers tend to encompass more intricate features. We intend to leverage this hierarchical structure for artistic endeavors, and the ability to selectively choose the type of feature for image comparison holds various practical applications.

To begin, let’s experiment with optimizing an image by contrasting its features, specifically from two later layers, with those of a target image.

Note: Choosing the layers determines the kind of features that are important

Our objective is to devise a method for extracting the style of an input image by utilizing the information from early layers and the specific textural features they acquire. However, a straightforward comparison of the feature maps from these early layers is not feasible, as these “maps” spatially encode information, which is not desirable for our purpose.

Style Loss with Gram matrix

So, we need a way to measure what kinds of style features are present, and ideally which kinds occur together, without worrying about where these features occur in the image.

Enter something called the Gram matrix. The idea here is that we will measure the correlation between features. Given a feature map with f features in an h x w grid, we will flatten out the spatial component and then for every feature we will take the dot product of that row with itself, giving an f x f matrix as the result. Each entry in this matrix quantifies how correlated the relevant pair of features are and how frequently they occur - exactly what we want. In this diagram each feature is represented as a colored dot.

Figure 2. Gram matrix calculation

We want to maintain the basic image structure but use a different image style. The idea is to produce images that have minimal distance between content and image style. The training model uses a pre-trained network and minimizes this distance through backpropagation.

Framework

In this section, I will explain the workflow of the process. In short, the idea is quite clear. We try to combine one based image with 50 different style images. After that, we combine all the fusion style images and save them as a GIF.

Figure 3. Workflow of the project consist of; base image, style images, and the result of transfer style images
model = TensorModel(torch.rand_like(content_im))
style_loss = StyleLossToTarget(style_im)
content_loss = ContentLossToTarget(content_im, target_layers=(6,18,25))

def combine_loss(x):
return style_loss(x) * 0.2 + content_loss(x)

learn = Learner(model,
get_dummy_dls(300),
combined_loss,
lr = 3e-2,
cbs=cbs,
opt_func=torch.optim.Adam)

learn.fit(1, cbs=[ImageLogCB(60)])

In this experiment, there are several number of parameters that we could try to modify:
- Set up focus on different layers (target_layers)
- Change the content loss -> focus on immidiate early layer as well. Start with the random image instead of content_img
- Change the scale from the combined_loss between the style_loss and content_loss (combine_loss(x))
- Change how long the model train (get_dummy_dls) for and the number of learning rate (lr)

Results

Not the best, but certainly not the worst. It presents as a GIF format, to show the result all at once.

Figure 4. The result of style transfer from base image content and 40 differents batik style of Indonesia (picture size: 256 x 256 pixel)

Learning Insight:

  1. While the loop tries to repeat all the processes together with the same model parameters. The results show very different Loss values for each style image dataset. Each image must be treated with different model parameters because the image styles are very different from each other.
  2. Disadvantages of fused images are often found in the corners and bottom of image-based datasets. The pattern is quite unique because the defects are located in the same location.

Further works:

  1. Attempts image style transfer with pixel constraints at specific locations in the image. This means we can try to ask the model to style certain masking pixel locations in our base image. For example, we can try to place this style only on the temple building or only on the edges of the image.

References:

Closing Remark

This publication is produced for educational or information only, If there are any mistake in data, judgement, or methodology that I used to produce this publication.

  • * Please consider to contact the writer using contact information at Profile. I would like to discuss and sharing more about the topic. Thank you.

Best Regards,

Andriyan Saputra

--

--