Currently, many in the software development industry know that image generation is usually performed using generative adversarial or diffusion models. Despite the fact that both approaches are publicly known, we cannot call them “affordable” due to the high requirements for machine resources that are required for model training. Even more time and resources will be required should you decide to create an image in an absolutely new style, since in this case the model will need to be trained from the scratch.
StyleGan  was training approximately one week on an NVIDIA DGX-1 that has 8 Tesla V100 GPUs.
However, is it possible to play around with image generation if you do not have heavy hardware or that much time? Sure it is! Additionally, in this article we will describe an image generation approach that can switch between different image styles without the need of additional model training and expensive devices.
Our main goal is to create a generalized embedding extractor. An embedding extractor can be used to compare parts of an image and an emoji. Thus, we will get an image generator as a constructor, which can generate images with various styles. In this article, we will consider two approaches to creating an embedding extractor, which you can see in the “Research” section. But first, we have to prepare a dataset for the generator and for training an embedding extractor.
First and foremost, we need to create an avatar parts dataset for combining these parts. This issue can be resolved in many ways. For example, we can create every single part manually. However, this approach is too slow and inflexible. A much better approach is to create multiple templates that we will combine to each other.
In another example, we can create five types of eyes, mouth, and face shapes, which will give us 125 different possible emojis.
We have implemented this approach by using SVG files as the base for creating templates and a Python script to generate avatar parts.
After generating, we extract embeddings from each part and save them as pairs “avatar_part – embedding”. This approach allows us to quickly switch between different art styles without additional model training.
In this case, embeddings can be created by two different approaches: via pre-trained ResNet50 or via special autoencoders. While ResNet50 doesn’t require any additional training, autoencoders do require more training, so either way we have to find the appropriate data for training. To train autoencoders, we decided to use a CELEBA dataset, since it satisfies our needs and allows us to work without collecting data from scratch. We have written a script that divides a face into segments and saves them to folders. These datasets are needed to train autoencoders. Now, we have mouth datasets, eyes datasets, etc. In this script, we simply used BiSeNet  for face segmentation.
Our architecture can be presented via one input layer, one output layer, and three hidden layers. The input layer takes an image, which must be normalized (meaning it must be 306×306 pixels). As we wrote above, this architecture has three hidden layers:
- In the first hidden layer, we divide a normalized face on segments via BiSeNet . We just need to prepare an image for the input layer of this neural network
- In the second hidden layer, we have an embedding extractor model, which gives us embeddings to compare. In our case, there may be ResNet50 or custom autoencoders
- In the third hidden layer, we compare each output from the second hidden layer with each possible emoji part. The comparison is then implemented via cosine similarity
An output of the third hidden layer is emoji parts with the maximum cosine similarity to the face parts. And finally, the output layer is a constructor that simply combines all the emoji parts.
The architecture can be presented in three layers:
- Segmentation model, that divides a selfie into face segments. In this case, the segmentation model  works with the face cropping model 
- Embedding model, that converts each face segment to embedding space. As we said earlier, we used ResNet50 and autoencoders
- Cosine similarity, that compares face embedding with all avatar embeddings of the same type
Approaches to embedding extraction
Although our model is composed of several neural networks, the architecture is quite simple. As we wrote above, this composition gathers the most-similar avatar parts with face segments and then combines them. Embeddings are compared to one another via cosine similarity. The main question is:
How exactly can we get those embeddings so that the comparison makes sense?
Two approaches can be used here: The first is to get features from the last besides one ResNet50 layer, and the second is to use custom autoencoders with a linear layer. In our case, we used ResNet50. To get an embedding space from this neural network, we just removed the last layer. The second approach is the unsupervised solution, which we implemented via autoencoders. In this case, an embedding space will be a compressed linear layer in an autoencoder, which we will use in image comparison.
For research purposes, we wrote a script for embedding visualization that gets a video file as the input and returns a video file with embedding graphs as the output. In this script, we get embeddings and plot their graphs for each part of the avatar and face segment every frame. The pink color stands for the embedding of the face, and the purple color stands for the embedding of the avatar.
ResNet50 as an embedding extractor
First of all, we implemented comparison via pre-trained ResNet50. Have a look at the results below:
As you can see, the test subject stands calmly and does nothing, but the embeddings are changing every frame. The avatar can’t be generated clearly, due to the fact that the embeddings are not much different from one another, and they have too many dimensions. Even when the emotion of the test subject changes (for example, a smile appears on their face), the embeddings remain the same (see below).
Autoencoders as an embedding extractor
What about autoencoders? We used equal architecture for them. Take a look at it below:
Here is also the code of the autoencoder:
The main idea of autoencoders is to take similar embeddings via similar influence on the input layer. As you can see, face segments from the CELEBA dataset and face parts from the avatar dataset can be represented as embedding graphs. Here, you can see that the graphs have some similarities.
The graphs of mouth embeddings are presented below:
Now, we would like to present to you the result of these models:
As you can see, with the autoencoder approach the embeddings are more reliable, and a small interference only slightly affects the final result, unlike the previous paragraph, where we used ResNet50 as an embedding extractor.
We have compared two captivating approaches to image generation: using ResNet50 and using autoencoders, and found out that the second approach shows better results. The development of a fast and unified model for assembling an emoji from pre-made templates without additional model training is an excellent alternative method to controllable generation of avatars from faces. At this moment, we have the model that can switch between different styles without additional training, but the embedding extractor (the second hidden layer) needs still is undergoing improvement by our team.
This article written by Dmitriy Drogovoz, a Software Development Engineer at Akvelon’s Georgia office