Less than two years ago, we announced the release of Akvelon’s AI ArtDive, a web/mobile application that allows you to transform your photos by applying AI-powered filters and effects. The app makes your photos appear to have come straight from a Vincent van Gogh painting, a retro video game, a watercolor masterpiece, and many more artistic styles. Our AI ArtDive app is available for free and you can download it in the Apple Appstore and the Google Play Store, or you can access the web version here.
Now, we have developed even more tools for manipulating images of people’s faces on a deeper level: beyond changing their gender and age, we can now change the subject’s facial expressions, manipulate their facial features, etc.. This will come in handy when one of your fussy family members or friends keeps ruining the shot by refusing to smile for the camera! .
Here, we are going to describe our approach to using generative adversarial networks (GANs) and the pixel2style2pixel (PSP) framework.
Here’s what we will cover in this post. Agenda:
- Introduction to our approach with vector manipulations. Declaring psp_model and styleGAN directions repo as sources
- StyleGAN2 and our examples of StyleGAN2-distillation approach for generating images for latent vector extraction
- Introducing Pixel2style2pixel (PSP)
- Our step-by-step guide to manipulating the pictures
- Comparison of the results between the original images and our results with the default weights and the superres ones.
Our Approach at a Glance
By encoding images to a vector space representation, we can manipulate these objects by adding or subtracting prepared vector directions with an alpha coefficient.
Combining multiple vector directions and different pretrained StyleGAN2 weights can lead to outstanding results.
StyleGAN2 is a state-of-the-art generative adversarial network for image synthesis developed by NVIDIA. It can produce high quality images of people’s faces from random latent code and encode target images into vector representation.
The base StyleGAN2 network produces images by generating a random noise and then encoding them to the latent vector representation.
Next, the StyleGAN2 decoder transforms this latent representation into the image. The main disadvantage of this network is that we couldn’t manipulate these representations by ourselves. In order to generate the images we need, we had to find a way to manipulate the encoded images.
StyleGAN2 generated images from the official repo.
StyleGAN2-distillation was one of our first attempts to manipulate the latent representation.
StyleGAN2 – distillation
This approach still doesn’t allow us to get desired latent representations for specific faces, but it allows us to manipulate random vectors to get desired transformations, like a smile, age, gender, etc. in randomly generated faces. To achieve this, you first need to identify the appropriate latent space directions.
This repo contains everything you need to find the appropriate latent space directions in paired images for yourself. It is based on StyleGAN2 and has Python notebooks and scripts for manual training.
It also contains prepared .npy files with different matrices that need to be injected in the next steps after you have encoded the pictures.
This is an example of a StyleGAN2-distillation approach that is displayed in the repo.
This StyleGAN2-distillation approach to the generation of the images for latent vector extraction can be described as follows:
1. Generate random latent vectors
2. Map them to intermediate latent codes, then generate corresponding image samples from them
3. Get attribute predictions from a pre-trained neural network
4. Filter out images where faces were detected with low confidence
5. Select images with high classification certainty
6. Find the center of every class by taking every image vector from the predicted class batch and dividing it by the number of images in the sample. Also, compute the transition vectors from one class to another by subtracting the class vector from another one
7. Generate random samples and pass them through a mapping network
8. For the gender swap task, create a set of five images with the follow from its latent code:
- plus half of the transition vector
- minus half of the transition vector
- plus transition vector
- minus transition vector
9. For aging/rejuvenation, first predict the face’s attributes, then use corresponding vectors to generate faces that should be two bins older/younger
10. Get predictions for every image in the raw dataset. Filter out by confidence
11. From every set of images, select a pair based on classification results. Each image must belong to the corresponding class with high certainty
12. Train a paired image-to-image translation network
The most useful and interesting directions so far we tested were:
More about StyleGAN2 image encoder
One of the StyleGAN2 components is an image encoder. It gives us the ability to project images into latent space. With this latent space of the image available, we could possibly add and subtract different prepared vectors in order to manipulate the output image after passing it to the generator.
- Original pre-trained StyleGAN generator is used for generating images
- Pre-trained VGG16 network is used for transforming a reference image and generated image into high-level features space
- Loss is calculated as a difference between them in the features space
- Optimization is performed only for latent representation which we want to obtain
- Upon completion of optimization, you are able to transform your latent vector as you wish. For example, you can find a “smiling direction” in your latent space, move your latent vector in this direction, and transform it back to an image using the generator
- The original images should face the camera. Results are much worse when the face is tilted
- The quality of the source image should be very clear. Poorly cropped or low resolution images produce worse results
The PSP framework has all of the tools and endpoints that are needed for running tests. It includes a StyleGAN2 encoder and decoder with different weights for experiments, and a pretrained model for face alignment.
It also supports face frontalization, conditional image synthesis, and super resolution transformation of the images.
The example of StyleGAN2 image encoding displayed in the PSP repo.
This repository also includes an image alignment tool. Images must be processed before they are sent into the neural model.
The PSP framework has several scripts and notebooks for experiments.
Step-by-step Guide to Implementing a Pipeline
In order to implement face morphing algorithms based on vector direction manipulations, we used the available StyleGAN2 encoder and decoder in the Pixel2style2pixel repository. The built-in capabilities and robustness of the PSP framework helped us design and develop a pipeline for face detection and alignment.
Our main pipeline consists of:
- Preparing precalculated latent space directions for each needed morphing style
- Preparing a StyleGAN2 pretrained model from the PSP framework for source image encoding
- Using a pretrained image cropping script from PSP to update the source image for better output quality
- Injecting the target style codes after the encoding process
- Decoding an image
Step 1: Prepare the imports
First, we need to prepare precalculated latent space directions for each morphing style that we are going to apply. You may find the available directions here and use them to generate your own.
Step 2: Select Experiment Type
In this step, we choose which type of encoder we will use. For this example we chose “ffhq_encode”, but you could also use “celebs_super_resolution”
Step 3: Download Pretrained Models
Next, we need to get pretrained StyleGAN2 from. a PSP framework repository. You can find them here.
This part of the code prepares the target models and saves them in the ‘pretrained_models’ folder. In case Google Drive doesn’t work, you can always download the target pretrained models and place them into the folder by yourself.
Next, we will define the available models and their respective urls. There are two StyleGAN3 saved models that can be used to replicate the result. For testing purposes, we used ffhq_encode.
Then, you could just use wget to download them:
Step 3: Define Inference Parameters
Below, we have a dictionary that defines parameters such as the path to the pretrained model to use and the path to the image to perform an inference on.
Next, we will define the vector directions folder that we will apply to the target images and the main function to iterate over the batch, and will then apply a vector shift to them.
Step 4: Load Pretrained Model
Here, we select the path to the pretrained and downloaded model, load the target weights, and update the training options.
Then, we evaluate the model and we are good to go!
Step 5: Visualize Input
For this step, we prepared some images with the faces of people in high resolution and placed them in the ‘test_imgs’ folder. Now we can load them, run the alignment script, crop them, and send them to the loaded network.
Step 6: Align images
In your pipeline, you should include the image alignment for processing the raw image. Run these lines of code in the notebook to get the landmarks needed for alignment:
Next, the function prepares an aligned image:
Then, we’ll read the images in the target folder, append them into the list, and then process them using the function.
Confirm that the images have been transformed:
As you can see, this script cropped the image to only the subjects’ faces. It also centered the frame around the face and mirrored the borders for better cropping.
Step 7: Perform Inference
The final step defines the output of the network. We pass the transformed images to the network and define the ‘vector_shift’ dict. These numbers show what power these changes will have and in which direction we should move our encoded images.
Feel free to change them and investigate the results.
Let’s see the result:
We could change the vector shift to “smile” and run again:
In another case, you could run inference_playgound.ipynb for yourself:
The combination of PSP and StyleGAN2 Distillation approaches allows for achieving more accurate and photorealistic face attributes manipulation than classic approaches based on CycleGan and Pix2PixHd that were described in our previous article. It achieves this by working with your images as a set of facial features, instead of just a set of pixels. This way, the image artifacts, such as low-resolution and low or high lighting, barely even impact the final result, significantly improving the appearance of the final photo.