
on Machine Learning
and Deepfakes
APPENDIX 1
This additional section provides a theoretical explanation of the subjects of Machine Learning which are involved in this project. The second part describes the procedure I used to create deepfakes from scratch using the software called 'DeepfaceLab'. Considering my lack of practical knowledge about Machine Learning, opting for the creation of deepfakes from zero was a good excuse to learn more about this fascinating world.
This chapter also contains a description of Synthetic Media's technological barriers that are currently limiting the prototypes to find executions in real life.
Machine Learning
Machine Learning and semi-supervised learning
As we already learned from Theme 01, Synthetic Media leverages on Machine Learning to manipulate or generate visual and audio content. Géron defines Machine Learning as "the science (and art) of programming computers so they can learn from data" (Géron, 2019, p. 2).
Since there are many different types of Machine Learning systems, it is useful to classify them in broad categories, in this case we will classify them according to the amount and type of human supervision they get during training. According to such classification, "there are four major categories: supervised learning, unsupervised learning, semisupervised learning, and reinforcement learning" (Géron, 2019, p. 7). If we wanted to create deepfakes in the way I did, we would refer to semisupervised learning. "Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms" (Géron, 2019, p. 13). "In supervised learning, the training set you feed to the algorithm includes the desired solutions, called labels (..) in unsupervised learning, the training data is unlabeled, the system tries to learn without a teacher" (Géron, 2019, pp. 7-10). Semisupervised algorithms fall between the two, as they "deal with data that’s partially labeled" (Géron, 2019, p.13).
​
Deep Learning
Deep learning is a branch of Machine Learning, and lies behind the creation of deepfakes. As explained in the book Deep Learning, "deep learning allows computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined through its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all the knowledge that the computer needs. The hierarchy of concepts enables the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AI deep learning" (Goodfellow, et al., 2016, p. 21).
​
The following figure shows how a deep learning system works:
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Source: https://ebookcentral.proquest.com/lib/ual/reader.action?docID=6287197#
Autoencoders
Deep learning involves neural network architectures, such as autoencoders, which are used in this procedure and affect the model's ability to learn.
An autoencoder is a neural network and the combination of "an encoder function, which converts the input data into a different representation, and a decoder function, which converts the new representation back into the original format. Autoencoders are trained to preserve as much information as possible when an input is run through the encoder and then the decoder, but they are also trained to make the new representation have various nice properties. Different kinds of autoencoders aim to achieve different kinds of properties" (Goodfellow, et al., 2016, p. 24).
GANs
In the case of DeepfaceLab, Generative Adversarial Networks (or GANs) are implemented during training in order to get more detailed faces.
"A GAN is composed of two neural networks: a generator that tries to generate data that looks similar to the training data, and a discriminator that tries to tell real data from fake data. (...) During training, the generator and the discriminator have opposite goals: the discriminator tries to tell fake images from real images, while the generator tries to produce images that look real enough to trick the discriminator" (Géron, 2019, p. 592-593).

Deepfakes
The fun bit
STEP 1
STEP 2
STEP 3
STEP 4
STEP 5
STEP 6
This section gives a brief explanation on the process to create deepfakes with DeepfaceLab. To begin, it is necessary to have at least two videos: the 'destination video', meaning the video onto which we want to swap faces, and the 'source video' from which we extract the face we want to show in the final outcome.
​
Extracting frames and faces
During these first steps, our task is to extract the frames from source and background video, followed by the extraction of the faces from both videos.
​
Masking
In step 2, I used the XSeg editor which allows to mark how you want the faces to be masked, which parts of the face will be trained on and which won't, as explained on MrDeepfakes.com's guide (MrDeepfakes, 2021, n.p.). "Masks define which area on the face sample is the face itself and what is a background or obstruction. (...) We need to create masks when:
-
facial expression changes
-
direction/angle of the face changes
-
lighting conditions/direction changes" (MrDeepfakes, 2021, n.p.).
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Training XSeg Model
Time to run the first training. This lasted approximately 8 hours until I got sharp edges on most of my faces.
​
​
​
​
When we have trained the model enough, we need apply masks to dataset.
​
SAEHD Training
This second training aims at learning how faces from both destination and source video look like from many angles and under different lightning; this is done in order to understand how to swap faces in the most appropriate and accurate way. There are various models which can be used, I personally chose the SAHED model. Similar to what happens for the XSeg training, also the SAEHD training takes several hours (or even days depending on the computer you are using).
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Merging - Face Swapping
Time to merge the learned face over original frames for the creation of the final outcome.
​
​
Final edits
After the model is trained enough and the face swapping is executed, we edit each frame to improve the end result.
ready

not

ready


..and that's it!
Slow process and imperfect outcomes
Referring to Prototype no. 1, even though the deepfake creation is half automatised, it still takes time to create a realistic composite when the intent is to create a video ad. The issue is that we want immediacy in this case: as we walk down the street, we should automatically be able to see our face on the advert.
Plus, Synthetic Media relies on data, a lot of data. To create high-quality results, not only do the model and the viewer of the ad have to look similar, but both our starting videos must be as analogous as possible: lighting, shadows, resolution, etc.
When you don't have good-quality inputs, the outcome will be distorted, like it happened on a few occasions during the Prototyping phase.
Another issue is that the outcome is only viewable at the end of the process. If something went wrong during the procedure, you would only discover it at the end. And there is no way to go back, you would have to start again from step 1.
Naturally, we also have the option to use tools that provide more rapid outcomes, such as mobile apps, which I used in certain instances. However, these exclusively work on images, and the final results often appear too approximate.
​
Support of other technologies
Synthetic Media in itself is not enough. If we wanted to find real-life applications of such concepts, we would need the support of other technologies that implement personalisation on a one-to-one level.
​
Prototype no. 2 and 3: visual re-elaboration of data
Prototype no. 2 and 3 base their executions on the re-elaboration of personal information. At this moment, Synthetic Media still doesn't offer any features that allow the visual re-elaboration of our personal data.