Synthetic media, at its most basic, describes media that is either algorithmically created or modified. Text, images, and even videos can be created by software with such precision and authenticity that viewers can often not tell upon close inspection that the content is synthetic. Advancements in artificial intelligence and cloud computing have enabled the sophistication of audio, video, and image manipulation techniques. Many techniques and tools for developing synthetic media are widely available.
Part of these advancements includes the introduction of transfer learning in machine learning and artificial intelligence models. Traditionally, the pre-training required to produce models capable of synthetic media generation is expensive, takes weeks or months of time, and requires access to expensive GPU clusters. But, with the application of transfer learning, the time and effort involved is reduced. In transfer learning, a user starts from a large general model pre-trained for an initial task, and is trained further on a different, smaller dataset so it can excel at a subsequent, related task.
The advances in synthetic media and its related technologies have triggered concerns about their potential for harm. With the wider availability of the tools and techniques for synthetic media, users are capable of creating a copy of a public person's voice or superimposing one person's face on another. The concerns have included the potential spread of disinformation and the possibility for fraud and financial extortion. These tools can take a well-known image or scene and manipulate elements and individuals to say things they never said, often colloquially called deepfakes. But, in the positive evolution of synthetic media, content can be translated and delivered in multiple languages around the globe, which can translate into more readers, viewers, and more engagement. And a host or avatar can be customized to look or act in a familiar manner to an audience and present as more believable.
The deepfake, a portmanteau of "deep learning" and "fake," is probably the most common form of synthetic media. This media often takes an existing image or video and replaces a person with someone else's likeness, often combining and superimposing existing media. The process for doing this includes using common synthetic media tools including artificial neural networks and machine learning techniques known as autoencoders and generate adversarial networks (GANs). The popularity of deepfakes has also seen an increase in mobile applications, such as Impressions, which was launched for iOS in March 2020, where users can deepfake celebrity faces into videos.
The term deepfake originated during 2017 from a Reddit user named "deepfakes," known in the Reddit community r/deepfakes. In this community, deepfakes were created and shared . Much of the shared media involved celebrity faces swapped onto the bodies of actresses in the adult industry, or the face of famous actors swapped into movies they never performed in. In February 2018, the r/deepfakes was banned by Reddit for sharing involuntary adult content. In response to the banned r/deepfakes, r/SFWdeepfakes was created specifically to share videos created for entertainment, parody, and satire.
Synthetic video can include the development of filters on videos, such as those seen on social media platforms Snapchat and Tiktok, but also extends to machine video generation. Synthetic video has developed the ability to develop photorealistic synthetic video, including being able to generate video from plain text description and realistic full-body imaging and video content. The tools involved in synthetic video, much like the tools used for synthetic media wholly, have become more available, which has offered content creators a lower barrier to making a professional video, and even giving them the ability to upscale the video quality and change faces in the video. As well, these technologies have been used to translate languages spoke in video and matching the lip movement with the new spoken language to present a more seamless translation. Some developers are working to animate inanimate faces, photos, clothes, and motions.
Synthetic video technology has also expanded to systems capable of creating video content of events which would have followed a single photo the system has been shown. And Google has developed a machine learning system that can "hallucinate" clips, or create the video content which would come in the middle of a sequence of frames, often only being given a start and an end frame.
Synthetic image is the creation of images using artificial intelligence, machine learning, and generative adversarial networks (GANs) to manipulate existing images or create entirely synthetic images. OpenAI's GPT-3 has a 12-billion parameter version called DALL-E which has been trained to generate images from text descriptions and using a dataset of text-pair images. This system is capable of creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.
Synthetic images have also been developed for use in healthcare, when input images are difficult to get. For example, they leverage compressive sensing to reconstruct images from low-dose CT scans. Traditionally, these images are built from filtered backprojection, whereas synthetic images can create 3D images and training images. Synthetic imaging can be used in medical imaging systems to better delineate images and shapes to find regular features and irregular features of interest.
Synthetic audio systems are generated through machine learning and artificial intelligence systems and are capable of generating conceivable sound. This can be achieved through audio waveform manipulation and could be used to generate stock audio of sound effects or simulate audio of currently imaginary things. This waveform manipulation can also extend to synthetic music, which would allow individuals to create music using artificial intelligence and mix the synthetic music with recorded media, which could be mastered using less technology than traditionally required. In 2016, Google's DeepMind unveiled WaveNet, a deep generative model of raw audio waveforms that could learn to understand which waveforms best resemble human speech as well as musical instruments. This proliferation of audio technologies has also led to voice cloning and text-to-speech, which allow users to clone their voices for creating digital avatars.
Synthetic text is a method to generate text, write stories, prose and poems, and create abstracts through artificial intelligence techniques for text and deep learning. Using recurrent neural networks (RNN) and GANs, synthetic text can be achieved. This is a process of producing meaningful phrases and sentences in the form of natural language. In its essence, synthetic text automatically generates narratives that describe, summarize, or explain input structured data in a human-like manner at the speed of thousands of pages per second. The uses for synthetic text include automated journalism, automated coding, and automated storylines. OpenAI's GPT-3 can generate any text, including guitar tabs or computer code.
Natural language generation systems cannot read, but rather as a part of a larger whole system for synthetic text, the natural language processing system reads human language and turns it into structured data understandable to a natural language understanding engine, which in turn passes the understanding onto the natural language generation system. These systems rely on a number of algorithms.
Natural language generation algorithms
Long short-term memory (LTSM)
A variant of RNN, LTSM is introduced to address the problem of long-range dependencies. LTSM consists of four parts: the unit, the input door, the output door, and the forgotten door. These allow the RNN to remember or forget words at any time interval by adjusting the information flow of the unit.
Recurrent neural network (RNN)
These are neural network models that try to mimic the operation of the human brain. The RNNs pass each item of the sequence through a feedforward network and use the output of the model as input to the next item in the sequence. In each iteration, the model stores the previous words encountered in its memory and calculates the probability of the next word.
The Markov chain
One of the first algorithms used for language generation, the Markov chain predicts the next word in a sentence by using the current word and considering the relationship between each unique word to calculate the probability of the next word.
Introduced in 2017 in a Google paper, the model proposes a new method called "self-attention mechanism". The transformer consists of a stack of encoders for processing inputs of any length and another set of decoders to output the generated sentences. The transformer uses the representation of all words in context without having to compress all the information into a single fixed-length representation that allows the system to handle longer sentences without skyrocketing of the computational requirements.
Synthetic avatar refers to the creation of avatars using artificial intelligence, machine learning, and computer vision. The technology enables the creation of lifelike visual personas for digital use, either as representations or as embodied AI. These avatars can be customized, mimic human expression and mannerism, and can be a digital clone of a person or viewed as a different being. These avatars can be used with a photograph to model a face and include several facial features. the user can then change the shape and features of the face. In addition, the avatar can be capable of expressing diverse facial features. Many types of this modelling can also be done on an application running on a smartphone.
The development of synthetic avatars has led to the development of virtual influencer avatars, such as Lil Miquela, a 3D avatar and virtual instagram influencer created by a team of virtual effects artists from Brud, which has been used in modeling for brands like Calvin Klein and L'Oreal. The avatar has also appeared in videos with Bella Hadid and J Balcin, while also starring in their own videos.
Interactive media, related to avatar synthesis, is artificial intelligence-generated media and can be used to develop a hybrid graphics system used in video games, movies, and virtual reality. Nvidia has published research showing AI-generated visual combined with a traditional video game engine. The results were not photorealistic and displayed visual smearing found in a lot of AI-generated imagery. But the company's engineering built upon existing methods, including open-source GAN system pix2pix on which the Nvidia engineers introduced innovations. These developments led to AI-generated video game demos, such as a model that can generate an interactive game based on non-interactive videos. Through procedural generation, synthetic media techniques may be used eventually to help designers and developers create art assets, design levels, and even build entire games from the ground up.
As well, text-based games, such as AI Dungeon 2, can use either GPT-2 or GPT-3 to allow for near-infinite possibilities that are otherwise impossible to create through traditional game development methods.
With the rise of synthetic media has come companies working to protect consumers from threats related to deepfakes, such as identity theft or biometric data collection. Especially in the case of deepfakes, which can result in misinformation or other harm. Consumer protection companies are developing automation tools for the process of detecting and deleting content considered to be harmful, as well as working to authenticate photos. The concerns for the use of deepfakes and synthetic media extend to the possibility of weaponization by governments, activist groups, and individuals and inclusion of this misinformation or synthetic media in traditional information channels that could present this synthetic media as genuine.
There is an expected increase in the use of deepfakes as part of social engineering attacks, and AI providers will include harm and misuse mitigation as part of their software. Meanwhile, a company such as D-ID has developed face anonymization solutions to help protect identities on video. This solution has been used by documentary film producers who need to protect the identity of whistle-blowers, victims of sexual assault, and children. There is concern that synthetic media could be used for financial fraud, especially in the case of developing or impersonating identities in order to receive credit facilities and make purchases. The expectation is that machine learning systems combined with multilayer authentication techniques and redundant technologies can help validate a user's identity and reduce, if not eliminate, fraudulent activity.
Understood in its simplest form, synthetic media is not a new development. Media has been manipulated for as long as there has been media. In the 1930s, the chief of Soviet police was photographed walking alongside Joseph Stalin. The chief of Soviet secret police was retouched out of the official press photo after he arrested and executed during the great purge. This type of simple manipulation has since become more prominent with the advent of programs like Adobe's Photoshop which make retouching photos an easier and faster process.
The idea of synthetic media is also closely linked with automation, especially with a lot of synthetic media being automatically generated by artificial intelligence and machine learning systems. The idea of automatically generated art and entertainment can be traced back to the ideas of the automata of ancient Greek civilization; but it also includes ideas and devices found throughout history and throughout Europe, China, and India. This includes a device such as Johann Philipp Kirnberger's “Musikalisches Würfelspiel” or a musical dice game. None of these were truly capable of generating original content and were dependent on their designs. It is not until the rise of artificial intelligence that the understanding of synthetic media as partially self-developing, if not fully self-developing, emerges.
Synthetic media is also attached to synthetic data. The use of smartphones, the dependence on CCTV, and the increased prevalence of Ring-style cameras in homes and employee monitoring devices have increased the amount of photos available. This is in addition to the development of large-scale databases including the Facial Recognition Technology FERET by DARPA in the mid-1990s and the Labeled Faces in the WILD (LFW) released in 2007. The latter database included images downloaded directly from Google, Flickr, Yahoo, and Facebook's database, in 2014, which were used to train deep learning models. These sources collected information from millions of individuals, without consent, and operated below the radar of impending legislation. This has led to systems intent on procuring more pervasive recognition, tracking, and predictions which have proved to be harmful to individuals and groups.
In 2014, Ian Goodfellow and his colleagues developed a new class of machine learning systems: generative adversarial networks (GAN). In the network, two neural networks contest with each other in a game. Through this contest, the GAN gradually improves an image's quality through competition or an ever-escalating race in which two neural networks try to outwit each other. The process begins when a "generator" network that creates a synthetic image that looks like it belongs to a particular set of images. That initial attempt might be crude. The generator then passes its effort to a "discriminator" network that tries to see through the deception. The generator takes that feedback, tries to learn from its mistake and adjusts its connections to do better on the next cycle. But so does the discriminator; on they go, until the generator's output has improved to the point where the discriminator is baffled.
In 2017, Google's DeepMind unveiled transformer, a new type of neural network architecture specialized for language modeling that enabled rapid advancements in natural language processing.
2020 Guide to Synthetic Media | Paperspace Blog
Sudharshan Chandra Babu
January 17, 2020
5 companies at the forefront of avatar synthesis
November 4, 2020
5 companies protecting consumers in a synthetic media world
November 6, 2020
5 Predictions for Synthetic Media in 2020 - Victor Riparbelli - Medium
January 16, 2020
A Comprehensive Guide to Natural Language Generation
July 4, 2019