DeepFloyd

Is a

Product

Product attributes

Industry

Generative AI

Product Parent Company

Stability AI

Competitors

Dall-E 2

Midjourney

Overview

DeepFloyd is a multimodal AI research lab developing a text-to-image generator model called IF. The DeepFloyd team works within Stability AI. IF is designed to improve on other AI models with respect to generating text and captions in images based on the prompt provided. Stability AI released a non-commercial research preview of DeepFloyd IF on April 28, 2023, providing research labs the opportunity to examine and experiment with the text-to-image model. Stability AI plans to release IF as a fully open-source model in the future.

Examples of images generated using DeepFloydIF.

IF is a modular cascaded, pixel diffusion model, which means.

Modular—the model consists of several neural networks that solve independent tasks such as generating images from prompts or upscaling.
Cascaded—IF models high-resolution data in a cascading manner using a series of individually trained models at different resolutions. The process begins with a base model that produces unique low-res samples that are upscaled by successive models known as amplifiers.
Diffusion—the base and super-resolution models are diffusion models where a Markov chain of steps is used to inject random noise into data until the process is reversed to generate new samples.
Pixel—this diffusion is implemented on a pixel level, unlike latent diffusion models (such as Stable Diffusion) that utilize latent representations.

Images are generated using a three-stage process passing the text prompt through the frozen T5-XXL language model to convert it to a qualitative text representation.

The base diffusion model transforms natural language text into a 64x64 image. DeepFloyd has trained three versions of the base model, each with different parameters: IF-I 400M, IF-I 900M, and IF-I 4.3B.
To ‘amplify’ the image, two text-conditional super-resolution models (Efficient U-Net) are applied to the output of the base model. The first of these upscales the 64x64 image to a 256x256 image. Again, several versions of this model are available: IF-II 400M and IF-II 1.2B.
The second super-resolution diffusion model is applied to produce a vivid 1024x1024 image. The final third stage model IF-III has 700M parameters.

Diagram showing the image generation process of DeepFloyd IF and the various models it uses.

Features

DeepFloyd IF features include:

Deep text prompt understanding

IF's generation pipeline utilizes the large language model T5-XXL-1.1 as a text encoder. A significant amount of text-image cross-attention layers also provides better prompt and image alliance.

Text descriptions in images

Incorporating the T5 model, IF generates coherent and clear text alongside objects of different properties appearing in various spatial relations.

Photorealism

IF achieves an impressive zero-shot FID score of 6.66 on the COCO dataset, FID is a metric used to evaluate the performance of text-to-image models.

Aspect ratio shifts

IF can generate images with a non-standard aspect ratio, vertical or horizontal, as well as the standard square aspect.

Zero-shot image-to-image translations

Image modification is possible by resizing the original image to 64 pixels, adding noise through forward diffusion, and using backward diffusion with a new prompt to denoise the image. The style can be changed further through super-resolution modules via a prompt text description.

Training

DeepFloyd IF was trained on a custom high-quality LAION-A dataset, containing 1B image-text pairs. LAION-A is an aesthetic subset of the English part of the LAION-5B dataset. It was obtained after deduplication based on similarity hashing, extra cleaning, and other modifications to the original dataset. The DeepFloyd team’s custom filters were used to remove watermarked, NSFW, and other inappropriate content.

Limitations and bias

DeepFloyd IF does not achieve perfect photorealism and was trained primarily with English captions, limiting its ability to return accurate images in other languages. While filters were applied, the LAION dataset used to train the model does contain contains adult, violent, and sexual content. IF may also reinforce or exacerbate social Biases. Again due to training based on English descriptions, texts and images from other languages are likely to be insufficiently accounted for.

License

Upon release, DeepFloyd IF was released under a research license with plans to move to a permissive license release. Any attempt to deploy the model in production requires not only that the license is followed but full liability over the person deploying the model. Stability AI believes research on DeepFloyd IF can lead to the development of novel applications in various domains including art, design, storytelling, virtual reality, accessibility, and more. Possible areas and tasks include:

Generation of artistic imagery and use in design
Safe deployment of models which have the potential to generate harmful content
Probing and understanding the limitations and biases of generative models
Applications in educational or creative tools
Research on generative models

Excluded uses of IF include:

Out-of-scope use—the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
Misuse and malicious use—using the model to generate content that is cruel to individuals is a misuse of this model.

Timeline

No Timeline data yet.

Further Resources

Title

Author

Link

Type

Date

Building The Next Large Model: DeepFloyd LLM + Text-to-Image = IF (Stability AI)

https://www.youtube.com/watch?v=vlxnDNVkWFo

Web

April 7, 2023

DeepFloyd

Contents

Product attributes

Timeline

Further Resources

References

Find more entities like DeepFloyd