DeepFloyd is a multimodal AI research lab developing a text-to-image generator model called IF. The DeepFloyd team works within Stability AI. IF is designed to improve on other AI models with respect to generating text and captions in images based on the prompt provided. Stability AI released a non-commercial research preview of DeepFloyd IF on April 28, 2023, providing research labs the opportunity to examine and experiment with the text-to-image model. Stability AI plans to release IF as a fully open-source model in the future.
IF is a modular cascaded, pixel diffusion model, which means.
- Modular—the model consists of several neural networks that solve independent tasks such as generating images from prompts or upscaling.
- Cascaded—IF models high-resolution data in a cascading manner using a series of individually trained models at different resolutions. The process begins with a base model that produces unique low-res samples that are upscaled by successive models known as amplifiers.
- Diffusion—the base and super-resolution models are diffusion models where a Markov chain of steps is used to inject random noise into data until the process is reversed to generate new samples.
- Pixel—this diffusion is implemented on a pixel level, unlike latent diffusion models (such as Stable Diffusion) that utilize latent representations.
Images are generated using a three-stage process passing the text prompt through the frozen T5-XXL language model to convert it to a qualitative text representation.
- The base diffusion model transforms natural language text into a 64x64 image. DeepFloyd has trained three versions of the base model, each with different parameters: IF-I 400M, IF-I 900M, and IF-I 4.3B.
- To ‘amplify’ the image, two text-conditional super-resolution models (Efficient U-Net) are applied to the output of the base model. The first of these upscales the 64x64 image to a 256x256 image. Again, several versions of this model are available: IF-II 400M and IF-II 1.2B.
- The second super-resolution diffusion model is applied to produce a vivid 1024x1024 image. The final third stage model IF-III has 700M parameters.
DeepFloyd IF features include:
IF's generation pipeline utilizes the large language model T5-XXL-1.1 as a text encoder. A significant amount of text-image cross-attention layers also provides better prompt and image alliance.
Incorporating the T5 model, IF generates coherent and clear text alongside objects of different properties appearing in various spatial relations.
IF achieves an impressive zero-shot FID score of 6.66 on the COCO dataset, FID is a metric used to evaluate the performance of text-to-image models.
IF can generate images with a non-standard aspect ratio, vertical or horizontal, as well as the standard square aspect.
Image modification is possible by resizing the original image to 64 pixels, adding noise through forward diffusion, and using backward diffusion with a new prompt to denoise the image. The style can be changed further through super-resolution modules via a prompt text description.
DeepFloyd IF was trained on a custom high-quality LAION-A dataset, containing 1B image-text pairs. LAION-A is an aesthetic subset of the English part of the LAION-5B dataset. It was obtained after deduplication based on similarity hashing, extra cleaning, and other modifications to the original dataset. The DeepFloyd team’s custom filters were used to remove watermarked, NSFW, and other inappropriate content.
DeepFloyd IF does not achieve perfect photorealism and was trained primarily with English captions, limiting its ability to return accurate images in other languages. While filters were applied, the LAION dataset used to train the model does contain contains adult, violent, and sexual content. IF may also reinforce or exacerbate social Biases. Again due to training based on English descriptions, texts and images from other languages are likely to be insufficiently accounted for.
Upon release, DeepFloyd IF was released under a research license with plans to move to a permissive license release. Any attempt to deploy the model in production requires not only that the license is followed but full liability over the person deploying the model. Stability AI believes research on DeepFloyd IF can lead to the development of novel applications in various domains including art, design, storytelling, virtual reality, accessibility, and more. Possible areas and tasks include:
- Generation of artistic imagery and use in design
- Safe deployment of models which have the potential to generate harmful content
- Probing and understanding the limitations and biases of generative models
- Applications in educational or creative tools
- Research on generative models
Excluded uses of IF include:
- Out-of-scope use—the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
- Misuse and malicious use—using the model to generate content that is cruel to individuals is a misuse of this model.
Building The Next Large Model: DeepFloyd LLM + Text-to-Image = IF (Stability AI)
April 7, 2023