- Big Purple Clouds
- Posts
- Explainer Series #7 - AI and Creativity - How AI Generates Art, Music, and Content
Explainer Series #7 - AI and Creativity - How AI Generates Art, Music, and Content
BIGPURPLECLOUDS PUBLICATIONS
Explainer Series #7 - AI and Creativity - How AI Generates Art, Music, and Content
Artificial intelligence has reached an inflection point in its ability to generate creative works like music, art, and literature. In this article, we will take a deeper dive into the technical architecture powering systems that enable algorithmic creativity using DALL-E 2, MuseNet, and GPT-3 as examples in each area.
Image Generation with Diffusion Models
DALL-E 2 produces remarkably vivid and varied images from text captions. Under the bonnet, it uses a deep neural network architecture called a diffusion model. Diffusion models are trained on massive image datasets, learning to convert random noise into realistic photos through a repeated process of conditioning and denoising.
More specifically, DALL-E 2 learns a mapping between sampled noise vectors and image data, mediated through text captions. At each training step:
The model starts with a real training image and corresponding caption.
It adds noise to degrade the image - essentially diffusing it. This distorted image is the input.
The model attempts to reverse the diffusion and restore the original image. It conditions this denoising on the associated text caption, learning which words map to which visual features.
Iterating this process across training data teaches the model to translate text into image constructs through successive denoising of noise into sharper outputs.
After extensive training, DALL-E can fabricate images from any caption by reversing its learned text-to-image diffusion mapping. Given a caption, the model begins with random noise and sequentially adds finer details guided by the text prompt. After hundreds of denoising steps, this procedural generation produces a novel image matching the description.
Crucially, because the model learned associations between language and image patterns, it can render appropriate scenery, lighting, poses, styles, compositions, and contexts for each caption. Variations in the initial noise also enable unique outputs.
Reply