drawn by AI
Image generative AI paints
images from noise
Image generative AI (artificial intelligence), which creates elaborate images given any text inputs, has garnered attention. Elaborate AIs, such as "Stable Diffusion," developed by a British startup, have been released to the public. How does AI produce images? Here, Nikkei uses a quiz and visuals to compare AI-generated images with images and photos made by humans.
Comparing AI and human images
Is it possible to distinguish between a human drawing and an AI drawing? In this quiz, you can try to identify which of two pictures using the same motif is an AI image. The explanation for each question will also discuss the characteristics of AI images.
The AI images in this quiz were generated using the "Stable Diffusion" AI. One of AI's strengths is that a single model can generate a variety of patterns, including ukiyo-e, illustrations and photographs.
While some images are elaborate and indistinguishable from human drawings, other AIs output images that are out of the ordinary. This is because AI learns about images from the vast number of pictures and photos produced by humans, but it does not learn common social sense.
How AI creates paint image
How does AI generate images? Knowing how image generation and learning works, we may be able to find some tips on how AI and humans can work together. Under the supervision of Makoto Shing, an applied scientist at AI developer rinna Co,. Ltd. in Tokyo, the technology for creating images from words is laid out based on a paper on the image-generating AI Stable Diffusion.
AI does not process images through combining existing pictures
It is often misunderstood, but image generative AI does not generate images through techniques like combining existing pictures. It learns how to draw pictures based on a vast amount of images and generates new images using words and image processing techniques.
Three steps of image generation
There are three main steps in the AI image-generation process. The first is "text conversion." This turns text inputs into representations that are easy for the AI to interpret. The second is "image generation." This gradually makes the image closer to the input text-related image by removing the noise from an otherwise featureless image of pure noise. The third is "decoding." The image data, which had been compressed to enable the computer to perform quick calculations, is converted back into a form that is easy for humans to see, and the image is generated.
1 ) Convert input text
One of the most attractive features of image generative AI is the ability to create images by entering any text inputs. The text inputs are called the "prompt." In the first step of image generation, these text inputs by humans are converted into "vectors," quantified representations that are easy for the AI to interpret. For example, the "cat" vector enables an AI to quickly find features in images of cats.
2 ) Denoising
The second step is to generate images. Whereas humans paint on a blank canvas, AI paints from images of pure noise filled with random data. The process of gradually removing noise from the image is repeated many times to generate the image based on the text inputs converted in the first step.
Denoise and get closer to text inputs
The process of removing noise brings the image closer to the text input. On the vast map of texts inside the AI, the text representation created in the first step is set as the destination. As the AI estimates how to denoise to generate an image that more closely resembles the input text representation, the image is gradually brought closer to the destination.
3 ) Decoding for the human eye
In the third step, the image generated in the first and second steps are decoded and output as an image that the human eye can parse. In this visual explanation, we used images that are easy for the human eye to see, but AI actually processes images that are smaller and more compressed in a space that machines can recognize, called "latent space". The use of small images makes it possible to generate images by a "shortcut", so to speak, and a high-performance computer can generate an image in just a few seconds after the text input is entered. Finally, the compressed data is converted back into an image consisting of color, width and height, resulting in a finished product.
AI learns in "reverse"
AI can determine the noise that should be eliminated for image generation because it has trained the difference between a image with noise and a image without noise. In the training process, various noises are artificially added to a clean image in the "reverse" direction of the generation process.
Next, the AI is trained to be able to generate an image before noise is added from an image with noise added. Through repeated training based on a vast amount of images, it is able to generate photo-realistic images even from images of pure noise.
This AI mechanism is known as the "diffusion model." Stable Diffusion has been released as a pre-trained model, but derivative models that have received additional training by others are also available.
The keys to training are "quantity" and "quality"
Both quality and quantity are essential in training data if an AI is to generate elaborate images. For example, Stable Diffusion has trained approximately 2.3 billion images from a huge dataset that houses a large number of images and descriptive text collected from the Internet. The collection of the Metropolitan Museum of Art in the U.S., one of the largest museums in the world, contains more than 2 million items. AI learns image features from more than 1,000 times that amount of data.
Just as humans develop an aesthetic sense through exposure to high-quality paintings, AI needs to be trained with high-quality data. Accurate predictions are impossible if the images used for training are inconsistent with the explanatory text, if the image quality is poor or if the content is biased.
Making an image of your choice
Simulating the AI experience
Choose from 10 keywords such as "Paris," "Kyoto," "evening" and "ukiyo-e" to create your image. You can select any number of keywords.
In an actual AI, you can freely input words, but here we have prepared 1,023 different images from the words prepared in advance.
Select keywords below the image
With the August 2022 public release of the image generative AI "Stable Diffusion," innovation in image generation AI is advancing at an astounding rate. Image-generation AIs have their own quirks based on their training data and mechanisms. While some images seem to captivate by capturing the entire history of human painting, other somewhat unusual images unintentionally leave people viewers laughing.
Some may feel unease that AI has stepped into the fundamentally human creative activity of "painting." However, there are many examples where technological innovation has expanded the horizons of human creativity. The technology of photography that emerged in the 19th century was a major catalyst for the creation of Impressionist paintings. In the world of chess and shogi, hybrid players like shogi champion Sota Fujii are employing new styles of play that were developed using AI research.
What is now required of us is a proper understanding of how AI works and a blueprint for coexistence between AI and humans.
The pictures featured as "AI-generated images" in the quiz, commentary, and image generation experience were generated by the image generative AI "Stable Diffusion" with the help of Nikkei Innovation Lab, a research and development organization owned by Nikkei. The models used were versions 1.4 and 2.0 trained models, with no additional training. The prompts were entered in English.
The 10 keywords used in the image-generation experience were extracted by referencing to frequently used words from a text analysis of a database of approximately 14 million prompts and images compiled by a research team at the Georgia Institute of Technology.
The main program used to generate the images can be found on GitHub. Users can enter any prompt they like to generate AI images like those in this article or create animations of the generation process.
The visual explanation is based on a paper written by researchers at the University of Munich in Germany, on how "Stable Diffusion" works, and was prepared under the supervision of rinna Co,. Ltd..