The artificial intelligence research outfit OpenAI Inc. today released an updated version of its text-to-image generation model DALL-E, with higher resolution and lower latency than the original system.
DALL-E 2, like the original DALL-E, is able to produce pictures, such as the image above, from written descriptions and comes with a few new capabilities, like being able to edit an existing picture. As is usual with OpenAI’s work, DALL-E 2 isn’t being open-sourced, but researchers are encouraged to sign up to test the system, which may one day be made available to use by third-party applications.
“DALL-E” is a portmanteau of the iconic artist Salvador Dali and the robot WALL-E from the computer-animated science fiction film of the same name. It was first unveiled in January 2021 and provided a fascinating example of AI-based creativity. The model was able to create depictions of anything from mundane mannequins to flannel shirts and even a “giraffe made of turtle”. OpenAI said at the time that it was continuing to build on the system, while being careful to examine dangers such as bias.
The result of that ongoing work is DALL-E 2, which includes a new inpainting feature that applies the original model’s text-to-image prowess on a more granular level, OpenAI explained. With the updated model, it’s possible to start with an existing picture, select a part of it and tell the model to edit it. So, someone could ask DALL-E 2 to block out a picture hanging on a wall and replace it with a new one, or add a glass of water to a table. The precise nature of the model means it can remove an object from an image while taking into account how this would affect details such as shadows. In the example below, the pink flamingo was added to an existing image.
A second new feature in DALL-E 2 is variations, which allows users to upload an image and then have the model create variations of it. It’s also possible to blend two existing images, generating a third picture that contains elements of both.
DALLE-E 2 is built on CLIP, a computer vision system created by OpenAI. CLIP is based on Generative Pre-trained Transformer 3, an autoregressive language model that uses deep learning to produce human-like text, only insead, generates images. CLIP was originally designed to look at images and summarize the content of what it saw in human language, similar to how humans do. Later, OpenAI iterated on this to create an inverted version of that model called unCLIP, which begins with the summary and works backward to recreate the image.
DALL-E 2 has some built-in safeguards. OpenAI explained that it was trained on data where potentially objectionable material was first weeded out, so as to reduce the chance of it creating an image that some might find offensive. Its images also contain a watermark indicating that DALLE-E 2 created it. OpenAI has also ensured it can’t generate any recognizable human faces based on a name, so apparently it wouldn’t be able to draw a portrait of, say, Donald Trump, if it were asked. Even so, the possibilities are virtually limitless, as this example of “a bowl of soup that looks like a monster, knitted out of wool”, demonstrates.
OpenAI said that unlike the original DALL-E, DALL-E 2 will only be made available for testing by vetted partners, and there will be some restrictions in place. So, users will not be allowed to upload or generate images that aren’t “G-rated” or could cause harm or upset. That means no nudity, hate symbols or obscene pictures will be allowed.
OpenAI said it hopes to add DALL-E 2 to its application programming interface toolset once it has undergone extensive testing, meaning it could one day be made available to third-party apps.