Google’s image AI Imagen outperforms DALL-E 2 – but Google has concerns

Google’s image AI Imagen outperforms DALL-E 2 – but Google has concerns
Written by insideindyhomes

Image: Imagen/Google

The article can only be displayed with activated JavaScript. Please enable JavaScript in your browser and reload the page.

With the generative image AI Imagen, Google also shows after OpenAI that artificial intelligence can generate credible and useful images.

Imagen is Google’s answer to OpenAI’s recently introduced image AI DALL-E 2. With one difference: OpenAI unveiled DALL-E 2 directly as a product including a beta test, which should be available to more people from summer.

According to Google’s researchers, Imagen beats DALL-E 2 in terms of precision and quality, but the generative AI is currently only available as a scientific work. For ethical reasons, this will probably not change in the near future, more on that later.

Imagen generates images matching text input. | Image: Google AI

Text becomes image

Imagen relies on a large, pre-trained Transformer language model (T5) that creates a numerical image representation (image embedding) from which a diffusion model creates an image. Diffusion models see images that gradually become noisy during training. After training, the models can reverse this process, i.e. generate an image from the noise.

The Imagen generation process. Image generation originates from the text understanding of a large Transformer language model. Theoretically, a different language model could be used for the input, which in turn should affect the quality of the images. | Image: Google AI

The low-resolution original image (64 x 64) is then increased by AI scaling up to 1024 x 1024 pixels – the same resolution as DALL-E 2. Similar to Nvidia DLSS, AI scaling adds new, Adds appropriate details in terms of content, so that it also offers high sharpness in the target resolution. Through this upscaling process, Imagen saves a lot of computing power that would be necessary if the model were directly outputting high resolutions.

Imagen performs better than DALL-E 2 on human evaluation

A key finding from the Google AI team is that a large pre-trained language model is “surprisingly effective” for encoding text for subsequent image synthesis. For a more realistic image generation also have the Magnification of the language model has a greater effect as a more extensive training of the diffusion model that creates the actual image.

The team developed the DrawBench benchmark, in which people rate the quality of a generated creative and how well the creative matches the input text. They compare the outputs of several systems in parallel.

In the DrawBench benchmark, human images generated by Imagen and DALL-E 2 were evaluated in terms of accuracy of fit to the input and the quality of the motif. According to Google Imagen, the human testers “clearly” preferred them. | Image: Google AI

In this test, Imagen performed significantly better than DALL-E 2, which the researchers attribute to the better language understanding of the text model, among other things. In most cases, Imagen can translate the instruction “A panda making latte art” into the right motif: a panda pouring milk perfectly into a cup of coffee. DALL-E 2 create a panda face in the milk froth instead.

On the left are the images generated by Imagen, which show a motif that matches the input in three out of four cases. On the right, the wrong interpretation of DALL-E 2 in four out of four cases. | Image: Google

Imagen also achieved a new best value (7.27) in a benchmark using the COCO (Common Object in Context) data set and performed better than DALL-E (17.89) and DALL-E 2 (10.39). All three image models were not previously trained with the Coco data. Only Meta’s “Make-A-Scene” (7.55) acts on par with Imagen here, but Meta’s image AI was trained with Coco data.

Move slowly and let things heal

A publication of the model is currently not planned for ethical reasons, since the underlying text model contains “social distortions and restrictions”, which is why Imagen could create “harmful stereotypes”.


In addition, Imagen currently has “significant limitations” in generating images with people in them, including “a general tendency to generate images of people with lighter skin tones and a tendency for images representing different occupations to be consistent with Western gender stereotypes.”

For this reason, Google does not want to release Imagen or similar technology “without further protective measures”. DALL-E 2 also has these problems. OpenAI is therefore only very slowly rolling out the image AI to around 1000 testers per month. A recent interim conclusion after three million generated images showed that currently only a fraction of the DALL-E motifs violate the OpenAI content guidelines.

Jeff Dean, senior AI researcher at Google AI, sees the potential in AI to foster creativity in human-computer collaboration. Imagen is “a direction” that Google is pursuing. dean shares numerous picture examples on Twitter. More information and an interactive demo is available on the Imagen project page.

Sources: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

#Googles #image #Imagen #outperforms #DALLE #Google #concerns

About the author


Leave a Comment