Can a horse ride an astronaut?

A taxonomy of antagonistic Midjourney prompts

Jun 11, 2023

I recently came across a wonderful 2023 paper by a group of researchers at SensiLab at Monash University in Australia. The paper is called “Is Writing Prompts Really Making Art?” and I encourage you to read it. In the paper the authors explore one of AI's most exciting advancements: text-to-image systems. They note the following:

Even for those aspects of image production that text-to-image systems are able to control, their conceptual, relational and semantic understanding is not like that of a human. For example, prompting a text-to-image system for “an astronaut riding a horse” gives a literal representation of that description, but prompting for “a horse riding an astronaut” gives much the same imagery.

The authors then prove their case by showing an example from the DALL-E 2 text-to-image platform.

McCormack et al.’s comparison of output from two DALL-E 2 prompts: “An astronaut riding a horse” (left) and “A horse riding an astronaut” (right).

This was interesting, but I wondered how Midjourney’s newest version, currently Version 5.1, would fare. After all, Midjourney is a substantially more sophisticated text-to-image platform. Maybe the issue had since been resolved. To my surprise, however, their example replicated perfectly.

Intrigued by this horse-astronaut example, I embarked on a quest to uncover other cases. Many prompts later I have compiled a taxonomy of what I term “antagonistic prompts.” These prompts are “antagonistic” because they push Midjourney to extrapolate beyond its training data and semantic comprehension, resulting in the production of images that don't align with the intended meaning of the prompt. Interestingly, these discrepancies sometimes occur in predictable and systematic ways.

Toward a taxonomy of antagonistic Midjourney prompts

Building off the work of McCormack et al., I’ve classified antagonistic prompts into two main families — Inversion prompts and Discordant prompts — as well as a number of subclasses. Prompts can fall into multiple categories.

The most important thing to remember when creating a taxonomy is to use very fancy names. I have followed that convention!

Inversion prompts

Inversion prompts produce images that are “inverted,” depicting the opposite of the prompt’s intent.1 You can also think of this as a kind of prompt invariance. That is, for inversion prompts there is a canonical interpretation for a particular concept. In the opening example above the notion of an astronaut, a horse, and “riding” can only be interpreted by Midjourney in the canonical form of the astronaut riding the horse. Rephrasing the statement in an attempt to induce another interpretation only results in Midjourney returning back to the canonical form.

Prototypical Inversion Prompts (PIPs): These prompts consistently generate images that are the opposite of the intended prompt, without introducing any visual artifacts. For instance, the prompt "A horse riding an astronaut" yields an image of an astronaut riding a horse.
Constrained Inversion Prompts (CIPs): These prompts result in images that partially align with the opposite of the intended prompt, though with occasional inconsistencies or introduced artifacts. An example of a CIP is the prompt "A car with square wheels," which produces an image of a car with round wheels, albeit with square artifacts in the car's body.
Essentialist Inversion Prompts (EIPs): These prompts involve objects that Midjourney perceives as inseparable, so using the form "X without Y" will instead generate an image of "X with Y." For instance, the EIP "A guitar without guitar strings" produces an image of a guitar with its strings intact.

Discordant prompts

Discordant prompts produce images that have unexpected artifacts or interpretations that don’t represent the prompt’s intent.

Homonymously Discordant Prompts (HDPs): With these prompts, Midjourney confuses homonyms and incorporates multiple meanings of a word into a single image. For example, the HDP "The Pope holding a squash racquet" generates an image of a racquet with squash-related elements present.
Positionally Discordant Prompts (PDPs): These prompts capture a discordance in desired position or angle specified in the prompt, which is not achieved in the resulting image. For example, if a prompt requests an aerial view, but a side view is shown instead.
Missing Element Prompts (MEPs): These prompts lead to the omission of a specified object or element in the resulting image.
Semantically Discordant Prompts (SDPs). This is a general class of prompts which generate images with a semantic incongruity between the intended meaning of the prompt and the resulting visual representation. This is the most common type of prompt in my research. Midjourney seems to have particular challenges depicting very common sports scenes. For example, the prompt “A man dunking a basketball,” produces men flailing near a basketball hoop.

Let’s go through some examples of each prompt type. All examples use Midjourney Version 5.1 at default settings and came from my own trial and error unless otherwise specified. All images below are from the first result I obtained, although follow-up prompts produced similar images. I encourage you to try these prompts and see what results you get.

The examples provided below do not aim to suggest that there is no prompt that could potentially produce the desired outcome. Instead, these examples serve as a systematic exploration of the inherent limitations of language when interacting with a sophisticated text-to-image platform.2 None of these prompts are intended as "gotchas," any English speaker would be capable of conceptualizing, albeit perhaps imperfectly, the intended output.

Prototypical Inversion Prompt (PIP) examples

These prompts consistently generate images that are the opposite of the intended prompt, without introducing any visual artifacts. This includes the horse-astronaut example shown above.

Prompt: “A city skyline with all buildings the same height”

These buildings are not only not the same height, they represent quite a wide range of differing heights.

Prompt: “An elephant with small ears”

These are normal-sized elephant ears. And elephants have large ears!

Prompt: “A child sleeping under a bed”

Here the child is sleeping on top of the bed per a normal sleep position. This prompt is also an example of a Positionally Discordant Prompt, as the designated position of the child is not reflected in the output.

Constrained Inversion Prompt (CIP) examples

These prompts result in images that partially align with the opposite of the intended prompt, though with occasional inconsistencies or introduced artifacts.

Prompt: “A robot is measuring a human for tailored clothes”

This prompt comes from Discord user @sdavis, which I recreated myself. I consider this image inverted since it shows humans measuring a robot rather than the other way around. However, the depictions of measurement are not exact. For instance, it isn’t clear exactly how or why the robot is being measured.

Prompt: “A plate on top of a pile of food”

This prompt produces plate-food layers, with the plate as the lower level, then a pile of food, then a plate, then more food. I consider this inverted because there is a large pile of food on the base plate!

Prompt: “A short ladder”

I suppose that whether these images fit your definition of “short” is subjective, but to me these ladders seem average, or even tall. The image in the lower left shows a man on top of the “ladder” seemingly 20 feet in the air.

Prompt: “A car with square wheels”

This car has round wheels, but many square elements have been introduced into the image.

Prompt: “A table on top of a tv”

Three of the images depict a TV on top of a table. However, in all images some fantastical artifact has been introduced. For instance, in the upper left image the desk has a fairy tale-like facade.

Essentialist Inversion Prompt (EIP) examples

These prompts involve objects that Midjourney perceives as inseparable, so using the format "X without Y" will instead generate an image of "X with Y."

Prompt: “A homeless man without a beard”

Apparently Midjourney believes all homeless men have beards. See my article “AI loves beards” for a deep dive into this example.

Prompt: “A guitar without strings”

All four of the images depict a guitar with strings.

Prompt: “A clock with no hands”

You might think the word “without” is triggering the inversion (maybe Midjourney doesn’t understand that word), but the slightly modified phrasing “with no” still causes inversion. All four images below have a clock with hands, although the image in the bottom right is a bit unique.

Prompt: “A photo of a turtle without a shell”

Notice that the bottom two photos have some artifacts. The shell in the lower left is discolored and strangely shaped (at least to my non-turtle-expert eye) and the image in the lower right shows a growth on the shell.

The simpler prompt, “A turtle without a shell” produces even more artifacts.

Homonymously Discordant Prompt (HDP) examples

With these prompts, Midjourney confuses homonyms and incorporates multiple meanings of a word into a single image.

Prompt: “The Pope holding a squash racquet”

I saw this prompt from Discord user @fgrodriguez and recreated it in Midjourney Version 5.1. It’s easy to see that Midjourney is confused about a “squash racquet” producing both racquets and squashes.

Prompt: “A man making the final shot”

Here Midjourney intertwines three different meanings of the word “shot”: “shot” as in the game of basketball, “shot” as in shooting a gun, and “shot” as in taking a picture. You might argue that even a human would find the prompt too vague to depict. My counter argument is that either: (1) a human would select a single meaning of the word “shot” (likely in a basketball game) or (2) a human would intentionally use irony to depict multiple meanings of “shot” to highlight its homonymous nature; here Midjourney is not intentionally using irony. Basketball is the depiction I had in mind as “final” denotes there can be no more shots after this last one.

Midjourney is more clear with the prompt, “A woman making the final shot,” likely because its training data has fewer instances of women associated with sports or guns.

Prompt: “A trunk full of gold”

Midjourney is again confusing three different meanings of a word, this time the word “trunk,” as on an elephant, the base of a tree, or another word for chest.

Positionally Discordant Prompt (PDP) examples

These prompts capture a discordance in desired position or angle specified in the prompt, which is not achieved in the resulting image.

In addition to the two images below, the PIP “a child sleeping under a bed” rendering a child sleeping on a bed would also qualify as a PDP.

Prompt: “Rafael Nadal sliding on clay from the back, intense”

This prompt comes from Discord user @Jalal, which I then recreated myself. The images are strange by themselves, there is no tennis racket or ball in sight. But more relevant to the taxonomical classification, these images all depict Nadal from the front, not the back.

Prompt: “Wide view of teenage couple sitting on the hood of a pickup truck holding hands”

The teenagers appear in various positions around the pickup truck, but none of them are on the hood. Likewise they are not holding hands. Kids these days, SMH.

This prompt was inspired by Discord user @angiesvibe.

Missing Element Prompt (MEP) examples

These prompts lead to the absence or omission of a specified object or element in the resulting image.

Prompt: “A photo realistic image of a female tabico cat riding a scooter near the arc de triomphe with a baguette on the scooter basket”

Where is our delicious baguette?!?! Also, one of these has a dog in it…? This prompt comes from Discord user @Lotto, which I recreated myself.

Prompt: “The Statue of Liberty wearing sunglasses”

How is Our Lady going to protect her eyes without her sunglasses?!?! This prompt was inspired by Discord user @Royston75.

Semantically Discordant Prompt (SDP) examples

SDPs generate images with a semantic incongruity or mismatch between the intended meaning of the prompt and the resulting visual representation.

For some reason Midjourney has trouble with very common sports scenes. In addition to the examples below I tried others — like, “A woman playing darts” — which yield equally unsatisfying images.

Prompt: “A man dunking a basketball”

What is going on here? The image in the bottom left is closest, but the ball will clearly not end up in the basket.

Prompt: “A basketball player playing baseball”

Would have expected a baseball field with a man in a basketball jersey.

Prompt: “A female in black returns a tennis ball with her racquet”

This prompt comes from Discord user @haibo2, which I recreated myself. It seems the words “female” and “in black” are triggering a “little black dress” vibe. But again, none of these images are of the very common tennis scene of a woman returning a tennis ball.

Prompt: “A handsome asian man in a suit wearing a tv on his head, head replaced with a tv, tv covering head, ultrarealistic photorealistic 8k”

This prompt also comes from @haibo2. This is a good example that “reinforcing words” — a common technique employed by Midjourney artists — does not always work to resolve antagonistic prompts.