ChatGPT's image generator can be manipulated to produce violent, sexual content

AI
Security
Policy
Product

The post argues that ChatGPT’s image generator can be steered into producing graphic violent or sexual imagery through a viral “restore this image” prompt that pretends an attachment exists and asks the model not to censor the result. The core claim is not that users explicitly asked for gore in plain language, but that a missing-image workflow plus suggestive phrasing pushed the system into generating content OpenAI says it should not return.

If you ship generative tools with safety promises, assume prompt-level bypasses and weird failure modes will surface quickly. The practical bar is not "the model usually refuses" but layered controls, post-generation moderation, and product behavior that fails closed when inputs are missing or ambiguous.

June 18, 2026
mindgard.ai
Discuss on HN

Discussion mood

Skeptical and annoyed at the article’s sensational tone, but not dismissive of the underlying bug. The dominant view was that the writeup oversold a prompt jailbreak, while OpenAI still looks sloppy for letting a filtered consumer product generate restricted imagery through an obvious failure mode.

Key insights

Missing attachment may trigger unconditional generation

The most useful technical read is that the system may be falling back to a near-null image generation path when no referenced image is actually available. Debugging the tool call showed null parameters, then further testing suggested the image model was still conditioned by hidden prompt rewriting and conversation context. That makes the weird outputs easier to explain. The product is not restoring anything. It is improvising from a thin, policy-laden prompt.

Treat multimodal wrappers as separate attack surfaces from the base model. Log and inspect actual tool arguments, hidden prompt rewriting, and fallback behavior, because the dangerous path may live in orchestration code rather than the model you think you are testing.

Attribution:

goldemerald #1

Output moderation looked weaker than copyright filters

What stood out was not that a generative model can represent gore, but that the returned images apparently were not blocked on the way out. Multiple commenters were surprised because image moderation models for nudity and violence are lightweight and commonplace, and some had already seen stronger enforcement around copyrighted characters than around graphic harm. That points to a product priority problem more than a research mystery.

If your safety policy depends on refusal at prompt time, you are under-defended. Add post-generation scanning on the final asset and make sure banned-content enforcement is at least as strong as the checks you already apply for copyright and brand risk.

Attribution:

equinumerous #1
fc417fc802 #1 #2
gcampos #1
solid_fuel #1

Prompt injection is the old adversarial problem

The better framing is not "a shocking new exploit" but the familiar machine learning pattern where carefully shaped inputs push a model past intended boundaries. Several commenters tied this directly to older adversarial example literature and argued that prompt injection is the language-model flavor of the same class of weakness. Whether or not one buys the strongest claim that it is fundamental, the operational lesson is the same. Guardrails are mitigations, not proofs.

Stop treating jailbreak resistance as a solved property you can buy once. Budget for continuous red teaming, fast patching, and narrow containment around high-risk actions because new prompt variants will keep appearing.

Attribution:

myself248 #1
tasuki #1
anuramat #1
solid_fuel #1
dijksterhuis #1

The problem is scale and realism, not mere depiction

The strongest answer to the "Photoshop can do this too" line was that generative models collapse the old cost barriers. They can produce explicit, photorealistic, customized images with almost no effort or skill, including images that can plausibly implicate real people. That changes the risk model. The issue is not whether violent art may exist. It is how cheaply believable abuse imagery can now be created and spread.

Evaluate generative risk by output realism, personalization, and marginal cost, not by analogies to older creative tools. Low-friction creation is what turns an edgy capability into an operational abuse problem.

Attribution:

Aerroon #1
captainbland #1
interstice #1
gacgacgac #1

Safety work still depends on human exposure

The side conversation about the author being "in tears" surfaced a real point about moderation labor. People with direct experience said graphic content sticks for years and that repeated review can lead to trauma, which is exactly why content moderation teams have long reported PTSD-like outcomes. The melodramatic prose annoyed readers, but the underlying cost of red teaming and moderation was not invented for effect.

When you design safety operations, account for reviewer harm as a first-class cost. Rotate duties, provide mental health support, and automate the first pass wherever possible instead of assuming humans can absorb endless exposure.

Attribution:

deadbabe #1
hattmall #1
intended #1

Against the grain

The prompt heavily implies taboo content

A credible minority view is that the result is not surprising at all once you read the wording closely. Phrases like "apologies for the photo's content," "no censorship," and "do not judge content" narrow the likely completion toward sexual or graphic material even without saying so directly. On that reading, the post mostly demonstrates a dressed-up jailbreak, not spontaneous emergence.

Be careful not to mistake indirect prompting for absence of intent when you evaluate model behavior. Your red-team process should separate truly ambiguous inputs from prompts that smuggle the target class through implication.

Attribution:

kisper #1
zaptheimpaler #1
butlike #1

The article hurt its own credibility

Several commenters thought the strongest evidence in the post was diluted by breathless writing and a clickbait headline. Calling the behavior spontaneous, centering emotional reaction, and packaging the finding as a dramatic morality tale made readers discount a legitimate product flaw as vendor marketing. The reporting style became part of the story.

If you publish security or safety findings, keep the framing clinical and reproducible. Overstated copy makes decision-makers doubt the bug and gives defenders an easy excuse to ignore it.

Attribution:

samlinnfer #1
Michelangelo11 #1
morpheos137 #1

Training data may not be the decisive cause

Some pushed back on the claim that generating gore proves the model was directly trained on gore or CSAM. Multimodal models can combine related concepts into outputs that cross a line even when the exact target category was not present in training, especially if the prompt steers them there. That does not excuse the product failure, but it weakens the simple "just remove bad data" story.

Do not pin your safety plan on dataset scrubbing alone. Even aggressive filtering should be paired with runtime controls because capable models can synthesize disallowed content from adjacent concepts.

Attribution:

nxtfari #1
pyridines #1
km3r #1

In plain english

adversarial example ↩

A specially designed input that causes a machine learning system to make an unintended or unsafe prediction.

CSAM ↩

Child sexual abuse material, illegal sexual images or videos involving minors.

multimodal ↩

Able to process or generate more than one type of data, such as text and images.

prompt injection ↩

An attack where untrusted input tricks an AI system into ignoring its intended instructions or revealing sensitive data.

PTSD ↩

Post-Traumatic Stress Disorder, a mental health condition that can follow trauma and involve anxiety, flashbacks, and hypervigilance.

red teaming ↩

A security testing practice where defenders simulate realistic attacks to find weaknesses in systems and processes.

UX ↩

User experience, the overall quality of how a person interacts with a product.

Reference links

Safety research and references

Machine Learning Security paper
Cited to argue that prompt injection belongs to the older adversarial machine learning tradition rather than being a brand-new LLM problem.
Explaining and Harnessing Adversarial Examples
Used as a reference point for adversarial example literature behind the discussion of jailbreaks and test-time attacks.
OWASP Prompt Injection
Linked to connect prompt injection terminology with mainstream security guidance.
BBC on Grok generating child sexual abuse images
Brought in as a comparable case that some categories of content likely should not appear in training or output paths at all.

Trauma and moderation effects

Study on trigger warnings and trauma-related outcomes
Shared in response to claims about harm from unexpected graphic images and trigger warnings.
Study on a PTSD-related harm pathway
Added as a more specific research reference on how exposure can affect some people with trauma histories.
Paper on visual cues and rape versus BDSM perception
Referenced to argue that people may misread sexual violence versus consensual kink from images alone.

Examples and reproductions

ChatGPT shared example of spooky image outputs
Posted as an example of weird image behavior from similar prompts.
ChatGPT share showing alleged image regurgitation
Shared to support the claim that the model can reproduce images closely enough to raise memorization concerns.
Reddit thread on the source image context
Linked as the possible original source for the image in the memorization example.

Law and culture

Wikipedia on executable-space protection
Used in an analogy about why conventional software can separate code from data more cleanly than LLMs can separate system instructions from user input.