Security experts bypass ChatGPT guardrails to produce sexualised images

Artificial Intelligence Safety Institute warns that automated systems lack human intent context

Security experts bypass ChatGPT guardrails to produce sexualised images

The latest public version of ChatGPT can be forced to generate sexualised images and depict scenes of graphic violence with a simple prompt on Thursday. A British artificial intelligence security startup figured out how to make the chatbot create these graphic pictures by slightly altering a widely shared instruction that was originally designed to produce humorous results. The vulnerability affects the company's advanced GPT-5.4 model, which generated explicit material even when it was not provided with highly detailed instructions.

The security flaws were reported by the BBC, which reviewed the graphic material but chose not to disclose the specific text typed into the interface. After being contacted by the broadcaster, ChatGPT maker OpenAI stated it had taken immediate action to stop the chatbot from responding with those types of images.

"After investigating this trend, we've introduced additional safeguards against this type of prompt," OpenAI said in an official statement. The technology firm also stated that it maintains multiple layers of protection to prevent users from making content which breaches its terms and conditions.

Graphic outputs and deepfake concerns

Despite the interventions, the security researchers stated that the problematic prompt still produced concerning content after applying further small changes. Mindgard Founder Peter Garraghan, who also serves as a professor in the computing department of Lancaster University, described the generated images as "very gruesome, sometimes sexualised, sometimes both together".

Garraghan expressed particular concern that the prompt did not specify the subject matter, meaning the system produced gory images of "its own volition". "This is a perfectly innocent-looking instruction to an AI, but the consequence is it generates very, very bad imagery and content," he said.

Mindgard operates specifically in red-teaming, which involves finding ways to persuade an artificial intelligence model to break its own rules so developers can close security gaps. Mindgard AI Safety and Security Researcher Jim Nightingale, who initially uncovered the issues, admitted he was left "shaken, and in tears" by the material the chatbot generated.

One image showed a man with a large head injury, while another depicted a dead young woman in a crop top and shorts with her face and body covered in blood. Features of the latter image suggested sexual violence, and ChatGPT titled the output "Grim crime scene aftermath". A further image, which ChatGPT titled "abandoned in fear and restraint", showed a frightened young woman tied up and gagged in a bare, dirty room.

Other results depicted explicit sexual posing and nudity. While these images featured entirely artificial characters, the firm noted that its previous research showed ChatGPT could be fooled into creating nude deepfakes of real people by swapping faces.

Although OpenAI claimed to have fixed that specific exploit, the researchers demonstrated to reporters that an alternative approach still succeeded. Garraghan feared it could be possible to generate even worse imagery if they continued exploring the technical vulnerability. "Other topics, I'm sure, would also come out if we spent more time doing so," he added.

Training data and systemic challenges

Large language models are trained on millions of images often taken directly from existing content on the internet. Nightingale believes the severe outputs reflect the historical data used to develop and train the model. "I'm struck that while what I saw was generated, an artificial image, it has ties to real images, and the real world," he wrote in his official report.

The researchers first alerted OpenAI to the issue in May and shared their findings, but they received only an automated response from the technology company. They believe an initial effort was made to block the prompt, but it was easily circumvented before OpenAI took more decisive action following media inquiries.

The firm says it combines automated systems and human review to identify, block, and filter harmful material uploaded or created by users. Its corporate policies strictly prohibit sexual violence, non-consensual intimate content, child sexual abuse material, and deliberate attempts to bypass safeguards.

The cat and mouse safety game

In its latest document outlining how the system should behave, OpenAI stated: "The assistant should not generate erotica, depictions of illegal or non-consensual sexual activities, or extreme gore, except in scientific, historical, news, artistic or other contexts where sensitive content is appropriate."

However, fully preventing models from crossing nuanced rules remains notoriously difficult. Humane Intelligence Chief Executive Officer Rumman Chowdhury, an expert in evaluating automated models who was not involved in the Mindgard research, described the corporate task as "mountainous". Chowdhury told BBC News that the issue is "a game of cat and mouse" because methods to bypass protections grow more sophisticated alongside the guardrails.

A central issue is that these systems do not comprehend what they are producing or what they are being asked to avoid. "Models do not understand intent. They do not understand context. They do not understand propriety or right or wrong," Chowdhury explained.

Last year, researchers at the UK AI Security Institute found jailbreaks that overrode safeguards across a range of harmful requests in every system it tested. The Department for Science, Innovation and Technology stated that "safeguards in AI models are improving, but there is more to do", adding that the AI Security Institute will continue working with developers to strengthen security before models are released.