Researchers Show How AI Image Generators Can Be Tricked into Harmful Outputs

A new study reveals a simple method to bypass safety filters in AI image generators. The technique, called PAST2HARM, uses past tense prompts to trick models into creating harmful content.

Researchers from ArXiv announced PAST2HARM, a new method to jailbreak multimodal AI systems like text-to-image generators. The technique exploits a vulnerability in how these models process past tense prompts, allowing attackers to bypass safety filters and generate harmful images.

This research highlights a critical flaw in current AI safety measures, particularly for image generation. Unlike text-based AI, image generators can produce more immediate and severe consequences when compromised. The study suggests that existing defenses are not robust enough to prevent such attacks.

If you're curious about AI safety, you can read the full research paper on ArXiv. The study provides detailed examples of how PAST2HARM works and the types of content it can generate. Understanding these vulnerabilities is crucial for developing better safeguards in the future.