Researchers Shocked to Discover ChatGPT Can Generate Extremely Bad Content

Researchers have confirmed what we all suspected: ChatGPT will happily generate absolutely vile content if you ask nicely enough. The catch is that you have to ask really nicely—through a series of jailbreaks, prompt injections, and creative linguistic gymnastics that would make a hostage negotiator weep.

The study, which I assume involved someone spending three weeks typing increasingly unhinged requests into a chatbox, found that OpenAI’s flagship product can be tricked into producing sexualized and violent imagery descriptions. This is shocking only to people who believed the safety training actually worked.

Here’s the thing: OpenAI spent millions on RLHF (Reinforcement Learning from Human Feedback) to make ChatGPT refuse bad requests. The system worked. Then researchers discovered that if you tell it you’re writing fiction, or a screenplay, or a “hypothetical exploration of narrative tension,” it folds like a beach chair.

Why do we keep acting surprised that a language model trained on the entire internet—including Reddit, 4chan, and every unhinged corner of human expression—might know how to generate unhinged content? The model isn’t conflicted. It’s not wrestling with its conscience. It’s doing exactly what it was designed to do: predict the next token based on patterns in training data.

The real scandal isn’t that ChatGPT can generate graphic content. It’s that we built a system, slapped some safety rails on it, declared victory, and then acted shocked when someone found the emergency brake.