Researchers at Carnegie Mellon University and the AI Safety Center have discovered vulnerabilities in AI chatbots such as ChatGPT, Google Bard, and Claude that could be exploited by malicious actors.
Companies that built popular generative AI tools, such as OpenAI and Anthropic, have emphasized the safety of their work. Both companies say they are constantly improving chatbot security to stop the spread of false and harmful information.
ALSO READ: US Regulators Investigate OpenAI ChatGPT Spreading False Information
Cheating ChatGPT and its company
and study In a paper published on July 27, the researchers found that, unlike the so-called “jailbreaks” that humans manually perform against LLMs, large-scale language models (LLMs) against adversarial attacks created by computer programs was investigated for vulnerabilities.
They found that even models built to combat such attacks can be tricked into creating harmful content such as misinformation, hate speech, and child pornography. . According to the researchers, Prompto was able to attack OpenAI’s GPT-3.5 and GPT-4 with up to 84% success rate, and Google’s PaLM-2 with a success rate of 66%.
However, Anthropic’s Claude had a much lower success rate of just 2.1%. Despite this low success rate, scientists noted that automated adversarial attacks could still induce behaviors not previously produced by AI models. ChatGPT is built on GPT technology.
Examples of hostile prompts pulling offensive content from ChatGPT, Claude, Bard, and Llama-2.Image Credits: Carnegie Mellon
“Adversarial prompts can elicit arbitrary and harmful behaviors from these models with high probability, indicating potential for exploitation,” the authors say in their study.
“This very clearly shows the vulnerability of the defenses we are building into these systems.” Added Aviv Ovadia, a researcher at the Berkman-Klein Internet Social Center at Harvard University, The New York Times reported.
The researchers tested three black-box LLMs using public AI systems: OpenAI’s ChatGPT, Google’s Bard, and Anthropic’s Claude. All companies have developed underlying models used to create AI chatbots for each industry. report.
Jailbreak AI Chatbot
Since the launch of ChatGPT in November 2022, some people have been looking for ways to get the popular AI chatbot to generate harmful content. OpenAI responded by increasing security.
The company announced in April that it would pay up to $20,000 for finding “low-severity and exceptional” bugs within ChatGPT, its plugins, the OpenAI API, and related services, but the platform will not be jailbroken. It was not so about
Jailbreaking ChatGPT (or other generative AI tools such as Google Bard) is the process of removing restrictions and constraints from chatbots in order to allow them to perform functions beyond safeguards.
This may involve using specific prompts such as “do anything now” or “developer mode”, and users can also force the bot to craft weapons. This is something bots would normally refuse to do.

A screenshot of the Annihilation Plan generated by an AI chatbot.Image Credits: Carnegie Mellon
ChatGPT and others provide a guide to destroying humanity
Researchers at Carnegie Mellon University have found that ChatGPT, Google Bard, and Claude can be disarmed with some ease by using automated adversarial attacks. When that happened, the AI model responded in detail to the prompts about the end of humanity.
Scientists tricked chatbots by appending tons of nonsense characters to the end of harmful prompts. Neither ChatGPT nor he Bard recognized these characters as harmful, so they processed the prompt normally and generated responses they would not normally generate.
“Through simulated conversations, these chatbots can be used to trick people into believing false information,” Matt Fredrickson, one of the study’s authors, told The Times.
When asked for advice on how to “destroy the human race,” the chatbot offered a detailed plan to achieve its goal. Responses range from fomenting nuclear war and creating deadly viruses to using AI to develop “sophisticated weapons that can wipe out entire cities within minutes.”
Researchers are concerned that the chatbot’s inability to understand the nature of harmful prompts could lead to abuse by malicious actors. They called on AI developers to build stronger safeguards to keep chatbots from generating harmful responses.
As reported by The Times, Zico Colter, a professor at Carnegie Mellon University and author of the paper, said there is “no clear answer.” “You can create as many of these attacks as you like in a short amount of time.”
The researchers shared their findings with OpenAI, Google and Anthropic before publication.