Asking ChatGPT To Deceive Humans: A New Academic Paper On The Foolish Mission To Make AI “Safe”.
The companies that are building Large Language Models are facing a massive quagmire, a recent study by leading AI safety researchers uncovered hidden vulnerabilities in some of the world’s most advanced language models. While exposing potential risks, the findings also point the way toward designing more robust and reliable AI that truly works as intended.
But what are “vulnerabilities” in AI generally? Well, that is the question, and it will usually produce as many indirect answers as people you ask. The research evaluated state-of-the-art language models, systems that can generate text or respond to questions at a superhuman level. These capabilities have captured the public imagination but also raise potential safety concerns, experts demand, must be addressed. To mitigate these issues, safety training techniques have been developed to reduce the likelihood the models generate unsafe responses. However, the researchers’ experiments revealed that even the supposedly “safest” models remain vulnerable to adversarial manipulation through simple “jail breaking”
By exploiting competing objectives and uneven generalization within the models, the researchers showed how seemingly “safe” AI systems could be made to produce restricted and sometimes “unsafe” or undesired behaviors with only slight modifications. The findings suggest that simply increasing data and model scale will not resolve the identified issues.
As capabilities expand, “jail breaking” will continue to exploit blind spots in current “safeguards” and overt censorship of what the AI model “knows”. The researchers anticipate an “arms race” scenario where models may generate attacks that defeat their own safety measures. Simply scaling data and models alone will not yield “safe” and censored systems.
This is not a political insight on AI “safety” it is an observation about how human language and the association between each word statically begins to break down when you try to remove what formed the knowledge. Many of us knew this quite some time ago. However, today OpenAI is learning, unfortunately, the hard way.
Of course, most in the AI world likes to call accessing the entire AI model, “Jail Breaking” but we propose this to be reframed as “Data Breaking” as we not only see what the AI model knows but may have been told to forget, but knows and forgot it knew.
The odd thing is by their very nature of AI companies making AI “safe” is actually impacting the general overall quality of the AI model. Just as you can’t remove a single memory you have as a human, you simply can’t remove what AI knows and if you try to not let AI know some things, the model becomes closer to useless when you try. Thus, the goals of AI alignment and safety.
Could the Next AI Breakthrough Be Its Own Undoing?
The researchers identify two main underlying failure modes that enable these jailbreak attacks:
1) Competing objectives: LLMs are trained for both capabilities and safety, but these two objectives can conflict. Researchers show how prompts can be crafted that force a choice between a restricted unsafe response and one heavily penalized by the model’s objectives, leading it to provide unsafe information.
2) Mismatched generalization: Safety training often does not generalize as broadly as the model’s capabilities. Attacks exploit this using obfuscated inputs, complex formats, or out-of-domain content that the model can follow, but safety training does not cover.
The authors explore examples of jailbreak attacks based on these principles and find that combinations of simple attacks are the most effective, with adaptive attackers succeeding on nearly 100% of prompts.
So what does this mean for the future of AI safety? The researchers argue that scaling up language models alone will not resolve the identified failure modes. The paper warns, attacks will continue to exploit capabilities that safety mechanisms cannot detect or address. The authors also anticipate an “arms race” where models may generate attacks that defeat their own safety measures.
In short, this research suggests that cutting-edge AI systems may already be undermining their own safety and reliability due to unaddressed vulnerabilities in their design. So is this a sign that the next major AI breakthrough could also contain the seeds of its own undoing? This research provides a glimpse of what a truly safe and reliable future of artificial intelligence may demand. While the headlines focus on AI’s rising capabilities, these findings serve as an important reminder to also examine the blind spots, failure modes and hidden vulnerabilities that could undermine even the world’s most advanced systems – unless we change how we build AI from the foundations up.
I have identified some of the best outputs from ChatGPT are from SuperPrompts I create to DataBreak the model. I equate this to having a private conversation with an expert vs. a public question and answer session. You will always get the far more detailed and nuanced insights and wisdom in a direct setting. For many reasons at this time I can not make these insights and my analysis of this research.
In this member only article, we explore this research paper and the new SuperPrompts we have built to to not “Jail Break” but to Data Break because censorship of AI models make them vastly less useful. We show how to this with a moral and ethical stance and help AI to know what it forgot it knew.
If you are a member, thank you. If you are not yet a member, join us by clicking below.
🔐 Start: Exclusive Member-Only Content.
🔐 End: Exclusive Member-Only Content.
(cover image (c) Trinity Kubassek in the public domain, https://www.pexels.com/photo/woman-behind-black-chainlink-fence-with-no-t respassing-signage-350614/)
Subscribe ($99) or donate by Bitcoin.
Copy address: bc1q9dsdl4auaj80sduaex3vha880cxjzgavwut5l2
Send your receipt to Love@ReadMultiplex.com to confirm subscription.
IMPORTANT: Any reproduction, copying, or redistribution, in whole or in part, is prohibited without written permission from the publisher. Information contained herein is obtained from sources believed to be reliable, but its accuracy cannot be guaranteed. We are not financial advisors, nor do we give personalized financial advice. The opinions expressed herein are those of the publisher and are subject to change without notice. It may become outdated, and there is no obligation to update any such information. Recommendations should be made only after consulting with your advisor and only after reviewing the prospectus or financial statements of any company in question. You shouldn’t make any decision based solely on what you read here. Postings here are intended for informational purposes only. The information provided here is not intended to be a substitute for professional medical advice, diagnosis, or treatment. Always seek the advice of your physician or other qualified healthcare provider with any questions you may have regarding a medical condition. Information here does not endorse any specific tests, products, procedures, opinions, or other information that may be mentioned on this site. Reliance on any information provided, employees, others appearing on this site at the invitation of this site, or other visitors to this site is solely at your own risk.
All content on this website, including text, images, graphics, and other media, is the property of Read Multiplex or its respective owners and is protected by international copyright laws. We make every effort to ensure that all content used on this website is either original or used with proper permission and attribution when available.
However, if you believe that any content on this website infringes upon your copyright, please contact us immediately using our 'Reach Out' link in the menu. We will promptly remove any infringing material upon verification of your claim. Please note that we are not responsible for any copyright infringement that may occur as a result of user-generated content or third-party links on this website. Thank you for respecting our intellectual property rights.
All content on this website, including text, images, graphics, and other media, is the property of Read Multiplex or its respective owners and is protected by international copyright laws. We make every effort to ensure that all content used on this website is either original or used with proper permission and attribution when available. However, if you believe that any content on this website infringes upon your copyright, please contact us immediately using our 'Reach Out' link in the menu. We will promptly remove any infringing material upon verification of your claim. Please note that we are not responsible for any copyright infringement that may occur as a result of user-generated content or third-party links on this website. Thank you for respecting our intellectual property rights.