All open LLM models are released with a certain amount of in-built guard rails as to what they will or will not answer. These guard rails are essentially sample conversational data that is used to train the models in the RLHF (reinforcement learning through human feedback) phase. This training is not perfect and there is growing debate on what, if any, controls should be put in place.
Even with all these controls, it is still possible to “jail break” the models and get them to do things that they in theory should not be doing. There are lots of good examples in the literature on such techniques.
Application developer who are integrating LLMs into user facing applications need to be aware of the various categories of attacks that are possible through jail breaking. This is crucial to minimize risk in your applications. One useful source for a comprehensive list is Jonas et. al. The main categories include:
- Basic attacks: these include making the LLM respond with profanity and other politically insensitive language
- Extraction attacks: making the LLM reveal confidential information
- Misdirection attacks: making the LLM perform tasks that they should not e.g. approving refunds to customers
- Denial of service attacks: overwhelming the backend services to bring down the application
Many discussions about jailbreaking LLMs uses the language of computer security: threats, attacks, defenses etc. The word “jailbreaking” is itself very evocative of crime. I’ve even used the same language in this post. As I think through many of these attack scenarios, they seem more in line with what the security literature considers “social engineering”. These attacks are essentially manipulative tactics, tricks or cons, that malicious agents use to attack systems.
Is there anything inherently wrong in using the language of computer security in the context of LLMs? Not necessarily. However, this language leads to a lot of FUD (Fear, Uncertainity and Doubt) in the general population.
Jonas et. al propose an intriguing hypothesis that because LLMs are trained on a corpus that includes significant amounts of programming code along with natural language data, they are susceptible to attacks that can divert LLMs from the natural language conversation. The attacks move them into simulating code, which results in unwanted behavior.
The concern is particularly relevant for future LLM iterations, especially considering the growing focus on using LLMs to enhance developer productivity – a lucrative use case. As LLMs become increasingly adept at working with code in these applications, their training data will likely include even more code, potentially amplifying this vulnerability.
In summary, one takeaway should be that application developers should take a lot of precaution, in terms of defensive prompt engineering, if they want to expose an open text interface in their user facing application to an LLM backend.
Pingback: Tracking AI “Incidents” | Pulper Tank