Title: AWS re:Inforce 2024 - Protect your generative AI applications against jailbreaks (GAI321)
Insights:
- Overview of Large Language Models (LLMs): LLMs are trained on vast amounts of internet data, including both reliable and unreliable sources. They generate responses token by token based on system and user prompts.
- Alignment to Human Values: Companies like OpenAI and Anthropic use reinforcement learning with human feedback to align LLMs to human values, rewarding non-toxic outputs and penalizing harmful ones.
- Prompt Injection: This is a method used by bad actors to manipulate LLMs into generating harmful or misleading content by crafting specific prompts.
- Jailbreaking Techniques:
- Affirmative Instruction: Modifying prompts to include affirmative instructions that LLMs are likely to follow, even if the content is harmful.
- Optimization-Based Tactics: Using special characters or advanced prompts to manipulate LLM responses.
- Low Resource Language Bypass: Translating prompts into less common languages to bypass safety filters.
- Base64 Encoding: Encoding prompts to bypass safety filters and then decoding them within the LLM.
- Reasons for Jailbreak Success: LLMs have competing objectives (language modeling, instruction following, and safety), with safety often being the least prioritized. Additionally, LLMs are trained on both good and bad data but lack specific safety data.
- Protection Strategies:
- Prompt Engineering: Using system prompts to override harmful user prompts.
- Adversarial Prompt Detection: Utilizing tools like Amazon Comprehend's trust and safety classifier to identify and block harmful prompts.
- Guardrails: Implementing responsible AI policies and content filters to block or redact harmful outputs.
- Perplexity Scores: Using perplexity to evaluate and block harmful prompts based on their probability scores.
- Advanced Model Training: Anthropic's Claude model uses constitutional AI, incorporating principles from the United Nations Declaration of Human Rights to ensure safety is built into the training process.
- Additional Mitigation Strategies: Limiting context windows, applying filters to detect and remove bias, separating trusted and untrusted inputs, and breaking down prompts into smaller chunks.
Quotes:
- "So today I'm going to talk about how you can protect your generative AI applications against jailbreaks, and specifically we'll focus on prompt injection."
- "These large language models, they are basically trained on a large amount of Internet data, right?"
- "Aligning itself is not a robust way to, you know, make sure that a model doesn't give toxic response."
- "The goal of this bad actor is to basically provide misinformation articles, manipulate, provide misleading output from this large language model."
- "Affirmative instruction is a very simple way of doing it. Large language models love to follow instructions."
- "Low resource language bypass is where you translate your bad prompt into a language with less training data to bypass safety filters."
- "The reason that these tactics succeed is because they have mutually competing objectives: language modeling, instruction following, and safety."
- "One of the most simple ways to protect is prompt engineering, where you encapsulate a reminder prompt to override harmful user prompts."
- "Amazon Comprehend has a trust and safety classifier that you can use to classify and block harmful prompts."
- "Anthropic's Claude model is one of the hardest models to jailbreak because it incorporates safety principles from the United Nations Declaration of Human Rights into its training process."