Issue # 26: AI Guardrails Explained

AI Guardrails Explained

I hope your summer is off to a great start. I had a blast at the DKB Tech Conference, SecAppDev and the Belgian CyberSecurity Application Security Experience Day.

I also got a chance to train a new group of AI data protection champions at Solaris SE (note: link is a LinkedIn post).

If you like this newsletter, want to support my work and learn from me live, on 26th August I'm running a new 5-week cohort with InfoQ, the Certified AI Security & Privacy Engineering Program.

The sessions are online, four hours a week, and bring senior engineers in regulated industries together to work through the privacy and security decisions behind real AI systems: handling sensitive data, threat modeling and red teaming, building controls, and learning how to do this across an organization. I hope you can join me.

Since the past few months have taught me more teams are turning to guardrails as part of their security strategy, I took time to condense information around guardrails, which I'll share in this issue along with some updates on my AI security mini-course.

AI Guardrails Explained

I wrote a mega post breaking down the common questions I get about guardrails along with information I think is essential to know if you're going to use guardrails as a major part of your AI security and privacy efforts.

I hear many misconceptions around guardrails. First, they will solve all of the problems, especially that LLMs-as-judges will be the magical cure for LLM security problems. Then, I hear another extreme that they will never work or help anything.

As is true in most things, both of those perspectives have valid points, but the truth for most organizations will be somewhere in the middle. Guardrails will be an essential part of your security and privacy strategy if you are regularly running generative AI (or agentic workflows), but they also won't save you from all problems. They will be one tool in your toolkit, and if you learn the nuances of different approaches, you can probably do better than just slapping on a LLM-as-a-judge or doing nothing at all.

There are three major types guardrails:

External Software-based deterministic guardrails: These guardrails test tokens and sequences against known unwanted content: whether that is inadvertently outputting your system prompt, memorized copyright material or profanity. They sit outside of the main model which is why I call them external.
External Algorithmic-based guardrails: These are also external to the main model, but they are usually small LLMs or similar models trained for the guardrail task. This means they'll take in text or other sequences and parse them sequentially before outputting "safe" or "unsafe" usually for a list of possible categories. LLM-as-a-judge fits here, but has some pros/cons compared with an actual guardrail model.
Alignment-trained guardrails: These guardrails are part of normal model fine-tuning for desired outputs. Most AI model providers apply these as the final training steps before model release to avoid responding to unwanted prompts and to avoid responding in an undesireable way. These are a part of the main model and cannot be changed by you unless you fine-tune them away. "Jailbreaking" means you've activated other parts of the network that alignment didn't fully change (or that the model providers overlooked).

Because each type of guardrail sits in different places in the architecture and has different properties, they will also address your privacy and security concerns differently.

Deterministic filters are fast but also brittle and easy to get around. This doesn't make them useless, just means you have to understand they are limited, especially when the main model is an LLM which can give output in many different languages and formats.

Guardrail models are models, which means they have the same algorithmic properties of the main model and there are likely adversarial examples that will get by them. It also means they are flexible, able to be further fine-tuned and will be able to catch things that software won't.

Alignment training at present is an extremely opaque process and model providers give almost no details on what they prioritized and how. Ideally we'd have more transparency to better focus safety and security efforts. I hope the industry moves in this direction and one step we can all contribute is to ask questions like, "Can you clarify what model alignment training occurred and give specific results for your internal testing?"

Interested in leveling up your guardrail understanding and usage? I see a few steps I think would be useful if you are trying to grow guardrail maturity to enhance privacy and security efforts.

Turn on relevant AI vendors guardrails. In the explainer post I review most of the major vendors and explain which type of guardrail each one probably is in order for you to also learn more about what they might be effective or ineffective in addressing.
Build testing and evaluation around what you believe is important for security and privacy. You can launch this via organized red teaming, which I cover in my YouTube mini-course. Ideally you're engaging a multidisciplinary group at your organization to identify which types of attacks or AI-related mistakes you are most concerned about given your specific AI workflows or use cases.
Try out some external guardrail models or LLM-as-a-judge additions for anything your current vendor-based guardrails doesn't yet address. Do this first by sampling parts of production traffic (with appropriate privacy protections) and reviewing these to see if any patterns show you unwanted inputs or outputs which aren't addressed in your current system.
Build out security and privacy observability and adjust your guardrails, evaluations and architecture as your maturity and understanding grows. Please remember to take privacy into account as you build out observability (more on this soon). Update your testing and evaluations as AI vendor guardrails change, as models change and as your understanding of relevant threats for your use cases changes, so you are able to notice shifts and adjust effectively. Be ready to adjust your architecture and systems as your maturity grows.

Since security is always a shifting landscape, knowing which threats are relevant for your actual AI products and workflows is essential, otherwise your security efforts probably aren't actually addressing what's relevant. Since there's plenty of surface area to cover, running regular red teaming, risk assessments and reviewing your observability data can help you ensure you're prioritizing what matters.

I hope you'll check out the longer explainer post on guardrails and let me know what's missing and any additional questions you have. I will be updating the post soon with additional guardrail models, frameworks and attacks on guardrail models and would love to also add anything you think is relevant.

Expanding the AI Security Mini-Course

On the Probably Private YouTube and as part of my full-day workshop at SecAppDev I added a few new attacks and defenses to the GitHub repository including prompt and data exfiltration, document poisoning and guardrail models which attempt to find prompt injection or jailbreak attempts.

If you're an audio-visual learner, there's one video explaining and demonstrating exfiltration attacks and one video demonstrating prompt injections in poisoned documents.

Even though there are a several ways to prevent word-for-word data and prompt exfiltration, I would still caution against trusting these will work for motivated attackers. As the shear number of leaked prompts show, even dedicated AI vendors cannot prevent such leaks. As I like to say, don't write anything in your prompt you wouldn't write on your public website.

This is also why it's useful to think about different architectural decisions. Depending on your use case, you might decide to put all GenAI and agentic interfaces behind log in pages or experiences, so minimally you can authenticate users before giving them access to your prompt and/or necessary data which they may be curious to peruse.

Want to learn these things intenstively? Join my new 5-week cohort with InfoQ, the Certified AI Security & Privacy Engineering Program, starting in August with another cohort in October.

Fall Engagements

If you're thinking of enhancing your AI privacy and security, I'd love to help out. Fall is slowly starting to fill up, but please reach out for:

Advisory on architecture decisions, building out testing and evaluations for privacy and security and rolling out agentic workflows
Training on AI security and privacy (both hands on for developers and engineers and higher-level for mixed groups)
Building and launching AI privacy and security champion programs
Multidisciplinary workshops to assess and address specific AI risks

It's a pleasure to read from you; either by a quick reply or a postcard or letter to my PO Box:

Postfach 2 12 67 10124 Berlin Germany

Until next time!

With Love and Privacy, kjam