Issue # 25: Synthetic Data for Practical AI Privacy + AI Agent Controls

Synthetic Data for Practical AI Privacy + AI Agent Controls

Hello privateers,

My Practical AI Privacy course is underway and it's teaching me so much about what topics are pressing in privacy and AI work today. My students are a delightful group and it's really such a pleasure to work with them for 6 weeks.

One of my biggest takeaways thus far is that perhaps the class is an incubator for new ideas. My students all have small products they are building, and the class creates an open space for ideation on how privacy can work in AI. The Capstone projects students are working on range from agentic frameworks with policy layers to local AI doing your taxes, travel planning, health care research.

There is no lack of cool ideas on building things with AI and with privacy. It makes me reflect that much of the public discourse is too focused on an imaginary "left behind" date to see true radical thinking and innovation that could be offered if human rights were a non-negotiable.

Speaking of, Hugo Bowne-Anderson and I wrote up an article with 15 Privacy Questions from AI Builders that has a ton of visuals, references and of course my opinions on many common questions that people building with AI ask. I'd be curious to your feedback and thoughts and also what we might have missed.

In this issue, I'll share some shifts in how I'm thinking about synthetic data. I'll also dive into more thoughts on AI Agent workflows.

Synthetic Data: When does it actually help?

If you've read Practical Data Privacy, you'll know that I'm not a huge fan of synthetic data. As a data scientist and machine learning person, I want to use high quality data and only introduce slight deviations by provable privacy protections (like differential privacy).

In my book, I clarify the small error that differential privacy introduces is trivial compared to the noise of synthetic data. If it's business critical and internal use only, you are usually using the real data directly.

However, it's clear to me now, that there are new use cases where synthetic data can help promote privacy in today's AI workflows. The use cases that were not as common when I wrote the book is sending large amounts of data to third-party AI vendors.

This can look like:

Vibe coding with "realistic" data which gets sent to a third-party AI system as part of the engineering workflow
AI-supported data visualizations where you are testing out a few vendors before signing an agreement
Evaluating different AI vendor data extraction pipelines to determine where to build your tool
Using a third-party vendor you don't necessarily trust with your real data

These are all cases where having semi-realistic synthetic data can help. But wait! Not all methods are 100% private, which is why I said in the book to be careful...

Let's review three major ways to produce synthetic data:

Fully synthetic "dummy" data: Libraries like Faker create fully synthetic initial data, which you can combine with statistical libraries like numpy and scipy to develop more realistic data factories.
Learned correlations to produce data: Libraries like Synthetic Data Vault help you learn correlations in your data and decide how well you reproduce them. The library comes with evaluation features, so even if you choose dummy data or a mixture, you can use the evaluation library to see how close your synthetic data is on a pairwise or single column comparison. Just know that tuning these algorithms can end up matching your data well (reducing privacy guarantees). Make sure you only learn as much as you need to in order to protect privacy.
Deep learning methods: Training your own GAN or other deep learning architecture on your own data, or fine-tuning one on your data can also produce synthetic data. In case it's not obvious from my series on memorization there is a lot of risk here that you might overfit on your data and actually output real data. The other problem is you might underfit and output useless data. This one is hard to get right for both privacy and quality and will vary based on your deep learning experience and expertise.

I have a Jupyter Notebook and YouTube explainer walking through examples and open-source libraries for each of these methods.

Using a vendor to generate deep learning synthetic data? Ask them what privacy guarantees and measurements they use to evaluate information leakage! And check out Damien Desfontaines' PEPR talk on synthetic data "measures" of privacy to get more inspiration.

What about using an LLM to produce synthetic data? Well, it depends! If you are putting real data in the prompt as an example and you don't know how it was trained, then you're gonna have a hard time analyzing the output anyways. Why not use a statistical method and at least be able to compare it?

But if you just want text dummy data and you're careful to just say simple things in the prompt and not actually put business-critical information there, go ahead! I have an example notebook I used in my Practical AI Privacy class if you wanna see how I prompt for synthetic data.

I recommend you start small and iterate. Figure out exactly what use cases are relevant and choose one. Once you get an idea of how synthetic data helps your privacy goals, then expand to thinking through how to automate.

When you automate, make sure you start running privacy red teaming (here I mean: reconstruction and reidentification attacks) and privacy testing (i.e. measuring your ability to differentiate between real and synthetic data) to stay on top of basic privacy guarantees you want to maintain.

If all goes well, you can then expand to multiple use cases and make the tooling more available for other teams who touch sensitive and personal data. I'd be excited to hear any additional questions you have on getting started and happy to hear how you're using synthetic data for AI use cases.

I dive deeper into these options and also review some other interesting research in a longer blog post on using synthetic data for AI.

AI Agents + Controls

I've posted a two recent YouTube videos on Claude Code and agentic privacy and security concerns.

My main issue continues to be that we don't yet have a reasonable approach to policy controls or even true sandbox controls for agents. Because many people using agents are doing so at work or to enhance or accelerate their work, this is a huge privacy and security oversight that will not be fixed by better prompts or smarter models.

If you work in privacy or security, one way you can start intervening now is developing Sandboxes and Sandboxing best practices at work and setting up reasonable defaults for these sandboxes. That means not throwing every single piece of potentially useful data or code onto a single container and calling it a sandbox but instead truly starting to piece out different "jobs" in the agentic workflows and starting from the principle of least privilege.

This also means thinking how the jobs can leak information between one another and reducing the blast radius for such leakage. It's a distributed systems security problem, and not new to computing or engineering!

It's also clear that more nuanced understandings of data and critical system access are necessary; but how? Up to this point we've kind of assumed if you have shell access on a computer, you get to do lots of things on that computer...

Well, for one, you could implement synthetic data for agent jobs until proven that the agent needs "real data". You could also start to provide microservices around agents that act as data access policy engines and sidecars to the sandboxes/containers you are using.

Finally, it's clear that this won't work unless we start to address security and privacy observability at scale. This means learning from SREs and Ops/Infra folks so as to avoid common mistakes made with "monitoring". I'm diving deep into observability best practices and research, and really enjoyed the Observability Engineering book from Honeycomb team. Expect more musings on this topic soon...

How are you using agents and controlling data, system and network access as well as the actual agentic work being processed? I'd be curious to hear your approaches, thoughts and questions.

Upcoming Speaking Events

I'll be in Leuven in June presenting at SecAppDev and at an evening event with the Belgian Cyber Security Coalition, so hope to see you there if you live nearby or are also attending.

In July, I'm honored to be presenting the afternoon keynote at International Workshop on Privacy Engineering, co-hosted with the IEEE European Symposium on Security and Privacy.

Want me to speak at your fall event? Or internally at your company? You can hire me for speaking, training and advisory via my company Kjamistan.

It's a pleasure to read from you; either by a quick reply or a postcard or letter to my PO Box:

Postfach 2 12 67 10124 Berlin Germany

Until next time!

With Love and Privacy, kjam