Issue # 13: Video Course launch, studying memorization via security and privacy and personalizing your recommendations

Video Course launch, studying memorization via security and privacy and personalizing your recommendations

Hello Privateers!

Since I last wrote you, it feels like we are entering a new era of content, politics and privacy worldwide. Sitting from the comfort of Berlin, I want to extend compassionate and caring support if life, work, friends/family and the news are stressful for you right now. 💜

In this issue, you'll learn about:

An O'Reilly video course for my book that launched this week
How privacy and security research in deep learning exposed memorization properties
Building personalized content recommenders for yourself by yourself

O'Reilly Video Course Launch

I'm very excited to announce that my video course on privacy fundamentals and privacy technologies has launched on the O'Reilly learning platform. You can get a 10-day free trial by signing up for a new account. The course is designed to introduce data privacy from a technology, product and personal lens and you can finish it in less than 5 hours.

The overall modules are arranged as follows:

Introduction to Privacy: Introduction to privacy concepts, who works in privacy and definitions (i.e. what data should be considered private?).
Risks and Controls: Learning about designing and implementing controls via privacy engineering, anonymization and pseudonymization.
Local-First Data and Distributed Data Workflows: Thinking through new architectures, where data remains decentralized to allow for better user control and transparency.
Trust Modelling and Encryption: What cryptographic models can teach us about trust in the real world and how that translates to technology.
Risk Frameworks and Governance: How organizations leverage risk understanding and evaluation to engage and implement privacy across an organization.
Privacy Champions and Beyond: Inspiring cultural change and advocacy around privacy, and how to learn more about topics as you grow in your journey.

Some of the modules mirror explicit chapters in my book. If you want a deeper dive, my book is on sale on many platforms and now in 3 languages.

Privacy and Security Research Unveil Memorization Problem

I have two fresh articles for you on how the memorization problem could have been identified earlier -- by looking at privacy and security research!

Differential privacy provides rigorous privacy standards for data processing and can define and audit data processing with attention to privacy. Because differential privacy can be used to measure privacy loss, differential privacy research exposed the memorization problem before it was deeply explored in machine learning research.

In the differential privacy blog post, one of the early deep learning architectures that supported differential privacy (PATE, 2017) showed interesting results in the "near-misses". Below are some examples that PATE got wrong that non-private models got right.

An example of near-misses from PATE, which shows the example above and the label below. Many of them are hard to interpret or even mislabeled

Reading the images above, would you guess the label underneath correctly? Would you fault someone if they got it wrong? Would you expect an AI model to get all of them correct?

Because differential privacy offers rigorous protection for outliers, these examples were not learned by the model (read: memorized), making the performance "worse". This again highlights the question of how to evalutate models, and defining those utility versus privacy tradeoffs. This should be a more active conversation now that we know memorization happens in order to properly discuss model evaluations in light of privacy and copyright issues.

The second post explores the field of adversarial machine learning and how it relates to memorization.

Adversarial machine learning is an exciting (sub)field that studies how to break, hack or trick machine learning models and AI systems. By understanding how and when AI systems break, you can also better understand how they work.

Adversarial attacks reveal that the way deep learning systems are architected and trained (with weights, biases, loss methods and algorithms like gradient descent) can be used to directly exploit the system. For example, I can use black-box (or white-box) model outputs to gather enough information to design attacks where inputs are misclassified (for example, where a turtle becomes a rifle).

Adversarial defenses reveal other interesting features with regard to memorization and privacy. For example, some early approaches to adversarial defenses identified layer-level manifolds (think of this like a surface in some multi-dimensional space) where you could move the adversarial example towards a manifold that represented a "safe" or "already known" decision space. This approach attempted to "repair" the adversarial example and could also be used to modify outliers or mislabeled examples -- pulling them "back into the fold".

But another simpler and effective solution evolved. By training larger models for more iterations and including adversarial examples as part of training (called "adversarial training"), the models ended up learning the adversarial decision space alongside the normal examples. Aha ! Why does it work? Well, probably a mixture of generalization of the adversarial examples and some memorization of them! As you've learned in this series, increasing model parameter size and number of training epochs directly relate to memorization.

In the next few months, I'll be writing up how you can address the problem of memorization in AI/ML systems, exploring topics like unlearning, differential privacy in deep learning, auditing models for privacy problems and other cool ideas like creating communally-owned or enthusiastic-consent models.

In related news, the French Data Protection Authority released new guidance on the problem of memorization in AI/ML models with a call to create explicit consent when training large models with person-related data.

Priveedly: An open-source personal content reader and recommender

I open-sourced a small personal project called Priveedly, which I built for my own personal use that allows you to read in RSS/Atom feeds, Subreddits and HackerNews and Lobste.rs and then build your own recommender model based on things you find interesting.

Why bother? Well, it seems that the entire content landscape is changing more rapidly than expected in the past few months. With shifts in content moderation, likely shifts in privacy policies and the general difficulty of finding useful content without scrolling through endless ads, inserted click-bait or GenAI garbage, I personally found that curating for myself what I want to read and how helped me destress my reading and allowed me to explore news and topics that I probably would have missed.

But doesn't that create filter bubbles and echo chambers? To be completely honest, I am not sure we've addressed those problems in platforms either. Although there is certainly literature in the recommender research that addresses this issue, I have yet to see a large-scale platform or production-level recommender in widespread use that fixes the bubble problem. If we were to do that, it might be directly at odds with autonomy (show me less of this) and privacy (we noticed that you like X so now we are showing you Y).

What do you think? Is your experience different? How do you find compromise between curating what you want to read and hear and diversifying your sources?

If you built or use a solution that's open and available to all, and you'd like some help building personalized recommenders on top of it, send it on over. If you build your own recommenders that are open and for personalized use, send them to me as well!

By the way, you can now also get Probably Private as an RSS feed.

That's all for this issue -- tell me what you liked/didn't liked/want more of by hitting reply. If you think someone would enjoy this newsletter (or the new RSS feed), send them a link.

Until the end of March, I'm hosting open drop-in office hours on Mondays if you want to discuss any burning privacy in AI/ML/data systems. Drop in and say hi!

With Love and Privacy, kjam