Issue # 10: Memorization in Machine Learning and Multidisciplinary Practices

Memorization in Machine Learning and Multidisciplinary Practices

Hi Privateers,

I'm resurfacing after a summer sabbatical and here to bring you a US-election-free newsletter, should you need something else to read and think about!

In this issue you'll read:

my investigations on deep learning memorization, with the reveal of the first article in a series on the topic
my ask for any interesting roles you might know of that fit my work
a few thoughts after several conversations around how to build better multidisciplinary conversations and collaborations in privacy work

Wandering new meadows

I have personal & professional news to share, which you might have not heard if we aren't on LinkedIn together -- I've left Thoughtworks and am finished with a mini-sabbatical and looking for what's next.

Here are a few things I'm interested in, in case you know of anything that fits:

Privacy in machine learning/AI systems
German and/or EU public sector
Working in German (the language)

Even if you know of a role that hits only 1 out of 3, I'd be curious to talk... Of course, even better if 2 or 3 for a match.

Sabbatical update: I spent this summer swimming in lakes and seas, reading books & research, dealing with some family changes, talking at conferences about topics like memorization, personal AI and encrypted computation and running a LAN party focused on building Feminist AI. More on those soon!

Deep Learning Memorization

I gave a talk at PyData Berlin on how models memorize and why. In my blog, I'll be breaking down a few papers on the topic, including the seminal paper by Vitaly Feldman which mathematically proves that we cannot build large deep learning models (i.e. overparametrized models) without memorizing parts of our training dataset. The first part of the blog series is online, covering why we should inquire about memorization. Expect new posts regularly in your inbox!

I was on a panel at an online ML conference speaking about this topic and the moderator asked me if it really mattered whether models memorize or not, because humans can memorize and humans even accidentally forget where they learned something.

Indeed, that is the case. But most humans wouldn't be able to magically recite a poem they have seen 3 times, most humans cannot draw an image from memory with striking clarity, most humans cannot mimic someones voice or face enough to trick others into believing they are someone else and humans certainly cannot do this at scale across the globe in multiple locations at once.

According to the latest research, it is estimated that models potentially memorize 30% or more of repeated examples, and likely also memorize "outliers". That may not seem like much, but the problem is that a ML model only knows something is an outlier based on the fact that it gets a high error for an example it hasn't seen before or sees very rarely. When we take the type of learning problems we are solving now (i.e. text, image, audio, speech), this is actually a large percentage of the data we are trying to train with, as most natural distributions of examples across many classes leave a "long-tail" distribution where we end up with as many examples in the tail (or more) as examples in the most common data. If we memorize both the things we see a lot and the things we see a little, then we memorize quite a bit.

This isn't a huge problem if you had the rights to all the data, or if it was consensual to reproduce other people's work and art, or if you were just training on your own data, or on data from a smaller trusted group of persons -- but it's a huge privacy and creative violation to randomly scrape other people's data and then repeat it without their permission.

The question I have for us as a ML community is... how important is it to us to have the highest accuracy possible, when we know that is only attainable by memorization? Would we be willing to have less correct models on outliers to introduce better privacy and reduce chances of outright memorization? How much? Why do we care about 100% accuracy in the first place, when we understand that our sample doesn't always represent reality?

And I've been pondering some related questions recently, such as, if we cannot guarantee that we didn't memorize the training data, should we be building models totally differently? As in, should models be collaboratively owned, because they likely contain memorized information from individuals, who should then minimally have attribution, access and benefit directly from their contribution? If so, how would that look like, what do humans want as a benefit? How can we provide that if we want to train machine learning models with "high accuracy"?

I'm curious as to your questions, thoughts and answers to any/all of the above questions. I welcome any and all feedback for the series as well!

Privacy as a Multidisciplinary Practice

I recently had a conversation with two respected privacy professionals about the challenges of building out "privacy engineering" in an organization.

The conversation was quite enlightening for me, and it spawned some thinking around how you need multidisciplinary groups to converse in order to actually "see" the full expanse of privacy issues.

It also reminded me of some great collaboration I had a chance to be involved in at Thoughtworks before I left, like working on the Singularity card game for multidisciplinary AI governance conversations with colleagues from Security (Jim Gumbley) and Privacy (Erin Nicholson).

Singularity Card Game for AI Governance

What are some examples of roles each group can play and that each group brings to the table? Here are some that I've seen in my career thus far:

Privacy Professionals & Lawyers:

privacy by design
legal advice, interpretation and compliance
analysis of personal and legal harms
risk analysis

Product and User Experience

understanding industry, business and context
user research
UX/UI design

Data Engineering & Data Science

data platform engineering & architecture
data governance controls and observability
integration of data flows in/out of data platform
data preparation and use
PETs as an option

Data governance professionals

Data lineage tracking and observability
Data cataloging and access request flows
Data classification and information flow analysis
Data processing documentation and oversight

Software Engineers & Architects

interface development
application code
collaboration with data engineering on appropriate lineage & user information controls
developer platform development and architecture

Security & Systems Architecture

appropriate data flow and storage protections
threat modeling exercises
data security controls
appropriate SRE/infra support (observability, monitoring, automatic deployments & updates)

Privacy Engineering

can take on many of the above, depending on their skills!

Perhaps some of the problem is if you only look at one type of problem (i.e. building a new application), you end up missing other parts of the problem (i.e. building a new machine learning model based on data you already collected, bought or downloaded from the internet). Some of these disciplines approach development very differently (i.e. software v. data science), because data exploration and discovery is not always a part of how software or systems are built.

Another process problem is that in some organizations the group rarely meets for an actual workshop, assessment or group activity, but instead plays an extended game of telephone. In this game, the privacy folks make a handoff to product who then hands off to software who then makes a handoff to data and the game continues. This creates communication dysfunctions around privacy at organizations where the "privacy part" is just written into a series of design or architecture requirements, but the actual goals and vision for privacy are lost in translation.

Do you need all of these voices in every meeting to make a decision? Probably not. But I would argue that you need as many of them as possible to come to useful discussions, compromises and architectural decisions in order to imbibe your services and products with actual privacy. One thing I think we can all agree with: Your privacy professionals are not there to rubber-stamp your business goals! They need to be a core part of how products are designed and built in order to actually have functioning privacy at your organization.

Honest Questions: Do you run multidisciplinary design and architecture decision workshops where privacy is a focus? Do you run regular risk assessments on tracking privacy harms and violations and addressing them as part of your normal development processes? What have you seen work or not work? Please feel free to respond, I'm very curious! :)

Post Note: Why did I put PETs under data science? Because many of the concepts build on skills that data scientists already have. For example, understanding probability theory and statistics is a core part of differential privacy as a concept. Understanding and leveraging things like federated learning libraries or encrypted computation libraries often require conceptualizing the problem like a data scientist (i.e. by creating an algorithm or model). This is exactly why I wrote Practical Data Privacy for data folks, as I think they will have the easiest time learning these concepts. This is, however, not meant to exclude others who want to learn about PETs -- and I've been amazed at what a positive response I've gotten from readers of the book who work in law, security, software and infrastructure! 😍

As always, I'd love to hear your thoughts on any or all of the topics in this newsletter, so feel free to reply. Stay tuned for the next of the memorization series in about a week. Until then 👋🏻

With Love and Privacy, kjam