Issue # 9: AI Act, Biden's Executive Order on AI, Data Privacy in der Praxis

AI Act, Biden's Executive Order on AI, Data Privacy in der Praxis

Hello Privateers,

I hope you enjoyed your start to 2024 and are warm, safe and reading lots of interesting things. It's very cold here in Berlin, and I'm snuggled up in a new writing area I've set up in my apartment, looking out on a snowy landscape.

On that note, I'm going to be releasing this newsletter more regularly, and starting a new book project this year. The new book will be a concise and visual introduction to privacy technology for folks in the legal and privacy professions who don't have a technical background. If that sounds like you or someone you know and you would want to give feedback on early copies, please reply and let me know!

In the interim, lots of things have changed around privacy in "AI" and AI governance in the past few months! In this newsletter, I'll cover:

the EU AI Act, Data Act and Data Governance Act and how they relate to privacy
Biden's Executive Order on AI
Practical Data Privacy auf Deutsch (with new content on LLMs!)

EU's AI Act, Data Act and Data Governance Act

The EU has been quite busy the past few years working on driving the new EU data strategy, which hopes to inspire more data innovation and responsible usage within the member states. As part of that vision, there have been several landmark pieces of legislation in the works, and I'll cover a few of them from the lens of privacy to let you know what's already in place, what's coming and what's next.

AI Act

You've probably heard or read grumblings on the "restrictive AI legislation" in the EU. Let me tell you what I see in the likely final draft of the EU's AI legislation:

Bans on Intrusive, Dangerous Technology: the EU has banned technology that I would also deem as intrusive and dangerous, such as real-time biometrics and social scoring technology. Albeit there are some caveats that could mean "state level" security can use biometric technologies, it is a huge step into saying these technologies are unwelcome, and I would assume many folks who've ever been surveiled or oppressed via technology would agree.
Risk-Based Categorization of AI Systems: AI systems will be categorized into high, medium and low risk based on how they could impact humans lives. Some examples: A recommender system is considered low risk, a hiring system is considered medium risk and a self-driving car is considered high risk. Based on these categorizations there will be different reporting, documentation and testing requirements.
More Transparency: For medium and high risk systems, there will be more requirements to transparently report what data models are trained with, how the models operate in terms of security and robustness as well as privacy measures put into place during training or usage.

There are also some interesting and potentially difficult requirements for GenAI systems, such as labeling generated content as AI-generated and providing safety testing, with possible caveats for open and research models. Like with the rollout of the GDPR, I'd wait and see how those obligations shake out before jumping to unnecessary conclusions that this will "restrict and forbid all generative models". Personally, I think the internet needs some ability to distinguish or filter out trashy content whether it is human-generated SEO juice or AI-generated blah-blah. Unfortunately, that's a difficult task that I'm not sure who would pay for...

I think a lot of the critical reactions to the AI act I've read are unnecessary scare-mongering and typical "rugged individualism over communal wellbeing" takes, where precaution, human rights and safety are viewed as contrary to technical innovation and progress. In my opinion, this is a technosolutionist mental fallacy.

Data Act

Remember that GDPR provision on "data portability", which promised to let you port your data from one service to another without problem? It's finally here, just as a new piece of legislation, called the Data Act!

The Data Act aims to democratize data from proprietary systems and allow more interoperability for those wishing to switch services, move to alternatives or debug or investigate how their hardware or cloud services actually work. The act creates open, standardized API access to data from consumer goods (i.e. IoT, connected devices) and proprietary systems, such as airplane computers, train driving systems and cloud services (of course, only to those who have bought them!). It also has provisions to make it easier to migrate between cloud services.

I opined in Practical Data Privacy that data portability never really became a thing, despite it being law (by the way, this is also why I disagree with my peers on the "impossibility" of parts of the AI act, it won't necessarily be evenly applied !!). I'm extremely excited to see how the Data Act gets implemented, and what this means for users controlling their own data destinies. Particularly cool for me is how new technologies like federated data analysis and federated learning could be used in a communal context if users could combine data from their various hardware APIs. Who wants to build privacy-first community projects around IoT data? ✋🏻

It will take a minute (~12-18 months) to figure out how this will actually work, but I'm happy to report the conversation has already started in Germany. Let's see if we truly get some new open data APIs as a result, and see if taking your data with you becomes a real-world possibility for anyone and everyone. 🥰

Data Governance Act

Much less hyped, but extremely exciting for privacy engineers is the data governance act, which went into effect in September last year. This act instructs EU member states to employ privacy technologies at scale within government to enable safer, easier, responsible data sharing across the EU. It also establishes a new European level committee for data and innovation to foster both continued research and development of privacy engineering systems as well as to formalize data sharing providers and policies across the EU.

If you are a privacy engineer, a privacy company or you want to get into the field of privacy, this means there will be increasing opportunities across European public sector to practice this work in critical systems. The act establishes two types of organizations:

"data intermediaries", who provide software and servcies that allow for EU-wide data marketplaces to form, with a special focus on how these services leverage privacy technology to meet stringent privacy requirements
"data altruism organizations", who can request data from the public sector to work on pressing issues, such as climate change, mobility, health care and human rights

This law will likely foster increased need for privacy technologists and engineers within the public sector itself, as responsible data sharing and usage become a focus for parts of the government. I know from observing the German federal government's initiatives that making safer ways for states to share data with one another could help improve lives as long as it is done with respect to privacy and human dignity.

I'll definitely be keeping an eye on how this rolls out, and I'd be very curious to hear if your organization is looking to register as either an intermediary or an altruism organization. I'd love to see this law foster better understanding of privacy, safe data sharing and really cool community projects to combat climate change and create new mobility futures.

Biden's Executive Order on AI

In late October 2023, Biden released an executive order focused on AI use and development. Although Executive Orders are not law, they can function similar to laws if Congress doesn't change or create laws that pertains to those topics. For this reason, the order should be considered "in effect", at least until the next election (this September).

Some things I like about the order:

It adds responsible review, transparency and audits for models used in government and critical infrastructure. This is important especially in the US where many parts of government can buy and install their own software and services without a lengthy review process (ahem, Palantir).
Privacy technologies are getting much love and attention--including funding for more research and new positions in government around privacy technology. This could fundamentally shift the US federal government and make privacy engineering a core part of the technology functions of the government and its contractors.
Workers Rights: it calls for investigation and intervention about how AI systems might affect workers rights and how to educate workers on AI systems. I think this is a critical part to get right and to continue to discuss when AI helps and when AI hinders or harms human processes.

Some things that made me wonder....

There are many mentions of "fairness" of the system, but less about "fairness" of the actual usage of the system. Sometimes this means we are looking at simple evaluation criteria instead a holistic perspective! For example, is it ever fair to have an "AI" system be the sole judge of a hiring decision? Or of a policing decision?
It calls for model registration and output watermarking for models based on size. This will create hurdles for research, and might push out smaller players. How can we create transparency reporting for people with smaller budgets? To hear more about how this could impact research, I recommend reading Arvind Narayanan and Sayash Kapoor's take.
"Guardrails" in "dual-use" or open models: I am not convinced that guardrails work or are a long-term solution for societal bias and other problems in models. I am also a supporter of open release of models. There is a tension here between open research and societal safety, but I don't think this is the way to solve this problem. This applies to the AI act as well, and is something I am concerned about from all policy makers (and I am not alone in this critique!). Safety does not come from closed technology.

I'd be curious as to what you think a "healthy balance" is between openness, transparency and required reporting so users can understand and evaluate models for their usage. I'm a big fan of Model Cards (and my proposed Privacy Card in my book!), but is that enough?

I'd also be curious to hear what your experiences are with "guardrails". The fact that they can be fine-tuned away for a very small amount of money, the fact that they need continuous supervision as new hacks and attacks are created, and that they essentially make awful internet content lightly palatable without disclosing the fact that the model is indeed full of awful internet content are all things I deeply dislike.

In my PyData Amsterdam keynote, I talked a bit about my distaste for closed models that pretend to be palatable, and instead proposed that we use better content from the start and directly report what the LLM or multi-modal model is good at. This seems to be the direction Apple is going, and could produce fun things like, this LLM was only trained on children's books, enjoy!! 😍

Practical Data Privacy, auf Deutsch!

In case any of you are working in the DACH region or are fluent German speakers, my book will be released in German in a few months! Pre-orders are already available via Amazon and DPunkt.

If you are learning German, but hope to work in Germany on these topics, I can recommend you have a read. It's been quite interesting for me to work with my translator and several native German speakers to find the right words to convey both the social as well as the technical meanings of privacy and privacy technologies. I have definitely learned a few new words!

I've also written some supplemental materials on ChatGPT, generative "AI" and LLMs -- which I will also release online in English once they go through the O'Reilly editing process.

The new content includes:

LLM and large model training data memorization (more below)
Assessing privacy and copyright risks of using LLMs as part of your deployments
Encrypted RAG architectures
Differential privacy and other potential protections for (extra)-large models

I've been especially excited to read Vitaly Feldman's theoretical work on long-tail data memorization in large models and to review several interesting papers on how and why over-parametrized models memorize their training data. It definitely happens, and the extent is likely wider than any of us would like to admit (somewhere between 1-35% depending on several factors).

I'm sad to report, there is no easy solution to this phenomenon -- but I'm quite interested in seeing how this fact makes its way through the legal courts with the New York Times lawsuit against OpenAI and Microsoft. Should the court uphold the NYT copyright, this could fundamentally alter how LLM and multi-modal models are built and likely create stronger privacy and content rights for everyone.

I'll will share the additional book content here and would be happy to write some longer reviews of these papers and research if there is interest. If you have 2 minutes, please hit reply and tell me what you want to learn!

As always, I'm very open to your thoughts, questions, challenges and ideas. Feel free to hit reply or drop me a letter in my PO Box. If you've already written me a letter, expect some belated new years cards soon!

If you enjoy this newsletter, consider forwarding it to someone who you think would also enjoy a bit of technical privacy. 😍

With Love and Privacy, kjam