Hello Privateers,
I hope you enjoyed your start to 2024 and are warm, safe and reading lots of interesting things. It's very cold here in Berlin, and I'm snuggled up in a new writing area I've set up in my apartment, looking out on a snowy landscape.
On that note, I'm going to be releasing this newsletter more regularly, and starting a new book project this year. The new book will be a concise and visual introduction to privacy technology for folks in the legal and privacy professions who don't have a technical background. If that sounds like you or someone you know and you would want to give feedback on early copies, please reply and let me know!
In the interim, lots of things have changed around privacy in "AI" and AI governance in the past few months! In this newsletter, I'll cover:
The EU has been quite busy the past few years working on driving the new EU data strategy, which hopes to inspire more data innovation and responsible usage within the member states. As part of that vision, there have been several landmark pieces of legislation in the works, and I'll cover a few of them from the lens of privacy to let you know what's already in place, what's coming and what's next.
You've probably heard or read grumblings on the "restrictive AI legislation" in the EU. Let me tell you what I see in the likely final draft of the EU's AI legislation:
There are also some interesting and potentially difficult requirements for GenAI systems, such as labeling generated content as AI-generated and providing safety testing, with possible caveats for open and research models. Like with the rollout of the GDPR, I'd wait and see how those obligations shake out before jumping to unnecessary conclusions that this will "restrict and forbid all generative models". Personally, I think the internet needs some ability to distinguish or filter out trashy content whether it is human-generated SEO juice or AI-generated blah-blah. Unfortunately, that's a difficult task that I'm not sure who would pay for...
I think a lot of the critical reactions to the AI act I've read are unnecessary scare-mongering and typical "rugged individualism over communal wellbeing" takes, where precaution, human rights and safety are viewed as contrary to technical innovation and progress. In my opinion, this is a technosolutionist mental fallacy.
Remember that GDPR provision on "data portability", which promised to let you port your data from one service to another without problem? It's finally here, just as a new piece of legislation, called the Data Act!
The Data Act aims to democratize data from proprietary systems and allow more interoperability for those wishing to switch services, move to alternatives or debug or investigate how their hardware or cloud services actually work. The act creates open, standardized API access to data from consumer goods (i.e. IoT, connected devices) and proprietary systems, such as airplane computers, train driving systems and cloud services (of course, only to those who have bought them!). It also has provisions to make it easier to migrate between cloud services.
I opined in Practical Data Privacy that data portability never really became a thing, despite it being law (by the way, this is also why I disagree with my peers on the "impossibility" of parts of the AI act, it won't necessarily be evenly applied !!). I'm extremely excited to see how the Data Act gets implemented, and what this means for users controlling their own data destinies. Particularly cool for me is how new technologies like federated data analysis and federated learning could be used in a communal context if users could combine data from their various hardware APIs. Who wants to build privacy-first community projects around IoT data? ✋🏻
It will take a minute (~12-18 months) to figure out how this will actually work, but I'm happy to report the conversation has already started in Germany. Let's see if we truly get some new open data APIs as a result, and see if taking your data with you becomes a real-world possibility for anyone and everyone. 🥰
Much less hyped, but extremely exciting for privacy engineers is the data governance act, which went into effect in September last year. This act instructs EU member states to employ privacy technologies at scale within government to enable safer, easier, responsible data sharing across the EU. It also establishes a new European level committee for data and innovation to foster both continued research and development of privacy engineering systems as well as to formalize data sharing providers and policies across the EU.
If you are a privacy engineer, a privacy company or you want to get into the field of privacy, this means there will be increasing opportunities across European public sector to practice this work in critical systems. The act establishes two types of organizations:
This law will likely foster increased need for privacy technologists and engineers within the public sector itself, as responsible data sharing and usage become a focus for parts of the government. I know from observing the German federal government's initiatives that making safer ways for states to share data with one another could help improve lives as long as it is done with respect to privacy and human dignity.
I'll definitely be keeping an eye on how this rolls out, and I'd be very curious to hear if your organization is looking to register as either an intermediary or an altruism organization. I'd love to see this law foster better understanding of privacy, safe data sharing and really cool community projects to combat climate change and create new mobility futures.
In late October 2023, Biden released an executive order focused on AI use and development. Although Executive Orders are not law, they can function similar to laws if Congress doesn't change or create laws that pertains to those topics. For this reason, the order should be considered "in effect", at least until the next election (this September).
Some things I like about the order:
Some things that made me wonder....
I'd be curious as to what you think a "healthy balance" is between openness, transparency and required reporting so users can understand and evaluate models for their usage. I'm a big fan of Model Cards (and my proposed Privacy Card in my book!), but is that enough?
I'd also be curious to hear what your experiences are with "guardrails". The fact that they can be fine-tuned away for a very small amount of money, the fact that they need continuous supervision as new hacks and attacks are created, and that they essentially make awful internet content lightly palatable without disclosing the fact that the model is indeed full of awful internet content are all things I deeply dislike.
In my PyData Amsterdam keynote, I talked a bit about my distaste for closed models that pretend to be palatable, and instead proposed that we use better content from the start and directly report what the LLM or multi-modal model is good at. This seems to be the direction Apple is going, and could produce fun things like, this LLM was only trained on children's books, enjoy!! 😍
In case any of you are working in the DACH region or are fluent German speakers, my book will be released in German in a few months! Pre-orders are already available via Amazon and DPunkt.
If you are learning German, but hope to work in Germany on these topics, I can recommend you have a read. It's been quite interesting for me to work with my translator and several native German speakers to find the right words to convey both the social as well as the technical meanings of privacy and privacy technologies. I have definitely learned a few new words!
I've also written some supplemental materials on ChatGPT, generative "AI" and LLMs -- which I will also release online in English once they go through the O'Reilly editing process.
The new content includes:
I've been especially excited to read Vitaly Feldman's theoretical work on long-tail data memorization in large models and to review several interesting papers on how and why over-parametrized models memorize their training data. It definitely happens, and the extent is likely wider than any of us would like to admit (somewhere between 1-35% depending on several factors).
I'm sad to report, there is no easy solution to this phenomenon -- but I'm quite interested in seeing how this fact makes its way through the legal courts with the New York Times lawsuit against OpenAI and Microsoft. Should the court uphold the NYT copyright, this could fundamentally alter how LLM and multi-modal models are built and likely create stronger privacy and content rights for everyone.
I'll will share the additional book content here and would be happy to write some longer reviews of these papers and research if there is interest. If you have 2 minutes, please hit reply and tell me what you want to learn!
As always, I'm very open to your thoughts, questions, challenges and ideas. Feel free to hit reply or drop me a letter in my PO Box. If you've already written me a letter, expect some belated new years cards soon!
If you enjoy this newsletter, consider forwarding it to someone who you think would also enjoy a bit of technical privacy. 😍
With Love and Privacy, kjam