Hello again privateers!
Thanks for your emails from the last newsletter, but there is somehow no mail in my PO Box...📭? If you want to send me some privacy-affirming post, the PO Box details are now on the contact page — let's write! 💌
Another cool note: My book Practical Data Privacy went to print and it will be available on eBook.com this week and in print in another few weeks (eep!). I even made it a website. If you know anyone who should read it, tell them to pre-order, as it helps a lot for promoting the book further and maybe getting a few translations!
This week you'll hear:
The same day I sent out the last newsletter on ChatGPT's privacy leak and issues the Italian privacy authorities announced a temporary ban on ChatGPT in Italy, asking the company to show compliance with GDPR by 30 April. Since that announcement, Ireland, Spain, Germany and France have all expressed intent to look at ChatGPT under their own GDPR laws.
I was curious as to what Italy specifically found and whether any users had brought forward official complaints. Only after digging a bit through the news of the announcement did I find this gem, which is linked to the privacy leak discussed in the last newsletter, in which OpenAI showed users other user chat histories:
The information exposed over the course of nine hours included first and last names, billing addresses, credit card types, credit card expiration dates and the last four digits of credit card numbers, according to an email OpenAI sent to one customer.
Note: This greatly reinforces the warnings from the last newsletter -- that indeed people are regularly inputting sensitive data and that OpenAI is storing this for future training.
OpenAI responded that they were working on making more transparent how they manage privacy, including updated sections to their website, describing how they currently manage privacy by attempting to cleanse sensitive inputs and respond promptly to deletion requests. They do now directly mention that they train on personal data found on the web as well as licensed content. This bit: "We don’t use data for selling our services" is truly chefs-kiss rich, written with the unabashed hope that no one in regulation understands that ML models are useless without this licensed and private data.
I personally filed a deletion request for all scraped data related to my personhood to be removed from OpenAI servers including removal of my work from GPT models on the 20th of February and have yet to hear a response. The required response time for GDPR is usually 30 days or less, which I also noted in my email. I will be following up later this month with the local authorities, and will keep you all updated on how it goes.
The problem is: I know how these systems usually look from the backend. I am certain that the "cleansing" is minimal at best, because entity linkage and recognition is important to how transformers function. Although they might have better cleansing for user input, it's unlikely they are using any other advanced techniques. To retrofit this into their system, they will need to rethink the way the entire system is designed, and likely slow their reinforcement learning cycles to ensure that sensitive data is not reintroduced by someone who does not have a right to enter that data (i.e. if someone feeds my emails without my consent into ChatGPT). This is a huge undertaking and it will not happen overnight, regardless of the regulatory pressure.
I've also started to hear more requests from friends, colleagues and folks in the industry — what are our options? I've seen several snake-oil-like products pop up promising "safe and private usage" of ChatGPT. Although I'm very excited folks are asking these questions and recognizing the risk, it's unlikely that a privacy condom around your chat data is going to prevent your private or sensitive data from entering ChatGPT via other vectors (other people's chats, scraped data, extra training sets provided by Microsoft, etc). True privacy engineering needs to be built into the product itself — which means rebuilding how we train these models from the start. Ultimately, this relies on users and regulators determining that the current practices are unacceptable.
If you reside in Europe or California, I recommend you take the time to file a deletion request or even a data request (i.e. asking what data about you has been collected). If you live in Europe you can also file a request to "port" your data to another service or for your own usage. You can find a guide on what to include in a deletion request from noyb. If you're interested to see what I wrote, I will be sharing it in the next issue along with the follow up requests to the local authorities. :)
AINow released their 2023 Landscape Report entitled Confronting Tech Power earlier this month. The report covers many pressing topics including regulation of General Purpose AI (GPAI), unfair advantages in the AI landscape including monopolization of AI infrastructure and data, and other fun topics like algorithmic audits, workplace surveillance and data colonialism via trade agreements. It's well worth a read!
One particular section stood out for me: the evolution of data monopolies and how privacy regulation and technology can end up reinforcing these monopolies. Let's take a concrete example from the report to see how this works.
In 2008, Google and DoubleClick merged, creating what we today know as the GoogleAds ecosystem. This deal probably shouldn't have made it past antitrust laws, but one of the arguments to make it through was that data would not be exchanged. This, of course, only lasted until 2016 — when the terms were updated, and then data flows between the platform and marketplace likely heavily increased.
There have been numerous regulatory requests from competitors since, especially now with the crackdown on third-party cookies and advertisements — giving Google a heavy advantage over many other platforms and services. In fact, the California privacy law (CPRA and CCPA) almost explicitly favor platform owners over other advertising networks in the way these laws define consent and data usage. This might be a win for individual privacy in a small scope, but it's a loss for privacy in a longer and larger context.
This is easy to see in some of the conversations around Google's proposed Privacy Sandbox, a mish-mash of internet and application proposals to essentially rewrite the way marketing and marketing-related data and ML work, in the name of "privacy". These proposals directly include a wealth of cool privacy technologies, including differential privacy, federated learning and enclaves.
On one hand, it makes me excited for the industry to adopt new technologies aimed at providing better privacy than users have ever been offered as long as they've been connected to the internet. On the other hand, this is clearly a privacy fig-leaf -- aimed to make users feel good, to avoid potential investigations and to soothe regulatory demands for more accountability and better privacy offerings. In fact, some of the proposals have been already debunked by privacy researchers and activists and then re-written and re-proposed, often without incorporating or addressing feedback or developing real changes. The latest critiques were brought directly to the W3C, so far without any change.
These battles, outlined in AINow's report show that a comprehensive view of how data ownership, data power and data privacy work together is needed. To understand if a privacy technology helps, we have to know the context, the use case and the alternatives — never skipping the very real alternative of collecting and using no data whatsoever. I'm going to start calling this the null data hypothesis. Please spread the word!
My personal hope is that these reports spark conversations at the regulatory and social levels — calling for less privacy technology as performance art to evade regulation and more real intentions of integrating user-driven, consensual data usage. If there is to be real privacy offered to all of us, there needs to be real competition with different data offerings from many companies — not just a few large ones with cool privacy tech.
Last week, a disturbing video showing a UK police raid with arrests circulated the Twitterverse (is it still called this or have we renamed it to something Elon-ey?), calling attention to state-sponsored surveillance and privacy inequality. The original tweet and positioning were deleted, but Fiona Robertson's thread remained, documenting the trauma that she, and many other people, have experienced as disabled persons when dealing with "fraud" detection and surveillance of public benefit systems.
I've referenced the amount of surveillance people who need social assistance go through in many talks, and how these systems and the further automation of them erode privacy rights for those who need them most. If you're new to these concepts, I highly recommend reading Virginia Eubanks's Automating Inequality, which dives into the state's automation and surveillance of the poor, the disadvantaged and the "other" throughout history. Privacy and privilege go hand in hand — those with more power have more privacy, those with less have to often give up their own privacy in order to survive.
Specifically when it comes to fraud, I think the dominant dialogue centers around how alleged fraudsters don't and should not have a right to privacy. Notice that the framing implies they are guilty even before the algorithm has been reviewed, and that anyone can be considered a potential fraudster. This helps many technology companies avoid disclosing much about their fraud systems and allows these companies to use expanded and increased private data with little oversight. I attended a panel for the UK government where a large cloud provider representative and I had an open argument about this exact point — with the other person saying any open transparency and accountability reporting around fraud-related algorithms would "only help people commit more fraud."
What is often forgotten in these conversations is that we just aren't that good at catching fraud with or without the troves of data. Years ago when presenting at a large AI conference on ethics, I was approached by a person after the talk. The person had built large-scale fairly comprehensive fraud monitoring, and the system was in production for a large financial institution, but it only really caught one particular group (based on a private attribute!) and only when they made an obvious mistake. It was clear to this person that this was a tiny percentage of the total fraud — they wanted to know if I thought it was fair...
I answered honestly, "Until you can model the crime you actually think is happening appropriately, why would you harm only one group to the benefit of the others?" Why are we automating things that we know do not work?
If this resonates with you, start conversations at your organization around privacy and privilege, and investigate the transparency you have in your own fraud detection. If you work in public sector on data and automation systems, determine how to build people- and privacy-affirming automation. If we don't fight for privacy for the most vulnerable in these systems, we won't actually change these systems.
I'd be curious to hear from you about any deletion requests you've sent that have gone unanswered, if you think antitrust regulation is a way to enforce better privacy, and how you've seen private information and related bias lurk into unsuspecting models (fraud or otherwise). Write me by hitting reply, or sending letters via postal mail to my PO Box!
Finally, if you know someone who would like this newsletter, please forward it to them or send them directly to my subscribe page. 🎊