PROBABLY PRIVATE

Issue # 1: Welcome: Privacy Shield, Contact Tracing & Adversarial ML
logo

Welcome: Privacy Shield, Contact Tracing & Adversarial ML

Welcome: Privacy Shield, Contact Tracing & Adversarial ML

Hi everyone! 👋🏻 I likely know most of you, so welcome friends to my initial edition of a newsletter I’ve been meaning to write for some time. I spend several hours each week reading (for fun and for work) about data privacy, data security, surveillance and how those intersect with data science and machine learning. I wanted to put together a syndicated view of things on my mind to share with folks who have similar interests, but less time, so here goes!

The name Probably Private was chosen as a fun play on words, allowing me to combine ideas of privacy and security with my second favorite topics: statistics, maths and data science. Probably Private is how I would describe most things we work on, even in the field of data privacy, so it seemed like a match. 😉

You’ll see a call for feedback and to share this newsletter at the end. I am doing this because I hope it’s useful and informative. But I need your help to tell me if it’s working, how to make it better and to share it with like-minded folks so I can get more feedback! Thank you in advance if you forward this to someone. 🙂

I chose Revue (NOTE: since dropped), since they had the best privacy practices I could find for making a list that I didn’t need to maintain or deploy myself. There is very limited tracking that is shown in aggregate to me in this email (unfortunately no way to turn off the small amount they had, but not adding any additional!). If anyone has another suggestion, I am all ears!

Thank you for being here with me and putting time into issues that I believe will help us create a better society, a more democratic and open world and ensure that we can be probably safe and probably private in most of our lives (as we choose!). Now, onto the first episode!

Privacy Shield: What Does It Mean for Data Science?

Why is this news? In mid-July, the European Union’s central court announced the “Schrems II Decision”, which invalidated the Privacy Shield. The Privacy Shield was a framework in place between the EU and US that allowed for data transfers of sensitive information by allowing data processors to self-certify that they were following proper privacy and security practices. Although self-certification sounds a bit fishy, it was the main way that private and sensitive information from EU residents was “shielded” in the United States.

Once GDPR came into effect (and even before that), there were some concerns and questions whether Privacy Shield was enough. Max Schrems (whose lawsuit also struck down Safe Harbor, the precursor to Privacy Shield), won this ruling because Privacy Shield does not protect against US government surveillance either abroad or within the US (for example, the undersea cable spying that is supported by Executive Order 12333).

There are already lawsuits against companies using Google Analytics and other tracking to challenge the continued practice of shipping EU data to the US. And articulate articles on what might be at stake. (More FAQS on the ruling in case you need/want to dive in here: EU Commission, IAPP).

But what does it mean for machine learning and data science?

  • It might mean more companies decide to move to multi-region, where European data is held and worked on within Europe. For example, TikTok announced it will investing $500M for building data centers in Ireland to hold European data. If you live in the US, this might mean challenges where you are operating and processing data in a new way (i.e. not downloading it to your computer, but instead working on it via virtual machines or workflow tools that live in the data centers).

  • It could also mean we move to a more privacy-by-design default where US companies and data processors have to consider privacy when architecting services.

Quote from DarkReading: For those that see it and get ahead of it, the answer has to be privacy by design. In most cases, that means that heavy lifting and architectural changes have to be considered and undertaken. Companies that have large datasets have to build out features and consider the notion of data autonomy in which users receive a degree of respect and autonomy in determining what happens to privacy data related to them. This isn’t easy by any stretch, but whatever follows Privacy Shield as an umbrella is a temporary reprieve.

“the CJEU [central court] sets out a heavy burden on data exporters which wish to use Standard Contractual Clauses (SCCs); the data exporter must consider the law and practice of the country to which data will be transferred, especially if public authorities may have access to the data. Additional safeguards, beyond the SCCs, may be required.”

For us as data guardians, it should also mean we begin to think about large data processing pipelines where data is inadvertently thrown together regardless of where it comes from and under what privacy or security practices it should be transferred and stored. We need to understand data lineage and data governance, now more than ever, since it appears that this is not going to be a small passing trend… I’m working on a lot of these topics at work, so if you want to chat - drop me a line! If you are using a governance or lineage tool (especially open-source!), I’d love to hear about it.

Contract Tracing and Surveillance

Like many privacy-conscious folks, I’ve been closely following contact tracing news for privacy violations and other mentions of surveillance (corporate or state-led). One interesting question I have for you is, what is your organization doing for employee contact tracing (if you are back in the office or traveling for work)? The IAPP (an international privacy organization) released a thoughtful report on how workplaces should handle the privacy risks of a COVID19 workplace. If you are in charge of health data, it is well worth a read.

A key step that organizations can take to reduce identification risks to individuals is to create a policy for COVID-19 data sharing and make that policy transparent to all individuals who may be affected by it. A key part of this policy should concern with whom it is necessary (outside of HR) to share diagnostic and other employee health data, keeping in mind the aim of data minimization.

Thought-provoking quote here:

We should not have to choose between “two evils” – either state surveillance or big tech surveillance (or to use another term Zuboff’s “surveillance capitalism” now when we are talking about the more general societal power dynamics). We should make choices that are based on our democratic values and nothing else. Those in power tend to simplify our choices by forceful simplification of false trade-offs: “choose this tech solution or submit to total surveillance”. But the fact is that no technology is a magic wand, no technology is our last resort, and therefore we do not have to accept trade-offs without questioning them or considering alternatives (e.g. how about physical tokens that are not connected with our most private devices the state or the big data technology infrastructure as EIT is looking into?)

Adversarial ML - Legal, Illegal? Depends Where You Sit..

For those unfamiliar with the term, adversarial machine learning is a field of research and study within machine learning, looking at how one could trick, infect, poison, evade or steal machine learning models or systems. You can think of it as penetration testing for machine learning (or other similar security “Red Team” style attacks, but on machine learning systems). It’s a field that has had growing interest over the past few years even though it is more than a decade old, I gave a talk on it at 34c3 and there are MANY great researchers to follow in the space (a few here: Battista Biggio, Reza Shokri, Ian Goodfellow and Konrad Rieck and to get started).

There were recently two very different takes on the legality of Adversarial ML depending on where you sit (both in terms of Europe versus the US, as well as within the United States).

United States: In the US, adversarial ML might be considered hacking. An interesting article published by several researchers including Bruce Schneier (of Schneier on Security) explores whether adversarial machine learning falls under regulation for the US Computer Fraud and Abuse Act (CFAA). This law is the primary regulation for prosecuting computer hacking attempts, and certain circuits (translated: jurisdictions / states) have different interpretations of the law (either more broad in the South or narrow in the West and North East). The paper points out that there should be better protection for security and ML researchers and better consistency of application of the law across circuits.

Below is an excerpt from the paper, showing how different types of attacks may apply depending on clause and the court that sees the case.

Attack Description 1030(a)(2) violation (Narrow) 1030(a)(2) violation (Broad) 1030(a)(5)(A) violation
Evasion Attack Attacker modifies the query to get appropriate response No No No
Model Inversion Attacker recovers the secret features used in the model through careful queries No Possible No
Membership Inference Attacker can infer if given data record was part of the model's training dataset or not No Possible No
Model Stealing Attacker is able to recover the model by constructing careful queries No Possible No
Reprogramming the ML System Repurpose the ML system to perform an activity it was not programmed for No Yes Yes
Poisoning Attack Attacker contaminates the training phase of ML ysstems to get intended result No Possible Yes
Attacking the ML Supply Chain Attacker compromises the ML models as it is being downloaded for use Yes Yes Possible

Meanwhile, in Europe, the EU commission’s committee on Trustworthy AI released guidelines outlining how organizations should properly self-assess their machine learning systems for trustworthiness. Outside of ethical, fairness and privacy concerns, there was a section on robustness of machine learning systems, where they directly recommended testing your models in an adversarial fashion to determine the system can withstand such attacks.

An excerpt of the robustness questions are below:

  • How exposed is the AI system to cyber-attacks?
  • Did you assess potential forms of attacks to which the AI system could be vulnerable?
  • Did you consider different types of vulnerabilities and potential entry points for attacks such as:
    • Data poisoning (i.e. manipulation of training data);
    • Model evasion (i.e. classifying the data according to the attacker's will);
    • Model inversion (i.e. infer the model parameters
  • Did you put measures in place to ensure the integrity, robustness and overall security of the AI system against potential attacks over its lifecycle?
  • Did you red-team/pentest the system?
  • Did you inform end-users of the duration of security coverage and updates?

What gives?To be fair, the CFAA is a law usually targeting people doing these types of activities outside an organization, whereas the EU guidelines specify testing your own models; but this dissonance isn’t likely to sit well with researchers and hobbyists. Turning adversarial ML into a internal-only affair is likely to have the same problems that computer and software security has had for decades. The best way to assess your system is to support bug bounties and other programs where you encourage external researchers and experts to attack your systems and reward them for problems found — not to potentially make those illegal or prosecutable.

Feedback, Questions, Requests?

This is my first (of many!) newsletters and I’d love to hear how you liked it. What topics would you like to see more of? Any articles I should read? Does this format work? Were there topics that were confusing or unclear (technical, not technical enough)? Please feel free to reply here, on Twitter and in the next issue I might assemble a small privacy-aware survey, so please share this newsletter if you enjoyed it so I can reach more folks next time. 🤗

With Love and Privacy, kjam