Differential Privacy

These days companies are using more and more customer data to improve their products and services. On October 2, 2006, Netflix announced the $1 Million Prize for improving their movie recommendation algorithm. Netflix released an anonymous dataset containing movie ratings by 500,000 subscribers containing 100M ratings for 17,000 movies. Netflix asserted that all personally identifiable information (PII) has been removed. One year later, Arvind Narayanan and Vitaly Shmatikov published this paper that proved the removal of PII is useless in protecting data privacy. They used the Internet Movie Database (IMDb) as the external source to identify the anonymous Netflix subscribers. This sobering truth alerts the academia as well as practitioners, and paths a significant research literature called Differential Privacy. 

What is Differential Privacy?

Differential Privacy allows companies to collect information about their users without compromising the privacy of an individual by adding a small amount of noise to the original data. The attack we discussed in paragraph 1 is called the Linkage attack that happens when pieces of seemingly anonymous data can be combined to reveal real identities. Also, 87% of Americans can be identified uniquely with only 3 pieces of information – Date of birth, Zip code, and Gender. These 2 examples clearly show that only anonymity is not enough to protect the privacy of individuals. 

How Differential Privacy Can Help?

Differential Privacy helps companies gain insights from large datasets while still maintaining privacy. Differential Privacy works by using an algorithm to add a controlled amount of randomness or noise into the computation. For e.g. imagine a survey that asks people whether they like Banana. Here we want to know a collective opinion of how many people like Banana but not the individual responses. The problem is once we receive the survey results, we can still see the individual responses by each participant. To protect the privacy of the participants, we would change the responses at a predetermined frequency that helps to protect the individual responses and hence the privacy of the participants. We would know the collective responses of the survey but not the actual responses from each participant. 

Trade-offs with Differential Privacy

Like all things with Security, Differential Privacy comes with its own trade-offs. In this case, the more noise we add to the original responses, the more privacy is protected, but the data becomes less accurate. 

Conclusion

Differential Privacy can make data less attractive to would-be attackers and prevent it from the Linkage attacks. Differential Privacy alone can’t make the data full proof, but is only a single defense in a broader arsenal. It should be used alongside other measures such as encryption and access control. Choosing to share our data can have benefits but that doesn’t mean we have to sacrifice with Privacy.