Evolving global data privacy and consumer protection laws make it important for companies to protect sensitive data from acquisition through its use in AI models.
The rapid proliferation of data and analytics tools in recent years has provided data analysts and data scientists with infinite opportunities to try to analyze as much of this new data in new and interesting ways.
These new tools and techniques have also led to the need for explainable AI (xAI), which seeks to make machine learning and deep learning models and their results more understandable both to the data scientists themselves and to the end user, and responsible AI. Equally important, responsible AI is focused on using the data itself responsibly, including maintaining the privacy of individuals and institutions.
See also: FICO’s Scott Zoldi Talks Data Scientist Cowboys and Responsible AI
Responsible AI is growing increasingly important for businesses and other institutions, as legislation regarding the protection of consumer data continues to be developed and adopted around the world, most famously GDPR and CCPA.
However, the measures that businesses and other institutions have in place for protecting privacy are not necessarily enough, said Sarah Bird, Head of Research and Emerging Technology Strategy for Azure AI at Microsoft Azure, during the recent virtual Spark + AI Summit.
She explained that the initial steps that most businesses and governments put into place to protect privacy while using data are:
- Putting into place access control, which makes sure that no one has unnecessary access to the data set.
- Anonymizing values in a data set, so that even the data scientists themselves do not see specific information about individuals.
However, even when companies take the above steps, a machine learning or AI model could still end up revealing private information about individuals. She gave the example of an AI model that auto-completes sentences when drafting e-mails. Imagine that there are some rare cases in the training data set, such as a case where someone wrote the following sentence in an e-mail: “My social security number is …” What if, in production, a model were to auto-complete this sentence with someone’s real, even if anonymized, social security number?
Even without sophisticated AI models, basic statistics and aggregation of anonymized data could be used to reconstruct the original data set and reveal private information.
That challenge is especially felt by those working on the US Census, said Simson Garfinkel, Senior Computer Scientist for Confidentiality and Data Access at the US Census Bureau, during another conference session. He spoke about the challenge of publishing detailed statistics about populations on different levels (state, county, and even individual blocks), while keeping the specific information about the over 300 million people in their database completely private, and how they are solving it.
They ran tests internally at the Census Bureau to see what would happen if they tried to use their anonymized, aggregate data about the US population to reveal information about individuals, he said. They first performed a database reconstruction from aggregated census statistics, which created a database with information about individuals, but still anonymized. Next, they matched this database with a commercial data set that does include people’s identities, and they were able to reconstruct confidential data for 17% of the population.
Responsible AI and Differential Privacy
So, how can we analyze data and publish results while being truly responsible with the data and preventing these database reconstructions?
One method, discussed by both Bird and Garfinkel, is differential privacy. The essence of differential privacy is that some amount of noise is added to the data, such that if someone tries to reconstruct the data set, the reconstruction will not accurately match the original data entries.
In practice, the mathematical algorithms behind differential privacy are complex, and as Bird points out, this is still an active area of research. At Microsoft Azure, they are partnering with researchers at Harvard on an open-source differential privacy platform.
One of the clear downsides to this approach is that it can add noticeable noise to the results, especially for smaller datasets, thereby reducing the accuracy of the results. However, one of the benefits of differential privacy is that you can adjust the amount of noise that is added in order to balance this trade-off between accuracy and privacy.
Another benefit of differential privacy is that it potentially allows data professionals to use data that previously was considered too sensitive to include in AI models.
Garfinkel, from the Census Bureau, points out that differential privacy is also future-proof, meaning that the original, private data should be protected no matter what new algorithms are developed in the future for reverse-engineering the published statistics.
Bird also addresses the question of why they decided at Microsoft Azure to make their software open-source. Essentially, this issue of responsible AI is so important, that they want to make sure everyone can integrate differential privacy into their workflows, without having to derive and code the underlying mathematical algorithms themselves. Furthermore, making the code open-source will allow more people outside of their organization to test the code and further advance the state-of-the-art. Finally, outside experts are now able to inspect and verify the algorithms and ensure that these algorithms truly are being responsible.
The continued adoption of data privacy and consumer protection laws around the world is making it even more important for companies to take measures to protect sensitive data from the time it is acquired, through the analysis and model-training stage, and all the way to the final published results or the in-production AI model. Differential privacy can help move companies toward achieving these goals and making AI more responsible.