Core Concepts
Differential Privacy
Privacy is crucial in data science and machine learning, especially regarding sensitive data. The problem is how to analyse and share statistical data without compromising privacy. In this scenario, a privacy breach occurs when someone unauthorised can infer the original information from a private database. Differential privacy is a robust framework for keeping individual information safe during data analysis.
What Is Differential Privacy?
Differential Privacy (DP) is a framework crafted to safeguard the privacy of individual data during statistical analyses. It achieves this by inserting controlled random noise into data queries, concealing any specific record's impact. This approach shields against re-identification and other privacy breaches, even when dealing with sensitive data. DP's significance grows as conventional privacy methods fall short of sophisticated data analysis techniques.
The core principle of DP is to extract valuable insights from vast datasets while ensuring that the inclusion or exclusion of any single data point does not substantially alter the results, thus preserving the privacy of individual data points within the dataset.
These are some of the fields in which Differential privacy is more commonly applied:
- Healthcare organisations can leverage DP to research, analyse patient outcomes, and improve treatment protocols without compromising confidentiality.
- Banks and financial services firms can use DP to detect fraudulent activities, assess credit risk, and optimise investment strategies securely.
- Social scientists can use DP to study population trends, conduct surveys, and analyse public opinion without exposing sensitive information.
Differential Privacy Definition
Picture two data sets on which we want to perform analysis using a mechanism ().
Both of them are identical, except for the fact that on one of them, your data is present in a single row, and on the other one, it is not. These datasets are known as neighboring or adjacent datasets ( and ), since they differ on a single individual.
We can define as the probability of the mechanism evaluated on a data set () to belong to any subset of possible outputs. In simple words that is the probability of the mechanism returning a solution inside , and we can define the same for the neighboring dataset as .
We can then use differential privacy to set parameters ( and ) that would bound our privacy loss of using that mechanism on any given data set, on what is known as Approximate Differential Privacy.
Formally, -Approximate Differential Privacy is satisfied for a mechanism () on two neighboring data sets ( and ) if
Where the privacy parameters epsilon () and delta (), can be described as follows:
- measures how much the data can be changed, defining the maximum distance between the query results on two neighboring databases, which differ by only one record. A smaller epsilon indicates a stronger privacy guarantee, meaning that the outputs from the mechanism will be more similar for neighboring databases. This makes it harder for an adversary to distinguish between the two databases and learn anything about a specific individual's data.
- is the probability that the privacy guarantee will be violated. A smaller delta means a lower chance of a privacy leak. Therefore, if δ = 1, differential privacy holds 99% of the time. While if δ = 0.1, differential privacy holds 99.9% of the time.
Understanding Epsilon and Delta
Epsilon () and Delta () are the base parameters that quantify and manage privacy protection in (ϵ, δ)-DP. These parameters help strike a delicate balance between privacy guarantees and the utility of data analysis. They are defined as follows:
- Epsilon (): Epsilon quantifies the privacy loss by determining the amount of information about an individual that can be inferred from the computed results. A smaller Epsilon value indicates stronger privacy protection.
- Delta (): Delta represents the probability of privacy violation. It measures the likelihood that a hacker (adversary) can distinguish between two datasets based on computation results. A smaller delta value implies a lower possibility of privacy breaches.
Understanding and appropriately managing epsilon and delta are essential to implementing a robust differential privacy mechanism. By carefully balancing these parameters, organisations can achieve adequate privacy protection while preserving the utility of data analysis.
Access the best practices for choosing epsilon and delta values on the Privacy Budget page.