Skip to main content

Differential Privacy

In data science and machine learning, privacy concerns are of utmost importance when working with sensitive data. As such, we consider differential privacy (DP) as the state-of-the-art privacy framework providing mathematical guarantees for individuals whose data is being analysed.

In simple words, DP uses "noise" added to every output to guarantee that the results of computations performed on sensitive data do not reveal specific information about any individual participant, by making said results to be indistinguishable from each other.

Formally, (ϵ\epsilon and δ\delta)-DP is satisifed if for a mechanism MM and two neighboring data sets (xx and xx') the following stands:

Pr[M(x)S]eϵPr[M(x)S]+δPr[M(x)\in S]\leq e^{\epsilon}Pr[M(x') \in S]+\delta

Where the privacy parameters epsilon (ϵ\epsilon) and delta (δ\delta), can be described as follows:

  • εε: a smaller epsilon indicates a stronger privacy guarantee, meaning that the outputs from the mechanism will be statistically harder to differentiate between neighboring databases due to the amount of noise being added. This makes it harder for an adversary to learn information about a specific individual's data, since any given dataset can be described as a neighbor to another dataset with one of the rows altered.
  • δδ is known as a catastrophically small probability that the privacy guarantee does not stand. Therefore, if δ = 1, differential privacy holds 99% of the time, while if δ = 0.1, differential privacy holds 99.9% of the time.

Applying DP to machine learning and data science tasks involves carefully designing algorithms and mechanisms to inject controlled noise into computations, and mapping that noise to a certain privacy guarantee determined by epsilon (ϵ\epsilon) and delta (δ\delta) parameteres. As such, we will later discover that the main challenge lies in balancing utility and strong privacy guarantees, since additional privacy protection comes at the cost of lost accuracy in the results, whilst aiming for more accuarate results sacrifices the underlying privacy guarantees.