Skip to main content

Guides

Diffprivlib

The Diffprivlib library implements differential privacy techniques for various data analysis tasks. It can be viewed as a differentialy private version of scikit-learn, implementing the DP-equivalents of many of the sklearn models. The library provides multiple functionalities, including mechanisms for adding noise to data, privacy-preserving machine learning algorithms, and statistical analysis tools. Diffprivlib is designed to assist developers in effectively incorporating differential privacy into their applications and research projects.

Below, we provide a guide to op_diffprivlib. The first section shows how to get started with the library while the following sections present how you can use it for regression and classification models.

Getting started with op_diffprivlib

Here we provide some code snippets to help you get started with op_diffprivlib. To use the Diffprivlib, you need to import the library as presented in the following code block:

%%ag
import op_diffprivlib

Preprocessing

In order to work with the models available in op_diffprivlib we first need to preprocess the data. This includes standardizing the features by removing the mean and scaling to unit variance, calculated with differential privacy guarantees.

In order to carry out the standardisation we need the bounds of the data which we usually get from the metadata of our private dataframe. In op_diffprivlib metadata and bounds are compulsory for the functioning of the methods. This is a stronger requriement compared to diffprivlib where bounds are optional and can be generated by the library during data fitting.

Creating metadata for the imported data

Once you import new data into the AG environment then the first step is to create metadata. This step only needs to be carried out if the data does not already come with metadata.

%%ag
# Generate metadata
metadata_dict = {}
for i in df.columns:
low_bound, high_bound = df[i].min(), df[i].max()
metadata_dict[i] = (low_bound,high_bound)
# Create a private dataframe with the imported data and the generated metadata
pdf = opd.PrivateDataFrame(df, metadata = metadata_dict)

Creating bounds for the imported data

Once we have the meta data we use it to generate the bounds of all our numerical columns. The bounds are only needed for the inputs to the model, we denote that by X_columns here.

%%ag
X_columns = pdf.columns[:-1]

low_bound,high_bound = [],[]
for i in X_columns:
low, high = pdf.metadata[i]
low_bound.append(low)
high_bound.append(high)
X_bounds = (low_bound, high_bound)

low_bound,high_bound = [],[]
for i in y_columns:
low, high = pdf.metadata[i]
low_bound.append(low)
high_bound.append(high)
y_bounds = (low_bound, high_bound)

Generating the training and testing split

Once we have the bounds of the data in place we generate the training and testing split of the data

%%ag

from op_pandas import train_test_split

pdf_train, pdf_test = train_test_split(pdf)

X_train, X_test, y_train, y_test = (
pdf_train.drop(['target'], inplace=False),
pdf_test.drop(['target'], inplace=False),
pdf_train['target'],
pdf_test['target']
)

Standardising the training and testing inputs

In this step we carry out the standardization of our input dataset which is a common requirement for many machine learning estimators. We use the StandardScalar function in op_diffprivlib to carry out the standardisation. We wil use the bounds we calculated earlier as one of the inputs to this function. Once standardisation is done we will need to recalculate the bounds of the input data. Please note that we carry out the standardisation of the train X and test X separatelty, as is the practice within sklearn in order to prevent information leakage.

%%ag
from op_diffprivlib.models import StandardScaler
X_scaler = StandardScaler(epsilon = 100,bounds = X_bounds)

X_train_std = X_scaler.fit_transform(X_train)
X_test_std = X_scaler.transform(X_test)

X_train_std = pd.DataFrame(X_train_std, columns=X_train.columns)
X_test_std = pd.DataFrame(X_test_std, columns=X_test.columns)

# Recalculate the metadata bounds for the standardised X data
metadata_dict_std = {}
for i in X_train_std.columns:
low_bound, high_bound = X_train_std[i].min(), X_train_std[i].max()
metadata_dict_std[i] = (low_bound,high_bound)

X_train_std = opd.PrivateDataFrame(X_train_std, metadata = metadata_dict_std)
X_test_std = opd.PrivateDataFrame(X_test_std, metadata = metadata_dict_std)

# Recalculate X bounds for the standardised data
X_columns = X_train_std.columns
low_bound,high_bound = [],[]
for i in X_columns:
low, high = X_train_std.metadata[i]
low_bound.append(low)
high_bound.append(high)
X_bounds_std = (low_bound, high_bound)

Regression models

Linear Regression in AG with op_diffprivlib

For linear regression, op_diffprivlib needs epsilon and the bounds of the data. Once we carry out the fitting step we can then use the score function to understand model performance. If we use the score function with the training inputs this tells us how much the model learned from the inputs (values approaching 1 are good), while using the score function with the test inputs provides us the performance of the model on the prediction task.

We can export these scores and plot them in our local environment.

%%ag
from op_diffprivlib.models import LinearRegression
seed = 1 # to have a repeatable result for debugging
epsilon = 0.1

reg_odpl = LinearRegression(epsilon=epsilon, bounds_X=X_bounds_std, bounds_y=y_bounds, random_state=seed)
reg_odpl.fit(X_train_std, y_train)

y_pred = reg_odpl.predict(X_test_std)

learning_score = reg_odpl.score(X_train_std, y_train)
prediction_score = reg_odpl.score(X_test_std, y_test)

export(learning_score, 'learning_score')
export(prediction_score, 'prediction_score')

Classification models

There are 3 main options for classification models in op_diffprivlib: Gaussian Naive Bayes, Logistic Regression, and Tree-Based Models. Here we show the first two.

Gaussian Naive Bayes in AG with op_diffprivlib

This builds on the diffprivlib implementation which inherits the sklearn.naive_bayes.GaussianNB class from Scikit Learn and adds noise to satisfy differential privacy to the learned means and variances.

Once we carry out the fitting step we can then use the score function to understand model performance. If we use the score function with the training inputs this tells us how much the model learned from the inputs (values approaching 1 are good), while using the score function with the test inputs provides us the performance of the model on the prediction task.

We can export these scores and plot them in our local environment.

%%ag
from op_diffprivlib.models import GaussianNB

seed = 1 # to have a repeatable result for debugging
epsilon = 0.1

clf = GaussianNB(epsilon=epsilon, bounds=X_bounds_std)
clf.fit(X_train_std, y_train)
y_pred = clf.predict(X_test_std)

learning_score = clf.score(X_train_std, y_train)
prediction_score = clf.score(X_test_std, y_test)

export(learning_score, 'learning_score')
export(prediction_score, 'prediction_score')

Logistic Regression in AG with op_diffprivlib

This builds on the diffprivlib implementation which inherits the sklearn.linear_model.LogisticRegression, with amendments to allow for the implementation of differential privacy.

This model needs an input data_norm, The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.

Once we carry out the fitting step we can then use the score function to understand model performance. If we use the score function with the training inputs this tells us how much the model learned from the inputs (values approaching 1 are good), while using the score function with the test inputs provides us the performance of the model on the prediction task.

We can export these scores and plot them in our local environment.

%%ag
from op_diffprivlib.models import LogisticRegression

epsilon = 0.1

clf = LogisticRegression(epsilon=epsilon, data_norm=7.89)
clf.fit(X_train_std, y_train)
y_pred = clf.predict(X_test_std)

learning_score = clf.score(X_train_std, y_train)
prediction_score = clf.score(X_test_std, y_test)

export(learning_score, 'learning_score')
export(prediction_score, 'prediction_score')

API Reference

Currently, AG supports all the methods and classes within the following directories:

AG uses the same function name and signature.

Resources

The following is a list of helpful resources for using the DiffPrivLib library: