Guides
Diffprivlib
The Diffprivlib library implements differential privacy techniques for various data analysis tasks. It can be viewed as a differentialy private version of scikit-learn, implementing the DP-equivalents of many of the sklearn models. The library provides multiple functionalities, including mechanisms for adding noise to data, privacy-preserving machine learning algorithms, and statistical analysis tools. Diffprivlib is designed to assist developers in effectively incorporating differential privacy into their applications and research projects.
Below, we provide a guide to op_diffprivlib
. The first section shows how to get started with the library while the following sections present how you can use it for regression and classification models.
Getting started with op_diffprivlib
Here we provide some code snippets to help you get started with op_diffprivlib
. To use the Diffprivlib, you need to import the library as presented in the following code block:
%%ag
import op_diffprivlib
Preprocessing
In order to work with the models available in op_diffprivlib
we first need to preprocess the data. This includes standardizing the features by removing the mean and scaling to unit variance, calculated with differential privacy guarantees.
In order to carry out the standardisation we need the bounds of the data which we usually get from the metadata of our private dataframe.
In op_diffprivlib
metadata and bounds are compulsory for the functioning of the methods. This is a stronger requriement compared to diffprivlib where bounds are optional and can be generated by the library during data fitting.
Creating metadata for the imported data
Once you import new data into the AG environment then the first step is to create metadata. This step only needs to be carried out if the data does not already come with metadata.
%%ag
# Generate metadata
metadata_dict = {}
for i in df.columns:
low_bound, high_bound = df[i].min(), df[i].max()
metadata_dict[i] = (low_bound,high_bound)
# Create a private dataframe with the imported data and the generated metadata
pdf = opd.PrivateDataFrame(df, metadata = metadata_dict)
Creating bounds for the imported data
Once we have the meta data we use it to generate the bounds of all our numerical columns. The bounds are only needed for the inputs to the model, we denote that by X_columns here.
%%ag
X_columns = pdf.columns[:-1]
low_bound,high_bound = [],[]
for i in X_columns:
low, high = pdf.metadata[i]
low_bound.append(low)
high_bound.append(high)
X_bounds = (low_bound, high_bound)
low_bound,high_bound = [],[]
for i in y_columns:
low, high = pdf.metadata[i]
low_bound.append(low)
high_bound.append(high)
y_bounds = (low_bound, high_bound)
Generating the training and testing split
Once we have the bounds of the data in place we generate the training and testing split of the data
%%ag
from op_pandas import train_test_split
pdf_train, pdf_test = train_test_split(pdf)
X_train, X_test, y_train, y_test = (
pdf_train.drop(['target'], inplace=False),
pdf_test.drop(['target'], inplace=False),
pdf_train['target'],
pdf_test['target']
)
Standardising the training and testing inputs
In this step we carry out the standardization of our input dataset which is a common requirement for many machine learning estimators. We use the StandardScalar function in op_diffprivlib to carry out the standardisation. We wil use the bounds we calculated earlier as one of the inputs to this function. Once standardisation is done we will need to recalculate the bounds of the input data. Please note that we carry out the standardisation of the train X and test X separatelty, as is the practice within sklearn in order to prevent information leakage.
%%ag
from op_diffprivlib.models import StandardScaler
X_scaler = StandardScaler(epsilon = 100,bounds = X_bounds)
X_train_std = X_scaler.fit_transform(X_train)
X_test_std = X_scaler.transform(X_test)
X_train_std = pd.DataFrame(X_train_std, columns=X_train.columns)
X_test_std = pd.DataFrame(X_test_std, columns=X_test.columns)
# Recalculate the metadata bounds for the standardised X data
metadata_dict_std = {}
for i in X_train_std.columns:
low_bound, high_bound = X_train_std[i].min(), X_train_std[i].max()
metadata_dict_std[i] = (low_bound,high_bound)
X_train_std = opd.PrivateDataFrame(X_train_std, metadata = metadata_dict_std)
X_test_std = opd.PrivateDataFrame(X_test_std, metadata = metadata_dict_std)
# Recalculate X bounds for the standardised data
X_columns = X_train_std.columns
low_bound,high_bound = [],[]
for i in X_columns:
low, high = X_train_std.metadata[i]
low_bound.append(low)
high_bound.append(high)
X_bounds_std = (low_bound, high_bound)
Regression models
Linear Regression in AG with op_diffprivlib
For linear regression, op_diffprivlib needs epsilon and the bounds of the data. Once we carry out the fitting step we can then use the score function to understand model performance. If we use the score function with the training inputs this tells us how much the model learned from the inputs (values approaching 1 are good), while using the score function with the test inputs provides us the performance of the model on the prediction task.
We can export these scores and plot them in our local environment.
%%ag
from op_diffprivlib.models import LinearRegression
seed = 1 # to have a repeatable result for debugging
epsilon = 0.1
reg_odpl = LinearRegression(epsilon=epsilon, bounds_X=X_bounds_std, bounds_y=y_bounds, random_state=seed)
reg_odpl.fit(X_train_std, y_train)
y_pred = reg_odpl.predict(X_test_std)
learning_score = reg_odpl.score(X_train_std, y_train)
prediction_score = reg_odpl.score(X_test_std, y_test)
export(learning_score, 'learning_score')
export(prediction_score, 'prediction_score')
Classification models
There are 3 main options for classification models in op_diffprivlib: Gaussian Naive Bayes, Logistic Regression, and Tree-Based Models. Here we show the first two.
Gaussian Naive Bayes in AG with op_diffprivlib
This builds on the diffprivlib implementation which inherits the sklearn.naive_bayes.GaussianNB class from Scikit Learn and adds noise to satisfy differential privacy to the learned means and variances.
Once we carry out the fitting step we can then use the score function to understand model performance. If we use the score function with the training inputs this tells us how much the model learned from the inputs (values approaching 1 are good), while using the score function with the test inputs provides us the performance of the model on the prediction task.
We can export these scores and plot them in our local environment.
%%ag
from op_diffprivlib.models import GaussianNB
seed = 1 # to have a repeatable result for debugging
epsilon = 0.1
clf = GaussianNB(epsilon=epsilon, bounds=X_bounds_std)
clf.fit(X_train_std, y_train)
y_pred = clf.predict(X_test_std)
learning_score = clf.score(X_train_std, y_train)
prediction_score = clf.score(X_test_std, y_test)
export(learning_score, 'learning_score')
export(prediction_score, 'prediction_score')
Logistic Regression in AG with op_diffprivlib
This builds on the diffprivlib implementation which inherits the sklearn.linear_model.LogisticRegression, with amendments to allow for the implementation of differential privacy.
This model needs an input data_norm, The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy. To preserve differential privacy fully, data_norm should be selected independently of the data, i.e. with domain knowledge.
Once we carry out the fitting step we can then use the score function to understand model performance. If we use the score function with the training inputs this tells us how much the model learned from the inputs (values approaching 1 are good), while using the score function with the test inputs provides us the performance of the model on the prediction task.
We can export these scores and plot them in our local environment.
%%ag
from op_diffprivlib.models import LogisticRegression
epsilon = 0.1
clf = LogisticRegression(epsilon=epsilon, data_norm=7.89)
clf.fit(X_train_std, y_train)
y_pred = clf.predict(X_test_std)
learning_score = clf.score(X_train_std, y_train)
prediction_score = clf.score(X_test_std, y_test)
export(learning_score, 'learning_score')
export(prediction_score, 'prediction_score')
API Reference
Currently, AG supports all the methods and classes within the following directories:
AG uses the same function name and signature.
Resources
The following is a list of helpful resources for using the DiffPrivLib library:
- Official Diffprivlib docs: Python library for differential privacy developed by IBM Research.
- GitHub repository: Explore the library's implementation for the differentially private methods.