Guides
SmartNoise Synth
SmartNoise-Synth is a privacy-preserving synthetic data generation tool developed by OpenDP. It uses advanced algorithms to generate realistic synthetic data while preserving privacy and ensuring statistical accuracy, making it valuable for research and analysis in sensitive domains like healthcare and finance.
Antigranular wraps the snsynth
library to support PrivateDataframe
with minimal change to the existing function signatures and methods. As a result, the transition is as smooth as possible for those who have worked with SmartNoise-Synth before.
To use the SmartNoise-Synth, you need to import it as presented in the following code block:
%%ag
import op_snsynth
Step 1. Loading data
In this step, we download the pumps
dataset and upload it to the session
. Then, we create a Synthesizer
and fit
, to later create a sample
.
import pandas as pd
# Load a dataset locally
pums = pd.read_csv("https://content.antigranular.com/image/notebook_content/PUMS.csv")
dataframe cached to server, loading to kernel...
DataFrame loaded successfully to the kernel
# Preview the non-private data
pums
age sex educ race income married
0 59 1 9 1 0.0 1
1 31 0 1 3 17000.0 0
2 36 1 11 1 0.0 1
3 54 1 11 1 9100.0 1
4 39 0 5 3 37000.0 0
... ... ... ... ... ... ...
995 73 0 3 3 24200.0 0
996 38 1 2 3 0.0 0
997 50 0 13 1 22000.0 1
998 44 1 14 4 500.0 1
999 29 1 11 1 66400.0 0
# Import it to AG
session.private_import(data=pums, name="pums")
Step 2. Using a synthesizer
%%ag
from op_snsynth import Synthesizer
# Build a Synthesizer
synth = Synthesizer.create("mwem", epsilon=1.0, iterations=6, verbose=True)
# Fit it
fit = synth.fit(pums, preprocessor_eps=0.1)
Processing 1 histograms
Histogram #0 split: [0 1 2 3 4 5]
Columns: 6
Dimensionality: 308,352
Cuboids possible: 63
1-2-way cuboids possible: 21
Fitting for 6 iterations
Number of queries: 12
Number of slices in queries: 2639
Per-Measure Epsilon: 0.083
Measurement Error: 27.63
[0] - Average error: 1.384. Selected 6 slices
[1] - Average error: 0.672. Selected 16 slices
[2] - Average error: 0.496. Selected 96 slices
[3] - Average error: 0.439. Selected 12 slices
[4] - Average error: 0.366. Selected 803 slices
[5] - Average error: 0.379. Selected 2 slices
The SmartNoise Synthesizers provide a variety of synthesizers for users to explore, other than MWEM. Here's a brief overview of each:
Marginal Synthesizers:
AIM
: A synthesizer focusing on producing synthetic data while maintaining the marginal distributions of the original dataset.MST
: Another marginal synthesizer that focuses on maintaining statistical properties of the data.PAC-Synth
: A synthesizer designed to offer probably approximately correct (PAC) guarantees on the synthetic data generated.
Neural Network Synthesizers:
DP-CTGAN
: A differentially private version of the Conditional Tabular Generative Adversarial Network (CTGAN).PATE-CTGAN
: An extension of CTGAN, which incorporates Private Aggregation of Teacher Ensembles (PATE) for enhanced privacy.PATE-GAN
: Utilizes the PATE framework in a GAN setting to generate synthetic data with privacy guarantees.DP-GAN
: A differentially private Generative Adversarial Network tailored for generating synthetic datasets.
Hybrid Synthesizers:
QUAIL
: A hybrid synthesizer that combines various techniques to balance between data utility and privacy.
Each of these synthesizers offers unique approaches to generating synthetic data while addressing different aspects of privacy and data utility.
Step 3. Exporting a sample
%%ag
# Get a sample of the same size as the dataset
sample = synth.sample(len(pums))
%%ag
# Export the sample to the local enviroment
export(sample, "sample")
Setting up exported variable in local environment: sample
# Preview of the data
sample
age sex educ race income married
0 50 0 13 1 38502.4 1
1 60 1 12 1 23756.8 1
2 76 0 14 1 48332.8 1
3 40 0 9 1 43417.6 0
4 48 0 9 1 58163.2 0
... ... ... ... ... ... ...
995 37 0 13 1 67993.6 0
996 79 1 13 1 18841.6 1
997 52 1 11 2 53248.0 0
998 67 0 7 1 53248.0 0
999 36 1 11 1 28672.0 1
Step 4. Comparing to real data (Ridge Classifier)
In this step, we created a function to compare the real versus synthetic data we have just created. We test it both with Ridge and Logistic Regression Classifiers.
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
def test_real_vs_synthetic_data(real, synthetic, model):
# Create a DataFrame for synthetic data with the same columns as the real data
synth_df = pd.DataFrame(synthetic, columns=real.columns)
# Split real and synthetic datasets into features and labels
X = real.iloc[:, :-1]
y = real.iloc[:, -1]
X_synth = synth_df.iloc[:, :-1]
y_synth = synth_df.iloc[:, -1]
# Split both real and synthetic data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train_synth, x_test_synth, y_train_synth, y_test_synth = train_test_split(
X_synth, y_synth, test_size=0.2, random_state=42
)
# Train model on real data
model_real = model()
model_real.fit(x_train, y_train)
# Train model on synthetic data
model_fake = model()
model_fake.fit(x_train_synth, y_train_synth)
# Evaluate and print performance of model trained on real data
predictions = model_real.predict(x_test)
print("\nTrained on Real Data:")
print(classification_report(y_test, predictions))
print("Accuracy real:", accuracy_score(y_test, predictions))
# Evaluate and print performance of model trained on synthetic data
predictions = model_fake.predict(x_test_synth)
print("\nTrained on Synthetic Data:")
print(classification_report(y_test_synth, predictions))
print("Accuracy synthetic:", accuracy_score(y_test_synth, predictions))
# Evaluate and print performance of random guessing
print("\nRandom Guessing:")
guesses = np.random.randint(0, (max(y_test_synth) - min(y_test_synth) + 1), len(y_test_synth))
np.random.shuffle(guesses)
print(classification_report(y_test_synth, guesses))
print("Accuracy guessing:", accuracy_score(y_test_synth, guesses))
return model_real, model_fake
# Let us start by testing a Ridge Classifier
from sklearn.linear_model import RidgeClassifier
import numpy as np
# We will compare the real dataset `pums` and the synthetic dataset `sample`
# using a Ridge Classifier
test_real_vs_synthetic_data(pums, sample, RidgeClassifier)
Trained on Real Data:
precision recall f1-score support
0 0.60 0.51 0.55 91
1 0.63 0.72 0.67 109
accuracy 0.62 200
macro avg 0.62 0.61 0.61 200
weighted avg 0.62 0.62 0.62 200
Accuracy real: 0.62
Trained on Synthetic Data:
precision recall f1-score support
0 0.52 0.11 0.18 104
1 0.48 0.90 0.63 96
accuracy 0.48 200
macro avg 0.50 0.50 0.40 200
weighted avg 0.50 0.48 0.39 200
Accuracy synthetic: 0.485
Random Guessing:
precision recall f1-score support
0 0.48 0.47 0.47 104
1 0.43 0.44 0.44 96
accuracy 0.46 200
macro avg 0.45 0.45 0.45 200
weighted avg 0.46 0.46 0.46 200
Accuracy guessing: 0.455
Step 4. Comparing to real data (Logistic Regression)
# We can also test it with different classifier
from sklearn.linear_model import LogisticRegression
test_real_vs_synthetic_data(pums, sample, LogisticRegression)
Trained on Real Data:
precision recall f1-score support
0 0.63 0.35 0.45 91
1 0.60 0.83 0.70 109
accuracy 0.61 200
macro avg 0.62 0.59 0.57 200
weighted avg 0.61 0.61 0.59 200
Accuracy real: 0.61
Trained on Synthetic Data:
precision recall f1-score support
0 0.00 0.00 0.00 104
1 0.48 1.00 0.65 96
accuracy 0.48 200
macro avg 0.24 0.50 0.32 200
weighted avg 0.23 0.48 0.31 200
Accuracy synthetic: 0.48
Random Guessing:
precision recall f1-score support
0 0.49 0.44 0.47 104
1 0.46 0.51 0.48 96
accuracy 0.48 200
macro avg 0.48 0.48 0.47 200
weighted avg 0.48 0.47 0.47 200
Accuracy guessing: 0.475
Step 5. Comparing data in a plot
We use a T-distributed stochastic neighbor embedding plot to display data from the real dataset and our synthetic dataset.
import matplotlib.patches as mpatches
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
def tsne_plot(real, synthetic):
# Prepare DataFrame for synthetic data
synth_df = pd.DataFrame(synthetic, columns=real.columns)
# Select features from real and synthetic data for t-SNE visualization
x_train = real.iloc[:, :-1]
x_train_synth = synth_df.iloc[:, :-1]
# Combine and apply t-SNE to the data
comb = np.vstack((x_train[:500], x_train_synth[:500]))
embedding_1 = TSNE(n_components=2, perplexity=5.0, early_exaggeration=1.0).fit_transform(comb)
x, y = embedding_1.T
# Splitting the transformed data into two parts for plotting
l = int(len(x) / 2)
# Set up the plot
plt.rcParams["figure.figsize"] = (8,8)
plt.scatter(x, y, c=['red' for _ in range(l)] + ['blue' for _ in range(l)])
red_patch = mpatches.Patch(color='red', label='Real Data')
blue_patch = mpatches.Patch(color='blue', label='Synthetic Data')
plt.legend(handles=[red_patch, blue_patch])
plt.title('TSNE Plot, Real Data vs. Synthetic')
plt.show()
# Plotting non-private and synthetic data
tsne_plot(pums, sample)
With this graph you can review how the synthetic data is distributed in comparison to the real data.
Extra: Using the MST Method
The acronym MST
stands for “Maximum-Spanning-Tree” as the method produces differentially private synthetic data by relying on a “Maximum-Spanning-Tree" of mutual information.
MST
finds the maximum spanning tree on a graph where nodes are data attributes and edge weights correspond to approximate mutual information between any two attributes.
We say approximate because the “maximum spanning tree” is built using the exponential mechanism, which helps select edge weights with high levels of mutual information in a differentially private manner. On the other hand, the marginals are measured using the Gaussian mechanism.
%%ag
df = pums.drop(["income"], axis=1)
df = df.sample(frac=1, random_state=42)
%%ag
mst_synth = Synthesizer.create("mst", epsilon=3.0, verbose=True)
sample = mst_synth.fit(df, preprocessor_eps=0.1)
%%ag
sample = mst_synth.sample(len(df))
Extra: What If We Use Data Transformers?
Even with preprocessing hints
, the preprocessor inferred by the synthesizer may not be exactly what you want. For example, the mwem
synthesizer will automatically bin continuous columns into 10 bins.
Spending epsilon
to infer bounds is wasteful and reduces accuracy when you already have public bounds for continuous columns. In most cases, you will get the best performance by manually specifying the preprocessor you want to use.
Preprocessing is done by a TableTransformer
object, which implements a differentially private reversible data transform.
%%ag
from op_snsynth import TableTransformer
# The create method can infer the min and max values from the data set.
# Inferring the min and max requires some privacy budget, specified by the epsilon argument.
tt = TableTransformer.create(
pums,
style='cube',
categorical_columns=list(pums.columns),
continuous_columns=['age', 'income']
)
%%ag
pums_encoded = tt.fit_transform(pums, epsilon=3.0)
pums_decoded = tt.inverse_transform(pums_encoded)
%%ag
synth = Synthesizer.create("mwem", epsilon=1.0, iterations=6, verbose=True)
fit = synth.fit(pums_decoded, preprocessor_eps=0.1)
%%ag
sample = synth.sample(len(df))
Resources
You can access the official Smartnoise-Synth page to learn more about all the function signatures and methods.