Guides
Functions, Joins and Statistical Methods
This section continue presenting the op_pandas
library guide, addressing functions, joins and statistical methods. The following operations are addressed:
General Functions
op_pandas
comes packaged with some useful functions such as:
to_datetime
concat
merge
train_test_split
Concat
Users can concatenate PrivateDataFrames, by creating a copy of the dataframe and then concatenating it with the original. See the example below:
First, create a DataFrame
df
with random data:arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
res = ''.join(random.choices(string.ascii_lowercase, k=N))
arr_name.append(res)
df = pd.DataFrame({
'name': arr_name,
'age': np.random.randint(0, 80, n_num),
'salary': np.random.randint(100, 100000, n_num)
})
session.private_import(data=df, name='imported_df_copy')>>> dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernelNext, create a PrivateDataFrame
priv_df_copy
using the imported DataFrame:%%ag
metadata = {
'age': (0, 80),
'salary': (1, 200000)
}
priv_df_copy = PrivateDataFrame(imported_df_copy, metadata=metadata)Then, print the counts of records in the original copied PrivateDataFrames, and concatenate them:
%%ag
ag_print("First DF count:\n", priv_df.count(eps=1))
ag_print("Second DF count:\n", priv_df_copy.count(eps=1))
import op_pandas
concat_df = op_pandas.concat([priv_df, priv_df_copy])
ag_print("Concat DF count: \n", concat_df.count(eps=1))
The resulting output should be as follows:
>>>
First DF count:
name 10000
age 9999
salary 9988
dtype: int64
Second DF count:
name 9998
age 9999
salary 10001
dtype: int64
Concat DF count:
name 19997
age 20000
salary 20004
dtype: int64
concat
is only used to concatenate PrivateDataFrames over indices.
The datatypes of the same columns must be the same across all data frames.
Train_test_split
Users can utilize op_pandas.train_test_split
to divide a PrivateDataFrame into train and test splits. As in the example below:
%%ag
priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)
ag_print("Count of train split:\n", priv_df_train.count(eps=1))
ag_print("Count of test split:\n", priv_df_test.count(eps=1))
Output when executed:
>>
Count of train split:
name 7501
age 7503
salary 7500
dtype: int64
Count of test split:
name 2492
age 2499
salary 2491
dtype: int64
In this example, op_pandas.train_test_split
splits the PrivateDataFrame priv_df
into train and test sets. The counts of records in each split are printed, showing the distribution of data between the training and testing subsets.
Join and Where
This section explores different operations related to joining PrivateDataFrame and PrivateSeries objects and using the WHERE
and JOIN
method for filtering data.
JOIN
on Private objects
Joining two PrivateDataframes:
%%ag
"""
pdf: {randomly sampled}
columns = ['A','B','C']
metadata = (0,100) for all columns
size = 1000
pdf2: {randomly sampled}
columns = ['E','F']
metadata = (-100,100) for all columns
size = 1000
ps: {randomly sampled series}
name = 'series'
metadata = (0,100)
size = 1000
"""
pdf = PrivateDataFrame(
pd.DataFrame({
"A": np.random.randint(0, 100, 1000),
"B": np.random.randint(0, 100, 1000),
"C": np.random.randint(0, 100, 1000)
}),
metadata={"A": (0, 100), "B": (0, 100), "C": (0, 100)}
)
pdf2 = PrivateDataFrame(
pd.DataFrame({
"E": np.random.randint(-100, 100, 1000),
"F": np.random.randint(-100, 100, 1000)
}),
metadata={"E": (-100, 100), "F": (-100, 100)}
)
ps = PrivateSeries(pd.Series(np.random.randint(0,100,1000)), metadata=(0,100))
%%ag #join with dataframe
result = pdf.join(pdf2,how="outer")
ag_print("Outer Join of two PDFs Describe:\n",result.describe(eps=1))
>>> Outer Join of two PDFs Describe:
A B C E F
count 978.000000 978.000000 978.000000 978.000000 978.000000
mean 45.078161 48.373313 47.395728 2.918509 0.379322
std 41.659230 39.782047 3.978732 55.174294 76.570248
min 0.000000 0.000000 0.000000 -100.000000 -100.000000
25% 25.835038 16.814188 13.689315 -55.716763 -79.960320
50% 45.534085 55.195066 57.023430 3.266460 -5.226773
75% 62.600983 63.868197 68.137527 40.907116 58.650339
max 93.493320 94.355744 79.755250 58.467962 89.288805
Joining PrivateDataframe with PrivateSeries:
%%ag # join with series
result = pdf.join(ps,how="inner")
ag_print("Inner Join of PDF and Series Describe:\n",result.describe(eps=1))
>>> Inner Join of PDF and Series Describe:
A B C series
count 995.000000 995.000000 995.000000 995.000000
mean 50.461822 50.737590 44.956548 52.693788
std 28.974945 35.573564 28.876807 39.639266
min 0.000000 0.000000 0.000000 0.000000
25% 11.820072 43.155183 48.520076 8.472153
50% 65.126290 35.547530 28.150633 60.640127
75% 89.179010 85.479518 50.462041 68.068637
max 97.001126 79.362787 92.063356 75.326474
WHERE
Query
%%ag
result = pdf.where(pdf > 0)
ag_print("Where Query Result:\n", result.describe(eps=1))
>>> Where Query Result:
A B C
count 1001.000000 1001.000000 1001.000000
mean 48.506532 50.754384 53.904313
std 42.434438 24.036048 27.686005
min 0.000000 0.000000 0.000000
25% 23.367422 21.512664 4.402222
50% 33.757446 52.456891 22.643254
75% 65.482272 81.311059 74.171133
max 92.626234 84.918578 98.803625
Statistical Methods
PrivateDataframe and PrivateSeries support various statistical methods with APIs similar to pandas.DataFrame
and pandas. Series, such as:
sum
,mean
,count
std
,var
quantile
,percentile
,median
- bounded min and max can be achieved using
percentile(0)
andpercentile(100)
respectively. hist
andhist2d
.cov
(covariance),skew
(skewness),corr
(correlation).
Basic Statistics
Users can generate statistical insights about the data while preserving privacy using differentially private methods such as variance
, count
, percentile
on a PrivateSeries. See the following example:
%%ag
var = ps.var(eps=1)
count = ps.count(eps=1)
ag_print(f"variance = {var} , count = {count}")
>>>
variance = 806.7200376847192 , count = 100
In it, the differentially private variance (var
) and count (count
) of the PrivateSeries ps
are calculated with an epsilon value of 1. These values provide statistical insights about the data while preserving differential privacy.
It is possible to find the bounded max and min values in the following way:
%%ag
min = ps.percentile(eps=0.1, p=0)
max = ps.percentile(eps=0.1, p=100)
ag_print(f"min = {min} , max = {max}") ## Actual min, max : (0,100)
>>>
min = 3.7421055945821973 , max = 99.0959021484033
In this example, the differentially private bounded minimum (min
) and maximum (max
) of the PrivateSeries ps
are calculated using the percentile method with an epsilon value of 0.1 and percentiles 0 and 100, respectively. These values provide bounds on the minimum and maximum values in the dataset while preserving differential privacy.
Advanced Statistics
Users can perform complex differentially private statistical calculations such as:
- Covariance
- Skewness
- Correlation
Below, we demonstrate finding the correlation matrix of a PrivateDataFrame and compare the results with the expected initial result:
The correlation matrix of the PrivateDataFrame
priv_df
is calculated with an epsilon value of 3 using thecorr
method.The resulting correlation matrix is exported to the local environment and printed.
%%ag
# Calculate the correlation matrix of the PrivateDataFrame priv_df with an epsilon value of 3
result = priv_df.corr(eps=3)
# Export the result to the local environment
export(result, 'private_result')>>>
Setting up exported variable in local environment: private_result# Print the calculated correlation matrix
print(private_result)>>> age salary
age 1.0 0.032151
salary 0.032151 1.0The correlation matrix of the original DataFrame
df[['age', 'salary']]
is calculated using pandas.# Print the correlation matrix of the original DataFrame
print(df[['age', 'salary']].corr())>>> age salary
age 1.000000 0.006333
salary 0.006333 1.000000
The calculated correlation matrices from both the private and original datasets are compared.
Histograms
Users can generate differentially private histograms using PrivateDataFrame.hist()
.
%%ag
hist_data = priv_df.hist(column='salary',eps=0.1)
export(hist_data , 'hist_data')
>>> Setting up exported variable in local environment: hist_data
Users can use Matplotlib or any other plotting library to visualize the histogram locally.
import matplotlib.pyplot as plt
dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins)*0.8, align='edge')
# Display the plot
plt.show()