Guides

Functions, Joins and Statistical Methods

This section continue presenting the op_pandas library guide, addressing functions, joins and statistical methods. The following operations are addressed:

General Functions
Join and Where
Statistical Methods

General Functions

op_pandas comes packaged with some useful functions such as:

to_datetime
concat
merge
train_test_split

Concat

Users can concatenate PrivateDataFrames, by creating a copy of the dataframe and then concatenating it with the original. See the example below:

First, create a DataFrame df with random data:

arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
  res = ''.join(random.choices(string.ascii_lowercase, k=N))
  arr_name.append(res)

df = pd.DataFrame({
    'name': arr_name,
    'age': np.random.randint(0, 80, n_num),
    'salary': np.random.randint(100, 100000, n_num)
})

session.private_import(data=df, name='imported_df_copy')

>>> dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel

Next, create a PrivateDataFrame priv_df_copy using the imported DataFrame:

%%ag
metadata = {
    'age': (0, 80),
    'salary': (1, 200000)
}

priv_df_copy = PrivateDataFrame(imported_df_copy, metadata=metadata)

Then, print the counts of records in the original copied PrivateDataFrames, and concatenate them:

%%ag
ag_print("First DF count:\n", priv_df.count(eps=1))
ag_print("Second DF count:\n", priv_df_copy.count(eps=1))

import op_pandas
concat_df = op_pandas.concat([priv_df, priv_df_copy])
ag_print("Concat DF count: \n", concat_df.count(eps=1))

The resulting output should be as follows:

>>> 
First DF count:
 name      10000
age        9999
salary     9988
dtype: int64

Second DF count:
 name       9998
age        9999
salary    10001
dtype: int64

Concat DF count:
 name      19997
age       20000
salary    20004
dtype: int64

concat is only used to concatenate PrivateDataFrames over indices.

Datatypes

The datatypes of the same columns must be the same across all data frames.

Train_test_split

Users can utilize op_pandas.train_test_split to divide a PrivateDataFrame into train and test splits. As in the example below:

%%ag

priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)

ag_print("Count of train split:\n", priv_df_train.count(eps=1))
ag_print("Count of test split:\n", priv_df_test.count(eps=1))

Output when executed:

>> 
Count of train split:
 name      7501
age       7503
salary    7500
dtype: int64

Count of test split:
 name      2492
age       2499
salary    2491
dtype: int64

In this example, op_pandas.train_test_split splits the PrivateDataFrame priv_df into train and test sets. The counts of records in each split are printed, showing the distribution of data between the training and testing subsets.

Join and Where

This section explores different operations related to joining PrivateDataFrame and PrivateSeries objects and using the WHERE and JOIN method for filtering data.

`JOIN` on Private objects

Joining two PrivateDataframes:

%%ag
"""
pdf: {randomly sampled}
    columns = ['A','B','C']
    metadata = (0,100) for all columns
    size = 1000

pdf2: {randomly sampled}
    columns = ['E','F']
    metadata = (-100,100) for all columns
    size = 1000

ps: {randomly sampled series}
    name = 'series'
    metadata = (0,100)
    size = 1000
"""

pdf = PrivateDataFrame(
    pd.DataFrame({
        "A": np.random.randint(0, 100, 1000),
        "B": np.random.randint(0, 100, 1000),
        "C": np.random.randint(0, 100, 1000)
    }), 
    metadata={"A": (0, 100), "B": (0, 100), "C": (0, 100)}
)

pdf2 = PrivateDataFrame(
    pd.DataFrame({
        "E": np.random.randint(-100, 100, 1000),
        "F": np.random.randint(-100, 100, 1000)
    }), 
    metadata={"E": (-100, 100), "F": (-100, 100)}
)

ps = PrivateSeries(pd.Series(np.random.randint(0,100,1000)), metadata=(0,100))

%%ag #join with dataframe
result = pdf.join(pdf2,how="outer")
ag_print("Outer Join of two PDFs Describe:\n",result.describe(eps=1))

>>> Outer Join of two PDFs Describe:
                 A           B           C           E           F
count  978.000000  978.000000  978.000000  978.000000  978.000000
mean    45.078161   48.373313   47.395728    2.918509    0.379322
std     41.659230   39.782047    3.978732   55.174294   76.570248
min      0.000000    0.000000    0.000000 -100.000000 -100.000000
25%     25.835038   16.814188   13.689315  -55.716763  -79.960320
50%     45.534085   55.195066   57.023430    3.266460   -5.226773
75%     62.600983   63.868197   68.137527   40.907116   58.650339
max     93.493320   94.355744   79.755250   58.467962   89.288805

Joining PrivateDataframe with PrivateSeries:

%%ag # join with series
result = pdf.join(ps,how="inner")
ag_print("Inner Join of PDF and Series Describe:\n",result.describe(eps=1))

>>> Inner Join of PDF and Series Describe:
                 A           B           C      series
count  995.000000  995.000000  995.000000  995.000000
mean    50.461822   50.737590   44.956548   52.693788
std     28.974945   35.573564   28.876807   39.639266
min      0.000000    0.000000    0.000000    0.000000
25%     11.820072   43.155183   48.520076    8.472153
50%     65.126290   35.547530   28.150633   60.640127
75%     89.179010   85.479518   50.462041   68.068637
max     97.001126   79.362787   92.063356   75.326474

`WHERE` Query

%%ag

result = pdf.where(pdf > 0)
ag_print("Where Query Result:\n", result.describe(eps=1))

>>> Where Query Result:
                  A            B            C
count  1001.000000  1001.000000  1001.000000
mean     48.506532    50.754384    53.904313
std      42.434438    24.036048    27.686005
min       0.000000     0.000000     0.000000
25%      23.367422    21.512664     4.402222
50%      33.757446    52.456891    22.643254
75%      65.482272    81.311059    74.171133
max      92.626234    84.918578    98.803625

Statistical Methods

PrivateDataframe and PrivateSeries support various statistical methods with APIs similar to pandas.DataFrame and pandas. Series, such as:

sum, mean, count
std, var
quantile, percentile , median
bounded min and max can be achieved using percentile(0) and percentile(100) respectively.
hist and hist2d.
cov (covariance), skew (skewness), corr (correlation).

Basic Statistics

Users can generate statistical insights about the data while preserving privacy using differentially private methods such as variance, count, percentile on a PrivateSeries. See the following example:

%%ag
var = ps.var(eps=1)
count = ps.count(eps=1)

ag_print(f"variance = {var} , count = {count}")

>>> 
variance = 806.7200376847192 , count = 100

In it, the differentially private variance (var) and count (count) of the PrivateSeries ps are calculated with an epsilon value of 1. These values provide statistical insights about the data while preserving differential privacy.

It is possible to find the bounded max and min values in the following way:

%%ag
min = ps.percentile(eps=0.1, p=0)
max = ps.percentile(eps=0.1, p=100)

ag_print(f"min = {min} , max = {max}") ## Actual min, max : (0,100)

>>> 
min = 3.7421055945821973 , max = 99.0959021484033

In this example, the differentially private bounded minimum (min) and maximum (max) of the PrivateSeries ps are calculated using the percentile method with an epsilon value of 0.1 and percentiles 0 and 100, respectively. These values provide bounds on the minimum and maximum values in the dataset while preserving differential privacy.

Advanced Statistics

Users can perform complex differentially private statistical calculations such as:

Covariance
Skewness
Correlation

Below, we demonstrate finding the correlation matrix of a PrivateDataFrame and compare the results with the expected initial result:

The correlation matrix of the PrivateDataFrame priv_df is calculated with an epsilon value of 3 using the corr method.

The resulting correlation matrix is exported to the local environment and printed.

%%ag

# Calculate the correlation matrix of the PrivateDataFrame priv_df with an epsilon value of 3
result = priv_df.corr(eps=3)

# Export the result to the local environment
export(result, 'private_result')

>>> 
Setting up exported variable in local environment: private_result

# Print the calculated correlation matrix
print(private_result)

>>>            age    salary
age          1.0  0.032151
salary  0.032151       1.0

The correlation matrix of the original DataFrame df[['age', 'salary']] is calculated using pandas.

# Print the correlation matrix of the original DataFrame
print(df[['age', 'salary']].corr())

>>>             age    salary
age     1.000000  0.006333
salary  0.006333  1.000000

The calculated correlation matrices from both the private and original datasets are compared.

Histograms

Users can generate differentially private histograms using PrivateDataFrame.hist().

%%ag

hist_data = priv_df.hist(column='salary',eps=0.1)
export(hist_data , 'hist_data')

>>> Setting up exported variable in local environment: hist_data

Users can use Matplotlib or any other plotting library to visualize the histogram locally.

import matplotlib.pyplot as plt
dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins)*0.8, align='edge')

# Display the plot
plt.show()

Guides

Functions, Joins and Statistical Methods​

General Functions​

Concat​

Train_test_split​

Join and Where​

JOIN on Private objects​

WHERE Query​

Statistical Methods​

Basic Statistics​

Advanced Statistics​

Histograms​