Skip to main content

Guides

Functions, Joins and Statistical Methods

This section continue presenting the op_pandas library guide, addressing functions, joins and statistical methods. The following operations are addressed:

General Functions

op_pandas comes packaged with some useful functions such as:

  • to_datetime
  • concat
  • merge
  • train_test_split

Concat

Users can concatenate PrivateDataFrames, by creating a copy of the dataframe and then concatenating it with the original. See the example below:

  1. First, create a DataFrame df with random data:

    arr_name = []
    n_num = 10000
    N = 10
    for i in range(n_num):
    res = ''.join(random.choices(string.ascii_lowercase, k=N))
    arr_name.append(res)

    df = pd.DataFrame({
    'name': arr_name,
    'age': np.random.randint(0, 80, n_num),
    'salary': np.random.randint(100, 100000, n_num)
    })

    session.private_import(data=df, name='imported_df_copy')
    >>> dataframe cached to server, loading to kernel...
    Output: Dataframe loaded successfully to the kernel
  2. Next, create a PrivateDataFrame priv_df_copy using the imported DataFrame:

    %%ag
    metadata = {
    'age': (0, 80),
    'salary': (1, 200000)
    }

    priv_df_copy = PrivateDataFrame(imported_df_copy, metadata=metadata)
  3. Then, print the counts of records in the original copied PrivateDataFrames, and concatenate them:

    %%ag
    ag_print("First DF count:\n", priv_df.count(eps=1))
    ag_print("Second DF count:\n", priv_df_copy.count(eps=1))

    import op_pandas
    concat_df = op_pandas.concat([priv_df, priv_df_copy])
    ag_print("Concat DF count: \n", concat_df.count(eps=1))

The resulting output should be as follows:

>>> 
First DF count:
name 10000
age 9999
salary 9988
dtype: int64

Second DF count:
name 9998
age 9999
salary 10001
dtype: int64

Concat DF count:
name 19997
age 20000
salary 20004
dtype: int64

concat is only used to concatenate PrivateDataFrames over indices.

Datatypes

The datatypes of the same columns must be the same across all data frames.

Train_test_split

Users can utilize op_pandas.train_test_split to divide a PrivateDataFrame into train and test splits. As in the example below:

%%ag

priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)

ag_print("Count of train split:\n", priv_df_train.count(eps=1))
ag_print("Count of test split:\n", priv_df_test.count(eps=1))

Output when executed:

>> 
Count of train split:
name 7501
age 7503
salary 7500
dtype: int64

Count of test split:
name 2492
age 2499
salary 2491
dtype: int64

In this example, op_pandas.train_test_split splits the PrivateDataFrame priv_df into train and test sets. The counts of records in each split are printed, showing the distribution of data between the training and testing subsets.

Join and Where

This section explores different operations related to joining PrivateDataFrame and PrivateSeries objects and using the WHERE and JOIN method for filtering data.

JOIN on Private objects

Joining two PrivateDataframes:

%%ag
"""
pdf: {randomly sampled}
columns = ['A','B','C']
metadata = (0,100) for all columns
size = 1000

pdf2: {randomly sampled}
columns = ['E','F']
metadata = (-100,100) for all columns
size = 1000

ps: {randomly sampled series}
name = 'series'
metadata = (0,100)
size = 1000
"""

pdf = PrivateDataFrame(
pd.DataFrame({
"A": np.random.randint(0, 100, 1000),
"B": np.random.randint(0, 100, 1000),
"C": np.random.randint(0, 100, 1000)
}),
metadata={"A": (0, 100), "B": (0, 100), "C": (0, 100)}
)

pdf2 = PrivateDataFrame(
pd.DataFrame({
"E": np.random.randint(-100, 100, 1000),
"F": np.random.randint(-100, 100, 1000)
}),
metadata={"E": (-100, 100), "F": (-100, 100)}
)

ps = PrivateSeries(pd.Series(np.random.randint(0,100,1000)), metadata=(0,100))

%%ag #join with dataframe
result = pdf.join(pdf2,how="outer")
ag_print("Outer Join of two PDFs Describe:\n",result.describe(eps=1))

>>> Outer Join of two PDFs Describe:
A B C E F
count 978.000000 978.000000 978.000000 978.000000 978.000000
mean 45.078161 48.373313 47.395728 2.918509 0.379322
std 41.659230 39.782047 3.978732 55.174294 76.570248
min 0.000000 0.000000 0.000000 -100.000000 -100.000000
25% 25.835038 16.814188 13.689315 -55.716763 -79.960320
50% 45.534085 55.195066 57.023430 3.266460 -5.226773
75% 62.600983 63.868197 68.137527 40.907116 58.650339
max 93.493320 94.355744 79.755250 58.467962 89.288805

Joining PrivateDataframe with PrivateSeries:

%%ag # join with series
result = pdf.join(ps,how="inner")
ag_print("Inner Join of PDF and Series Describe:\n",result.describe(eps=1))

>>> Inner Join of PDF and Series Describe:
A B C series
count 995.000000 995.000000 995.000000 995.000000
mean 50.461822 50.737590 44.956548 52.693788
std 28.974945 35.573564 28.876807 39.639266
min 0.000000 0.000000 0.000000 0.000000
25% 11.820072 43.155183 48.520076 8.472153
50% 65.126290 35.547530 28.150633 60.640127
75% 89.179010 85.479518 50.462041 68.068637
max 97.001126 79.362787 92.063356 75.326474

WHERE Query

%%ag

result = pdf.where(pdf > 0)
ag_print("Where Query Result:\n", result.describe(eps=1))

>>> Where Query Result:
A B C
count 1001.000000 1001.000000 1001.000000
mean 48.506532 50.754384 53.904313
std 42.434438 24.036048 27.686005
min 0.000000 0.000000 0.000000
25% 23.367422 21.512664 4.402222
50% 33.757446 52.456891 22.643254
75% 65.482272 81.311059 74.171133
max 92.626234 84.918578 98.803625

Statistical Methods

PrivateDataframe and PrivateSeries support various statistical methods with APIs similar to pandas.DataFrame and pandas. Series, such as:

  • summeancount
  • stdvar
  • quantilepercentile , median
  • bounded min and max can be achieved using percentile(0) and percentile(100) respectively.
  • hist and hist2d.
  • cov (covariance), skew (skewness), corr (correlation).

Basic Statistics

Users can generate statistical insights about the data while preserving privacy using differentially private methods such as variance, count, percentile on a PrivateSeries. See the following example:

%%ag
var = ps.var(eps=1)
count = ps.count(eps=1)

ag_print(f"variance = {var} , count = {count}")
>>> 
variance = 806.7200376847192 , count = 100

In it, the differentially private variance (var) and count (count) of the PrivateSeries ps are calculated with an epsilon value of 1. These values provide statistical insights about the data while preserving differential privacy.

It is possible to find the bounded max and min values in the following way:

%%ag
min = ps.percentile(eps=0.1, p=0)
max = ps.percentile(eps=0.1, p=100)

ag_print(f"min = {min} , max = {max}") ## Actual min, max : (0,100)
>>> 
min = 3.7421055945821973 , max = 99.0959021484033

In this example, the differentially private bounded minimum (min) and maximum (max) of the PrivateSeries ps are calculated using the percentile method with an epsilon value of 0.1 and percentiles 0 and 100, respectively. These values provide bounds on the minimum and maximum values in the dataset while preserving differential privacy.

Advanced Statistics

Users can perform complex differentially private statistical calculations such as:

  • Covariance
  • Skewness
  • Correlation

Below, we demonstrate finding the correlation matrix of a PrivateDataFrame and compare the results with the expected initial result:

  1. The correlation matrix of the PrivateDataFrame priv_df is calculated with an epsilon value of 3 using the corr method.

  2. The resulting correlation matrix is exported to the local environment and printed.

    %%ag

    # Calculate the correlation matrix of the PrivateDataFrame priv_df with an epsilon value of 3
    result = priv_df.corr(eps=3)

    # Export the result to the local environment
    export(result, 'private_result')
    >>> 
    Setting up exported variable in local environment: private_result
    # Print the calculated correlation matrix
    print(private_result)
    >>>            age    salary
    age 1.0 0.032151
    salary 0.032151 1.0
  3. The correlation matrix of the original DataFrame df[['age', 'salary']] is calculated using pandas.

    # Print the correlation matrix of the original DataFrame
    print(df[['age', 'salary']].corr())
    >>>             age    salary
    age 1.000000 0.006333
    salary 0.006333 1.000000

The calculated correlation matrices from both the private and original datasets are compared.

Histograms

Users can generate differentially private histograms using PrivateDataFrame.hist().

%%ag

hist_data = priv_df.hist(column='salary',eps=0.1)
export(hist_data , 'hist_data')
>>> Setting up exported variable in local environment: hist_data

Users can use Matplotlib or any other plotting library to visualize the histogram locally.

import matplotlib.pyplot as plt
dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins)*0.8, align='edge')

# Display the plot
plt.show()