Introducing op_pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

We have created a differentially private version of pandas library named op_pandas. This lets you handle private dataframes and private series which can be used to perform various statistical analysis with differential privacy guarantees. The api methods have been created such that it gives you minimal difficulty from getting adjusted if you have used pandas before.

Getting Started: Install, Import & Connect to Antigranular

!pip install antigranular

import antigranular as ag
session = ag.login("<ag client secrets>", "<ag client secrets>", competition = "Sandbox Competition")

>>> Loading dataset "Medical Treatment" to the kernel...
Dataset "Medical Treatment" loaded to the kernel as medical_treatment
Connected to Antigranular server session id: 66b9293d-ecb6-4059-ac24-c8fe2afb3b4f, the session will time out if idle for 60 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!

This is a short introduction to op_pandas (Oblivious private pandas), geared mainly for new users. Let's get started! You can import the library as follows:

%%ag
from op_pandas import PrivateDataFrame , PrivateSeries

Object Creation

From AG-Server

To load a dataset from AG-Server and obtain the required data structures, you can use the load_dataset() function from ag_utils. The response of the load_dataset() function provides the necessary data in a dictionary format.

%%ag
"""
medical treatment dictionary :-
{
  "train_x": pdf_train_x,
  "train_y": pdf_train_y,
  "test_x": df_test_x,
}

These are Private datastructures and cannot be exported to local environment.
Unless a differentially private measure is applied to obtain a non-private
DataFrame or Series.
"""
train_x = medical_treatment["train_x"]
train_y = medical_treatment["train_y"]
test_x = medical_treatment["test_x"]

From pandas.Series

When creating a PrivateSeries, it's recommended to set metadata bounds. This allows you to define the range of valid values for the series. If you don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9] , name='Test_series')
priv_s = PrivateSeries(series=s,metadata=(0,10))

From pandas.DataFrame

Similar to PrivateSeries, it's recommended to set metadata bounds when creating a PrivateDataFrame. By specifying metadata, you can define the valid range of values for each column in the dataframe. If metadata bounds are not provided, op_pandas will assign automatic bounds based on the minimum and maximum values in each numerical column.

%%ag
import pandas as pd
data = {
    'Age':[20,30,40,25,30,25,26,27,28,29],
    'Salary':[35000,60000,100000,55000, 35000,60000,100000,55000,35000,60000],
    'Sex':['M','F','M','F', 'M','F','M','F', 'M', 'F']
}
metadata = {
    'Age':(18,65),
    'Salary':(20000,200000)
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata)

Importing data

Within the AG environment, we can also import external data from local jupyter session.

import pandas as pd
import numpy as np
import string, random

arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
  res = ''.join(random.choices(string.ascii_lowercase, k=N))
  arr_name.append(res)

df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})

session.private_import(data = df, name= 'imported_df')

>>> dataframe cached to server, loading to kernel...

# Creating another dataset with NaNs in it

# Randomly distributing nans in two columns with prob = 0.5

choice = [1,2,np.nan]
a = np.random.choice(choice,10000,p=[0.25,0.25,0.5])
b = np.random.choice(choice,10000,p=[0.25,0.25,0.5])

df_2 = pd.DataFrame({'a': a, 'b': b})

session.private_import(data = df_2, name = "imported_df_2")

>>> dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel

After using private_import, you can access the dataFrame with the name imported_df within the AG environment. We will create a PrivateDF out of it.

%%ag
metadata = {
    'age': (0, 80),
    'salary': (1, 200000)
}

priv_df = PrivateDataFrame(imported_df ,metadata = metadata)

>>> Dataframe loaded successfully to the kernel

Viewing Data

To protect privacy, records in PrivateDataFrame and PrivateSeries cannot be viewed directly. However, you can still analyse and obtain statistical information about the data using methods that offer differential privacy guarantees.

Viewing details about the data

Printing the details about the data, like columns and metadata can be done in the following way:

%%ag

ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)

>>> Columns:
 Index(['name', 'age', 'salary'], dtype='object')
Metadata:
 {'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
 name      object
age        int64
salary     int64
dtype: object

Quick Statistics

One way to obtain the quick-statistic is by using the describe() method where you can spend some epsilon to obtain a rough meta-data about the dataset.

%%ag

priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)

>>>
                age         salary
count  10011.000000   10011.000000
mean      39.430439   49728.643224
std       23.065769   28405.891863
min        0.000000     583.443063
25%       19.904847   24461.587152
50%       35.801599   50159.255936
75%       60.452492   74499.712660
max       77.701005  147580.823075

You can view the statistics by exporting the non-private result to the local Jupyter server:

%%ag

export(priv_describe, name='priv_describe')

Setting up exported variable in local environment: priv_describe

print(priv_describe)

>>>
                age         salary
count  10011.000000   10011.000000
mean      39.430439   49728.643224
std       23.065769   28405.891863
min        0.000000     583.443063
25%       19.904847   24461.587152
50%       35.801599   50159.255936
75%       60.452492   74499.712660
max       77.701005  147580.823075

Cleaning Data

You can use dropna method to remove any records that contain nan values in any of its features.

Cleaning using dropna method.

%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})

export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')

>>> Setting up exported variable in local environment: result


/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._df[col].clip(lower=self._metadata[col][0], upper=self._metadata[col][1], inplace=True)

print(result)

>>>
                 a            b
count  2490.000000  2490.000000
mean      1.495172     1.502561
std       0.496467     0.483317
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.264635     1.668942
75%       1.985666     1.975653
max       1.997052     1.993214

print(df_2.dropna(axis=0).describe())

>>>
                 a            b
count  2490.000000  2490.000000
mean      1.492369     1.503213
std       0.500042     0.500090
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.000000     2.000000
75%       2.000000     2.000000
max       2.000000     2.000000

We can also drop some columns from the PrivateDataFrame using drop functionality.

%%ag
ag_print("Columns before: ", priv_df.columns)

temp_df = priv_df.drop(['name'], inplace = False)

ag_print("Columns After Dropping name: ", temp_df.columns)

>>> Columns before:  Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name:  Index(['age', 'salary'], dtype='object')

Selecting Data

Setting values that affect a certain set of records is not allowed in PrivateDataFrame or PrivateSeries. However, you can apply transformation functions using PrivateDataFrame.ApplyMap or PrivateSeries.Map.

Getting Data Objects

To select a single column and obtain a PrivateSeries, equivalent to df["A"]:

%%ag

priv_s = priv_df['age']
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)

>>> Describe:
 count    9988.000000
mean       39.472473
std        22.735391
min         0.000000
25%        21.384668
50%        37.840602
75%        55.273188
max        78.730881
Name: series, dtype: float64
Metadata:
 (0, 80)

To select a collection of columns which returns a PrivateDataFrame

%%ag

priv_df_filtered = priv_df[['age','salary']]

ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)

>>> Columns:
 Index(['age', 'salary'], dtype='object')
Metadata:
 {'age': (0, 80), 'salary': (1, 200000)}



/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._df[col].clip(lower=self._metadata[col][0], upper=self._metadata[col][1], inplace=True)

Setting Data Objects

Setting any values that affect a record is not allowed in PrivateDataFrame or PrivateSeries. However, you can apply transformation functions using PrivateDataFrame.ApplyMap or PrivateSeries.Map

Using Applymap on a PrivateDataFrame

%%ag


# maps string to its length if not numerical else divides it by 2.
def func(x: str | int | float) -> float:
    if isinstance(x, str):
        return len(x)
    elif isinstance(x, (int, float)):
        return x / 2
    return 0.0


result = priv_df.applymap(func, eps=1)

%%ag
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps = 1))

>>> Metadata:
 {'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}

Describe:
                name           age        salary
count  10001.000000  10001.000000  10001.000000
mean       9.979915     19.763231  25068.695612
std        0.274668     10.477926  16333.537130
min        8.000000      0.000000    686.474307
25%       10.000000      8.740750  12489.945476
50%       10.000000     18.864365  24417.320448
75%       10.000000     28.283437  37265.078587
max       10.000000     35.126989  60705.413286

## Calculating the original results

def func(x: str | int | float) -> float:
    if isinstance(x, str):
        return len(x)
    elif isinstance(x, (int, float)):
        return x / 2
    return 0.0

result_local = df.applymap(func)

print("Describe: ", result_local.describe())

>>>
Describe:            name           age        salary
count  10000.0  10000.000000  10000.000000
mean      10.0     19.720800  24840.119300
std        0.0     11.544976  14512.035948
min       10.0      0.000000     50.500000
25%       10.0      9.500000  12210.250000
50%       10.0     19.500000  24811.750000
75%       10.0     30.000000  37367.750000
max       10.0     39.500000  49993.000000

Using Map on a PrivateSeries. The mapping can be done either via using a dictionary for 1:1 mapping or via a callable method.

%%ag

def series_map(x:int)->float:
    return x/2

priv_df['age'] = priv_df['age'].map(series_map,eps=1) # important
ag_print("Metadata:\n", priv_df.metadata)

>>> Metadata:
 {'age': (0.0, 64.0), 'salary': (1, 200000)}

Operations

This section covers various operations that can be performed on PrivateDataFrame and PrivateSeries objects.

Unary Ops

You can perform unary operations such as ~, -, +, and abs on PrivateDataFrames and PrivateSeries. These operations apply element-wise to the data.

The ~ operator performs the bitwise negation operation.

The - operator performs the arithmetic negation operation.

The + operator performs the arithmetic addition operation.

The abs() function calculates the absolute value of each element.

%%ag

export(priv_df_2.describe(eps=2) , 'original')
export((-priv_df_2).describe(eps=2) , 'negative')

>>> Setting up exported variable in local environment: original
Setting up exported variable in local environment: negative

negative.columns = ["a_neg","b_neg"]
print(original.join(negative , how="left"))

>>>
                  a             b         a_neg         b_neg
count  10000.000000  10000.000000  10000.000000  10000.000000
mean       1.498597      1.504388     -1.500690     -1.503697
std        0.494197      0.498858      0.498402      0.499788
min        1.000000      1.000000     -1.000000     -1.000000
25%        1.000000      1.000000     -1.001932     -1.001540
50%        1.635791      1.891447     -1.678784     -1.161680
75%        1.991538      1.997409     -1.996417     -1.996500
max        1.992424      1.997140     -1.999776     -1.999894

Binary Ops

Similar to pandas.DataFrame , you can apply binary operations using scalars and PrivateDataFrames against PrivateDataFrames.

Binary operations on a mix of scalars and PrivateDataFrames against PrivateDataFrames.

%%ag

pdf = priv_df[['age','salary']]
result1 = pdf + (10*pdf)  # expected min-max => Age:(0,704) ,  Salary:(11, 2200000)
result2 = result1/1000 # expected min-max => Age:(0,0.704) ,  Salary:(0.011, 2200)
ag_print("Result1 metadata: \n", result1.metadata)
ag_print("Result2 metadata: \n", result2.metadata)

>>> Result1 metadata:
 {'age': (0.0, 704.0), 'salary': (11, 2200000)}
Result2 metadata:
 {'age': (0.0, 0.704), 'salary': (0.011, 2200.0)}



/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._df[col].clip(lower=self._metadata[col][0], upper=self._metadata[col][1], inplace=True)

Bitwise Ops

Similar to pandas.DataFrame, you can apply bitwise operations using scalars and PrivateDataFrames against PrivateDataFrames. These operations apply element-wise to the data.

%%ag
import numpy as np
import pandas as pd

priv_ser_1 = PrivateSeries(pd.Series(np.random.randint(0,1,10000)), metadata=(0,1))
priv_ser_2 = PrivateSeries(pd.Series(np.random.randint(0,1,10000)), metadata=(0,1))

"""
{
    priv_ser_1 : randomly sampled integer data containing (0,1)
    priv_ser_1 : randomly sampled integer data containing (0,1)
}
"""

ag_print("Describe of private Series 1: \n", priv_ser_1.describe(eps = 1))
ag_print("Describe of private Series 2: \n", priv_ser_2.describe(eps = 1))

result = priv_ser_1 & priv_ser_2

ag_print("Describe of the result: \n", result.describe(eps = 1))

>>> Describe of private Series 1:
 count    9.998000e+03
mean     1.571300e-03
std      1.998231e-02
min      0.000000e+00
25%      4.656613e-10
50%      4.656613e-10
75%      4.656613e-10
max      4.656613e-10
Name: series, dtype: float64

Describe of private Series 2:
 count    1.000500e+04
mean     5.608570e-04
std      4.612496e-02
min      0.000000e+00
25%      4.656613e-10
50%      4.656613e-10
75%      4.656613e-10
max      4.656613e-10
Name: series, dtype: float64

Describe of the result:
 count    1.000700e+04
mean     3.952059e-04
std      2.277582e-02
min      0.000000e+00
25%      4.656613e-10
50%      4.656613e-10
75%      4.656613e-10
max      4.656613e-10
Name: series, dtype: float64

General Functions

op_pandas comes packaged with some useful functions like to_datetime, concat, merge and train_test_split.

Concat

Let us create a copy of priv_df and concatenate the copy with the original.

arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
  res = ''.join(random.choices(string.ascii_lowercase, k=N))
  arr_name.append(res)

df = pd.DataFrame({'name': arr_name, 'age': np.random.randint(0, 80, n_num), 'salary': np.random.randint(100, 100000, n_num)})

session.private_import(data = df, name= 'imported_df_copy')

>>> dataframe cached to server, loading to kernel...
Output: Dataframe loaded successfully to the kernel

%%ag
metadata = {
    'age': (0, 80),
    'salary': (1, 200000)
}

priv_df_copy = PrivateDataFrame(imported_df_copy ,metadata = metadata)

%%ag

ag_print("First DF count:\n",priv_df.count(eps = 1))
ag_print("Second DF count:\n",priv_df_copy.count(eps = 1))

import op_pandas
concat_df = op_pandas.concat([priv_df, priv_df_copy])
ag_print("Concat DF count: \n", concat_df.count(eps=1))

>>> First DF count:
 name      10000
age        9999
salary     9988
dtype: int64

Second DF count:
 name       9998
age        9999
salary    10001
dtype: int64

Concat DF count:
 name      19997
age       20000
salary    20004
dtype: int64

We can only use concat to concatenate over indices.

Keep in mind that the datatypes of same columns for all dataframes must be same.

Train_test_split

We can use op_pandas.train_test_split to create train and test splittings of a PrivateDataFrame/Series or an array of PrivateDataFrames/Series.

%%ag

priv_df_train, priv_df_test = op_pandas.train_test_split(priv_df)

ag_print("Count of train split:\n",priv_df_train.count(eps = 1))
ag_print("Count of test split:\n",priv_df_test.count(eps = 1))

>>> Count of train split:
 name      7501
age       7503
salary    7500
dtype: int64

Count of test split:
 name      2492
age       2499
salary    2491
dtype: int64

JOINS and WHERE

This section explores different operations related to joining PrivateDataFrame and PrivateSeries objects and using the WHERE and JOIN method for filtering data.

`JOIN` on Private objects

Joining two PrivateDataframes:

%%ag
"""
pdf: {randomly sampled}
    columns = ['A','B','C']
    metadata = (0,100) for all columns
    size = 1000

pdf2: {randomly sampled}
    columns = ['E','F']
    metadata = (-100,100) for all columns
    size = 1000

ps: {randomly sampled series}
    name = 'series'
    metadata = (0,100)
    size = 1000
"""

pdf = PrivateDataFrame(pd.DataFrame({"A": np.random.randint(0,100,1000), "B": np.random.randint(0,100,1000), "C": np.random.randint(0,100,1000)}), metadata = {"A": (0, 100), "B": (0, 100), "C": (0, 100)})
pdf2 = PrivateDataFrame(pd.DataFrame({"E": np.random.randint(-100,100,1000), "F": np.random.randint(-100,100,1000)}), metadata = {"E": (-100, 100), "F": (-100, 100)})
ps = PrivateSeries(pd.Series(np.random.randint(0,100,1000)), metadata=(0,100))

%%ag #join with dataframe
result = pdf.join(pdf2,how="outer")
ag_print("Outer Join of two PDFs Describe:\n",result.describe(eps=1))

>>> Outer Join of two PDFs Describe:
                 A           B           C           E           F
count  978.000000  978.000000  978.000000  978.000000  978.000000
mean    45.078161   48.373313   47.395728    2.918509    0.379322
std     41.659230   39.782047    3.978732   55.174294   76.570248
min      0.000000    0.000000    0.000000 -100.000000 -100.000000
25%     25.835038   16.814188   13.689315  -55.716763  -79.960320
50%     45.534085   55.195066   57.023430    3.266460   -5.226773
75%     62.600983   63.868197   68.137527   40.907116   58.650339
max     93.493320   94.355744   79.755250   58.467962   89.288805

Joining PrivateDataframe with PrivateSeries:

%%ag # join with series
result = pdf.join(ps,how="inner")
ag_print("Inner Join of PDF and Series Describe:\n",result.describe(eps=1))

>>> Inner Join of PDF and Series Describe:
                 A           B           C      series
count  995.000000  995.000000  995.000000  995.000000
mean    50.461822   50.737590   44.956548   52.693788
std     28.974945   35.573564   28.876807   39.639266
min      0.000000    0.000000    0.000000    0.000000
25%     11.820072   43.155183   48.520076    8.472153
50%     65.126290   35.547530   28.150633   60.640127
75%     89.179010   85.479518   50.462041   68.068637
max     97.001126   79.362787   92.063356   75.326474

`WHERE` Query

%%ag

result = pdf.where(pdf > 0)
ag_print("Where Query Result:\n", result.describe(eps=1))

>>> Where Query Result:
                  A            B            C
count  1001.000000  1001.000000  1001.000000
mean     48.506532    50.754384    53.904313
std      42.434438    24.036048    27.686005
min       0.000000     0.000000     0.000000
25%      23.367422    21.512664     4.402222
50%      33.757446    52.456891    22.643254
75%      65.482272    81.311059    74.171133
max      92.626234    84.918578    98.803625

Statistical Methods

PrivateDataframe and PrivateSeries also support wide range of statistical methods with similar api of that to pandas.DataFrame and pandas.Series.

sum, mean, count
std, var
quantile, percentile , median
bounded min and max can be achieved using percentile(0) and percentile(100) respectively
hist and hist2d
cov (covariance), skew (skewness), corr (correlation)

Basic Statistics

This example demonstrates the usage of differentially private variance and count methods on a PrivateSeries.

The resulting values are differentially private while providing statistical insights about the data.

%%ag
var = ps.var(eps = 1)
count = ps.count(eps = 1)

ag_print(f"variance = {var} , count = {count}")

>>> variance = 806.7200376847192 , count = 1000

You can find differentially private bounded max and min using the percentile method.

%%ag

min = ps.percentile(eps=0.1 ,p = 0)
max = ps.percentile(eps=0.1 , p =100)
ag_print(f"min = {min} , max = {max}") ## Actual min, max : (0,100)

>>> min = 3.7421055945821973 , max = 99.09590214840338

Advanced Statistics

You can perform more complex differentially private statistical calculations like covariance, skewness, and correlation similar to their counterparts in pandas. Finding correlation matrix of a PrivateDataFrame and comparing the results with the original expected result.

%%ag

result = priv_df.corr(eps=3)
export(result,'private_result')

>>> Setting up exported variable in local environment: private_result

print(private_result)

>>>             age    salary
age          1.0  0.032151
salary  0.032151       1.0

print(df[['age', 'salary']].corr())

>>>             age    salary
age     1.000000  0.006333
salary  0.006333  1.000000

Histograms

You can generate differentially private histograms using PrivateDataFrame.hist().

%%ag

hist_data = priv_df.hist(column='salary',eps=0.1)
export(hist_data , 'hist_data')

>>> Setting up exported variable in local environment: hist_data

To visualise the histogram locally, you can use matplotlib or any other plotting library of your choice.

import matplotlib.pyplot as plt
dp_hist, dp_bins = hist_data
# Create a bar plot using Matplotlib
plt.bar(dp_bins[:-1], dp_hist, width=np.diff(dp_bins)*0.8, align='edge')

# Display the plot
plt.show()

Now that we are all done, we can terminate the session. Happy coding!

session.terminate_session()

>>> {'status': 'ok'}

Introducing op_pandas

Getting Started: Install, Import & Connect to Antigranular​

Object Creation​

From AG-Server​

From pandas.Series​

From pandas.DataFrame​

Importing data​

Viewing Data​

Viewing details about the data​

Quick Statistics​

Cleaning Data​

Selecting Data​

Getting Data Objects​

Setting Data Objects​

Operations​

Unary Ops​

Binary Ops​

Bitwise Ops​

General Functions​

Concat​

Train_test_split​

JOINS and WHERE​

JOIN on Private objects​

WHERE Query​

Statistical Methods​

Basic Statistics​

Advanced Statistics​

Histograms​