Guides
Managing Data
This section presents the op_pandas
library guide and addresses data management tasks. The following operations are addressed:
Viewing Data
Records in PrivateDataFrame and PrivateSeries cannot be viewed directly to protect privacy. However, users can still analyze and obtain statistical information about the data using methods that offer differential privacy guarantees.
Printing details about the data
To print details about the data, such as columns, metadata, and data types, within the AG environment, users can use the ag_print
function. It can quickly inspect the details of their data within the AG environment, facilitating data analysis and exploration.
See the following example:
%%ag
ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)
When executed:
>>> Columns:
Index(['name', 'age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
name object
age int64
salary int64
dtype: object
Generating quick statistics
Users can obtain quick statistics about your dataset using the describe()
method in pandas. By spending some epsilon, you can get an idea about the statistical details of the dataset.
See the following example:
%%ag
priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)
When executed:
>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075
Statistics can be viewed by exporting the non-private result to the local Jupyter server:
%%ag
export(priv_describe, name='priv_describe')
Setting up exported variable in local environment: priv_describe
print(priv_describe)
>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075
Cleaning Data
Users can use the dropna
method to remove any records that contain NaN values in any of its features.
%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})
export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')
>>> Setting up exported variable in local environment: result
/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/
indexing.html#returning-a-view-versus-a-copy
self._df[col].clip(lower=self._metadata[col][0],
upper=self._metadata[col][1], inplace=True)
print(result)
>>>
a b
count 2490.000000 2490.000000
mean 1.495172 1.502561
std 0.496467 0.483317
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.264635 1.668942
75% 1.985666 1.975653
max 1.997052 1.993214
print(df_2.dropna(axis=0).describe())
>>>
a b
count 2490.000000 2490.000000
mean 1.492369 1.503213
std 0.500042 0.500090
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 2.000000
75% 2.000000 2.000000
max 2.000000 2.000000
Columns from the PrivateDataFrame can also be dropped using the drop
functionality, just like this:
%%ag
ag_print("Columns before: ", priv_df.columns)
temp_df = priv_df.drop(['name'], inplace = False)
ag_print("Columns After Dropping name: ", temp_df.columns)
Response:
>>> Columns before: Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name: Index(['age', 'salary'], dtype='object')
Selecting Data
Setting values that affect a particular set of records is not allowed in PrivateDataFrame or PrivateSeries. However, transformation functions can be applied using PrivateDataFrame.ApplyMap
or PrivateSeries.Map
.
Getting Data Objects from a single column
To select a single column from a DataFrame and obtain a PrivateSeries, equivalent to df["A"]
, use the following approach:
%%ag
# Select the 'age' column from the PrivateDataFrame 'priv_df' and obtain a PrivateSeries
priv_s = priv_df['age']
# Print quick statistics and metadata of the PrivateSeries 'priv_s'
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)
When executed:
>>> Describe:
count 9988.000000
mean 39.472473
std 22.735391
min 0.000000
25% 21.384668
50% 37.840602
75% 55.273188
max 78.730881
Name: series, dtype: float64
Metadata:
(0, 80)
In this example:
- The
priv_df['age']
syntax is used to select the 'age' column from the PrivateDataFramepriv_df
, resulting in a PrivateSeriespriv_s
. - The
describe()
method is applied to the PrivateSeriespriv_s
with an epsilon value of 1 to obtain quick statistics about the data. - The output displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for the 'age' column.
- The metadata bounds for the 'age' column are also printed, indicating that valid values range from 0 to 80.
Getting data objects from various columns
To select a collection of columns from a DataFrame and obtain a PrivateDataFrame, you can use the following approach:
%%ag
# Select the 'age' and 'salary' columns from the PrivateDataFrame
# 'priv_df' and obtain a PrivateDataFrame 'priv_df_filtered'
priv_df_filtered = priv_df[['age', 'salary']]
# Print the columns and metadata of the PrivateDataFrame 'priv_df_filtered'
ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)
When executed:
>>> Columns:
Index(['age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
In this example:
- The
priv_df[['age', 'salary']]
syntax is used to select the 'age' and 'salary' columns from the PrivateDataFramepriv_df
, resulting in a PrivateDataFramepriv_df_filtered
. - The
columns
attribute of the PrivateDataFramepriv_df_filtered
is printed to display the selected columns. - The
metadata
attribute of the PrivateDataFramepriv_df_filtered
is printed to show the metadata bounds for each selected column.
Applying transformation functions
To use Applymap on a PrivateDataFrame, use the following approach:
%%ag
# Define a transformation function 'func' to map strings
# to their lengths and numerical values to their halves
def func(x: str | int | float) -> float:
if isinstance(x, str):
return len(x)
elif isinstance(x, (int, float)):
return x / 2
return 0.0
# Apply the transformation function 'func' to each element
# of the PrivateDataFrame 'priv_df' using applymap
result = priv_df.applymap(func, eps=1)
# Print the metadata and quick statistics of the resulting PrivateDataFrame 'result'
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps=1))
When executed:
>>> Metadata:
{'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}
Describe:
name age salary
count 10001.000000 10001.000000 10001.000000
mean 9.979915 19.763231 25068.695612
std 0.274668 10.477926 16333.537130
min 8.000000 0.000000 686.474307
25% 10.000000 8.740750 12489.945476
50% 10.000000 18.864365 24417.320448
75% 10.000000 28.283437 37265.078587
max 10.000000 35.126989 60705.413286
In this example:
- The
func
function is defined to map strings to their lengths and numerical values to their halves. - The
applymap()
function is used to apply the transformation functionfunc
to each element of the PrivateDataFramepriv_df
. - The resulting PrivateDataFrame
result
contains transformed values. - The metadata of the resulting PrivateDataFrame
result
is printed, showing the updated metadata bounds. - Quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the PrivateDataFrame
result
are printed, providing insights into the transformed data.
To use applymap
on a PrivateSeries. The mapping can be done using a dictionary for 1:1 mapping or via a callable method.
%%ag
# Define a mapping function 'series_map' to halve numerical values
def series_map(x: int) -> float:
return x / 2
# Apply the mapping function 'series_map' to the 'age' column of the PrivateDataFrame 'priv_df'
priv_df['age'] = priv_df['age'].map(series_map, eps=1)
# Print the metadata of the updated 'age' column in 'priv_df'
ag_print("Metadata:\n", priv_df.metadata)
When executed:
>>> Metadata:
{'age': (0.0, 64.0), 'salary': (1, 200000)}
In this example:
- The
series_map
function is defined to halve numerical values. - The
map()
function is used to apply the mapping functionseries_map
to the 'age' column of the PrivateDataFramepriv_df
. - The metadata of the 'age' column in the updated PrivateDataFrame
priv_df
is printed, showing the updated metadata bounds.
op_pandas
guide.See the Operations page to continue following the op_pandas
guide.