Skip to main content

Guides

Managing Data

This section presents the op_pandas library guide and addresses data management tasks. The following operations are addressed:

Viewing Data

Records in PrivateDataFrame and PrivateSeries cannot be viewed directly to protect privacy. However, users can still analyze and obtain statistical information about the data using methods that offer differential privacy guarantees.

Printing details about the data

To print details about the data, such as columns, metadata, and data types, within the AG environment, users can use the ag_print function. It can quickly inspect the details of their data within the AG environment, facilitating data analysis and exploration.

See the following example:

%%ag

ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)

When executed:

>>> Columns:
Index(['name', 'age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
name object
age int64
salary int64
dtype: object

Generating quick statistics

Users can obtain quick statistics about your dataset using the describe() method in pandas. By spending some epsilon, you can get an idea about the statistical details of the dataset.

See the following example:

%%ag

priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)

When executed:

>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075

Statistics can be viewed by exporting the non-private result to the local Jupyter server:

%%ag

export(priv_describe, name='priv_describe')
Setting up exported variable in local environment: priv_describe
print(priv_describe)
>>>
age salary
count 10011.000000 10011.000000
mean 39.430439 49728.643224
std 23.065769 28405.891863
min 0.000000 583.443063
25% 19.904847 24461.587152
50% 35.801599 50159.255936
75% 60.452492 74499.712660
max 77.701005 147580.823075

Cleaning Data

Users can use the dropna method to remove any records that contain NaN values in any of its features.

%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})

export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')
>>> Setting up exported variable in local environment: result

/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/
indexing.html#returning-a-view-versus-a-copy
self._df[col].clip(lower=self._metadata[col][0],
upper=self._metadata[col][1], inplace=True)
print(result)
>>>
a b
count 2490.000000 2490.000000
mean 1.495172 1.502561
std 0.496467 0.483317
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.264635 1.668942
75% 1.985666 1.975653
max 1.997052 1.993214
print(df_2.dropna(axis=0).describe())
>>>
a b
count 2490.000000 2490.000000
mean 1.492369 1.503213
std 0.500042 0.500090
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 2.000000
75% 2.000000 2.000000
max 2.000000 2.000000

Columns from the PrivateDataFrame can also be dropped using the drop functionality, just like this:

%%ag
ag_print("Columns before: ", priv_df.columns)

temp_df = priv_df.drop(['name'], inplace = False)

ag_print("Columns After Dropping name: ", temp_df.columns)

Response:

>>> Columns before:  Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name: Index(['age', 'salary'], dtype='object')

Selecting Data

Setting values that affect a particular set of records is not allowed in PrivateDataFrame or PrivateSeries. However, transformation functions can be applied using PrivateDataFrame.ApplyMap or PrivateSeries.Map.

Getting Data Objects from a single column

To select a single column from a DataFrame and obtain a PrivateSeries, equivalent to df["A"], use the following approach:

%%ag

# Select the 'age' column from the PrivateDataFrame 'priv_df' and obtain a PrivateSeries
priv_s = priv_df['age']

# Print quick statistics and metadata of the PrivateSeries 'priv_s'
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)

When executed:

>>> Describe:
count 9988.000000
mean 39.472473
std 22.735391
min 0.000000
25% 21.384668
50% 37.840602
75% 55.273188
max 78.730881
Name: series, dtype: float64
Metadata:
(0, 80)

In this example:

  • The priv_df['age'] syntax is used to select the 'age' column from the PrivateDataFrame priv_df, resulting in a PrivateSeries priv_s.
  • The describe() method is applied to the PrivateSeries priv_s with an epsilon value of 1 to obtain quick statistics about the data.
  • The output displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for the 'age' column.
  • The metadata bounds for the 'age' column are also printed, indicating that valid values range from 0 to 80.

Getting data objects from various columns

To select a collection of columns from a DataFrame and obtain a PrivateDataFrame, you can use the following approach:

%%ag

# Select the 'age' and 'salary' columns from the PrivateDataFrame
# 'priv_df' and obtain a PrivateDataFrame 'priv_df_filtered'
priv_df_filtered = priv_df[['age', 'salary']]

# Print the columns and metadata of the PrivateDataFrame 'priv_df_filtered'
ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)

When executed:

>>> Columns:
Index(['age', 'salary'], dtype='object')
Metadata:
{'age': (0, 80), 'salary': (1, 200000)}

In this example:

  • The priv_df[['age', 'salary']] syntax is used to select the 'age' and 'salary' columns from the PrivateDataFrame priv_df, resulting in a PrivateDataFrame priv_df_filtered.
  • The columns attribute of the PrivateDataFrame priv_df_filtered is printed to display the selected columns.
  • The metadata attribute of the PrivateDataFrame priv_df_filtered is printed to show the metadata bounds for each selected column.

Applying transformation functions

To use Applymap on a PrivateDataFrame, use the following approach:

%%ag

# Define a transformation function 'func' to map strings
# to their lengths and numerical values to their halves
def func(x: str | int | float) -> float:
if isinstance(x, str):
return len(x)
elif isinstance(x, (int, float)):
return x / 2
return 0.0

# Apply the transformation function 'func' to each element
# of the PrivateDataFrame 'priv_df' using applymap
result = priv_df.applymap(func, eps=1)

# Print the metadata and quick statistics of the resulting PrivateDataFrame 'result'
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps=1))

When executed:

>>> Metadata:
{'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}

Describe:
name age salary
count 10001.000000 10001.000000 10001.000000
mean 9.979915 19.763231 25068.695612
std 0.274668 10.477926 16333.537130
min 8.000000 0.000000 686.474307
25% 10.000000 8.740750 12489.945476
50% 10.000000 18.864365 24417.320448
75% 10.000000 28.283437 37265.078587
max 10.000000 35.126989 60705.413286

In this example:

  • The func function is defined to map strings to their lengths and numerical values to their halves.
  • The applymap() function is used to apply the transformation function func to each element of the PrivateDataFrame priv_df.
  • The resulting PrivateDataFrame result contains transformed values.
  • The metadata of the resulting PrivateDataFrame result is printed, showing the updated metadata bounds.
  • Quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the PrivateDataFrame result are printed, providing insights into the transformed data.

To use applymap on a PrivateSeries. The mapping can be done using a dictionary for 1:1 mapping or via a callable method.

%%ag

# Define a mapping function 'series_map' to halve numerical values
def series_map(x: int) -> float:
return x / 2

# Apply the mapping function 'series_map' to the 'age' column of the PrivateDataFrame 'priv_df'
priv_df['age'] = priv_df['age'].map(series_map, eps=1)

# Print the metadata of the updated 'age' column in 'priv_df'
ag_print("Metadata:\n", priv_df.metadata)

When executed:

>>> Metadata:
{'age': (0.0, 64.0), 'salary': (1, 200000)}

In this example:

  • The series_map function is defined to halve numerical values.
  • The map() function is used to apply the mapping function series_map to the 'age' column of the PrivateDataFrame priv_df.
  • The metadata of the 'age' column in the updated PrivateDataFrame priv_df is printed, showing the updated metadata bounds.
Continue the op_pandas guide.

See the Operations page to continue following the op_pandas guide.