Guides

Managing Data

This section presents the op_pandas library guide and addresses data management tasks. The following operations are addressed:

Viewing Data
Cleaning Data
Selecting Data

Viewing Data

Records in PrivateDataFrame and PrivateSeries cannot be viewed directly to protect privacy. However, users can still analyze and obtain statistical information about the data using methods that offer differential privacy guarantees.

Printing details about the data

To print details about the data, such as columns, metadata, and data types, within the AG environment, users can use the ag_print function. It can quickly inspect the details of their data within the AG environment, facilitating data analysis and exploration.

See the following example:

%%ag

ag_print("Columns: \n", priv_df.columns)
ag_print("Metadata: \n", priv_df.metadata)
ag_print("Dtypes: \n", priv_df.dtypes)

When executed:

>>> Columns:
 Index(['name', 'age', 'salary'], dtype='object')
Metadata:
 {'age': (0, 80), 'salary': (1, 200000)}
Dtypes:
 name      object
age        int64
salary     int64
dtype: object

Generating quick statistics

Users can obtain quick statistics about your dataset using the describe() method in pandas. By spending some epsilon, you can get an idea about the statistical details of the dataset.

See the following example:

%%ag

priv_describe = priv_df.describe(eps=1)
# Export information from remote ag kernel to local jupyter server.
ag_print(priv_describe)

When executed:

>>>
                age         salary
count  10011.000000   10011.000000
mean      39.430439   49728.643224
std       23.065769   28405.891863
min        0.000000     583.443063
25%       19.904847   24461.587152
50%       35.801599   50159.255936
75%       60.452492   74499.712660
max       77.701005  147580.823075

Statistics can be viewed by exporting the non-private result to the local Jupyter server:

%%ag

export(priv_describe, name='priv_describe')

Setting up exported variable in local environment: priv_describe

print(priv_describe)

>>>
                age         salary
count  10011.000000   10011.000000
mean      39.430439   49728.643224
std       23.065769   28405.891863
min        0.000000     583.443063
25%       19.904847   24461.587152
50%       35.801599   50159.255936
75%       60.452492   74499.712660
max       77.701005  147580.823075

Cleaning Data

Users can use the dropna method to remove any records that contain NaN values in any of its features.

%%ag
# probability of a record not having nan = (0.5 * 0.5) = 0.25
# Hence expected count after dropna should be around 2500.
priv_df_2 = PrivateDataFrame(imported_df_2, metadata = {'a': (1, 2), 'b': (1, 2)})

export(priv_df_2.dropna(axis=0).describe(eps=1), 'result')

>>> Setting up exported variable in local environment: result

/code/dependencies/op_pandas/op_pandas/core/private_dataframe.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/
indexing.html#returning-a-view-versus-a-copy
  self._df[col].clip(lower=self._metadata[col][0],
  upper=self._metadata[col][1], inplace=True)

print(result)

>>>
                 a            b
count  2490.000000  2490.000000
mean      1.495172     1.502561
std       0.496467     0.483317
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.264635     1.668942
75%       1.985666     1.975653
max       1.997052     1.993214

print(df_2.dropna(axis=0).describe())

>>>
                 a            b
count  2490.000000  2490.000000
mean      1.492369     1.503213
std       0.500042     0.500090
min       1.000000     1.000000
25%       1.000000     1.000000
50%       1.000000     2.000000
75%       2.000000     2.000000
max       2.000000     2.000000

Columns from the PrivateDataFrame can also be dropped using the drop functionality, just like this:

%%ag
ag_print("Columns before: ", priv_df.columns)

temp_df = priv_df.drop(['name'], inplace = False)

ag_print("Columns After Dropping name: ", temp_df.columns)

Response:

>>> Columns before:  Index(['name', 'age', 'salary'], dtype='object')
Columns After Dropping name:  Index(['age', 'salary'], dtype='object')

Selecting Data

Setting values that affect a particular set of records is not allowed in PrivateDataFrame or PrivateSeries. However, transformation functions can be applied using PrivateDataFrame.ApplyMap or PrivateSeries.Map.

Getting Data Objects from a single column

To select a single column from a DataFrame and obtain a PrivateSeries, equivalent to df["A"], use the following approach:

%%ag

# Select the 'age' column from the PrivateDataFrame 'priv_df' and obtain a PrivateSeries
priv_s = priv_df['age']

# Print quick statistics and metadata of the PrivateSeries 'priv_s'
ag_print("Describe:\n", priv_s.describe(eps=1))
ag_print("Metadata:\n", priv_s.metadata)

When executed:

>>> Describe:
 count    9988.000000
mean       39.472473
std        22.735391
min         0.000000
25%        21.384668
50%        37.840602
75%        55.273188
max        78.730881
Name: series, dtype: float64
Metadata:
 (0, 80)

In this example:

The priv_df['age'] syntax is used to select the 'age' column from the PrivateDataFrame priv_df, resulting in a PrivateSeries priv_s.
The describe() method is applied to the PrivateSeries priv_s with an epsilon value of 1 to obtain quick statistics about the data.
The output displays statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values for the 'age' column.
The metadata bounds for the 'age' column are also printed, indicating that valid values range from 0 to 80.

Getting data objects from various columns

To select a collection of columns from a DataFrame and obtain a PrivateDataFrame, you can use the following approach:

%%ag

# Select the 'age' and 'salary' columns from the PrivateDataFrame
# 'priv_df' and obtain a PrivateDataFrame 'priv_df_filtered'
priv_df_filtered = priv_df[['age', 'salary']]

# Print the columns and metadata of the PrivateDataFrame 'priv_df_filtered'
ag_print("Columns:\n", priv_df_filtered.columns)
ag_print("Metadata:\n", priv_df_filtered.metadata)

When executed:

>>> Columns:
 Index(['age', 'salary'], dtype='object')
Metadata:
 {'age': (0, 80), 'salary': (1, 200000)}

In this example:

The priv_df[['age', 'salary']] syntax is used to select the 'age' and 'salary' columns from the PrivateDataFrame priv_df, resulting in a PrivateDataFrame priv_df_filtered.
The columns attribute of the PrivateDataFrame priv_df_filtered is printed to display the selected columns.
The metadata attribute of the PrivateDataFrame priv_df_filtered is printed to show the metadata bounds for each selected column.

Applying transformation functions

To use Applymap on a PrivateDataFrame, use the following approach:

%%ag

# Define a transformation function 'func' to map strings
# to their lengths and numerical values to their halves
def func(x: str | int | float) -> float:
    if isinstance(x, str):
        return len(x)
    elif isinstance(x, (int, float)):
        return x / 2
    return 0.0

# Apply the transformation function 'func' to each element
# of the PrivateDataFrame 'priv_df' using applymap
result = priv_df.applymap(func, eps=1)

# Print the metadata and quick statistics of the resulting PrivateDataFrame 'result'
ag_print("Metadata:\n", result.metadata)
ag_print("Describe:\n", result.describe(eps=1))

When executed:

>>> Metadata:
 {'name': (8.0, 16.0), 'age': (0.0, 64.0), 'salary': (512.0, 65536.0)}

Describe:
                name           age        salary
count  10001.000000  10001.000000  10001.000000
mean       9.979915     19.763231  25068.695612
std        0.274668     10.477926  16333.537130
min        8.000000      0.000000    686.474307
25%       10.000000      8.740750  12489.945476
50%       10.000000     18.864365  24417.320448
75%       10.000000     28.283437  37265.078587
max       10.000000     35.126989  60705.413286

In this example:

The func function is defined to map strings to their lengths and numerical values to their halves.
The applymap() function is used to apply the transformation function func to each element of the PrivateDataFrame priv_df.
The resulting PrivateDataFrame result contains transformed values.
The metadata of the resulting PrivateDataFrame result is printed, showing the updated metadata bounds.
Quick statistics (count, mean, std, min, 25%, 50%, 75%, max) of the PrivateDataFrame result are printed, providing insights into the transformed data.

To use applymap on a PrivateSeries. The mapping can be done using a dictionary for 1:1 mapping or via a callable method.

%%ag

# Define a mapping function 'series_map' to halve numerical values
def series_map(x: int) -> float:
    return x / 2

# Apply the mapping function 'series_map' to the 'age' column of the PrivateDataFrame 'priv_df'
priv_df['age'] = priv_df['age'].map(series_map, eps=1)

# Print the metadata of the updated 'age' column in 'priv_df'
ag_print("Metadata:\n", priv_df.metadata)

When executed:

>>> Metadata:
 {'age': (0.0, 64.0), 'salary': (1, 200000)}

In this example:

The series_map function is defined to halve numerical values.
The map() function is used to apply the mapping function series_map to the 'age' column of the PrivateDataFrame priv_df.
The metadata of the 'age' column in the updated PrivateDataFrame priv_df is printed, showing the updated metadata bounds.

Continue the op_pandas guide.

See the Operations page to continue following the op_pandas guide.

Guides

Managing Data​

Viewing Data​

Printing details about the data​

Generating quick statistics​

Cleaning Data​

Selecting Data​

Getting Data Objects from a single column​

Getting data objects from various columns​

Applying transformation functions​

Managing Data

Viewing Data

Printing details about the data

Generating quick statistics

Cleaning Data

Selecting Data

Getting Data Objects from a single column

Getting data objects from various columns

Applying transformation functions