Skip to main content

API Reference

PrivateDataFrame

The PrivateDataFrame API is based on pandas.DataFrame, but in this case, all the methods are differentially private.

Constructor

The constructor for a PrivateDataFrame is as follows:

Constructor:

class op_pandas.PrivateDataFrame(
df : pandas.DataFrame,
metadata = None,
categorical_metadata = None
)

Parameters:

  • df: pandas.DataFrame : A pandas DataFrame, with data consisting of only strings, integers, floats, booleans, and datetime objects.
  • metadata: Dict[str, Tuple(float,float)]: Metadata containing bounds of the given DataFrame. The metadata should be a dictionary with column names as keys mapped to their bounds. Metadata contains keys of those columns that have only numerical data.
    { 'Age': (18,65), 'Salary': (10000, 200000), 'Gender': (0,1) }
  • categorical_metadata: Dict[str, List]: Metadata containing information about the categorical data of the given DataFrame. The categorical_metadata should be a dictionary with column names as keys mapped to a list containing all the categories in the column. The data types for all the elements in the list must be identical.
    { 'Income' : [">50k", "<=50k"], 'Sex': ["M", "F"] }

General Functions

applymap

The applymap() method allows you to apply one or more functions to the DataFrame object, enabling the modification of each element independently.

Function Signature:

PrivateDataFrame.applymap(
func,
eps = 0,
output_bounds = None
) -> PrivateDataFrame:

Parameters:

  • func: callable : Python function, returns a single value from a single value and should meet the following constraints:
    • Func can only take one argument, the individual element on which the function is applied.
    • Appropriate type annotations should be present in the function. To use datetime and regex, import datetime and import re to put their type annotations.
  • eps: float : The epsilon provided to the differentially private calculation. The eps value must be >=0. It’s used to calculate bounds.
  • output_bounds: Dict[str, Tuple[float, float]]: The output bounds (if already known) prevent the spending of epsilon from getting estimated bounds of the applied function.
  • categorical_output_bounds: Dict[str, List]: The categorical output bounds (if already known). If categorical output bounds for a specific column are not given, it will be calculated automatically using the function provided.

Returns:

  • PrivateDataFrame: A new DataFrame with the function applied to each element.

all

The all() method returns whether all elements are True, potentially over an axis.

Function Signature:

PrivateDataFrame.all(
axis: int = 0,
bool_only: bool = False,
skipna: bool = False
) -> PrivateSeries:

Parameters:

  • axis: int, default 0: The axis to use. 0 is for rows, and 1 is for columns.
  • bool_only: bool, default False: Include only boolean columns. If False, all columns are included.
  • skipna: bool, default False: Exclude NA/null values when computing the result.

Returns:

  • PrivateSeries: A Series indicating whether all elements along a specified axis are True.

categorical_metadata

This method returns the metadata of the categorical columns in PrivateDataFrame.

Function Signature:

PrivateDataFrame.categorical_metadata -> dict

Returns:

  • dict: A dictionary containing metadata about the categorical columns in the DataFrame.

columns

This method returns the column names of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.columns -> list

Example:

>> priv_df.columns

['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
'marital-status', 'occupation', 'relationship', 'race', 'gender',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']

Returns:

  • list: A list containing the names of the columns in the DataFrame.

copy

This method returns a copy of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.copy() -> PrivateDataFrame

Returns:

  • PrivateDataFrame: A new instance of PrivateDataFrame that is a copy of the original.

describe

The describe() method returns a statistical description of the data in the DataFrame, using differentially private calculations.

Function Signature:

PrivateDataFrame.describe(
eps,
percentiles = None,
include = None,
exclude = None
)-> pandas.DataFrame

Parameters:

  • eps: float: The epsilon provided to the differentially private calculation. eps must be >=0.
  • percentiles: list-like of numbers, optional: The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
  • include: ‘all’, list-like of dtypes or None (default), optional:
    • all: All input columns will be included in the output.
    • A list-like of dtypes: Limits the results to the provided data types.
      • To limit the result to numeric types, submit numpy.number.
      • To limit the list to object columns submit the numpy.object data type.
      • Strings can also be used in the select_dtypes style.
      • To select pandas categorical columns, use category.
    • None (default): The result will include all numeric columns.
  • exclude: list-like of dtypes or None (default), optional:
    • A list-like of dtypes : Excludes the provided data types from the result.
      • To exclude numeric types submit numpy.number.
      • To exclude object columns submit the data type numpy.object.
      • Strings can also be used in the style of select_dtype (e.g. df.describe(exclude=['O'])).
      • To exclude pandas’ categorical columns, use category.
    • None (default): No result will be excluded.

Returns:

  • pandas.DataFrame: A DataFrame object with the statistical description of the DataFrame’s columns, adjusted for privacy concerns.

drop

The drop() method removes the specified row or column from the PrivateDataFrame.

Function Signature:

PrivateDataFrame.drop(
columns=None,
inplace=True,
errors='raise'
)

Parameters:

  • columns: single label or list-like: Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
  • inplace: boolean: Whether to operate in place on the data.
  • errors: {‘ignore’, ‘raise’}, default ‘raise’: If you use ignore, suppress the error, and only existing labels are dropped.

dropna

The dropna() method removes the rows that contain NULL values from the PrivateDataFrame.

Function Signature:

PrivateDataFrame.dropna(
axis=0,
how=_NoDefault.no_default,
thresh=_NoDefault.no_default
)

Parameters:

  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • how: str {‘any’, ‘all’}, default ‘any’: Determine if a row or column is removed from DataFrame when we have at least one NA or all NA.
    • any: If any NA values are present, drop that row or column.
    • all: If all values are NA, drop that row or column.
  • thresh: int, optional: Defines how many existing non-NA values are required to remove the row. It cannot be combined with how.

dtypes

The dtypes property returns the data type of each column in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.dtypes

fillna

The fillna() method is used to replace missing values (NaNs) in a PrivateDataFrame. This method provides various options for filling missing data, either by specifying a static value or by using a method like 'forward fill' or 'backward fill'.

Function Signature:

PrivateDataFrame.fillna(
value=None,
limit: int = None,
method=None,
inplace: bool = False
):

Parameters:

  • value : scalar, dict, Series, DataFrame, or None, default None: The value used to fill missing entries. It can be a scalar, dictionary, Series, or DataFrame, providing great flexibility in how replacements are handled. If value is None and method is specified, it will perform the specified method of filling.

  • limit : int, optional: The maximum number of consecutive NaN values to forward/backward fill. The limit applies to the number of filled values.

  • method : {'backfill', 'bfill', 'pad', 'ffill', or None}, optional: The method to use when filling holes in reindexed Series:

    • 'pad' or 'ffill': propagate last valid observation forward to next valid
    • 'backfill' or 'bfill': use NEXT valid observation to fill gap
  • inplace : bool, default False: If True, fill in-place. Note that this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

Returns:

PrivateDataFrame or None: Depending on the value of inplace, it either returns a new DataFrame with missing values filled or modifies the original DataFrame and returns None.

groupby

The groupby() method on a PrivateDataFrame is crucial for data analysis, allowing data to be grouped based on specific criteria and operations like sum, mean, and count to be executed on these groups.

Function Signature:

PrivateDataFrame.groupby(
by=None,
sort: bool=True,
dropna:
bool=True
) -> PrivateDataFrameGroupby

Parameters:

  • by : str | List | pd.Series | op_pandas.PrivateSeries: Determines the groups for the groupby operation. Options include:

    • Column: Group by one or more categorical columns. The columns should be specified in the categorical_metadata.
    • Boolean Series / PrivateSeries: A series of boolean values. Non-boolean series will be converted to boolean before grouping.
    • List: A combination of column names and series.
  • sort : bool, default True: Controls whether the group keys are sorted. If set to False, the groups will appear in the order they are found in the original DataFrame.

  • dropna : bool, default True: If True, rows with NA values in the group keys are dropped. If False, NA values are included as a group key.

Allowed Operations:

After grouping, the following operations can be applied to compute statistics for each group:

  • sum: Calculate the sum of group values.
  • mean: Compute the average of group values.
  • std: Standard deviation of the group values.
  • var: Variance of the group values.
  • count: Count of non-NA cells for each group.
  • quantile: Compute quantiles for each group.
  • median: Median of the group values.
  • percentile: Specific percentiles of group values.

Returns:

PrivateDataFrameGroupby: A specialized view of the DataFrame that supports further operations specific to groups.

Usage:

import op_pandas as opd

# Create a PrivateDataFrame with metadata
pdf = opd.PrivateDataFrame(
df,
metadata={"age": (0,100)},
categorical_metadata={"groups": ['a', 'b', 'c']}
)

# Group the data by 'groups' column
grouped = pdf.groupby("groups")

# Print the sum of the 'age' column for each group, ensuring differential privacy
print(grouped.sum(eps=1))

Output Example:

>>>              age
a 837986.678085
b 817237.487139
c 827334.458893

This example demonstrates how to group data by categories and apply a privacy-preserving sum operation, providing insights into the dataset while maintaining the confidentiality of the data.

idxmax

The idxmax() method returns the index of the first occurrence of the maximum value in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.idxmax(axis=1, skipna=True, numeric_only=False) -> PrivateSeries:

Parameters:

  • axis: int, default 1: The axis to use. Only column idxmax (axis=1) is allowed.
  • skipna: bool, default True: Exclude NA/null values when computing the result.
  • numeric_only: bool, default False: Include only float, int, and boolean columns.

idxmin

The idxmin() method returns the index of the first occurrence of the minimum value in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.idxmin(
axis=1,
skipna=True,
numeric_only=False
) -> PrivateSeries:

Parameters:

  • axis: int, default 1: The axis to use. Only column idxmin (axis=1) is allowed.
  • skipna: bool, default True: Exclude NA/null values when computing the result.
  • numeric_only: bool, default False: Include only float, int, and boolean columns.

info

The info() method provides a concise summary of a PrivateDataFrame, detailing attributes like column names, their data types, and additional metadata concerning bounds and categorical distinctions.

Usage:

private_df.info()

isnull

The isnull() method detects missing values for an array-like object.

Function Signature:

PrivateDataFrame.isnull() -> PrivateDataFrame:

isna

The isna() method detects missing values for an array-like object.

Function Signature:

PrivateDataFrame.isna() -> PrivateDataFrame:

isin

The isin() method checks if each element in a DataFrame is contained in the specified values.

Function Signature:

PrivateDataFrame.isin(values):

Parameters:

  • values: PrivateDataFrame: The result will only be valid at a location if all the labels match.

join

The join() method inserts columns from another DataFrame or Series.

Function Signature:

PrivateDataFrame.join(
other,
on = None,
how = "left",
lsuffix = "",
rsuffix = "",
sort = False,
validate = None
) -> PrivateDataFrame

Parameters:

  • other: PrivateDataFrame, PrivateSeries: Index should be similar to one of the columns in this one. If a PrivateSeries is provided, its name attribute will be used as the column name in the resulting joined DataFrame.
  • on: str, list of str, or array-like, optional: Specifies in what level to do the joining.
  • how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’:
    • left: use the calling frame’s index (or column if on is specified)
    • right: use the other’s index.
    • outer: form a union of the calling frame’s index (or column if one is specified) with the other’s index and sort it lexicographically.
    • inner: form the intersection of the calling frame’s index (or column if one is specified) with the other’s index, preserving the order of the calling’s one.
  • lsuffix: str, default ‘’: Suffix to use from left frame’s overlapping columns.
  • rsuffix: str, default ‘’: Suffix to use from right frame’s overlapping columns.
  • sort: bool, default False: Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).
  • validate: str, optional: If specified, check if the join is of the specified type. The options are:
    • one_to_one or 1:1: Check if join keys are unique in both left and right datasets.
    • one_to_many or 1:m: Check if join keys are unique in the left dataset.
    • many_to_one or m:1: Check if join keys are unique in the right dataset.
    • many_to_many or m:m: This option is allowed but doesn’t result in checks.

make_column_categorical

The make_column_categorical converts a noncategorical column to a categorical one.

Function Signature:

PrivateDataFrame.make_column_categorical(
column,
categories,
inplace=False
):

Parameters:

  • column: str: Column to be converted to categorical.
  • categories: List: List of categories to be used for the column.
  • inplace: bool: If True, the operation is done in place.

make_column_non_categorical

The make_column_non_categorical converts a categorical column to a noncategorical one.

Function Signature:

PrivateDataFrame.make_column_non_categorical(
columns: str | List[str],
output_bounds: dict = None,
eps: float = 0.0
)

Parameters:

  • columns: str | List[str]: Column or a list of columns.
  • output_bounds: dict: If a column contains numerical values, but is categorical, you need to provide output bounds for it. If output bounds for a numerical column are absent, epsilon will be spent to estimate the bounds.
  • eps: float: Epsilon to estimate the output bounds of a numerical column.

metadata

The metadata returns the metadata or bounds of numerical columns present in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.metadata -> dict

notnull

The notnull() method detects non-missing values for an array-like object.

Function Signature:

PrivateDataFrame.notnull() -> PrivateDataFrame:

notna

The notna() method detects existing (non-missing) values.

Function Signature:

PrivateDataFrame.notna() -> PrivateDataFrame:

one_hot_encoding

The one_hot_encoding() method encodes the categorical columns of the PrivateDataFrame into one-hot vectors.

Function Signature:

PrivateDataFrame.one_hot_encoding(
cols,
prefix=None,
prefix_sep="_"
) -> PrivateDataFrame:

Parameters:

  • cols: str | List[str]: Column or list of columns to be encoded.
  • prefix: str: Prefix to be used for the column names in the resulting PrivateDataFrame.
  • prefix_sep: str: Separator to be used between the prefix and the column name.

rename

This method renames a specific set of columns in the PrivateDataFrame. The rename method uses a dictionary, which should contain a key-value pair of the one-to-one mapping needed for the column replacement.

PrivateDataFrame.rename(dict) -> PrivateDataFrame

sample_with_sensitivity

The sample_with_sensitivity() method returns a random sample of items from the PrivateDataFrame, so that the sensitivity (how many times a user can be present in the dataset) is capped.

Function Signature:

PrivateDataFrame.sample_with_sensitivity(max_sensitivity) -> PrivateDataFrame:

Parameters:

  • max_sensitivity: int: The maximum number of times a user can be present in the dataset.

size

The size method returns the differentially private number of elements in the PrivateDataFrame.

Function Signature:

PrivateDataFrame.size(eps: float = 0) -> int:

Parameters:

  • eps: float: The epsilon provided to the differentially private calculation. The eps value must be >=0.

unique

The unique() method returns the unique values in a column of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.unique(column: str) -> PrivateSeries:

Parameters:

  • column: str: The column for which you want to find the unique values.

where

The where() method replaces the values of the rows where the condition evaluates to False.

Function Signature:

PrivateDataFrame.where(
cond,
other = None,
inplace = False,
axis = None,
level = None
)

Parameters:

  • cond: bool PrivateSeries/PrivateDataFrame, Series/DataFrame, or array-like:
    • If True, keep the original value.
    • If False, replace it with the corresponding value from the other.
  • other: None: Other tweaking is not supported currently.
  • inplace: bool, default False: Whether to operate in place on the data.
  • axis: int, default None: Alignment axis if needed. For Series, this parameter is unused and defaults to 0.
  • level: int, default None: Alignment level if needed.

Returns:

  • PrivateDataFrame: A DataFrame with the result, or None if the inplace parameter is set to True.

Basic statistical methods

count

The count() method counts the number of not empty values for each row or column if you specify the axis parameter as axis='columns'.

Function Signature:

PrivateDataFrame.count(
eps = 0,
axis=0,
numeric_only=False
)

Parameters:

  • eps : float, default = 0: Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.
  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • numeric_only: bool, default None: Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True, otherwise, you must specify a value.

Returns:

  • Series: A Series object with the count result for each row/column.

mean

The mean() method returns the mean value of each column.

Function Signature:

PrivateDataFrame.mean(
eps = 0,
axis=0,
skipna=True,
numeric_only=None,
**kwargs
)

Parameters:

  • eps : float, default = 0: Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.
  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • skipna: bool, default True: Exclude NA/null values when computing the result.
  • numeric_only: bool, default None: Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True, otherwise, you must specify a value.
  • *kwargs: Additional keyword arguments are to be passed to the function.

Returns:

  • Series: A Series with the mean values.

median

The median() method returns a Series with the median value of each column.

Function Signature:

PrivateDataFrame.median(eps)

Parameters:

  • eps: float: Inform the epsilon is provided for the differentially private calculation. The eps value must be >=0.

percentile

It is a differentially private implementation of the percentile method.

Function Signature:

PrivateDataFrame.percentile(eps, p)

Parameters:

  • eps: float: Inform the epsilon provided to the differentially private calculation. eps must be >=0.
  • p: float or array-like: A value between 0 <= p <= 100. The percentile(s) to compute.

quantile

The quantile() method calculates the quantile of the values in a given axis. The default axis is row.

Function Signature:

PrivateDataFrame.quantile(eps, q = 0.5)

Parameters:

  • eps: float: Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.
  • q: float or array-like, default 0.5 (50% quantile): A value between 0 <= q <= 1, the quantile(s) to compute.

Standard deviation

The standard deviation method, std(), returns the sample’s standard deviation over a requested axis.

Function Signature:

PrivateDataFrame.std(
eps = 0,
axis=0,
skipna=True,
ddof=1,
numeric_only=None,
**kwargs
)

Parameters:

  • eps : float, default = 0: Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.
  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • skipna: bool, default True: Exclude NA/null values when computing the result.
  • ddof: int, default 1: Delta Degrees of Freedom. The divisor used in calculations is NddofN - ddof, where N represents the number of elements. If axis = 0, ddof must be equal to 1.
  • numeric_only: bool, default None: Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.
  • *kwargs: The additional keyword arguments to be passed to the function.

sum

The sum() method adds all values in each column and returns the sum for each one.

Function Signature:

PrivateDataFrame.sum(
eps = 0,
axis=0,
skipna=True,
numeric_only=None,
min_count=0,
**kwargs
)

Parameters:

  • eps: float, default = 0: Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.
  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • skipna: bool, default True: Exclude NA/Null values when computing the result.
  • numeric_only: bool, default None: Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.
  • min_count: int, default 0: The required number of valid values to operate.
    • If fewer than min_count non-NA values are present, the result will be NA.
    • If axis = 0, min_count is always assumed to be 0. Otherwise, you must specify a value.
  • *kwargs: Additional keyword arguments to be passed to the function.

variance

The var() method calculates the variance for each column.

Function Signature:

PrivateDataFrame.var(eps = 0, axis=0, skipna=True, ddof=1, numeric_only=None, **kwargs)

Parameters:

  • eps : float, default = 0: Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.
  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • skipna: bool, default True: Exclude NA/Null values when computing the result.
  • ddof: int, default 1: Delta Degrees of Freedom. The divisor used in calculations is NddofN - ddof, where N represents the number of elements. If axis = 0, ddof must be equal to 1.
  • numeric_only: bool, default None: Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.
  • *kwargs: Additional keyword arguments are to be passed to the function.

Advanced statistical methods

correlation

The corr() method finds the correlation of each column in a PrivateDataFrame.

Function Signature:

PrivateDataFrame.corr(eps: float, method: str = "pearson", min_periods: int = 1, numeric_only = True)

Parameters:

  • eps: float: Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.
  • method: str, {‘pearson’ or ‘spearman’}, default 'pearson': Define the method used to calculate the correlation. The available options are:
    • pearson : standard correlation coefficient.
    • spearman : Spearman rank correlation.
  • min_periods: int, optional: Assumed to be 1. Currently, min_periods tweaking is not supported.
  • numeric_only: bool, default True: Include only float, int, or boolean data. Currently, numeric_only tweaking is not allowed.

Returns:

  • Pandas DataFrame: A DataFrame object with the correlation results.

covariance

The cov() method finds the covariance of each column in a PrivateDataFrame.

Function Signature:

PrivateDataFrame.cov(
eps: float,
min_periods,
ddof = 1,
numeric_only = True
)

Parameters:

  • eps: float: Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.
  • min_periods: int, optional: Assumed to be 1. Currently, min_periods tweaking is not supported.
  • ddof: int, default 1: Delta Degrees of Freedom. The divisor used in calculations is NddofN - ddof, where N represents the number of elements. Currently, ddof tweaking is not supported.
  • numeric_only: bool, default True: Include only float, int, or boolean data. Currently, numeric_only tweaking is not allowed.

Returns:

  • Pandas DataFrame: A DataFrame object with the covariance results.

skew

The skew() method calculates the skew for each column.

Function Signature:

PrivateDataFrame.skew(
eps,
axis = 0,
skipna = True,
numeric_only = True
)

Parameters:

  • eps : float, default = 0 : Inform the epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.
  • axis: boolean {index (0), columns (1)}, default = 0: Axis for the function to be applied on.
  • skipna: bool, default True: Exclude NA/Null values when computing the result.
  • numeric_only: bool, default True: Include only float, int, and boolean columns. If axis = 0, numeric_only is always assumed to be True. Otherwise, you must specify a value.

Returns:

  • Pandas DataFrame: A DataFrame object with the skew results.

Histograms

hist

This method draws a histogram of the PrivateDataFrame’s columns.

Function Signature:

PrivateDataFrame.hist(
column,
eps,
bins = 10
)

Parameters:

  • column: str: Column in the PrivateDataFrame to group by.
  • eps: float: Inform the epsilon provided to the differentially private calculation. The eps value must be >=0.
  • bins: int, default 10: Number of histogram bins to be used.

hist2d

This method creates a 2D histogram among two of the columns of the PrivateDataFrame.

Function Signature:

PrivateDataFrame.hist2d(eps, x, y, bins = 10):

Parameters:

  • eps: float: The epsilon provided to the differentially private calculation. The eps value must be >=0.
  • x: str: Inform the first column to be used from the PrivateDataFrame to group by.
  • y: str: Inform the second column to be used from the PrivateDataFrame to group by.
  • bins: int, default 10: Number of histogram bins to be used.