Skip to main content

PrivateDataFrame

General functions

Constructor

A PrivateDataFrame based on top of pandas.DataFrame, for which all the methods are differentially private.

class op_pandas.PrivateDataFrame(
df : pandas.DataFrame ,
metadata = None,
categorical_metadata = None
)
  • df: pandas.DataFrame
    A pandas DataFrame, with data consisting of only strings, integers, floats, booleans and datetime object.

  • metadata: Dict[str, Tuple(float,float)]
    Metadata containing bounds of the given DataFrame. The metadata should be a dictionary with column names as keys mapped to their bounds. Metadata contains keys of only those columns that have only numerical data.

    { 'Age': (18,65) , 'Salary': (10000 , 200000) , 'Gender': (0,1) }
  • categorical_metadata: Dict[str, List]
    Metadata containing information about the categorical data of the given DataFrame. The categorical_metadata should be a dictionary with column names as keys mapped to a list containing all the categories in the column. The data types for all the elements in the list must be same.

    { 'Income' : [">50k", "<=50k"], 'Sex': ["M", "F"] }

applymap

A differentially private implementation of the apply method.

PrivateDataFrame.applymap(func, eps = 0 , output_bounds = None)
  • func: callable
    Python function, returns a single value from a single value. Output should be string, int, float, boolean or datetime. Keep in mind, that this function will run in an isolated environment, with mypy strict mode enabled. The function should be meet following constraints:

    • Func can only take one argument, which would be the individual element the function is being applied on.

    • Appropriate type annotations should be present in the function. To use datetime and regex, do import datetime and import re respectively to put their type annotations. For examples, checkout pandas quickstart

  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0. (used to calculate bounds)

  • output_bounds:Dict[str, Tuple[float, float]]
    Mention output bounds (if already known) for preventing the spending of epsilon to get estimated bounds of the applied function.

  • categorical_output_bounds:Dict[str, List]
    Mention categorical output bounds (if already known). If categorical output bounds for a specific column is not given, it will be calculated automatically using the function provided.

  • Returns
    PrivateDataFrame

categorical_metadata

Returns the categorical_metadata of the PrivateDataFrame

PrivateDataFrame.categorical_metadata -> dict

columns

Returns the column names of the PrivateDataFrame.

PrivateDataFrame.columns -> Index(object)
>> test_x.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
'marital-status', 'occupation', 'relationship', 'race', 'gender',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
dtype='object')

copy

Returns a copy of the PrivateDataFrame.

PrivateDataFrame.copy() -> PrivateDataFrame

describe

A differentially private implementation of the describe method.

PrivateDataFrame.describe(eps, percentiles = None, include = None, exclude = None)
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • percentiles: list-like of numbers, optional
    The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

  • include: ‘all’, list-like of dtypes or None (default), optional
    A white list of data types to include in the result. Ignored for Series. Here are the options:

    • ‘all’ : All columns of the input will be included in the output.

    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

    • None (default) : The result will include all numeric columns.

  • exclude: list-like of dtypes or None (default), optional,
    A black list of data types to omit from the result. Ignored for Series. Here are the options:

    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'.

    • None (default) : The result will exclude nothing.

  • Returns : Pandas DataFrame

drop

Drop specified labels from columns.

PrivateDataFrame.drop(columns=None, inplace=True, errors='raise'):
  • columns: single label or list-like
    Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

  • inplace: boolean
    Whether to perform the operation in place on the data.

  • errors: {‘ignore’, ‘raise’}, default ‘raise’
    If ‘ignore’, suppress error and only existing labels are dropped.

dropna

Removes missing values within a PrivateDataFrame.

PrivateDataFrame.dropna(axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default)
  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • how: str {‘any’, ‘all’}, default ‘any’
    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    • ‘any’ : If any NA values are present, drop that row or column.

    • ‘all’ : If all values are NA, drop that row or column.

  • thresh: int, optional
    Require that many non-NA values. Cannot be combined with how.

dtypes

Returns the data type information of columns present in the PrivateDataFrame.

PrivateDataFrame.dtypes
>> test_x.dtypes

age int64
workclass object
fnlwgt int64
education object
educational-num int64
marital-status object
occupation object
relationship object
race object
gender object
capital-gain int64
capital-loss int64
hours-per-week int64
native-country object

groupby

Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

PrivateDataFrame.groupby(by=None, sort: bool=True, dropna: bool=True) -> PrivateDataFrameGroupby
  • by: str | List | pd.Series | op_pandas.PrivateSeries
    Used to determine the groups for the groupby. Can be one of the following:

    • Column: Used to group on the basis of a column. The column must be categorical (present in the categorical_metadata).
    • Boolean Series / PrivateSeries: Series containing only boolean values. If the series is not boolean, it will be converted to boolean before groupby.
    • List containing any combination of the above two
  • sort: bool = True
    Sort group keys.If False, the groups will appear in the same order as they did in the original DataFrame.

  • dropna: bool = True
    If True, and if group keys contain NA values, NA values together with row will be dropped. If False, NA values will also be treated as the key in groups.

Following operations are allowed on PrivateDataFrameGroupby:

Output after applying the operation will be a pandas DataFrame containing output of the function, and index being information about the groups.

%%ag
import op_pandas as opd

pdf = opd.PrivateDataFrame(df, metadata = {"age": (0,100)}, categorical_metadata = {"groups": ['a', 'b', 'c']})
grouped = pdf.groupby("groups")
ag_print(grouped.sum(eps=1))
>>>              age
a 837986.678085
b 817237.487139
c 827334.458893

isin

Whether each element in the DataFrame is contained in values.

PrivateDataFrame.isin(values):
  • values: PrivateDataFrame or PrivateSeries
    The result will only be true at a location if all the labels match.

join

A differentially private implementation of the join method.

PrivateDataFrame.join(other, on = None,how = "left",lsuffix = "",rsuffix = "",sort = False,validate = None)
  • other: PrivateDataFrame, PrivateSeries
    Index should be similar to one of the columns in this one. If a PrivateSeries is passed, its name attribute will be used as the column name in the resulting joined DataFrame.

  • on: str, list of str, or array-like, optional
    Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

  • how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
    How to handle the operation of the two objects.

    • left: use calling frame’s index (or column if on is specified)

    • right: use other’s index.

    • outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.

    • inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.

  • lsuffix: str, default ‘’
    Suffix to use from left frame’s overlapping columns.

  • rsuffix: str, default ‘’
    Suffix to use from right frame’s overlapping columns.

  • sort: bool, default False
    Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

  • validate: str, optional
    If specified, checks if join is of specified type.

    • “one_to_one” or “1:1”: check if join keys are unique in both left and right datasets.
    • “one_to_many” or “1:m”: check if join keys are unique in left dataset.
    • “many_to_one” or “m:1”: check if join keys are unique in right dataset.
    • “many_to_many” or “m:m”: allowed, but does not result in checks.
  • Returns
    PrivateDataFrame

make_column_non_categorical

Makes a categorical column non categorical.

PrivateDataFrame.make_column_non_categorical(columns: str | List[str], output_bounds: dict = None, eps: float = 0.0)
  • columns: str | List[str]
    Column or a list of columns

  • output_bounds: dict
    If a column contains numerical values (but is categorical), provide output bounds for it. If output bounds for a numerical column is not present, epsilon will be spent to estimate the bounds.

  • eps: float
    Epsilon to estimate output bounds of a numerical column.

metadata

Returns the metadata / bounds of numerical columns present in the PrivateDataFrame.

PrivateDataFrame.metadata -> dict
>> test_x.metadata

{
'age': (17, 90),
'fnlwgt': (18827, 1490400),
'educational-num': (1, 16),
'capital-gain': (0, 99999),
'capital-loss': (0, 4356),
'hours-per-week': (1, 99)
}

rename

Rename a specific set of columns in the PrivateDataFrame. The method takes in a dictionary which should contain key-value pair of the one-to-one mapping needed for the column replacement.

PrivateDataFrame.rename(dict) -> PrivateDataFrame
>> new_test_x = test_x.rename({'age':'AGE' , 'hours-per-week':'HPW'})
>> new_test_x.metadata

{'fnlwgt': (18827, 1490400),
'educational-num': (1, 16),
'capital-gain': (0, 99999),
'capital-loss': (0, 4356),
'AGE': (17, 90),
'HPW': (1, 99)}

where

A differentially private implementation of the where method.

PrivateDataFrame.where(cond, other = None,inplace = False, axis = None, level = None)
  • cond: bool PrivateSeries/PrivateDataFrame,Series/DataFrame array-like
    Where cond is True, keep the original value. Where False, replace with corresponding value from other.

  • other: None
    Other tweaking is not supported currently.

  • inplace: bool, default False
    Whether to perform the operation in place on the data.

  • axis: int, default None
    Alignment axis if needed. For Series this parameter is unused and defaults to 0.

  • level: int, default None
    Alignment level if needed.

  • Returns
    PrivateDataFrame

Basic statistical methods

count

A differentially private implementation of the count method.

PrivateDataFrame.count(eps = 0, axis=0, numeric_only=False)
  • eps : float, default = 0
    The epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • numeric_only: bool, default None
    Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True, else, you must specify a value.

mean

A differentially private implementation of the mean method.

PrivateDataFrame.mean(eps = 0, axis=0, skipna=True, numeric_only=None, **kwargs)
  • eps : float, default = 0
    The epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • skipna: bool, default True
    Exclude NA/null values when computing the result.

  • numeric_only: bool, default None
    Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True, else, you must specify a value.

  • **kwargs
    Additional keyword arguments to be passed to the function.

median

A differentially private implementation of the median method.

PrivateDataFrame.median(eps)
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

percentile

A differentially private implementation of the percentile method.

PrivateDataFrame.percentile(eps, p)
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • p: float or array-like
    Value between 0 <= p <= 100, the percentiles(s) to compute.

quantile

A differentially private implementation of the quantile method.

PrivateDataFrame.quantile(eps, q = 0.5)
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • q: float or array-like, default 0.5 (50% quantile)
    Value between 0 <= q <= 1, the quantile(s) to compute.

standard deviation

A differentially private implementation of the std method.

PrivateDataFrame.std(eps = 0, axis=0, skipna=True, ddof=1, numeric_only=None, **kwargs)
  • eps : float, default = 0
    The epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • skipna: bool, default True
    Exclude NA/null values when computing the result.

  • ddof: int, default 1
    Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. If axis = 0, ddof must be equal to 1.

  • numeric_only: bool, default None
    Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True, else, you must specify a value.

  • **kwargs
    Additional keyword arguments to be passed to the function.

sum

A differentially private implementation of the sum method.

PrivateDataFrame.sum(eps = 0, axis=0, skipna=True, numeric_only=None, min_count=0, **kwargs)
  • eps : float, default = 0
    The epsilon provided to the differentially private calculation.

  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • skipna: bool, default True
    Exclude NA/null values when computing the result.

  • numeric_only: bool, default None
    Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True, else, you must specify a value.

  • min_count: int, default 0
    The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. If axis = 0, min_count is always assumed to be 0, else, you must specify a value.

  • **kwargs
    Additional keyword arguments to be passed to the function.

variance

A differentially private implementation of the var method.

PrivateDataFrame.var(eps = 0, axis=0, skipna=True, ddof=1, numeric_only=None, **kwargs)
  • eps : float, default = 0
    The epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • skipna: bool, default True
    Exclude NA/null values when computing the result.

  • ddof: int, default 1
    Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. If axis = 0, ddof must be equal to 1.

  • numeric_only: bool, default None
    Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True, else, you must specify a value.

  • **kwargs
    Additional keyword arguments to be passed to the function.

Advanced statistical methods

correlation

A differentially private implementation of the covariance method.

PrivateDataFrame.cov(eps: float, method: str = "pearson", min_periods: int = 1, numeric_only = True)
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • method: str, {‘pearson’ or ‘spearman’}, default 'pearson'
    Method of correlation:

    • pearson : standard correlation coefficient

    • spearman : Spearman rank correlation

  • min_periods: int, optional
    Assumed to be 1. Currently, min_periods tweaking is not supported.

  • numeric_only: bool, default True
    Include only float, int or boolean data. Currently, numeric_only tweaking is not allowed.

  • Returns
    Pandas DataFrame

covariance

A differentially private implementation of the covariance method.

PrivateDataFrame.cov(eps: float, min_periods, ddof = 1, numeric_only = True)
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • min_periods: int, optional
    Assumed to be 1. Currently, min_periods tweaking is not supported.

  • ddof: int, default 1
    Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. Currently, ddof tweaking is not supported.

  • numeric_only: bool, default True
    Include only float, int or boolean data. Currently, numeric_only tweaking is not allowed.

  • Returns
    Pandas DataFrame

skew

A differentially private implementation of the skew method.

PrivateDataFrame.skew(eps, axis = 0, skipna = True, numeric_only = True)
  • eps : float, default = 0
    The epsilon provided to the differentially private calculation. If axis = 0, eps must be >=0.

  • axis: boolean {index (0), columns (1)}, default = 0
    Axis for the function to be applied on.

  • skipna: bool, default True
    Exclude NA/null values when computing the result.

  • numeric_only: bool, default None
    Include only float, int, boolean columns. If axis = 0, numeric_only is always assumed to be True, else, you must specify a value.

  • Returns
    Pandas DataFrame

Histograms

Hist

Draw histogram of the PrivateDataFrame’s columns.

PrivateDataFrame.hist(column, eps, bins = 10)
  • column: str
    Column in the PrivateDataFrame to group by.

  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • bins: int, default 10
    Number of histogram bins to be used.

Hist2d

Creates a 2d histogram among two of the columns of the PrivateDataFrame.

PrivateDataFrame.hist2d(eps, x, y, bins = 10):
  • eps: float
    The epsilon provided to the differentially private calculation. eps must be >=0.

  • x: str
    Column 1 in the PrivateDataFrame to group by.

  • y: str
    Column 2 in the PrivateDataFrame to group by.

  • bins: int, default 10
    Number of histogram bins to be used.