Skip to main content

API Reference

General Methods

This page showcases some of the most commonly used Panda methods available in op_pandas and their parameters.

concat

The concat() function is used to concatenate Panda's objects, such as PrivateSeries and PrivateDataFrames, along a specified axis. This function also supports creating a hierarchical index on the concatenation axis if needed, and handles the set logic of the indexes on the non-concatenation axes through optional union or intersection.

def concat(
objs,
*,
axis=0,
join="outer",
ignore_index=False,
keys=None,
levels=None,
names=None,
verify_integrity=False,
sort=False,
copy=None,
)->PrivateData:

Parameters:

  • objs : array of PrivateSeries | PrivateDataFrame: An array that includes PrivateDataFrames or PrivateSeries for concatenation. If any element within the array is None, it will be silently dropped unless all elements are None, in which case a ValueError will be raised.

  • axis : {0}, default 0: Specifies the axis along which to concatenate the objects. Currently, only concatenation along axis=0 is allowed.

  • join : {'inner', 'outer'}, default 'outer': Dictates how to handle the indexes on the axes other than the concatenation axis.

    • 'outer': Uses the union of indexes.
    • 'inner': Uses the intersection of indexes.
  • ignore_index : bool, default False: If set to True, the index values along the concatenation axis will be ignored. The resulting axis will be labeled from 0 to n - 1. This is particularly useful when the original index does not carry meaningful information for the concatenated result.

  • keys : sequence, default None: Used to create a hierarchical index on the concatenation axis, with the elements of the sequence forming the outermost level.

  • levels : list of sequences, default None: Specifies the levels to use for constructing a MultiIndex, if not inferred from the keys.

  • names : list, default None: Provides names for the levels in the resulting hierarchical index.

  • verify_integrity : bool, default False: Verification of integrity during concatenation is not supported in this function.

  • sort : bool, default False: Determines whether to sort the non-concatenation axis if it is not already aligned.

  • copy : True: The copy parameter is not supported in this version of the function.

    Usage:

    combined_df = op_pandas.concat([df1, df2], ignore_index=True, join='inner')
note
  • The datatypes along a single column must be the same, or the concatenation won't happen.

merge

The merge() function facilitates the merging of PrivateDataFrame or named PrivateSeries objects, mimicking database-style joins. This function allows for various types of joins, handling indexes and columns differently based on the type of merge specified.

def merge(
left,
right,
how="inner",
on=None,
*args, **kwargs
)-> PrivateData:

Parameters:

  • left : PrivateDataFrame or named PrivateSeries: The left object in the merge. A named PrivateSeries is treated as a PrivateDataFrame with a single column.

  • right : PrivateDataFrame or named PrivateSeries: The right object in the merge. Similarly, a named PrivateSeries is treated as a PrivateDataFrame with a single column.

  • how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner': Specifies the type of merge to perform:

    • 'left': Perform a left outer join, using only keys from the left frame. The order of keys is preserved.
    • 'right': Perform a right outer join, using only keys from the right frame. The order of keys is preserved.
    • 'outer': Perform a full outer join, using the union of keys from both frames. Keys are sorted lexicographically.
    • 'inner': Perform an inner join, using the intersection of keys from both frames. The order of the left keys is preserved.
    • 'cross': Create a Cartesian product of both frames, preserving the order of the left keys. Note: No columns to merge on can be specified in a cross join.

Usage:

When columns are specified for a join, index information of the PrivateDataFrames is ignored. However, when joining on indexes, whether with each other or with columns, index information is preserved, which is crucial for alignments where index continuity is necessary.

```python
result = op_pandas.merge(left_df, right_df, how='inner', on='key_column')
```

to_datetime

The to_datetime() function converts an input scalar, array-like, PrivateSeries, or PrivateDataFrame into a Panda's datetime object, handling a wide range of datetime formats and providing various options for customization and error handling.

def to_datetime(
arg,
errors="ignore",
dayfirst=False,
yearfirst=False,
utc=False,
format=None,
exact=_NoDefault.no_default,
unit=None,
infer_datetime_format=_NoDefault.no_default,
origin="unix",
cache=True,
)-> PrivateData:

Parameters:

  • arg : PrivateSeries: The data to convert to datetime format. For DataFrames, it should contain the columns "year", "month", and "day", with years in a four-digit format.

  • errors : str, default 'ignore'

    • 'ignore': If parsing fails, return the original input.
    • 'raise': Raise an error if parsing fails.
    • 'coerce': Set unparsable entries to NaT (Not a Time).
  • dayfirst : bool, default False: Influences parsing order if arg is string-like. If True, interprets the first number in a date string as the day (e.g., 10/11/12 becomes 2012-11-10).

  • yearfirst : bool, default False: Influences parsing order if arg is string-like. If True, interprets the first number in a date string as the year (e.g., 10/11/12 becomes 2010-11-12).

    • Note: If both dayfirst and yearfirst are True, yearfirst takes precedence, similar to the behavior in dateutil.
  • utc : bool, default False

    • If True, returns a UTC-localized Timestamp, Series, or DatetimeIndex.
    • If False, returns data without timezone conversion, maintaining original time offsets where present.
  • format : str, default None: The format string to use for parsing dates, like %d/%m/%Y. Special options include:

    • 'ISO8601': Parse any ISO8601 formatted string.
    • 'mixed': Infer the format for each element, use cautiously as recommended by Antigranular.
  • exact : bool, default True

    • If True, the format string must be precisely matched.
    • If False, allows the format to match anywhere in the target string.
    • Note: Incompatible with format='ISO8601' or format='mixed'.
  • unit : str, default 'ns': Defines the unit for numeric input based on the origin. Common units include 'D' (days), 's' (seconds), 'ms' (milliseconds), etc.

  • infer_datetime_format : bool, default False: When True and no format is specified, attempts to infer the datetime format, potentially speeding up parsing significantly.

  • origin : scalar, default 'unix'

    • Defines the reference date for numeric inputs. Possible values:
      • 'unix': Start from 1970-01-01.
      • 'Julian': Start from Julian Calendar day zero.
      • Timestamp convertible values or numeric offsets relative to 1970-01-01.
  • cache : bool, default True: Utilizes a cache for converted dates to enhance parsing speed for repeated date strings, especially those with timezone offsets. Not effective for out-of-bounds values.

Example Usage:

datetime_data = op_pandas.to_datetime(series_data, errors='coerce', dayfirst=True, format='%d/%m/%Y')
note
  • If both day first and year first are True, year first is preceded (same as dateutil).
  • Cannot be used alongside format='ISO8601' or format='mixed'.

train_test_split

The train_test_split() method is used to split the PrivateDataFrame or PrivateSeries into a training set and a testing set, which is essential for training models in a manner that can evaluate their performance effectively.

    def train_test_split(
df,
test_size=0.25,
random_state=None,
stratify=None
)->
Tuple[PrivateData , PrivateData]:

Parameters:

  • df : list | PrivateDataFrame | PrivateSeries: Accepts either a single PrivateDataFrame, a PrivateSeries, or a list of these. The list does not need to contain elements of the same size; however, if they are of the same size, they will be split in the same way in terms of indices.

  • test_size : float, default 0.25: This specifies the proportion of the dataset to include in the test split. It must be between 0 and 1.

  • random_state : int | None, default None: Provides a seed value to ensure reproducibility of the split.

  • stratify : None: Currently, stratification is not supported, meaning the data will be split without considering the distribution of outcomes across the training and testing sets.

Example Usage:

train_data, test_data = op_pandas.train_test_split(df, test_size=0.3, random_state=42)

standard_scaler

This function standardizes features by removing the mean and scaling to unit variance, applying differential privacy techniques to ensure the data privacy is maintained.

def standard_scaler(
data,
eps
)-> PrivateData:

Parameters:

  • data : PrivateDataFrame | PrivateSeries: This is the input data, which should be either a PrivateDataFrame or a PrivateSeries. It contains the features that need to be standardized.

  • eps : float: Represents the epsilon budget for differential privacy. A smaller epsilon value means stronger privacy guarantees but potentially less accuracy in the scaled data.

Returns:

The function does not explicitly return a type in the signature provided, but it likely returns a PrivateDataFrame or PrivateSeries with the standardized features.

Usage:

scaled_data = op_pandas.standard_scaler(data, eps=0.1)

label_encoder

This function performs label encoding on one or more categorical columns of a DataFrame or a Series. It returns a tuple containing the transformed data and a dictionary mapping the original categories to their encoded labels.

def label_encoder(
df,
cols = None
) -> Tuple[ PrivateData , dict]:

Parameters:

  • df : PrivateDataFrame | PrivateSeries: This is the input data which should be of type PrivateDataFrame or PrivateSeries, containing categorical data that needs to be encoded.

  • cols : List | str | None: Specifies the columns to be label encoded. You can provide a single column name as a string, a list of column names, or None. If None is provided and the input is a DataFrame, No columns are considered for encoding. This parameter is ignored if the input is a PrivateSeries.

Returns:

  • Tuple[PrivateData, dict]: A tuple where the first element is the label-encoded data (as a PrivateDataFrame or PrivateSeries) and the second element is a dictionary that maps the original categorical values to their respective integer labels.

Usage:

encoded_data, mapping = op_pandas.label_encoder(df, cols=['category_column'])