Skip to main content

Guides

Importing Data

The op_pandas library allows users to import datasets efficiently. This page will showcase some of the available ways to do so.

Requirements

Before continue following the steps described in this page, be sure to have finished the First Steps to Use AG

Object Creation

Users can import and load datasets from different sources, such as:

The following sections will describe them in more detail.

Importing from the AG-Server

The load_dataset() lets users obtain a dataset and the required data structures from the Antigranular server.

info

Private data structures cannot be exported to the local environment. Unless a differentially private measure is applied to obtain a non-private data frame.

You can use the load_dataset() function to load any dataset, as shown in the following code block:

%%ag
from op_pandas import PrivateDataFrame, PrivateSeries

# Obtaining the dictionary containing private objects
response = load_dataset("<dataset_name>", "<team_name>")

# Response will be a PDF, and will be using the budget allocated to the user from "<team_name>" team.

Importing from a pandas.Series

When creating a PrivateSeries, it's recommended to set metadata bounds to define the range of valid values for the series. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

See an example in the following code block:

%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9] , name='Test_series')
priv_s = PrivateSeries(series=s,metadata=(0,10))

Where:

  • A pandas Series named 'Test_series' is created with the values [1, 5, 8, 2, 9].
  • Using the PrivateSeries constructor, we create a private series (priv_s) from the regular pandas Series s.
  • Metadata bounds (0, 10) are set, ensuring that all values in the series fall within the range from 0 to 10.

By setting metadata bounds, you control the valid range of values within the series, enhancing privacy and security while working with sensitive data.

Importing from a pandas.DataFrame

Just as with PrivateSeries, setting metadata bounds when creating a PrivateDataFrame is recommended. If users don't provide explicit bounds, op_pandas will automatically assign the metadata based on the minimum and maximum values in the series.

See an example in the following code block:

%%ag
import pandas as pd
data = {
'Age':[20,30,40,25,30,25,26,27,28,29],
'Salary':[35000,60000,100000,55000, 35000,60000,100000,55000,35000,60000],
'Sex':['M','F','M','F', 'M','F','M','F', 'M', 'F']
}
metadata = {
'Age':(18,65),
'Salary':(20000,200000)
}
categorical_metadata = {
'Sex':['M','F']
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata, categorical_metadata=categorical_metadata)

In the example:

  • Data for the DataFrame is defined, including columns for 'Age', 'Salary', and 'Sex'.
  • Metadata bounds are specified for each column:
    • For the 'Age' column, the valid range is set from 18 to 65.
    • For the 'Salary' column, the valid range is set from 20000 to 200000.
  • Categorical Metadata is specified for 'Sex' column.
  • A pandas DataFrame (df) is created using the defined data.
  • Using the PrivateDataFrame constructor, a privateDataFrame (priv_df) is created from the pandas.DataFrame df, with specified metadata bounds.

By setting metadata bounds, users can ensure that each column in the dataframe contains values within predefined limits, enhancing data integrity and security.

Importing from the local Jupyter session

Users can import external data from their local Jupyter session within the AG environment. This allows seamless data integration into the AG environment while maintaining privacy and security.

See an example below:

  1. Random data is generated to create two pandas DataFrames, df and df_2, representing different datasets.

    import pandas as pd
    import numpy as np
    import string
    import random

    # Generate random names, ages, and salaries for the DataFrame
    arr_name = []
    n_num = 10000
    N = 10
    for i in range(n_num):
    res = ''.join(random.choices(string.ascii_lowercase, k=N))
    arr_name.append(res)

    # Create a DataFrame with random data
    df = pd.DataFrame(
    {'name': arr_name,
    'age': np.random.randint(0, 80, n_num),
    'salary': np.random.randint(100, 100000, n_num)})

    # Import the DataFrame 'df' into the AG environment with the name 'imported_df'
    session.private_import(data=df, name='imported_df')

    Now the second dataset with NaNs is created:

    # Randomly distributing NaNs in two columns with a probability of 0.5

    choice = [1, 2, np.nan]
    a = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
    b = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])

    # Create a DataFrame 'df_2' with random data and NaNs
    df_2 = pd.DataFrame({'a': a, 'b': b})

    # Import the DataFrame 'df_2' into the AG environment with the name 'imported_df_2'
    session.private_import(data=df_2, name="imported_df_2")

    The private_import function imports these DataFrames into the AG environment with specified names (imported_df and imported_df_2).

  2. Now, Metadata bounds are defined for columns 'age' and 'salary' of the DataFrame imported_df, and a PrivateDataFrame priv_df is created from the DataFrame imported_df, ensuring that the data remains private and secure within the AG environment.

    # Create a PrivateDataFrame 'priv_df' from the imported DataFrame 'imported_df'
    metadata = {
    'age': (0, 80), # Define metadata bounds for the 'age' column
    'salary': (1, 200000) # Define metadata bounds for the 'salary' column
    }

    priv_df = PrivateDataFrame(imported_df, metadata=metadata)

By leveraging the private_import function and creating PrivateDataFrames, users can seamlessly work with external data while maintaining privacy.

Continue the op_pandas guide.

Access the Managing Data to continue following the op_pandas guide.