Guides
Importing Data
The op_pandas
library allows users to import datasets efficiently. This page will showcase some of the available ways to do so.
Before continue following the steps described in this page, be sure to have finished the First Steps to Use AG
Object Creation
Users can import and load datasets from different sources, such as:
The following sections will describe them in more detail.
Importing from the AG-Server
The load_dataset()
lets users obtain a dataset and the required data structures from the Antigranular server.
Private data structures cannot be exported to the local environment. Unless a differentially private measure is applied to obtain a non-private data frame.
You can use the load_dataset()
function to load any dataset, as shown in the following code block:
%%ag
from op_pandas import PrivateDataFrame, PrivateSeries
# Obtaining the dictionary containing private objects
response = load_dataset("<dataset_name>", "<team_name>")
# Response will be a PDF, and will be using the budget allocated to the user from "<team_name>" team.
Importing from a pandas.Series
When creating a PrivateSeries, it's recommended to set metadata bounds to define the range of valid values for the series. If users don't provide explicit bounds, op_pandas
will automatically assign the metadata based on the minimum and maximum values in the series.
See an example in the following code block:
%%ag
import pandas as pd
s = pd.Series([1,5,8,2,9] , name='Test_series')
priv_s = PrivateSeries(series=s,metadata=(0,10))
Where:
- A pandas Series named
'Test_series'
is created with the values[1, 5, 8, 2, 9]
. - Using the
PrivateSeries
constructor, we create a private series (priv_s
) from the regular pandas Seriess
. - Metadata bounds
(0, 10)
are set, ensuring that all values in the series fall within the range from 0 to 10.
By setting metadata bounds, you control the valid range of values within the series, enhancing privacy and security while working with sensitive data.
Importing from a pandas.DataFrame
Just as with PrivateSeries, setting metadata bounds when creating a PrivateDataFrame is recommended. If users don't provide explicit bounds, op_pandas
will automatically assign the metadata based on the minimum and maximum values in the series.
See an example in the following code block:
%%ag
import pandas as pd
data = {
'Age':[20,30,40,25,30,25,26,27,28,29],
'Salary':[35000,60000,100000,55000, 35000,60000,100000,55000,35000,60000],
'Sex':['M','F','M','F', 'M','F','M','F', 'M', 'F']
}
metadata = {
'Age':(18,65),
'Salary':(20000,200000)
}
categorical_metadata = {
'Sex':['M','F']
}
df = pd.DataFrame(data)
priv_df = PrivateDataFrame(df=df , metadata=metadata, categorical_metadata=categorical_metadata)
In the example:
- Data for the DataFrame is defined, including columns for 'Age', 'Salary', and 'Sex'.
- Metadata bounds are specified for each column:
- For the 'Age' column, the valid range is set from 18 to 65.
- For the 'Salary' column, the valid range is set from 20000 to 200000.
- Categorical Metadata is specified for 'Sex' column.
- A pandas DataFrame (
df
) is created using the defined data. - Using the
PrivateDataFrame
constructor, a privateDataFrame (priv_df
) is created from the pandas.DataFramedf
, with specified metadata bounds.
By setting metadata bounds, users can ensure that each column in the dataframe contains values within predefined limits, enhancing data integrity and security.
Importing from the local Jupyter session
Users can import external data from their local Jupyter session within the AG environment. This allows seamless data integration into the AG environment while maintaining privacy and security.
See an example below:
Random data is generated to create two pandas DataFrames,
df
anddf_2
, representing different datasets.import pandas as pd
import numpy as np
import string
import random
# Generate random names, ages, and salaries for the DataFrame
arr_name = []
n_num = 10000
N = 10
for i in range(n_num):
res = ''.join(random.choices(string.ascii_lowercase, k=N))
arr_name.append(res)
# Create a DataFrame with random data
df = pd.DataFrame(
{'name': arr_name,
'age': np.random.randint(0, 80, n_num),
'salary': np.random.randint(100, 100000, n_num)})
# Import the DataFrame 'df' into the AG environment with the name 'imported_df'
session.private_import(data=df, name='imported_df')Now the second dataset with NaNs is created:
# Randomly distributing NaNs in two columns with a probability of 0.5
choice = [1, 2, np.nan]
a = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
b = np.random.choice(choice, 10000, p=[0.25, 0.25, 0.5])
# Create a DataFrame 'df_2' with random data and NaNs
df_2 = pd.DataFrame({'a': a, 'b': b})
# Import the DataFrame 'df_2' into the AG environment with the name 'imported_df_2'
session.private_import(data=df_2, name="imported_df_2")The
private_import
function imports these DataFrames into the AG environment with specified names (imported_df
andimported_df_2
).Now, Metadata bounds are defined for columns 'age' and 'salary' of the DataFrame
imported_df
, and a PrivateDataFramepriv_df
is created from the DataFrameimported_df
, ensuring that the data remains private and secure within the AG environment.# Create a PrivateDataFrame 'priv_df' from the imported DataFrame 'imported_df'
metadata = {
'age': (0, 80), # Define metadata bounds for the 'age' column
'salary': (1, 200000) # Define metadata bounds for the 'salary' column
}
priv_df = PrivateDataFrame(imported_df, metadata=metadata)
By leveraging the private_import
function and creating PrivateDataFrames, users can seamlessly work with external data while maintaining privacy.
op_pandas
guide.Access the Managing Data to continue following the op_pandas
guide.