Skip to main content

SmartNoise SQL

Define Privacy Parameters

class op_snsql.Privacy(epsilon:float=1.0, delta:float=10E-16)
  • epsilon
    The epsilon value for each statistic returned by the private SQL connection.
  • delta
    It can be avoided if you are a beginner. Most counts and sums will use delta of 0, but dimension censoring and Gaussian mechanism require delta. Set delta to something like 1/(nsqrt(n))1/(n*sqrt(n)), where n is the approximate number of rows in the data source.

Loading Data

op_snsql.from_df(df,privacy,metadata=None) : _PrivateReader
  • df: pandas.DataFrame | op_pandas.PrivateDataFrame
    The dataframe to be loaded.

  • privacy: op_snsql.Privacy
    A Privacy object with epsilon, delta, and other privacy properties.

  • metadata: Dict
    This is ignored when a PrivateDataframe is sent. For a normal DataFrame , you need to send the metadata information as per the format used by snsql.

  • Returns
    A private reader class op_snsql._PrivateReader which is used for executing queries against the loaded data.

Executing Queries

Executes a query and returns a recordset that is differentially private.

# reader is an object of op_snsql._PrivateReader obtained when you load a dataset.
reader.execute(query_string, postprocess:bool=True)
  • query_string
    A query string in SQL syntax
  • postprocess
    If False, the intermediate result, immediately after adding noise and censoring dimensions, will be returned. All post-processing that does not impact privacy, such as clamping negative counts, LIMIT, HAVING, and ORDER BY, will be skipped.
  • return
    A recordset structured as an array of tuples, where each tuple represents a row, and each item in the tuple is typed. The first row will contain column names.

Estimating Privacy Budget

Estimates the epsilon and delta cost for running the given query. Privacy cost is returned without running the query or incrementing the odometer.

# reader is an object of op_snsql._PrivateReader obtained when you load a dataset.
reader.get_privacy_cost(query_string)
  • query_string
    The SQL query string or a list of strings to analyse the expected cost.
  • returns
    A tuple of (epsilon, delta) estimating total privacy cost for running this query or queries.

Sample Implementation

Loading the Iris dataset after loging into antigranular-server.

%%ag

x = load_dataset("Iris Dataset")
train_x = x["train_x"]
train_y = x["train_y"]
test_x = x["test_x"]
export(train_x.describe(eps=1) , 'result')
print(result)

sepal_length sepal_width petal_length petal_width
count 120.000000 120.000000 120.000000 120.000000
mean 6.004110 3.634028 2.721815 0.949909
std 1.167699 0.563917 2.655861 0.638500
min 5.573799 2.118888 1.831912 0.794652
25% 5.449591 2.785598 4.016730 1.873766
50% 6.097875 3.302320 6.233562 0.689428
75% 4.962494 4.396164 4.737204 1.976907
max 6.344443 4.109294 3.857761 1.787463

Finding count of individual Iris species

%%ag
import op_snsql
#setup a privacy l
privacy = op_snsql.Privacy(epsilon=1.0 , delta=0.01)
# train_y contains the species (target attribute)
reader = op_snsql.from_df(df=train_y, privacy=privacy)

Obtain the privacy cost needed for executing the queries

total_cost = reader.get_privacy_cost(
[
"SELECT COUNT(species) from df.table WHERE species = 'Iris-setosa'",
"SELECT COUNT(species) from df.table WHERE species = 'Iris-virginica'",
"SELECT COUNT(species) from df.table WHERE species = 'Iris-versicolor'",
"SELECT COUNT(species) from df.table WHERE species = 'Iris-random'"
]
)
# Make sure the return types are casted to string ( export prevents snsql modules )
export(str(total_cost) , "total_cost")

This denotes that a total of eps=8 and delta=0.044 needs to spent to execute the list of given queries.

print(total_cost) # (epsilon , delta)
>>> (8.0, 0.04420697005000007)

Executing the queries without column-names

c0 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-setosa'")
c1 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-virginica'")
c2 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-versicolor'")
c3 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-Unknown'")
export(str([c0,c1,c2,c3]) , 'counts')
print(counts) # '???' is shown if column-name is not specified in SQL query
>>> [[['???'], [42]], [['???'], [40]], [['???'], [41]], [['???']]]

Executing the queries with column-names

c0 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-setosa'")
c1 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-virginica'")
c2 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-versicolor'")
c3 = reader.execute("SELECT COUNT(species) from df.table WHERE species = 'Iris-Unknown'")
export(str([c0,c1,c2,c3]) , 'counts')
print(counts) # others does not have count since no such records exist.
>>> [[['setosa'], [40]], [['virginica'], [40]], [['versicolor'], [41]], [['others']]]

Finding AVG(sepal_length) of Iris-setosa whose sepal_width < AVG(sepal_width)

Obtain the privacy cost needed for executing the queries

total_cost = reader.get_privacy_cost(
[
"SELECT AVG(sepal_width) as avg_sepal_width_setosa from df.table",
"SELECT AVG(sepal_length) as avg_sepal_length_setosa from df.table WHERE species = 'Iris-setosa' AND sepal_width < avg_val"
]
)
export(total_cost , "total_cost")
print(total_cost)
>>> (6.0, 0.024800500000000003)

Executing the query

result1 = reader.execute("SELECT AVG(sepal_width) as avg_sepal_width_setosa from df.table")
result = reader.execute(f"SELECT AVG(sepal_length) as avg_sepal_length_setosa from df.table WHERE species = 'Iris-setosa' AND sepal_width < {result1[1][0]}")
export(str(result1) , 'result1')
export(str(result) , 'result')
print(result1)
>>> [['avg_sepal_width_setosa'], [3.0050040866647545]]
print(result)
>>> [['avg_sepal_length_setosa'], [8.350680871843242]]