API Reference
Record Linkage
The op_recordlinkage
API is based on the Record Linkage library, but all the methods are differentially private. The toolkit provides comprehensive toolkit needed for record linkage.
Preprocessing
Preprocessing data, like cleaning and standardising, may increase your record linkage accuracy. The preprocessing and standardising functions are available in the submodule recordlinkage.preprocessing.
You can import the functions as presented in the code block below:
from op_recordlinkage.preprocessing import clean, phonetic
Read more detailed information on these functions on the following official Record Linkage pages:
Indexing
The indexing module is used to make pairs of records. These pairs are called candidate links or candidate matches. There are several indexing algorithms available such as blocking and sorted neighborhood indexing.
indexer = op_recordlinkage.Index()
The following are the supported indexing rules:
After applying the indexing rules, you can can obtain the candidate links as presented in the code block below:
import op_recordlinkage as rl
indexer = rl.Index()
# Indexing rules
indexer.block('passenger_date_of_birth','patient_date_of_birth')
# Obtaining candidate links based on the indexing rules set.
candidate_links = indexer.index(dataset1 , dataset2)
Comparing
Class to compare the attributes of candidate record pairs. It has methods such as string
, exact
, and numeric
to initialise the comparison of the records. The compute method is used to start the actual comparing.
compare = op_recordlinkage.Compare()
These are the supported comparing rules:
User-defined algorithms
You can create a custom comparator function as a comparison rule to add weights to a particular comparing strategy.
def compare.custom(cmp: func(x,y) -> int , left_col , right_col , label)
For example, consider a custom comparison rule to link the records where the COVID test was taken no more than 14 days after the flight departure date. The following code block showcases this example:
# Using a custom compare rule.
from datetime import datetime
def cmp(date_str1 , date_str2):
# Convert date strings to datetime objects
date1 = datetime.strptime(date_str1, "%Y-%m-%d")
date2 = datetime.strptime(date_str2, "%Y-%m-%d")
# Calculate the absolute difference in days
days_apart = (date2 - date1).days
# Check if the dates are within two weeks (14 days) apart
if days_apart <= 14:
return 2
else:
return 0
compare.custom(cmp,"flight_date","covidtest_date",label="date_cmp")
Compute
Use the compute
method to start comparing records.
def op_recordlinkage.Compare.compute(
pairs:op_recordlinkage.Index.compute() ,
x : PrivateDataFrame ,
x_link: PrivateDataFrame
) -> PrivateDataFrame
The following code block showcases how the features matrix looks internally.
features_matrix = compare.compute(pairs , x , x_link)
>>> features_matrix (PrivateDataFrame)
firstname lastname date_cmp
9 48703 0.000000 1.0 2
28227 48703 0.412037 0.0 2
32066 48703 0.888889 1.0 0
32067 48703 0.888889 1.0 2
32068 48703 0.888889 1.0 2
>> getting average weight to find a good matching threshold.
ag_print(f"Average weight : {features.sum(axis=1).mean(eps=0.1)}")
Output >>> Average weight : 2.3949462007983215
Since the above DataFrame is private information, it cannot be accessed directly, and you would need to apply a differentially private method to obtain valuable stats.
Link Datasets
After the indexing and compare rules are set up to obtain a features matrix, we can apply a threshold value against which the dataset linking will take place.
def op_recordlinkage.Compare.get_match(feature_matrix: PrivateDataFrame=None,threshold: float=0.0,max_sensitivity: int=1) -> PrivateDataFrame
In the example presented in the Compute section the average weight was 2.39/4. As a result, you can keep a threshold of 3.0 to have a fairly strong match. The argument max_sensitivity is by default 1 which denotes the maximum number of matchings of one record. If it unknown, a large value can be set to get the complete results
The linked dataset will have columns with prefixes l_
and r_
for the first and second datasets used for linking, respectively. The below code block presents an example of the result of linking datasets:
linked_df = compare.get_match(threshold=3, max_sensitivity = 1000)
>>> linked_df (PrivateDataFrame)
l_flight_number l_flight_date l_flight_from l_flight_to \
0 CHI-ROM-0019 2020-01-19 Chicago Rome
1 CHI-TOK-0018 2020-01-18 Chicago Tokyo
2 TOK-LON-0024 2020-01-24 Tokyo London
3 PRE-ROM-0013 2020-01-13 Pretoria Rome
4 ROM-PRE-0020 2020-01-20 Rome Pretoria
l_passenger_firstname l_passenger_lastname l_passenger_date_of_birth \
0 Nymo Thum 1978-03-30
1 Dina Thum 1978-03-30
2 Dina Thum 1978-03-30
3 WolmUlna Fano 1996-12-13
4 WolmUlna Fano 1996-12-13
r_patient_firstname r_patient_lastname r_patient_date_of_birth \
0 Dina Anin Thum 1978-03-30
1 Dina Anin Thum 1978-03-30
2 Dina Anin Thum 1978-03-30
3 Wolm Fano 1996-12-13
4 Wolm Fano 1996-12-13
r_covidtest_date r_covidtest_result r_patient_address
0 2020-01-27 positive House 401, Eagle Alley, London
1 2020-01-27 positive House 401, Eagle Alley, London
2 2020-01-27 positive House 401, Eagle Alley, London
3 2020-01-22 positive House 255, Newton Corner, Pretoria
4 2020-01-22 positive House 255, Newton Corner, Pretoria