Skip to main content

Record Linkage

Pre Processing

Preprocessing data, like cleaning and standardising, may increase your record linkage accuracy. The Python Record Linkage Toolkit contains several tools for data preprocessing. The preprocessing and standardising functions are available in the submodule recordlinkage.preprocessing. Import the algorithms in the following way:

from op_recordlinkage.preprocessing import clean , phonetic

Indexing

The indexing module is used to make pairs of records. These pairs are called candidate links or candidate matches. There are several indexing algorithms available such as blocking and sorted neighborhood indexing.

indexer = op_recordlinkage.Index()

These are the supported indexing rules.

After applying the indexing rules , you can obtain the candidate links as follows:

import op_recordlinkage as rl
indexer = rl.Index()

# Indexing rules
indexer.block('passenger_date_of_birth','patient_date_of_birth')

# Obtaining candidate links based on the indexing rules set.
candidate_links = indexer.index(dataset1 , dataset2)

Comparing

Class to compare the attributes of candidate record pairs. The Compare class has methods like string, exact and numeric to initialise the comparing of the records. The compute method is used to start the actual comparing.

compare = op_recordlinkage.Compare()

These are the supported comparing rules.

User-defined algorithms

You can create a custom comparator function as a compare rule which can used to add weights on a particular comparing strategy.

def compare.custom(cmp: func(x,y) -> int , left_col , right_col , label)

Sample example in which we do a custom compare rule to link those records where the covid test was taken not more than 14 days after the flight departure date.

# Using a custom compare rule.
from datetime import datetime
def cmp(date_str1 , date_str2):
# Convert date strings to datetime objects
date1 = datetime.strptime(date_str1, "%Y-%m-%d")
date2 = datetime.strptime(date_str2, "%Y-%m-%d")

# Calculate the absolute difference in days
days_apart = (date2 - date1).days
# Check if the dates are within two weeks (14 days) apart
if days_apart <= 14:
return 2
else:
return 0

compare.custom(cmp,"flight_date","covidtest_date",label="date_cmp")

Compute

Calling this method starts the comparing of records.

def op_recordlinkage.Compare.compute(
pairs:op_recordlinkage.Index.compute() ,
x : PrivateDataFrame ,
x_link: PrivateDataFrame
) -> PrivateDataFrame

The following is how the features matrix looks internally. Its a private information , therefore the below dataframe cannot be accessed directly and you would need to apply differentially private method to obtain valuable stats.

features_matrix = compare.compute(pairs , x , x_link)

>>> features_matrix (PrivateDataFrame)
firstname lastname date_cmp
9 48703 0.000000 1.0 2
28227 48703 0.412037 0.0 2
32066 48703 0.888889 1.0 0
32067 48703 0.888889 1.0 2
32068 48703 0.888889 1.0 2

>> getting average weight to find a good matching threshold.
ag_print(f"Average weight : {features.sum(axis=1).mean(eps=0.1)}")
Output >>> Average weight : 2.3949462007983215
def op_recordlinkage.Compare.get_match(threshold:float) -> PrivateDataFrame

After the indexing and compare rules are set up to obtain a features matrix , we can apply a threshold value against which the dataset linking will take place. Since the average weight was 2.39/4 , we can keep a threshold of 3.0 to have a fairly strong match.

The linked dataset will have columns with prefix as l_ and r_ for the first and second dataset using for linking respectively.

linked_df = compare.get_match(3)

>>> linked_df (PrivateDataFrame)
l_flight_number l_flight_date l_flight_from l_flight_to \
0 CHI-ROM-0019 2020-01-19 Chicago Rome
1 CHI-TOK-0018 2020-01-18 Chicago Tokyo
2 TOK-LON-0024 2020-01-24 Tokyo London
3 PRE-ROM-0013 2020-01-13 Pretoria Rome
4 ROM-PRE-0020 2020-01-20 Rome Pretoria

l_passenger_firstname l_passenger_lastname l_passenger_date_of_birth \
0 Nymo Thum 1978-03-30
1 Dina Thum 1978-03-30
2 Dina Thum 1978-03-30
3 WolmUlna Fano 1996-12-13
4 WolmUlna Fano 1996-12-13

r_patient_firstname r_patient_lastname r_patient_date_of_birth \
0 Dina Anin Thum 1978-03-30
1 Dina Anin Thum 1978-03-30
2 Dina Anin Thum 1978-03-30
3 Wolm Fano 1996-12-13
4 Wolm Fano 1996-12-13

r_covidtest_date r_covidtest_result r_patient_address
0 2020-01-27 positive House 401, Eagle Alley, London
1 2020-01-27 positive House 401, Eagle Alley, London
2 2020-01-27 positive House 401, Eagle Alley, London
3 2020-01-22 positive House 255, Newton Corner, Pretoria
4 2020-01-22 positive House 255, Newton Corner, Pretoria