API Reference

Record Linkage

The op_recordlinkage API is based on the Record Linkage library, but all the methods are differentially private. The toolkit provides comprehensive toolkit needed for record linkage.

Preprocessing

Preprocessing data, like cleaning and standardising, may increase your record linkage accuracy. The preprocessing and standardising functions are available in the submodule recordlinkage.preprocessing.

You can import the functions as presented in the code block below:

from op_recordlinkage.preprocessing import clean, phonetic

Read more detailed information on these functions on the following official Record Linkage pages:

Indexing

The indexing module is used to make pairs of records. These pairs are called candidate links or candidate matches. There are several indexing algorithms available such as blocking and sorted neighborhood indexing.

indexer = op_recordlinkage.Index()

The following are the supported indexing rules:

After applying the indexing rules, you can can obtain the candidate links as presented in the code block below:

import op_recordlinkage as rl
indexer = rl.Index()

# Indexing rules
indexer.block('passenger_date_of_birth','patient_date_of_birth')

# Obtaining candidate links based on the indexing rules set.
candidate_links = indexer.index(dataset1 , dataset2)

Comparing

Class to compare the attributes of candidate record pairs. It has methods such as string, exact, and numeric to initialise the comparison of the records. The compute method is used to start the actual comparing.

compare = op_recordlinkage.Compare()

These are the supported comparing rules:

User-defined algorithms

You can create a custom comparator function as a comparison rule to add weights to a particular comparing strategy.

def compare.custom(cmp: func(x,y) -> int , left_col , right_col , label)

For example, consider a custom comparison rule to link the records where the COVID test was taken no more than 14 days after the flight departure date. The following code block showcases this example:

# Using a custom compare rule.
from datetime import datetime
def cmp(date_str1 , date_str2):
    # Convert date strings to datetime objects
    date1 = datetime.strptime(date_str1, "%Y-%m-%d")
    date2 = datetime.strptime(date_str2, "%Y-%m-%d")

    # Calculate the absolute difference in days
    days_apart = (date2 - date1).days
    # Check if the dates are within two weeks (14 days) apart
    if days_apart <= 14:
        return 2
    else:
        return 0

compare.custom(cmp,"flight_date","covidtest_date",label="date_cmp")

Compute

Use the compute method to start comparing records.

def op_recordlinkage.Compare.compute(
    pairs:op_recordlinkage.Index.compute() ,
    x : PrivateDataFrame ,
    x_link: PrivateDataFrame
) -> PrivateDataFrame

The following code block showcases how the features matrix looks internally.

features_matrix = compare.compute(pairs , x , x_link)

>>> features_matrix (PrivateDataFrame)
             firstname  lastname date_cmp
9     48703   0.000000       1.0        2
28227 48703   0.412037       0.0        2
32066 48703   0.888889       1.0        0
32067 48703   0.888889       1.0        2
32068 48703   0.888889       1.0        2

>> getting average weight to find a good matching threshold.
ag_print(f"Average weight : {features.sum(axis=1).mean(eps=0.1)}")
Output >>> Average weight : 2.3949462007983215

Since the above DataFrame is private information, it cannot be accessed directly, and you would need to apply a differentially private method to obtain valuable stats.

Link Datasets

After the indexing and compare rules are set up to obtain a features matrix, we can apply a threshold value against which the dataset linking will take place.

def op_recordlinkage.Compare.get_match(feature_matrix: PrivateDataFrame=None,threshold: float=0.0,max_sensitivity: int=1) -> PrivateDataFrame

In the example presented in the Compute section the average weight was 2.39/4. As a result, you can keep a threshold of 3.0 to have a fairly strong match. The argument max_sensitivity is by default 1 which denotes the maximum number of matchings of one record. If it unknown, a large value can be set to get the complete results

The linked dataset will have columns with prefixes l_ and r_ for the first and second datasets used for linking, respectively. The below code block presents an example of the result of linking datasets:

linked_df = compare.get_match(threshold=3, max_sensitivity = 1000)

>>> linked_df (PrivateDataFrame)
l_flight_number l_flight_date l_flight_from l_flight_to  \
  CHI-ROM-0019    2020-01-19       Chicago        Rome
  CHI-TOK-0018    2020-01-18       Chicago       Tokyo
  TOK-LON-0024    2020-01-24         Tokyo      London
  PRE-ROM-0013    2020-01-13      Pretoria        Rome
  ROM-PRE-0020    2020-01-20          Rome    Pretoria

  l_passenger_firstname l_passenger_lastname l_passenger_date_of_birth  \
                Nymo                 Thum                1978-03-30
                Dina                 Thum                1978-03-30
                Dina                 Thum                1978-03-30
            WolmUlna                 Fano                1996-12-13
            WolmUlna                 Fano                1996-12-13

r_patient_firstname r_patient_lastname r_patient_date_of_birth  \
         Dina Anin               Thum              1978-03-30
         Dina Anin               Thum              1978-03-30
         Dina Anin               Thum              1978-03-30
              Wolm               Fano              1996-12-13
              Wolm               Fano              1996-12-13

  r_covidtest_date r_covidtest_result                   r_patient_address
     2020-01-27           positive      House 401, Eagle Alley, London
     2020-01-27           positive      House 401, Eagle Alley, London
     2020-01-27           positive      House 401, Eagle Alley, London
     2020-01-22           positive  House 255, Newton Corner, Pretoria
     2020-01-22           positive  House 255, Newton Corner, Pretoria

API Reference

Record Linkage​

Preprocessing​

Indexing​

Comparing​

User-defined algorithms​

Compute​

Link Datasets​