Skip to main content

Splink

Pre Processing

Preprocessing in SPlink involves cleaning and standardizing data from various sources, ensuring uniformity in attributes, and handling missing or erroneous values. This crucial step enhances the accuracy of probabilistic record linkage by reducing noise and improving data quality, making it easier for SPlink to identify matching records accurately across datasets.

Blocking

Blocking rules in SPlink are used to improve the efficiency of the record linkage process. They group records into "blocks" based on shared attributes, reducing the number of comparisons needed. For example, records with the same zip code or last name may be grouped together. SPlink's API documentation provides detailed guidance on defining and customizing blocking rules for your specific dataset and use case.

Currently, we only support DuckDB backend.

from op_splink.duckdb.blocking_rule_library import block_on # just import from op_splink instead of splink

block_by_firstname = block_on("firstname")
block_by_firstname_and_lastname = block_on(["substr(firstname, 1,1)", "lastname"])

ag_print(block_by_firstname_and_lastname.sql) # to check the sql query generate by the blocking rule

We support all the functions within blocking_rules_library and blocking_rule_composition

Comparisons

In SPlink, comparisons involve assessing the similarity between records within each block created by the blocking rules. Various comparison methods, such as string similarity metrics or numerical comparisons, are applied to attribute pairs, enabling the system to calculate match probabilities. SPlink's sophisticated comparison techniques allow it to identify matching records accurately, contributing to the successful record linkage process while accommodating diverse data types and attributes.

import op_splink.duckdb.comparison_template_library as ctl

first_name_comparison = ctl.name_comparison("firstname")
ag_print(first_name_comparison.human_readable_description)

We support all the different types of comparisons within the DuckDB library, along with custom comparisons:

Linker

In SPlink, the "settings dictionary" is a configuration file where users define the parameters and settings for their record linkage task. It includes specifications for blocking rules, comparison methods, and matching thresholds. The "linker" refers to the core component of SPlink responsible for executing the linkage process based on the settings defined in the dictionary. Together, the settings dictionary and linker form the heart of SPlink's ability to efficiently and accurately link records across datasets, offering flexibility and control in tailoring the linkage process to specific data and objectives.

from op_splink.duckdb.linker import DuckDBLinker

settings = {
"link_type": "link_only",
"comparisons":[
first_name_comparison
],
"blocking_rules_to_generate_predictions": [
block_by_firstname_and_lastname
]
}
linker = DuckDBLinker([df_1, df_2], settings)

Predict and create linked df api

DuckDBLinker.predict_and_create_linked_df(
threshold_match_probability: float = None,
threshold_match_weight: float = None,
materialise_after_computing_term_frequencies=True,
) -> op_pandas.PrivateDataFrame

Once the linker model is fully trained by estimating the m and u values, we can use this api to create a linked private dataframe:

linked_df = linker.predict_and_create_linked_df(threshold_match_probability = 0.9)

These are the supported methods within linker:

In SPlink, the "SplinkDataFrame" type is a specialized data structure designed for efficiently handling and manipulating datasets during the record linkage process.

To extract the DataFrame from SplinkDataFrame, use the following api:

DuckDBDataFrame.as_private_dataframe(limit: int=None) -> PrivateDataFrame: