Skip to main content

API Reference

SPlink is a versatile tool for accurate record linkage, leveraging probabilistic matching and customisable rules to identify and link records efficiently. AG's op_splink is a differentially private version of SPlink. This page showcases methods and commands available for this library.

Preprocessing

Preprocessing in SPlink involves cleaning and standardising data from various sources, ensuring uniformity in attributes, and handling missing or erroneous values. This step is crucial to enhance the accuracy of probabilistic record linkage by reducing noise and improving data quality. As a result, it becomes easier for SPlink to identify matching records accurately across datasets.

SPlink requires that you clean your data and assign unique IDs to rows before linking. Access the official SPlink Data Prerequisites page for more information.

Blocking

Blocking rules in SPlink are used to improve the efficiency of the record linkage process. They group records into "blocks" based on shared attributes. This approach reduces the number of comparisons required. or example, records with the same zip code or last name may be grouped together. The below code block presents an example of the blocking implementation:

from op_splink.duckdb.blocking_rule_library import block_on # just import from op_splink instead of splink

block_by_firstname = block_on("firstname")
block_by_firstname_and_lastname = block_on(["substr(firstname, 1,1)", "lastname"])

ag_print(block_by_firstname_and_lastname.sql) # to check the sql query generate by the blocking rule

note

Currently, Antigranular only supports DuckDB backend.

Antigranular supports all the functions within blocking_rules_library and blocking_rule_composition . You can access these libraries using the following links.

SPlink's API documentation provides detailed guidance on defining and customising blocking rules for your specific dataset and use case.

Comparisons

In SPlink, comparisons involve assessing the similarity between records within each block created by the blocking rules. Various comparison methods, such as string similarity metrics or numerical comparisons, are applied to attribute pairs, enabling the system to calculate match probabilities. SPlink's sophisticated comparison techniques allow it to identify matching records accurately, contributing to the successful record linkage process while accommodating diverse data types and attributes.

See an example on the following code block:

import op_splink.duckdb.comparison_template_library as ctl

first_name_comparison = ctl.name_comparison("firstname")
ag_print(first_name_comparison.human_readable_description)

Antigranular supports all the different types of comparisons within the DuckDB library, along with custom comparisons:

Linker

The settings dictionary in SPlink is a configuration file in which users define the parameters and settings for their record linkage task. It includes specifications for blocking rules, comparison methods, and matching thresholds. The linker refers to the core component of SPlink that executes the linkage process based on the settings defined in the dictionary.

Together, the settings dictionary and linker form the heart of SPlink's ability to efficiently and accurately link records across datasets, offering flexibility and control in tailoring the linkage process to specific data and objectives.

Below, you find an example of how to use the linker from op_splink :

from op_splink.duckdb.linker import DuckDBLinker

settings = {
"link_type": "link_only",
"comparisons":[
first_name_comparison
],
"blocking_rules_to_generate_predictions": [
block_by_firstname_and_lastname
]
}
linker = DuckDBLinker([df_1, df_2], settings)

Predict and create linked Dataframe API

The predict_and_create_linked_df function allows you to transform the output from predict (in case of two input dataframes) and remove the information, such as the gamma of a column or parent table. It cleans the prediction's output and adds the metadata for each column from the input two data frames.

DuckDBLinker.predict_and_create_linked_df(
threshold_match_probability: float = None,
threshold_match_weight: float = None,
materialise_after_computing_term_frequencies=True,
left_sensitivity: int = 1,
right_sensitivity: int = 1
) -> op_pandas.PrivateDataFrame

Here, left_sensitivity will limit the number of times one unique id from left dataframe can be present in the final linked dataframe.

Similarly, right_sensitivity will limit the number of times one unique id from right dataframe can be present in the final linked dataframe.

Once the linker model is fully trained by estimating the m and u values, you can use the op_splink API to create a linked PrivateDataFrame:

linked_df = linker.predict_and_create_linked_df(threshold_match_probability = 0.9)

Below, you find a list with all supported methods within linker:

In SPlink, the SplinkDataFrame type is a specialised data structure designed for efficiently handling and manipulating datasets during the record linkage process.

To extract the DataFrame from SplinkDataFrame, use the following API:

DuckDBDataFrame.as_private_dataframe(limit: int=None) -> PrivateDataFrame: