API Reference
SPlink
SPlink is a versatile tool for accurate record linkage, leveraging probabilistic matching and customisable rules to identify and link records efficiently. AG's op_splink
is a differentially private version of SPlink. This page showcases methods and commands available for this library.
Preprocessing
Preprocessing in SPlink involves cleaning and standardising data from various sources, ensuring uniformity in attributes, and handling missing or erroneous values. This step is crucial to enhance the accuracy of probabilistic record linkage by reducing noise and improving data quality. As a result, it becomes easier for SPlink to identify matching records accurately across datasets.
SPlink requires that you clean your data and assign unique IDs to rows before linking. Access the official SPlink Data Prerequisites page for more information.
Blocking
Blocking rules in SPlink are used to improve the efficiency of the record linkage process. They group records into "blocks" based on shared attributes. This approach reduces the number of comparisons required. or example, records with the same zip code or last name may be grouped together. The below code block presents an example of the blocking implementation:
from op_splink.duckdb.blocking_rule_library import block_on # just import from op_splink instead of splink
block_by_firstname = block_on("firstname")
block_by_firstname_and_lastname = block_on(["substr(firstname, 1,1)", "lastname"])
ag_print(block_by_firstname_and_lastname.sql) # to check the sql query generate by the blocking rule
Currently, Antigranular only supports DuckDB backend.
Antigranular supports all the functions within blocking_rules_library
and blocking_rule_composition
. You can access these libraries using the following links.
SPlink's API documentation provides detailed guidance on defining and customising blocking rules for your specific dataset and use case.
Comparisons
In SPlink, comparisons involve assessing the similarity between records within each block created by the blocking rules. Various comparison methods, such as string similarity metrics or numerical comparisons, are applied to attribute pairs, enabling the system to calculate match probabilities. SPlink's sophisticated comparison techniques allow it to identify matching records accurately, contributing to the successful record linkage process while accommodating diverse data types and attributes.
See an example on the following code block:
import op_splink.duckdb.comparison_template_library as ctl
first_name_comparison = ctl.name_comparison("firstname")
ag_print(first_name_comparison.human_readable_description)
Antigranular supports all the different types of comparisons within the DuckDB library, along with custom comparisons:
Linker
The settings dictionary in SPlink is a configuration file in which users define the parameters and settings for their record linkage task. It includes specifications for blocking rules, comparison methods, and matching thresholds. The linker refers to the core component of SPlink that executes the linkage process based on the settings defined in the dictionary.
Together, the settings dictionary and linker form the heart of SPlink's ability to efficiently and accurately link records across datasets, offering flexibility and control in tailoring the linkage process to specific data and objectives.
Below, you find an example of how to use the linker from op_splink
:
from op_splink.duckdb.linker import DuckDBLinker
settings = {
"link_type": "link_only",
"comparisons":[
first_name_comparison
],
"blocking_rules_to_generate_predictions": [
block_by_firstname_and_lastname
]
}
linker = DuckDBLinker([df_1, df_2], settings)
Predict and create linked Dataframe API
The predict_and_create_linked_df
function allows you to transform the output from predict (in case of two input dataframes) and remove the information, such as the gamma of a column or parent table. It cleans the prediction's output and adds the metadata for each column from the input two data frames.
DuckDBLinker.predict_and_create_linked_df(
threshold_match_probability: float = None,
threshold_match_weight: float = None,
materialise_after_computing_term_frequencies=True,
left_sensitivity: int = 1,
right_sensitivity: int = 1
) -> op_pandas.PrivateDataFrame
Here, left_sensitivity
will limit the number of times one unique id from left dataframe can be present in the final linked dataframe.
Similarly, right_sensitivity
will limit the number of times one unique id from right dataframe can be present in the final linked dataframe.
Once the linker model is fully trained by estimating the m and u values
, you can use the op_splink
API to create a linked PrivateDataFrame
:
linked_df = linker.predict_and_create_linked_df(threshold_match_probability = 0.9)
Below, you find a list with all supported methods within linker:
estimate_parameters_using_expectation_maximisation
estimate_u_using_random_sampling
load_settings
load_model
predict_and_create_linked_df
Splink Dataframe
In SPlink, the SplinkDataFrame
type is a specialised data structure designed for efficiently handling and manipulating datasets during the record linkage process.
To extract the DataFrame from SplinkDataFrame, use the following API:
DuckDBDataFrame.as_private_dataframe(limit: int=None) -> PrivateDataFrame: