Ministry of Business, Innovation, and Employment | Data Scientist
Nilani is a Data Scientist with the Data Science team at the Ministry of Business, Innovation and Employment (MBIE). Coming from an academic background she has leveraged Python to solve a variety of statistical, analytics, and data science problems. Nilani is committed to making sophisticated data science concepts accessible to non-technical stakeholders, enabling organizations to leverage data for measurable business outcomes.
Abstract
Organizations across both the public and private sectors frequently manage datasets that lack unique identifiers, such as client numbers or national IDs. This absence creates significant challenges in accurately linking records that belong to the same individual or entity. Without reliable linkage, organizations face issues which can negatively impact decision-making and service delivery, such as duplicated records, fragmented profiles, inconsistent reporting, and reduced data quality.
Probabilistic record linkage provides a powerful approach to address these challenges by estimating the likelihood that two records refer to the same entity based on a combination of non-unique attributes, such as name, gender, and date of birth. Unlike deterministic methods, which require exact matches, probabilistic techniques allow for flexibility in handling typographical errors, missing values, and variations in data entry.
In this talk, we will showcase how Python, and specifically the Splink library, can be leveraged to design and implement scalable probabilistic record linkage systems. We will walk through the core principles behind Splink, demonstrate its capabilities for handling large datasets, and share practical lessons learned from real-world implementation.