Ministry of Business, Innovation, and Employment | Senior Data Scientist
Jane is a Senior Data Scientist with the Risk Analytics and Data Science team at the Ministry of Business, Innovation and Employment (MBIE). She works with statistical analysis and data modelling to interpret complex data, identify trends and patterns, and draw meaningful conclusions to help organizations make data-driven decisions. Jane relishes the challenge of solving complex analytical problems, and has a particular interest in providing insights through visual presentation.
Abstract
Organizations across both the public and private sectors frequently manage datasets that lack unique identifiers, such as client numbers or national IDs. This absence creates significant challenges in accurately linking records that belong to the same individual or entity. Without reliable linkage, organizations face issues which can negatively impact decision-making and service delivery, such as duplicated records, fragmented profiles, inconsistent reporting, and reduced data quality.
Probabilistic record linkage provides a powerful approach to address these challenges by estimating the likelihood that two records refer to the same entity based on a combination of non-unique attributes, such as name, gender, and date of birth. Unlike deterministic methods, which require exact matches, probabilistic techniques allow for flexibility in handling typographical errors, missing values, and variations in data entry.
In this talk, we will showcase how Python, and specifically the Splink library, can be leveraged to design and implement scalable probabilistic record linkage systems. We will walk through the core principles behind Splink, demonstrate its capabilities for handling large datasets, and share practical lessons learned from real-world implementation.