Jane Li

Ministry of Business, Innovation, and Employment | Senior Data Scientist

Jane is a Senior Data Scientist with the Risk Analytics and Data Science team at the Ministry of Business, Innovation and Employment (MBIE). She works with statistical analysis and data modelling to interpret complex data, identify trends and patterns, and draw meaningful conclusions to help organizations make data-driven decisions. Jane relishes the challenge of solving complex analytical problems, and has a particular interest in providing insights through visual presentation. 

Abstract

Connecting the Dots: From Gazette Automation to Probabilistic Entity Resolution

Organizations across both the public and private sectors frequently manage datasets that lack unique identifiers, such as client numbers or IDs. This absence creates significant challenges in accurately linking records that belong to the same individual or entity. Without reliable linkage, duplicated records, fragmented profiles, inconsistent reporting, and reduced data quality can hinder both operational efficiency and decision‑making.  

Probabilistic record linkage provides a powerful method to address these challenges by estimating the likelihood that two records refer to the same entity based on non-unique attributes such as name, gender, and date of birth. Unlike deterministic methods, which require exact matches, probabilistic techniques allow for flexibility in handling typographical errors, missing values, and variations in data entry.

A practical example comes from Immigration New Zealand’s work monitoring accredited employers who receive liquidation notices published in the New Zealand Gazette. Historically, this was a fully manual and error‑prone process that required manual text extraction and spreadsheet‑based matching. We automated the workflow using APIs, Power BI, R (RegEX), SharePoint, and Power Automate, turning hours of human processing into a near real-time data pipeline. However, once automated, a deeper issue surfaced: directors listed in New Zealand Companies Office data do not have unique identifiers, making it difficult to track individuals across multiple companies. In this talk, we will showcase the Gazette Workflow and how Python, and specifically the Splink library, can be leveraged to design and implement scalable probabilistic record linkage systems. We will highlight how this linkage layer builds on earlier automation efforts, sharing practical lessons from real‑world implementation across both automation and probabilistic matching.