Introduction in Record Linkage


Welcome to my record linkage blog!

Record Linkage is a sum of techniques to associate records of two or multiple databases without sharing common keys.

ACME Corporation stores customer data provided by the sales and support departments. Since there is/was no global concept of “Customer ID”, both departments use their own internal IDs, while collecting data like Customer Name, Customer Address, Customer Company and the list of registered products. Since the data of sales is disconnected of support data, the decision support system cannot identify the classes of customers that require the less support, even if these costumers are preferred to high maintenance customers.

During the following posts I will present a few basic techniques of record linkage, as well as the features of a software I develop.

How long would it take to a human being to associate the 11 records of the two fictional tables below? Can you think of associating the records in less than a minute?

If instead 11 records there will be 11 thousand, the time to link the records will increase even more than 1000 times! Without a proper strategy, the matching time will increase by one million. This means 1 million minutes, almost two years.

Sales data

Name Address Company Products
Joel Smith  New York, NY New York-onics MegaProduct,SuperProd
Alicia Thomson 4789 Woodward Avenue Detroit Detroitics MegaProd,TeraProd
James Jones 4789 Woodward Ave. Detroit, MI Detroitics GigaProd,MegaProd
Ram Kumar New Orchard Road, Armonk, NY International Business Machines  PetaProduct
Jack Jones  One Microsoft Way Redmond Microsoft USA PetaProd
Mary Kerry 12345 Michigan Ave, Chicago IL Chicagonics MegaProd
Mary Barry GM Renaissance Center, Detroit, MI 48243 General Motors TeraProd,SuperProd
James Stephenson One Ford Way, Dearborn, MI Ford Petaprod
Al Shepard  New York, NY Newyorkonics SuperProd,MegaProd
Mike Taylor Unknown Lockheed-M Tera Product
James Jones 1234 Woodward Ave. Detroit, MI Moonlighting Mega-Prod

Support data

Name Address Company Products
Jackob Jones  One Microsoft Way Redmond, WA 98052-7329 Microsoft Peta Product
James (Jim) Stephenson One Ford Way, Dearborn, MI 48126 Ford Petaprod
Alicia Shepard 1234 56 St. New York, NY Newyorkonics ???
Jim Jones 1234 Woodward Ave. Detroit, MI Moonlighting, Inc MegaProd
Joe Smith 1234 56 St. New York, NY Newyorkonics SuperProd,MegaProd
Mary Barry Renaissance Center, Detroit GM Tera product
Ramkrishna (Ram) Kumar New Orchard Road, Armonk, New York 10504 IBM GigaProduct, PetaProduct
James Jones 4789 Woodward Ave. Detroit, MI Detroitics GigaProd
Michael Taylor Classified, Bethesda, MD Lockheed Martin TeraProd
Alice Thompson 4789 Woodward Ave. Detroit, MI Detroitics MegaProduct,TeraProd

Modern record matching techniques are able to link even hundreds of millions of records in reasonable time.

During this post I have only described the problem of record linkage, without attempting to provide any solution. The next posts will attempt different algorithms and reporting techniques to associate the provided sales and support data.