Welcome to my record linkage blog!
Record Linkage is a sum of techniques to associate records of two or multiple databases without sharing common keys.
ACME Corporation stores customer data provided by the sales and support departments. Since there is/was no global concept of “Customer ID”, both departments use their own internal IDs, while collecting data like Customer Name, Customer Address, Customer Company and the list of registered products. Since the data of sales is disconnected of support data, the decision support system cannot identify the classes of customers that require the less support, even if these costumers are preferred to high maintenance customers.
During the following posts I will present a few basic techniques of record linkage, as well as the features of a software I develop.
How long would it take to a human being to associate the 11 records of the two fictional tables below? Can you think of associating the records in less than a minute?
If instead 11 records there will be 11 thousand, the time to link the records will increase even more than 1000 times! Without a proper strategy, the matching time will increase by one million. This means 1 million minutes, almost two years.
Sales data
Name | Address | Company | Products |
Joel Smith | New York, NY | New York-onics | MegaProduct,SuperProd |
Alicia Thomson | 4789 Woodward Avenue Detroit | Detroitics | MegaProd,TeraProd |
James Jones | 4789 Woodward Ave. Detroit, MI | Detroitics | GigaProd,MegaProd |
Ram Kumar | New Orchard Road, Armonk, NY | International Business Machines | PetaProduct |
Jack Jones | One Microsoft Way Redmond | Microsoft USA | PetaProd |
Mary Kerry | 12345 Michigan Ave, Chicago IL | Chicagonics | MegaProd |
Mary Barry | GM Renaissance Center, Detroit, MI 48243 | General Motors | TeraProd,SuperProd |
James Stephenson | One Ford Way, Dearborn, MI | Ford | Petaprod |
Al Shepard | New York, NY | Newyorkonics | SuperProd,MegaProd |
Mike Taylor | Unknown | Lockheed-M | Tera Product |
James Jones | 1234 Woodward Ave. Detroit, MI | Moonlighting | Mega-Prod |
Support data
Name | Address | Company | Products |
Jackob Jones | One Microsoft Way Redmond, WA 98052-7329 | Microsoft | Peta Product |
James (Jim) Stephenson | One Ford Way, Dearborn, MI 48126 | Ford | Petaprod |
Alicia Shepard | 1234 56 St. New York, NY | Newyorkonics | ??? |
Jim Jones | 1234 Woodward Ave. Detroit, MI | Moonlighting, Inc | MegaProd |
Joe Smith | 1234 56 St. New York, NY | Newyorkonics | SuperProd,MegaProd |
Mary Barry | Renaissance Center, Detroit | GM | Tera product |
Ramkrishna (Ram) Kumar | New Orchard Road, Armonk, New York 10504 | IBM | GigaProduct, PetaProduct |
James Jones | 4789 Woodward Ave. Detroit, MI | Detroitics | GigaProd |
Michael Taylor | Classified, Bethesda, MD | Lockheed Martin | TeraProd |
Alice Thompson | 4789 Woodward Ave. Detroit, MI | Detroitics | MegaProduct,TeraProd |
Modern record matching techniques are able to link even hundreds of millions of records in reasonable time.
During this post I have only described the problem of record linkage, without attempting to provide any solution. The next posts will attempt different algorithms and reporting techniques to associate the provided sales and support data.