2015-11-04
Evaluate record linkages

Today’s topic is how to evaluate the correctness of a linkage. Even with the few options described in the previous posts (such as field equality and Jaro-Winkler dissimilarity), a number of models were created, each with its own specific accuracy. Without a specialized interface, the choice of the parameters alpha, beta, the true/false threshold and the decision between boolean and confidence based adjustment of the test are tough calls.

Read More
 2015-10-31
Using Jaro-Winkler to enhance record linkage

The previous post has exclusively relied on the exact match of name. Out of the ten names, only two of them enjoyed the perfect equality. Due to a coincidence, one of the two names was associated with a wrong record. It seems that perfect equality does not yield the necessary accuracy; on the other hand, one can speculate that measuring the “degree of similarity” between the entries will yield some better results. One such measure is the “Jaro-Winkler” distance from [1]. This measure returns a value between zero and one, with one for the equal strings and zero for the totally different strings. The measure factors in the similarity of the letters of a string, their sequentiality as well as the position of similar characters inside the string.

Read More
 2015-10-25
Introduction in Record Linkage

Welcome to my record linkage blog!

Record Linkage is a sum of techniques to associate records of two or multiple databases without sharing common keys.

ACME Corporation stores customer data provided by the sales and support departments. Since there is/was no global concept of “Customer ID”, both departments use their own internal IDs, while collecting data like Customer Name, Customer Address, Customer Company and the list of registered products. Since the data of sales is disconnected of support data, the decision support system cannot identify the classes of customers that require the less support, even if these costumers are preferred to high maintenance customers.

Read More