Evaluate record linkages


Today’s topic is how to evaluate the correctness of a linkage. Even with the few options described in the previous posts (such as field equality and Jaro-Winkler dissimilarity), a number of models were created, each with its own specific accuracy. Without a specialized interface, the choice of the parameters alpha, beta, the true/false threshold and the decision between boolean and confidence based adjustment of the test are tough calls. Record linkage rarely operates with a golden standard, meaning that the “true” linkage it is not available, therefore a linkage cannot be scored against a predetermined known solution.

The first step is to inspect the confidence level yielded by the battery of matching rules. This number is a normalization of the probability that the two records match.

Record linkage report with confidence level
Record linkage report with confidence level

The fact that the errors become more numerous as the confidence level decreases is a good aspect, it means the scoring system used to perform the linkage is statistically sound.

Another GUI useful for the evaluation scenario is the linkage comparator. The user can save the state of linkage for different settings and compare any two of them.

To compare two linkages, the user must point to the entity to compare, choose two of the previously saved linkages and decide which of the left and right entities to be matched is the reference. In the example below are saved the linkages described in the previous post, as well as the field equality example.

Record linkage comparison parameters
Record linkage comparison parameters

The outcome of the comparison is a four list report:

  • The unchanged matches, with respect of the two entities. Confidence level may have been changed.
  • Matches only belonging to the first linkage.
  • Matches encountered in the second linkage only.
  • Changed matches.

All but the first uses a reference entity (either the left or the right) into consideration. For example if the reference is set to the left entity, then the second list is composed of the left entity records only matched in the first chosen linkage.

The previous post has shown a 100% correct linkage by using two methods, first was using the name column only while the second employed all the available data. Although the two methods yielded the same result, the confidence levels were different:

Comparison of name only and all-data linkages
Comparison of name only and all-data linkages

A more meaningful report can be inspected on comparing the two applications of Jaro-Winkler dissimilarity, with or without confidence adjustment based on the observed dissimilarity score:

Comparison of two linkage methods (part 1)
Comparison of two linkage methods (part 1)
Comparison of two linkage methods (part 2)
Comparison of two linkage methods (part 2)

A basic reporting method of record linkage result analysis was described in this post. The reports help to decide if a change in the linkage model leads to better results or not.