Vaadin 10 minimal project

Since a few days, I’ve started exploring Vaadin 10, the new and shiny version of Vaadin.

Vaadin 10 comes with a new library of controls, called “flow”. The old code remains functional, as long as the package of the control classes are renamed.

Surprisingly, the old approach of starting a new project with an Apache Maven archetype is not available anymore. Probably because of marketing reasons (that’s my personal speculation), getting a new project in Vaadin now requires to sign up on their web site and use the on-line wizard. The startup project is a fully fledged application, composed of two tabs. The new application has CSS, Polymer templates, HTML. The good news is that it compiles and runs from the first attempt, which is something to expect, since it has no database or other external dependencies. The bad news is the increased complexity of the “bare bones” application. How complex would it be when such a project will reach production capabilities?

Read More
Set Match

Tennis racket Most of the record linkage techniques operate on two tables and define set of association rules. Is this the limit of what can be done? Well, my software goes beyond independent tables, exploring the relational dimension of data. This post is about matching sets of records associated to one data record.

Read More
Evaluate record linkages

Today’s topic is how to evaluate the correctness of a linkage. Even with the few options described in the previous posts (such as field equality and Jaro-Winkler dissimilarity), a number of models were created, each with its own specific accuracy. Without a specialized interface, the choice of the parameters alpha, beta, the true/false threshold and the decision between boolean and confidence based adjustment of the test are tough calls.

Read More
Using Jaro-Winkler to enhance record linkage

The previous post has exclusively relied on the exact match of name. Out of the ten names, only two of them enjoyed the perfect equality. Due to a coincidence, one of the two names was associated with a wrong record. It seems that perfect equality does not yield the necessary accuracy; on the other hand, one can speculate that measuring the “degree of similarity” between the entries will yield some better results. One such measure is the “Jaro-Winkler” distance from [1]. This measure returns a value between zero and one, with one for the equal strings and zero for the totally different strings. The measure factors in the similarity of the letters of a string, their sequentiality as well as the position of similar characters inside the string.

Read More
Introduction in Record Linkage

Welcome to my record linkage blog!

Record Linkage is a sum of techniques to associate records of two or multiple databases without sharing common keys.

ACME Corporation stores customer data provided by the sales and support departments. Since there is/was no global concept of “Customer ID”, both departments use their own internal IDs, while collecting data like Customer Name, Customer Address, Customer Company and the list of registered products. Since the data of sales is disconnected of support data, the decision support system cannot identify the classes of customers that require the less support, even if these costumers are preferred to high maintenance customers.

Read More