remadder software tutorial - how to use remadder software for successful records matching

56
Homepage: http://remaddersoft.wix.com/remadder ReMaDDer Software Tutorial How to use Remadder software for successful records matching, data cleansing and data deduplication projects 5/10/2016 Revision 1.1.

Upload: matalab

Post on 29-Jul-2016

226 views

Category:

Documents


0 download

DESCRIPTION

Remadder is an affordable and powerful record linkage and data cleansing software, with great fuzzy record matching and data deduplication capabilities. It's user-friendly graphical interface provides intuitive means for projects creation, raw data import and solutions definition, while server-side PostgreSQL database, placed in cloud, provides mighty data processing and fuzzy match record linkage engine that can process and solve even the most complex fuzzy match analysis in reasonable time. Remadder uses an inventive and pragmatic approach for fuzzy match analysis, by utilizing combination of two different string comparison similarity metrics: normalized Levenshtein distance and trigram similarity. These functions can be combined in flexible way in order to provide best results.

TRANSCRIPT

Page 1: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

Homepage: http://remaddersoft.wix.com/remadder

ReMaDDer Software Tutorial How to use Remadder software for successful records matching, data

cleansing and data deduplication projects

5/10/2016

Revision 1.1.

Page 2: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 1 / 55

Table of Contents

Introduction ........................................................................................................................ 3

What Is ReMaDDer Software .......................................................................................... 3

Fuzzy Match ..................................................................................................................... 3

Records Linkage .............................................................................................................. 3

Data Deduplication .......................................................................................................... 3

Remadder Software Advantages ..................................................................................... 4

Prerequisites .................................................................................................................... 4

Revision History .............................................................................................................. 4

Projects ................................................................................................................................ 5

Record Matching Project vs. Data Deduplication Projects ............................................. 6

Copy A Project ................................................................................................................. 6

Raw Data Import ................................................................................................................. 7

“Left” and “Right” datasets .............................................................................................. 8

Import Raw Data ............................................................................................................. 8

Browse And Choose CSV files ...................................................................................................................... 8

Register CSV Files ........................................................................................................................................ 9

Edit Raw Datasource Schema Information .............................................................................................. 10

Preprocess Raw Datasource ....................................................................................................................... 11

Import Data From Raw Datasources ........................................................................................................ 12

Solution Definition ............................................................................................................ 13

String Similarity Metrics ............................................................................................... 15

Normalized Levenshtein Distance or Similarity Function ...................................................................... 15

Trigram Similarity Function ..................................................................................................................... 16

Page 3: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 2 / 55

Trigram Similarity Operator ..................................................................................................................... 17

Remadder Similarity Function .................................................................................................................. 17

Solution Definition Header ............................................................................................ 17

Characteristics of Levenshtein and Trigram Similarity Functions ......................................................... 21

Composite (Concatenated) Field ............................................................................................................... 21

Similarity Threshold Value ........................................................................................................................ 23

Inclusive “OR” vs. Exclusive “AND”.......................................................................................................... 23

Solution Definition Details ............................................................................................ 26

Fields Picker ............................................................................................................................................... 27

Solution Constraints .................................................................................................................................. 29

Solution Execution ............................................................................................................ 33

SQL Query Preparation ................................................................................................. 34

Prepare SQL Query With Composite Fields (Re)Creation ...................................................................... 37

Prepare Only SQL Query ........................................................................................................................... 37

Data Retrieving And Storing ............................................................................................. 37

Execute SQL Query ........................................................................................................ 39

Solution Query Preparation And Resultset Retrieval Status Info ................................ 40

Save And Load Resultset ............................................................................................... 41

Review And Edit Resultset ............................................................................................ 42

Resultset Browsing .................................................................................................................................... 42

Resultset Edit And Review ........................................................................................................................ 47

Exporting Resultset................................................................................................................................... 48

Customize Data Grids ........................................................................................................ 52

Customize Splitters ........................................................................................................... 53

Remadder Software Trial .................................................................................................. 53

Commercial Release Code Purchase And Activation ........................................................ 54

Page 4: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 3 / 55

ReMaDDer Software Tutorial

How to use Remadder software for successful records matching, data cleansing and

data deduplication projects

Introduction

What Is ReMaDDer Software Remadder is an affordable and powerful record linkage and data cleansing software, with great fuzzy record

matching and data deduplication capabilities.

It's user-friendly graphical interface provides intuitive means for projects creation, raw data import and

solutions definition, while server-side PostgreSQL database, placed in cloud, provides mighty data

processing and fuzzy match record linkage engine that can process and solve even the most complex fuzzy

match analysis in reasonable time.

Remadder uses an inventive and pragmatic approach for fuzzy match analysis, by utilizing combination of

two different string comparison similarity metrics: normalized Levenshtein distance and trigram similarity.

These functions can be combined in flexible way in order to provide best results.

The name “ReMaDeDer” is an acronym for “Records Matching and Data Deduplication Software”.

Homepage: http://remaddersoft.wix.com/remadder

Fuzzy Match Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It

is used in cases where no unique identifier or exact match relation exists between two sets of data.

Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity.

Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with

probabilities below threshold are considered to be non-matches.

Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold

matching percentage set by the application.

Records Linkage Record linkage refers to the task of finding records in a data set that refer to the same entity across different

data sources, i.e. to identify related records in two separate data sets.

Record linkage is necessary when joining data sets is based on entities that may or may not share a common

identifier, as may be the case due to differences in record shape, storage location, and/or curator style or

preference.

Data Deduplication

Page 5: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 4 / 55

Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from

redundant information.

Remadder Software Advantages Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic

papers. Some of the researchers even tend to build their own software, but those programs suffer from their

complexity and it is really necessary to understand all the advanced mathematics and algorithms in order

to be able to use it.

This is not something that can be expected from an average user facing data linkage problem and wanting

to be able to solve it in matter of hours.

On the other hand, there are huge corporate entity resolution framework solutions produced by big software

companies oriented toward huge corporate customers. These solutions are often very complex and

affordable only to big companies and corporate users.

Remadder places itself in the middle and provides powerful fuzzy match records linkage solution for mere

mortals and regular office users.

By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints

in a visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and

he/she can focus on the business case, rather than technical issues. That is where Remadder software really

shines and clearly distinguishes itself from competition.

Remadder uses an inventive and pragmatic approach for fuzzy match analysis, by utilizing combination of

two different string comparison similarity metrics: normalized Levenshtein distance and Trigram

similarity. These can be combined in flexible way in order to provide best results.

By its intuitive graphical user interface and low pricing, Remadder provides superb solution for fuzzy match

records linkage for any business case.

Prerequisites Major prerequisite to use Remadder is active internet connection, since the raw data is imported to remote

PostgreSQL server and actual query execution also takes place on the server. After trial period expires, you

are required to purchase commercial release code in order to be able to continue using remote server.

However, project and solution creation and editing can be performed even without established connection

and purchased release code, since these data are stored locally on your computer.

Revision History Revision Date Change Description

1.0. 3/20/2016 Initial release. Tutorial covers Remadder version 1.0. 1.1. 5/10/2016 Document is updated to reflect changes and improvements brought by

Remadder version 1.1. New version brings many improvements and simplifies solution definition. Instead of separarately choosing and defining thresholds for trigram similarity and levenshtein distance functions, a new, combined,

Page 6: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 5 / 55

common similarity function (remadder_similarity) is now introduced that combines both trigram and levenshtein similarity properties. This reduces complexity and uncertainty in solution definition creation, retaining Remadder strength and advantages. Previous Remadder version has been outputting all columns from left and right dataset into resultset. Now, you can choose which fields are to be included in resultset. Raw data import process is also much improved, especially regarding importing data from Excel files (in CSV format) where column names contain non-ascii characters and blanks. There are many small performance improvements and several bugfixes that will improve user experience when using the ReMaDDer software for data match analysis.

Projects

You can easily create new project, which are stored locally on your computer.

Page 7: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 6 / 55

Each project contains definition of two datasets to be compared and linked (so called "left dataset" and

"right dataset"), as well as variable number of corresponding solutions.

Solutions are named and stored definitions of how to perform fuzzy match analysis.

On creation, each project is assigned unique project tag. During raw data importing to server,

corresponding input tables get that tag appended in their name. This way, imported tables are always tagged

by the project name.

There are two views of projects. On the top section, there is a datagrid listing all projects, while on bottom

section there is a form view of currently selected project.

Record Matching Project vs. Data Deduplication Projects There is no fundamental difference between data deduplication and records matching projects. In both

cases we compare two datasets trying to establish which records from “left” dataset correspond to which

records in “right” dataset.

The only difference between the two is that in case of records matching project we have two different

datasets to be compared, while in case of data deduplication project we have to compare a dataset with itself

in order to identify duplicates.

Copy A Project Instead of manually entering all the parameters for new projects, Remadder allows you to copy an existing

project into another project. This action copies raw data import specifications as well as solution definitions.

Page 8: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 7 / 55

Raw Data Import

Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from CSV files.

The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and since all databases

and spreadsheet editors, as well as all other data sources can be easily exported to a csv file.

Remadder provides simple and intuitive tool for importing csv files. It will automatically detect

field’s delimiter, encoding and columns schema information. You can then edit the retrieved schema and

finally import the files on server for further processing.

Page 9: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 8 / 55

“Left” and “Right” datasets In each data deduplication or record matching project we always compare two datasets for matching of

records. In case of record matching projects, these two datasets correspond to two different input CSV files,

while in case of data deduplication projects, these two datasets are imported from the same input CSV file.

Nevertheless, we always have so-called “left dataset” and “right dataset” to be compared. Think of this like

if you wish to match fingers from left and right hand. You can easily identify thumb on the left hand to be

related to the thumb on the right hand, since they share similar shape. It is obvious due to their physical

similarity.

It is similar with fuzzy match analysis where we compare fields from left and right dataset in order to

identify string similarities.

Import Raw Data Process of importing raw data into server database consists of two logical phases. In first phase we need to

specify “left” and “right” dataset schema, while in the second phase we actually perform schema and data

import from input files, according to the previously defined schema.

On “Data Import” page, there are two subpages: “Left Dataset Specification” and “Right Dataset

Specification” in which we separately define input dataset specifications for “left” and “right” dataset.

Import can be executed separately or both datasets can be imported in batch.

Browse And Choose CSV files

Page 10: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 9 / 55

First step in importing input CSV files is to choose CSV files to be imported.

On upper part of “Left Dataset Specification” or “Right Dataset Specification” subpage, there is a CSV file

browser dialog box.

You can browse CSV files on your computer by clicking on the browse button . This opens a file

browser in which you can choose a CSV file. The absolute file path is then copied to the edit box.

Register CSV Files

Next step in importing CSV files is to define CSV file schema specification. We call this process “registering

CSV file”.

By clicking “Register CSV file” button near the file browser, the browsed CSV

file is examined for its columns and it’s schema information is then inserted in the corresponding list of

fields (columns).

Page 11: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 10 / 55

As you can see, Remadder determines field delimiter in CSV file (normally it is either “;” or “,”) and retrieves

information about columns.

If a column name has upper case characters, it is converted to lower case.

Currently, Remadder treats all columns as text fields of various length. This is due fact that the comparison

is performed by using string comparison functions (Levenshtein distance and trigram similarity) and other

type of data fields (e.g. datetime, integer, real etc.) would not really make sense for string comparisons.

Edit Raw Datasource Schema Information

Once you retrieved schema information from a CSV file, you might conclude that you don’t want to import

all columns, but only a subset.

You can edit the schema by using corresponding data grid navigator buttons.

If you wish to delete currently selected field from schema, just click delete button.

Page 12: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 11 / 55

If you wish to regain original columns schema, just click “Get Fields Schema”

button and the columns list will be repopulated from the CSV file.

Preprocess Raw Datasource

While defining import schema, you might realize that input data need some preprocessing before importing

to server for further analysis.

Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice

Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit, Geany

or Leafpad), but you can also use an embedded Remadder’s spreadsheet editor.

You can launch external default spreadsheet editor by clicking the button “Open CSV File in Ext.

Editor” .

You can launch the embedded spreadsheet editor by clicking button “Open CSV File In Int. Editor”

. This will open the embedded spreadsheet editor “Spready”.

Page 13: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 12 / 55

Import Data From Raw Datasources

Final step is actual import execution.

We can execute import separately for left and right datasets, by clicking corresponding buttons “Import

left dataset CSV file” or “Import right dataset CSV file” or we can import them both at once by

clicking the button “Import both CSV files to server”.

When you click the import button, Remadder will automatically open the “Import Log” page, where you

can watch import process progress.

Page 14: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 13 / 55

Import speed depends on the file size and most importantly, internet connection quality.

Solution Definition

A solution represents definition of parameters for performing record linkage or data deduplication analysis

for a project. Each project can have many solutions, with different parameters, thus you can test which

combination of parameters lead to best results.

Each solution definition consists of several sections: solution header section, exact match relations section,

fuzzy match relations section and other constraints section.

Solution header definition in Remadder v 1.0.:

Page 15: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 14 / 55

Solution header definition in Remadder v 1.1.:

Page 16: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 15 / 55

String Similarity Metrics There are many algorithms used for fuzzy match analysis. The Remadder software uses combination of two

such functions: normalized Levenshtein distance function and trigram similarity function. Besides these

two function, there is also a trigram similarity operator “%” used, which complements these two functions.

UPDATE: In Remadder verion 1.1. normalized Levenshtein distance and trigram similarity functions are

replaced by common similarity function (remadder_similarity).

remadder_similarity(left_string, right_string)=(levenshtein_similarity (left_string, right_string) +

trigram_similarity(left_string, right_string))/2

Normalized Levenshtein Distance or Similarity Function

In information theory and computer science, the Levenshtein distance is a string metric for measuring the

difference between two sequences. Informally, the Levenshtein distance between two words is the

minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change

one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance function returns an integer value, which is the minimum number of single-character

edits (insertion, deletion, substitution) required to transform one string to another. The bigger the integer,

the weaker is similarity between the two strings.

However, the retrieved Levenshtein integer has particular meaning only for certain pair of strings, because

interpretation of its’ value depends of the string lengths. Thus it is difficult to say whether certain integer

value means “similar” or “dissimilar” until you know the string lengths.

That’s the reason why Remadder does not use original Levenshtein distance function, but rather it’s

normalized variant which is defined as Levenshtein distance value divided by maximum string length

(whether left or right dataset string is longer).

Normalized Levenshtein distance function returns a real (decimal) value between zero (0) and one (1),

where 0 means absolute similarity (meaning “identical”) while 1 means absolute dissimilarity (meaning

“completely dissimilar”).

This way, we have a relative metrics that has inherent meaning and different datasets can be compared with

same Levenshtein distance metrics.

Instead of distance function, we could use derived similarity function. The normalized Levenshtein

similarity is (1 - Normalized Levenshtein distance).

More generally, normalized_similarity=(1 – normalized_distance).

NOTE: For sake of user interface simplification and more straightforward solution definition process,

Levenshtein function utilization is removed from GUI in Remadder version 1.1., in favour of the new

remadder_similarity function, which combines both normalized Levenshtein similarity and trigram

similarity function in one common similarity function.

However, you can still use Levenshtein functions directly, by manually replacing remadder_similarity

function inside the generated SQL.

Page 17: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 16 / 55

You can use the following Levenshtein functions:

a) Levenshtein distance functions

See: http://www.postgresql.org/docs/9.5/static/fuzzystrmatch.html

levenshtein(text left_string, text right_string, int insert_cost, int deletion_cost, int substitution_cost) returns int

levenshtein(text left_string, text right_string) returns int

levenshtein_less_equal(text left_string, text right_string, int insertion_cost, int deletion_cost, int substitution_cost, int

max_d) returns int

levenshtein_less_equal(text left_string, text right_string, int max_d) returns int

Both source and target can be any non-null string, with a maximum of 255 characters. The cost parameters

specify how much to charge for a character insertion, deletion, or substitution, respectively. You can omit

the cost parameters, as in the second version of the function; in that case they all default to 1.

levenshtein_less_equal is an accelerated version of the Levenshtein function for use when only small

distances are of interest. If the actual distance is less than or equal to max_d,

then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater

than max_d. If max_d is negative then the behavior is the same as levenshtein.

b) Normalized Levenshtein distance function

levenshtein_norm(text left_string, text right_string) returns real

Normalized Levenshtein distance is Levenshtein distance divided by maximu length of the strings being

compared.

a) Normalized Levenshtein similarity function

levenshtein_similarity(text left_string, text right_string) returns real

Levenshtein similarity is actually (1 – normalized levenshtein distance).

Trigram Similarity Function

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of

two strings by counting the number of trigrams they share. Basically, trigrams break a word up into 3-letter

sequences, which can then be used to do similarity matches. For instance, in the word "hello", we have the

following trigrams: "h", "he", "hel", "ell", "llo", "lo", and "o". Now look at the word "hallo". It has the

following trigrams: "h", "ha", "hal", "all", "llo", "lo", and "o". You can see that there are several trigrams

that match, and several that don't match. The more trigrams you have in common the closer the match.

The trigram similarity function returns a real number that indicates how similar the two arguments are.

The range of the result is from zero (0), indicating that the two strings are completely dissimilar, to one (1),

indicating that the two strings are identical.

Page 18: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 17 / 55

NOTE: For sake of user interface simplification and more straightforward solution definition process,

trigram similarity function utilization is removed from GUI in Remadder version 1.1., in favour of the new

remadder_similarity function, which combines both normalized Levenshtein similarity and trigram

similarity function in one common similarity function.

However, you can still use trigram similarity functions directly, by manually replacing remadder_similarity

function inside the generated SQL.

You can use the following trigram functions:

(see: http://www.postgresql.org/docs/9.5/static/pgtrgm.html)

similarity(text left_string, text right_string) returns real

trigram_similarity(text left_string, text right_string) returns real

Functions similarity and trigram_similarity have completely the same behavior and are synonyms.

Trigram Similarity Operator

Besides trigram similarity function, for which we define an arbitrary threshold, Remadder also uses trigram

similarity operator “%”, which performs preliminary narrowing down solution space to a subset of possible

combinations. Similarity operator is used to narrow down number of combination for resource extensive

fuzzy match analysis, in order to increase processing speed.

Similarity operator can be understood as applying trigram similarity function with preset threshold of 0.3.

If such similarity function result would be above 0.3 the records combination is included in further analysis

with trigram similarity function and Levenshtein distance function. Otherwise, if applied similarity function

result would be below the 0.3, the records pair is not considered in further analysis.

Remadder Similarity Function

Starting from version 1.1. Remadder software uses common similarity function called

remadder_similarity, which incorporates both normalized Levenshtein similarity function and trigram

similarity function. The remadder similarity function is simply arithmetic mean of the two previously used

functions.

remadder_similarity(left_string, right_string)=(levenshtein_similarity (left_string, right_string) +

trigram_similarity(left_string, right_string))/2

Solution Definition Header Solution header contains general info about a solution definition and status.

Solution definition header in Remadder version 1.0.:

Page 19: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 18 / 55

Page 20: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 19 / 55

Solution definition header in Remadder version 1.1.:

Page 21: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 20 / 55

Solution definition can be entered either through data grid or through form view which shows currently

selected solution.

Page 22: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 21 / 55

We have to set corresponding threshold values for similarity function used for individual field pairs

(weighted fields) and for composite field pair (if it is used), in order to define acceptance criteria. Acceptance

criterion for record match is that the similarity function returns value that is equal or greater than the

threshold value.

Similarity(LeftString, RightString) >= Threshold ACCEPTED

Similarity(LeftString, RightString) < Threshold REJECTED

We can decide whether to use composite (concatenated) fields or not.

Also we can choose whether to use inclusive OR or exclusive AND operators while constructing SQL query.

Characteristics of Levenshtein and Trigram Similarity Functions

In previous version (1.0.) of Remadder software, user had to choose whether to use Levenshtein distance

function or trigram similarity function or both. In Remadder veriosn >=1.1 there is no such option, since

common similarity function (remadder_similarity) is used, which incorporates both normalized

Levenshtein similarity and trigram similarity.

However, for sake of better understanding, here are the main properties of the two incorporated functions:

Levenshtein distance is suitable for finding strings that contain minor differences (e.g. minor

spelling errors). In order to provide good results, strings must be very similar and of similar length.

Levenshtein distance calculation is also very resource extensive and slow to perform.

Trigram similarity is suitable for analyizing larger deviations (spelling mistakes) and also performs

much, much faster than Levenshtein. Trigram similarity is very straightforward solution for fuzzy

match, but it has one major limitation: it fails with misspellings / alternative spellings. For example

the words "vodka" and "votka" share no trigrams yet are obviously very similar.

Composite (Concatenated) Field

Remadder allows you to perform fuzzy match analysis on individual field pairs, but also gives you additional

feature to create an additional composite field created by concatenating multiple fields and then perform

fuzzy match analysis on such concatenated field.

On the Solution Constraints tab, Fuzzy Match Relation sub-tab, you can define which individual fields will

be included in composite fields.

Page 23: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 22 / 55

For field pairs for which option “Include the field in composite field” is checked, Remadder will create

corresponding concatenated field during query preparation and perform fuzzy match comparison between

left dataset composite field and right dataset composite field.

This composite field (re)creation before query execution is resource extensive and lasts for a while, but can

greatly improve your overall result. Why? That’s because it can mitigate and balance wrongly set individual

field weights, in case taht you use “inclusive OR” option.

Important notice: Each solution has its own left and right composite field, directly related to the fuzzy match

relations section field pairs list. If you change fields list, you must also recreate corresponding composite

fields. Therefore, upon any change of fuzzy match relations specification, you shoudl recreate composite

fields and recreate SQL query.

Use “Generate SQL Query And Composite Field” button to recreate the composite field. Otherwise,

your fuzzy match analysis with composite fields will be wrong!

Page 24: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 23 / 55

My advice is to utilize composite fields, with “inclusive OR” option, because it can alleviate and balance

individual field pairs fuzzy match comparison and wrongly set relative weights.

Similarity Threshold Value

You have to set threshold values for the similarity function, for both individual field pairs and composite

field pairs.

Threshold values must be real numbers between zero (0) and one (1) and are used as acceptance criteria for

record matching.

If result of the similarity function for a field pair is a value greater than or equal to the threshold

(>=threshold), then the field pair will be accepted as match.

If the result is lesser than the threshold, the field pair will be rejected as non-match.

Similarity(LeftString, RightString) >= Threshold ACCEPTED

Similarity(LeftString, RightString) < Threshold REJECTED

Inclusive “OR” vs. Exclusive “AND”

There is a solution header option whether to use inclusive "OR" or exclusive "AND" for fuzzy match analysis.

If the option “Use inclusive “OR” is checked then the inclusive OR option is used, otherwise exclusive

AND is used.

Page 25: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 24 / 55

If we use inclusive “OR” option, then Remadder will use OR operator in SQL query WHERE clause to

combine the constraints, and result will be considered match if any of the conditions satisfy criteria.

If we use exclusive “AND” then Remadder will use AND operator in SQL query WHERE clause,

meaning that the both criteria must be satisfied in order to accept match.

Generally speaking, if we use inclusive "OR", then analysis will return more false positive results, while

with exclusive "AND" there will be more false negative results.

Notice that the option is relevant only for combining various constraints inside the fuzzy match relations,

not between the exact match, fuzzy match and other constraints. These three group of conditions are always

combined with exclusive AND operator, meaning that both three groups of conditions must be satisfied for

acceptance.

If you are familiar with SQL language, you can see how these conditions are translated to SQL query

language. Take a look on the “Solution SQL” page and see how Remadder translates these constraints to

SQL WHERE clause.

SQL (WHERE clause) in Remadder version 1.0.:

Page 26: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 25 / 55

SQL (WHERE clause) in Remadder version >1.1.:

Page 27: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 26 / 55

Solution Definition Details

Page 28: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 27 / 55

While solution definition header defines general parameters for performing fuzzy match analysis, solution

definition details are being set in Field Picker subpage and Solution Constraints subpage with three

sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations

section and Other Constraints section.

Fields Picker

Remadder provides simple, yet very powerful visual tool to add field pairs to exact match relations section,

fuzzy match section or other constraints section.

Page 29: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 28 / 55

By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two

lists, visually establish field pairs and send them to appropriate constraints definition sections by click on

appropriate button.

You can add selected fields pair to exact match section by clicking the button “Add Fields Pair To Exact

Match Relations Section”.

You can add selected fields pair to fuzzy match section by clicking the button “Add Fields Pair To Fuzzy

Match Relations Section”.

You can add left or right dataset field to other constraints section by clicking the respective button.

By eliminating need for tedious manual input and letting you to visually build solution constraints instead,

Remadder simplifies solution definition creation and boosts your performance.

Starting from Remadder version 1.1., checkbox column “Output Field to Resultset?” is added to the

Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default,

all fields are included in resultset.

Page 30: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 29 / 55

Solution Constraints

There are three type of constraints that you can define for a solution: exact match relations, fuzzy match

relations and other constraints.

Exact Match Relations

In exact matching relations section, we can add field pairs from "left" and "right" imported dataset and

define their equalness (=) or not-equalness (<>).

Page 31: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 30 / 55

If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of

analysis.

Therefore, it is recommended to use exact match relations whenever possible.

Fuzzy Match Relations

In solution header section we can set various general parameters that determine how fuzzy match analysis

will be performed: we can choose whether a composite (concatenated) field be used, whether analysis will

be more inclusive (inclusive "OR") or more exclusive (exclusive "AND"), whether normalized Levenshtein

distance or trigram similarity or both function will be used.

In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field

pairs to be compared and furtherly define how fuzzy match analysis will be performed.

Page 32: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 31 / 55

Relative Field Weight

For each field pair, which will be compared by similarity, we have to define its relative weight. The bigger

the weight, the greater is importance of the particular field pair similarity in final decision whether two

records do match or not.

The weight for particular field pair is entered as an arbitrary integer value in the field “Field Weight

(integer)” and Remadder then calculates its relative weight. The sum of relative weights is always 1.

On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight

(integer) value, which is one (1). You can change this value to any bigger integer and Remadder will

recalculate relative weights, taking care of their sum, which must always be 1.

Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight

for currently selected fields pair.

There are two buttons provided.

Page 33: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 32 / 55

The button “Reset Weights” reset all relative weights to 1, which is the same as if relative weights are not

used at all. In that case, all field pairs are treated equally important.

The button “Recalculate Weights” performs the recalculation of relative fields according to the integer

values entered in the field “Field Weight (integer)”. You don’t need to trigger this action manually, since

this procedure is triggered automatically on each change of integer value or a field pair addition.

Inclusive “OR” vs. “Exclusive “AND”

Similarly to option whether to use inclusive "OR" or exclusive "AND" on solution header level, we have also

option to use inclusive "OR" or exclusive "AND" when combining individual field pairs results to get final

acceptance or rejection decision.

Generally speaking, if we use inclusive "OR", then analysis will return more false positive results, while with

exclusive "AND" there will be more false negative results.

My advice is to use “inclusive OR” for combining the field pairs with similarity operator in most cases,

because otherwise result will fail if any of the fields do not match, no matter whether other fields match or

not, which is in most cases wrong conclusion.

Include Or Exclude a Fields Pair From Composite Fields

For each field pair we can choose whether fields from the field pair will be included in composite fields in

left and right dataset or not. This is relevant only if option “use composite (concatenated) field” is checked

in the solution’s definition header.

Page 34: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 33 / 55

Other Constraints

Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such

constraints can greatly increase speed of record linkage or data deduplication analysis.

We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.

Normally, condition is: sometable.somefield= ‘some string’, but other operators such as LIKE can be used

as well.

Solution Execution

Page 35: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 34 / 55

Once a solution definition is prepared by setting global parameters and exact match, fuzzy match and other

constraints, you can then execute the solution on remote server and retrieve resultset.

A solution can be executed in one step, by clicking the button “Execute Solution”

or can be executed in two distinct steps of which first step is query preparation

and second step is actual query execution.

If “Execute Solution” is triggered and Remadder finds that the solution status field “Query Is Prepared” is

False, then it will first prepare query and then execute the query.

It is recommended to use two-step approach, because it is less likely you will forget to (re)create composite

(concatenated) fields on change of solution definition.

SQL Query Preparation Remadder software hides complexity of fuzzy match analysis constraints definition from user, by providing

simple and intuituive graphical user interface. However, the actual definition of the solution constraints is

quite complex and represented in SQL language.

Screenshot from Remadder version 1.0.:

Page 36: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 35 / 55

Screenshot from Remadder version 1.1.:

Page 37: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 36 / 55

Remadder software translates solution definition prepared by Remadder's user interface into an SQL query,

which is ready to be sent to remote PostgreSQL server for processing. The SQL query can also be manually

edited according to special needs.

Page 38: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 37 / 55

The PostgreSQL server, placed in cloud, is the brute force behind the user interface, which provides robust

and sophisticated engine for fuzzy match analysis.

In order to be executed on server, the solution’s definition must first be translated to SQL language. It is

done in two different ways: with or without composite (concatenated) fields creation.

Prepare SQL Query With Composite Fields (Re)Creation

If you click the button “Generate SQL Query And Composite Field” then then first a concatenated

field is (re)created and indexed, according to the specification in fuzzy match relations section, and then

SQL query text is generated.

This concatenated composite fields creation takes some time.

Notice that this is relevant only if option “use composite (concatenated) field” is used in the solution’s

definition header in first place.

IMPORTANT NOTICE: If you change list of fields to be included in composite fields, in the fuzzy match

relations definition, for an already prepared solution (the field “Query Is Prepared” is True), then you MUST

trigger this concatenated fields recreation before query execution. Otherwise, you risk to get completely

wrong results because concatenated fields do not correspond to the fuzzy match specification!

Prepare Only SQL Query

If you click the button “Generate SQL query”, then the SQL query text is (re)created for the solution

definition, translating the definition into SQL language.

You can safely use this action only if you haven’t changed list of fields to be included into composite fields.

Otherwise, use “Generate SQL Query And Composite Field” button instead.

Data Retrieving And Storing

You can launch previously prepared solution SQL queries to the PostgreSQL server in order to process it

and return resultsets.

Screenshot from Remadder version 1.0.:

Page 39: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 38 / 55

Screenshot from Remadder version 1.1.:

Page 40: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 39 / 55

Once retrieved, resultsets are stored locally on your computer and you can load them afterwards anytime

you wish.

You can easily browse, edit and analyze results in many different ways, including datasheet forms with

sophisticated data searching, filtering and navigation capabilities.

Execute SQL Query

The query is executed by clicking the button “Execute Solution” .

Previously prepared SQL query text is sent to remote PostgreSQL server for execution. The progress of

query execution can be monitored in “Solution Log” page.

The speed of query execution depends on various factors, such as:

Whether there are any exact match relations specified? – Exact match constraints can

dramatically increase speed of the query execution.

Whether there are other constraints specified? – Any custom constraint can increase speed.

Whether Levenshtein distance function is used? – Levenshtein distance calculation is resource

extensive for processing and decreases query execution speed.

Whether Trigram similarity function is used? – Contrary to Levenshtein distance, trigram

similarity function and operator can utilize indexes, thus improve fuzzy match analysis speed.

Whether composite (concatenated) fields are used? – If concatenated fields are used, this

introduces additional calculations and thus decrease speed.

Whether “inclusive OR” or e”xclusive AND” is used?

The retrieved resultset is automatically opened in a separate form.

Page 41: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 40 / 55

Solution Query Preparation And Resultset Retrieval Status Info Remadder automatically updates solution status upon query preparation and execution actions. The

solution status information is shown both in the solution header data grid and form view.

Page 42: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 41 / 55

You get information whether SQL query is generated (prepared) or not, whether SQL query was already

executed or not, whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also

shown whether the resultset is stored locally and in which file.

There are also information about execution times and number of executions performed.

Save And Load Resultset

Page 43: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 42 / 55

Once a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file

in the Remadder installation folder, subfolder “\data\results\”.

Resultset can be loaded into the subpage “Solution Result” of the main form, by clicking the button “Load

Solution Resultset” or in a separate form, by clicking the button “Load Solution Resultset In

Separate Window”.

Review And Edit Resultset There are various ways you can post-process and review the retrieved resultset.

Resultset Browsing

You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated

data searching, filtering and navigation capabilities.

Page 44: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 43 / 55

You can scroll by using mouse, vertical and horizontal sliders and arrows.

You can also browse records by using navigation buttons.

Resultset Searching

You can easily search for any particular value in any column. On the upper left corner of the datagrid

there is a small button represented by orange double arrow. This button opens a pop-up dialog

with various search, filter and customization options, of which one is “Find data”.

Page 45: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 44 / 55

When you click on the “Find data” button, a search dialog box appears. You can search any value on

any column.

Resultset Filtration

Page 46: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 45 / 55

You can easily filter by any column. On the upper left corner of the datagrid there is a small button

represented by orange double arrow.

This button opens a pop-up dialog with various search, filter and customization options, including “Filter

data” and “Filter in table”, which are two different ways to perform filtration in a datagrid.

Filter Data

When you click the button “Filter data”, a dialog box appears on which you can build your filtering

conditions. This way you can define complex multicolumn filters.

Page 47: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 46 / 55

The filtering is then applied by clicking “Apply” button.

Filter In Table

Another option for filtration is to use the button “Filter in table”, which activates a filtration

combobox, which is placed just below each column’s title. When you click on the filtration combobox cell,

a combobox lista appears, listing all possible values for respective column. When you select a value, the

respective column is automatically filtered by the chosen value.

Page 48: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 47 / 55

Resultset Sorting

You can sort ascending or descending on any column by clicking column title.

Resultset Edit And Review

You can edit the resultset in datagrid easily. You can delete a row by using delete button ,

or edit a record by clicking the edit button .

Page 49: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 48 / 55

Exporting Resultset

Besides using datagrid controls, another option for resultset post-processing is to export the resultset into

a spreadsheet and then perform reviewing and editing in a spreadsheet editor.

Remadder has many different possibilities of exporting resultset to spreadsheets.

Exporting Resultset To Spreadsheet

Resultset can be exported to a CSV file by clicking the button “Export To CSV File”.

Resultset can be exported to a XLSX file by clicking the button “Export To XLSX File”.

Resultset can be exported to XLS file by clicking the button “Export To XLS File”.

Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft

Excel, by clicking the button “Load In Ext. Spreadsheet Editor”.

Page 50: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 49 / 55

Remadder also has its own embedded spreadsheet editor which can be used for resultset post-processing.

Resultset can be loaded into the embedded spreadsheet editor by clicking the button “Load As

Spreadsheet”.

Page 51: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 50 / 55

Exporting Datagrid To Spreadsheet

Another possibility for exporting resultset into a spreadsheet file is to use datagrid’s exporting feature.

Page 52: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 51 / 55

You have to browse the destination folder for export and enter exported file name and extension, as well

as to enter page name (sheet name). If you forget to specify “page name”, you will get an error.

Page 53: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 52 / 55

Customize Data Grids

Remadder enables you to customize your user interface in certain extent. You can shrink or stretch columns,

rearrange their order and hide/unhide columns.

Resize columns by dragging vertical splitters between columns.

Rearrange columns by pushing the left mouse button on a column’s title and dragging the column while

mouse button is still pushed down. After the column is moved to another position, release the mouse button.

You can define which columns are shown and which are hidden, by clicking on the button “Select visible

columns”.

Page 54: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 53 / 55

When you close the application, your customization is saved and when you open the application again,

your customizations will be loaded as well.

Customize Splitters

You will notice that various sections are divided by splitters which you can easily drag and thus resize the

corresponding splitted sections.

The customization you make is saved on application close and reloaded on application start.

Remadder Software Trial

Remadder client application is distributed as a shareware with 15-days trial period.

On first application start on your computer the trial period is initialized.

Page 55: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 54 / 55

Commercial Release Code Purchase And Activation

After expiration period expires, you are required to purchase commercial release code in order to be able to

continue using server features, such as raw data import and query execution.

You can, however, continue creating and editing projects and solution definitions, as well as loading and

editing previously acquired resultsets.

When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is

a tag generated by Remadder software and is unique for your hardware. The purchased commercial release

code is thus machine-specific and valid only for your hardware.

Once you purchased release code, activate it by clicking the button “Activate Commercial Release

Code”.

You are asked to enter the release code.

Page 56: ReMaDDer Software Tutorial - How to use Remadder software for successful records matching

ReMaDDer Software Tutorial

Page 55 / 55

The entered release code will be then validated and if correct, the server-side features will be unlocked for

you.