remadder software tutorial - how to use remadder software for successful records matching
DESCRIPTION
Remadder is an affordable and powerful record linkage and data cleansing software, with great fuzzy record matching and data deduplication capabilities. It's user-friendly graphical interface provides intuitive means for projects creation, raw data import and solutions definition, while server-side PostgreSQL database, placed in cloud, provides mighty data processing and fuzzy match record linkage engine that can process and solve even the most complex fuzzy match analysis in reasonable time. Remadder uses an inventive and pragmatic approach for fuzzy match analysis, by utilizing combination of two different string comparison similarity metrics: normalized Levenshtein distance and trigram similarity. These functions can be combined in flexible way in order to provide best results.TRANSCRIPT
Homepage: http://remaddersoft.wix.com/remadder
ReMaDDer Software Tutorial How to use Remadder software for successful records matching, data
cleansing and data deduplication projects
5/10/2016
Revision 1.1.
ReMaDDer Software Tutorial
Page 1 / 55
Table of Contents
Introduction ........................................................................................................................ 3
What Is ReMaDDer Software .......................................................................................... 3
Fuzzy Match ..................................................................................................................... 3
Records Linkage .............................................................................................................. 3
Data Deduplication .......................................................................................................... 3
Remadder Software Advantages ..................................................................................... 4
Prerequisites .................................................................................................................... 4
Revision History .............................................................................................................. 4
Projects ................................................................................................................................ 5
Record Matching Project vs. Data Deduplication Projects ............................................. 6
Copy A Project ................................................................................................................. 6
Raw Data Import ................................................................................................................. 7
“Left” and “Right” datasets .............................................................................................. 8
Import Raw Data ............................................................................................................. 8
Browse And Choose CSV files ...................................................................................................................... 8
Register CSV Files ........................................................................................................................................ 9
Edit Raw Datasource Schema Information .............................................................................................. 10
Preprocess Raw Datasource ....................................................................................................................... 11
Import Data From Raw Datasources ........................................................................................................ 12
Solution Definition ............................................................................................................ 13
String Similarity Metrics ............................................................................................... 15
Normalized Levenshtein Distance or Similarity Function ...................................................................... 15
Trigram Similarity Function ..................................................................................................................... 16
ReMaDDer Software Tutorial
Page 2 / 55
Trigram Similarity Operator ..................................................................................................................... 17
Remadder Similarity Function .................................................................................................................. 17
Solution Definition Header ............................................................................................ 17
Characteristics of Levenshtein and Trigram Similarity Functions ......................................................... 21
Composite (Concatenated) Field ............................................................................................................... 21
Similarity Threshold Value ........................................................................................................................ 23
Inclusive “OR” vs. Exclusive “AND”.......................................................................................................... 23
Solution Definition Details ............................................................................................ 26
Fields Picker ............................................................................................................................................... 27
Solution Constraints .................................................................................................................................. 29
Solution Execution ............................................................................................................ 33
SQL Query Preparation ................................................................................................. 34
Prepare SQL Query With Composite Fields (Re)Creation ...................................................................... 37
Prepare Only SQL Query ........................................................................................................................... 37
Data Retrieving And Storing ............................................................................................. 37
Execute SQL Query ........................................................................................................ 39
Solution Query Preparation And Resultset Retrieval Status Info ................................ 40
Save And Load Resultset ............................................................................................... 41
Review And Edit Resultset ............................................................................................ 42
Resultset Browsing .................................................................................................................................... 42
Resultset Edit And Review ........................................................................................................................ 47
Exporting Resultset................................................................................................................................... 48
Customize Data Grids ........................................................................................................ 52
Customize Splitters ........................................................................................................... 53
Remadder Software Trial .................................................................................................. 53
Commercial Release Code Purchase And Activation ........................................................ 54
ReMaDDer Software Tutorial
Page 3 / 55
ReMaDDer Software Tutorial
How to use Remadder software for successful records matching, data cleansing and
data deduplication projects
Introduction
What Is ReMaDDer Software Remadder is an affordable and powerful record linkage and data cleansing software, with great fuzzy record
matching and data deduplication capabilities.
It's user-friendly graphical interface provides intuitive means for projects creation, raw data import and
solutions definition, while server-side PostgreSQL database, placed in cloud, provides mighty data
processing and fuzzy match record linkage engine that can process and solve even the most complex fuzzy
match analysis in reasonable time.
Remadder uses an inventive and pragmatic approach for fuzzy match analysis, by utilizing combination of
two different string comparison similarity metrics: normalized Levenshtein distance and trigram similarity.
These functions can be combined in flexible way in order to provide best results.
The name “ReMaDeDer” is an acronym for “Records Matching and Data Deduplication Software”.
Homepage: http://remaddersoft.wix.com/remadder
Fuzzy Match Term “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It
is used in cases where no unique identifier or exact match relation exists between two sets of data.
Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity.
Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with
probabilities below threshold are considered to be non-matches.
Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold
matching percentage set by the application.
Records Linkage Record linkage refers to the task of finding records in a data set that refer to the same entity across different
data sources, i.e. to identify related records in two separate data sets.
Record linkage is necessary when joining data sets is based on entities that may or may not share a common
identifier, as may be the case due to differences in record shape, storage location, and/or curator style or
preference.
Data Deduplication
ReMaDDer Software Tutorial
Page 4 / 55
Data deduplication refers to identifying duplicate records in a dataset and cleansing datasets from
redundant information.
Remadder Software Advantages Due to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic
papers. Some of the researchers even tend to build their own software, but those programs suffer from their
complexity and it is really necessary to understand all the advanced mathematics and algorithms in order
to be able to use it.
This is not something that can be expected from an average user facing data linkage problem and wanting
to be able to solve it in matter of hours.
On the other hand, there are huge corporate entity resolution framework solutions produced by big software
companies oriented toward huge corporate customers. These solutions are often very complex and
affordable only to big companies and corporate users.
Remadder places itself in the middle and provides powerful fuzzy match records linkage solution for mere
mortals and regular office users.
By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints
in a visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and
he/she can focus on the business case, rather than technical issues. That is where Remadder software really
shines and clearly distinguishes itself from competition.
Remadder uses an inventive and pragmatic approach for fuzzy match analysis, by utilizing combination of
two different string comparison similarity metrics: normalized Levenshtein distance and Trigram
similarity. These can be combined in flexible way in order to provide best results.
By its intuitive graphical user interface and low pricing, Remadder provides superb solution for fuzzy match
records linkage for any business case.
Prerequisites Major prerequisite to use Remadder is active internet connection, since the raw data is imported to remote
PostgreSQL server and actual query execution also takes place on the server. After trial period expires, you
are required to purchase commercial release code in order to be able to continue using remote server.
However, project and solution creation and editing can be performed even without established connection
and purchased release code, since these data are stored locally on your computer.
Revision History Revision Date Change Description
1.0. 3/20/2016 Initial release. Tutorial covers Remadder version 1.0. 1.1. 5/10/2016 Document is updated to reflect changes and improvements brought by
Remadder version 1.1. New version brings many improvements and simplifies solution definition. Instead of separarately choosing and defining thresholds for trigram similarity and levenshtein distance functions, a new, combined,
ReMaDDer Software Tutorial
Page 5 / 55
common similarity function (remadder_similarity) is now introduced that combines both trigram and levenshtein similarity properties. This reduces complexity and uncertainty in solution definition creation, retaining Remadder strength and advantages. Previous Remadder version has been outputting all columns from left and right dataset into resultset. Now, you can choose which fields are to be included in resultset. Raw data import process is also much improved, especially regarding importing data from Excel files (in CSV format) where column names contain non-ascii characters and blanks. There are many small performance improvements and several bugfixes that will improve user experience when using the ReMaDDer software for data match analysis.
Projects
You can easily create new project, which are stored locally on your computer.
ReMaDDer Software Tutorial
Page 6 / 55
Each project contains definition of two datasets to be compared and linked (so called "left dataset" and
"right dataset"), as well as variable number of corresponding solutions.
Solutions are named and stored definitions of how to perform fuzzy match analysis.
On creation, each project is assigned unique project tag. During raw data importing to server,
corresponding input tables get that tag appended in their name. This way, imported tables are always tagged
by the project name.
There are two views of projects. On the top section, there is a datagrid listing all projects, while on bottom
section there is a form view of currently selected project.
Record Matching Project vs. Data Deduplication Projects There is no fundamental difference between data deduplication and records matching projects. In both
cases we compare two datasets trying to establish which records from “left” dataset correspond to which
records in “right” dataset.
The only difference between the two is that in case of records matching project we have two different
datasets to be compared, while in case of data deduplication project we have to compare a dataset with itself
in order to identify duplicates.
Copy A Project Instead of manually entering all the parameters for new projects, Remadder allows you to copy an existing
project into another project. This action copies raw data import specifications as well as solution definitions.
ReMaDDer Software Tutorial
Page 7 / 55
Raw Data Import
Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from CSV files.
The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and since all databases
and spreadsheet editors, as well as all other data sources can be easily exported to a csv file.
Remadder provides simple and intuitive tool for importing csv files. It will automatically detect
field’s delimiter, encoding and columns schema information. You can then edit the retrieved schema and
finally import the files on server for further processing.
ReMaDDer Software Tutorial
Page 8 / 55
“Left” and “Right” datasets In each data deduplication or record matching project we always compare two datasets for matching of
records. In case of record matching projects, these two datasets correspond to two different input CSV files,
while in case of data deduplication projects, these two datasets are imported from the same input CSV file.
Nevertheless, we always have so-called “left dataset” and “right dataset” to be compared. Think of this like
if you wish to match fingers from left and right hand. You can easily identify thumb on the left hand to be
related to the thumb on the right hand, since they share similar shape. It is obvious due to their physical
similarity.
It is similar with fuzzy match analysis where we compare fields from left and right dataset in order to
identify string similarities.
Import Raw Data Process of importing raw data into server database consists of two logical phases. In first phase we need to
specify “left” and “right” dataset schema, while in the second phase we actually perform schema and data
import from input files, according to the previously defined schema.
On “Data Import” page, there are two subpages: “Left Dataset Specification” and “Right Dataset
Specification” in which we separately define input dataset specifications for “left” and “right” dataset.
Import can be executed separately or both datasets can be imported in batch.
Browse And Choose CSV files
ReMaDDer Software Tutorial
Page 9 / 55
First step in importing input CSV files is to choose CSV files to be imported.
On upper part of “Left Dataset Specification” or “Right Dataset Specification” subpage, there is a CSV file
browser dialog box.
You can browse CSV files on your computer by clicking on the browse button . This opens a file
browser in which you can choose a CSV file. The absolute file path is then copied to the edit box.
Register CSV Files
Next step in importing CSV files is to define CSV file schema specification. We call this process “registering
CSV file”.
By clicking “Register CSV file” button near the file browser, the browsed CSV
file is examined for its columns and it’s schema information is then inserted in the corresponding list of
fields (columns).
ReMaDDer Software Tutorial
Page 10 / 55
As you can see, Remadder determines field delimiter in CSV file (normally it is either “;” or “,”) and retrieves
information about columns.
If a column name has upper case characters, it is converted to lower case.
Currently, Remadder treats all columns as text fields of various length. This is due fact that the comparison
is performed by using string comparison functions (Levenshtein distance and trigram similarity) and other
type of data fields (e.g. datetime, integer, real etc.) would not really make sense for string comparisons.
Edit Raw Datasource Schema Information
Once you retrieved schema information from a CSV file, you might conclude that you don’t want to import
all columns, but only a subset.
You can edit the schema by using corresponding data grid navigator buttons.
If you wish to delete currently selected field from schema, just click delete button.
ReMaDDer Software Tutorial
Page 11 / 55
If you wish to regain original columns schema, just click “Get Fields Schema”
button and the columns list will be repopulated from the CSV file.
Preprocess Raw Datasource
While defining import schema, you might realize that input data need some preprocessing before importing
to server for further analysis.
Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice
Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit, Geany
or Leafpad), but you can also use an embedded Remadder’s spreadsheet editor.
You can launch external default spreadsheet editor by clicking the button “Open CSV File in Ext.
Editor” .
You can launch the embedded spreadsheet editor by clicking button “Open CSV File In Int. Editor”
. This will open the embedded spreadsheet editor “Spready”.
ReMaDDer Software Tutorial
Page 12 / 55
Import Data From Raw Datasources
Final step is actual import execution.
We can execute import separately for left and right datasets, by clicking corresponding buttons “Import
left dataset CSV file” or “Import right dataset CSV file” or we can import them both at once by
clicking the button “Import both CSV files to server”.
When you click the import button, Remadder will automatically open the “Import Log” page, where you
can watch import process progress.
ReMaDDer Software Tutorial
Page 13 / 55
Import speed depends on the file size and most importantly, internet connection quality.
Solution Definition
A solution represents definition of parameters for performing record linkage or data deduplication analysis
for a project. Each project can have many solutions, with different parameters, thus you can test which
combination of parameters lead to best results.
Each solution definition consists of several sections: solution header section, exact match relations section,
fuzzy match relations section and other constraints section.
Solution header definition in Remadder v 1.0.:
ReMaDDer Software Tutorial
Page 14 / 55
Solution header definition in Remadder v 1.1.:
ReMaDDer Software Tutorial
Page 15 / 55
String Similarity Metrics There are many algorithms used for fuzzy match analysis. The Remadder software uses combination of two
such functions: normalized Levenshtein distance function and trigram similarity function. Besides these
two function, there is also a trigram similarity operator “%” used, which complements these two functions.
UPDATE: In Remadder verion 1.1. normalized Levenshtein distance and trigram similarity functions are
replaced by common similarity function (remadder_similarity).
remadder_similarity(left_string, right_string)=(levenshtein_similarity (left_string, right_string) +
trigram_similarity(left_string, right_string))/2
Normalized Levenshtein Distance or Similarity Function
In information theory and computer science, the Levenshtein distance is a string metric for measuring the
difference between two sequences. Informally, the Levenshtein distance between two words is the
minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change
one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.
Levenshtein distance function returns an integer value, which is the minimum number of single-character
edits (insertion, deletion, substitution) required to transform one string to another. The bigger the integer,
the weaker is similarity between the two strings.
However, the retrieved Levenshtein integer has particular meaning only for certain pair of strings, because
interpretation of its’ value depends of the string lengths. Thus it is difficult to say whether certain integer
value means “similar” or “dissimilar” until you know the string lengths.
That’s the reason why Remadder does not use original Levenshtein distance function, but rather it’s
normalized variant which is defined as Levenshtein distance value divided by maximum string length
(whether left or right dataset string is longer).
Normalized Levenshtein distance function returns a real (decimal) value between zero (0) and one (1),
where 0 means absolute similarity (meaning “identical”) while 1 means absolute dissimilarity (meaning
“completely dissimilar”).
This way, we have a relative metrics that has inherent meaning and different datasets can be compared with
same Levenshtein distance metrics.
Instead of distance function, we could use derived similarity function. The normalized Levenshtein
similarity is (1 - Normalized Levenshtein distance).
More generally, normalized_similarity=(1 – normalized_distance).
NOTE: For sake of user interface simplification and more straightforward solution definition process,
Levenshtein function utilization is removed from GUI in Remadder version 1.1., in favour of the new
remadder_similarity function, which combines both normalized Levenshtein similarity and trigram
similarity function in one common similarity function.
However, you can still use Levenshtein functions directly, by manually replacing remadder_similarity
function inside the generated SQL.
ReMaDDer Software Tutorial
Page 16 / 55
You can use the following Levenshtein functions:
a) Levenshtein distance functions
See: http://www.postgresql.org/docs/9.5/static/fuzzystrmatch.html
levenshtein(text left_string, text right_string, int insert_cost, int deletion_cost, int substitution_cost) returns int
levenshtein(text left_string, text right_string) returns int
levenshtein_less_equal(text left_string, text right_string, int insertion_cost, int deletion_cost, int substitution_cost, int
max_d) returns int
levenshtein_less_equal(text left_string, text right_string, int max_d) returns int
Both source and target can be any non-null string, with a maximum of 255 characters. The cost parameters
specify how much to charge for a character insertion, deletion, or substitution, respectively. You can omit
the cost parameters, as in the second version of the function; in that case they all default to 1.
levenshtein_less_equal is an accelerated version of the Levenshtein function for use when only small
distances are of interest. If the actual distance is less than or equal to max_d,
then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater
than max_d. If max_d is negative then the behavior is the same as levenshtein.
b) Normalized Levenshtein distance function
levenshtein_norm(text left_string, text right_string) returns real
Normalized Levenshtein distance is Levenshtein distance divided by maximu length of the strings being
compared.
a) Normalized Levenshtein similarity function
levenshtein_similarity(text left_string, text right_string) returns real
Levenshtein similarity is actually (1 – normalized levenshtein distance).
Trigram Similarity Function
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of
two strings by counting the number of trigrams they share. Basically, trigrams break a word up into 3-letter
sequences, which can then be used to do similarity matches. For instance, in the word "hello", we have the
following trigrams: "h", "he", "hel", "ell", "llo", "lo", and "o". Now look at the word "hallo". It has the
following trigrams: "h", "ha", "hal", "all", "llo", "lo", and "o". You can see that there are several trigrams
that match, and several that don't match. The more trigrams you have in common the closer the match.
The trigram similarity function returns a real number that indicates how similar the two arguments are.
The range of the result is from zero (0), indicating that the two strings are completely dissimilar, to one (1),
indicating that the two strings are identical.
ReMaDDer Software Tutorial
Page 17 / 55
NOTE: For sake of user interface simplification and more straightforward solution definition process,
trigram similarity function utilization is removed from GUI in Remadder version 1.1., in favour of the new
remadder_similarity function, which combines both normalized Levenshtein similarity and trigram
similarity function in one common similarity function.
However, you can still use trigram similarity functions directly, by manually replacing remadder_similarity
function inside the generated SQL.
You can use the following trigram functions:
(see: http://www.postgresql.org/docs/9.5/static/pgtrgm.html)
similarity(text left_string, text right_string) returns real
trigram_similarity(text left_string, text right_string) returns real
Functions similarity and trigram_similarity have completely the same behavior and are synonyms.
Trigram Similarity Operator
Besides trigram similarity function, for which we define an arbitrary threshold, Remadder also uses trigram
similarity operator “%”, which performs preliminary narrowing down solution space to a subset of possible
combinations. Similarity operator is used to narrow down number of combination for resource extensive
fuzzy match analysis, in order to increase processing speed.
Similarity operator can be understood as applying trigram similarity function with preset threshold of 0.3.
If such similarity function result would be above 0.3 the records combination is included in further analysis
with trigram similarity function and Levenshtein distance function. Otherwise, if applied similarity function
result would be below the 0.3, the records pair is not considered in further analysis.
Remadder Similarity Function
Starting from version 1.1. Remadder software uses common similarity function called
remadder_similarity, which incorporates both normalized Levenshtein similarity function and trigram
similarity function. The remadder similarity function is simply arithmetic mean of the two previously used
functions.
remadder_similarity(left_string, right_string)=(levenshtein_similarity (left_string, right_string) +
trigram_similarity(left_string, right_string))/2
Solution Definition Header Solution header contains general info about a solution definition and status.
Solution definition header in Remadder version 1.0.:
ReMaDDer Software Tutorial
Page 18 / 55
ReMaDDer Software Tutorial
Page 19 / 55
Solution definition header in Remadder version 1.1.:
ReMaDDer Software Tutorial
Page 20 / 55
Solution definition can be entered either through data grid or through form view which shows currently
selected solution.
ReMaDDer Software Tutorial
Page 21 / 55
We have to set corresponding threshold values for similarity function used for individual field pairs
(weighted fields) and for composite field pair (if it is used), in order to define acceptance criteria. Acceptance
criterion for record match is that the similarity function returns value that is equal or greater than the
threshold value.
Similarity(LeftString, RightString) >= Threshold ACCEPTED
Similarity(LeftString, RightString) < Threshold REJECTED
We can decide whether to use composite (concatenated) fields or not.
Also we can choose whether to use inclusive OR or exclusive AND operators while constructing SQL query.
Characteristics of Levenshtein and Trigram Similarity Functions
In previous version (1.0.) of Remadder software, user had to choose whether to use Levenshtein distance
function or trigram similarity function or both. In Remadder veriosn >=1.1 there is no such option, since
common similarity function (remadder_similarity) is used, which incorporates both normalized
Levenshtein similarity and trigram similarity.
However, for sake of better understanding, here are the main properties of the two incorporated functions:
Levenshtein distance is suitable for finding strings that contain minor differences (e.g. minor
spelling errors). In order to provide good results, strings must be very similar and of similar length.
Levenshtein distance calculation is also very resource extensive and slow to perform.
Trigram similarity is suitable for analyizing larger deviations (spelling mistakes) and also performs
much, much faster than Levenshtein. Trigram similarity is very straightforward solution for fuzzy
match, but it has one major limitation: it fails with misspellings / alternative spellings. For example
the words "vodka" and "votka" share no trigrams yet are obviously very similar.
Composite (Concatenated) Field
Remadder allows you to perform fuzzy match analysis on individual field pairs, but also gives you additional
feature to create an additional composite field created by concatenating multiple fields and then perform
fuzzy match analysis on such concatenated field.
On the Solution Constraints tab, Fuzzy Match Relation sub-tab, you can define which individual fields will
be included in composite fields.
ReMaDDer Software Tutorial
Page 22 / 55
For field pairs for which option “Include the field in composite field” is checked, Remadder will create
corresponding concatenated field during query preparation and perform fuzzy match comparison between
left dataset composite field and right dataset composite field.
This composite field (re)creation before query execution is resource extensive and lasts for a while, but can
greatly improve your overall result. Why? That’s because it can mitigate and balance wrongly set individual
field weights, in case taht you use “inclusive OR” option.
Important notice: Each solution has its own left and right composite field, directly related to the fuzzy match
relations section field pairs list. If you change fields list, you must also recreate corresponding composite
fields. Therefore, upon any change of fuzzy match relations specification, you shoudl recreate composite
fields and recreate SQL query.
Use “Generate SQL Query And Composite Field” button to recreate the composite field. Otherwise,
your fuzzy match analysis with composite fields will be wrong!
ReMaDDer Software Tutorial
Page 23 / 55
My advice is to utilize composite fields, with “inclusive OR” option, because it can alleviate and balance
individual field pairs fuzzy match comparison and wrongly set relative weights.
Similarity Threshold Value
You have to set threshold values for the similarity function, for both individual field pairs and composite
field pairs.
Threshold values must be real numbers between zero (0) and one (1) and are used as acceptance criteria for
record matching.
If result of the similarity function for a field pair is a value greater than or equal to the threshold
(>=threshold), then the field pair will be accepted as match.
If the result is lesser than the threshold, the field pair will be rejected as non-match.
Similarity(LeftString, RightString) >= Threshold ACCEPTED
Similarity(LeftString, RightString) < Threshold REJECTED
Inclusive “OR” vs. Exclusive “AND”
There is a solution header option whether to use inclusive "OR" or exclusive "AND" for fuzzy match analysis.
If the option “Use inclusive “OR” is checked then the inclusive OR option is used, otherwise exclusive
AND is used.
ReMaDDer Software Tutorial
Page 24 / 55
If we use inclusive “OR” option, then Remadder will use OR operator in SQL query WHERE clause to
combine the constraints, and result will be considered match if any of the conditions satisfy criteria.
If we use exclusive “AND” then Remadder will use AND operator in SQL query WHERE clause,
meaning that the both criteria must be satisfied in order to accept match.
Generally speaking, if we use inclusive "OR", then analysis will return more false positive results, while
with exclusive "AND" there will be more false negative results.
Notice that the option is relevant only for combining various constraints inside the fuzzy match relations,
not between the exact match, fuzzy match and other constraints. These three group of conditions are always
combined with exclusive AND operator, meaning that both three groups of conditions must be satisfied for
acceptance.
If you are familiar with SQL language, you can see how these conditions are translated to SQL query
language. Take a look on the “Solution SQL” page and see how Remadder translates these constraints to
SQL WHERE clause.
SQL (WHERE clause) in Remadder version 1.0.:
ReMaDDer Software Tutorial
Page 25 / 55
SQL (WHERE clause) in Remadder version >1.1.:
ReMaDDer Software Tutorial
Page 26 / 55
Solution Definition Details
ReMaDDer Software Tutorial
Page 27 / 55
While solution definition header defines general parameters for performing fuzzy match analysis, solution
definition details are being set in Field Picker subpage and Solution Constraints subpage with three
sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations
section and Other Constraints section.
Fields Picker
Remadder provides simple, yet very powerful visual tool to add field pairs to exact match relations section,
fuzzy match section or other constraints section.
ReMaDDer Software Tutorial
Page 28 / 55
By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two
lists, visually establish field pairs and send them to appropriate constraints definition sections by click on
appropriate button.
You can add selected fields pair to exact match section by clicking the button “Add Fields Pair To Exact
Match Relations Section”.
You can add selected fields pair to fuzzy match section by clicking the button “Add Fields Pair To Fuzzy
Match Relations Section”.
You can add left or right dataset field to other constraints section by clicking the respective button.
By eliminating need for tedious manual input and letting you to visually build solution constraints instead,
Remadder simplifies solution definition creation and boosts your performance.
Starting from Remadder version 1.1., checkbox column “Output Field to Resultset?” is added to the
Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default,
all fields are included in resultset.
ReMaDDer Software Tutorial
Page 29 / 55
Solution Constraints
There are three type of constraints that you can define for a solution: exact match relations, fuzzy match
relations and other constraints.
Exact Match Relations
In exact matching relations section, we can add field pairs from "left" and "right" imported dataset and
define their equalness (=) or not-equalness (<>).
ReMaDDer Software Tutorial
Page 30 / 55
If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of
analysis.
Therefore, it is recommended to use exact match relations whenever possible.
Fuzzy Match Relations
In solution header section we can set various general parameters that determine how fuzzy match analysis
will be performed: we can choose whether a composite (concatenated) field be used, whether analysis will
be more inclusive (inclusive "OR") or more exclusive (exclusive "AND"), whether normalized Levenshtein
distance or trigram similarity or both function will be used.
In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field
pairs to be compared and furtherly define how fuzzy match analysis will be performed.
ReMaDDer Software Tutorial
Page 31 / 55
Relative Field Weight
For each field pair, which will be compared by similarity, we have to define its relative weight. The bigger
the weight, the greater is importance of the particular field pair similarity in final decision whether two
records do match or not.
The weight for particular field pair is entered as an arbitrary integer value in the field “Field Weight
(integer)” and Remadder then calculates its relative weight. The sum of relative weights is always 1.
On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight
(integer) value, which is one (1). You can change this value to any bigger integer and Remadder will
recalculate relative weights, taking care of their sum, which must always be 1.
Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight
for currently selected fields pair.
There are two buttons provided.
ReMaDDer Software Tutorial
Page 32 / 55
The button “Reset Weights” reset all relative weights to 1, which is the same as if relative weights are not
used at all. In that case, all field pairs are treated equally important.
The button “Recalculate Weights” performs the recalculation of relative fields according to the integer
values entered in the field “Field Weight (integer)”. You don’t need to trigger this action manually, since
this procedure is triggered automatically on each change of integer value or a field pair addition.
Inclusive “OR” vs. “Exclusive “AND”
Similarly to option whether to use inclusive "OR" or exclusive "AND" on solution header level, we have also
option to use inclusive "OR" or exclusive "AND" when combining individual field pairs results to get final
acceptance or rejection decision.
Generally speaking, if we use inclusive "OR", then analysis will return more false positive results, while with
exclusive "AND" there will be more false negative results.
My advice is to use “inclusive OR” for combining the field pairs with similarity operator in most cases,
because otherwise result will fail if any of the fields do not match, no matter whether other fields match or
not, which is in most cases wrong conclusion.
Include Or Exclude a Fields Pair From Composite Fields
For each field pair we can choose whether fields from the field pair will be included in composite fields in
left and right dataset or not. This is relevant only if option “use composite (concatenated) field” is checked
in the solution’s definition header.
ReMaDDer Software Tutorial
Page 33 / 55
Other Constraints
Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such
constraints can greatly increase speed of record linkage or data deduplication analysis.
We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.
Normally, condition is: sometable.somefield= ‘some string’, but other operators such as LIKE can be used
as well.
Solution Execution
ReMaDDer Software Tutorial
Page 34 / 55
Once a solution definition is prepared by setting global parameters and exact match, fuzzy match and other
constraints, you can then execute the solution on remote server and retrieve resultset.
A solution can be executed in one step, by clicking the button “Execute Solution”
or can be executed in two distinct steps of which first step is query preparation
and second step is actual query execution.
If “Execute Solution” is triggered and Remadder finds that the solution status field “Query Is Prepared” is
False, then it will first prepare query and then execute the query.
It is recommended to use two-step approach, because it is less likely you will forget to (re)create composite
(concatenated) fields on change of solution definition.
SQL Query Preparation Remadder software hides complexity of fuzzy match analysis constraints definition from user, by providing
simple and intuituive graphical user interface. However, the actual definition of the solution constraints is
quite complex and represented in SQL language.
Screenshot from Remadder version 1.0.:
ReMaDDer Software Tutorial
Page 35 / 55
Screenshot from Remadder version 1.1.:
ReMaDDer Software Tutorial
Page 36 / 55
Remadder software translates solution definition prepared by Remadder's user interface into an SQL query,
which is ready to be sent to remote PostgreSQL server for processing. The SQL query can also be manually
edited according to special needs.
ReMaDDer Software Tutorial
Page 37 / 55
The PostgreSQL server, placed in cloud, is the brute force behind the user interface, which provides robust
and sophisticated engine for fuzzy match analysis.
In order to be executed on server, the solution’s definition must first be translated to SQL language. It is
done in two different ways: with or without composite (concatenated) fields creation.
Prepare SQL Query With Composite Fields (Re)Creation
If you click the button “Generate SQL Query And Composite Field” then then first a concatenated
field is (re)created and indexed, according to the specification in fuzzy match relations section, and then
SQL query text is generated.
This concatenated composite fields creation takes some time.
Notice that this is relevant only if option “use composite (concatenated) field” is used in the solution’s
definition header in first place.
IMPORTANT NOTICE: If you change list of fields to be included in composite fields, in the fuzzy match
relations definition, for an already prepared solution (the field “Query Is Prepared” is True), then you MUST
trigger this concatenated fields recreation before query execution. Otherwise, you risk to get completely
wrong results because concatenated fields do not correspond to the fuzzy match specification!
Prepare Only SQL Query
If you click the button “Generate SQL query”, then the SQL query text is (re)created for the solution
definition, translating the definition into SQL language.
You can safely use this action only if you haven’t changed list of fields to be included into composite fields.
Otherwise, use “Generate SQL Query And Composite Field” button instead.
Data Retrieving And Storing
You can launch previously prepared solution SQL queries to the PostgreSQL server in order to process it
and return resultsets.
Screenshot from Remadder version 1.0.:
ReMaDDer Software Tutorial
Page 38 / 55
Screenshot from Remadder version 1.1.:
ReMaDDer Software Tutorial
Page 39 / 55
Once retrieved, resultsets are stored locally on your computer and you can load them afterwards anytime
you wish.
You can easily browse, edit and analyze results in many different ways, including datasheet forms with
sophisticated data searching, filtering and navigation capabilities.
Execute SQL Query
The query is executed by clicking the button “Execute Solution” .
Previously prepared SQL query text is sent to remote PostgreSQL server for execution. The progress of
query execution can be monitored in “Solution Log” page.
The speed of query execution depends on various factors, such as:
Whether there are any exact match relations specified? – Exact match constraints can
dramatically increase speed of the query execution.
Whether there are other constraints specified? – Any custom constraint can increase speed.
Whether Levenshtein distance function is used? – Levenshtein distance calculation is resource
extensive for processing and decreases query execution speed.
Whether Trigram similarity function is used? – Contrary to Levenshtein distance, trigram
similarity function and operator can utilize indexes, thus improve fuzzy match analysis speed.
Whether composite (concatenated) fields are used? – If concatenated fields are used, this
introduces additional calculations and thus decrease speed.
Whether “inclusive OR” or e”xclusive AND” is used?
The retrieved resultset is automatically opened in a separate form.
ReMaDDer Software Tutorial
Page 40 / 55
Solution Query Preparation And Resultset Retrieval Status Info Remadder automatically updates solution status upon query preparation and execution actions. The
solution status information is shown both in the solution header data grid and form view.
ReMaDDer Software Tutorial
Page 41 / 55
You get information whether SQL query is generated (prepared) or not, whether SQL query was already
executed or not, whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also
shown whether the resultset is stored locally and in which file.
There are also information about execution times and number of executions performed.
Save And Load Resultset
ReMaDDer Software Tutorial
Page 42 / 55
Once a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file
in the Remadder installation folder, subfolder “\data\results\”.
Resultset can be loaded into the subpage “Solution Result” of the main form, by clicking the button “Load
Solution Resultset” or in a separate form, by clicking the button “Load Solution Resultset In
Separate Window”.
Review And Edit Resultset There are various ways you can post-process and review the retrieved resultset.
Resultset Browsing
You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated
data searching, filtering and navigation capabilities.
ReMaDDer Software Tutorial
Page 43 / 55
You can scroll by using mouse, vertical and horizontal sliders and arrows.
You can also browse records by using navigation buttons.
Resultset Searching
You can easily search for any particular value in any column. On the upper left corner of the datagrid
there is a small button represented by orange double arrow. This button opens a pop-up dialog
with various search, filter and customization options, of which one is “Find data”.
ReMaDDer Software Tutorial
Page 44 / 55
When you click on the “Find data” button, a search dialog box appears. You can search any value on
any column.
Resultset Filtration
ReMaDDer Software Tutorial
Page 45 / 55
You can easily filter by any column. On the upper left corner of the datagrid there is a small button
represented by orange double arrow.
This button opens a pop-up dialog with various search, filter and customization options, including “Filter
data” and “Filter in table”, which are two different ways to perform filtration in a datagrid.
Filter Data
When you click the button “Filter data”, a dialog box appears on which you can build your filtering
conditions. This way you can define complex multicolumn filters.
ReMaDDer Software Tutorial
Page 46 / 55
The filtering is then applied by clicking “Apply” button.
Filter In Table
Another option for filtration is to use the button “Filter in table”, which activates a filtration
combobox, which is placed just below each column’s title. When you click on the filtration combobox cell,
a combobox lista appears, listing all possible values for respective column. When you select a value, the
respective column is automatically filtered by the chosen value.
ReMaDDer Software Tutorial
Page 47 / 55
Resultset Sorting
You can sort ascending or descending on any column by clicking column title.
Resultset Edit And Review
You can edit the resultset in datagrid easily. You can delete a row by using delete button ,
or edit a record by clicking the edit button .
ReMaDDer Software Tutorial
Page 48 / 55
Exporting Resultset
Besides using datagrid controls, another option for resultset post-processing is to export the resultset into
a spreadsheet and then perform reviewing and editing in a spreadsheet editor.
Remadder has many different possibilities of exporting resultset to spreadsheets.
Exporting Resultset To Spreadsheet
Resultset can be exported to a CSV file by clicking the button “Export To CSV File”.
Resultset can be exported to a XLSX file by clicking the button “Export To XLSX File”.
Resultset can be exported to XLS file by clicking the button “Export To XLS File”.
Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft
Excel, by clicking the button “Load In Ext. Spreadsheet Editor”.
ReMaDDer Software Tutorial
Page 49 / 55
Remadder also has its own embedded spreadsheet editor which can be used for resultset post-processing.
Resultset can be loaded into the embedded spreadsheet editor by clicking the button “Load As
Spreadsheet”.
ReMaDDer Software Tutorial
Page 50 / 55
Exporting Datagrid To Spreadsheet
Another possibility for exporting resultset into a spreadsheet file is to use datagrid’s exporting feature.
ReMaDDer Software Tutorial
Page 51 / 55
You have to browse the destination folder for export and enter exported file name and extension, as well
as to enter page name (sheet name). If you forget to specify “page name”, you will get an error.
ReMaDDer Software Tutorial
Page 52 / 55
Customize Data Grids
Remadder enables you to customize your user interface in certain extent. You can shrink or stretch columns,
rearrange their order and hide/unhide columns.
Resize columns by dragging vertical splitters between columns.
Rearrange columns by pushing the left mouse button on a column’s title and dragging the column while
mouse button is still pushed down. After the column is moved to another position, release the mouse button.
You can define which columns are shown and which are hidden, by clicking on the button “Select visible
columns”.
ReMaDDer Software Tutorial
Page 53 / 55
When you close the application, your customization is saved and when you open the application again,
your customizations will be loaded as well.
Customize Splitters
You will notice that various sections are divided by splitters which you can easily drag and thus resize the
corresponding splitted sections.
The customization you make is saved on application close and reloaded on application start.
Remadder Software Trial
Remadder client application is distributed as a shareware with 15-days trial period.
On first application start on your computer the trial period is initialized.
ReMaDDer Software Tutorial
Page 54 / 55
Commercial Release Code Purchase And Activation
After expiration period expires, you are required to purchase commercial release code in order to be able to
continue using server features, such as raw data import and query execution.
You can, however, continue creating and editing projects and solution definitions, as well as loading and
editing previously acquired resultsets.
When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is
a tag generated by Remadder software and is unique for your hardware. The purchased commercial release
code is thus machine-specific and valid only for your hardware.
Once you purchased release code, activate it by clicking the button “Activate Commercial Release
Code”.
You are asked to enter the release code.
ReMaDDer Software Tutorial
Page 55 / 55
The entered release code will be then validated and if correct, the server-side features will be unlocked for
you.