approximate string joins with...

Download Approximate String Joins with Abbreviationspeople.csail.mit.edu/dongdeng/papers/vldb2018-abbrjoin.pdf · String set 2 b 1 dept of computer science b 2 career services b 3 ctr for

If you can't read please download the document

Upload: buiphuc

Post on 07-Feb-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

  • Approximate String Joins with Abbreviations

    Wenbo Tao

    MIT

    [email protected]

    Dong Deng

    MIT

    [email protected]

    Michael Stonebraker

    MIT

    [email protected]

    ABSTRACTString joins have wide applications in data integration andcleaning. The inconsistency of data caused by data errors,term variations and missing values has led to the need forapproximate string joins (ASJ). In this paper, we study ASJwith abbreviations, which are a frequent type of term vari-ation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three com-mon limitations. First, they suer from low precision in thepresence of abbreviations having multiple full forms. Sec-ond, their join algorithms are not scalable due to the ex-ponential time complexity. Third, the dictionary may notexist since abbreviations are highly domain-dependent.

    We propose an end-to-end workflow to address these lim-itations. There are three main components in the work-flow: (1) a new similarity measure taking abbreviations intoaccount that can handle abbreviations having multiple fullforms, (2) an ecient join algorithm following the filter-verification framework and (3) an unsupervised approach tolearn a dictionary of abbreviation rules from input strings.We evaluate our workflow on four real-world datasets andshow that our workflow outputs accurate join results, scaleswell as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and eciency.

    PVLDB Reference Format:

    Wenbo Tao, Dong Deng, Michael Stonebraker. ApproximateString Joins with Abbreviations. PVLDB, 11(1): 53-65,2017.DOI: 10.14778/3136610.3136615

    1. INTRODUCTIONJoin is one of the fundamental operations of relational

    DBMSs. Joining string attributes is especially importantin areas such as data integration and data deduplication.However, in many applications the data between two tablesis not consistent because of data errors, abbreviations, andmissing values. This leads to a requirement for approximatestring joins (ASJ), which is the topic of this paper.

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 44th International Conference on Very Large Data Bases,August 2018, Rio de Janeiro, Brazil.Proceedings of the VLDB Endowment, Vol. 11, No. 1Copyright 2017 VLDB Endowment 2150-8097/17/09... $ 10.00.DOI: 10.14778/3136610.3136615

    1mit harvd division of health sciences technology

    harvard mit division of hst

    2camp activ compl

    campus activities complex

    3dept of athletics phys ed recreation

    daper

    4us and p

    urban studies and planning

    Figure 1: Some similar department names in theMIT Data Warehouse found by our approach.

    In ASJ, the similarity between two strings is typically cap-tured by traditional measures such as Jaccard and edit dis-tance [13]. These measures calculate a value between 0 and1, with higher values indicating more similarity. The joinresults are string pairs with similarity values not less than auser-inputted threshold.1

    However, abbreviations, as a frequent type of term vari-ation in real-world data, pose a significant challenge to theeectiveness of traditional similarity functions. Figure 1shows several pairs of similar department names with ab-breviations we found in the MIT Data Warehouse2. We cansee none of these pairs have large edit or Jaccard similarity.

    1.1 Previous Research and their LimitationsTo address the ineectiveness of traditional measures, two

    prior works JaccT [4] and SExpand [15] have studied how tofind similar strings given a user-inputted dictionary of syn-onym rules. They both adopt the idea of generating a set ofderived strings for any given string s, either by applying syn-onym rewritings to s [4] or appending applicable synonymsto s [15], and consider two strings similar if the similarity ofthe most similar pair of derived strings is not less than thethreshold.

    However, they have three common limitations.

    Limitation 1: Low precision is likely when the same ab-breviation has multiple full forms, as illustrated by the fol-lowing example.

    Example 1. Figure 2 shows two sets of strings and auser-inputted synonym dictionary. The abbreviation cs hasfive dierent synonyms (full forms).

    JaccT generates derived strings of a string by rewritingit using applicable synonym rules. For example, string b1

    1This threshold is usually greater than 0.5 to avoid outputting toomany dissimilar strings.2http://ist.mit.edu/warehouse

    53

  • String set 1 The synonym dictionarya1 department of csa2 campus securitya3 ctr for control systems

    hdept , departmentihcs , computer scienceihcs , campus securityihcs , career servicesihcs , control systemsihcs , chemical sciencesi

    String set 2b1 dept of computer scienceb2 career servicesb3 ctr for cs

    Figure 2: Example strings and synonym rules toillustrate prior work [4,15].

    = dept of computer science has two applicable synonymrules: hdept, departmenti and hcs, computer sciencei.So b1 has 2

    2 = 4 derived strings, namely dept of cs, deptof computer science, department of cs and depart-ment of computer science. Given two strings s1 and s2,JaccT searches for a derived string s01 of s1 and a derivedstring s02 of s2 such that f(s

    01, s

    02) is maximized where f is

    some traditional similarity function3 such as Jaccard4 oredit similarity. Consider a2 = campus security and b2= career services. They represent two dierent thingsand thus should not be considered similar. Yet because theyhave a common derived string cs, their JaccT similarity is 1.More generally, any two dierent full forms of cs will havea JaccT similarity of 1 because they share a common derivedstring cs. Therefore, low precision is a possible outcome.

    SExpand generates derived strings in a dierent way byappending applicable synonyms. For instance, b1 has thefollowing four derived strings:

    dept of computer science dept of computer science cs dept of computer science department dept of computer science cs department

    Then, in the same manner as JaccT, SExpand calculates thesimilarity between two strings by searching for the most sim-ilar pair of derived strings. Consider a1 = department ofcs and b3 = ctr for cs. The tokens cs in two strings re-spectively stand for computer science and control sys-tems. However, because SExpand searches for the most sim-ilar pair of derived strings, all five full forms of cs will beappended to both strings (e.g. a01 =department of cs com-puter science campus security career services control

    systems chemical sciences), leading to a high similarityof 1115 (a

    01\b03 = 11 and a01[b03 = 15) . More generally, given

    the dictionary in Figure 2, any two strings both containingcs will get a relatively high SExpand similarity.

    Limitation 2: The join algorithms are not scalable.

    Based on their similarity measures, both JaccT and SEx-pand propose a filter-verification algorithm to compute thejoin results. In the filter step, they first generate a signatureset for each string which guarantees that two strings are sim-ilar only if their signature sets intersect. So they select a setof candidate string pairs whose signature sets intersect. Inthe verification step, for each candidate pair, they computeits real JaccT or SExpand similarity. However, to generate

    3For simplicity, in the following we use Jaccard as the default un-derlying similarity function, but our techniques can be modified tosupport other functions.4The Jaccard similarity of two strings is the number of shared tokensdivided by the number of all distinct tokens in two strings.

    the signature set of a string, both JaccT and SExpand needto enumerate all its derived strings, which are O(2n) wheren is number of applicable synonym rules in this string. Thismakes their join algorithms not scalable when n is large.

    Limitation 3: The dictionary may not exist, as term ab-breviations are highly domain-dependent.

    Some previous works [5,16] learn synonyms from user-provided examples of matching strings. However, it is im-practical for a human to manually compile a set of examplesthat is large enough to make this approach eective (in [5],100,000 examples are used). To this end, the authors usedtraditional Jaccard-based ASJ to automatically generate ex-amples (e.g. all string pairs with Jaccard similarity no lessthan 0.8). But we can see from Figure 1 that good examplesof matching strings often have very small Jaccard similar-ity, making these approaches suer from low recall. There isalso some NLP work [3,8,17,19,21] discovering abbreviationsin text. However, these works all rely on explicit indica-tors of abbreviations in text documents such as parentheses(e.g. ...At MIT (Massachusetts Institute of Technol-ogy)...). In the string join setting, those indicators oftendo not exist. For example, no string in Figure 1 has paren-theses.

    1.2 Our SolutionTo address these limitations, we propose an end-to-end

    workflow to eectively and eciently solve ASJ with abbre-viations.

    We first propose a new similarity measure which isrobust to abbreviations having multiple full forms. Insteadof searching for the most similar pair of derived strings, wesearch for a derived string of one string that is the mostsimilar to the other string, i.e., find a derived string s01 ofs1 that maximizes f(s

    01, s2) and vice versa. For example, in

    Figure 2, no derived string of campus security will have apositive Jaccard similarity with career services and viceversa. We present the formal description and analysis of thisnew measure in Section 2. Another big advantage broughtby this new measure is that to compute the similarity, ithas a much smaller search space than JaccT and SExpand,leading to much more ecient join computation (Section 3).

    We then design an ecient join algorithm following thefilter-verification framework. In contrast to JaccT and SEx-pand, we propose a polynomial time algorithm to calcu-late signatures without iterating through all derivedstrings. Calculating the new similarity measure is NP-hard,so we propose an ecient heuristic algorithm to verify allcandidate string pairs after the filter step (Section 3).

    We also present an unsupervised approach to learn anabbreviation dictionary5 from input strings. We first ex-tract a small set of candidate abbreviation rules by assum-ing that the length of the Longest Common Subsequence(LCS)6 between an abbreviation and its full form should beequal or very close to the length of the abbreviation. Forexample, the LCS between harvd and its full form harvard

    5We focus on discovering abbreviations in this paper, but our similar-ity measure and join algorithm can use other types of user-inputtedsynonym rules.6A subsequence of a string s can be derived by deleting zero or morecharacters in s without changing the order of remaining characters.The longest common subsequence between two strings is a subse-quence of both strings which has the most number of characters.

    54

  • is harvd, which is exactly the abbreviation itself. This LCS-based assumption is expressive. It can capture a varietyof abbreviation patterns such as prefix abbreviations (e.g.hcamp , campusi), single term abbreviations (e.g. hharvd, harvardi) and acronyms (e.g. hcs , computer sci-encei). We search all pairs of a token in any string anda token sequence in any string that satisfy the LCS-basedassumption. A naive approach which iterates through everypair runs quadratically in the number of strings and thusis not scalable. So we propose an ecient algorithm toretrieve LCS-based candidate rules without enumerating allpairs. We then use existing NLP techniques [3,19] to refinethe LCS-based candidate rule set to remove unlikely abbre-viations. Details are in Section 4.

    To summarize, we make the following contributions:

    The first (to our best knowledge) end-to-end workflowsolving ASJ with abbreviations.

    A new string similarity measure taking abbreviationsinto account which is robust to abbreviations havingmultiple full forms and enables much more ecientjoin computation than previous measures.

    An ecient join algorithm which generates signaturesfor strings in PTIME (as opposed to the exponentialapproach adopted by previous works) and calculatessimilarities eciently using a heuristic algorithm.

    An unsupervised approach to eciently learn a com-prehensive abbreviation dictionary from input strings.

    Experimental results on four real-world datasets demon-strating that (1) our join results have high accuracy,(2) the entire workflow scales well as the input sizegrows and (3) individual parts of the workflow outper-form state-of-the-art alternatives in both accuracy andeciency.

    2. STRING SIMILARITY MEASURE ANDJOINS WITH ABBREVIATIONS

    We start by describing how we formally model strings andabbreviation rules. Next, we introduce a new similarity mea-sure which quantifies how similar two strings are given adictionary of abbreviation rules7. The formal definition ofthe ASJ problem is presented at the end of this section.

    2.1 Strings and Abbreviation RulesWemodel a string s as a token sequence8 s(1), s(2), . . . , s(|s|)

    where |s| denotes the number of tokens in s. For example, ifthe tokenization delimiter is whitespace, then for s = deptof computer science, s(1) = dept, s(4) = science and|s| = 4. The token sequence s(i, j) (1 i j |s|) is thetoken sequence: s(i), s(i+1), . . . , s(j). For example, for s =dept of computer science, s(3, 4) = computer science.

    An abbreviation rule is a pair of the form habbr , fulli,where abbr is a token representing an abbreviation and full isa token sequence representing its full form. Example abbre-viation rules in Figure 1 include hcamp , campusi, hharvd, harvardi and hhst , health sciences technologyi.An abbreviation rule habbr , fulli is applicable in a string

    s if either abbr is a token of s, or full is a token subsequence7Section 4 will introduce how to learn this dictionary.8A string can be tokenized by splitting it based on common delimiterssuch as whitespace.

    of s. For example, hcamp , campusi is applicable in campactiv compl and hcs, computer sciencei is applicable indepartment of computer science. We use applicable sideto denote the side of an applicable rule that matches thestring and rewrite side to denote the other side. For ex-ample, the applicable side of hcamp , campusi w.r.t campactiv compl is camp whereas the rewrite side is campus.

    2.2 Our MeasureWe propose a new similarity measure pkduck9 to address

    the issue with previous measures, i.e., low precision in thepresence of abbreviations having multiple full forms. In con-trast to JaccT and SExpand which search for the most similarpair of derived strings, we search for a derived string of onestring that is the most similar to the other string.

    Formally, we use pkduck(R, s1, s2) to denote the similaritybetween strings s1 and s2 given a dictionary R of abbrevi-ation rules. pkduck(R, s1, s2) is calculated as follows. For agiven string s, we use A(s) to denote the set of applicablerules in s. Applying an applicable rule means rewriting theapplicable side with the rewrite side. Following JaccT, weget a derived string s0 of s by applying rules in a subset ofA(s) such that the applications of rules do not overlap (i.e.each original token is rewrote by at most one rule)10. Weuse D(s) to denote the set of all derived string of s. Then,

    pkduck(R, s1, s2) = max

    8 0.

    Our algorithm is based on the trie structure [11], whichis widely used in Information Retrieval to eciently index aset of strings and search a query string in linear time in thestring length. Figure 5 shows the trie constructed for fourstrings. The string formed by concatenating the characterson the path from the root to a node is called the prefix ofthat node. For example, the prefix of node 10 is dap. Wesay a trie node matches a string if the string equals theprefix of this node. In Figure 5, every leaf node (in orange)uniquely matches one original string16.

    16Indexed strings are often appended a special character to ensure nostring is the prefix of another.

    1

    2

    4

    8

    12

    3

    75 6

    10

    13

    h

    a

    r

    v

    d

    a e

    e

    s

    15

    9

    t

    16r

    p

    11

    14

    p

    t

    hst

    harvd

    dept

    daper

    d

    Figure 5: A trie constructed by four strings.

    We first create a trie T to index all tokens in S. Then, thecore problem we need to solve is for each token sequence tsin S, find all leaf nodes in T that matches some subsequenceof ts.

    Our algorithm recursively calculates this set of leaf nodesby considering the characters in ts one by one. Formally,we use F (i) to denote the set of nodes in T that matchsome subsequence of the first i characters of ts (denoted asts[1..i]). The recursion base is F (0) = {root of T}, meaningthe root matches the empty subsequence. What we needare all leaf nodes in F (|c|) where c is the concatenation oftokens in ts.

    For i > 0, F (i) is calculated based on F (i 1) as follows.According to the definition of F (i), F (i 1) should be asubset of F (i) because subsequences of ts[1..i 1] are alsosubsequences of ts[1..i]. So we initially set F (i) to F (i 1).Let c

    i

    denote the i-th character of ts. Recall that each nodex 2 F (i1) matches a subsequence of ts[1..i1]. If x has anoutgoing edge with a character label c

    i

    to a child x0, we canappend c

    i

    to the subsequence that x matches, and form anew subsequence of ts[1..i]. This new subsequence of ts[1..i]is matched by the child x0, which we add to F (i).

    Every leaf node x 2 F (|c|) matches a token in S which isalso a subsequence of ts. Therefore, the prefix of x and tscan form a valid LCS-based candidate rule.

    We use the following example to illustrate the recursivecalculation of the F sets.

    Example 4. Let T be the trie in Figure 5 and ts be healthst. The F sets are in the following table.

    i ci

    F (i)0 {1}1 h {1, 2}2 e {1, 2}3 a {1, 2, 4}4 l {1, 2, 4}5 t {1, 2, 4}6 h {1, 2, 4}7 s {1, 2, 4, 5}8 t {1, 2, 4, 5, 9}

    Let us consider the calculation of F (8). F (7) containsnodes 1, 2, 4 and 5 which respectively match the empty sub-sequence, subsequences h, ha and hs. The 8-th character ist. In F (7), only node 5 has an outgoing edge with characterlabel t. Thus, node 9 is added into F (8).

    59

  • Algorithm 1: LCS-based Candidate Rule Generation

    Input: A set S of tokenized strings.Output: A set R of LCS-based candidate rules.

    T an empty trie;1for each token t in S do2

    Insert t into T ;3

    R ?;4root the root node of T ;5for each token sequence ts 2 S do6

    c concatenation of tokens in ts;7F (0) {root};8for i 1 to |c| do9

    F (i) F (i 1);10for each x 2 F (i 1) do11

    if node x has an outgoing edge with12character c

    i

    thenx0 the corresponding child;13Insert x0 into F (i);14

    for each x 2 F (|c|) do15if node x is a leaf node then16

    abbr the prefix of x;17full ts;18Insert habbr , fulli into R;19

    return R;20

    After calculating all F sets, node 9 is the only leaf nodein F (8), which suggests one valid LCS-based candidate rulehhst , health sti.

    Algorithm 1 presents the pseudo-code of our algorithm.Lines 914 show the recursive calculation of the F sets.Lines 1519 generate an LCS-based candidate rule for eachleaf node in F (|c|).

    Extension to > 0. To support positive values, we canmodify the above algorithm as follows. First, for any tokent and token sequence ts, if |t| |LCS(t, ts)| , LCS(t, ts)must be one of the subsequences of t having at least |t| characters. Because is small, we can just insert all thesesubsequences into the trie index. Second, in Line 17 of Al-gorithm 1, if the prefix of x is not a token in S, we shouldassign to abbr the original token in S that x corresponds to.

    Performance Analysis. Constructing a trie for a set ofstrings runs linearly in the sum of the string length. So thecost of building a trie is very small.

    For a given token sequence ts, compared to the naive ap-proach, our algorithm does not iterate over all leaf nodesand only visits trie nodes that match a subsequence of ts.In the worst case, this number could be as large as the sizeof T . But we will empirically show in Section 5 that thisnumber is very small on real-world datasets and that thisalgorithm is very ecient and scales well.

    4.2 RefinementSome of the LCS-based candidate rules are not likely to

    contain real abbreviations. Consider the rule hte, databasei.Although it satisfies the LCS-based assumption, in practiceit is unlikely that people will abbreviate database as te.Other examples include hia , informationi and hay ,

    dictionaryi. Including those kinds of rules in the dictio-nary not only incurs unnecessary overhead in the ASJ pro-cess but also hurts the accuracy of join results since theyfalsely equate a pair of a token and a token sequence.

    Thus it is important to design a rule refiner17 to filter outrules that are not likely to contain real abbreviations fromthe LCS-based candidate rule set.

    We implement the rule refiner combining heuristic pat-terns from existing NLP works [2,3,19,21]. By analyzinglarge amounts of text, these works create handcrafted heuris-tic patterns to filter out unlikely abbreviations from a setof candidates extracted from text. An example pattern isthat if a vowel letter in full preceded by a consonant let-ter matches a letter in abbr, the preceding consonant lettermust also match a letter in abbr. For example, in hte ,databasei, the e in database matches the e in te, but thepreceding s does not match any letter in te. Thus this rulewill be filtered out by this pattern. Similarly, hia, infor-mationi and hay , dictionaryi will also be filtered out bythis pattern. Interested readers should refer to the papersfor more patterns and details.

    There are also some supervised learning approaches [8,18,28] that can be adopted to implement the rule refiner. How-ever, we focus on building an automatic end-to-end work-flow and leave the comparison of dierent implementationsfor future work.

    5. EXPERIMENTAL RESULTSIn this section, we describe our experiments and results.

    The goals of our experiments are (1) evaluating the end-to-end accuracy and eciency of our workflow and (2) compar-ing individual parts of our workflow with state-of-the-art al-ternatives. Section 5.1 describes the experimental settings.Section 5.25.4 respectively evaluates our similarity mea-sure, join algorithm and abbreviation dictionary.

    5.1 Experimental SettingsDatasets. We use four real-world datasets:

    Disease contains 634 distinct disease names. The av-erage number of tokens per string (denoted as a) is2.1 and the maximum number of tokens a string couldhave (denoted as k) is 10.

    Dept contains 1,151 distinct department names wecollect from the MIT Data Warehouse by merging alldepartment name columns in all tables. The a valueand k value are respectively 3.4 and 10.

    Course contains 20,049 distinct course names we col-lect from the MIT Data Warehouse by merging allcourse name columns. The a value and k value arerespectively 4.1 and 17.

    Location contains 112,394 distinct location names(street names, city names, etc.) we collect from 7,515tables crawled from Data.gov. The a value and k valueare respectively 3.6 and 13.

    Sample similar strings. Figure 1 shows sample similarstrings in the Dept dataset found by our approach. Figures6, 7 and 8 show some sample similar strings in Disease,

    17The characteristics of abbreviations are language-dependent [18,22],but we focus on English abbreviations in this paper.

    60

  • Table 1: Comparing pkduck with SExpand and traditional non-weighted Jaccard (using LCS-based dictionarywith = 0). JaccT is not included because it did not terminate in one week. P, R and F respectively standfor Precision, Recall and F-measure.

    Disease Dept Course Location

    P R F P R F P R F P R F

    Jaccard0.7 0.00 0.00 0.00 0.64 0.18 0.28 0.45 0.08 0.14 0.53 0.21 0.300.8 0.00 0.00 0.00 0.60 0.08 0.14 0.50 0.03 0.06 0.25 0.01 0.020.9 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01 0.01 0.00 0.00 0.00

    SExpand

    0.7 0.71 0.75 0.73 0.14 0.93 0.25 0.01 1.00 0.03 0.01 0.98 0.020.8 0.83 0.73 0.78 0.30 0.90 0.45 0.04 0.99 0.07 0.01 0.97 0.030.9 0.99 0.69 0.82 0.42 0.62 0.50 0.15 0.89 0.25 0.02 0.74 0.05

    pkduck

    0.7 0.88 0.74 0.81 0.75 0.79 0.77 0.66 0.77 0.71 0.46 0.55 0.500.8 0.97 0.72 0.83 0.83 0.60 0.70 0.78 0.58 0.66 0.74 0.28 0.400.9 0.99 0.72 0.83 0.96 0.37 0.54 0.95 0.39 0.56 0.80 0.24 0.37

    1cmt disease

    charcot marie tooth disease

    2ic/pbs

    interstitial cystitis/painful bladder syndrome

    Figure 6: Sample similar strings in Disease.

    1prac it mgmt

    practical information technology management

    2seminar in marine geology and geophysics at mit

    sem in mg g at mit

    3quant analysis emp methods

    quant analysis empirical meth

    Figure 7: Sample similar strings in Course.

    Course and Location. These demonstrate that abbrevia-tions are a common type of term variation in real-world dataand that solving ASJ with abbreviations has pragmatic ben-efits.

    Quality Metrics of Join Results. We have all pairs ofstrings that refer to the same disease (i.e. the ground truth)for the Disease dataset. For Dept, we manually constructthe ground truth. For Course and Location, we randomlysample 5% strings and manually construct the ground truth.Then we test the Precision (P), Recall (R) and F-measure(F), where F = 2PR

    P+R . All reported P, R and F numbersare rounded to the nearest hundredth.

    Implementation & Environment. Non-weighted Jac-card is used as the underlying similarity function f . All al-gorithms are implemented in C++18. All experiments wererun on a Linux server with 1.0GHz CPU and 256GB RAM.

    5.2 Evaluating the Similarity MeasureIn this experiment, we compare our pkduck similarity mea-

    sure with traditional non-weighted Jaccard, JaccT [4] andSExpand [15] by evaluating the end-to-end accuracy of joinresults.

    LCS-based dictionary with = 0 is used. Table 2 showssome characteristics of this dictionary. Section 5.4 will in-clude a comparison of dierent dictionaries.

    We implement JaccT and SExpand using a modification ofour signature generation algorithm to generate JaccT and

    18Due to space limitation, interested readers could refer to our ex-tended technical report [1] for a discussion on how to implement theASJ in an RDBMS and an additional scalability experiment.

    1martin luther king ave se

    mlk avenue se

    2historic columbia river hwy e

    hist col rvr hwy

    3broadway ste

    brdway suite

    Figure 8: Sample similar strings in Location.

    Table 2: Characteristics of the LCS-based dictionary( = 0). n is the number of applicable rules in astring. f is the sum of the frequency of abbreviationshaving more than 5 dierent full forms.

    Maximum n Average n fDisease 51 2.60 0.02Dept 188 11.90 0.04

    Course 1,351 40.87 0.20Location 1,716 38.87 0.26

    SExpand signatures in polynomial time19. However, evenwith this optimization, JaccT still did not terminate in oneweek on any dataset because of the O(22n) algorithm usedto calculate the similarities for candidate string pairs, wheren is can be in the thousands as shown in Table 2. Therefore,we conclude that it is not practical to use JaccT and do notinclude it in the comparison.

    Results are shown in Table 1. We observe the following:

    Our pkduck measure has the highest F-measure on alldatasets, and greatly outperforms SExpand. For ex-ample, on Course, pkducks highest F-measure is 0.71( = 0.7) while SExpands highest F-measure is only0.25 ( = 0.9).

    SExpand suers from low precision. For example, onLocation, SExpands highest precision is only 0.02.This is because SExpand is susceptible to abbrevia-tions having multiple full forms, which are frequentin input strings as shown in Table 2. In general, asanalyzed in Example 1, any two strings both contain-ing an abbreviation that has multiple full forms willget a high SExpand similarity. Note on Disease with

    19See our extended technical report [1] for details on the modification.

    61

  • = 0.9, SExpand has high precision (0.99). This is be-cause abbreviations having multiple full forms are notvery frequent and the threshold is large.

    Jaccard has very low recall. The highest recall on alldatasets is only 0.21. This is not surprising becausematching strings often have very low Jaccard similarityas shown in Figures 1, 6, 7 and 8.

    Limitations. Despite the clear benefits pkduck has, wenote its two limitations. First, pkduck is often subject to theineectiveness of the underlying function f :

    pkduck has relatively low recall when is large. Forexample, on Course with = 0.9, the recall of pkduckis only 0.39. This actually results from the use of Jac-card as the underlying similarity function. For twostrings both containing very few tokens, a single dan-gling token will result in a very low Jaccard similar-ity. Consider two strings s1 = dept of cs and s2 =department cs. Even with the rule hdept , depart-menti, their pkduck similarity is only 23 because of thedangling token of. On the contrary, SExpand is lesssensitive to the underlying function because it can ap-pend dierent full forms of the same abbreviation toboth strings to increase their similarity. For exam-ple, suppose cs has five dierent full forms as in Fig-ure 2, the SExpand similarity between s1 and s2 is 1213(s01 \ s02 = 12 and s01 [ s02 = 13). Unfortunately, asmentioned before, this comes at the cost of precision.

    The fact that some strings are considered similar by feven if they actually mean dierent things, may causepkduck to incorrectly join some strings. One exam-ple of such false positives on Course is s1 = spe-cial sub in ee cs and s2 = special sub in com-puter science, with a learned rule hcs , computersciencei. Their pkduck similarity is 56 because theJaccard similarity between s01 = special sub in eecomputer science and s2 is 56 . However, they are twodierent courses.

    Second, pkduck may join an abbreviation with an incorrectfull form. For example, one false positive on Disease iss1 = ml and s2 = metachromatic leukodystrophy (s1 refersto mucolipidoses not s2). In this case, considering otherattributes of a record will help (e.g. the symptoms of adisease). However, this is beyond the scope of this paper sowe leave it for future work.

    5.3 Evaluating the Join AlgorithmIn this experiment, we evaluate the eciency and scala-

    bility of our join algorithm and compare it with JaccT andSExpand. JaccT, SExpand and our join algorithm all fol-low the filter-verification framework which first generatesprefix-filter signatures, then selects a set of candidate stringpairs and finally verifies each candidate string pair. So wefirst compare our join algorithm with JaccT and SExpandin terms of signature generation eciency, number of candi-date string pairs and verification eciency. Then, we eval-uate how our join algorithm scales as the input size varies.The LCS-based dictionary with = 0 is used.

    Signature generation. Both JaccT and SExpand calcu-late the signature of a string by iterating over its derivedstrings, which is O(2n). Table 2 shows that in LCS-based

    0

    100

    200

    300

    400

    0.7 0.8 0.9

    Exe

    cutio

    n T

    ime

    (s)

    w/o Rule Compressionw/ Rule Compression

    (a) Course

    0

    500

    1000

    1500

    0.7 0.8 0.9

    Exe

    cutio

    n T

    ime

    (s)

    w/o Rule Compressionw/ Rule Compression

    (b) Location

    Figure 9: Performance of our PTIME signature gen-eration algorithm. JaccT and SExpand did not finishgenerating signatures in one week.

    0

    25

    50

    75

    100

    0.7 0.8 0.9

    # o

    f ca

    ndid

    ate

    pairs

    (10

    6)

    pkduckJaccT

    SExpand

    (a) Course

    0

    25

    50

    75

    0.7 0.8 0.9

    # o

    f ca

    ndid

    ate

    pairs

    (10

    7)

    pkduckJaccT

    SExpand

    (b) Location

    Figure 10: Number of candidate string pairs, incomparison with JaccT and SExpand. The signaturesof JaccT and SExpand are generated using a modifica-tion of our PTIME signature generation algorithm.

    dictionaries n can be in the thousands, making this approachfar from scalable. We implemented this brute-force way ofgenerating signatures and it did not terminate in one weekon any dataset.

    In contrast, our PTIME signature generation algorithm isvery ecient. The execution time is shown in Figure 9. Evenwithout the rule compression optimization (Section 3.3), oursignature generation algorithm took less than 1,500 secondsto finish on the largest dataset Location.

    Figure 9 also shows that the rule compression techniquesignificantly improves the eciency. For example, when =0.7, rule compression makes the algorithm respectively 23and 21 faster on Course and Location. The reason isthat the rule compression technique groups rules having thesame number of tokens on the applicable side and the rewriteside, which lowers down the time complexity of our signaturegeneration algorithm.

    Number of candidate string pairs. Figure 10 showsthe number of candidate string pairs selected by JaccT, SEx-pand and pkduck. It can be seen that pkduck selects muchfewer candidate string pairs than JaccT and SExpand. Forinstance, on Location with = 0.9, pkduck selects 30fewer candidate pairs than JaccT, and 33 fewer candidatepairs than SExpand. As analyzed in Section 3.2, this resultsfrom the much smaller search space of pkduck compared tothose of JaccT and SExpand.

    Verification eciency. We report in Table 3 the aver-age time of SExpand and pkduck to verify a candidate stringpair ( = 0.7). JaccT is not included in the comparisonbecause it did not terminate in one week on any dataset.

    62

  • Table 3: Average verification time ( = 0.7) in com-parison with SExpand.

    Course Location

    pkduck 8.12105 s. 1.02104 s.SExpand 1.75102 s. 1.69101 s.

    0

    100

    200

    300

    400

    500

    25% 50% 75% 100%

    Exe

    cutio

    n t

    ime

    (s)

    Input size (percentage of the original data)

    = 0.7 = 0.8 = 0.9

    (a) Course

    0

    1000

    2000

    3000

    4000

    5000

    25% 50% 75% 100%

    Exe

    cutio

    n t

    ime

    (s)

    Input size (percentage of the original data)

    = 0.7 = 0.8 = 0.9

    (b) Location

    Figure 11: Scalability of our join algorithm.

    We can see that pkducks verification algorithm is muchfaster than that of SExpand onCourse and Location wheren is large. For example, on Location, pkducks average ver-ification time is approximately three orders of magnitudeshorter than that of SExpand. This is expected since thetime complexity of pkducks verification algorithm is O(k n)whereas that of SExpand is O(k n2).

    Scalability. We test how our entire join algorithm (in-cluding signature generation, candidate selection and candi-date verification) scales as the input size varies on Courseand Location. For each dataset, we generate three smallerdatasets by randomly choosing a subset with respectively25%, 50% and 75% of the original strings. The executiontime on all small and original datasets is shown in Figure 11.As shown, our join algorithm scales well on both datasets.

    5.4 Evaluating the Abbreviation DictionaryNumber of rules. Table 4 shows the number of LCS-

    based rules ( = 0 or 1) and number of rules after refinement,in comparison to the search space. We can see that thenumber of rules satisfying the LCS-based assumption is verysmall compared to the huge search space. For example, with = 0 on the Course dataset, the search space is 1.9 1010while the number of LCS-based candidate rules is only 3.3105. Also, many LCS-based candidate rules are filtered outby the rule refiner. For example, among all 44,748 LCS-based candidate rules on Dept when = 1, only 4,786 ofthem are preserved by the rule refiner.

    Comparison of Dictionaries. We compare our LCS-based dictionaries with three types of dictionaries:

    ExampleS [5]: the state-of-the-art approach to learnsynonym rules from examples of matching string pairs.We implement ExampleS using as examples all stringpairs with Jaccard similarity not less than a threshold , as described in the paper.

    Handcrafted dictionaries: we manually construct ab-breviation dictionaries for two small datasets Diseaseand Dept.

    A general-purpose dictionary: we collect from abbre-viations.com 20,014 pairs of general-purpose abbre-viations and their full forms.

    Table 4: Number of LCS-based candidate rules (L)and number of rules after refinement (R), as com-pared to the search space (all pairs of a token and atoken sequence).

    Searchspace

    L R

    Disease 3.5 106 0 4,302 1,3621 34,972 4,306

    Dept 3.8 107 0 9,274 2,7641 44,748 4,786

    Course 1.9 1010 0 333,054 41,3541 2,068,916 67,292

    Location 4.1 1011 0 1,631,330 106,1921 17,374,792 286,424

    To compare dierent dictionaries, we feed them to our joinalgorithm (using the pkduck similarity measure), and com-pare the accuracy of the join results with the user-inputtedthreshold = 0.7. values 0 and 1 are used for LCS-baseddictionaries. For those learned by ExampleS, we use values0.4, 0.6 and 0.8.

    Table 5 shows results. It is clear that LCS-based dictio-naries greatly outperform dictionaries learned by ExampleS.For example, on Course, the highest F-measure of LCS-based dictionaries is 0.71 whereas that of ExampleS is only0.14. We can also see ExampleS has very low recall. On Dis-ease, the recall of ExampleS is 0. The maximum recall onall four datasets is only 0.27. The reasons are twofold. First,as can be seen in Figures 1, 6, 7 and 8, matching strings of-ten have very low Jaccard similarity20, so using traditionalJaccard-based ASJ misses many good examples. Second,ExampleS uses the frequency in the examples to infer howlikely a rule is a real synonym rule (higher frequency indi-cates higher likelihood). Although high frequency is a goodindicator (e.g. hdept , departmenti occurs many times inDept), many real abbreviations occur very few times (e.g.htat, training alignment teami only has one occurrencein Dept) and are thus missed by ExampleS.

    Table 5 also shows that our LCS-based dictionaries havecomparable accuracy as handcrafted dictionaries. Manualconstruction of the dictionary is an extremely cumbersomeprocess which is prone to omissions of correct abbreviations,as indicated by the lower recall of handcrafted dictionaries.The low recall of the general-purpose dictionary indicatesthat abbreviations are highly domain-dependent and thatour automatic approach is needed.

    The comparison between two LCS-based dictionaries with = 0 and = 1 reveals that 0 is already a very good choiceof in practice. The F-measure and precision of = 0 isalways no lower than that of = 1. The recall of = 0 isonly slightly lower than that of = 1 (by at most 2%). Thisshows that most real-world abbreviations can be capturedby = 0. Also, the LCS-based rule generation algorithm iseasier to implement and runs faster with = 0.

    Scalability. We test how the LCS-based rule generationalgorithm (Algorithm 1) scales as the input size varies onCourse and Location.

    20In fact, matching strings in Disease all have a Jaccard similarity lessthan 0.4, leading to ExampleSs 0 recall.

    63

  • Table 5: Accuracy of join results ( = 0.7) when using LCS-based dictionaries, those learned by ExampleS [5],handcrafted dictionaries and a general-purpose abbreviation dictionary.

    Disease Dept Course Location

    P R F P R F P R F P R F

    ExampleS, = 0.4 0.00 0.00 0.00 0.03 0.27 0.06 0.06 0.17 0.09 0.21 0.21 0.21ExampleS, = 0.6 0.00 0.00 0.00 0.14 0.23 0.17 0.09 0.17 0.12 0.27 0.21 0.23ExampleS, = 0.8 0.00 0.00 0.00 0.64 0.18 0.28 0.37 0.09 0.14 0.53 0.21 0.30General-purpose 0.96 0.08 0.15 0.64 0.18 0.28 0.47 0.09 0.15 0.52 0.21 0.30Handcrafted 0.92 0.69 0.79 0.80 0.74 0.77

    LCS-based, = 0 0.88 0.74 0.81 0.75 0.79 0.77 0.66 0.77 0.71 0.46 0.55 0.50LCS-based, = 1 0.88 0.74 0.81 0.70 0.80 0.75 0.61 0.78 0.68 0.25 0.57 0.35

    0

    10

    20

    30

    40

    50

    25% 50% 75% 100%

    Exe

    cutio

    n t

    ime

    (s)

    Input size (percentage of the original data)

    = 1 = 0

    (a) Course

    0

    50

    100

    150

    200

    250

    25% 50% 75% 100%

    Exe

    cutio

    n t

    ime

    (s)

    Input size (percentage of the original data)

    = 1 = 0

    (b) Location

    Figure 12: Scalability of the LCS-based rule gener-ation algorithm.

    Results are shown in Figure 12. We can see that Algo-rithm 1 has high eciency and scales almost linearly as theinput size grows. This high eciency results from the useof the trie structure and the recursive calculation of the Farray (Section 4.1.2).

    6. RELATED WORKThere is a long literature on ecient computation of ASJs

    using traditional metrics, which can be divided into two cat-egories: set-based (e.g. Jaccard [7,24] and Cosine [20,27])and character-based (e.g. Edit [14,23,26] and Hamming Dis-tance [6]). Many works [12,20,25,26] transform strings intosets of substrings with length q (i.e. q-grams), and thenconvert character-based metrics to set-based metrics. Thisidea can be adopted in our workflow when the underlyingfunction f is one of those character-based metrics. Note,however, that traditional metrics are ineective to deal withabbreviations, so these works cannot be adopted directly.

    The filter-verification framework is widely adopted by afore-mentioned works for ecient computation. Under the frame-work, a filtering algorithm first filters out a large portion ofdissimilar string pairs. A verification algorithm then ver-ifies the remaining pairs by calculating their real similari-ties. Various signature-based filtering algorithms have beenproposed, including prefix-filter [9,27], positional-filter [27],partition-based signature schemes [6,14] and approximatesignature schemes [10]. Verifying set-based metrics is gen-erally simple due to the straightforward calculation. Yetcalculating character-based metrics such as edit distance isusually expensive. Algorithms have been designed to speedup the calculation of these metrics [14,26].

    Our join algorithm for pkduck is also under this filter-verification framework. The filtering scheme extends prefix-filter [9] by calculating the signature union Sigu(s). Other

    types of signatures can also fit in this scheme, but it is gen-erally hard to eciently calculate Sigu(s) for two reasons.First, many signature algorithms [12,20,24,27] require build-ing indexes on input strings to keep track of some positionalinformation of tokens. However, to extend them to supportpkduck, one has to build indexes for all derived strings, whichis prohibitively expensive. Second, the signature space ofmany fast algorithms [6,14] is not polynomial, so it is notviable to modify our signature generation algorithm (whichenumerates all possible signatures in the first step) to extendthese algorithms.

    Besides filter-verification algorithms, trie-join [23] is a fastalgorithm using trie to directly compute join results. Bothtrie-join and our rule generation algorithm (Algorithm 1)employ the idea of identifying valid trie nodes for a setof query strings. However, because we focus on a dier-ent problem, Algorithm 1 diers inherently from trie-joinin many key aspects. For example, the trie in trie-join issimply constructed by input strings. In contrast, the triein Algorithm 1 is constructed by every subsequence of anytoken t that has at least |t| characters. The definition ofvalid nodes and query strings are also dierent, due to thedierent problems being tackled.

    Other highly related works [4,5,15] are discussed in detailin Section 1, so we do not repeat them in this section.

    7. CONCLUSIONIn this paper, we studied approximate string joins with

    abbreviations. We highlighted the limitations of existingapproaches and proposed an end-to-end workflow to addressthem. We first designed a new similarity measure to quan-tify the similarity between two strings taking abbreviationsinto account. In contrast to existing measures, this mea-sure is robust to abbreviations having multiple full forms.Note that this measure can also use other types of synonymrules. We then devised a PTIME join algorithm followingthe filter-verification framework, as opposed to existing al-gorithms whose time complexity is exponential. We alsopresented an unsupervised approach to learn a dictionary ofabbreviation rules from input strings based on the LCS as-sumption. Experimental results on four real-world datasetsshowed that our approach is highly eective and ecient,and has clear advantages over state-of-the-art approaches.

    8. ACKNOWLEDGMENTThis work was supported by the Qatar Computing Re-

    search Institute under Grant 6928356.

    64

  • 9. REFERENCES[1] Extended technical report.

    web.mit.edu/wenbo/www/abbr.pdf.[2] E. Adar. Sarad: A simple and robust abbreviation

    dictionary. Bioinformatics, 20(4):527533, 2004.[3] H. Ao and T. Takagi. ALICE: An algorithm to extract

    abbreviations from MEDLINE. JAMIA,12(5):576586, 2005.

    [4] A. Arasu, S. Chaudhuri, and R. Kaushik.Transformation-based framework for record matching.In ICDE, pages 4049, 2008.

    [5] A. Arasu, S. Chaudhuri, and R. Kaushik. Learningstring transformations from examples. PVLDB,2(1):514525, 2009.

    [6] A. Arasu, V. Ganti, and R. Kaushik. Ecient exactset-similarity joins. In VLDB, pages 918929, 2006.

    [7] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up allpairs similarity search. In WWW, pages 131140,2007.

    [8] J. T. Chang, H. Schutze, and R. B. Altman. Creatingan online dictionary of abbreviations from MEDLINE.JAMIA, 9(6):612620, 2002.

    [9] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitiveoperator for similarity joins in data cleaning. In ICDE,page 5, 2006.

    [10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni.Locality-sensitive hashing scheme based on p-stabledistributions. In SoCG, pages 253262, 2004.

    [11] R. De La Briandais. File searching using variablelength keys. In WJCC, pages 295298, 1959.

    [12] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,N. Koudas, S. Muthukrishnan, D. Srivastava, et al.Approximate string joins in a database (almost) forfree. In VLDB, volume 1, pages 491500, 2001.

    [13] Y. Jiang, G. Li, J. Feng, and W.-S. Li. Stringsimilarity joins: An experimental evaluation. PVLDB,7(8):625636, 2014.

    [14] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: Apartition-based method for similarity joins. PVLDB,5(3):253264, 2011.

    [15] J. Lu, C. Lin, W. Wang, C. Li, and H. Wang. Stringsimilarity measures and joins with synonyms. In

    SIGMOD, pages 373384, 2013.[16] M. Michelson and C. A. Knoblock. Mining the

    heterogeneous transformations between data sourcesto aid record linkage. In ICAI, pages 422428, 2009.

    [17] N. Okazaki and S. Ananiadou. A term recognitionapproach to acronym recognition. In ACL, 2006.

    [18] N. Okazaki, M. Ishizuka, and J. Tsujii. Adiscriminative approach to Japanese abbreviationextraction. In IJCNLP, pages 889894, 2008.

    [19] Y. Park and R. J. Byrd. Hybrid text mining forfinding abbreviations and their definitions. InEMNLP, pages 126133, 2001.

    [20] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin.Ecient exact edit similarity query processing withthe asymmetric signature scheme. In SIGMOD, pages10331044, 2011.

    [21] A. S. Schwartz and M. A. Hearst. A simple algorithmfor identifying abbreviation definitions in biomedicaltext. In PSB, pages 451462, 2003.

    [22] X. Sun, N. Okazaki, J. Tsujii, and H. Wang. Learningabbreviations from Chinese and English terms bymodeling non-local information. ACM TALLIP,12(2):5:15:17, 2013.

    [23] J. Wang, J. Feng, and G. Li. Trie-join: Ecienttrie-based string similarity joins with edit-distanceconstraints. PVLDB, 3(1-2):12191230, 2010.

    [24] J. Wang, G. Li, and J. Feng. Can we beat the prefixfiltering?: an adaptive framework for similarity joinand search. In SIGMOD, pages 8596, 2012.

    [25] W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen.Vchunkjoin: An ecient algorithm for edit similarityjoins. TKDE, 25(8):19161929, 2013.

    [26] C. Xiao, W. Wang, and X. Lin. Ed-join: an ecientalgorithm for similarity joins with edit distanceconstraints. PVLDB, 1(1):933944, 2008.

    [27] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang.Ecient similarity joins for near-duplicate detection.TODS, 36(3):15, 2011.

    [28] W. Zhang, Y. C. Sim, J. Su, and C. L. Tan. Entitylinking with eective acronym expansion, instanceselection, and topic modeling. In IJCAI, pages19091914, 2011.

    65