fuzzy hash map

21
Efficient Fuzzy Search Enabled Hash Map 4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA Vasile Topac PhD Student Department of Information Technology and Computer Science “Politehnica” University Of Timisoara Email: [email protected]

Upload: vasile-topac

Post on 11-Nov-2014

4.348 views

Category:

Technology


6 download

DESCRIPTION

This is a presentation of Fuzzy Hash Map (FHM). FHM is an extension to the regular Java HashMap data structure allowing efficient fuzzy string key search. Customizable algorithms and settings bring flexibility to this new data structure, making it adaptable to each specific use case. Fuzzy string search performance comparison between Fuzzy Hash Map and the regular HashMap are presented for both accuracy and time consumption. Results show very good performance for Fuzzy Hash Map compared to the regular HashMap.

TRANSCRIPT

Page 1: Fuzzy Hash Map

Efficient Fuzzy Search Enabled Hash Map

4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA

Vasile TopacPhD Student

Department of Information Technology and Computer Science“Politehnica” University Of Timisoara

Email: [email protected]

Page 2: Fuzzy Hash Map

How it all started

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

&

Page 3: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

- widely used Java data structure

- stores (key, value) pairs

- search by key

- very fast

-a hash function generates a hash code for indexation

- Uses equals method to compare trough the keys

- only values for existing keys can be retrieved

Page 4: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

phone book example

Page 5: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Collision

Page 6: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Search for “Lisa Smith”

hashMap.get(“Lisa Smith”);Result: “521-8976”

Page 7: Fuzzy Hash Map

Problem

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

- only values for existing keys can be retrieved

Search for “Lissa Smith”

hashMap.get(“Lissa Smith”);Result: null

Page 8: Fuzzy Hash Map

Problem

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Brute force solution: - iterate trough the set of entries and search approximate matches Works, but is time expensive Fuzzy data structures – currently available for database

- search for “Lissa Smith”

Page 9: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

“ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”

Page 10: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

UML Class Diagram

Page 11: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

FuzzyKey overridden methods

- hashCode()- prehashing - create collisions to cluster data

- substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”- soundex soundex(“Fuzzy Search”) = F226

- equals(Object o)- string metrics

- Levenshtain Distance LD(computing, computation)=4- Hamming Distance HD(computing, computers)=3

How it works

Page 12: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Example(law terminology dictionary)

- hashCode()- prehashing

- substring 4

- equals(Object o)- Levenshtain Distance

SUBSTRING (0, 4)

action

adjudication

evidence

violence

violation

...

...

hashfunction

pre-hashingfunction buckets

acti

adju

evid

viol

action

adjudication

evidence

violence

violation

12

13

14

215

A civil judicial proceeding ...

A decision or sentence imposed by a judge...

The expression of physical or verbal ...

An offense for which the only sentence ...

Testimony, documents or objects ...

...

......

...

......

......

Page 13: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

“the judge has the option of either adjudicating you as guilty or..”

fuzzyHashMap.get(“adjudicating”) = nullfuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence

imposed by a

judge…”

- hashCode()substring 4 = “adju”

- equals(Object o)LD(adjudicating, adjudication) = 2

Page 14: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

fuzzyHashMap.getFuzzy(“violent”)= “violence”

SUBSTRING (0, 4)

action

adjudication

evidence

violence

violation

...

...

hashfunction

pre-hashingfunction buckets

acti

adju

evid

viol

action

adjudication

evidence

violence

violation

12

13

14

215

A civil judicial proceeding ...

A decision or sentence imposed by a judge...

The expression of physical or verbal ...

An offense for which the only sentence ...

Testimony, documents or objects ...

...

......

...

......

......

LD(violent, violence) = 2LD(violent, violation) = 5

“violence” is returned

Page 15: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

SOUNDEX

Mary

Paul

Scott

Jhon

John

...

...

hashfunction

pre-hashingfunction buckets

M600

P400

S300

J500

Mary

Paul

Scott

Jhon

John

12

13

14

215

312050505

732124789

025465892

361475236

712696969

...

......

...

......

......

Example(phone book)

- hashCode()- prehashing

- soundex

- equals(Object o)- Levenshtain Distance

Page 16: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Accuracy Test

Test conditions- Substring(0,4) hashing function- Levenshtein Distance fuzzy matching algorithm- Distance threshold value 2- medical terminology dictionary populated with 1030 English medical terms

Test results

-Parse text from American Family Physicians Journal - text of 568 words- 43 words identified as medical terms- 9 were incorrect matches- 80% accuracy

- Parse text from eMedicine web site - text of 2730 words- 260 were recognized- 7 were incorrect matches- 97% accuracy

Page 17: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Speed Test

-Exact matches only

0 100 200 300 400 500 600 700 800 90010000

1000

2000

3000

4000

5000

6000

4 5 5 6

5419

4013

2300

1

HashMap

FuzzyHashMap

Page 18: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Speed Test

-Fuzzy matches only

010

020

030

040

050

060

070

080

090

010

000

1000

2000

3000

4000

5000

6000

7000

4 5 6 7

54195739 5711

5401

HashMap

FuzzyHashMap

Page 19: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Speed Test

-Exact & fuzzy matches

010

020

030

040

050

060

070

080

090

010

000

1000

2000

3000

4000

5000

6000

4 5 5 6

5419

4744

4135

3143HashMap

FuzzyHashMap

Page 20: Fuzzy Hash Map

Conclusion

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

- FuzzyHashMap data structures proved to have very good performance on working with uncertain data

- Flexible (can choose different pre-hashing functions and string metrics)

- available as open source http://fuzzyhashmap.sourceforge.net/

- community can extend the functionality

- Future work: - adding more string metrics- improve performance- implement Fuzzy TreeMap

Page 21: Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Thank you!

sources at:http://fuzzyhashmap.sourceforge.net

[email protected]