fuzzy hash map
DESCRIPTION
This is a presentation of Fuzzy Hash Map (FHM). FHM is an extension to the regular Java HashMap data structure allowing efficient fuzzy string key search. Customizable algorithms and settings bring flexibility to this new data structure, making it adaptable to each specific use case. Fuzzy string search performance comparison between Fuzzy Hash Map and the regular HashMap are presented for both accuracy and time consumption. Results show very good performance for Fuzzy Hash Map compared to the regular HashMap.TRANSCRIPT
Efficient Fuzzy Search Enabled Hash Map
4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA
Vasile TopacPhD Student
Department of Information Technology and Computer Science“Politehnica” University Of Timisoara
Email: [email protected]
How it all started
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
&
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- widely used Java data structure
- stores (key, value) pairs
- search by key
- very fast
-a hash function generates a hash code for indexation
- Uses equals method to compare trough the keys
- only values for existing keys can be retrieved
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
phone book example
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Collision
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Search for “Lisa Smith”
hashMap.get(“Lisa Smith”);Result: “521-8976”
Problem
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- only values for existing keys can be retrieved
Search for “Lissa Smith”
hashMap.get(“Lissa Smith”);Result: null
Problem
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Brute force solution: - iterate trough the set of entries and search approximate matches Works, but is time expensive Fuzzy data structures – currently available for database
- search for “Lissa Smith”
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
“ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
UML Class Diagram
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
FuzzyKey overridden methods
- hashCode()- prehashing - create collisions to cluster data
- substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”- soundex soundex(“Fuzzy Search”) = F226
- equals(Object o)- string metrics
- Levenshtain Distance LD(computing, computation)=4- Hamming Distance HD(computing, computers)=3
How it works
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Example(law terminology dictionary)
- hashCode()- prehashing
- substring 4
- equals(Object o)- Levenshtain Distance
SUBSTRING (0, 4)
action
adjudication
evidence
violence
violation
...
...
hashfunction
pre-hashingfunction buckets
acti
adju
evid
viol
action
adjudication
evidence
violence
violation
12
13
14
215
A civil judicial proceeding ...
A decision or sentence imposed by a judge...
The expression of physical or verbal ...
An offense for which the only sentence ...
Testimony, documents or objects ...
...
......
...
......
......
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
“the judge has the option of either adjudicating you as guilty or..”
fuzzyHashMap.get(“adjudicating”) = nullfuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence
imposed by a
judge…”
- hashCode()substring 4 = “adju”
- equals(Object o)LD(adjudicating, adjudication) = 2
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
fuzzyHashMap.getFuzzy(“violent”)= “violence”
SUBSTRING (0, 4)
action
adjudication
evidence
violence
violation
...
...
hashfunction
pre-hashingfunction buckets
acti
adju
evid
viol
action
adjudication
evidence
violence
violation
12
13
14
215
A civil judicial proceeding ...
A decision or sentence imposed by a judge...
The expression of physical or verbal ...
An offense for which the only sentence ...
Testimony, documents or objects ...
...
......
...
......
......
LD(violent, violence) = 2LD(violent, violation) = 5
“violence” is returned
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
SOUNDEX
Mary
Paul
Scott
Jhon
John
...
...
hashfunction
pre-hashingfunction buckets
M600
P400
S300
J500
Mary
Paul
Scott
Jhon
John
12
13
14
215
312050505
732124789
025465892
361475236
712696969
...
......
...
......
......
Example(phone book)
- hashCode()- prehashing
- soundex
- equals(Object o)- Levenshtain Distance
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Accuracy Test
Test conditions- Substring(0,4) hashing function- Levenshtein Distance fuzzy matching algorithm- Distance threshold value 2- medical terminology dictionary populated with 1030 English medical terms
Test results
-Parse text from American Family Physicians Journal - text of 568 words- 43 words identified as medical terms- 9 were incorrect matches- 80% accuracy
- Parse text from eMedicine web site - text of 2730 words- 260 were recognized- 7 were incorrect matches- 97% accuracy
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Exact matches only
0 100 200 300 400 500 600 700 800 90010000
1000
2000
3000
4000
5000
6000
4 5 5 6
5419
4013
2300
1
HashMap
FuzzyHashMap
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Fuzzy matches only
010
020
030
040
050
060
070
080
090
010
000
1000
2000
3000
4000
5000
6000
7000
4 5 6 7
54195739 5711
5401
HashMap
FuzzyHashMap
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Exact & fuzzy matches
010
020
030
040
050
060
070
080
090
010
000
1000
2000
3000
4000
5000
6000
4 5 5 6
5419
4744
4135
3143HashMap
FuzzyHashMap
Conclusion
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- FuzzyHashMap data structures proved to have very good performance on working with uncertain data
- Flexible (can choose different pre-hashing functions and string metrics)
- available as open source http://fuzzyhashmap.sourceforge.net/
- community can extend the functionality
- Future work: - adding more string metrics- improve performance- implement Fuzzy TreeMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Thank you!
sources at:http://fuzzyhashmap.sourceforge.net