fuzzy search on plone & search for east asian language

40
©2013 CMScom [email protected] Fuzzy Search on Plone and Search for East Asian Language CMS communications Inc, Manabu TERADA [email protected] http://www.cmscom.jp 4 / Oct / 2013 Plone Conference 2013 in Brasilia

Upload: manabu-terada

Post on 15-Jan-2015

240 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Fuzzy search on plone & search for east asian language

©2013 CMScom [email protected]

Fuzzy Search on Plone and Search for East Asian Language

CMS communications Inc,Manabu TERADA [email protected]

http://www.cmscom.jp 4 / Oct / 2013

Plone Conference 2013 in Brasilia

Page 2: Fuzzy search on plone & search for east asian language

Who I am? (お前だれよ?)

©2013 CMScom [email protected]

•Manabu TERADA (寺田 学) @terapyon•Advisory Board Member of Plone Foundation•Chair of PyCon APAC 2013 in Japan•Owner of CMS communications Inc.•Member of Plone Users Group Japan

•Authors

1

Page 3: Fuzzy search on plone & search for east asian language

Contents

©2013 CMScom [email protected]

•About Japanese Language and other Languages

•Fuzzy Search on Plone•About the product•Basic technology•Dependencies•Domo•Structure of the product•The plan of future

2

Page 4: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

ありがとう Thank you Obrigado

Gracias 谢谢 감사 합니다

ขอบคุณ Спасибо شكرا

Page 5: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

ありがとう日本語

Thank youEnglish

ObrigadoPortuguese

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

Page 6: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

•Double bytes

ありがとう日本語

Thank youEnglish

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

ObrigadoPortuguese

Page 7: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

•Double bytes

ありがとう日本語

Thank youEnglish

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

ObrigadoPortuguese

Page 8: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

•Left to Right (LTR) or Right to Left (RTL)

ありがとう日本語

Thank youEnglish

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

ObrigadoPortuguese

Page 9: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

•Left to Right (LTR) or Right to Left (RTL)

ありがとう日本語

Thank youEnglish

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

ObrigadoPortuguese

Page 10: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

•No white space?

ありがとう日本語

Thank youEnglish

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

ObrigadoPortuguese

Page 11: Fuzzy search on plone & search for east asian language

Language Questions

©2013 CMScom [email protected]

3

•No white space

ありがとう日本語

Thank youEnglish

GraciasSpanish

谢谢

Chinese감사 합니다

Korean

ขอบคุณThai

СпасибоRussian

شكرا

Arabic

ObrigadoPortuguese

Page 12: Fuzzy search on plone & search for east asian language

Japanese

©2013 CMScom [email protected]

4

•Can you read this Japanese?

•私は寺田学です。日本の東京から来ました。ブラジルに来たのは初めてです。•I am Manabu TERADA. I came from Tokyo, Japan. I have come to Brazil for the first time.

•私 は 寺田 学 です。日本 の 東京 から 来ました。ブラジル に 来た のは 初めて です。

Page 13: Fuzzy search on plone & search for east asian language

Japanese

©2013 CMScom [email protected]

4

•Japanese doesn’t have white space for splitting words.•Japanese has 3 different characters,•Hiragana, Katakana, Kanji•Hiragana and Katakana are each 50 characters•Kanji is over 2000 characters•Japanese has same homonym by different characters, and has different homonym by same character.

Page 14: Fuzzy search on plone & search for east asian language

Japanese

©2013 CMScom [email protected]

4

•They are the same meaning.•Kyoto ← Roma-ji•京都 ← Kanji•きょうと ← Hiragana•キョウト ← Katakana

Page 15: Fuzzy search on plone & search for east asian language

Japanese

©2013 CMScom [email protected]

4

•Can you read?•橋 → ハシ → Hashi•端 → ハシ → Hashi•箸 → ハシ → Hashi

•They are different meaning.•We can understand those by context.

Page 16: Fuzzy search on plone & search for east asian language

Japanese and other Languages

©2013 CMScom [email protected]

4

•We have a lot of languages. •We have a lot of rules.•We have a lot of issues.

•I want to have any solutions in Plone.

Page 17: Fuzzy search on plone & search for east asian language

Fuzzy Search on Plone

©2013 CMScom [email protected]

5

Fuzzy Search

Page 18: Fuzzy search on plone & search for east asian language

Fuzzy Search on Plone

©2013 CMScom [email protected]

5

•Name: c2.search.fuzzy•1.0a5 (alpha release)

https://pypi.python.org/pypi/c2.search.fuzzyhttps://bitbucket.org/cmscom/c2.search.fuzzy

Page 20: Fuzzy search on plone & search for east asian language

Fuzzy Search on Plone

©2013 CMScom [email protected]

5

•We want to get suggestions the same as Google.

•In the Intranet, we can NOT use Google.

Page 21: Fuzzy search on plone & search for east asian language

Fuzzy Search on Plone

©2013 CMScom [email protected]

5

•NOT use Solr. I know Solr is working well, •But, it's difficult to install/configure/implement.

•And, I want to build own system.

Page 22: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

•This system is not difficult.

•Keywords•Levenshtein Distance•Sorted list•Automata system

Page 23: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance. It is named after Vladimir Levenshtein, who considered this distance in 1965.[1] It is closely related to pairwise string alignments.

From WikiPedia: http://en.wikipedia.org/wiki/Levenshtein_distance

Page 24: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

Levenshtein Distance•base word: “plone”

•Zero Distance•PLONE, Plone, pLone•One Distance•Phone, plene, plne, lone, ploneg, .....•Two Distance•one, plo, polne, ......

Page 25: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

Sorted list•Ordered container (List) or \•Can get Order of words

Sorted Order from Unicode (by alphabet)

['Argentina', 'Australia', 'Brazil', 'Canada', 'China', 'European Union', 'France', 'Germany', 'India', 'Indonesia', 'Italy', 'Japan', 'Mexico', 'Russia', 'Saudi Arabia', 'South Africa', 'South Korea', 'Turkey', 'United Kingdom', 'United States']

for example (G20’s countries)

Page 26: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

From @hiratara’s slide:http://www.slideshare.net/hiratara/levenshtein-automata

Page 27: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

Levenshtein Automata

•I found a good blog entry:•“Damn Cool Algorithms: Levenshtein Automata”•http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata•https://gist.github.com/Arachnid/491973

•It’s only using Python!!

Page 28: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

Index•It create original index, like a Sorted List, when Plone content is being created or modified.

Search•Searching from original index when we input into search-box.•Correct spelling will be shown in original index in less distance.•Because, It can be shown inside Plone content.

Page 29: Fuzzy search on plone & search for east asian language

Basic technology

©2013 CMScom [email protected]

6

•For example,•We want to show by one distance (it’s default).•From the G20 countries list.•Brezil → Brazil•Japon → Japan

•And, it use Automata system for increased speed.

Page 30: Fuzzy search on plone & search for east asian language

Dependencies

©2013 CMScom [email protected]

7

We need only Python.

Page 31: Fuzzy search on plone & search for east asian language

Dependencies

©2013 CMScom [email protected]

7

•We use MeCab for Japanese support.•Japanese don’t has white space for splitting word.•(same as Chinese and Koran)

Page 32: Fuzzy search on plone & search for east asian language

Dependencies

©2013 CMScom [email protected]

7

•Support language•English and other European languages•MAYBE: Arabic

•Chinese and Korean•It’s need to work splitting system•I don’t know it.

Page 33: Fuzzy search on plone & search for east asian language

Domo

©2013 CMScom [email protected]

8

•View the video on YouTubehttp://youtu.be/e5DHsF7Gi70

Page 34: Fuzzy search on plone & search for east asian language

Structure of the product

©2013 CMScom [email protected]

9

•Index data will be stored in ZODB, it's List object.

•When it being created or modified, will update the List by sorted.•List is into Dict, Dict key is phonetic (or lower case in English), value is original word.

[{'argentina' : ['Argentina', 'argentina', 'ARGENTINA']}, {'australia': ['Australia']}, {'brazil' : ['Brazil]}, {'きょうと' : ['京都', 'キョウト']}]

Example Index data

Page 35: Fuzzy search on plone & search for east asian language

Structure of the product

©2013 CMScom [email protected]

9

•Search •Checking the List from input word for less distance by automata system.

•It's shown the original word from list in Dict values under the search-box by JavaScript.

Page 36: Fuzzy search on plone & search for east asian language

Structure of the product

©2013 CMScom [email protected]

9

for Japanese

•I'm using MeCab for splitting and getting phonetic.

•It's stored phonetic and original word. •Because Japanese has same homonym by different characters

Page 37: Fuzzy search on plone & search for east asian language

The plan of future

©2013 CMScom [email protected]

10

•Now, I'm using ZODB for index storing.•I want to have a option, Storing to RDBMS. I'm trying to develop it.

•I want to support more language.•Please help me for more support languages.

Page 38: Fuzzy search on plone & search for east asian language

Thanks

©2013 CMScom [email protected]

11

•Japanese & East Asian languages•We have any problems yet in Plone.•I think Plone is working well in multi languages.•I wish Plone will be continuous working well.•All developers, you never forget other languages.

•Fuzzy search•I want to get the bug report.•Please try to use the product.

Page 40: Fuzzy search on plone & search for east asian language

13 Contact me

©2012 CMScom [email protected]

• Twitter: @terapyon

• Facebook: https://www.facebook.com/terapyon