an introduction to nlp4l - natural language processing tool for apache lucene: presented by tomoko...
TRANSCRIPT
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
An Introduction to NLP4L: Natural Language Processing Tool for Apache Lucene
Tomoko Uchida Consultant, Rondhuit Co. Ltd.
3
Who am I
• Tomoko Uchida (@moco_beta)
• Luke (Lucene Toolbox) collaborator (2015 ~)
• https://github.com/DmitryKey/luke
• The best-known tool for debugging and learning Lucene/Solr, Elasticsearch index :-)
4
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
5
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
6
What’s NLP4L?• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Apprications (e.g. Transliteration)
7
What’s NLP4L?• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Apprications (e.g. Transliteration)
8
What’s NLP4L?• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
9
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
10
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
11
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
12
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
13
Evaluation Measures
Recall ,Precision
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
14
Recall ,Precision
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
15
n-gram, synonym dictionary, etc.
facet (filter query) Ranking Tuning
recall precision
recall , precision
16
Solution
n-gram, synonym dictionary, etc.
facet (filter query) Ranking Tuning
recall precision
recall , precision
17
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
recall precision
recall , precision
Ranking Tuning
18
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
e.g. Named Entity Extraction
recall precision
recall , precision
Ranking Tuning
19
Solution
q=watch
20
targetresult
gradual precision improvement
filter by “Gender=Men’s”
21
targetresult
gradual precision improvement
22
targetresult
filter by “Gender=Men’s”
filter by “Price=100-150”
gradual precision improvement
ID product price gender
1 CURREN New Men’s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men’s
2 Suiksilver The Gamer Watch 87.99 Men’s
23
Structured Documents
ID article
1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.
2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.
24
Unstructured Documents
ID article person org loc
1David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.
David Cameron EU Brussels
2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.
EU UK Britain
NEE[1] extracts interesting words.
[1] Named Entity Extraction
25
Make them Structured
26
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
27
Language Model
• LM represents the fluency of language
28
Language Model• LM represents the fluency of language
• LM represents the fluency of language
• N-gram model is the LM which is most widely used
29
Language Model
• LM represents the fluency of language
• N-gram model is the LM which is most widely used
• Calculation example for 2-gram
30
totalTermFreq(”word2g”,”an apple”)
totalTermFreq(”word”,”an”)
Language Model
Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singular
VB Verb
AT Article
JJ Adjective
. period
31
Our Corpus for training
Part-of-Speech Tagging
32
Hidden Markov Model
33
Series of Words
Hidden Markov Model
34
Series of Part-of-Speech
Hidden Markov Model
35
Hidden Markov Model
36
Hidden Markov Model
NNP 0.667
VB 0.0
. 0.0
JJ 0.0
AT 0.333
1.0
1.0
0.4 0.6
0.6670.333
37
alice 0.2 apple 0.4 mike 0.2 orange 0.2
ate 0.333 is 0.333 likes 0.333
an 1.0
red 1.0
. 1.0
HMM state diagram
38
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
39
Transliteration
Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
40
Transliteration
you search English “mouse”
41
It helps improve recall
but you got “マウス” (=mouse) highlighted in Japanese
42
It helps improve recall
academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー
train_data/alpha_katakana.txt
43
Training data in NLP4L
アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy
train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt
44
academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー
Training data in NLP4L
nlp4l> :load examples/trans_katakana_alpha.scala
45
Demo: Transliterationval indexer = new HmmModelIndexer(index)val file = Source.fromFile("train_data/alpha_katakana_aligned.txt", "UTF-8")val pattern: Regex = """([\u30A0-\u30FF]+)([a-zA-Z]+)(.*)""".rdef align(result: List[(String, String)], str: String): List[(String, String)] = { str match { case pattern(a, b, c) => align(result :+ (a, b), c) case _ => result }}// create hmm model indexfile.getLines.foreach{ line: String => val doc = align(List.empty[(String, String)], line) indexer.addDocument(doc)}
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
46
Demo: Transliteration
① crawl
gathering Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
“アルゴリズム”
“algorism”
calculate edit distance
synonyms.txt
47
store pair of strings if edit distance is small enough
②
③
④⑤
⑥
Gathering loan words
48
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
49
NLP4L Framework• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
50
NLP4L Framework• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
51
NLP4L Framework• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
52
NLP4L Framework• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and indexes.
• Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
53
Solr
ES
Mahout Spark
Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log
Dictionaries
・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment
maintenance
Model files Tagged Corpus
Document Vectors
・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection
・Learning to Rank ・Personalized Search
54
Solr
ES
Mahout Spark
Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log
Dictionaries
・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment
maintenance
Model files Tagged Corpus
Document Vectors
・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection
・Learning to Rank ・Personalized Search
55
Solr
ES
Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log
Dictionaries
・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment
maintenance
・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection
・Learning to Rank ・Personalized Search
Mahout Spark
Model filesDocument Vectors
Tagged Corpus
56
Mahout Spark
Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log
Dictionaries
・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment
maintenance
Model files Tagged Corpus
Document Vectors
・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection
・Learning to Rank ・Personalized Search
Solr
ES
57
example: Keyword Attachment
Information about associated Solr collection (core)
NLP/ML task (processor) chain
described by HOCON (Human-Optimized Config Object Notation)
UI prototype for NLP4L Framework (Lucia)https://github.com/NLP4L/lucia
58
example: Keyword Attachment
Extracted keywords from whole documentsex.) Named Entities by OpenNLP
59
example: Keyword Attachment
Information about associated Solr document (unique key, etc.)
Extracted keywords from this document
Solr field name for each keyword
60
example: Keyword Attachment
Check the keywords and removewrong / inappropriate entries
61
example: Keyword Attachment
Synch (attach) all keywords to Solr documents (by partial update command)
62
example: Keyword Attachment
Solr document (befere keywords are attached)
63
example: Keyword Attachment
Solr document (after keywords are attached)
64
example: Keyword Attachment
If you delete keyword(s) already have been attached to solr documents,
the keyword(s) also will be removed from solr index when next “synch” action executed.
65
Lucene doc
Lucene doc keyword
↑ Increase boost
Keyword Attachment Application
• “Keyword attachment” is a general format that enables the following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classification
66
targetresult
1 2 3 …
50 100 500 …
Before Learning to Rank
67
targetresult
1 2 3 …
50 100 500 …
After Learning to Rank
• Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d)
68
Lucene doc d
q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
Learning to Rank
69
targetresult
1 2 3 …
50 100 500 …
q=apple
computer …
Personalized Search
70
target
result
50 100 500 …
1 2 3 …
q=applefruit …Personalized Search
71
Lucene doc d1 q1u1, q2u2
Lucene doc d2 q2u1, q1u2
Personalized Search• Program learns, from access log and other sources, that the score of
document d for a query q by user u should be larger than the normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d).
• Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.
72
example: Generating Synonyms (loanwords)
Execute job that generate pairs of Katakana and corresponding English words from corpus
73
example: Generating Synonyms (loanwords)
Make adjustments in auto generated pairs (candidate synonyms) via web UI
74
example: Generating Synonyms (loanwords)
acacia,アカシアacademy,アカデミーacatenango,アカテナンゴaccess,アクセスaccident,アクシデントaction,アクションactive,アクティブactivision,アクティビジョンacton,アクトンactor,アクター……
Exported pairs can be used in SynonymFilter
synonyms_loadwords_ja.txt
75
Contact us at
koji at apache dot org
for the details.
Join and Code with Us!
https://github.com/NLP4L
76
Thank you!