language support, linguistics and text analytics - boston-12-2010

22
Language Support, Linguistics, and Text Analytics in Solr Boston Apache Lucene and Solr Meetup Carl W. Hoffman Founder & CEO www.basistech.com in Solr Steve Kearns Product Manager Basis Technology

Upload: others

Post on 03-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Support, Linguistics and Text Analytics - Boston-12-2010

Language Support, Linguistics, and Text Analyticsin Solr

Boston Apache Lucene and Solr Meetup

Carl W. Hoffman

Founder & CEO

www.basistech.com

in Solr

Steve KearnsProduct Manager

Basis Technology

Page 2: Language Support, Linguistics and Text Analytics - Boston-12-2010

Agenda

• About Basis Technology

• Language Identification

• Linguistics for Search

• Entity Extraction in Solr

• Demonstration Application

Page 3: Language Support, Linguistics and Text Analytics - Boston-12-2010

About Basis Technology

• This is Headquarters, offices in:� Tokyo, San Francisco, Washington DC

• Specialists in natural language processing for

�Web/enterprise search �E-discoveryWeb/enterprise search

�Document/OSINT/media exploitation

• Developer of a mature and widely used platform for

multilingual text analytics and information retrieval

• Solutions for commercial enterprises and government

agencies

�E-discovery

�Digital forensics

Page 4: Language Support, Linguistics and Text Analytics - Boston-12-2010

How well will it work for me?

• Define!• Definition of success varies widely

• Log File Search: Return only things that match exactly – exception

• Product Search: Return similar results organized by category.

• Measure!• Create examples, track performance.

Page 5: Language Support, Linguistics and Text Analytics - Boston-12-2010

Language Identification

• Detect dominant language

• Find language regions

Page 6: Language Support, Linguistics and Text Analytics - Boston-12-2010

Language Identification – Why?

• Faceting

• Language-specific indexing

• Entity extraction

Page 7: Language Support, Linguistics and Text Analytics - Boston-12-2010

Language Identification in Solr

• Preprocessor to Solr

• Custom

• OpenPipeline

• Solr UpdateRequestProcessor

• Chain of URP’s may be defined in SolrConfig.xml

• Has full access to the document• Has full access to the document

• Add, Edit, Remove fields

UpdateRequestProcessor Chain

Language Identification URP

Add Language field

Rename Text: text_<LANG>

Field Analysis

Solr Doc

Solr Field

Other Custom URP

Language

text_SWEDISH

text_ENGLISH

SolrConfig.xml Schema.xml

Page 8: Language Support, Linguistics and Text Analytics - Boston-12-2010

Language Identification Challenges

• Identifying query language is hard

• How do you query multiple fields at the same time?

• Use the Dismax parser:

� /solr/select?qt=dismax&qf=text_ENGLISH%20text_SWEDISH%20de-� /solr/select?qt=dismax&qf=text_ENGLISH%20text_SWEDISH%20de-text&q=hello%20world

� The QT specifies the query type as dismax

� The QF specifies the fields to search

Page 9: Language Support, Linguistics and Text Analytics - Boston-12-2010

Linguistics for Search

• Why?

• Improve recall!

• Every language has a unique set of challenges:

• Tokenization

• Chinese, Japanese, Korean, Thai

� Morphological Analysis vs. N-Gram

• Stemming vs. Lemmatization

• All European and Middle Eastern languages

• Compound words

• Swedish, Danish, Norwegian, Dutch, German, Korean, Japanese

Page 10: Language Support, Linguistics and Text Analytics - Boston-12-2010

Morphological Analysis vs. N-Gram

• Search Term: 東京東京東京東京 ルパン上映時間ルパン上映時間ルパン上映時間ルパン上映時間

• N-Gram:

• Morphological:

Page 11: Language Support, Linguistics and Text Analytics - Boston-12-2010

Stemming vs. Lemmatization

• Stemming:

• Set of language-specific rules for removing leading and trailing characters from words

• Intended to increase recall at the expense of precision

• Example EN rule: Remove trailing “ing”

• Lemmatization:

• Complex set of language-specific approaches for producing the dictionary form of a given word

• Intended to increase recall without hurting precision.

• Uses context to disambiguate when multiple dictionary forms exist

Page 12: Language Support, Linguistics and Text Analytics - Boston-12-2010

Stemming vs. Lemmatization

• English: “I have spoken at several conferences”

• Stemming:

• Lemmatization:

Page 13: Language Support, Linguistics and Text Analytics - Boston-12-2010

Stemming vs. Lemmatization

• French: “Je n’étais pas là”

• Stemming:

• Lemmatization:

Page 14: Language Support, Linguistics and Text Analytics - Boston-12-2010

Stemming vs. Lemmatization + Decompounding

• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”

• Stemming:

• Lemmatization (and decompounding!)

Page 15: Language Support, Linguistics and Text Analytics - Boston-12-2010

Stemming vs. Lemmatization

• Swedish: “En person skadades lindrigt i en trafikolycka i Pernå”

• Stemming:

• Lemmatization:

Page 16: Language Support, Linguistics and Text Analytics - Boston-12-2010

Linguistics in Solr

• Easy to customize as Analyzer/Tokenizer/TokenFilter

UpdateRequestProcessor Chain

Language Identification URP

Field Analysis

Language Identification URP

Add Language field

Rename Text: text_<LANG>

Solr Doc

Solr Field

Other Custom URP

Language

text_SWEDISH

text_ENGLISH

SolrConfig.xml Schema.xml

Page 17: Language Support, Linguistics and Text Analytics - Boston-12-2010

Related Challenges

• Can I index text from many languages into the same field?

• Yes, but it’s not always a good idea, because query language ID is not accurate.

• You need a custom Query Analyzer that does stemming/lemmatization in many languages for the same query.

• How do I query text in multiple fields?

• Dismax parser!

Page 18: Language Support, Linguistics and Text Analytics - Boston-12-2010

Text Analytics in Solr: Entity Extraction

• Process of identifying people, places, organizations, dates,

times, etc. in unstructured text.

• Methods:

• Lists

• Rules

• Statistical

• Define your goals upfront!

• Some extraction methods work better for certain entity types

• Rules work well for dates, email addresses, and URL’s, but not people

• Lists work well for titles, but not locations

• Statistical extractors work well for ambiguous entities: people, locations, organizations

Page 19: Language Support, Linguistics and Text Analytics - Boston-12-2010

Entity Extraction

Page 20: Language Support, Linguistics and Text Analytics - Boston-12-2010

Entity Extraction in Solr

• Pre-processor to Solr

• Custom

• OpenPipeline

• UpdateRequestProcessor

• Store entities in new fields per entity type

<field name="PERSON" type="string" indexed="true" multiValued="true" stored="false" />

UpdateRequestProcessor Chain

Language Identification URP

Add Language field

Rename Text: text_<LANG>

Field Analysis

Solr Doc

Solr Field

Entity Extraction URP

Language

text_SWEDISH

text_ENGLISH

SolrConfig.xml Schema.xml

Page 21: Language Support, Linguistics and Text Analytics - Boston-12-2010

Entity Extraction Challenges

• How do you use extracted entities as facets?

• For retrieving counts:

• &facet=true&facet.field=PERSON&facet.field=LOCATION

• For filtering results:

• &facet.query=PERSON:Steve Kearns&facet.query=LOCATION:Stockholm

• How else can Entities be used?• How else can Entities be used?

• Improve relevance by searching the entity fields with a boost

• Entity-specific search – phonetic matching and other name-specific search appoaches

• Measure accuracy!

• F-Score is a measurement that combines precision and recall

• Vendors should provide this, but evaluate on your own data!

Page 22: Language Support, Linguistics and Text Analytics - Boston-12-2010

Demo: Odyssey Information Navigator

• Example search application built on Solr

• I personally built this in < 2 months using Solr and products from

Basis Technology

• I spent more time on the UI than integration of text analytics

componentscomponents

• I would be happy to show you the Solr config and let you try it out