language support, linguistics and text analytics - boston-12-2010

Language Support, Linguistics, and Text Analyticsin Solr

Boston Apache Lucene and Solr Meetup

Carl W. Hoffman

Founder & CEO

www.basistech.com

in Solr

Steve KearnsProduct Manager

Basis Technology

Agenda

• About Basis Technology

• Language Identification

• Linguistics for Search

• Entity Extraction in Solr

• Demonstration Application

About Basis Technology

• This is Headquarters, offices in:� Tokyo, San Francisco, Washington DC

• Specialists in natural language processing for

�Web/enterprise search �E-discoveryWeb/enterprise search

�Document/OSINT/media exploitation

• Developer of a mature and widely used platform for

multilingual text analytics and information retrieval

• Solutions for commercial enterprises and government

agencies

�E-discovery

�Digital forensics

How well will it work for me?

• Define!• Definition of success varies widely

• Log File Search: Return only things that match exactly – exception

• Product Search: Return similar results organized by category.

• Measure!• Create examples, track performance.

Language Identification

• Detect dominant language

• Find language regions

Language Identification – Why?

• Faceting

• Language-specific indexing

• Entity extraction

Language Identification in Solr

• Preprocessor to Solr

• Custom

• OpenPipeline

• Solr UpdateRequestProcessor

• Chain of URP’s may be defined in SolrConfig.xml

• Has full access to the document• Has full access to the document

• Add, Edit, Remove fields

UpdateRequestProcessor Chain

Language Identification URP

Add Language field

Rename Text: text_<LANG>

Field Analysis

Solr Doc

Solr Field

Other Custom URP

Language

text_SWEDISH

text_ENGLISH

SolrConfig.xml Schema.xml

Language Identification Challenges

• Identifying query language is hard

• How do you query multiple fields at the same time?

• Use the Dismax parser:

� /solr/select?qt=dismax&qf=text_ENGLISH%20text_SWEDISH%20de-� /solr/select?qt=dismax&qf=text_ENGLISH%20text_SWEDISH%20de-text&q=hello%20world

� The QT specifies the query type as dismax

� The QF specifies the fields to search

Linguistics for Search

• Why?

• Improve recall!

• Every language has a unique set of challenges:

• Tokenization

• Chinese, Japanese, Korean, Thai

� Morphological Analysis vs. N-Gram

• Stemming vs. Lemmatization

• All European and Middle Eastern languages

• Compound words

• Swedish, Danish, Norwegian, Dutch, German, Korean, Japanese

Morphological Analysis vs. N-Gram

• Search Term: 東京東京東京東京ルパン上映時間ルパン上映時間ルパン上映時間ルパン上映時間

• N-Gram:

• Morphological:

Stemming vs. Lemmatization

• Stemming:

• Set of language-specific rules for removing leading and trailing characters from words

• Intended to increase recall at the expense of precision

• Example EN rule: Remove trailing “ing”

• Lemmatization:

• Complex set of language-specific approaches for producing the dictionary form of a given word

• Intended to increase recall without hurting precision.

• Uses context to disambiguate when multiple dictionary forms exist


• English: “I have spoken at several conferences”

• Stemming:

• Lemmatization:


• French: “Je n’étais pas là”

• Stemming:

• Lemmatization:

Stemming vs. Lemmatization + Decompounding

• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”

• Stemming:

• Lemmatization (and decompounding!)


• Swedish: “En person skadades lindrigt i en trafikolycka i Pernå”

• Stemming:

• Lemmatization:

Linguistics in Solr

• Easy to customize as Analyzer/Tokenizer/TokenFilter



Field Analysis


Add Language field


Solr Doc

Solr Field

Other Custom URP

Language

text_SWEDISH

text_ENGLISH


Related Challenges

• Can I index text from many languages into the same field?

• Yes, but it’s not always a good idea, because query language ID is not accurate.

• You need a custom Query Analyzer that does stemming/lemmatization in many languages for the same query.

• How do I query text in multiple fields?

• Dismax parser!

Text Analytics in Solr: Entity Extraction

• Process of identifying people, places, organizations, dates,

times, etc. in unstructured text.

• Methods:

• Lists

• Rules

• Statistical

• Define your goals upfront!

• Some extraction methods work better for certain entity types

• Rules work well for dates, email addresses, and URL’s, but not people

• Lists work well for titles, but not locations

• Statistical extractors work well for ambiguous entities: people, locations, organizations

Entity Extraction

Entity Extraction in Solr

• Pre-processor to Solr

• Custom

• OpenPipeline

• UpdateRequestProcessor

• Store entities in new fields per entity type

<field name="PERSON" type="string" indexed="true" multiValued="true" stored="false" />



Add Language field


Field Analysis

Solr Doc

Solr Field

Entity Extraction URP

Language

text_SWEDISH

text_ENGLISH


Entity Extraction Challenges

• How do you use extracted entities as facets?

• For retrieving counts:

• &facet=true&facet.field=PERSON&facet.field=LOCATION

• For filtering results:

• &facet.query=PERSON:Steve Kearns&facet.query=LOCATION:Stockholm

• How else can Entities be used?• How else can Entities be used?

• Improve relevance by searching the entity fields with a boost

• Entity-specific search – phonetic matching and other name-specific search appoaches

• Measure accuracy!

• F-Score is a measurement that combines precision and recall

• Vendors should provide this, but evaluate on your own data!

Demo: Odyssey Information Navigator

• Example search application built on Solr

• I personally built this in < 2 months using Solr and products from

Basis Technology

• I spent more time on the UI than integration of text analytics

componentscomponents

• I would be happy to show you the Solr config and let you try it out

language support, linguistics and text analytics - boston-12-2010

Documents