language support, linguistics and text analytics - boston-12-2010
TRANSCRIPT
Language Support, Linguistics, and Text Analyticsin Solr
Boston Apache Lucene and Solr Meetup
Carl W. Hoffman
Founder & CEO
www.basistech.com
in Solr
Steve KearnsProduct Manager
Basis Technology
Agenda
• About Basis Technology
• Language Identification
• Linguistics for Search
• Entity Extraction in Solr
• Demonstration Application
About Basis Technology
• This is Headquarters, offices in:� Tokyo, San Francisco, Washington DC
• Specialists in natural language processing for
�Web/enterprise search �E-discoveryWeb/enterprise search
�Document/OSINT/media exploitation
• Developer of a mature and widely used platform for
multilingual text analytics and information retrieval
• Solutions for commercial enterprises and government
agencies
�E-discovery
�Digital forensics
How well will it work for me?
• Define!• Definition of success varies widely
• Log File Search: Return only things that match exactly – exception
• Product Search: Return similar results organized by category.
• Measure!• Create examples, track performance.
Language Identification
• Detect dominant language
• Find language regions
Language Identification – Why?
• Faceting
• Language-specific indexing
• Entity extraction
Language Identification in Solr
• Preprocessor to Solr
• Custom
• OpenPipeline
• Solr UpdateRequestProcessor
• Chain of URP’s may be defined in SolrConfig.xml
• Has full access to the document• Has full access to the document
• Add, Edit, Remove fields
UpdateRequestProcessor Chain
Language Identification URP
Add Language field
Rename Text: text_<LANG>
Field Analysis
Solr Doc
Solr Field
Other Custom URP
Language
text_SWEDISH
text_ENGLISH
SolrConfig.xml Schema.xml
Language Identification Challenges
• Identifying query language is hard
• How do you query multiple fields at the same time?
• Use the Dismax parser:
� /solr/select?qt=dismax&qf=text_ENGLISH%20text_SWEDISH%20de-� /solr/select?qt=dismax&qf=text_ENGLISH%20text_SWEDISH%20de-text&q=hello%20world
� The QT specifies the query type as dismax
� The QF specifies the fields to search
Linguistics for Search
• Why?
• Improve recall!
• Every language has a unique set of challenges:
• Tokenization
• Chinese, Japanese, Korean, Thai
� Morphological Analysis vs. N-Gram
• Stemming vs. Lemmatization
• All European and Middle Eastern languages
• Compound words
• Swedish, Danish, Norwegian, Dutch, German, Korean, Japanese
Morphological Analysis vs. N-Gram
• Search Term: 東京東京東京東京 ルパン上映時間ルパン上映時間ルパン上映時間ルパン上映時間
• N-Gram:
• Morphological:
Stemming vs. Lemmatization
• Stemming:
• Set of language-specific rules for removing leading and trailing characters from words
• Intended to increase recall at the expense of precision
• Example EN rule: Remove trailing “ing”
• Lemmatization:
• Complex set of language-specific approaches for producing the dictionary form of a given word
• Intended to increase recall without hurting precision.
• Uses context to disambiguate when multiple dictionary forms exist
Stemming vs. Lemmatization
• English: “I have spoken at several conferences”
• Stemming:
• Lemmatization:
Stemming vs. Lemmatization
• French: “Je n’étais pas là”
• Stemming:
• Lemmatization:
Stemming vs. Lemmatization + Decompounding
• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”
• Stemming:
• Lemmatization (and decompounding!)
Stemming vs. Lemmatization
• Swedish: “En person skadades lindrigt i en trafikolycka i Pernå”
• Stemming:
• Lemmatization:
Linguistics in Solr
• Easy to customize as Analyzer/Tokenizer/TokenFilter
UpdateRequestProcessor Chain
Language Identification URP
Field Analysis
Language Identification URP
Add Language field
Rename Text: text_<LANG>
Solr Doc
Solr Field
Other Custom URP
Language
text_SWEDISH
text_ENGLISH
SolrConfig.xml Schema.xml
Related Challenges
• Can I index text from many languages into the same field?
• Yes, but it’s not always a good idea, because query language ID is not accurate.
• You need a custom Query Analyzer that does stemming/lemmatization in many languages for the same query.
• How do I query text in multiple fields?
• Dismax parser!
Text Analytics in Solr: Entity Extraction
• Process of identifying people, places, organizations, dates,
times, etc. in unstructured text.
• Methods:
• Lists
• Rules
• Statistical
• Define your goals upfront!
• Some extraction methods work better for certain entity types
• Rules work well for dates, email addresses, and URL’s, but not people
• Lists work well for titles, but not locations
• Statistical extractors work well for ambiguous entities: people, locations, organizations
Entity Extraction
Entity Extraction in Solr
• Pre-processor to Solr
• Custom
• OpenPipeline
• UpdateRequestProcessor
• Store entities in new fields per entity type
<field name="PERSON" type="string" indexed="true" multiValued="true" stored="false" />
UpdateRequestProcessor Chain
Language Identification URP
Add Language field
Rename Text: text_<LANG>
Field Analysis
Solr Doc
Solr Field
Entity Extraction URP
Language
text_SWEDISH
text_ENGLISH
SolrConfig.xml Schema.xml
Entity Extraction Challenges
• How do you use extracted entities as facets?
• For retrieving counts:
• &facet=true&facet.field=PERSON&facet.field=LOCATION
• For filtering results:
• &facet.query=PERSON:Steve Kearns&facet.query=LOCATION:Stockholm
• How else can Entities be used?• How else can Entities be used?
• Improve relevance by searching the entity fields with a boost
• Entity-specific search – phonetic matching and other name-specific search appoaches
• Measure accuracy!
• F-Score is a measurement that combines precision and recall
• Vendors should provide this, but evaluate on your own data!
Demo: Odyssey Information Navigator
• Example search application built on Solr
• I personally built this in < 2 months using Solr and products from
Basis Technology
• I spent more time on the UI than integration of text analytics
componentscomponents
• I would be happy to show you the Solr config and let you try it out