text analysis with sap hana

49
.consulting .solutions .partnership Text Analysis with SAP HANA

Upload: msg-systems-ag-custom-development

Post on 18-Jan-2017

916 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Text Analysis with SAP HANA

.consulting .solutions .partnership

Text Analysis with SAP HANA

Page 2: Text Analysis with SAP HANA

Text Analysis with SAP HANA

2© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg

Motivation - Big Data1 3

Text Analysis with SAP HANA2 7

Enhancement Options3 21

Page 3: Text Analysis with SAP HANA

Text Analysis with SAP HANA

3© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg

Motivation - Big Data1 3

Text Analysis with SAP HANA2 7

Enhancement Options3 21

Page 4: Text Analysis with SAP HANA

Big Data - taking a closer look

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 4

• Big Data is hot topic today, but what is hidden in the “Big Data”?

• According to Merril Lynch 80-90% of all potentially usable business information may originate in unstructured form(Structure, Models and Meaning: Is "unstructured" data merely unmodeled?, Intelligent Enterprise, March 1, 2005.)

• According to Computer World unstructured information might account for more than 70%–80% of all data in organizations(Holzinger, Andreas; et al. (2013). "Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field" in Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data. Lecture Notes in Computer Science. Springer. pp. 13–24)

• This data will grow up to 40 zettabytes by 2020

• The data might origin from:− Social Networks− Call Centers− “Letters” from Customer− ...

Page 5: Text Analysis with SAP HANA

What is the Problem with Unstructured Data?

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 5

• It is unstructured!− Not organized− No pre-defined data model− No metadata or mix of data and metadata� Limited/No access to the data via classical programs

• But the data contains valuable information

� We have a lot of information that is relevant for the business but we cannot access it �

Page 6: Text Analysis with SAP HANA

How can we solve that issue?

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 6

• Text Analysis: Extracting high quality information from texts

• Typical process of a text analysis:− Parsing of the text− Adding features like linguistic information− Insertion to database in structured manner

• Examples for typical text analysis tasks:− Entity recognition: Is it an organization or a person or a place including domain facts like

requests?− Sentiment analysis: What attitudinal information is “hidden” in the text?− Relationship, fact and event extraction

Page 7: Text Analysis with SAP HANA

Text Analysis with SAP HANA

7© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg

Motivation - Big Data1 3

Text Analysis with SAP HANA2 7

Enhancement Options3 21

Page 8: Text Analysis with SAP HANA

What has this to do with SAP HANA?

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 8

© SAP SE

Page 9: Text Analysis with SAP HANA

Text Analysis with HANA - Basics

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 9

• Starting point: database table containing the text

• Supported data types are: − TEXT− BINTEXT− NVARCHAR− VARCHAR− NCLOB,− CLOB− BLOB

Page 10: Text Analysis with SAP HANA

Text Analysis with HANA - Basics

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 10

Fulltext index incl. options (see system view SYS.FULLTEXT_INDEXES)

Page 11: Text Analysis with SAP HANA

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 11

Page 12: Text Analysis with SAP HANA

Text Analysis with HANA - Basics

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 12

Index properties on the table

Page 13: Text Analysis with SAP HANA

Text Analysis with HANA - Basics

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 13

Fulltext index table $TA_*

Page 14: Text Analysis with SAP HANA

Text Analysis with HANA – Linguistic Analysis

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 14

LINGANALYSIS_BASIC = Tokenization

Page 15: Text Analysis with SAP HANA

Text Analysis with HANA – Linguistic Analysis

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 15

LINGANALYSIS_STEMS = Tokeniziation + Stems

Page 16: Text Analysis with SAP HANA

Text Analysis with HANA – Linguistic Analysis

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 16

LINGANALYSIS_FULL = Tokeniziation + Stems + Tagging

Page 17: Text Analysis with SAP HANA

Text Analysis with HANA – Entity Extraction

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 17

• In order to get more information out of the data SAP delivers several configurations

• These configurations focus on entity and fact extraction under specific aspects

• Types of Extraction:

− EXTRACTION_CORE

− EXTRACTION_CORE_ENTERPRISE

− EXTRACTION_CORE_PUBLIC_SECTOR

− EXTRACTION_CORE_VOICEOFCUSTOMER

Page 18: Text Analysis with SAP HANA

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 18

Page 19: Text Analysis with SAP HANA

Text Analysis with HANA – Entity Extraction

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 19

EXTRACTION_CORE = Basic Entity Extraction (People, Organizations, Places)

Page 20: Text Analysis with SAP HANA

Text Analysis with HANA – Entity Extraction

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 20

EXTRACTION_CORE_VOICEOFCUSTOMER = Basic Entity Extraction + Sentiments

Page 21: Text Analysis with SAP HANA

Text Analysis with SAP HANA

21© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg

Motivation - Big Data1 3

Text Analysis with SAP HANA2 7

Enhancement Options3 21

Page 22: Text Analysis with SAP HANA

Text Analysis with HANA – Custom Dictionary

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 22

• In several use cases you might need to enhance the dictionary due to your business domain

• Structure of a dictionary

© SAP SE

Page 23: Text Analysis with SAP HANA

Text Analysis with HANA – Workflow of Enhancement

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 23

1. Find an extraction configuration that is most fitting for you

2. Copy the configuration into the target folder

3. Create a new custom dictionary

4. Reference the dictionary in your configuration copy

5. Recreate the fulltext index using your custom configuration

Page 24: Text Analysis with SAP HANA

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 24

Page 25: Text Analysis with SAP HANA

Text Analysis with HANA – Workflow of Enhancement

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 25

1. Find an extraction configuration that is most fitting for you

Page 26: Text Analysis with SAP HANA

Text Analysis with HANA – Workflow of Enhancement

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 26

2. Copy the configuration into the target folder � Important: File suffix *.hdbtextconfig

Page 27: Text Analysis with SAP HANA

Text Analysis with HANA – Workflow of Enhancement

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 27

3. Create a new custom dictionary� Important: File suffix *.hdbtextdict

Page 28: Text Analysis with SAP HANA

Text Analysis with HANA – Workflow of Enhancement

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 28

4. Reference the dictionary in your configuration copy� Important: You have to specify the full path

Page 29: Text Analysis with SAP HANA

Text Analysis with HANA – Workflow of Enhancement

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 29

5. Recreate the fulltext index using your custom configuration

Page 30: Text Analysis with SAP HANA

Text Analysis with HANA – Enhancement of Sentiment Analysis

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 30

• Special Case: Enhancement of sentiments

• You can directly enhance/tailor the files delivered by SAP

Page 31: Text Analysis with SAP HANA

Text Analysis with HANA – What’s next?

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 31

• Assume that we are in an “industry”-specific context or mining for “slang”-like facts and entities

• Good example for this are sports!

• We use the example of CrossFit® … as there are some funny facts to extract

• Question: How can we extract complex entities from a text?

• Examples: − Did somebody attend a CrossFit training?− Does somebody want to join a CrossFit box?

Page 32: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 32

Setup and Status Quo

Page 33: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 33

• Extraction rules (CGUL rules): pattern-based language for pattern matching using character or token-based regular expressions combined with linguistic attributes to define custom entity types.

• Goal of the rule sets:− Extract complex facts based on relations between entities and predicates.

− Entity-to-Entity relations to associate entities such as times, dates, and locations, with other entities

− Identify entities in domain-specific language.

− Capture facts expressed in new, popular “slang”

Page 34: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 34

Extraction Rule

Regular ExpressionsTokens

Luck ☺Dictionaries

Page 35: Text Analysis with SAP HANA

Text Analysis with HANA Tokens, Operators, Expression Markers and Directives

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 35

• Tokens define the syntactic units of the text analysis

<string, STEM: <stem>, POS: <postag>>

• Example: <activat.*, STEM: activat.*, POS: V>

• Several operators are possible to enable the matching:

− Standard operators e. g. character wildcard “.”, alternations “|”

− Iteration operatorse.g. zero or one occurrence of preceding item “?” ; zero or many occurrence of preceding item “*”

− Grouping and containment operators, e. g. item group “( )”, range groups “[ ]”

Page 36: Text Analysis with SAP HANA

Text Analysis with HANA Tokens, Operators, Expression Markers and Directives

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 36

• Expression Markers allow the definition of delimiters of the searched terms

• Several markers are available:

− Paragraph Marker: Specifies beginning and end of paragraph – [P]

− Entity Marker: Limits an expression to one or several entity types – [TE] <expr> [/TE]

− Sentence Marker: Specifies the beginning and end of a sentence – [SN] [/SN]

− Clause Container: Matches entire clause if expression is matched somewhere in the clause [CC] <expr> [/CC]

Page 37: Text Analysis with SAP HANA

Text Analysis with HANA Tokens, Operators, Expression Markers and Directives

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 37

• Directives allow the definition of character classes, groups of tokens and relation types

• #define (character class): denotes character expressionsExample: #define ALPHA: [A-Za-z]

• #subgroup (group of tokens): defines a group of one or more tokensExample: #subgroup Cloud: <HCP>|<AWS>|<Azure>

• #group (relation type): definition of custom facts and entity types consisting of one or more tokensExample:#group HANA: <HANA>#group HANANATIVE: %(HANA) <native>

Page 38: Text Analysis with SAP HANA

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 38

Page 39: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 39

Step 1 – Create a dictionary (It is all about entities)

Page 40: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 40

Step 2 – Create a custom configuration

Page 41: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 41

Recreate the fulltext index with the custom configuration

Page 42: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 42

Next step: Create a simple plain text rule (*.hdbtextrule) and adopt configuration

Page 43: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 43

Result of the plain rule

Page 44: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 44

Refactor and enhance the rule

Page 45: Text Analysis with SAP HANA

Text Analysis with HANA – Text Analysis Extraction Rules

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 45

Reduce the extracted entities using the PreProcessor Configuration

Page 46: Text Analysis with SAP HANA

Text Analysis with HANA – Summary

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 46

• SAP HANA contains a lot of functionality

• One very powerful feature is text analysis

• Besides the delivered content you have a lot of options to adopt the text analysis to extract the entities and facts that you need

• Since SP09 rules get compiled upon activation (no separate compilation necessary)

• Creating custom dictionaries and text rules is cumbersome � No support in IDE �

• The results of the text analysis form the basis of predictive analytics (also part of SAP HANA ☺)

Page 47: Text Analysis with SAP HANA

© msg | September 2015 | SAP Web IDE - IT Conference on SAP Technologies by msg 47

Q&A

Page 48: Text Analysis with SAP HANA

.consulting .solutions .partnership

Dr. Christian LechnerPrincipal IT Consultant

+49 (0) 171 [email protected]

msg systems ag (Headquarters)Robert-Buerkle-Str. 1, 85737 IsmaningGermany

www.msg-systems.com

Page 49: Text Analysis with SAP HANA

Text Analysis with HANA – Ressources

© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 49

• SAP HANA Search Developer Guide (Fulltext Index Options)help.sap.com -> Search Developer Guide

• SAP HANA Text Analysis Developer Guide: help.sap.com -> TA Developer Guide

• SAP HANA Text Analysis Language Reference Guide: help.sap.com -> TA Language Refrence Guide

• SAP HANA Text Analysis Extraction Customization Guide:help.sap.com -> TA Extraction Customization Guide

• YouTube Playlist of SAP HANA Academy:Text Analysis and Search