voc real world enterprise needs

Post on 17-Dec-2014

136 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

VOC sentiment analysis korean language processing morphological analysis CRF

TRANSCRIPT

Communicating KnowledgeSentiment Analysis Symposium

Lessons Learned from a VOC Analysis System for a big

Korean Telecommunication CompanyIvan Berlocher

SALTLUXSentiment Analysis Symposium

Nov. 9th 2011

Communicating KnowledgeSentiment Analysis Symposium 2

Introduction

• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.

• Expertise domain: Information Retrieval, Text/Data/Web/Graph Mining solutions and services

based on Semantic Web Technology.

• Main languages support: Korean, Japanese, English. For other use external solutions.

• 70 employees in Seoul, one Development Center in Vietnam (12 employ-ees)

One sales office in Japan (3 employees)

• Have several partnerships with other companies/institutes: – Ontoprise in Germany– Franz in California– DERI in Ireland

• Have many partnerships with R&D (ETRI, KAIST, Universities…)

Communicating KnowledgeSentiment Analysis Symposium 3

Table of Contents• Project & Environment Description

– Needs of Customer– System (Main) Requirements

• VOC Data– Sample Data– Data Analysis

• System Overview• Korean Linguistic• Sentiment Analysis• Lessons Learned• Future work

Communicating KnowledgeSentiment Analysis Symposium 4

Project & Environment Description• Needs of Customer

– Customer: Korean Corporation in Telecommunication– Department of Voice of Customer Analysis

– Mission: Analysis (human typed) memos from all call centers for identifying majors problems, make reports for decisions makers in order to improve quality of services and augment customer satisfaction.

– Data: human typed notes covering any kind of questions from customers• Information about subscriptions• Inquiry or complaint about devices (phones) or services, dealer-

ship • Complaints about quality of communication• etc.

The numbers of notes: ~200 thousand a day. (~5 Millions a Month). Required notes to be searchable during 1 year (~60 millions)

Communicating KnowledgeSentiment Analysis Symposium 5

Project & Environment Description

• System (Main) Requirements• Distinguish between simple inquiries vs. complaints• Classify into categories/departments of services• Monitor Trends of Topics in real-time, daily, weekly, monthly• Compare trends/tendency between by slice of times• Find related Topics• Manage personal vocabulary• Anonymous”ize” personal data (people names, telephone,

social id, addresses etc.)

Project started in October 2010 for a 3 Months POC. (~10MM)After acceptance(success) integration with real system for another 3 months (~10 MM)2 phases: ~200 000$

Communicating KnowledgeSentiment Analysis Symposium 6

VOC Data Sample

Communicating KnowledgeSentiment Analysis Symposium 7

VOC Data Sample

• Data often contain some structured information (metadata) but without any standard.

• But most of time, no particular mark/meta.

Cause problem of Named Entities Recogni-tion more complex

All different input of same information( 연락처 :Phone Number)

Communicating KnowledgeSentiment Analysis Symposium 8

VOC Data Analysis

• Data contains lot’s of named entities: Products/Services/People/Social ID/phones numbers often related to privacy

• Data contains lot’s of technical (domain) terms• Real content to analysis is mostly very

short(tweets like) but sometimes very.• Lot’s of misspelling/mistyping • Korean(Asian) problem of segmentation, amplified

by speed constraint • Lot’s of (non standard) abbreviations

Communicating KnowledgeSentiment Analysis Symposium 9

System Overview

Text Segmentati

on

Morphological Analyzer

Chunk/PhraseIdenti

fication

Named Entities

Recognition

Synonyms & Normalizatio

n

Indexing

Distributed Indexes

Classifier(Hybrid SVM

& Rules)

Analysis Phase

Searching/Clustering

(TopicRank)

TimelinesDumper

DFS

Timelines20110713_0700_1.df20110713_0700_2.df20110713_0700_3.df20110713_0710_1.df20110713_0710_2.df20110713_0710_3.df

Scheduler

Merger & Ranker

Trend (TopN)

DB

Web Server

(Web UI)

Complaint Detector

• Overall Architecture

In the real system, for fast indexing, system has been parallelized on 18 Linux machines.

Communicating KnowledgeSentiment Analysis Symposium 10

System Overview

• Home page

Communicating KnowledgeSentiment Analysis Symposium 11

System Overview• Top N Keywords Extraction

Communicating KnowledgeSentiment Analysis Symposium 12

System Overview• Related Keywords (Word Clustering)

Communicating KnowledgeSentiment Analysis Symposium 13

System Overview• Trend (Timeline) view

Communicating KnowledgeSentiment Analysis Symposium 14

System Overview• Tweets view

Communicating KnowledgeSentiment Analysis Symposium 15

Korean Linguistic

• Brief introduction

Korean is alphabetic based with consonants/vowels, composition by

consonant/vowel or consonant/vowel/consonant.‘ 나는 학생입니다 .” => 나 = ㄴ (N) + ㅏ (A) = NA => 학 = ㅎ (H) + ㅏ (A) + ㄱ (K) = HAKOne unit of consonant/vowel or consonant/vowel/consonant is asyllable called “Eojol”(Syllable) and words are composed of

several“eojeol”.

Basic grammar:Words a composition of one root (Nouns, Adjectives/Verbs)

followedby a flexion marking grammatical role (Subject/Object/Location

etc.)for nouns (Called “Josa”) or aspects/mood (tense, honorific form etc. ) for

verbs/adjectives (Called “Eomi”).

Communicating KnowledgeSentiment Analysis Symposium 16

Korean Linguistic• Examples:

‘ 나는 학생입니다 .” => “ 나는” = “ 나” (NA: I/me) + “ 는” (Neun: Thema)

학생입니다 = “ 학생” + “ 입니다” = “ 학생” (Hak-seng: Student) + “ 입니다” (Im-ni-da: am) => I’m (a) student.Lot’s of (composite) inflectional forms:학생 + 입니다 = Noun + Be학생 + 인 / 이예요 / 이다 / 입니까 ?/ 인데 / 인데요 etc. (was, will be …)

(eomi)학생 + Syntactic Role ( 이 :Subject/ 에게 :To/ 한테 :From/ 을 :Object)

etc. (josa)

Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)

Þ Search Engine: 검색엔진 .Þ High performance search engine: 고성능검색엔진But usage of space is free/arbitrary. Can write equivalently: 검색엔진 or 검색 엔진Especially with SNS, space limited devices for speed constraints (like real-time transcription of conversations) the space is more

and moreun/mis- used. => Need Automatic Segmentation Correction.

Communicating KnowledgeSentiment Analysis Symposium 17

Project & Environment Description

• Automatic Segmentation Correction Illustration

Communicating KnowledgeSentiment Analysis Symposium 18

Korean Linguistic

• Automatic Segmentation Correction ImplementationBinary Classification Approach:Tagging each syllable as space or not before.Can use any kind of Classifier. Here we use CRF model (could be SVM)with following set of features:

프랑스의 세계적인 디자이너 …

CRF

Accuracy at Character Level

96.25%

Precision at Word Level 95.58%

• Features– 1gram, 2gram, 3gram, 4gram of characters (syllables)– Korean or not, contains number

• Evaluation– Accuracy (character)– Word-precision

# words correct spaced word / # words produced by system

• Very simple to train (easy to get huge data)• Not need of lexicon or any lexical information• Perform surprisingly very well

Communicating KnowledgeSentiment Analysis Symposium 19

Korean Linguistic

• Transliteration- Korean used more and more English derived word

transliterated phonetically in Korean alphabet (Reverse of “Romanization”). Especially for foreign names (Companies, Products, Peo-

ple,technical/domain terms) – Transcription is non unique and non standardExamples:

tablet, 태블릿 , 태블릿 , 타블렛 , 테블릿Hitachi, 히타치 , 히타찌 , 히다찌 , 히타찌iPhone 4s, 아이폰 4s, 아이폰포에스 , 아이폰 포에스

Communicating KnowledgeSentiment Analysis Symposium 20

Korean Linguistic

• Automatic transliteration recognition- Make a rules based transliteration based on pho-

netic transliteration acting similarly to Soundex, adapted for Korean pronunciation.

tablet, 태블릿T=> ㅌ / ㄸ / ㄷA => ㅏ / ㅓ / ㅔ / ㅐEtc.

This method has high recall but low precision and need post-processing filter-ing (Remove known Korean words from lexicons, remove too short nouns etc.)

Result has to be corrected by human, so need of efficient workbench for pro-ductivity.

Gathered a 130 thousand entries dictionaries, mainly IT oriented.

Still need more Academic research to solve this problem.

Communicating KnowledgeSentiment Analysis Symposium 21

Sentiment Analysis

• Complaint DetectionSimilar problem of standard Subjectivity Detection (Detect if a sentence is sentiment bearing or not)

Simple Approach: Binary ClassificationUsing SVM, manually tagged training/test corpuses. (more than 20 thousand)Features Space: N-gram of Characters (Syllables/Eojol) + N-Gram of

Wordsusing 2-4 grams gave best results.Features Extraction is important to lower the features

space.Chi-square/Information Gain gave best results.

Communicating KnowledgeSentiment Analysis Symposium 22

Sentiment Analysis

Problems: No freely available resources such Sentiword-NetNeed to build it!Build our general domain dictionary as baseline:20 000 verbs/adjectives classified as positive/negative/neutralResult is a lexicon of ~5000 entries (only positive/negative)Enrich with manually extracted features from N-grams.Precision oriented (92%) but still quite low recall (75%). Overall Accuracy: 85%=> Still working on ways to make recall better without sacrificing precision. Basic Ideas: Bagging / Boosting (Combining several Classifiers)Make hybrid models between (linguistic: semantic/syntactic)

rulesand Machine Learning(statistics)

Communicating KnowledgeSentiment Analysis Symposium 23

Lessons Learned

• Lessons Learned- Still a quite big gap between expectation of customer

and reality. Need to explain and let him involved in process of assessment and knowledge/domain vocabu-lary acquisition

- Need acquire a lot of lexicons: => Named entities/Synonyms/Stopwords/Senti-Word- Quality and Quantity of this lexicons is a real assets of

Company. Acquiring lexicons require workbenches for efficiently semi-supervised methods (Filter manually automatic methods) to reduce costs.

- Tuning Classifiers parameters, features extraction, lin-guistic knowledge etc. is time/expertise consuming.

- Simple Academic methods works quite well (even needs lot of tuning)

- Beyond simple search engine, NLP components quality became more and more important, especially for Sen-timent Analysis

Communicating KnowledgeSentiment Analysis Symposium 24

Lessons Learned

• Lessons Learned- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud

”, “Social Network/Intelligence”…

- More and more Customers want to get data/opinion out of in-site system (Blogs, Communities(BBS), Tweets etc.). Typical questions:

Þ How many crawlers are needed for crawl all Korean tweets/blogs? Þ How about crawling Facebook?

- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?Þ Solutions required are required far more than Sentiment Analysis. Þ But often customer can’t afford/don’t want crawling infra-structure and maintenance

fees.

Þ New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS (Software/Platform/Infrastructure) as Service.

Þ Even in enterprise, distributed framework is required (not only web scale services)

- Customers (as least in Korea) love knowing technology and are more and more high level users. They not only buy solutions but consulting/expertise.

- Projects are more and more expensive, and many require either Benchmarks/POC

Communicating KnowledgeSentiment Analysis Symposium 25

Future Work & Plan

• Future Work (On-going)Acquire more entries in Sentiment dictionary

- Make a framework for handling Linguistic Rules and Statistical (SVM/Rocchio)

- Coupling with Antonyms; and/or hints- Better handling Negation- Better Workbench for faster acquisition / (re-)training- Co-Reference resolution- (Full/Semi) Parsing ?- More complex models than binary classification ?- Building/Maintaining a Platform for Pass/SassA long long way to go…

Communicating KnowledgeSentiment Analysis Symposium 26

Questions?

Thank you.

top related