Download - VOC real world enterprise needs
![Page 1: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/1.jpg)
Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned from a VOC Analysis System for a big
Korean Telecommunication CompanyIvan Berlocher
SALTLUXSentiment Analysis Symposium
Nov. 9th 2011
![Page 2: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/2.jpg)
Communicating KnowledgeSentiment Analysis Symposium 2
Introduction
• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.
• Expertise domain: Information Retrieval, Text/Data/Web/Graph Mining solutions and services
based on Semantic Web Technology.
• Main languages support: Korean, Japanese, English. For other use external solutions.
• 70 employees in Seoul, one Development Center in Vietnam (12 employ-ees)
One sales office in Japan (3 employees)
• Have several partnerships with other companies/institutes: – Ontoprise in Germany– Franz in California– DERI in Ireland
• Have many partnerships with R&D (ETRI, KAIST, Universities…)
![Page 3: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/3.jpg)
Communicating KnowledgeSentiment Analysis Symposium 3
Table of Contents• Project & Environment Description
– Needs of Customer– System (Main) Requirements
• VOC Data– Sample Data– Data Analysis
• System Overview• Korean Linguistic• Sentiment Analysis• Lessons Learned• Future work
![Page 4: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/4.jpg)
Communicating KnowledgeSentiment Analysis Symposium 4
Project & Environment Description• Needs of Customer
– Customer: Korean Corporation in Telecommunication– Department of Voice of Customer Analysis
– Mission: Analysis (human typed) memos from all call centers for identifying majors problems, make reports for decisions makers in order to improve quality of services and augment customer satisfaction.
– Data: human typed notes covering any kind of questions from customers• Information about subscriptions• Inquiry or complaint about devices (phones) or services, dealer-
ship • Complaints about quality of communication• etc.
The numbers of notes: ~200 thousand a day. (~5 Millions a Month). Required notes to be searchable during 1 year (~60 millions)
![Page 5: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/5.jpg)
Communicating KnowledgeSentiment Analysis Symposium 5
Project & Environment Description
• System (Main) Requirements• Distinguish between simple inquiries vs. complaints• Classify into categories/departments of services• Monitor Trends of Topics in real-time, daily, weekly, monthly• Compare trends/tendency between by slice of times• Find related Topics• Manage personal vocabulary• Anonymous”ize” personal data (people names, telephone,
social id, addresses etc.)
Project started in October 2010 for a 3 Months POC. (~10MM)After acceptance(success) integration with real system for another 3 months (~10 MM)2 phases: ~200 000$
![Page 6: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/6.jpg)
Communicating KnowledgeSentiment Analysis Symposium 6
VOC Data Sample
![Page 7: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/7.jpg)
Communicating KnowledgeSentiment Analysis Symposium 7
VOC Data Sample
• Data often contain some structured information (metadata) but without any standard.
• But most of time, no particular mark/meta.
Cause problem of Named Entities Recogni-tion more complex
All different input of same information( 연락처 :Phone Number)
![Page 8: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/8.jpg)
Communicating KnowledgeSentiment Analysis Symposium 8
VOC Data Analysis
• Data contains lot’s of named entities: Products/Services/People/Social ID/phones numbers often related to privacy
• Data contains lot’s of technical (domain) terms• Real content to analysis is mostly very
short(tweets like) but sometimes very.• Lot’s of misspelling/mistyping • Korean(Asian) problem of segmentation, amplified
by speed constraint • Lot’s of (non standard) abbreviations
![Page 9: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/9.jpg)
Communicating KnowledgeSentiment Analysis Symposium 9
System Overview
Text Segmentati
on
Morphological Analyzer
Chunk/PhraseIdenti
fication
Named Entities
Recognition
Synonyms & Normalizatio
n
Indexing
Distributed Indexes
Classifier(Hybrid SVM
& Rules)
Analysis Phase
Searching/Clustering
(TopicRank)
TimelinesDumper
DFS
Timelines20110713_0700_1.df20110713_0700_2.df20110713_0700_3.df20110713_0710_1.df20110713_0710_2.df20110713_0710_3.df
Scheduler
Merger & Ranker
Trend (TopN)
DB
Web Server
(Web UI)
Complaint Detector
• Overall Architecture
In the real system, for fast indexing, system has been parallelized on 18 Linux machines.
![Page 10: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/10.jpg)
Communicating KnowledgeSentiment Analysis Symposium 10
System Overview
• Home page
![Page 11: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/11.jpg)
Communicating KnowledgeSentiment Analysis Symposium 11
System Overview• Top N Keywords Extraction
![Page 12: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/12.jpg)
Communicating KnowledgeSentiment Analysis Symposium 12
System Overview• Related Keywords (Word Clustering)
![Page 13: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/13.jpg)
Communicating KnowledgeSentiment Analysis Symposium 13
System Overview• Trend (Timeline) view
![Page 14: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/14.jpg)
Communicating KnowledgeSentiment Analysis Symposium 14
System Overview• Tweets view
![Page 15: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/15.jpg)
Communicating KnowledgeSentiment Analysis Symposium 15
Korean Linguistic
• Brief introduction
Korean is alphabetic based with consonants/vowels, composition by
consonant/vowel or consonant/vowel/consonant.‘ 나는 학생입니다 .” => 나 = ㄴ (N) + ㅏ (A) = NA => 학 = ㅎ (H) + ㅏ (A) + ㄱ (K) = HAKOne unit of consonant/vowel or consonant/vowel/consonant is asyllable called “Eojol”(Syllable) and words are composed of
several“eojeol”.
Basic grammar:Words a composition of one root (Nouns, Adjectives/Verbs)
followedby a flexion marking grammatical role (Subject/Object/Location
etc.)for nouns (Called “Josa”) or aspects/mood (tense, honorific form etc. ) for
verbs/adjectives (Called “Eomi”).
![Page 16: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/16.jpg)
Communicating KnowledgeSentiment Analysis Symposium 16
Korean Linguistic• Examples:
‘ 나는 학생입니다 .” => “ 나는” = “ 나” (NA: I/me) + “ 는” (Neun: Thema)
학생입니다 = “ 학생” + “ 입니다” = “ 학생” (Hak-seng: Student) + “ 입니다” (Im-ni-da: am) => I’m (a) student.Lot’s of (composite) inflectional forms:학생 + 입니다 = Noun + Be학생 + 인 / 이예요 / 이다 / 입니까 ?/ 인데 / 인데요 etc. (was, will be …)
(eomi)학생 + Syntactic Role ( 이 :Subject/ 에게 :To/ 한테 :From/ 을 :Object)
etc. (josa)
Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)
Þ Search Engine: 검색엔진 .Þ High performance search engine: 고성능검색엔진But usage of space is free/arbitrary. Can write equivalently: 검색엔진 or 검색 엔진Especially with SNS, space limited devices for speed constraints (like real-time transcription of conversations) the space is more
and moreun/mis- used. => Need Automatic Segmentation Correction.
![Page 17: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/17.jpg)
Communicating KnowledgeSentiment Analysis Symposium 17
Project & Environment Description
• Automatic Segmentation Correction Illustration
![Page 18: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/18.jpg)
Communicating KnowledgeSentiment Analysis Symposium 18
Korean Linguistic
• Automatic Segmentation Correction ImplementationBinary Classification Approach:Tagging each syllable as space or not before.Can use any kind of Classifier. Here we use CRF model (could be SVM)with following set of features:
프랑스의 세계적인 디자이너 …
CRF
Accuracy at Character Level
96.25%
Precision at Word Level 95.58%
• Features– 1gram, 2gram, 3gram, 4gram of characters (syllables)– Korean or not, contains number
• Evaluation– Accuracy (character)– Word-precision
# words correct spaced word / # words produced by system
• Very simple to train (easy to get huge data)• Not need of lexicon or any lexical information• Perform surprisingly very well
![Page 19: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/19.jpg)
Communicating KnowledgeSentiment Analysis Symposium 19
Korean Linguistic
• Transliteration- Korean used more and more English derived word
transliterated phonetically in Korean alphabet (Reverse of “Romanization”). Especially for foreign names (Companies, Products, Peo-
ple,technical/domain terms) – Transcription is non unique and non standardExamples:
tablet, 태블릿 , 태블릿 , 타블렛 , 테블릿Hitachi, 히타치 , 히타찌 , 히다찌 , 히타찌iPhone 4s, 아이폰 4s, 아이폰포에스 , 아이폰 포에스
![Page 20: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/20.jpg)
Communicating KnowledgeSentiment Analysis Symposium 20
Korean Linguistic
• Automatic transliteration recognition- Make a rules based transliteration based on pho-
netic transliteration acting similarly to Soundex, adapted for Korean pronunciation.
tablet, 태블릿T=> ㅌ / ㄸ / ㄷA => ㅏ / ㅓ / ㅔ / ㅐEtc.
This method has high recall but low precision and need post-processing filter-ing (Remove known Korean words from lexicons, remove too short nouns etc.)
Result has to be corrected by human, so need of efficient workbench for pro-ductivity.
Gathered a 130 thousand entries dictionaries, mainly IT oriented.
Still need more Academic research to solve this problem.
![Page 21: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/21.jpg)
Communicating KnowledgeSentiment Analysis Symposium 21
Sentiment Analysis
• Complaint DetectionSimilar problem of standard Subjectivity Detection (Detect if a sentence is sentiment bearing or not)
Simple Approach: Binary ClassificationUsing SVM, manually tagged training/test corpuses. (more than 20 thousand)Features Space: N-gram of Characters (Syllables/Eojol) + N-Gram of
Wordsusing 2-4 grams gave best results.Features Extraction is important to lower the features
space.Chi-square/Information Gain gave best results.
![Page 22: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/22.jpg)
Communicating KnowledgeSentiment Analysis Symposium 22
Sentiment Analysis
Problems: No freely available resources such Sentiword-NetNeed to build it!Build our general domain dictionary as baseline:20 000 verbs/adjectives classified as positive/negative/neutralResult is a lexicon of ~5000 entries (only positive/negative)Enrich with manually extracted features from N-grams.Precision oriented (92%) but still quite low recall (75%). Overall Accuracy: 85%=> Still working on ways to make recall better without sacrificing precision. Basic Ideas: Bagging / Boosting (Combining several Classifiers)Make hybrid models between (linguistic: semantic/syntactic)
rulesand Machine Learning(statistics)
![Page 23: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/23.jpg)
Communicating KnowledgeSentiment Analysis Symposium 23
Lessons Learned
• Lessons Learned- Still a quite big gap between expectation of customer
and reality. Need to explain and let him involved in process of assessment and knowledge/domain vocabu-lary acquisition
- Need acquire a lot of lexicons: => Named entities/Synonyms/Stopwords/Senti-Word- Quality and Quantity of this lexicons is a real assets of
Company. Acquiring lexicons require workbenches for efficiently semi-supervised methods (Filter manually automatic methods) to reduce costs.
- Tuning Classifiers parameters, features extraction, lin-guistic knowledge etc. is time/expertise consuming.
- Simple Academic methods works quite well (even needs lot of tuning)
- Beyond simple search engine, NLP components quality became more and more important, especially for Sen-timent Analysis
![Page 24: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/24.jpg)
Communicating KnowledgeSentiment Analysis Symposium 24
Lessons Learned
• Lessons Learned- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud
”, “Social Network/Intelligence”…
- More and more Customers want to get data/opinion out of in-site system (Blogs, Communities(BBS), Tweets etc.). Typical questions:
Þ How many crawlers are needed for crawl all Korean tweets/blogs? Þ How about crawling Facebook?
- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?Þ Solutions required are required far more than Sentiment Analysis. Þ But often customer can’t afford/don’t want crawling infra-structure and maintenance
fees.
Þ New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS (Software/Platform/Infrastructure) as Service.
Þ Even in enterprise, distributed framework is required (not only web scale services)
- Customers (as least in Korea) love knowing technology and are more and more high level users. They not only buy solutions but consulting/expertise.
- Projects are more and more expensive, and many require either Benchmarks/POC
![Page 25: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/25.jpg)
Communicating KnowledgeSentiment Analysis Symposium 25
Future Work & Plan
• Future Work (On-going)Acquire more entries in Sentiment dictionary
- Make a framework for handling Linguistic Rules and Statistical (SVM/Rocchio)
- Coupling with Antonyms; and/or hints- Better handling Negation- Better Workbench for faster acquisition / (re-)training- Co-Reference resolution- (Full/Semi) Parsing ?- More complex models than binary classification ?- Building/Maintaining a Platform for Pass/SassA long long way to go…
![Page 26: VOC real world enterprise needs](https://reader033.vdocuments.mx/reader033/viewer/2022061211/5491bd6eac795963288b45e4/html5/thumbnails/26.jpg)
Communicating KnowledgeSentiment Analysis Symposium 26
Questions?
Thank you.