introduction product facet identification subtopic summarization discussion and conclusion...
Post on 19-Dec-2015
223 views
TRANSCRIPT
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Product Review Summarizationfrom a Deeper Perspective
Ly Duy Khang
Supervisor: A/P KAN Min Yen
Ly Duy Khang CS4101 B.COMP. DISSERTATION1
1. Introduction Motivation Related work Problem statement & Our approach
2. Product Facet Identification Preliminaries Methodology Evaluation Improvement
3. Subtopic Summarization Preliminary Methodology Evaluation
4. Discussion and Conclusion
Outline
Ly Duy Khang CS4101 B.COMP. DISSERTATION2
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
1. Introduction Motivation Related work Problem statement & Our approach
2. Product Facet Identification Preliminaries Methodology Evaluation Improvement
3. Subtopic Summarization Preliminary Methodology Evaluation
4. Discussion and Conclusion
Outline
Ly Duy Khang CS4101 B.COMP. DISSERTATION
MotivationRelated workProblem statement & Our approach
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
3
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Product review
A media commonly provided by online merchants for customers to review and express opinions on the products that they have purchased.
MotivationRelated workProblem statement & Our approach
4
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Product review is an important source of information:1. More and more people are shopping online, as
a result of the expansion of e-commerce.2. Enables customers to find opinions about
products easily, as well as to share them with their peers.
3. Allows producers to get certain degree of feedback.
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Problems
1. The number of reviews is often too large, and is still growing rapidly.
2. It is difficult to locate and capture opinions effectively.
MotivationRelated workProblem statement & Our approach
5
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Product review summarization system
1. Automatically process a large collection of reviews.
2. Identify topics and opinions in the review.3. Aggregate all information and present a
concise summary to the user.
Ly Duy Khang CS4101 B.COMP. DISSERTATION
MotivationRelated workProblem statement & Our approach
6
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Summarization
The task of extracting and presenting the most important information
from the inputs.• News headline• Program agenda• Scientific paper abstract• …
Ly Duy Khang CS4101 B.COMP. DISSERTATION
MotivationRelated workProblem statement & Our approach
7
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Review Summarization
Focus on opinions (techniques from Sentiment Analysis):
• Thumbs-up/Thumbs-down indication: [Turney02]
• Facet-based summary: [Hu04a],[Hu04b],[Popescu05]
• Comparative summary: [Hu05]
Ly Duy Khang CS4101 B.COMP. DISSERTATION8
MotivationRelated workProblem statement & Our approach
Product Facet examples:
1. Camera: “battery life”, “lens”, “flash”, “resolution”, etc.
2. Music player: “sound” , “weight”, “size”, “storage”, etc.
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION9
MotivationRelated workProblem statement & Our approach
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
GoogleProduct
Bing Shopping
Problem statement
Produce a facet-based summary of product review that captures
• Opinions of users.• Evidences that support those opinions.
Ly Duy Khang CS4101 B.COMP. DISSERTATION10
MotivationRelated workProblem statement & Our approach
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION11
MotivationRelated workProblem statement & Our approach
Approach and Contribution
Two main components:1. Product Facet Identification• Re-implement the baseline from [Hu04a]• Contribute a new effective heuristic to
improve the accuracy2. Subtopic Summarization• Initiate a sentence clustering solution• Make necessary modification to sentence
semantic similarity measurement (adopted from [Li06] and [Kong07])
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
1. Introduction Motivation Related work Approach
2. Product Facet Identification Preliminaries Methodology Evaluation Improvement
3. Subtopic Summarization Overview Methodology Evaluation
4. Discussion and Conclusion
Outline
Ly Duy Khang CS4101 B.COMP. DISSERTATION
PreliminariesMethodologyEvaluationImprovement
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
12
Why do we want to automate this task?
1. It is hard or even impossible to obtain a complete list of facets.• e.g., iPhone’s alarm function
2. Different set of words used by users and manufacturers/sellers to describe the same facet.• e.g., Price vs. Value; Body vs. Case
3. The manufacturer may not want to include those weak facets of their product.• e.g., iPhone is unable to play Flash on the
WebLy Duy Khang CS4101 B.COMP. DISSERTATION
PreliminariesMethodologyEvaluationImprovement
13
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
PreliminariesMethodologyEvaluationImprovement
Explicit/Implicit product facet
Product facets can be expressed explicitly or implicitly.
1. The pictures of this camera are very clear.2. The camera fits nicely into my palm.
We only consider explicit facet – appears as noun/noun phrase in the
sentence.
14
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Architecture Overview
PreliminariesMethodologyEvaluationImprovement
15
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
a/ Preprocessing
1. Process each input sentence with a Part-of-Speech (POS) Tagger to obtain the POS label for each word.
2. Remove stop words from the result.3. Stem each word to obtain its root form4. Only noun/noun phrases are fed to the next
module.
PreliminariesMethodologyEvaluationImprovement
16
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
b/ Frequent Mining
Identify all frequent noun/noun phrases that satisfy the minimum
support, which is defined as the minimum number of sentences
containing that noun/noun phrases.
PreliminariesMethodologyEvaluationImprovement
17
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
c/ Post Processing (1/2)
1. Usefulness pruning: Remove single-word facet that is likely to be meaningless.• e.g. life battery life
2. Compactness pruning: Remove facet phrase that is not compact.• e.g. sample photo photo
PreliminariesMethodologyEvaluationImprovement
18
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
c/ Post Processing (2/2)
3. Infrequent facet discovery: help discover genuine facets that are not mentioned a lot.• Gather opinion words that modify frequent
facets.• For each sentence that does not contain
frequent facet but one or more opinion words, include the nearest noun/noun phrase as facet.
PreliminariesMethodologyEvaluationImprovement
19
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
d/ Sentence Extraction• Sentences that contain any of the product
facets that we have discovered are labeled with that corresponding facet.
• Only opinionated sentences are sent down to the next component.
PreliminariesMethodologyEvaluationImprovement
20
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
a/ Experimental Data
From the same dataset as in [Hu04a]:• 1 Digital Camera (45 reviews)• 1 DVD Player (99 reviews)• 1 Cell phone (41 reviews)
PreliminariesMethodologyEvaluationImprovement
21
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
b/ Evaluation Measure
PreliminariesMethodologyEvaluationImprovement
22
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
c/ Experimental Result (Baseline)
PreliminariesMethodologyEvaluationImprovement
Baseline
Recall Precision F
Camera
79 0.822 0.747 0.783
Phone 67 0.761 0.718 0.739
DVD 49 0.797 0.793 0.795
Avg. 65 0.793 0.753 0.772
23
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Improvement - Syntactic Role (1/2)
Many noisy results such as: “light”, “hand”, “time”, “month”, “hour”,
etc. • Filtered by considering the word’ syntactic role
in the sentence.
PreliminariesMethodologyEvaluationImprovement
24
Improvement - Syntactic Role (2/2)
During the preprocessing step, we do not pass down to the next
module those noun/noun phrases that do not appear as subject/object
in the sentence.
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Experimental Result (Baseline with Syntactic Role)
PreliminariesMethodologyEvaluationImprovement
Recall Precision F-measure
Baseline
Improve
Baseline
Improve
Baseline
Improve
Camera
0.822 0.8220.74
70.80
20.78
30.81
2
Phone
0.761 0.7610.71
80.78
50.73
90.77
3
DVD 0.797 0.7970.79
30.86
70.79
50.83
1
Avg.0.79
30.793+0%
0.753
0.818
+8.6%
0.772
0.805
+4.3%
25
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
1. Introduction Motivation Related work Approach
2. Product Facet Identification Preliminaries Methodology Evaluation Improvement
3. Subtopic Summarization Overview Methodology Evaluation
4. Discussion and Conclusion
Outline
Ly Duy Khang CS4101 B.COMP. DISSERTATION
OverviewMethodologyEvaluation
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
26
OverviewMethodologyEvaluation
Ly Duy Khang CS4101 B.COMP. DISSERTATION27
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
How often does subtopic exist?
OverviewMethodologyEvaluation
Camera Subtopics
Memory 3
LCD 6
Lens 7
… …
Average 5.125
Phone Subtopics
Radio 3
Headset 4
Signal 3
… …
Average 3.5
DVD Subtopics
Price 1
Remote 4
Format 1
… …
Average 2
28
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
OverviewMethodologyEvaluation
Ly Duy Khang CS4101 B.COMP. DISSERTATION29
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Architecture Overview
OverviewMethodologyEvaluation
30
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
a/ Preprocessing
1. General Entity pruning• Product class name: “camera”, “DVD”,
“phone”, etc.• Brand name: “Nikon”, “Canon”, “iPod”,
“Kingston”, etc.2. Similarity pruning ([Kong07])• “picture” vs. “image”, “photo”• “display” vs. “monitor”• “Megapixel” vs. “Resolution”
OverviewMethodologyEvaluation
31
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
b/ Sentence representation & Semantic similarity measurement (1/2)Adopted from the work by [Li 06], a scalable
vector formulation is used to represent sentence, followed by cosine
distance between two vectors for sentence semantic similarity
measurement
OverviewMethodologyEvaluation
32
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
b/ Sentence representation & Semantic similarity measurement (2/2)
OverviewMethodologyEvaluation
S1 = The battery of my camera is very impressive.S2 = This camera always has a long battery life.Joint Concept Vector:C = {battery, camera, impressive, long, battery life}V1 = { 1.0 , 1.0 , 1.0 , 0.25, 0.5 }V2 = { 0.5 , 1.0 , 0.25 , 1.0 , 1.0 }sim(S1, S2) = = 0.75
33
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
c/ Sentence clustering (1/2)
1. Hierarchical clustering:
2. Non-hierarchical clustering:
OverviewMethodologyEvaluation
34
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
c/ Sentence clustering (2/2)
OverviewMethodologyEvaluation
35
To estimate the number of clusters, we adopt the graph-based
algorithm proposed in [Hat01]
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
d/ Compact presentation
OverviewMethodologyEvaluation
36
1. Sentences are now grouped into subtopics.2. Determine the orientation for every sentences
in the cluster.3. For each positive/negative partition P, we
would select the sentence with the maximum representative power to display
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
OverviewMethodologyEvaluation
a/ Experimental Data
From the same dataset used in the previous component, we extract a
subset of those facets with high frequency in each product.• Camera: 8 facets• Phone: 8 facets• DVD: 6 facets
37
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
OverviewMethodologyEvaluation
Experiment Results – Number of subtopics (average)
38
Manualsubtopi
cs
SenSim ([Li06])
SenSim(+ADJ)
Camera 5.125 1.875 3.0
Phone 3.5 1.5 2.5
DVD 2 1.167 1.5
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
1. Purity: rewards the clustering solution that introduces less noise in each cluster:
2. Inverse Purity: rewards the clustering solution that gathers more elements (of the same cluster in the gold standard) into a corresponding cluster:
Ly Duy Khang CS4101 B.COMP. DISSERTATION
OverviewMethodologyEvaluation
b/ Evaluation Measure (1/2)
39
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
F-measure: The harmonic mean of purity and inverse purity(α = 0.5):
Ly Duy Khang CS4101 B.COMP. DISSERTATION
OverviewMethodologyEvaluation
b/ Evaluation Measure (2/2)
40
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
OverviewMethodologyEvaluation
c/ Experiment Results – Performance using SenSim (+ADJ)
41
Camera
0.524 0.617 0.542 0.676 0.819 0.725 0.714 0.828 0.753
5.125
+29.02%
+32.63%
+33.80%
+36.21%
+34.13%
+38.89%Phon
e0.647 0.593 0.604 0.682 0.783 0.717 0.702 0.739 0.707
3.5 +5.54% +32.00%
+18.74%
+8.63%
+24.64%
+17.16%
DVD 0.825 0.622 0.682 0.904 0.795 0.837 0.894 0.743 0.791
2 +9.60% +27.72%
+22.73%
+8.33%
+19.34%
+15.94%
Random (200) Hierarchical Non-hierarchical (200)
Purity
I-Purity
F(0.5) Purity I-
Purity F(0.5) Purity I-Purity
F(0.5)
Introduction Product Facet IdentificationSubtopic SummarizationDiscussion and Conclusion
1. Introduction Motivation Related work Approach
2. Product Facet Identification Preliminaries Methodology Evaluation Improvement
3. Subtopic Summarization Overview Methodology Evaluation
4. Discussion and Conclusion
Outline
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
42
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Limitation and Future work
1. We do not conduct human evaluation on the effectiveness of the new proposed summary compared to the current ones.
2. Automatic sentiment analysis module integration.
3. Better sentence semantic similarity measurement with deep analysis.
4. Implicit facets handling.5. Sentence reformulation for summary output.6. Extend subtopics to other review
summarization settings.43
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Conclusion
1. We designed a complete summarization system targeting the domain of product reviews.
2. We introduced an effective heuristic rule using syntactic role to improve the process of identifying product facets.
3. We showed the existence of subtopic within the discussion of product facets and addressed this limitation in current summarization system with our proposed clustering component.
4. We extended the sentence semantic similarity measurement with sentiment information.
44
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
[Barzilay02] Barzilay, R., Elhadad, N., & McKeown, K. (2002). Inferring strategies for sentence ordering in multidocument news summarization. Journal of Artificial Intelligence Research, 17, 35–55.
[Car98b] Carbonell, J., & Goldstein, J. (1998). The use of MMR, Diversity-based Re-ranking for Reordering Documents and Producing Summaries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 335–336.
[Ding08] Ding, X., Liu, B., & Yu, P. S. (2008). A Holistic Lexicon-based Approach to Opinion Mining. Proceedings of the international conference on Web search and web data mining – WSDM
[Hat01] Hatzivassiloglou, V., Klavans, J. L., Holcombe, M. L., Barzilay, R., yen Kan, M., & McKeown, K. R. (2001). Simnder: A exible clustering tool for summarization. In Proceedings of the NAACL Workshop on Automatic Summarization, 41-49
[Hat97] Hatzivassiloglou, V., & McKeown, K. R. (1997). Predicting the Semantic Orientation of Adjectives. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics , 174-181.
[Hovy01] Hovy, E. H. (2001). Automated text summarization. Handbook of computational linguistics. Oxford University Press, Oxford.
References
Ly Duy Khang CS4101 B.COMP. DISSERTATION45
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
[Knight00] Knight, K., & Marcu, D. (2000). Statistics-based summarization-step one: Sentence compression. Proceedings of the National Conference on Artificial Intelligence, 703–710
[Barzilay99] Barzilay, R., Mckeown, K. R., & Elhadad, M. (1999). Information fusion in the context of multi-document summarization. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, 550–557.
[Hu04b] Hu, M., & Liu, B. (2004b). Mining Opinion Features in Customer Reviews. Proceedings of the National Conference on Artificial Intelligence, 755-760
[Hu05] Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. Proceedings of the 14th international conference on World Wide Web
[Kim06] Kim, S. M., & Hovy, E. (2006). Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text. Computational Linguistics
[Li06] Li, Y., McLean, D., Bandar, Z. A., O'Shea, J. D., & Crockett, K. (2006). Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Trans. on Knowledge and Data Engineering, 18 (8), 1138-1150.
References
Ly Duy Khang CS4101 B.COMP. DISSERTATION46
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
[Liu09] Liu, B. (2009). Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, 1-38
[Popescu05] Popescu, A. M., & Etzioni, O. (2005). Extracting Product Features and Opinions from Reviews. Computational Linguistics, 339-346.
[Radev04] Radev, D., Jing, H., Styś, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing and Management, 40(6), 919–938.
[Turney02] Turney, P., C., & Littman, M. (2002). Unsupervised Learning of Semantic Orientation From a Hundred-Billion-Word Corpus.
[Wiebe99] Wiebe, J. M., Bruce, R. F., & O'Hara, T. P. (1999). Development and Use of a Gold-standard Data Set for Subjectivity Classifications. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics 246-253
[Ye05] Ye, S., Qiu, L., Chua, T., & Kan, M. Y. (2005). NUS at DUC 2005: Understanding Documents via Concept Links. Document Understanding Conference (DUC)
[Yu03] Yu, H., & Hatzivassiloglou, V. (2003). Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. Proceedings of the conference on Empirical methods in natural language processing,129-136
References
Ly Duy Khang CS4101 B.COMP. DISSERTATION47
Introduction Product Facet Identification
Subtopic SummarizationDiscussion and Conclusion
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Q & A
48
Ly Duy Khang CS4101 B.COMP. DISSERTATION
Thank you for your attention
49