wims 2014, thessaloniki, june 2014

24
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI) Luca Aiello (Yahoo Labs) Ryan Skraba (Alcatel-Lucent Bell Labs)

Upload: yovela

Post on 03-Feb-2016

20 views

Category:

Documents


0 download

DESCRIPTION

A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI) Luca Aiello (Yahoo Labs) Ryan Skraba (Alcatel-Lucent Bell Labs). WIMS 2014, Thessaloniki, June 2014. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014

A soft frequent pattern mining approach for textual topic

detection

Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI)

Luca Aiello (Yahoo Labs)

Ryan Skraba (Alcatel-Lucent Bell Labs)

Page 2: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#2

Overview

• Motivation: classes of topic detection methods, degree of co-occurrence patterns considered.

• Beyond pairwise co-occurrence analysis:– Frequent Pattern Mining.– Soft Frequent Pattern Mining.

• Experimental evaluation.• Conclusions and future work.

Page 3: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#3

Classes of textual topic detection methods

• Probabilistic topic models:– Learn the joint probability distribution of topics and terms and

perform inference on it (e.g. LDA).

• Document-pivot methods:– Cluster together documents, each group of documents is a topic (e.g.

incremental clustering based on cosine similarity / tfidf representation)

• Feature-pivot methods:– Cluster together terms based on their co-occurrence patterns, each

group of terms is a topic (e.g. graph-based feature-pivot)

Page 4: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#4

Feature-pivot methods: degree of cooccurrence (1/2)

• We focus on feature-pivot methods and examine the effect of the “degree” of examined co-occurrence patterns on the term clustering procedure.

• Let us consider the following topics:– Romney wins Virginia– Romney appears on TV and thanks Virginia– Obama wins Vermont– Romney congratulates Obama

• Key terms (e.g. Romney, Obama, Virginia, Vermont, wins) co-occur with more than one other key term.

Page 5: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#5

Feature-pivot methods: degree of cooccurrence (2/2)

• Previous approaches typically examined only “pairwise” cooccurrence patterns.

• In the case of closely related topics, such as the above, examining only pairwise cooccurrence patterns may lead in mixed topics.

• We propose to examine the simultaneous co-occurrence patterns of a larger number of terms.

Page 6: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#6

Beyond pairwise co-occurrence analysis: FPM

• Frequent Pattern Mining.

• A variety of algorithms (e.g. APriori, FPGrowth) that can be used to find groups of items that co-occur frequently.

• Not a new approach for textual topic detection.

Page 7: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#7

Beyond pairwise co-occurrence analysis: SFPM

• Frequent Pattern Mining is strict in that it looks for sets of terms all of which cooccur frequently at the same time. May be able to surface only topics with a very small number of terms, i.e. very coarse topics.

• Can we formulate an algorithm that lies between the two ends of the spectrum, i.e. looks at co-occurrence patterns with degree higher than two but is not as strict as a typical FPM algorithm?

Page 8: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#8

SFPM

1. Term selection.

2. Co-occurrence vector formation.

3. Post-processing.

Page 9: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#9

SFPM–Step 1: Term selection

• Select the top K terms that will enter the clustering procedure.

• Different options to do this. For instance:– Select the most frequent terms.– Select the terms that exhibit the most “bursty” behaviour (if we are

considering temporal processing).

• In our experiments we select the terms that are most “unusually frequent”:

Page 10: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#10

SFPM–Step 2: Co-occurrence vector formation (1/4)

• The heart of the SFPM approach.• Notation:

– n: The number of documents in the collection– S: A set of terms, representing a topic– DS: vector of length n, the i-th element indicates how many of the

terms in S co-occur in the i-th document.– Dt: binary vector of length n, the i-th element indicates if the term t

occurs in the i-th document.

• The vector Dt for a term t that frequently cooccurs with the terms in S will have high cosine similarity with DS.

• Idea of algorithm: greedily expand S with the term t that best matches DS.

Page 11: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#11

SFPM–Step 2: Co-occurrence vector formation (2/4)

Page 12: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#12

SFPM–Step 2: Co-occurrence vector formation (3/4)

• Need a stopping criterion for the expansion procedure.• If not properly set, the set of terms may end up being too

small (i.e. the topic may be too coarse) or may end up being a mixture of topics.

• We use a threshold on the cosine of Ds and Dt and we adapt the threshold dynamically on the size of S. In particular, we use a sigmoid function that is a function of |S|:

Page 13: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#13

SFPM–Step 2: Co-occurrence vector formation (4/4)

• We run the expansion procedure many times, each time starting from a different term.

• Additionally, to avoid having less important terms dominate Ds and the cosine similarity, at the end of each expansion step we zero-out entries of Ds that have a value smaller than |S|/2.

Page 14: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#14

SFPM–Step 3: Post-processing

• Due to the fact that we run the expansion procedure many times, each time starting with a different term, we may end up with a large number of duplicate topics.

• At the final step of the algorithm, we filter out duplicate topics (by considering the Jaccard similarity between the sets of keywords of the produced topics).

Page 15: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#15

SFPM – Overview

Page 16: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#16

Evaluation – Datasets and evaluation metrics

• Three datasets collected from Twitter:– Super Tuesday dataset: 474,109 documents– F.A. Cup final: 148,652 documents– U.S. presidential elections: 1,247,483 documents

• For each of them a number of ground-truth topics were determined by examining the relevant stories that appeared in the mainstream media.

• Each topic is represented by a set of mandatory terms, a set of forbidden terms (so that we make sure that closely related topics are not merged) and a set of optional terms.

• We evaluate:– Topic recall.– Keyword recall.– Keyword precision.

Page 17: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#17

Evaluation – Competing methods

• A classic probabilistic method: LDA.

• A graph-based feature-pivot approach that examines only pairwise co-occurrence patterns.

• FPM using the FP-Growth algorithm.

Page 18: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#18

Evaluation – Results: topic recall

• Evaluated for different numbers of returned topics

• SFPM achieves highest topic recall in all three datasets

Page 19: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#19

Evaluation – Results: keyword recall

• SFPM achieves highest topic recall in all three datasets

• SFPM not only retrieves more target topics than the other methods, but also provides a quite complete representation of the topics

Page 20: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#20

Evaluation – Results: keyword precision

• SFPM nevertheless achieves a somewhat lower keyword precision, indicating that some spurious keywords are also included in the topics.

• FPM as the strictest method achieves the highest keyword precision.

Page 21: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#21

Evaluation – Example topics produced

Page 22: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#22

Conclusions

• Started from the observation that in order to detect closely related topics, a feature-pivot topic detection method should examine co-occurrence patterns of degree larger than 2.

• We have presented an approach, SFPM, that does this, albeit in a soft and controllable manner. It is based on a greedy set expansion procedure.

• We have experimentally shown that the proposed approach may indeed improve performance when dealing with corpora containing closely inter-related topics.

Page 23: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#23

Future work

• Experiment with different types of documents.

• Consider the problem of synonyms.

• Examine alternative, more efficient search strategies. E.g. may index the Dt vectors, using e.g. LSH, in order to rapidly retrieve the best matching term to a set S.

Page 24: WIMS 2014, Thessaloniki, June 2014

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#24

Thank you!

• Open source implementation (including a set of other topic detection methods) available at:

https://github.com/socialsensor/topic-detection• Dataset and evaluation resources available at:

http://www.socialsensor.eu/results/datasets/72-twitter-tdt-dataset• Relevant topic detection dataset on which SFPM will be tested:

http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755

Questions, comments, suggestions?