wims 2014, thessaloniki, june 2014

WIMS 2014, Thessaloniki, June 2014

A soft frequent pattern mining approach for textual topic

detection

Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI)

Luca Aiello (Yahoo Labs)

Ryan Skraba (Alcatel-Lucent Bell Labs)

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos#2

Overview

• Motivation: classes of topic detection methods, degree of co-occurrence patterns considered.

• Beyond pairwise co-occurrence analysis:– Frequent Pattern Mining.– Soft Frequent Pattern Mining.

• Experimental evaluation.• Conclusions and future work.


Classes of textual topic detection methods

• Probabilistic topic models:– Learn the joint probability distribution of topics and terms and

perform inference on it (e.g. LDA).

• Document-pivot methods:– Cluster together documents, each group of documents is a topic (e.g.

incremental clustering based on cosine similarity / tfidf representation)

• Feature-pivot methods:– Cluster together terms based on their co-occurrence patterns, each

group of terms is a topic (e.g. graph-based feature-pivot)


Feature-pivot methods: degree of cooccurrence (1/2)

• We focus on feature-pivot methods and examine the effect of the “degree” of examined co-occurrence patterns on the term clustering procedure.

• Let us consider the following topics:– Romney wins Virginia– Romney appears on TV and thanks Virginia– Obama wins Vermont– Romney congratulates Obama

• Key terms (e.g. Romney, Obama, Virginia, Vermont, wins) co-occur with more than one other key term.


Feature-pivot methods: degree of cooccurrence (2/2)

• Previous approaches typically examined only “pairwise” cooccurrence patterns.

• In the case of closely related topics, such as the above, examining only pairwise cooccurrence patterns may lead in mixed topics.

• We propose to examine the simultaneous co-occurrence patterns of a larger number of terms.


Beyond pairwise co-occurrence analysis: FPM

• Frequent Pattern Mining.

• A variety of algorithms (e.g. APriori, FPGrowth) that can be used to find groups of items that co-occur frequently.

• Not a new approach for textual topic detection.


Beyond pairwise co-occurrence analysis: SFPM

• Frequent Pattern Mining is strict in that it looks for sets of terms all of which cooccur frequently at the same time. May be able to surface only topics with a very small number of terms, i.e. very coarse topics.

• Can we formulate an algorithm that lies between the two ends of the spectrum, i.e. looks at co-occurrence patterns with degree higher than two but is not as strict as a typical FPM algorithm?


SFPM

1. Term selection.

2. Co-occurrence vector formation.

3. Post-processing.


SFPM–Step 1: Term selection

• Select the top K terms that will enter the clustering procedure.

• Different options to do this. For instance:– Select the most frequent terms.– Select the terms that exhibit the most “bursty” behaviour (if we are

considering temporal processing).

• In our experiments we select the terms that are most “unusually frequent”:


SFPM–Step 2: Co-occurrence vector formation (1/4)

• The heart of the SFPM approach.• Notation:

– n: The number of documents in the collection– S: A set of terms, representing a topic– DS: vector of length n, the i-th element indicates how many of the

terms in S co-occur in the i-th document.– Dt: binary vector of length n, the i-th element indicates if the term t

occurs in the i-th document.

• The vector Dt for a term t that frequently cooccurs with the terms in S will have high cosine similarity with DS.

• Idea of algorithm: greedily expand S with the term t that best matches DS.



• Need a stopping criterion for the expansion procedure.• If not properly set, the set of terms may end up being too

small (i.e. the topic may be too coarse) or may end up being a mixture of topics.

• We use a threshold on the cosine of Ds and Dt and we adapt the threshold dynamically on the size of S. In particular, we use a sigmoid function that is a function of |S|:



• We run the expansion procedure many times, each time starting from a different term.

• Additionally, to avoid having less important terms dominate Ds and the cosine similarity, at the end of each expansion step we zero-out entries of Ds that have a value smaller than |S|/2.


SFPM–Step 3: Post-processing

• Due to the fact that we run the expansion procedure many times, each time starting with a different term, we may end up with a large number of duplicate topics.

• At the final step of the algorithm, we filter out duplicate topics (by considering the Jaccard similarity between the sets of keywords of the produced topics).


SFPM – Overview


Evaluation – Datasets and evaluation metrics

• Three datasets collected from Twitter:– Super Tuesday dataset: 474,109 documents– F.A. Cup final: 148,652 documents– U.S. presidential elections: 1,247,483 documents

• For each of them a number of ground-truth topics were determined by examining the relevant stories that appeared in the mainstream media.

• Each topic is represented by a set of mandatory terms, a set of forbidden terms (so that we make sure that closely related topics are not merged) and a set of optional terms.

• We evaluate:– Topic recall.– Keyword recall.– Keyword precision.


Evaluation – Competing methods

• A classic probabilistic method: LDA.

• A graph-based feature-pivot approach that examines only pairwise co-occurrence patterns.

• FPM using the FP-Growth algorithm.


Evaluation – Results: topic recall

• Evaluated for different numbers of returned topics

• SFPM achieves highest topic recall in all three datasets


Evaluation – Results: keyword recall

• SFPM achieves highest topic recall in all three datasets

• SFPM not only retrieves more target topics than the other methods, but also provides a quite complete representation of the topics


Evaluation – Results: keyword precision

• SFPM nevertheless achieves a somewhat lower keyword precision, indicating that some spurious keywords are also included in the topics.

• FPM as the strictest method achieves the highest keyword precision.


Evaluation – Example topics produced


Conclusions

• Started from the observation that in order to detect closely related topics, a feature-pivot topic detection method should examine co-occurrence patterns of degree larger than 2.

• We have presented an approach, SFPM, that does this, albeit in a soft and controllable manner. It is based on a greedy set expansion procedure.

• We have experimentally shown that the proposed approach may indeed improve performance when dealing with corpora containing closely inter-related topics.


Future work

• Experiment with different types of documents.

• Consider the problem of synonyms.

• Examine alternative, more efficient search strategies. E.g. may index the Dt vectors, using e.g. LSH, in order to rapidly retrieve the best matching term to a set S.


Thank you!

• Open source implementation (including a set of other topic detection methods) available at:

https://github.com/socialsensor/topic-detection• Dataset and evaluation resources available at:

http://www.socialsensor.eu/results/datasets/72-twitter-tdt-dataset• Relevant topic detection dataset on which SFPM will be tested:

http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755

Questions, comments, suggestions?

wims 2014, thessaloniki, june 2014

Documents

frequent terms

group of terms

cooccurrence analysis

obamakey terms

cooccurrence patterns

larger number of terms

small number of terms

degree of cooccurrence