2009 ieee symposium on computational intelligence in cyber security 1 lda-based dark web analysis

20
2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

Upload: edwin-todd

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 1

LDA-based Dark Web Analysis

Page 2: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 2

Outline

What is Dark Web?

Why do we need to analyze it?

How to analyze Dark Web: Our Strategy Web Crawling Topic Discovery based on Latent Dirichlet

Allocation (LDA) Optimization Process

Conclusion

Page 3: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 3

What is Dark Web? Web is a global information platform accessible from

different locations. It is a fast tool to spread information anonymously or with

few regulations. Its cost is relatively low compared with other media.

Dark Web is the place where terrorist/extremist organizations and their sympathizers

exchange ideology spread propaganda recruit members plan attacks

An example of dark web: www.natall.com

Page 4: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 4

Why do we need to analyze it?

To find the hidden topics in the Dark Web community, which are:

embedded in other large scale on-line web sites

information overloaded

multi-lingual

Page 5: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 5

How to analyze Dark Web: architecture of our strategy

GS: Gibbs Sampling – a random walk in the sample space to find the maximum estimation

LDA: Latent Dirichlet Allocation

Page 6: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 6

How to analyze Dark Web: architecture of our strategy

Use a web crawler to download text-based documents Pruning by removing:

all the HTML tags irrelevant contents such as images, navigation instructions

Formatting into a plain text file FF := header {doc}header := a line contains the number of documentsdoc := {term_1}

Feed the text file to GibbbsLDA analyzer to discover the latent topics

Optimize topic discovery

Page 7: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 7

Criteria to select web crawlers

Able to parse ill-coded web pages Parameterized URLs Flexible to handle different web site structures The downloaded web pages will be read by

machine rather than human, therefore some kind of normalization must be taken to ensure the text corpus is well formatted and readable

Easy maintenance and of minimal hardware resources

Not necessary to be super fast Not introduce any intellectual property problem

Page 8: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 8

Web-harvest vs. others

Page 9: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 9

Web-harvest pipeline

Page 10: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 10

Topic discovery based on LDA

LDA is an Information Retrieval (IR) technique Information Retrieval (IR)

reduces information overload preserves the essential statistical relationships

Basic and traditional IR methods tf-idf scheme: term-count pair => term-by-document

matrix LSI (Latent semantic indexing) pLSI (probabilistic LSI) Clustering: divide data set into subsets

Page 11: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 11

Dirichlet Distribution

a generalization of the beta distribution

Page 12: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 12

Beta Distribution

a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]

Page 13: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 13

LDA graph

corpus level: α: Dirichlet prior hyper-parameter on the mixing proportion β: Dirichlet prior hyper-parameter on the mixture component

distributions M: number of documents

document level: θ: the documents mixture proportion φ: the mixture component of documents N: # of words in a document

word level: ι: hidden topic variable ω: document variable

[H Zhang et al, 2007]

Page 14: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 14

LDA vs. Clustering

Clustering simply partition corpus; one document belongs to on category

LDA-based analysis allows one document to be classified into different categories because of its hierarchy structure

Page 15: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 15

Optimizing the results (1)

LDA does not know how many topics could be there; this value is set by the user

However we can evaluate the multiple “wild guesses” and choose the best one

f(x) is the number of documents that contain the word x f(y) is the number of documents that contain the word y f(x,y) if the number of documents that contain both word x

and word y M is the total number of the documents

))}(log ()) ,(m in{ log (log

)),(lo g ())}(log ()) ,(m ax{ log (),(tan

yfxfM

yxfyfxfyxced is

Page 16: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 16

Optimizing the results (2)

For each topic discovery, find the minimum of average distance of each topic.

Page 17: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 17

Optimizing the results (3)

Results: Four topics has the minimum average distance between words in each topic.

Page 18: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 18

A topic list of discovered topics from www.natall.com

Discovering New Topics after Optimization

Page 19: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 19

Conclusion

Web-harvest integrated with LDA is able to discover the hidden latent topics from dark

web sites. provide a more flexible and automated tool to

counter terrorism. support a measurable way to optimize the

results of LDA. provide a generic tool to analyze a variety of

websites such as financial, medical, etc.

Page 20: 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 20

References

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3:993-1022. Mar. 2003.

An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks, Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen, In Proceedings of IEEE Intelligence and Security Informatics, 2007.

Tracing the Event from Evolution of Terror Attacks from On-Line News, Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei, In Proceedings of IEEE Intelligence and Security Informatics, 2006.

On the Topology of the Dark Web of Terrorist Groups, Jennifer Xu, Hsinchun Chen, Yilu Zhou, and Jialun Qin, In Proceedings of IEEE Intelligence and Security Informatics 2006.