2009 ieee symposium on computational intelligence in cyber security 1 lda-based dark web analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 1

LDA-based Dark Web Analysis


Outline

What is Dark Web?

Why do we need to analyze it?

How to analyze Dark Web: Our Strategy Web Crawling Topic Discovery based on Latent Dirichlet

Allocation (LDA) Optimization Process

Conclusion


What is Dark Web? Web is a global information platform accessible from

different locations. It is a fast tool to spread information anonymously or with

few regulations. Its cost is relatively low compared with other media.

Dark Web is the place where terrorist/extremist organizations and their sympathizers

exchange ideology spread propaganda recruit members plan attacks

An example of dark web: www.natall.com


Why do we need to analyze it?

To find the hidden topics in the Dark Web community, which are:

embedded in other large scale on-line web sites

information overloaded

multi-lingual


How to analyze Dark Web: architecture of our strategy

GS: Gibbs Sampling – a random walk in the sample space to find the maximum estimation

LDA: Latent Dirichlet Allocation


How to analyze Dark Web: architecture of our strategy

Use a web crawler to download text-based documents Pruning by removing:

all the HTML tags irrelevant contents such as images, navigation instructions

Formatting into a plain text file FF := header {doc}header := a line contains the number of documentsdoc := {term_1}

Feed the text file to GibbbsLDA analyzer to discover the latent topics

Optimize topic discovery


Criteria to select web crawlers

Able to parse ill-coded web pages Parameterized URLs Flexible to handle different web site structures The downloaded web pages will be read by

machine rather than human, therefore some kind of normalization must be taken to ensure the text corpus is well formatted and readable

Easy maintenance and of minimal hardware resources

Not necessary to be super fast Not introduce any intellectual property problem


Web-harvest vs. others


Web-harvest pipeline


Topic discovery based on LDA

LDA is an Information Retrieval (IR) technique Information Retrieval (IR)

reduces information overload preserves the essential statistical relationships

Basic and traditional IR methods tf-idf scheme: term-count pair => term-by-document

matrix LSI (Latent semantic indexing) pLSI (probabilistic LSI) Clustering: divide data set into subsets


Dirichlet Distribution

a generalization of the beta distribution


Beta Distribution

a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]


LDA graph

corpus level: α: Dirichlet prior hyper-parameter on the mixing proportion β: Dirichlet prior hyper-parameter on the mixture component

distributions M: number of documents

document level: θ: the documents mixture proportion φ: the mixture component of documents N: # of words in a document

word level: ι: hidden topic variable ω: document variable

[H Zhang et al, 2007]


LDA vs. Clustering

Clustering simply partition corpus; one document belongs to on category

LDA-based analysis allows one document to be classified into different categories because of its hierarchy structure


Optimizing the results (1)

LDA does not know how many topics could be there; this value is set by the user

However we can evaluate the multiple “wild guesses” and choose the best one

f(x) is the number of documents that contain the word x f(y) is the number of documents that contain the word y f(x,y) if the number of documents that contain both word x

and word y M is the total number of the documents

))}(log ()) ,(m in{ log (log

)),(lo g ())}(log ()) ,(m ax{ log (),(tan

yfxfM

yxfyfxfyxced is



For each topic discovery, find the minimum of average distance of each topic.



Results: Four topics has the minimum average distance between words in each topic.


A topic list of discovered topics from www.natall.com

Discovering New Topics after Optimization


Conclusion

Web-harvest integrated with LDA is able to discover the hidden latent topics from dark

web sites. provide a more flexible and automated tool to

counter terrorism. support a measurable way to optimize the

results of LDA. provide a generic tool to analyze a variety of

websites such as financial, medical, etc.


References

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3:993-1022. Mar. 2003.

An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks, Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen, In Proceedings of IEEE Intelligence and Security Informatics, 2007.

Tracing the Event from Evolution of Terror Attacks from On-Line News, Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei, In Proceedings of IEEE Intelligence and Security Informatics, 2006.

On the Topology of the Dark Web of Terrorist Groups, Jennifer Xu, Hsinchun Chen, Yilu Zhou, and Jialun Qin, In Proceedings of IEEE Intelligence and Security Informatics 2006.

2009 ieee symposium on computational intelligence in cyber security 1 lda-based dark web analysis

Documents