technical report of web mining group presented by: mohsen kamyar ferdowsi university of mashhad,...

Technical Report of Web Mining Group

Presented by: Mohsen Kamyar

Ferdowsi University of Mashhad, WTLab

Main Approach in Concept ExtractionProblemsClustering Methods and LSIIdeas and Our WorksExperimental Results

Main approach in Concept Extraction (we will say it CE) is using LSI.

LSI is a collection of one Matrix Algorithm and some Probabilistic Analyses on it for using on Term-Document Matrix.

At first we should create Term-Document matrix (using measures like TFiDF for indicating the importance of a term in a particular document), then give it to SVD (Singular Value Decomposition) algorithm and finally choose the first K columns as concepts.

Singular Value Decomposition is an algorithm for Matrix (we assume that matrix M is m×n) Decomposition to 3 matrices like U, S and V, such that S is an orthogonal matrix of singular values, U is eigenvectors of the Matrix MMT (Term correlation matrix) and V is eigenvectors of the Matrix MTM (Document Correlation Matrix). S is sorted descending. Therefore the first k elements of it or the first k columns of U or the first k rows of V are the most important values.

Steps of SVD can be explained as below:1- Select first column of matrix M1, we name it u1

2- Calculate the length of u1 and add it to first element.

3- Then set B1=|u1|2/24- Then set U1=I-B1

-1 u1u1T

5- Then set M2=U1M1

6- Do it for first row and then repeat for other rows and columns

In general for ith column or row, in step 2 we should first set all elements before ith element equal to zero, then calculate the length and add the result to ith element.

We can list the main problems of LSI as belowThis method is based on the sum of square of

distances (Σ(si-ti)2), so it is useful for data that has Gaussian (Normal) Distribution. But Term-Document Matrix has Poisson Distribution.

This method is very slow (its computation complexity is n3m and n<<m)

Poisson distribution is a Memory-less Distribution. In other words next occurrence of probabilistic variable X doesn’t depend on previous occurrences.

There is a wide variety of methods in clustering. But we can group them as below:Discrete Methods

Linear approachesPCAK-MeansK-MediansK-CentersLSH

Non-Linear approachesKPCAEmbeddings

Artificial Intelligence Based Approaches

PCA is abbreviation for Principle Component Analysis and is a collection of methods that use eigenvector and eigenvalue properties for clustering.

So, SVD is one of the main approaches in PCA collection.

Recently, proved that K-Means and other members of its family can be listed in PCA family.

PCA family are linear approaches and can not cluster data that their independence is nonlinear.

PCA family is suitable for Gaussian Distribution.

One sample for nonlinear independence.

But K-Means has computational complexity equal to O(nm), and it is better than SVD (O(n3m)).

LSH is a member of linear methods and has good computational complexity.

KPCA (Kernel PCA) is a collection of methods in nonlinear clustering.

There are two groups in KPCAKernel functions: in this family we should

invent a function that can convert nonlinear independence to linear one. For example of using Gaussian function see below.

Kernel Tricks: in this family we should convert original space to a higher order space with specific properties (some methods convert data to a Hilbert space that is a subset for Banach spaces) such that nonlinear independence will be converted to linear one and then we can use PCA methods. In this approach we should use Embedding methods.

Artificial Intelligence based clustering are very slow for our purpose.

Our works will be on both finding an appropriate Kernel Function and an appropriate Embedding.

But we focus on Kernel Functions in this phase.Our idea is a little different with main

approach, we change distance function instead of points to reach the linearity.

There is a technique called “Copula” in statistics and probabilistics. Copula is a framework for finding a bi-variate distribution function for two probabilistic variable.

Main idea is as below: two variable are independent if the conjunctive probability of them is equal to product of their probabilities. So first we find an appropriate Copula function and then calculate the surrounding volume between copula surface and the surface of the product of probabilities of variables. This can be used as a measure for indicating the independence. Now we have a good Kernel Function.

There are a wide variety of copula function for general purposes and have been used in different researches and they did reach to good results.

This is a sample copula function obtained for a sample data, using Bernstein Polynomials Copula function.

Main advantages of our idea are as follows:All of preprocessing computational complexity is

about O(nm2). So if we using K-Means (O(nm)) then we obtain an algorithm with computational complexity equal to O(nm2) for detecting clusters with nonlinear independency (SVD has O(n3m) for linear independency and n>>m).

Copula functions don’t care about data distribution. Surprisingly, we can use them for two variables with different distributions. On the other hand SVD is suitable for Gaussian data distribution.

For testing our ideas we did the following:First we obtained popular Datasets. They are all

from University College Dublin (UCD), School of Computer Science and Informatics, Machine Learning Group.

Next we study the structure of SVD and K-Means (obtaining K-Means using core-sets) algorithms

We use MATLAB to implement algorithmsWe test SVD and K-Means on datasets. For

example one concept group that we obtain for BBC dataset is as following terms: juventu, cocain, romanian, alessandro, luciano, adrian, chelsea, ruin, bayern, drug, fifa, club, ... or another concept group about printers and so on.

Now we should implement Copula in MATLAB and compare results with common SVD and K-Means.

technical report of web mining group presented by: mohsen kamyar ferdowsi university of mashhad,...

Documents

matrix algorithm

termdocument matrix

collection of methods

column of matrix m1

pca family

member of linear methods

pca collection

nonlinear clustering