an unsupervised framework for effective indexing of bigdata

An unsupervised framework for effective indexing of Big Data

Ramakrishna Sakhamuri , Dr.Pradeep Chowriappa

Outline: Indexing

Regular File Systems. Relational Databases.

Decision Making. HDFS BIRCH Algorithm Project

Sample output of a Data Analysis:

http://www.medsci.org/v13/p0099/ijmsv13p0099g003.jpg

Introduction:

4

Indexing:• An index is a systematic arrangement of entries

designed to enable users to locate required information.

• The process of creating an index is called indexing.

5

File System:• Every file system

maintains an index tree/table which helps the user to store, parse and retrieve the files.

• We can parse a file system in two ways:• Name driven.• Content driven.

7

Databases:

• Primary Index• Secondary Index

Traditional Decision Making System:

http://www.jmir.org/article/viewFile/3555/1/39724

9

What are we looking at ?

• Integration of Indexing and Clustering.

• Building a secondary index on top of existing file system index.

11

Hadoop System Architecture:

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

12

HDFS Architecture:HDFS stands for Hadoop Distributed Files System, is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

13Design of HDFS:

• Very large files : “Very large” in this context means files that are hundreds of megabytes, gigabytes or terabytes in size.

• Streaming data access : Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

• Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware.

14HDFS Concepts:

• Blocks: A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem blocks are typically 512 bytes. HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks.

• Namenodes and Datanodes: An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). • The namenode manages the filesystem namespace. The namenode also

knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently

• Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

15

Data Flow (Data Read):• The client opens the file it wishes

to read by calling open() on the FileSystem object.

• Distributed FileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file. For each block, the namenode returns the addresses of the datanodes that have a copy of that block.

• The DistributedFileSystem returns an FSDataInputStream to the client for it to read data from

16Data Flow (Data Write):• DFSOutputStream splits it into

packets, which it writes to an internal queue, called the data queue.

• The data queue is consumed by the DataStreamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.

• The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline and then the second node passes on to the third in the pipeline.

17BIRCH:

BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.

BIRCH is especially appropriate for very large data sets. The BIRCH clustering algorithm consists of two main phases.

Phase 1: Build theCF Tree. Load the data in to memory by building a cluster-feature tree (CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF.

Phase 2: Global Clustering. Apply an existing clustering algorithm on the leaves of the CF tree. Optionally, refine these clusters.

18Cluster Feature:

BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points.

For clustering purposes, these summary statistics constitute a CF, and represent a sufficient substitute for the actual data.

A CF is a set of three summary statistics that represent a set of data points in a single cluster. Count. How many data values in the cluster. Linear Sum. Sum the individual coordinates. This is a measure of the location

of thecluster. Squared Sum. Sum the squared coordinates. This is a measure of the spread

of the cluster.

Cluster Feature (contd):

20CF Tree:

A CF tree is a tree structure composed of CFs. A CF tree represents characteristic form of data. There are three important parameters for any CF Tree.

Branching factor ‘B’ which defines the maximum children allowed for a non leaf node.

Threshold ‘T’ which is the upper limit for the radius of the cluster in a leaf node.

‘L’ , number of entries in a leaf node. For a CF entry in a root node or a non-leaf node, that CF entry equals

the sum of the CF entries in the child nodes of that entry.

21Building a CF Tree:

Compare the incoming CF with each CF in root node using linear sum or mean of CF. Create a dictionary which holds each CF and the respective distance.

Enter into branch whose CF is closest to the root node CF. If node type is Non Leaf node then repeat the above step until reaches the leaf

node. If current node type is Leaf node then compare the incoming record CF to the leaf

node CF’s and enter into the closest one. Perform one of (a) or (b):

a. If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the incoming record is assigned to that leaf and all of its parent CFs are updated.

b. If the radius of the chosen leaf including the newrecord does exceed the Threshold T, then a new leaf is formed, consisting of the incoming record only and update the parent.

22

Diameter of a cluster:Diameter for any cluster is calculated and compared with the given threshold value. Here is the formulae used in the implementation.

R =

23

CF Tree structure:General structure of a CF tree, with branching factor B, and L leafs in each leaf node.

L

B

24Sample steps:

25

Project Data Flow:

1. Writing Files into HDFS.2. Pulling the block

address information from HDFS.

3. Passing the address and the data to the BIRCH algorithm.

26Birch Algorthm in Python:

In this project we have done the BIRCH implementation in Python. We imported a package which contains few implemented classes like CF

tree, cfnode. And two other classes non-leaf node, leaf node which are inherited from cfnode.

We designed a Birch program which creates an instance of this cftree class and passes the input data and few other values like branching factor, intial diamter, maximum node entries to cftree.

And the data given as an input should only contains the numbers as it completely deals with it using linear and square sums.

Hence we have done the data preprocessing before passing it to the algorithm.

27Birch Algorithm in Python: (cont..)

Once cftree gets the data it uses the other classes like cfnode, non-leaf, leaf nodes and builds the cf tree which looks similar to the one in slide 13.

Once the tree is built it will return all the leafes and the details about each leaf. How many CFs each leaf contains. And the list of all the individual CFs in a leaf. How many successors in its parent non-leaf node. And it also shows the address of the non-leaf node it belongs to.

28Hadoop Implementation:

Install virtual machine on windows or OS. Install Ubuntu on the virtual machine. Download and install Hadoop package on Ubuntu. Tell Hadoop where Java installation has been done. For pseudo-distribution mode, change the configuration files to

configure: a. Core-site.xml -> to set default Schema and authority. b. Hdfs-site.xml -> to set def.replication to 1 rather than the default three,

otherwise all the blocks would always be alarmed with under replication. c. Mapred-site.xml -> To let know of host and port pair where the Jobtrackers

runs at. ► Format the name node.

29Project Implementation:

Both these scripts are designed to run in the background all the time.

Client.sh Client.py► Execute the shell script client.sh as a

background process.

► bash $PATH/client.sh > $LOGFILE 2>&1 &

► This client file handles following steps:

► It looks for the files in INPUT directory and once it get any files moves them into HADOOP processing directory.

► Then it loads the data into Hadoop and handshakes Python process to proceed further steps.

► Execute the client.py Python script along with the above script.

► Python $PATH/client.py > $LOGFILE 2>&1 &

► This script handles following steps:► It looks for the Hadoop

processed files and once it get those files it pulls the address of those files from HDFS.

► It gives the data files one by one to birch along with their respective addresses.

► After every run it writes the entire tree in new file in a predefined location.

Sample Output

31Refernces:

1. Larose, D. T. (2015). Data MIning and Predictive Analystics. Wiley.2. Athman Bouguettaya, Q. Y. (2014). Efficient agglomerative hierarchical

clustering. Expert Systems with Applications .3. Tian Zhang, R. R. (1996). BIRCH: an efficient data clustering method for

very large databases. NY.4. https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg

https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg

https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg