an unsupervised framework for effective indexing of bigdata

31
An unsupervised framework for effective indexing of Big Data Ramakrishna Sakhamuri , Dr.Pradeep Chowriappa

Upload: ramakrishna-prasad-sakhamuri

Post on 10-Feb-2017

16 views

Category:

Design


2 download

TRANSCRIPT

Page 1: An unsupervised framework for effective indexing of BigData

An unsupervised framework for effective indexing of Big Data

Ramakrishna Sakhamuri , Dr.Pradeep Chowriappa

Page 2: An unsupervised framework for effective indexing of BigData

Outline: Indexing

Regular File Systems. Relational Databases.

Decision Making. HDFS BIRCH Algorithm Project

Page 3: An unsupervised framework for effective indexing of BigData

Sample output of a Data Analysis:

http://www.medsci.org/v13/p0099/ijmsv13p0099g003.jpg

Introduction:

Page 4: An unsupervised framework for effective indexing of BigData

4

Indexing:• An index is a systematic arrangement of entries

designed to enable users to locate required information.

• The process of creating an index is called indexing.

Page 5: An unsupervised framework for effective indexing of BigData

5

File System:• Every file system

maintains an index tree/table which helps the user to store, parse and retrieve the files.

• We can parse a file system in two ways:• Name driven.• Content driven.

Page 6: An unsupervised framework for effective indexing of BigData

6

Page 7: An unsupervised framework for effective indexing of BigData

7

Databases:

• Primary Index• Secondary Index

Page 8: An unsupervised framework for effective indexing of BigData

Traditional Decision Making System:

http://www.jmir.org/article/viewFile/3555/1/39724

Page 9: An unsupervised framework for effective indexing of BigData

9

What are we looking at ?

Page 10: An unsupervised framework for effective indexing of BigData

• Integration of Indexing and Clustering.

• Building a secondary index on top of existing file system index.

Page 11: An unsupervised framework for effective indexing of BigData

11

Hadoop System Architecture:

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. 

Page 12: An unsupervised framework for effective indexing of BigData

12

HDFS Architecture:HDFS stands for Hadoop Distributed Files System, is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

Page 13: An unsupervised framework for effective indexing of BigData

13Design of HDFS:

• Very large files : “Very large” in this context means files that are hundreds of megabytes, gigabytes or terabytes in size.

• Streaming data access : Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

• Commodity Hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware.

Page 14: An unsupervised framework for effective indexing of BigData

14HDFS Concepts:

• Blocks: A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem blocks are typically 512 bytes. HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks.

• Namenodes and Datanodes: An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). • The namenode manages the filesystem namespace. The namenode also

knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently

• Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

Page 15: An unsupervised framework for effective indexing of BigData

15

Data Flow (Data Read):• The client opens the file it wishes

to read by calling open() on the FileSystem object.

• Distributed FileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file. For each block, the namenode returns the addresses of the datanodes that have a copy of that block.

• The DistributedFileSystem returns an FSDataInputStream to the client for it to read data from

Page 16: An unsupervised framework for effective indexing of BigData

16Data Flow (Data Write):• DFSOutputStream splits it into

packets, which it writes to an internal queue, called the data queue.

• The data queue is consumed by the DataStreamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.

• The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline and then the second node passes on to the third in the pipeline.

Page 17: An unsupervised framework for effective indexing of BigData

17BIRCH:

BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies.

BIRCH is especially appropriate for very large data sets. The BIRCH clustering algorithm consists of two main phases.

Phase 1: Build theCF Tree. Load the data in to memory by building a cluster-feature tree (CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF.

Phase 2: Global Clustering. Apply an existing clustering algorithm on the leaves of the CF tree. Optionally, refine these clusters.

Page 18: An unsupervised framework for effective indexing of BigData

18Cluster Feature:

BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points.

For clustering purposes, these summary statistics constitute a CF, and represent a sufficient substitute for the actual data.

A CF is a set of three summary statistics that represent a set of data points in a single cluster. Count. How many data values in the cluster. Linear Sum. Sum the individual coordinates. This is a measure of the location

of thecluster. Squared Sum. Sum the squared coordinates. This is a measure of the spread

of the cluster.

Page 19: An unsupervised framework for effective indexing of BigData

Cluster Feature (contd):

Page 20: An unsupervised framework for effective indexing of BigData

20CF Tree:

A CF tree is a tree structure composed of CFs. A CF tree represents characteristic form of data. There are three important parameters for any CF Tree.

Branching factor ‘B’ which defines the maximum children allowed for a non leaf node.

Threshold ‘T’ which is the upper limit for the radius of the cluster in a leaf node.

‘L’ , number of entries in a leaf node. For a CF entry in a root node or a non-leaf node, that CF entry equals

the sum of the CF entries in the child nodes of that entry.

Page 21: An unsupervised framework for effective indexing of BigData

21Building a CF Tree:

Compare the incoming CF with each CF in root node using linear sum or mean of CF. Create a dictionary which holds each CF and the respective distance.

Enter into branch whose CF is closest to the root node CF. If node type is Non Leaf node then repeat the above step until reaches the leaf

node. If current node type is Leaf node then compare the incoming record CF to the leaf

node CF’s and enter into the closest one. Perform one of (a) or (b):

a. If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the incoming record is assigned to that leaf and all of its parent CFs are updated.

b. If the radius of the chosen leaf including the newrecord does exceed the Threshold T, then a new leaf is formed, consisting of the incoming record only and update the parent.

Page 22: An unsupervised framework for effective indexing of BigData

22

Diameter of a cluster:Diameter for any cluster is calculated and compared with the given threshold value. Here is the formulae used in the implementation.

R =

Page 23: An unsupervised framework for effective indexing of BigData

23

CF Tree structure:General structure of a CF tree, with branching factor B, and L leafs in each leaf node.

L

B

Page 24: An unsupervised framework for effective indexing of BigData

24Sample steps:

Page 25: An unsupervised framework for effective indexing of BigData

25

Project Data Flow:

1. Writing Files into HDFS.2. Pulling the block

address information from HDFS.

3. Passing the address and the data to the BIRCH algorithm.

Page 26: An unsupervised framework for effective indexing of BigData

26Birch Algorthm in Python:

In this project we have done the BIRCH implementation in Python. We imported a package which contains few implemented classes like CF

tree, cfnode. And two other classes non-leaf node, leaf node which are inherited from cfnode.

We designed a Birch program which creates an instance of this cftree class and passes the input data and few other values like branching factor, intial diamter, maximum node entries to cftree.

And the data given as an input should only contains the numbers as it completely deals with it using linear and square sums.

Hence we have done the data preprocessing before passing it to the algorithm.

Page 27: An unsupervised framework for effective indexing of BigData

27Birch Algorithm in Python: (cont..)

Once cftree gets the data it uses the other classes like cfnode, non-leaf, leaf nodes and builds the cf tree which looks similar to the one in slide 13.

Once the tree is built it will return all the leafes and the details about each leaf. How many CFs each leaf contains. And the list of all the individual CFs in a leaf. How many successors in its parent non-leaf node. And it also shows the address of the non-leaf node it belongs to.

Page 28: An unsupervised framework for effective indexing of BigData

28Hadoop Implementation:

Install virtual machine on windows or OS. Install Ubuntu on the virtual machine. Download and install Hadoop package on Ubuntu. Tell Hadoop where Java installation has been done. For pseudo-distribution mode, change the configuration files to

configure: a. Core-site.xml -> to set default Schema and authority. b. Hdfs-site.xml -> to set def.replication to 1 rather than the default three,

otherwise all the blocks would always be alarmed with under replication. c. Mapred-site.xml -> To let know of host and port pair where the Jobtrackers

runs at. ► Format the name node.

Page 29: An unsupervised framework for effective indexing of BigData

29Project Implementation:

Both these scripts are designed to run in the background all the time.

Client.sh Client.py► Execute the shell script client.sh as a

background process.

► bash $PATH/client.sh > $LOGFILE 2>&1 &

► This client file handles following steps:

► It looks for the files in INPUT directory and once it get any files moves them into HADOOP processing directory.

► Then it loads the data into Hadoop and handshakes Python process to proceed further steps.

► Execute the client.py Python script along with the above script.

► Python $PATH/client.py > $LOGFILE 2>&1 &

► This script handles following steps:► It looks for the Hadoop

processed files and once it get those files it pulls the address of those files from HDFS.

► It gives the data files one by one to birch along with their respective addresses.

► After every run it writes the entire tree in new file in a predefined location.

Page 30: An unsupervised framework for effective indexing of BigData

Sample Output

Page 31: An unsupervised framework for effective indexing of BigData

31Refernces:

1. Larose, D. T. (2015). Data MIning and Predictive Analystics. Wiley.2. Athman Bouguettaya, Q. Y. (2014). Efficient agglomerative hierarchical

clustering. Expert Systems with Applications .3. Tian Zhang, R. R. (1996). BIRCH: an efficient data clustering method for

very large databases. NY.4. https://codemphasis.files.wordpress.com/2012/09/hdfs-arch.jpg