ets train ppt_big_data_basics_v2.0

Post on 15-Apr-2017

135 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data Basics

AUTHOR : MITHUN BANERJEEDATE: 05-OCTOBER-2016

C O P Y R I G H T P R O T E C T E D B Y E C L I P S E T E C H N O C O N S U LT I N G G L O B A L ( P ) LT D .

What is Big data?Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. --Wikipedia

Is the above definition fully comprehensive?

Lets try to go deep in next slides

Data units to measure exponential growth of data over the years

VOLUME of DATA

Type of data

• Relational Data (Tables/Transaction/Legacy Data)

• Text Data (Web)

• Semi-structured Data (XML)

• Graph DataSocial Network, Semantic Web (RDF), …

• Streaming Data You can only scan the data once

• A single application can be generating/collecting many types of data

• Big Public Data (online, weather, finance, etc)

Variety (complexities) of data

Velocity of dataLate decisions missing opportunities

Example: Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

Velocity of data

Social media and networks(all of us are generating data)Scientific instruments

(collecting all sorts of data)

Sensor technology and networks(measuring all kinds of data)

REAL TIME / FAST DATA

3Vs

4Vs

Generation and Consumption of Data

In past

In present

OLTP: O N L I N E T RA N S AC T I O N P R O C E S S I N G ( D B M S )

OLAP: O N L I N E A N A LY T I C A L P R O CE S S I N G ( DATA WA R E H O U S I N G )

RTAP: R EA L-T IME ANA LY T IC S P R OC ES S I NG (B IG DATA ARC H I T EC T U R E & T E CH NOLOGY )

Driver of Data

- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time

- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets

The Evolution of Business Intelligence

BI ReportingOLAP &

Dataware houseBusiness Objects, SAS,

Informatica, Cognos other SQL Reporting

Tools

Interactive Business

Intelligence & In-memory

RDBMS

QliqView, Tableau, HANA

Big Data:Real Time &Single ViewGraph Databases

Big Data: Batch Processing

& Distributed Data

StoreHadoop/Spark;

HBase/Cassandra1990’s 2000’s 2010’s

Speed

Scale

Scale

Speed

Topic 1: Data Analytics & Data Mining• EXPLORATORY DATA ANALYSIS• • L INEAR CLASSIF ICATION (PERCEPTRON &

LOGIST IC REGRESSION) • • L INEAR REGRESSION

• C4.5 DECIS ION TREE

• APRIORI

• K-MEANS CLUSTERING• • EM ALGORITHM

• PAGERANK & HITS

• COLLABORATIVE F ILTERING

Topic 2: Hadoop/MapReduce Programming & Data Processing

ARCHITECTURE OF HADOOP, HDFS, AND YARNPROGRAMMING ON HADOOP

BASIC DATA PROCESSING: SORT AND JOININFORMATION RETRIEVAL USING HADOOPDATA MINING USING HADOOP (KMEANS+HISTOGRAMS)MACHINE LEARNING ON HADOOP (EM)

HIVE/PIGHBASE AND CASSANDRA

Topic 3: Graph Database and Graph Analytics

GRAPH DATABASE (HTTP://EN.WIKIPEDIA.ORG/WIKI/GRAPH_DATABASE)

Native Graph Database (Neo4j) Pregel/Giraph (Distributed Graph Processing Engine)

NEO4J/TITAN/GRAPHLAB/GRAPHSQL

Reference to read for in depth home work

• Hadoop: The Definitive Guide, Tom White, O’Reilly

• Data Mining: Concepts and Techniques, Third Edition, by Jiawei Han et al.

• https://www.mongodb.com/collateral/big-data-examples-and-guidelines-enterprise-decision-maker

• • http://

www.aptude.com/blog/entry/hadoop-vs-mongodb-which-platform-is-better-for-handling-big-data

• • http://

www.slideshare.net/wlaforest/an-introduction-to-big-data-nosql-and-mongodb

• http://www.infoworld.com/article/2608460/application-development/the-10-worst-big-data-practices.html

THANK YOU ETS

top related