making big data, small

33
MAKING BIG DATA, SMALL Using distributed systems for processing, analysing and managing large huge data sets Marcin Jedyk Software Professional’s Network, Cheshire Datasystems Ltd

Upload: marcinjedyk

Post on 11-May-2015

810 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Making Big Data, small

MAKING BIG DATA, SMALL

Using distributed systems for processing, analysing and managing

large huge data sets

Marcin Jedyk

Software Professional’s Network, Cheshire Datasystems Ltd

Page 2: Making Big Data, small

WARM-UP QUESTIONS

How many of you heard about Big Data before?

How many about NoSQL?

Hadoop?

Page 3: Making Big Data, small

AGENDA.

Intro – motivation, goal and ‘not about…’

What is Big Data?

NoSQL and systems classification

Hadoop & HDFS

MapReduce & live demo

HBase

Page 4: Making Big Data, small

AGENDA

Pig

Building Hadoop cluster

Conclusions

Q&A

Page 5: Making Big Data, small

MOTIVATION

Data is everywhere – why not to analyse it?

With Hadoop and NoSQL systems, building

distributed systems is easier than before

Relying on software & cheap hardware rather

than expensive hardware works better!

Page 6: Making Big Data, small

MOTIVATION

Page 7: Making Big Data, small

GOAL

To explain basic ideas behind Big Data

To present different approaches towards BD

To show that Big Data systems are easy to build

To show you where to start with such systems

Page 8: Making Big Data, small

WHAT IT IS NOT ABOUT?

Not a detailed lecture on a single system

Not about advanced techniques in Big Data

Not only about technology – but also about its

application

Page 9: Making Big Data, small

WHAT IS BIG DATA?

Data characterised by 3 Vs:

Volume

Variety

Velocity

The interesting ones: variety & velocity

Page 10: Making Big Data, small

WHAT IS BIG DATA

Data of high velocity: cannot store? Process on

the fly!

Data of high variety: doesn’t fit into relational

schema? Don’t use schema, use NoSQL!

Data which is impractical to process on a single

server

Page 11: Making Big Data, small

NO-SQL

Hand in and with Big Data

NoSQL – an umbrella term for non-relational

data bases or data storages

It’s not always possible to replace RDBMS with

NoSQL! (opposite is also true)

Page 12: Making Big Data, small

NO-SQL

NoSQL DBs are built around different principles

Key-value stores: Redis, Riak

Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)

Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records

Page 13: Making Big Data, small

HADOOP

Existed before ‘Big Data’ buzzword emerged

A simple idea – MapReduce

A primary purpose – to crunch tera- and

petabytes of data

HDFS as underlying distributed file system

Page 14: Making Big Data, small

HADOOP – ARCHITECTURE BY EXAMPLE

Image you need to process 1TB of logs

What would you need?

A server!

Page 15: Making Big Data, small

HADOOP – ARCHITECTURE BY EXAMPLE

But 1TB is quite a lot of data… we want it

quicker!

Ok, what about distributed environment?

Page 16: Making Big Data, small

HADOOP – ARCHITECTURE BY EXAMPLE

So what about that Hadoop stuff?

Each node can: store data & process it (DataNode

& TaskTracker)

Page 17: Making Big Data, small

HADOOP – ARCHITECTURE BY EXAMPLE

How about allocating jobs to slaves? We need a

JobTracker!

Page 18: Making Big Data, small

HADOOP – ARCHITECTURE BY EXAMPLE

How about HDFS, how data blocks are

assembled into files?

NameNode does it.

Page 19: Making Big Data, small

HADOOP – ARCHITECTURE BY EXAMPLE

NameNode – manages HDFS metadata, doesn’t deal with files directly

JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers

TaskTracker – runs MapReduce operations

DataNode – stores blocks of HDFS – default replication level for each block: 3

Page 20: Making Big Data, small

HADOOP - LIMITATIONS

DataNodes & TaskTrackers are fault tollerant

NameNode & JobTracker are NOT! (existing

workaround for this problem)

HDFS deals nicely with large files, doesn’t do

well with billions of small files

Page 21: Making Big Data, small

MAP_REDUCE

MapReduce – parallelisation approach

Two main stages:

Map – do an actual bit of work, i.e.: extract info

Reduce – summarise, aggregate or filter outputs from Map operation

For each job, multiple Map and Reduce operations – each may run on different node = parallelism

Page 22: Making Big Data, small

MAP_REDUCE – AN EXAMPLE

Let’s process 1TB of raw logs and extract traffic by host.

After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!

Map - analyse logs and return them as set of <key,value>

Reduce -> merge output of Map operations

Page 23: Making Big Data, small

MAP_REDUCE – AN EXAMPLE

Take a look at mocked log extract:

[IP – bandwidth]

10.0.0.1 – 1234

10.0.0.1 – 900

10.0.0.2 – 1230

10.0.0.3 – 999

Page 24: Making Big Data, small

MAP_REDUCE – AN EXAMPLE

It’s important to define key, in this case IP

<10.0.0.1;2134>

<10.0.0.2;1230>

<10.0.0.3;999>

Now, assume another Map operation returned:

<10.0.0.1;1500>

<10.0.0.3;1000>

<10.0.0.4;500>

Page 25: Making Big Data, small

MAP_REDUCE – AN EXAMPLE

Now, Reduce will merge those results:

<10.0.0.1;3624>

<10.0.0.2;2230>

<10.0.0.3;1499>

<10.0.0.4;500>

Page 26: Making Big Data, small

MAP_REDUCE

Selecting a key is important

It’s possible to define composite key, i.e.

IP+date

For more complex tasks, it’s possible to chain

MapReduce jobs

Page 27: Making Big Data, small

HBASE

Another layer on top of Hadoop/HDFS

A distributed data storage

Not a replacement for RDBMS!

Can be used with MapReduce

Good for unstructured data – no need to worry

about exact schema in advance

Page 28: Making Big Data, small

PIG – HBASE ENHANCEMENT

HBase - missing proper query language

Pig – makes life easier for HBase users

Translates queries into MapReduce jobs

When working with Pig or HBase, forget what

you know about SQL – it makes your life easier

Page 29: Making Big Data, small

BUILDING HADOOP CLUSTER

Post production servers are ok

Don’t take ‘cheap hardware’ too literally

Good connection between nodes is a must!

>=1Gbps between nodes

>=10Gbps between racks

1 disk per CPU core

More RAM, more caching!

Page 30: Making Big Data, small

FINAL CONCLUSIONS

Hadoop and NoSQL-like DB/DS scale very well

Hadoop ideal for crunching huge data sets

Does very well in production environment

Cluster of slaves is fault tolerant, NameNode

and JobTracker are not!

Page 31: Making Big Data, small

EXTERNAL RESOURCES

Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1

Building web crawler with Hadoop: http://goo.gl/xPTlJ

Analysing adverse drug events: http://goo.gl/HFXAx

Moving average for large data sets: http://goo.gl/O4oml

Page 32: Making Big Data, small

EXTERNAL RESOURCES – USEFUL LINKS

http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-

recommendation-talk/1

https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

http://hstack.org/hbase-performance-testing/

http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/

http://wiki.apache.org/hadoop/MachineScaling

http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-

ladis2009.pdf

http://www.cloudera.com/resource-types/video/

http://hstack.org/why-were-using-hbase-part-2/

Page 33: Making Big Data, small

QUESTIONS?