making big data, small

MAKING BIG DATA, SMALL

Using distributed systems for processing, analysing and managing

large huge data sets

Marcin Jedyk

Software Professional’s Network, Cheshire Datasystems Ltd

WARM-UP QUESTIONS

How many of you heard about Big Data before?

How many about NoSQL?

Hadoop?

AGENDA.

Intro – motivation, goal and ‘not about…’

What is Big Data?

NoSQL and systems classification

Hadoop & HDFS

MapReduce & live demo

HBase

AGENDA

Pig

Building Hadoop cluster

Conclusions

Q&A

MOTIVATION

Data is everywhere – why not to analyse it?

With Hadoop and NoSQL systems, building

distributed systems is easier than before

Relying on software & cheap hardware rather

than expensive hardware works better!

MOTIVATION

GOAL

To explain basic ideas behind Big Data

To present different approaches towards BD

To show that Big Data systems are easy to build

To show you where to start with such systems

WHAT IT IS NOT ABOUT?

Not a detailed lecture on a single system

Not about advanced techniques in Big Data

Not only about technology – but also about its

application

WHAT IS BIG DATA?

Data characterised by 3 Vs:

Volume

Variety

Velocity

The interesting ones: variety & velocity

WHAT IS BIG DATA

Data of high velocity: cannot store? Process on

the fly!

Data of high variety: doesn’t fit into relational

schema? Don’t use schema, use NoSQL!

Data which is impractical to process on a single

server

NO-SQL

Hand in and with Big Data

NoSQL – an umbrella term for non-relational

data bases or data storages

It’s not always possible to replace RDBMS with

NoSQL! (opposite is also true)

NO-SQL

NoSQL DBs are built around different principles

Key-value stores: Redis, Riak

Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)

Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records

HADOOP

Existed before ‘Big Data’ buzzword emerged

A simple idea – MapReduce

A primary purpose – to crunch tera- and

petabytes of data

HDFS as underlying distributed file system

HADOOP – ARCHITECTURE BY EXAMPLE

Image you need to process 1TB of logs

What would you need?

A server!


But 1TB is quite a lot of data… we want it

quicker!

Ok, what about distributed environment?


So what about that Hadoop stuff?

Each node can: store data & process it (DataNode

& TaskTracker)


How about allocating jobs to slaves? We need a

JobTracker!


How about HDFS, how data blocks are

assembled into files?

NameNode does it.


NameNode – manages HDFS metadata, doesn’t deal with files directly

JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers

TaskTracker – runs MapReduce operations

DataNode – stores blocks of HDFS – default replication level for each block: 3

HADOOP - LIMITATIONS

DataNodes & TaskTrackers are fault tollerant

NameNode & JobTracker are NOT! (existing

workaround for this problem)

HDFS deals nicely with large files, doesn’t do

well with billions of small files

MAP_REDUCE

MapReduce – parallelisation approach

Two main stages:

Map – do an actual bit of work, i.e.: extract info

Reduce – summarise, aggregate or filter outputs from Map operation

For each job, multiple Map and Reduce operations – each may run on different node = parallelism

MAP_REDUCE – AN EXAMPLE

Let’s process 1TB of raw logs and extract traffic by host.

After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!

Map - analyse logs and return them as set of <key,value>

Reduce -> merge output of Map operations


Take a look at mocked log extract:

[IP – bandwidth]

10.0.0.1 – 1234

10.0.0.1 – 900

10.0.0.2 – 1230

10.0.0.3 – 999


It’s important to define key, in this case IP

<10.0.0.1;2134>

<10.0.0.2;1230>

<10.0.0.3;999>

Now, assume another Map operation returned:

<10.0.0.1;1500>

<10.0.0.3;1000>

<10.0.0.4;500>


Now, Reduce will merge those results:

<10.0.0.1;3624>

<10.0.0.2;2230>

<10.0.0.3;1499>

<10.0.0.4;500>

MAP_REDUCE

Selecting a key is important

It’s possible to define composite key, i.e.

IP+date

For more complex tasks, it’s possible to chain

MapReduce jobs

HBASE

Another layer on top of Hadoop/HDFS

A distributed data storage

Not a replacement for RDBMS!

Can be used with MapReduce

Good for unstructured data – no need to worry

about exact schema in advance

PIG – HBASE ENHANCEMENT

HBase - missing proper query language

Pig – makes life easier for HBase users

Translates queries into MapReduce jobs

When working with Pig or HBase, forget what

you know about SQL – it makes your life easier

BUILDING HADOOP CLUSTER

Post production servers are ok

Don’t take ‘cheap hardware’ too literally

Good connection between nodes is a must!

>=1Gbps between nodes

>=10Gbps between racks

1 disk per CPU core

More RAM, more caching!

FINAL CONCLUSIONS

Hadoop and NoSQL-like DB/DS scale very well

Hadoop ideal for crunching huge data sets

Does very well in production environment

Cluster of slaves is fault tolerant, NameNode

and JobTracker are not!

EXTERNAL RESOURCES

Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1

Building web crawler with Hadoop: http://goo.gl/xPTlJ

Analysing adverse drug events: http://goo.gl/HFXAx

Moving average for large data sets: http://goo.gl/O4oml

http://goo.gl/BWWO1

http://goo.gl/BWWO1

http://goo.gl/xPTlJ

http://goo.gl/xPTlJ

http://goo.gl/HFXAx

http://goo.gl/HFXAx

http://goo.gl/O4oml

http://goo.gl/O4oml

EXTERNAL RESOURCES – USEFUL LINKS

http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-

recommendation-talk/1

https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

http://hstack.org/hbase-performance-testing/

http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/

http://wiki.apache.org/hadoop/MachineScaling

http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-

ladis2009.pdf

http://www.cloudera.com/resource-types/video/

http://hstack.org/why-were-using-hbase-part-2/

QUESTIONS?

making big data, small

Education

hbase data

data hdfs

motivation data

big data nosql

data blocks

big data systems

unstructured data

data storages