big data technology

31
1 / 31 BIG DATA TECHNOLOGY Juanjo Mostazo c-base Berlin May 2014

Upload: juan-j-mostazo

Post on 29-Jan-2018

819 views

Category:

Engineering


2 download

TRANSCRIPT

1 / 31

BIG DATA TECHNOLOGY

● Juanjo Mostazo● c-base Berlin● May 2014

2 / 31

Roadmap

● Map Reduce● Hadoop

● Concepts● HDFS● Architecture

● Hadoop Ecosystem● Lambda Architecture● New trends

3 / 31

M/R: Motivation

● Process big amount of data to produce other data● Scale up vs Scale out

4 / 31

M/R: What is it?

● Different programming paradigm● Based on a google paper (2004)● Automatic parallelization and distribution● I/O Scheduling● Fault tolerance● Status and monitoring

5 / 31

M/R: The paradigm

● Input & Output: set of key/value pairs● Big amount of data group & sort● Job = Two phases = Mapper & Reducer

● Map (in_key, in_value) →list(interm_key, interm_value)

● Reduce (interm_key, list(interm_value)) →list (out_key, out_value)

6 / 31

M/R: Example (word counter)

7 / 31

M/R: Workflow

8 / 31

Roadmap

● Map Reduce● Hadoop

● Concepts● HDFS● Architecture

● Hadoop Ecosystem● Lambda Architecture● New trends

9 / 31

Hadoop: What is it?

● Framework based on GMR / GFS● Apache project● Developed in Java● Multiple applications● Used by many companies● De-facto standard in community

10 / 31

Hadoop: HDFS Architecture

11 / 31

Hadoop: HDFS concepts

● Distributed file system. Layer on top ext3, xfs...● Works better on huge files● Redundancy (default 3)● Bad seeking, no append!● Good rack scale. Not good data center scale● File divided in 128Mb – 256Mb blocks● Computation is sent to data!

12 / 31

Hadoop: Architecture v1

13 / 31

Hadoop: Architecture v2

14 / 31

Hadoop: Architecture v3

15 / 31

M/R: Example (word counter)

16 / 31

Hadoop: Clustering

17 / 31

Hadoop: Advanced

● Distributed caches● Partitioner● Sort comparator● Group comparator● Combiner● Input format & Record reader● MultiInput● MultiOutput● Compression

18 / 31

Hadoop: Conclusions

● Simplify large-scale computation● Hide parallel programming issues● Easy to get into & develop (huge doc)● Deeply used & maintained by community● Possibility to throw away RDBMs! (Bottleneck)

19 / 31

Roadmap

● Map Reduce● Hadoop

● Concepts● HDFS● Architecture

● Hadoop Ecosystem● Lambda Architecture● New trends

20 / 31

Hadoop: Ecosystem

21 / 31

Roadmap

● Map Reduce● Hadoop

● Concepts● HDFS● Architecture

● Hadoop Ecosystem● Lambda Architecture● New trends

22 / 31

Lambda Architecture: Motivation

● Real time use cases● Business analytics

● Batch processing vs Real Time● Problem!

● Low latency read & update● Scalable & fault tolerant● Something else needed!

23 / 31

Lambda Architecture: Schema

24 / 31

Lambda Architecture: Example 1

25 / 31

Lambda Architecture: Example 2

26 / 31

Lambda Architecture: Lambdoop

● Unified technology stack● High level programming environment● Management tools

27 / 31

Roadmap

● Map Reduce● Hadoop

● Concepts● HDFS● Architecture

● Hadoop Ecosystem● Lambda Architecture● New trends

28 / 31

New trends: Architecture

● Hadoop vs Hadoop2● Columnar storage

29 / 31

New trends: Storm

● Stream processing● Tuples● Streams● Spouts● Bolts● Topologies

● Twitter

30 / 31

New trends: Spark

● Next generation MapReduce● Integrated but not dependent on Hadoop● Fast memory optimized execution engine● Avoids many Hadoop problems

● Overhead● High latency● Many disk writes

● In-memory cache● Flexible executions graph● Much faster than MapReduce (up to 100x)● Shark (SQL)● Support streaming (beta)

31 / 31

BIG DATA TECHNOLOGY

● Juanjo Mostazo● [email protected]● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase