paradigmas de procesamiento en big data: estado actual, tendencias y oportunidades

Dr. Rubén Casado

[email protected]

ruben_casado

Paradigmas de procesamiento en Big Data: estado actual,

tendencias y oportunidades

UNIVERSIDAD COMPLUETENSE MADRID 19 de Noviembre de 2014

1. Big Data processing

2. Batch processing

3. Streaming processing

4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

PhD in Software Engineering MSc in Computer Science BSc in Computer Science

Academics

WorkExperience

1. Big Data processing2. Batch processing




Agenda

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

3 problems

Volume

Variety Velocity

3 solutions

Batch processing

NoSQLStreaming processing

Volume

Variety Velocity

Science or Engineering?


Volume

Variety

Value

Velocity


Volume

Variety

Value

Velocity

SoftwareEngineering

Data Science

13

Relational Databases Schema based ACID (Atomicity, Consistency, Isolation, Durability)

Performance penalty Scalability issues

NoSQL Not Only SQL Families of solutions Google BigTable, Amazon Dynamo BASE = Basically Available, Soft state, Eventually consistent CAP= Consistency, Availability, Partition tolerance

NoSQL

14

Key-value Key: ID Value: associated data Diccionario LinkedIn Voldemort Riak, Redis Memcache, Membase

Document More complex tan K-V Documents are indexed by ID Multiple index MongoDB CouchDB

Column Tables with predefined families of

fields Fields within families are flexible Vertical and horizontal partitioning HBase Cassandra

Graph Nodes Relationships Neo4j FlockDB OrientDB

CR7: ‘Cristiano Ronaldo’

CR7:{Name: ’Cristiano’Surname: ‘Ronaldo’Age: 29}

CR7: [Personal:{Name: ’Cristiano’Surname: ‘Ronaldo’Edad: 29}

Job: {Team: ‘R. Madrid’Salary: 20.000.000}]

NoSQL

[CR]

[R.Madrid]

se_llama

juega

[Cristiano]

• Scalable• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

• Low latency• Continuous

unbounded streams of data

• Distributed

• Parallel

• Fault-tolerant

Streaming processing

Velocity

• Low latency: real-time• Massive data-at-rest + data-in-motion• Scalable

• Combine batch and streaming results

Hybrid computation model

Volume Velocity

All data

New data

Batch processing

Streaming processing

Batchresults

Streamresults

CombinationFinal results

Hybrid computation model

Batch processing Large amount of statics data Scalable solution Volume

Streaming processing Computing streaming data Low latency Velocity

Hybrid computation Lambda Architecture Volume + Velocity

2006

2010

2014

1ª Generation

2ª Generation

3ª Generation

Inception

2003Processing Paradigms

Batch

+10 years of Big Data processing technologies

2003 2004 2005 2013201120102008

The Google File System

MapReduce: Simplified Data Processing on Large Clusters

Doug Cutting starts developing Hadoop

2006

Yahoo! starts working on Hadoop

Apache Hadoop is in production

Nathan Marzcreates Storm

Yahoo! creates S4

2009

Facebook creates Hive

Yahoo! creates Pig

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

LinkedIn presents Samza

LinkedIn presents KafkA

Clouderapresents Flume

2012

Nathan Marzdefines the Lambda Architecture

Streaming Hybrid

2014

Spark stack is open sourced Lambdoop &

Summinbgirdfirst steps

StratospherebecomesApache Flink

Processing Pipeline

DATA ACQUISITION

DATA STORAGE

DATA ANALYSIS RESULTS

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Air Quality case study

1. Big Data processing overview

2. Batch processing3. Real-time processing



Agenda

Batch processing technologies

DATA ACQUISITION

DATA STORAGE


o HDFS commands

o Sqoopo Flumeo Scribe

o HDFSo HBase

o MapReduceo Hiveo Pigo Cascadingo Sparko Spark SQL (Shark)

• Import to HDFS

hadoop dfs -copyFromLocal<path-to-local> <path-to-remote>

hadoop dfs –copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/

HDFS commands DATA ACQUISITION

BATCH

• Tool designed for transferring data between HDFS/HBase and structural datastores

• Based in MapReduce• Includes connectors for multiple databases

o MySQL, o PostgreSQL, o Oracle, o SQL Server and o DB2 o Generic JDBC connector

• Java API

Sqoop DATA ACQUISITION

BATCH

import -all-tables --connect

jdbc:mysql://localhost/testDatabase

--target-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

1) Import data from database to HDFS

export --connect

jdbc:mysql://localhost/testDatabase

--export-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

3) Export results to database

2) A

naly

zeda

ta (H

AD

OO

P)

Sqoop DATA ACQUISITION

BATCH

• Service for collecting, aggregating, and moving large amounts of log data

• Simple and flexible architecture based on streaming data flows

• Reliability, scalability, extensibility, manageability• Support log stream types

o Avroo Syslogo Netcast

Flume DATA ACQUISITION

BATCH

Sources Channels SinksAvro Memory HDFSThrift JDBC LoggerExec File AvroJMS Thrift

NetCat IRCSyslog

TCP/UDPFile Roll

HTTP NullHBase

Custom Custom

• Architectureo Source

• Waiting for events .o Sink

• Sends the information towardsanother agent or system.

o Channel• Stores the information until it is

consumed by the sink.


BATCH

Stations send the information to the servers. Flume collects

this information and move it into the HDFS for further analsys Air quality syslogs


BATCH

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";



• Server for aggregating log data streamed in real time from a large number of servers

• There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.

• The central scribe server(s) can write the messages to the files that are their final destination

Scribe DATA ACQUISITION

BATCH

category=‘mobile‘;

// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …'message= sensor_log.readLine();

log_entry = scribe.LogEntry(category, message)

// Create a Scribe Client

client = scribe.Client(iprot=protocol, oprot=protocol)

transport.open()

result = client.Log(messages=[log_entry])

transport.close()

• Sending a sensor message to a Scribe Server

Scribe DATA ACQUISITION

BATCH

• Distributed FileSystem for Hadoop• Master-Slaves Architecture (NameNode – DataNodes)

o NameNode: Manage the directory tree and regulates access to files by clients

o DataNodes: Store the data• Files are split into blocks of the same size and these blocks are

stored and replicated in a set of DataNodes

HDFS DATA STORAGE

BATCH

• Open-source non-relational distributed column-oriented database modeled after Google’s BigTable.

• Random, realtime read/write access to the data.

• Not a relational database.

o Very light «schema»

• Rows are stored in sorted order.

DATA STORAGE

BATCH

HBase

• Framework for processing large amount of data in parallelacross a distributed cluster

• Slightly inspired in the Divide and Conquer (D&C) classic strategy

• Developer has to implement Map and Reduce functions:

o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes parsed to the format <K, V>

o Reduce: It collects the <K, List(V)> and generates the results

MapReduce DATA ANALYTICS

BATCH

• Design Patterns

o Joinso Reduce side Joino Replicated joino Semi join

o Sorting:o Secondary sorto Total Order Sort

o Filtering

MapReduce

o Statisticso AVGo VARo Counto …

o Top-Ko Binningo …

DATA ANALYTICS

BATCH

• Obtain the S02 average of each station

MapReduce







DATA ANALYTICS

BATCH

Input Data

Mapper

Mapper

Mapper

<1, 6> …

…

…

Shuf

fling

<1, 2> <3, 1> <1, 9>

<3, 9> <2, 6> <2, 6> <1, 6>

<2, 0> <2, 8> <1, 2> <3,9>

<Station_ID, S02_VALUE>


BATCH

• Maps get records and produce the SO2 value in <Station_Id, SO2_value>

Station_ID, AVG_SO21, 2,013

2, 2,695

3, 3,562ReducerSum

Divide

Shuf

fling

ReducerSum

Divide

…

<Station_ID, [SO1, SO2,…,SOn>

• Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station


BATCH

Hive

• Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets

• Abstraction layer on top of MapReduce

• SQL-like language called HiveQL.• Metastore: Central repository of Hive metadata.

DATA ANALYTICS

BATCH

CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;

Hive

• Obtain the S02 average of each stationSELECT Titulo, avg(SO2)

FROM air_quality

GROUP BY Estacion

DATA ANALYTICS

BATCH

• Platform for analyzing large data sets • High-level language for expressing data

analysis programs. Pig Latin. Data flow programming language.

• Abstraction layer on top of MapReduce• Procedural language

Pig DATA ANALYTICS

BATCH

Pig DATA ANALYTICS

BATCH


calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';')

AS (estacion:chararray, titulo:chararray, latitud:chararray,

longitud:chararray, fecha:chararray, so2:chararray,

no:chararray, co:chararray, pm10:chararray, o3:chararray,

dd:chararray, vv:chararray, tmp:chararray, hr:chararray,

prb:chararray, rs:chararray, ll:chararray, ben:chararray,

tol:chararray, mxil:chararray, pm25:chararray);

grouped = GROUP air_quality BY estacion;

avg = FOREACH grouped GENERATE group, AVG(so2);

dump avg;

• Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows

• Makes development of complex Hadoop MapReduce workflows easy

• In the same way that Pig

DATA ANALYTICS

BATCH

Cascading

// define source and sink Taps.

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) );

Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( “avgSO2" );

assembly = new GroupBy( assembly, new Fields( “Estacion" ) );

// For every Tuple group

Aggregator avg = new Average( new Fields( “SO2" ) );

assembly = new Every( assembly, avg );

// Tell Hadoop which jar file to use

Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly );

// execute the flow, block until complete

flow.complete();

DATA ANALYTICS

BATCH


Cascading

Spark

• Cluster computing systems for faster data analytics

• Not a modified version of Hadoop

• Compatible with HDFS• In-memory data storage for very fast iterative

processing• MapReduce-like engine• API in Scala, Java and Python

DATA ANALYTICS

BATCH

Spark DATA ANALYTICS

BATCH

• Hadoop is slow due to replication, serialization and IO tasks

Spark DATA ANALYTICS

BATCH

• 10x-100x faster

Spark SQL

• Large-scale data warehouse system for Spark

• SQL on top of Spark (aka SHARK)

• Actually Hive QL over Spark

• Up to 100 x faster than Hive

DATA ANALYTICS

BATCH

Pros• Faster than Hadoop ecosystem

• Easier to develop new applications

o (Scala, Java and Python API)

Cons

• Not tested in extremely large clusters yet

• Problems when Reducer’s data does not fit in memory

DATA ANALYTICS

BATCH

Spark


2. Batch processing

3. Streaming processing4. Hybrid computation model


Agenda

Real-time processing technologies

DATA ACQUISITION

DATA STORAGE


o Flume o Kafkao Kestrel

o Flumeo Stormo Tridento S4o Spark Streaming


STREAM

• Kafka is a distributed, partitioned, replicated commit log service

o Producer/Consumer model

o Kafka maintains feeds of messages in categories called topics

o Kafka is run as a cluster

Kafka DATA STORAGE

STREAM

Insert AirQuality sensor log file into Kafka cluster and consume the info.

// new Producter

Producer<String, String> producer = new Producer<String, String>(config);

//Open sensor log file

BufferedReader br…

String line;

while(true)

{

line = br.readLine();

if(line ==null)

… //wait;

else

producer.send(new KeyedMessage<String, String>(topic, line));

}

Kafka DATA STORAGE

STREAM

AirQuality Consumer

ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config);

Map<String, Integer> topicCountMap = new HashMap<String, Integer>();

topicCountMap.put(topic, new Integer(1));

Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap);

KafkaMessageStream stream = consumerMap.get(topic).get(0);

ConsumerIterator it = stream.iterator();

while(it.hasNext()){

// consume it.next()

Kafka DATA STORAGE

STREAM

• Simple distributed message queue

• A single Kestrel server has a set of queues (strictly-ordered FIFO)

• On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication

• Kestrel vs Kafka

o Kafka consumers cheaper (basically just the bandwidth usage)

o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation.

o Kafka has significantly better throughput.

o Kestrel does not support ordered consumption

Kestrel DATA STORAGE

STREAM

Interceptor• Interface org.apache.flume.interceptor.Interceptor• Can modify or even drop events based on any criteria• Flume supports chaining of interceptors.• Types:

o Timestamp interceptoro Host interceptoro Static interceptoro UUID interceptoro Morphline interceptoro Regex Filtering interceptoro Regex Extractor interceptor

DATA ANALYTICS

STREAM

Flume

• The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel.







Flume DATA ANALYTICS

STREAM

# Write format can be text or writable

…

#Defining channel – Memory type …1

…

#Defining source – Syslog …

…

# Defining sink – HDFS …

…

#Defining interceptor

agent.sources.source.interceptors = i1

agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter

class StationFilter implements Interceptor

…

if(!"Station".equals("2"))

discard data;

else

save data;

Flume DATA ANALYTICS

STREAM

Hadoop StormJobTracker NimbusTaskTracker SupervisorJob Topology

• Distributed and scalable realtime computation system

• Doing for real-time processing what Hadoop did for batch processing

• Topology: processing graph. Each node contains processing logic(spouts and bolts). Links between nodes are streams of datao Spout: Source of streams. Read a data source and emit the data into the

topology as a streamo Bolts: Processing unit. Read data from several streams, does some

processing and possibly emits new streamso Stream: Unbounded sequence of tuples. Tuples can contain any

serializable object

Storm DATA ANALYTICS

STREAM

CAReader LineProcessor AvgValues

• AirQuality average values

oStep 1: build the topology

Storm

Spout Bolt Bolt

DATA ANALYTICS

STREAM


oStep 1: build the topology

TopologyBuilder AirAVG= new TopologyBuilder();

builder.setSpout("ca-reader", new CAReader(), 1);

//shuffleGrouping -> even distribution

AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3)

.shuffleGrouping("ca-reader");

//fieldsGrouping -> fields with the same value goes to the same task

AirAVG.setBolt("ca-avg-values", new AvgValues(), 2)

.fieldsGrouping("ca-line-processor", new Fields("id"));

Storm DATA ANALYTICS

STREAM

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

//Initialize fileBufferedReader br = new ……

}

public void nextTuple() {String line = br.readLine();if (line == null) {

return;} else

collector.emit(new Values(line));}

Storm• AirQuality average values

oStep 2: CAReader implementation (IRichSpout interface)

DATA ANALYTICS

STREAM

public void declareOutputFields (OutputFieldsDeclarerdeclarer) {

declarer.declare(new Fields("id", "stationName", "lat", …

}

public void execute (Tuple input, BasicOutputCollectorcollector) {

collector.emit(new Values(input.getString(0).split(";");

}


oStep 3: LineProcessor implementation (IBasicBolt interface)

DATA ANALYTICS

STREAM

public void execute (Tuple input, BasicOutputCollector collector)

{

//totals and count are hashmaps with each station accumulated values

if (totals.containsKey(id)) {

item = totals.get(id);

count = counts.get(id);

}

else {

//Create new item}

//update values

item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2")));

item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no")));

…}


oStep 4: AvgValues implementation (IBasicBolt interface)

DATA ANALYTICS

STREAM

66

• High level abstraction on top of Stormo Provides high level operations (joins, filters,

projections, aggregations, functions…)

Proso Easy, powerful and flexible

o Incremental topology development

o Exactly-once semantics

Conso Very few built-in functions

o Lower performance and higher latency than Storm

Trident DATA ANALYTICS

STREAM

Simple Scalable Streaming System

Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data

Inspired by MapReduce and Actor models of computationo Data processing is based on Processing Elements (PE)

o Messages are transmitted between PEs in the form of events (Key, Attributes)

o Processing Nodes are the logical hosts to PEs

S4 DATA ANALYTICS

STREAM

…

<bean id="split" class="SplitPE"><property name="dispatcher" ref="dispatcher"/><property name="keys">

<list>

<value>LogLines *</value></list>

</property></bean><bean id="average" class="AveragePE">

<property name="keys"><list>

<value>CAItem stationId</value></list>

</property></bean>

…


S4 DATA ANALYTICS

STREAM

Spark Streaming

• Spark for real-time processing

• Streaming computation as a series of very short batch jobs (windows)

• Keep state in memory

• API similar to Spark

DATA ANALYTICS

STREAM


2. Batch processing


4. Hybrid computation model5. Open Issues & Conclusions

Agenda

• We are in the beginning of this generation

• Short-term Big Data processing goal

• Abstraction layer over the Lambda Architecture

• Promising technologies

o SummingBird

o Lambdoop

Hybrid Computation Model

SummingBird

• Library to write MapReduce-like process that can be executed on Hadoop, Storm or hybrid model

• Scala syntaxis

• Same logic can be executed in batch, real-time and hybrid bath/real mode

HYBRIDCOMPUTATION

MODEL

SummingBird HYBRIDCOMPUTATION

MODEL

Pros• Hybrid computation model

• Same programing model for all proccesing paradigms

• ExtensibleCons

• MapReduce-like programing

• Scala

• Not as abstract as some users would like

SummingBird HYBRIDCOMPUTATION

MODEL

Software abstraction layer over Open Source technologieso Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

Lambdoop HYBRIDCOMPUTATION

MODEL

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

HYBRIDCOMPUTATION

MODEL

DataInput db_historical = new StaticCSVInput(URI_db);

Data historical = new Data (db_historical);

Workflow batch = new Workflow (historical);

Operation filter = new Filter (“Station", “=", 2);

Operation select = new Select (“Titulo“, “SO2");

Operation group = new Group(“Titulo");

Operation average = new Average (“SO2");

batch.add(filter);

batch.add(select);

batch.add(group);

batch.add(variance);

batch.run();

Data results = batch.getResults();

…


MODEL

DataInput stream_sensor = new StreamXMLInput(URI_sensor);

Data sensor = new Data(stream_sensor)

Workflow streaming = new Workflow (sensor, new WindowsTime(100) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "S02");

Operation group = new Group("Titulo");

Operation average = new Average ("S02");

streaming.add(filter);

streaming.add(select);

streaming.add(group);

streaming.add(average);

streaming.run();

While (true){

Data live_results = streaming.getResults();…

}


MODEL

DataInput historical= new StaticCSVInput(URI_folder);

DataInput stream_sensor= new StreamXMLInput(URI_sensor);

Data all_info = new Data (historical, stream_sensor);

Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "SO2");

Operation group = new Group("Titulo");

Operation average = new Average ("SO2");

hybrid.add(filter);

hybrid.add(select);

hybrid.add(group);

hybrid.add(variance);

hybrid.run();

Data updated_results = hybrid.getResults();


MODEL

Pros• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• ExtensibleCons

• Ongoing project

• Not open-source yet

• Not tested in larger cluster yet


MODEL


2. Batch processing




Agenda

Open Issues

• Interoperability between well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark)

• European technologies (Stratosphere / Apache Flink)

• Massive Streaming Machine Learning

• Real-time Interactive Visual Analytics

• Vertical (domain-driven) solutions

Conclusions

Casado R., Younas M. Emerging trends and technologies in big data processing. Concurrency Computat.: Pract. Exper. 2014

Conclusions

• Big Data is not only Hadoop

• Identify the processing requirements of your project

• Analyze the alternatives for all steps in the data pipeline

• The battle for real-time processing is open

• Stay tuned for the hybrid computation model

Thanks for your attention!

Questions?

[email protected]

ruben_casado

paradigmas de procesamiento en big data: estado actual, tendencias y oportunidades

Education

static data

rest data

unstructured data

big data processing2

simplified data processing

realtime massive data

procesamiento en big

batch processing3