paradigmas de procesamiento en big data: estado actual, tendencias y oportunidades

86
Dr. Rubén Casado [email protected] ruben_casado Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades UNIVERSIDAD COMPLUETENSE MADRID 19 de Noviembre de 2014

Upload: universidad-complutense-de-madrid

Post on 11-Jul-2015

263 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Dr. Rubén Casado

[email protected]

ruben_casado

Paradigmas de procesamiento en Big Data: estado actual,

tendencias y oportunidades

UNIVERSIDAD COMPLUETENSE MADRID 19 de Noviembre de 2014

Page 2: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

1. Big Data processing

2. Batch processing

3. Streaming processing

4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

Page 3: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

PhD in Software Engineering MSc in Computer Science BSc in Computer Science

Academics

WorkExperience

Page 4: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

1. Big Data processing2. Batch processing

3. Streaming processing

4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

Page 5: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Page 6: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

Page 7: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

3 problems

Volume

Variety Velocity

Page 8: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

3 solutions

Batch processing

NoSQLStreaming processing

Page 9: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

3 solutions

Batch processing

NoSQLStreaming processing

Page 10: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Volume

Variety Velocity

Science or Engineering?

Page 11: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Science or Engineering?

Volume

Variety

Value

Velocity

Page 12: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Science or Engineering?

Volume

Variety

Value

Velocity

SoftwareEngineering

Data Science

Page 13: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

13

Relational Databases Schema based ACID (Atomicity, Consistency, Isolation, Durability)

Performance penalty Scalability issues

NoSQL Not Only SQL Families of solutions Google BigTable, Amazon Dynamo BASE = Basically Available, Soft state, Eventually consistent CAP= Consistency, Availability, Partition tolerance

NoSQL

Page 14: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

14

Key-value Key: ID Value: associated data Diccionario LinkedIn Voldemort Riak, Redis Memcache, Membase

Document More complex tan K-V Documents are indexed by ID Multiple index MongoDB CouchDB

Column Tables with predefined families of

fields Fields within families are flexible Vertical and horizontal partitioning HBase Cassandra

Graph Nodes Relationships Neo4j FlockDB OrientDB

CR7: ‘Cristiano Ronaldo’

CR7:{Name: ’Cristiano’Surname: ‘Ronaldo’Age: 29}

CR7: [Personal:{Name: ’Cristiano’Surname: ‘Ronaldo’Edad: 29}

Job: {Team: ‘R. Madrid’Salary: 20.000.000}]

NoSQL

[CR]

[R.Madrid]

se_llama

juega

[Cristiano]

Page 15: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Scalable• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

Page 16: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Low latency• Continuous

unbounded streams of data

• Distributed

• Parallel

• Fault-tolerant

Streaming processing

Velocity

Page 17: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Low latency: real-time• Massive data-at-rest + data-in-motion• Scalable

• Combine batch and streaming results

Hybrid computation model

Volume Velocity

Page 18: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

All data

New data

Batch processing

Streaming processing

Batchresults

Streamresults

CombinationFinal results

Hybrid computation model

Page 19: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Batch processing Large amount of statics data Scalable solution Volume

Streaming processing Computing streaming data Low latency Velocity

Hybrid computation Lambda Architecture Volume + Velocity

2006

2010

2014

1ª Generation

2ª Generation

3ª Generation

Inception

2003Processing Paradigms

Page 20: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Batch

+10 years of Big Data processing technologies

2003 2004 2005 2013201120102008

The Google File System

MapReduce: Simplified Data Processing on Large Clusters

Doug Cutting starts developing Hadoop

2006

Yahoo! starts working on Hadoop

Apache Hadoop is in production

Nathan Marzcreates Storm

Yahoo! creates S4

2009

Facebook creates Hive

Yahoo! creates Pig

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

LinkedIn presents Samza

LinkedIn presents KafkA

Clouderapresents Flume

2012

Nathan Marzdefines the Lambda Architecture

Streaming Hybrid

2014

Spark stack is open sourced Lambdoop &

Summinbgirdfirst steps

StratospherebecomesApache Flink

Page 21: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Processing Pipeline

DATA ACQUISITION

DATA STORAGE

DATA ANALYSIS RESULTS

Page 22: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Air Quality case study

Page 23: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

1. Big Data processing overview

2. Batch processing3. Real-time processing

4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

Page 24: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Batch processing technologies

DATA ACQUISITION

DATA STORAGE

DATA ANALYSIS RESULTS

o HDFS commands

o Sqoopo Flumeo Scribe

o HDFSo HBase

o MapReduceo Hiveo Pigo Cascadingo Sparko Spark SQL (Shark)

Page 25: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Import to HDFS

hadoop dfs -copyFromLocal<path-to-local> <path-to-remote>

hadoop dfs –copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/

HDFS commands DATA ACQUISITION

BATCH

Page 26: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Tool designed for transferring data between HDFS/HBase and structural datastores

• Based in MapReduce• Includes connectors for multiple databases

o MySQL, o PostgreSQL, o Oracle, o SQL Server and o DB2 o Generic JDBC connector

• Java API

Sqoop DATA ACQUISITION

BATCH

Page 27: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

import -all-tables --connect

jdbc:mysql://localhost/testDatabase

--target-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

1) Import data from database to HDFS

export --connect

jdbc:mysql://localhost/testDatabase

--export-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

3) Export results to database

2) A

naly

zeda

ta (H

AD

OO

P)

Sqoop DATA ACQUISITION

BATCH

Page 28: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Service for collecting, aggregating, and moving large amounts of log data

• Simple and flexible architecture based on streaming data flows

• Reliability, scalability, extensibility, manageability• Support log stream types

o Avroo Syslogo Netcast

Flume DATA ACQUISITION

BATCH

Page 29: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Sources Channels SinksAvro Memory HDFSThrift JDBC LoggerExec File AvroJMS Thrift

NetCat IRCSyslog

TCP/UDPFile Roll

HTTP NullHBase

Custom Custom

• Architectureo Source

• Waiting for events .o Sink

• Sends the information towardsanother agent or system.

o Channel• Stores the information until it is

consumed by the sink.

Flume DATA ACQUISITION

BATCH

Page 30: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Stations send the information to the servers. Flume collects

this information and move it into the HDFS for further analsys Air quality syslogs

Flume DATA ACQUISITION

BATCH

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";

Page 31: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Server for aggregating log data streamed in real time from a large number of servers

• There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.

• The central scribe server(s) can write the messages to the files that are their final destination

Scribe DATA ACQUISITION

BATCH

Page 32: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

category=‘mobile‘;

// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …'message= sensor_log.readLine();

log_entry = scribe.LogEntry(category, message)

// Create a Scribe Client

client = scribe.Client(iprot=protocol, oprot=protocol)

transport.open()

result = client.Log(messages=[log_entry])

transport.close()

• Sending a sensor message to a Scribe Server

Scribe DATA ACQUISITION

BATCH

Page 33: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Distributed FileSystem for Hadoop• Master-Slaves Architecture (NameNode – DataNodes)

o NameNode: Manage the directory tree and regulates access to files by clients

o DataNodes: Store the data• Files are split into blocks of the same size and these blocks are

stored and replicated in a set of DataNodes

HDFS DATA STORAGE

BATCH

Page 34: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Open-source non-relational distributed column-oriented database modeled after Google’s BigTable.

• Random, realtime read/write access to the data.

• Not a relational database.

o Very light «schema»

• Rows are stored in sorted order.

DATA STORAGE

BATCH

HBase

Page 35: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Framework for processing large amount of data in parallelacross a distributed cluster

• Slightly inspired in the Divide and Conquer (D&C) classic strategy

• Developer has to implement Map and Reduce functions:

o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes parsed to the format <K, V>

o Reduce: It collects the <K, List(V)> and generates the results

MapReduce DATA ANALYTICS

BATCH

Page 36: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Design Patterns

o Joinso Reduce side Joino Replicated joino Semi join

o Sorting:o Secondary sorto Total Order Sort

o Filtering

MapReduce

o Statisticso AVGo VARo Counto …

o Top-Ko Binningo …

DATA ANALYTICS

BATCH

Page 37: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Obtain the S02 average of each station

MapReduce

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";

DATA ANALYTICS

BATCH

Page 38: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Input Data

Mapper

Mapper

Mapper

<1, 6> …

Shuf

fling

<1, 2> <3, 1> <1, 9>

<3, 9> <2, 6> <2, 6> <1, 6>

<2, 0> <2, 8> <1, 2> <3,9>

<Station_ID, S02_VALUE>

MapReduce DATA ANALYTICS

BATCH

• Maps get records and produce the SO2 value in <Station_Id, SO2_value>

Page 39: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Station_ID, AVG_SO21, 2,013

2, 2,695

3, 3,562ReducerSum

Divide

Shuf

fling

ReducerSum

Divide

<Station_ID, [SO1, SO2,…,SOn>

• Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station

MapReduce DATA ANALYTICS

BATCH

Page 40: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Hive

• Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets

• Abstraction layer on top of MapReduce

• SQL-like language called HiveQL.• Metastore: Central repository of Hive metadata.

DATA ANALYTICS

BATCH

Page 41: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;

Hive

• Obtain the S02 average of each stationSELECT Titulo, avg(SO2)

FROM air_quality

GROUP BY Estacion

DATA ANALYTICS

BATCH

Page 42: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Platform for analyzing large data sets • High-level language for expressing data

analysis programs. Pig Latin. Data flow programming language.

• Abstraction layer on top of MapReduce• Procedural language

Pig DATA ANALYTICS

BATCH

Page 43: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Pig DATA ANALYTICS

BATCH

• Obtain the S02 average of each station

calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';')

AS (estacion:chararray, titulo:chararray, latitud:chararray,

longitud:chararray, fecha:chararray, so2:chararray,

no:chararray, co:chararray, pm10:chararray, o3:chararray,

dd:chararray, vv:chararray, tmp:chararray, hr:chararray,

prb:chararray, rs:chararray, ll:chararray, ben:chararray,

tol:chararray, mxil:chararray, pm25:chararray);

grouped = GROUP air_quality BY estacion;

avg = FOREACH grouped GENERATE group, AVG(so2);

dump avg;

Page 44: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows

• Makes development of complex Hadoop MapReduce workflows easy

• In the same way that Pig

DATA ANALYTICS

BATCH

Cascading

Page 45: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

// define source and sink Taps.

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) );

Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( “avgSO2" );

assembly = new GroupBy( assembly, new Fields( “Estacion" ) );

// For every Tuple group

Aggregator avg = new Average( new Fields( “SO2" ) );

assembly = new Every( assembly, avg );

// Tell Hadoop which jar file to use

Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly );

// execute the flow, block until complete

flow.complete();

DATA ANALYTICS

BATCH

• Obtain the S02 average of each station

Cascading

Page 46: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Spark

• Cluster computing systems for faster data analytics

• Not a modified version of Hadoop

• Compatible with HDFS• In-memory data storage for very fast iterative

processing• MapReduce-like engine• API in Scala, Java and Python

DATA ANALYTICS

BATCH

Page 47: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Spark DATA ANALYTICS

BATCH

• Hadoop is slow due to replication, serialization and IO tasks

Page 48: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Spark DATA ANALYTICS

BATCH

• 10x-100x faster

Page 49: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Spark SQL

• Large-scale data warehouse system for Spark

• SQL on top of Spark (aka SHARK)

• Actually Hive QL over Spark

• Up to 100 x faster than Hive

DATA ANALYTICS

BATCH

Page 50: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Pros• Faster than Hadoop ecosystem

• Easier to develop new applications

o (Scala, Java and Python API)

Cons

• Not tested in extremely large clusters yet

• Problems when Reducer’s data does not fit in memory

DATA ANALYTICS

BATCH

Spark

Page 51: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

1. Big Data processing

2. Batch processing

3. Streaming processing4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

Page 52: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Real-time processing technologies

DATA ACQUISITION

DATA STORAGE

DATA ANALYSIS RESULTS

o Flume o Kafkao Kestrel

o Flumeo Stormo Tridento S4o Spark Streaming

Page 53: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Flume DATA ACQUISITION

STREAM

Page 54: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Kafka is a distributed, partitioned, replicated commit log service

o Producer/Consumer model

o Kafka maintains feeds of messages in categories called topics

o Kafka is run as a cluster

Kafka DATA STORAGE

STREAM

Page 55: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Insert AirQuality sensor log file into Kafka cluster and consume the info.

// new Producter

Producer<String, String> producer = new Producer<String, String>(config);

//Open sensor log file

BufferedReader br…

String line;

while(true)

{

line = br.readLine();

if(line ==null)

… //wait;

else

producer.send(new KeyedMessage<String, String>(topic, line));

}

Kafka DATA STORAGE

STREAM

Page 56: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

AirQuality Consumer

ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config);

Map<String, Integer> topicCountMap = new HashMap<String, Integer>();

topicCountMap.put(topic, new Integer(1));

Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap);

KafkaMessageStream stream = consumerMap.get(topic).get(0);

ConsumerIterator it = stream.iterator();

while(it.hasNext()){

// consume it.next()

Kafka DATA STORAGE

STREAM

Page 57: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• Simple distributed message queue

• A single Kestrel server has a set of queues (strictly-ordered FIFO)

• On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication

• Kestrel vs Kafka

o Kafka consumers cheaper (basically just the bandwidth usage)

o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation.

o Kafka has significantly better throughput.

o Kestrel does not support ordered consumption

Kestrel DATA STORAGE

STREAM

Page 58: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Interceptor• Interface org.apache.flume.interceptor.Interceptor• Can modify or even drop events based on any criteria• Flume supports chaining of interceptors.• Types:

o Timestamp interceptoro Host interceptoro Static interceptoro UUID interceptoro Morphline interceptoro Regex Filtering interceptoro Regex Extractor interceptor

DATA ANALYTICS

STREAM

Flume

Page 59: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel.

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";

Flume DATA ANALYTICS

STREAM

Page 60: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

# Write format can be text or writable

#Defining channel – Memory type …1

#Defining source – Syslog …

# Defining sink – HDFS …

#Defining interceptor

agent.sources.source.interceptors = i1

agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter

class StationFilter implements Interceptor

if(!"Station".equals("2"))

discard data;

else

save data;

Flume DATA ANALYTICS

STREAM

Page 61: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Hadoop StormJobTracker NimbusTaskTracker SupervisorJob Topology

• Distributed and scalable realtime computation system

• Doing for real-time processing what Hadoop did for batch processing

• Topology: processing graph. Each node contains processing logic(spouts and bolts). Links between nodes are streams of datao Spout: Source of streams. Read a data source and emit the data into the

topology as a streamo Bolts: Processing unit. Read data from several streams, does some

processing and possibly emits new streamso Stream: Unbounded sequence of tuples. Tuples can contain any

serializable object

Storm DATA ANALYTICS

STREAM

Page 62: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

CAReader LineProcessor AvgValues

• AirQuality average values

oStep 1: build the topology

Storm

Spout Bolt Bolt

DATA ANALYTICS

STREAM

Page 63: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• AirQuality average values

oStep 1: build the topology

TopologyBuilder AirAVG= new TopologyBuilder();

builder.setSpout("ca-reader", new CAReader(), 1);

//shuffleGrouping -> even distribution

AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3)

.shuffleGrouping("ca-reader");

//fieldsGrouping -> fields with the same value goes to the same task

AirAVG.setBolt("ca-avg-values", new AvgValues(), 2)

.fieldsGrouping("ca-line-processor", new Fields("id"));

Storm DATA ANALYTICS

STREAM

Page 64: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

//Initialize fileBufferedReader br = new ……

}

public void nextTuple() {String line = br.readLine();if (line == null) {

return;} else

collector.emit(new Values(line));}

Storm• AirQuality average values

oStep 2: CAReader implementation (IRichSpout interface)

DATA ANALYTICS

STREAM

Page 65: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

public void declareOutputFields (OutputFieldsDeclarerdeclarer) {

declarer.declare(new Fields("id", "stationName", "lat", …

}

public void execute (Tuple input, BasicOutputCollectorcollector) {

collector.emit(new Values(input.getString(0).split(";");

}

Storm• AirQuality average values

oStep 3: LineProcessor implementation (IBasicBolt interface)

DATA ANALYTICS

STREAM

Page 66: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

public void execute (Tuple input, BasicOutputCollector collector)

{

//totals and count are hashmaps with each station accumulated values

if (totals.containsKey(id)) {

item = totals.get(id);

count = counts.get(id);

}

else {

//Create new item}

//update values

item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2")));

item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no")));

…}

Storm• AirQuality average values

oStep 4: AvgValues implementation (IBasicBolt interface)

DATA ANALYTICS

STREAM

66

Page 67: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• High level abstraction on top of Stormo Provides high level operations (joins, filters,

projections, aggregations, functions…)

Proso Easy, powerful and flexible

o Incremental topology development

o Exactly-once semantics

Conso Very few built-in functions

o Lower performance and higher latency than Storm

Trident DATA ANALYTICS

STREAM

Page 68: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Simple Scalable Streaming System

Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data

Inspired by MapReduce and Actor models of computationo Data processing is based on Processing Elements (PE)

o Messages are transmitted between PEs in the form of events (Key, Attributes)

o Processing Nodes are the logical hosts to PEs

S4 DATA ANALYTICS

STREAM

Page 69: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

<bean id="split" class="SplitPE"><property name="dispatcher" ref="dispatcher"/><property name="keys">

<!-- Listen for both words and sentences --><list>

<value>LogLines *</value></list>

</property></bean><bean id="average" class="AveragePE">

<property name="keys"><list>

<value>CAItem stationId</value></list>

</property></bean>

• AirQuality average values

S4 DATA ANALYTICS

STREAM

Page 70: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Spark Streaming

• Spark for real-time processing

• Streaming computation as a series of very short batch jobs (windows)

• Keep state in memory

• API similar to Spark

DATA ANALYTICS

STREAM

Page 71: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

1. Big Data processing

2. Batch processing

3. Streaming processing

4. Hybrid computation model5. Open Issues & Conclusions

Agenda

Page 72: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

• We are in the beginning of this generation

• Short-term Big Data processing goal

• Abstraction layer over the Lambda Architecture

• Promising technologies

o SummingBird

o Lambdoop

Hybrid Computation Model

Page 73: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

SummingBird

• Library to write MapReduce-like process that can be executed on Hadoop, Storm or hybrid model

• Scala syntaxis

• Same logic can be executed in batch, real-time and hybrid bath/real mode

HYBRIDCOMPUTATION

MODEL

Page 74: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

SummingBird HYBRIDCOMPUTATION

MODEL

Page 75: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Pros• Hybrid computation model

• Same programing model for all proccesing paradigms

• ExtensibleCons

• MapReduce-like programing

• Scala

• Not as abstract as some users would like

SummingBird HYBRIDCOMPUTATION

MODEL

Page 76: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Software abstraction layer over Open Source technologieso Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

Lambdoop HYBRIDCOMPUTATION

MODEL

Page 77: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

HYBRIDCOMPUTATION

MODEL

Page 78: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

DataInput db_historical = new StaticCSVInput(URI_db);

Data historical = new Data (db_historical);

Workflow batch = new Workflow (historical);

Operation filter = new Filter (“Station", “=", 2);

Operation select = new Select (“Titulo“, “SO2");

Operation group = new Group(“Titulo");

Operation average = new Average (“SO2");

batch.add(filter);

batch.add(select);

batch.add(group);

batch.add(variance);

batch.run();

Data results = batch.getResults();

Lambdoop HYBRIDCOMPUTATION

MODEL

Page 79: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

DataInput stream_sensor = new StreamXMLInput(URI_sensor);

Data sensor = new Data(stream_sensor)

Workflow streaming = new Workflow (sensor, new WindowsTime(100) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "S02");

Operation group = new Group("Titulo");

Operation average = new Average ("S02");

streaming.add(filter);

streaming.add(select);

streaming.add(group);

streaming.add(average);

streaming.run();

While (true){

Data live_results = streaming.getResults();…

}

Lambdoop HYBRIDCOMPUTATION

MODEL

Page 80: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

DataInput historical= new StaticCSVInput(URI_folder);

DataInput stream_sensor= new StreamXMLInput(URI_sensor);

Data all_info = new Data (historical, stream_sensor);

Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "SO2");

Operation group = new Group("Titulo");

Operation average = new Average ("SO2");

hybrid.add(filter);

hybrid.add(select);

hybrid.add(group);

hybrid.add(variance);

hybrid.run();

Data updated_results = hybrid.getResults();

Lambdoop HYBRIDCOMPUTATION

MODEL

Page 81: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Pros• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• ExtensibleCons

• Ongoing project

• Not open-source yet

• Not tested in larger cluster yet

Lambdoop HYBRIDCOMPUTATION

MODEL

Page 82: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

1. Big Data processing

2. Batch processing

3. Streaming processing

4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

Page 83: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Open Issues

• Interoperability between well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark)

• European technologies (Stratosphere / Apache Flink)

• Massive Streaming Machine Learning

• Real-time Interactive Visual Analytics

• Vertical (domain-driven) solutions

Page 84: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Conclusions

Casado R., Younas M. Emerging trends and technologies in big data processing. Concurrency Computat.: Pract. Exper. 2014

Page 85: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Conclusions

• Big Data is not only Hadoop

• Identify the processing requirements of your project

• Analyze the alternatives for all steps in the data pipeline

• The battle for real-time processing is open

• Stay tuned for the hybrid computation model

Page 86: Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportunidades

Thanks for your attention!

Questions?

[email protected]

ruben_casado