bruno guedes - hadoop real time for dummies - nosql matters paris 2015

40

Upload: nosqlmatters

Post on 16-Jul-2015

219 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Page 2: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Paris 2015

Page 3: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

• CTO for Zenika

• In charge of BigData/NoSQL consulting/training

• Trainer

• Pleasant guy

Page 4: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Hadoop reminder1

2

3

HAWQ – SQL on Hadoop

PXF – Accessing sources

4 Demo – Tweets Analytics

Page 5: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

HadoopReminder

Page 6: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Page 7: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Storage

• Semi-Structured

• Unstructured

• Large files

• Large amount of data

• Write once, Read many

Process

• Processing large amount of

data in parallel

• Commodity hardware

• Derived from functional

programming

Page 8: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

• Provides high-throughput access to data blocks

• Provides limited interface for managing the file system to

allow it to scale

• Creates multiple replicas of each data block

• Distributes them on computers throughout the cluster to

enable reliable and rapid data access

Page 9: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

namenode

JVM

NameNode

NameNode (single)

• Manages file-system content tree

• Manages file & directory meta-data

• Manages datanodes and blocks they hold

DataNode (multiple)

• Store & retrieve data blocks (64Mb/128Mb)

• Report block usage to namenode

Page 10: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

namenode

JVM

NameNode

A

D B

A

CB B C

C

D

A

D

File: input/logfiles/2014-12-12.log (200 mb)

Requires 4 blocks (A,B,C,D) spread across data nodesReplicated again on blocks: 77,88,10,20Replicated on blocks: 33,99,55,111Stored on blocks: 11,22,44,66

Page 11: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

namenode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

JVM

NameNode

A

D B

A

CB B C

C

D

A

DFAILURE

Page 12: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

• Performs distributed data processing using the

MapReduce programming paradigm

• Allows to possess user-defined map phase, which is a

parallel, share-nothing processing of input

• Creates multiple replicas of each data block

• Distributes them on computers throughout the cluster to

enable reliable and rapid data access

Page 13: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Page 14: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

JobTracker

JVM

JobTracker

JobTracker (single)

• Launch and manages Jobs

TaskTracker (multiple)

• Run individual tasks (Mappers/Reducers)

• Reside on the DataNodes

Page 15: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

HawqSQL on Hadoop

Page 16: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

• HAdoop With Queries ?

• Implementation of PostgreSQL

• Use HDFS

• Alternative to HIVE querying

• Supports ANSI SQL-92 and analytic extensions from

SQL-2003

• Cost-based parallel query optimiser

Page 17: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

• ODP – Standardize Hadoop Ecosystem

• ODP Core for building a versionned, packaged,

tested set of Hadoop components

• Developing a platform

• Pivotal and Hortonworks alliance to simplify adoption

• Joint engineering efforts

• Support services

• HAWQ Open Sourced

Page 18: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network

Master

Worker Worker Worker

Page 19: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Page 20: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

• Based on PostgreSQL

• Handle SQL commands

• Maintains global system catalog

• Contains no Data

Page 21: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

HAWQ Segment Host

Query Executor

PXF• Process partition of Query

• Based on PostgreSQL

• Stateless

• Manage communication with NameNode

• User/Table data stored in HDFS files

Page 22: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Clients

JDBC/ODBC

SQL

Page 23: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Gather Motion

Sort

HashAggregate

HashJoin

Redistribute Motion

HashJoin

Seq Scan on lineitem

Hash

Seq Scan on orders

Hash

HashJoin

Seq Scan on customer

Hash

Broadcast Motion

Seq Scan on nation

Page 24: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

ScanBars

b

HashJoinb.name = s.bar

ScanSells

s Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

ScanBars

b

HashJoinb.name = s.bar

ScanSells

s Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

ScanBars

b

HashJoinb.name = s.bar

ScanSells

s Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

Page 25: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Page 26: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Page 27: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Page 28: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Pivotal HD

0 20 40 60 80 100 120

111 / 111

20 / 111

31 / 111

http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

Page 29: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

PXFAccessing sources

Page 30: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

• Allows access to Hadoop (HDFS files, HBase, Hive) as

external tables

• Allows joins between HAWQ (internal) & external tables

• Integrating with Third Party systems (Cassandra,

Accumulo)

• Provides extensible framework API to enable custom

development for other data sources

Page 31: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

HDFS HBase Hive

Xtension Framework

Page 32: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Fragmenter

• Get the locations of fragments for a table

Accessor

• Understand and read fragment, return records to

the Resolver

Resolver

• Convert the records into a SQL engine format

Analyser

• Provide source stats to the Query optimizer

Page 33: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

DemoTweets Analytics

Page 34: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

PHD (or any ODP Core-based Hadoop Distribution)

HDFS

HAWQ

(SQL on Hadoop)

SpringXD

(Stream Processing/scoring)

Direct Store

https://github.com/spring-projects/spring-xd-samples/tree/master/analytics-dashboard

Page 35: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

HDFS

Xtension Framework (Json-ext)

http://pivotal-field-engineering.github.io/pxf-field/json.html

Page 36: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

stream create tweets --definition "twitterstream | hdfs --idleTimeout=3000 --fileExtension=json"

stream create tweetlang --definition "tap:stream:tweets > field-value-counter --fieldName=lang" --deploy

stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" --deploy

stream create tagcount --definition "tap:stream:tweets > field-value-counter --fieldName=entities.hashtags.text --name=hashtags" --deploy

stream deploy tweets

Page 37: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

<profile>

<name>JSON</name>

<description>A profile for JSON data, one JSON record per line</description>

<plugins>

<fragmenter>com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>

<accessor>com.pivotal.pxf.plugins.json.JsonAccessor</accessor>

<resolver>com.pivotal.pxf.plugins.json.JsonResolver</resolver>

<analyzer>com.pivotal.pxf.plugins.hdfs.HdfsAnalyzer</analyzer>

</plugins>

</profile>

Page 38: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

CREATE EXTERNAL TABLE ext_tweets_json (created_at TEXT, id_str TEXT, text TEXT, source TEXT, "user.id" INTEGER, "user.location" TEXT, "coordinates.coordinates[0]" DOUBLE PRECISION, "coordinates.coordinates[1]" DOUBLE PRECISION) LOCATION('pxf://pivhdsne:50070/xd/tweets/*.json?PROFILE=JSON') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Page 39: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Page 40: Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015