realtime analytics + hadoop 2.0

42
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Realtime Analytics in Hadoop Rommel Garcia – Solution Engineer October 10, 2014

Upload: rommel-garcia

Post on 20-Aug-2015

664 views

Category:

Software


2 download

TRANSCRIPT

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Realtime Analytics in HadoopRommel Garcia – Solution EngineerOctober 10, 2014

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop provides

• Terabytes to Petabytes of storage on commodity hardware (HDFS)• Massive parallel computation on enormous amount of data (YARN)

Hadoop is essentially a supercomputer for the masses!

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDFS: Scalable, Reliable, Secure Storage Platform

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

C A B C B B A C

B A B A C A

ReliableHighly Available &Fault Tolerant

Protects against data loss & corruption

Cost EffectiveHorizontally scales on Commodity Hardware

SecureStrong access controls, integrated with authentication mechanisms

Granular data access controls to datasets across users and groups

Standards Based Data Interfaces

NFSSource/

Destination

REST

RPC

Source/Destination

Source/Destination

The Storage Platform for the Modern Data Architecture

Ingest and store any data in any format

Flexible read access enables a variety of work loads

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 1

Single Use Data PlatformBatch

HADOOP 1

Redundant, Reliable Storage(HDFS)

Mapreduce

Hive PigJava

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

20092006

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File System)

MapReduceLargely Batch Processing

Hadoop w/ MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clustersLargely batch systemDifficult to integrate

MR-279: YARN

Hadoop 2 & YARN

Interactive Real-TimeBatch

Enabled the Modern Data Architecture

October 23, 2013

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop

Multi Use Data PlatformBatch, Interactive, Realtime, Online, Streaming, …

HADOOP 2

Redundant, Reliable Storage(HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard QueryProcessing

Hive

BatchMapReduce

Online Data Processing

InteractiveTez

Real Time Stream Processing Others

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why Are Enterprises Using Hadoop?

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

AP

PL

ICA

TIO

NS

DA

TA S

YS

TE

M

Business Analytics

Custom Applications

PackagedApplications

Traditional systems under pressure

• Silos of Data

• Costly to Scale

• Constrained Schemas

Clickstream

Geolocation

Sentiment, Web Data

Sensor, Machine Data (IoT)

Unstructured docs, emails

Server logs

SO

UR

CE

S

Existing Sources (CRM, ERP,…)

RDBMS EDW MPP

New Data Types

…and difficult to manage new data

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2 and YARN enable the Modern Data Architecture

Common data set, multiple applications

• Optionally land all data in a single cluster

• Batch, interactive & real-time use cases

• Support multi-tenant access, processing & segmentation of data

YARN: Architectural center of Hadoop

• Consistent security, governance & operations

• Ecosystem applications run natively in Hadoop

SO

UR

CE

S

EXISTING Systems

Clickstream Web &Social

Geolocation Sensor & Machine

Server Logs

Unstructured

AP

PL

ICA

TIO

NS

DA

TA S

YS

TE

M

Business Analytics

Custom Applications

PackagedApplications

RDBMS EDW MPP YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-TimeBatch

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Real-Time Use Cases

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Realtime Analytics in…

$

• Proactive Maintenance• Fraud Detection/Prevention • Cell tower diagnostics• Bandwidth Allocation

• Brand Sentiment Analysis• Localized, Personalized

Promotions

Financial Services

Retail Telecom Manufacturing

HealthcareUtilities, Oil & Gas

Public Sector

• Monitor patient vitals• Patient care and safety• Reduce re-admittance rates

• Smart meter stream analysis

• Proactive equipment repair• Power and consumption

matching

• Network intrusion detection and prevention

• Disease outbreak detection

• Unsafe driving detection and monitoring

Transportation

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Truck Demo: Real-Time Analytics

Problem:• The only way to measure “safe driving” is through accident

occurences.• There’s no realtime accident prevention mechanism in place

Solution:• Use Hadoop to analyze driving violations in real-time• Provide a UI to view to real-time violation alerts• Provide a dashboard to review violation reports

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo Time !

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Truck Demo Real-Time Hadoop Architecture

Truck EventsKafka

Storm

HBaseHDFS/HiveMessage Queue

(ActiveMQ)Real-Time

Monitoring App

Solr(Reporting Dashboard)

ViolationsAlertsTruck Event Data

High Speed Ingestion

Distributed Processing

Show

Show Driving Report

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q&A

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2.0Rommel Garcia – Solution EngineerOctober 10, 2014

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2 Becoming A Critical Platform

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2 delivers a comprehensive data management platform

Hadoop 2 Platform

Provision, Manage & Monitor

AmbariZookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

FalconSqoopFlumeNFS

WebHDFSYARN: Data Operating System

DATA MANAGEMENT

SECURITYBATCH, INTERACTIVE & REAL-TIME

DATA ACCESSGOVERNANCE

& INTEGRATION

AuthenticationAuthorizationAccounting

Data Protection

Storage: HDFSResources: YARNAccess: Hive, … Pipeline: Falcon

Cluster: Knox

OPERATIONS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBaseAccumulo

Stream

Storm

Others

ISV Engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

In-Memory

Spark

Deployment Choice

Linux Windows On-Premise

Cloud

YARN is the architectural center of Hadoop 2

• Enables batch, interactive and real-time workloads

• Single SQL engine for both batch and interactive

• Enable existing ISV apps to plug directly into Hadoop via YARN

Provides comprehensive enterprise capabilities

• Governance

• Security

• Operations

The widest range of deployment options

• Linux & Windows

• On premise & cloud

TezTez

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN – Roadmap

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN Development Framework

System

Engine

API

YARN : Data Operating System

°1 ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°°

° ° ° ° ° ° °

° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

BatchMapReduce

Real-TimeSlider

Direct

Java.NET

Scripting

Pig

SQL

Hive

Cascading

JavaScala

NoSQL

HBaseAccumulo

Stream

Storm

OtherISV

OtherISV

Applications

Others

Spark Other ISV

New New

New New

NewTezTezTez Tez

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN General Store – The Future

• A Data Lake that has a General Store to continually serve you….– App Store – YARN Ready Applications– Data Store – Where do I get the interesting data…Weather, Geo, ..etc.– View Store – How do I get UI’s to the cluster– Processing Store – Falcon, Pig...etc. for “standard” data sets or common “processing

patterns”

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Argus– Security

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Argus: Security needs are changing

AdministrationCentrally management & consistent security

AuthenticationAuthenticate users and systems

AuthorizationProvision access to data

AuditMaintain a record of data access

Data ProtectionProtect data at rest and in motion

Security needs are changing• YARN unlocks the data lake

• Multi-tenant: Multiple applications for data access

• Changing and complex compliance environment

• ETL of non-sensitive data can yield sensitive data

Summer 201465% of clusters host multiple workloads

Fall 2013Largely silo’d deployments with single workload clusters

5 areas of security focus

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Security in Hadoop with HDP + Argus (XA Secure)

AuthorizationRestrict access to explicit data

AuditUnderstand who did what

Data ProtectionEncrypt data at rest & in motion

• Kerberos in native Apache Hadoop

• HTTP/REST API Secured with Apache Knox Gateway

• HDFS Permissions, HDFS ACL,• Audit logs in with HDFS & MR• Hive ATZ-NG

AuthenticationWho am I/prove it?

• Wire encryption in Hadoop

• Open Source Initiatives

• Partner Solutions

• HDFS, Hive and Hbase

• Fine grain access control

• RBAC

• Centralized audit reporting

• Policy and access history

• Future Integration

Had

oop

2A

rgus

Centralized Security Administration

• As-Is, works with current authentication methods

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive– SQL In Hadoop & Roadmap

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive: The De-Facto SQL Interface for Hadoop

Page 27

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Abstractions in Hive

Page 28

Partitions, buckets and skews facilitatefaster, more direct data access. Cube, windowing, aggregation

functions supported as well

Database

Table Table

Partition Partition Partition

Bucket

Bucket

BucketOpti

onal

Per

Tab

le

Skewed KeysUnskewed Keys

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger.Next - Roadmap

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger.Next – Release Cycle

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive Demo Using DBVisualizer or Excel?

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Falcon– Data Governance

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Pipeline Tracing

Purchase feed

Customer feed

Product feedStore feed

View dependencies between clusters, datasets

and processes

Data pipeline dependencies

Add arbitrary tags to feeds & processes

Credit

feed

Sensitive encrypted

Data pipeline tagging

Know who modified a dataset when and into

what

Data pipeline audits

File-1

File-2

File-3

Analyze how a dataset reached a particular

state

Data pipeline lineage

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example: Multi-Cluster Replication

Raw DataPresented

DataCleansed

DataConformed

Data

Staged DataPresented

Data

Rep

licat

ion

Failover Hadoop Cluster

Primary Hadoop Cluster

Rep

licat

ion

Bi and Analytic Applications

• Falcon manages workflow and replication• Enables business continuity without requiring full data reprocessing• Failover clusters can be smaller than primary clusters

..and many more

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example: Retention

Staged DataPresented

DataCleansed

DataConformed

Data

Retain 5 Years

Retain Last Copy Only

Retain 3 Years

Retain 3 Years

• Sophisticated retention policies expressed in one place• Simplify data retention for audit, compliance, or for data re-processing

Ret

entio

n P

olic

y

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Ambari – Hadoop Cluster Monitoring

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Ambari Dashboard

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Ambari 2H 2014

1.7.0 (September) 1.8.0 (October) 2.0.0 (December)

Features• Config versioning + history • Config <final> Properties• Flume Support • Ubuntu Support • ResourceManager HA • HDFS Rebalance • Ambari Views Framework• Slider Support

Tech Preview• Windows Support• Ambari Shell

Features• ServiceX on YARN via Slider• Log Access + Search • Rack Awareness • Simplified Kerberos Setup• NameNode SafeMode • Ambari Shell GA

Features• Automated Rolling Upgrades• Oozie HA • Ambari Alerts • Ambari Metrics • Windows Support GA

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2 Deployment Options

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Efficient Data Lakes can Span to the Cloud

On-Premises Cloud

HDP on Windows

HDP on Linux

Your deployment of Hadoop

hosted as a VM in Azure

HDP on Windows

HDP on Linux

Full control of HW and

software configs

Analytics Platform System

Turnkey Hadoop and

relational warehouse appliance

HDInsight

Managed Hadoop Service

Built on Azure storage

Enjoy cross-platform interoperability based on 100% open source HDP

1 2

3 4

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q&A

Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Thank You!Rommel Garcia – Solution EngineerTwitter: @rommelgarciaLinkedIn: /rommelgarcia