enterprise hadoop - sas · provide deployment choice across physical, virtual, ... cluster: knox...

18
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected]

Upload: danglien

Post on 15-May-2018

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enterprise Hadoop

Enterprise Hadoop

Jeff Markham

Technical Director, APAC

[email protected]

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Upcoming Announcements

Hortonworks Data Platform 2.1A continued focus on innovation within the core of Enterprise Hadoop

to enable an ecosystem to flourish and cement Hadoop’s role in the

data architectures of tomorrow

• Interactive SQL Query: Final phase of Stinger Delivered.

• Comprehensive Features: Governance, Security, Operations

• Processing Versatility: Storm, Search

April

2

April

2

April

3

LucidWorks partnershipA resell agreement has been inked with Lucidworks

to provide tier 2 and tier 3 support for HDP Search

Hadoop Summit Europe 2014SOLD OUT, double exhibitors,

double content, year over year.

April

21Concurrent

Partnership

Cascading is the proven application

development platform for building data

applications on Hadoop

Integrate and Deliver the Cascading SDK

into HDP 2.1

• Collection of tools, documentation,

libraries, tutorials and example projects

• Simplifies SQL integration and enables

Scala development for Hadoop

Hortonworks provides level 1 & 2 support

for Cascading SDK

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop within an emerging Modern Data Architecture

OPERATIONS TOOLS

Provision,

Manage &

Monitor

DEV & DATA TOOLS

Build &

Test

DA

TA

SY

ST

EM

REPOSITORIES

SO

UR

CE

S

RDBMS EDW MPP

OLTP, ERP,

CRM Systems

Documents,

Emails

Web Logs,

Click Streams

Social

Networks

Machine

Generated

Sensor

Data

Geolocation

Data

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data Management

AP

PLI

CA

TIO

NS

Business

Analytics

Custom

Applications

Packaged

Applications

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Core Capabilities of Enterprise Hadoop

Load data

and manage

according

to policy

Deploy and

effectively

manage the

platform

Store and process all of your Corporate Data Assets

Access your data simultaneously in multiple ways

(batch, interactive, real-time) Provide layered

approach to

security through

Authentication,

Authorization,

Accounting, and

Data Protection

DATA MANAGEMENT

SECURITYDATA ACCESSGOVERNANCE &

INTEGRATIONOPERATIONS

Enable both existing and new application to

provide value to the organization

PRESENTATION & APPLICATION

Empower existing operations and

security tools to manage Hadoop

ENTERPRISE MGMT & SECURITY

Provide deployment choice across physical, virtual, cloud

DEPLOYMENT OPTIONS

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

=delivered in Open Source

Provision,

Manage &

Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow,

Lifecycle &

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

YARN : Data Operating System

DATA MANAGEMENT

SECURITYDATA ACCESSGOVERNANCE &

INTEGRATION

Authentication

Authorization

Accounting

Data Protection

Storage: HDFS

Resources: YARN

Access: Hive, …

Pipeline: Falcon

Cluster: Knox

OPERATIONS

Script

Pig

Search

Solr

SQL

Hive/Tez,

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

In-Memory

Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map

Reduce

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP 2.1: Enterprise Hadoop

HDP 2.1Hortonworks Data Platform

Provision,

Manage &

Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow,

Lifecycle &

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

YARN : Data Operating System

DATA MANAGEMENT

SECURITYDATA ACCESSGOVERNANCE &

INTEGRATION

Authentication

Authorization

Accounting

Data Protection

Storage: HDFS

Resources: YARN

Access: Hive, …

Pipeline: Falcon

Cluster: Knox

OPERATIONS

Script

Pig

Search

Solr

SQL

Hive/Tez,

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

In-Memory

Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map

Reduce

Deployment ChoiceLinux Windows On-

Premise

Cloud

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP 2.1 Investment Themes

HDP 2.1 Represents a MAJOR step forward for HadoopDelivery of Interactive Query via Stinger Initiative, Addition of Data Governance,

more Security, Stream Processing and Search, Highlight Release

Three Key Highlights of Release

1. Stinger Initiative DELIVERED: Interactive Query in Apache Hive

2. NEW Capabilities for Hadoop

• Governance: delivered with Apache Falcon

• Security: Apache Knox extends perimeter security for Hadoop

3. NEW Engines included in HDP

• Stream processing: Apache Storm to analyze/process streams of data

• Search: via Apache Solr

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hortonworks Data Platform

So

lr

Had

oo

p

&Y

AR

N

Pig

Tez

Hiv

e &

HC

ata

log

HB

ase

Sq

oo

p

Oo

zie

Zo

okeep

er

Mah

ou

t

Am

bari

Sto

rm

Flu

me

Kn

ox

Ph

oen

ix

Accu

mu

lo

HDP 2.1: Reliable, Consistent & Current

HDP certifies most recent & stable community innovation

2.2.0

1.1.2

0.11.0

0.11.0

0.12.0

0.12.0

HDP 1.3

May

2013

2.4.0 0.12.1

HDP 2.0

October

2013

HDP 2.1

April

2014

SecurityOperationsData AccessData

Management

0.13.0

0.94.6

0.96.1

0.98.0

0.9.1

0.7.0

0.8.0

0.9.04.7.8

1.4.3

1.4.4

1.3.1

1.4.0

1.2.5

1.4.4

1.5.1

3.3.2

4.0.0

3.4.5

0.4.0

0.4.04.0.0

1.5.1

Falc

on

0.5.0

Governance

& Integration

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Interactive SQL-IN-Hadoop Delivered

Stinger Initiative – DELIVERED

Next generation SQL based

interactive query in Hadoop

SpeedImprove Hive query performance has increased by 100X to allow for

interactive query times (seconds)

Scale

The only SQL interface to Hadoop designed for queries that scale

from TB to PB

SQL

Support broadest range of SQL semantics for analytic applications

running against Hadoop

Apache Hive ContributionA an Open Community at its finest

1,672Jira Tickets Closed

145Developers

44Companies

~390,000Lines Of Code AddedA (2x)

Apache YARN

Apache

MapReduce

1 ° ° °

° ° ° °

° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Apache

Tez

Apache Hive

SQL

Business AnalyticsCustom

AppsStinger Project

Stinger Phase 1:

• Base Optimizations

• SQL Types

• SQL Analytic Functions

• ORCFile Modern File Format

Stinger Phase 2:

• SQL Types

• SQL Analytic Functions

• Advanced Optimizations

• Performance Boosts via YARN

Stinger Phase 3• Hive on Apache Tez

• Query Service (always on)

• Buffer Cache

• Cost Based Optimizer (Optiq)

13Months

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

New: Data Governance & Integration

Investment Phases

Apache FalconSimplified Data Governance

for Enterprise Hadoop

• First time included in HDP

• Provides key governance framework for:

• Acquisition & processing of data sets

• Replication & Retention of datasets

• Redirect datasets to non-Hadoop extensions

• Provides audit trail & lineage

Phase-3• Advanced Dashboard for pipeline

definition & management

• Audit

• Lineage

• Data tagging

• File import SSH & SCP

Phase-2• Basic dashboard for

pipeline viewing

• Kerberos security support

• Ambari integration for

management

• Hive/HCatalog integration

Phase-1

• Incubate Apache Falcon

• Dataset replication & retention

• Falcon tech preview

Another great example of

Open Community InnovationOriginally built and contributed to Apache by InMobi

• Fastest path to innovation is the open community

• 14 months in the making

• Tested In production

• Vibrant community of developers building

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

New: Apache Knox for Perimeter Security

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Important Note: Security for Hadoop must be addressed within

every layer of the stack and integrated into existing frameworksFor a full description of what is available in Enterprise Hadoop

today across Authentication, Authorization, accountability and

Encryption please visit our security labs page

Apache KnoxPerimeter security for Hadoop

� A common place to preform authentication

across Hadoop and all related projects

� Integrated to LDAP and AD

� Currently supports:

WebHDFS, WebHCAT, Oozie, Hive & HBase

� Broad community effort, Incubated with

Microsoft, broad set of developers invovled

Security Investments

Security Phase 3:• Audit event correlation and Audit viewer

• Support Token-Based AuthN beyond kerb

• Data Encryption in HDFS, Hive & Hbase

• Knox for HDFS HA, Ambari & Falcon

Security Phase 2:

• ACLs for HDFS

• Knox: Hadoop REST API Security

• SQL-style Hive AuthZ (GRANT, REVOKE)

• SSL support for Hive Server 2

• SSL for DN/NN UI & WebHDFS

• PAM support for Hive

Phase 1• Strong AuthN with Kerberos

• HBase, Hive, HDFS basic AuthZ

• Encryption with SSL for NN, JT, etc.

• Wire encryption with Shuffle, HDFS, JDBC

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

New: Stream Processing with Apache Storm

Apache StormReal-time event processing for

sensor and business activity

monitoring

• Unlocks new business cases for Hadoop

• Key component of a data lake architecture

• Scale: Ingest millions of events per second.

Fast query on petabytes of data

• Integrated with Ambari to manage

Investment Phases

Phase-3• High Availability mgmnt w/Ambari

• AD/LDAP plugin for authentication

• Declarative “wiring”

• Hive update support

• Advanced scheduler

Phase-2• Storm-on-YARN

• Ingest & Notification for JMS

• Data persistence: EDWs, RDBMS,

Cassandra

Phase-1� Install, Start, & Stop via Ambari

� Kafka, HBase, & HDFS Connectors

� Ganglia & Nagios

based monitoring

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

New: Search for Hadoop

• Apache SolrOpen source enterprise

search for Hadoop and HDP

• Open architecture: In the community, for the community

• Simple, powerful UI for advanced search applications

• High performance indexing & sub-second search times

over billions of documents

• Deep Integration Roadmap with HDP

• Partnership with LucidWorks

• LucidWorks provides tier 3 & 4 support

• Alignment w/ strategy of working within the community

and with the core committers

• 9 committers total (7 PMC)

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Cascading SDK & HDP 2.1

Cascading SDK

Enables the the rapid development of batch

and interactive data-driven applications

Integration Roadmap

• Step 1: Integrate Cascading SDK for

customers to use with HDP 2.1

• Step 2: Integration with Tez

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tech Preview: Apache Spark

In-memory processing is “HOT!”

Ahowever, most of the world using for science and machine learning

In memory sandbox for iterative data

analytics used by a handful of data scientists

Hortonworks provides guidance for initial

applicability and scale

� Exploring key use cases with customers focused on

Iterative access & machine learning

� Experience thus far supports target deployments of no

more than: 1 TB of data, 40 nodes, and 1-3 users

� Skill set required: Scala (Java-based API Framework)

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Operating Enterprise Hadoop

Apache Ambari is the only 100% open source

framework for provisioning, managing and monitoring

Apache Hadoop clusters

AMBARI WEB

OthersViewpoint

compute

&

storage. . .

. . .

. .compute

&

storage

.

.

PROVISION

MANAGE

MONITOR

REST APIs

AMBARI SERVERPROVISION | MANAGE | MONITOR

Integration With Existing Operations Tools

New in HDP 2.1

� Support new Data Access Engines

� Stack extensibility, Cluster Blueprints

� Rolling restarts

� Maintenance mode

� more...

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP 2.1 Investment Themes

HDP 2.1 Represents a MAJOR step forward for HadoopDelivery of Interactive Query via Stinger Initiative, Addition of Data Governance,

more Security, Stream Processing and Search, Highlight Release

Three Key Highlights of Release

1. Stinger Initiative DELIVERED: Interactive Query in Apache Hive

2. NEW Capabilities for Hadoop

• Governance: delivered with Apache Falcon

• Security: Apache Knox extends perimeter security for Hadoop

3. NEW Engines included in HDP

• Stream processing: Apache Storm to analyze/process streams of data

• Search: via Apache Solr

Go

ve

rna

nc

e

& In

teg

rati

on

Se

cu

rity

Op

era

tio

nsData Access

Data

Management

HDP 2.1

)AND the HDP Spark Tech Preview,

Simultaneous Linux & Windows Release,

COUNTLESS additional features

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Thank You

Jeff Markham

Technical Director, APAC

[email protected]