evolution of big data at intel - crawl, walk and run approach

22
Evolution of Big Data at Intel - crawl, walk and run approach Gomathy Bala | Director Chandhu Yalla | Manager & Architect Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson IT BI Big Data Team

Upload: hadoop-summit

Post on 16-Apr-2017

7.101 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Evolution of Big Data at Intel - crawl, walk and run approach Gomathy Bala | DirectorChandhu Yalla | Manager & Architect

Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson

IT BI Big Data Team

Page 2: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

Legal NoticesThis presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

The content in this presentation is being shared Under NDA.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others.Copyright © 2014, Intel Corporation. All rights reserved.

2

Page 3: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.3

Agenda

• Intel IT Big Data Journey• Enterprise DW architecture• BI Big Data 3 yr Roadmap• Big Data Ecosystem Architecture• Platform Strategies & BKMs• Summary

Page 4: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.4

20112012 2013 2014 2015

Intel IT Big Data Journey

Big Data &

Analytics Strategy

Production Online

Telmap: 1st Use Case

Preproduction Online

Hadoop Evaluatio

n IDH to CDHHadoop 2.0

$176M BVProduction: Security BI,

Attribute Reduction System, ATM Ellipses

Engine, IAH-Retail Analytics

6 Environments

CDH 5.3

4 Use Cases in Preproduction

12 POC Use Cases

6 Use Cases in Production

$290K investment$948/TB

3 Use Cases in ProductionSmart-What,

Marketing-IAH, Incident Predictability

$6M BV

CDH 5.1

IAH – Cloud CRMIn Production

Enterprise Standards, Guidance,

Processes for Platform & Capabilities

15 Active Use Cases | $290K + 10.5 HC Investment | Delivered $182M BV

Page 6: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.6

Any Data Source

ERP

In Memory Real-Time Data Platform

CRMSCMSRM

ECC

BWECCW Real-Time & Self Service Analytics Platform

MDG

NW

Teradata Cloudera Hadoop Data Lake

Reporting Tools

Data TieringHot-Cold data

EnterpriseData Warehouse

Other Apps

Custom

Intel

NRT

Predictive Analytics

BPCBCS

Cloud BI

SaaS

NewApps

.

DownstreamApplications

2014-2017 Vision: Real-Time Enterprise

Page 7: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

FE Tools

CLS/Proxy

High speed data loaderBig

Dat

a

• Machine Learning• Log Processing• Unstructured data

Use Cases• High volume counter Analytics• Text Parsing/Mining

• Strategic/Operational reporting

• Interactive Reporting

Use Cases• High Concurrent user analytics -

Supply/Order• Mission critical analytics –

Finance/HRFu

ture

SQL on Hadoop

Enterprise Data Architecture with Hadoop and Other MPP DWH Current & Future Strategy Future Present

EDWMfg Data

A %ge of Traditional BI use cases

IMT

Page 8: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.8

BI Big Data | 3-Year RoadmapBig Data + AA Big Data + SSAA + Traditional BI

Big Data + SSAA + Traditional BI

2015

2016

2017

Scalable and well designed

Hadoop Platform

Evolve IMT + Hadoop Data Lineage & Data

Catalog Streaming

Capabilities Advanced SQL on

Hadoop ACID semantics

Evolve Big Data + SSAA per ecosystem roadmaps

BC/DR End to end enterprise

features Enterprise ready: OLAP

and Traditional DW

Hadoop is an open source framework designed for big data analytics.

Hadoop is evolving rapidly, but it will still take a couple of years for it to mature and support “traditional bi” use cases.

LegendOrange Text: Traditional BI Capabilities Green Text: Big Data/AA Capabilities

Security (RBAC, ITS/IRS) Data Governance Data Discovery Self Service AA

Framework IMT + Hadoop AVP + Hadoop In-memory + Near real

time capabilities SQL on Hadoop

Page 9: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.9

Data Integration

Big Data Platform – Ecosystem Architecture & Maturity

NRT/Stream Processing

In-Memory Processing

Processing Layer Batch Processing

Data Virtualization Data DiscoveryAdv. AnalyticsAdv. Visualization Data Management

Presentation Layer

End User Data Steward

Business Analyst

Data Scientist

DeveloperUser layer Auditor

Machine Learning

Analytical layer Statistical

Numerical Time series

Textual/Log Spatial

Graph

Textual/Log DB Hierarchy DBRelational DB Graph DBStorage Model

Platform VirtualizationInfrastructure Platform Management Network Management Systems Management

Data Ingestion

Continuous IntegrationDev Framework Security

Source/Target APIs 3rd Party Drivers

Ent. Scheduler Srvs Metadata MgmtWorkload Mgmt

Middleware

*Other names and brands may be claimed as the property of others.

Columnar DB

Data Egression

Other Vendors offered capabilities

Majority CDH offered capabilities

Data Consumption

Prescriptive Guidance

Change Release

GovernanceEngagem

ent

Service M

anagement

Training

Support

Processes

Page 10: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.10

BI Big Data Platform

Hadoop Project Sandbox – CDH 5.3

Multiple Instances Deployed on Intel Cloud & MyCloud

environments. TTM to business: 2-3 Days

Hadoop Pre-Production – CDH 5.3

10 data nodes | 399TB | 320 vcoresUse cases in Dev/POC: 14

Hadoop Production – CDH 5.322 data nodes | 658TB | 704 vcores

Use cases Live in prod: 7

Hadoop 2.0 architecture provides reliability, scalability & performance

High availability and scalability design Well positioned to meet 2015 business use

case requirements Repeatable architecture for faster builds. Capacity additions: Add data node. White

boxes, Waterfall equipment or HP servers TTM: Varies depending on HW (3 wks-2

months)Job/Workflow Management

Data Node Data Node Data Node Data Node Data Node

Name NodeResource Mgr

Name NodeResource Mgr

heartbeat, balancing, replicationYARN

Scale to meet business needs

GatewayNodes

(NN hi-av)Gateway nodes

Login (ssh) : AD authentication & authorization, access

cluster, run HDFS commands, submit jobs, etc.

ManagementNode

Source Data

DB Data

VisualizationTools

Data Movement/ETL

EDW or Datamart

DB data

Unstructured Semi-structured

Page 11: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

• Skills and resources with time to ramp up• Starting small is ok. Focus on design and scalability for the platform. • Technical product evaluation

Stick with a distribution which is core Hadoop open source stack vs proprietary software• Security is a big deal to Intel, Big Data Security capabilities implementation is

key focus• Methodology to understand the data is to use an iterative discovery method

with technical, business and modeling teams. • Intel IT Big Data Journey benefited heavily from Cloudera partnership• Open source will play a big role in advancing Big Data capabilities and analytics

BKM’s | Summary

Page 12: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.12

BI Big Data IT@Intel Resource InfoBI Big Data IT@Intel Resource Links: 1. Hadoop Migration Success Story: How Intel IT Moved to Cloudera2. Mining Big Data in the Enterprise for Better Business Intelligence3. Enabling Big Data Platforms and Solutions with Centralized Data Management4. Integrating Apache Hadoop* into Intel’s Big Data Environment5. Using a Multiple Data Warehouse Strategy to Improve BI Analytics

To learn more: www.intel.com/bigdata

Page 13: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.13

Q & A

Page 14: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Intel Confidential — Do Not Forward

Page 15: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.15

Backup

Page 16: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

Big Data Capability Catalog

Hive

HDFS MapReduceZookeeper

Pig Mahout

NetworkServers Storage Security OS Hi-AvEAM / AD Integration

HDFS Compress

WHIRR

Hbase

Governance

Change Release

Engagement

Service mgmt.

Prescriptive Guidance

Training

SQOOP JDBC Other DW

Infrastructure

Process

Cloudera* Distribution of Hadoop (CDH)

*Other names and brands may be claimed as the property of others.

Storm

Hcatalog

ACCUMULOYARN

SPARK

Autosys

SecureGIT

Impala JDBC

HiveODBC

3rd Party SW/Connectors Integration

HUE SOLRIMPALA

PARQUET DataFu

Impala ODBC

TDCH

Oozie

Kafka

Sqoop

DIGateway

Flume

SFTPSMBClient

DataIntegration

Camel

Enabled PlannedWIPAvail. Now 1-3 Months 3-6+ Months

Cloudera Manager*System Management

Cloudera Navigator*Data Management

Audit

Access Control

Discovery Explore

Lineage Lifecyle

DeploymentMonitoring Reporting Diagnostics

Alerting Service Management

Rolling Upgrades

Config Rollbacks

List includes only the capabilities planned for next 6 months.

16

Google Analytics

SFDC

Sentry

Page 17: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

i. Find Differences with a Comparative Evaluation in a Sandbox Environment

ii. Define Your Strategy for the Cloudera Implementation

iii. Split the Hardware Environment

iv. Upgrade the Hadoop Version

v. Create a Preproduction-to-Production Pipeline

vi. Rebalance the Data

Migration to Cloudera – 6 BKMs

Page 18: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

Building Block Strategy to Enterprise Security of Hadoop

Q1’15: Perimeter access with LDAP + finer grain controls with Sentry. The second building block towards enterprise grade security design.

Q2’15: Add Kerberos to enable more Hadoop components and further secure the platform

2H’15: Exploration starting, awaiting product and target to adopt in 2H’15 in Production.

Now

Q2’15

2H’15

Page 19: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.19

Hadoop Maturity & Evolution

MapReduce(batch data processing,

cluster resource management)

HDFS 1.0(redundant, reliable

data storage)

Hadoop 1.0

YARN(cluster resource management)

HDFS 2.0(redundant, reliable data storage)

Interactive

(Impala)

In-Memory(Spark)

Batch(Map

Reduce)

Online(Hbase)

Others(Search, Storm

etc.)

Graph

Applications Run Natively In Hadoop

+ Scalable data storage and processing platform+ Positioned for Batch processing workloads for Map and Reduce only+ Apache Hive offers SQL like query language

- Lacks reliability and stability- No support for low latency queries

Apache YARN allows you to run multiple applications in Hadoop and provides reliability, scalability and performance

Advanced Resource Management Apache Hive offers a 50x improvement in performance for queries Cloudera Impala to support low latency query requirements with SQL-92 and SQL-

2000 support Data at Rest Encryption and Row Level/Cell Level Security planned Data Streaming and Search Capability GraphDB Expanded Data Governance IMT + Hadoop Integration Improved Front End tool integration/support Deeper Diagnostics for multiple components

2005 - 2012 2013 - 2014

Hadoop 2.0

HDFS(redundant, reliable

data storage)

YARN(cluster resource

management)

Batch(Map Reduce)

Others(data

processing)

2015 - 2017

Page 20: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.20

2014 Intel IT Vital Statistics>6,300 IT employees

59 global IT sites

>98,000 Intel employees1

168 Intel sites in 65 Countries

64 Data Centers(91 Data Centers in 2010)80% of servers virtualized

(42% virtualized in 2010, goal of 75%)

>147,000+ Devices100% of laptops encrypted100% of laptops with SSD’s>43,200 handheld devices

57 mobile applications developed

Source: Information provided by Intel IT as of Jan 20141Total employee count does not include wholly owned subsidiaries that Intel IT does not directly support

Copyright © 2014, Intel Corporation. All rights reserved.

Page 21: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.21

Big Data in the Industry Recommendation Engine Fraud Detection

Sentiment Analytics

Behavioral Targeting

Customer Experience Analytics

Marketing campaign Analytics

Page 22: Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Copyright © 2014, Intel Corporation. All rights reserved.

Learn more about Intel IT’s Initiatives at www.intel.com/IT

Sharing Intel IT Best Practices With the World