spark summit keynote by suren nathan

34
Data Profiling and Pipeline Processing with Spark – A Journey Suren Nathan Synchronoss

Upload: spark-summit

Post on 16-Apr-2017

3.182 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark Summit Keynote by Suren Nathan

Data Profiling and Pipeline Processing with Spark – A Journey

Suren NathanSynchronoss

Page 2: Spark Summit Keynote by Suren Nathan

(Q3’2014 revenue)

Who am I• Sr. Director Big Data Platform and Analytics Framework

at Synchronoss• CTO at Razorsight (acquired by Synchronoss)• Worked in Analytics and decision support systems for

more than 15 years• Passionate about solving business problems leveraging

latest technology

Page 3: Spark Summit Keynote by Suren Nathan

(Q3’2014 revenue)

Synchronoss provides Personal Cloud and Activation Platforms to Tier One Operators, MSO’s and Enterprises around the globe

Page 4: Spark Summit Keynote by Suren Nathan

Mobile ContentTransfer

PersonalCloud

Device Activation

Cloud Account

Provisioning

On-Boarded

Welcome

Synchronoss Integrated Cloud Products

Page 5: Spark Summit Keynote by Suren Nathan

Online and Device ACTIVATION

Back-up, Sync and Share

ACTIVATION CLOUD Internet of Things

Integrated Life

(Q3’2014 revenue)

Synchronoss Connects Operators to their Customers

Page 6: Spark Summit Keynote by Suren Nathan

Big Data @ SynchronossSample numbers @ one tier1 customer:• 30M registered users• 14M monthly active users• 8M daily active users• Up to 215TB of ingest per day• 62PB of content stored• 50 Billion user content files• Ingest of 1PB per week• 4+ Star Rating Apps

Page 7: Spark Summit Keynote by Suren Nathan

What do we do?• Big Data Analytics Platform Group• Implement scalable big data technology platform

to help deliver consistent analytics • Platform deployed in private cloud and AWS

Page 8: Spark Summit Keynote by Suren Nathan

Data Pipeline Process

Ingest Data

Profile Data

Parse Data

Transform Data

Enrich Data

Aggregate Data

- Perform Analysis- Load

Index Store- Feed EDW

Page 9: Spark Summit Keynote by Suren Nathan

Our Data Pipeline Journey

Page 10: Spark Summit Keynote by Suren Nathan

Data Pipeline – V1

StagingETL

EDWETL

Process CentricETL

Source Data EDW

Multiple Custom ETLs separated from data layerSMP architecture not distributedLong running batch workloadsContention, Bottlenecks with increased data volumeNo support for unstructured dataCannot retain historical data

$$$>1 YEARInflexible

Page 11: Spark Summit Keynote by Suren Nathan

Data Pipeline – V2

StagingETL

EDWETL

Process CentricETL

Source Data EDW

ETLs closer to data High performance, but expensive Batch workloads, with reduced latencies Unable to handle unstructured data Storage costs prohibitive

$$$$6 Months+Still Inflexible

MPP Appliance

Page 12: Spark Summit Keynote by Suren Nathan

Data Pipeline – V3 Option Skipped

Source Data

Did not foresee a huge improvement Batch workloads only Slow performance with MapReduce Lack of resources and skills gap Lack of consistency Too many tools

$$1 year +Risks

Page 13: Spark Summit Keynote by Suren Nathan

Data Pipeline – V4

Source Data

ETLs closer to dataBatch and stream workloadsSuperior performanceAbstracted features via FrameworkComponents and standardsMultiple language supportSimplified design

$<1 MonthHighly Flexible

Page 14: Spark Summit Keynote by Suren Nathan

Data ProfilingPut all the data in the lake man

What’s in these data sets?

More data is better. Work with the

population and not a sample

-- Data Scientists

Page 15: Spark Summit Keynote by Suren Nathan

Why Data Profiling?• Find out what is in the data • Get metrics on data quality • Assess the risk involved in creating business rules • Discover metadata including value patterns and

distributions• Understanding data challenges early to avoid delays

and cost overruns• Improve the ability to search the data

Page 16: Spark Summit Keynote by Suren Nathan

Analysts spend 80-90% of time in data munging

Current approaches require multiple manual touch point and processes

Lost opportunity due to lengthy project time frames

Business Challenge

Page 17: Spark Summit Keynote by Suren Nathan

Typical ScenarioData size too large to view using excel & notepad

Data has to be loaded into database for profilingCannot load into database unless the data fields are known

File formats are not right and specifications are incorrect

Distribution, space, multiple touch points, moving files here and there

CRAZYToo many dependencies, wasted time

Page 18: Spark Summit Keynote by Suren Nathan

What do we need?

Speed, Agility & Automation

Page 19: Spark Summit Keynote by Suren Nathan

Data Profiler RequirementsProfile data from data lake

Validate and Preview data

Review statistics

Create Meta Data

Create Downstream Schema

Page 20: Spark Summit Keynote by Suren Nathan

Spark to the rescue

Check the Types

Check the Values

Calculate metrics

Generate MetaData

RDD

….

C1 C2 C3 C4 Cn

RDDsData Files

Dynamic build execution graph

Map-> Map

Built in transformations (unique, get first etc.,)

In memory execution provides speed

Page 21: Spark Summit Keynote by Suren Nathan

Execution Flow and Software Stack

Repository

Data Lake location for data

Data Profiler UI

3

SparkData Profiler

1

2

4

5

6.1

6.3

7

6.2

8

MapR FS (M7)

Spark

Spark Monitoring UI

Spark Data Profiler MapR

UI

MapR Cluster

Hardware Infrastructure Level

OS/File System Level

Razorsight Application Level

System Application Level

Legend:

NFS

Meta Data Repository

WEB Server

Data Profiler UI

Page 22: Spark Summit Keynote by Suren Nathan

Univariate StatisticsOutputs for Numeric Values Outputs for

Non-Numeric Values

Histograms

Count of Missing Values

Count of Non-Missing Values

Mean

Variance

Standard Deviation

Minimum

Maximum

Range

Mode

Median

Q1 Value

Q3 Value

Interquartile Range

Skewness

Kurtosis

Page 23: Spark Summit Keynote by Suren Nathan

Data Profiler Web Application

Page 24: Spark Summit Keynote by Suren Nathan

Meta Data and DDL

Page 25: Spark Summit Keynote by Suren Nathan

Advantages• Source data in data lake• All profiling done in the data lake• No manual movement of data• Profile sample or full data set• Integrate creation of meta data for transformation,

enrichment• Send clean data to downstream processes

Page 26: Spark Summit Keynote by Suren Nathan

Results• Improved data analysis time from weeks to

hours• Average improvement of data pipeline process

80%• Identified data quality issues well ahead of time• Empowered business analysts to perform the

work

Page 27: Spark Summit Keynote by Suren Nathan

Secure Repository

Data Health | Cleansing | Pruning | Transformation |Univariate Analysis

Descriptive | Predictive | Bivariate | Multivariate

RESTful | SOA

Dashboards | Adhoc Queries | KPIs | Alerts

Data Ingestion

Data Lake

Data Preparation

Data Analytics

Data Services

Data Visualization

Layer 1Infrastructure

Layer 2Data Management

Layer 3Modeling

Layer 4Integration

Layer 5Business Insight and Actions

Structured | Unstructured | Batch | Streaming

SFTP NDM Nwk

PathSocial Media StreamEmail

Framework Layers

Page 28: Spark Summit Keynote by Suren Nathan

Framework ComponentsIngestion

Multiple source channels

Batch/Real Time

Data Validation

Compression/Encryption

ProfilingData Health Check

Summary Statistics

Scrubbing/Cleansing

Meta Data Creation

Parsing

Fixed Width

Delimited

Mapping

TransformationEnrichment

Truncation

Imputation

Aggregation

Integration

Batch

RESTful

Database

Web PortalMeta Data Configuration

Tracking

Alerts

Dashboard

Page 29: Spark Summit Keynote by Suren Nathan

Framework Architecture

ProcessingComponents

Data Storage Layer

Data Aggregator

Data Parser &

Transformer

Elastic Search Loader

DB LoaderData Reconciliation

Orchestration Layer

Elastic Search

XDF Web UI

Data Profiler

MySQL

Meta-data Repository

Control Flow

Data Flow

Data Partitioner

Synchronoss Data Lake

Data Ingestion

DataBeacon

ExternalData

SourcesBivariate Engine

Data Prep Engine

SQL Engine

Page 30: Spark Summit Keynote by Suren Nathan

Framework Technology Stack

MapR FS (M7)

ScoopApache Spark

Hadoop

MapR Cluster

Hardware Infrastructure Level

OS/File System Level

System Application Level

NFS

UI/Control Cluster

Oozie ApacheDrill

Tomcat

ActiveMQ

Spring Integration

HUE

ElasticSearch Cluster

NFS

ElasticSearch Engine

Angular REST

Unix/Linux Unix/Linux Unix/Linux

Page 31: Spark Summit Keynote by Suren Nathan

What’s Next?• Bivariate Analysis • Multicollinearity

Outputs for Numeric Values(by target value for each variable)

Correlation Outputs

Record Count

Row Count Percent

Average

Variance

Standard Deviation

Skewness

Kurtosis

Minimum

Maximum

Pearson’s Correlation Coefficient

Spearman’s Correlation Coefficient

Covariance

Variable Clustering

Regression Coefficients

Dendogram

Hierarchical Cluster(HCA)

Correlation Matrix

Variance Inflation Factor

(VIF)

Page 32: Spark Summit Keynote by Suren Nathan

Lessons• Let business value drive technology adoption• Plan incremental updates• Pay attention to hidden costs• Simplify• Implement Framework based development• Leverage existing skillset to scale

Page 33: Spark Summit Keynote by Suren Nathan

Simplify

Page 34: Spark Summit Keynote by Suren Nathan

THANK [email protected]