spark summit keynote by suren nathan

Data Profiling and Pipeline Processing with Spark – A Journey

Suren NathanSynchronoss

(Q3’2014 revenue)

Who am I• Sr. Director Big Data Platform and Analytics Framework

at Synchronoss• CTO at Razorsight (acquired by Synchronoss)• Worked in Analytics and decision support systems for

more than 15 years• Passionate about solving business problems leveraging

latest technology

(Q3’2014 revenue)

Synchronoss provides Personal Cloud and Activation Platforms to Tier One Operators, MSO’s and Enterprises around the globe

Mobile ContentTransfer

PersonalCloud

Device Activation

Cloud Account

Provisioning

On-Boarded

Welcome

Synchronoss Integrated Cloud Products

Online and Device ACTIVATION

Back-up, Sync and Share

ACTIVATION CLOUD Internet of Things

Integrated Life

(Q3’2014 revenue)

Synchronoss Connects Operators to their Customers

Big Data @ SynchronossSample numbers @ one tier1 customer:• 30M registered users• 14M monthly active users• 8M daily active users• Up to 215TB of ingest per day• 62PB of content stored• 50 Billion user content files• Ingest of 1PB per week• 4+ Star Rating Apps

What do we do?• Big Data Analytics Platform Group• Implement scalable big data technology platform

to help deliver consistent analytics • Platform deployed in private cloud and AWS

Data Pipeline Process

Ingest Data

Profile Data

Parse Data

Transform Data

Enrich Data

Aggregate Data

- Perform Analysis- Load

Index Store- Feed EDW

Our Data Pipeline Journey

Data Pipeline – V1

StagingETL

EDWETL

Process CentricETL

Source Data EDW

Multiple Custom ETLs separated from data layerSMP architecture not distributedLong running batch workloadsContention, Bottlenecks with increased data volumeNo support for unstructured dataCannot retain historical data

$$$>1 YEARInflexible


StagingETL

EDWETL

Process CentricETL

Source Data EDW

ETLs closer to data High performance, but expensive Batch workloads, with reduced latencies Unable to handle unstructured data Storage costs prohibitive

$$$$6 Months+Still Inflexible

MPP Appliance

Data Pipeline – V3 Option Skipped

Source Data

Did not foresee a huge improvement Batch workloads only Slow performance with MapReduce Lack of resources and skills gap Lack of consistency Too many tools

$$1 year +Risks


Source Data

ETLs closer to dataBatch and stream workloadsSuperior performanceAbstracted features via FrameworkComponents and standardsMultiple language supportSimplified design

$<1 MonthHighly Flexible

Data ProfilingPut all the data in the lake man

What’s in these data sets?

More data is better. Work with the

population and not a sample

-- Data Scientists

Why Data Profiling?• Find out what is in the data • Get metrics on data quality • Assess the risk involved in creating business rules • Discover metadata including value patterns and

distributions• Understanding data challenges early to avoid delays

and cost overruns• Improve the ability to search the data

Analysts spend 80-90% of time in data munging

Current approaches require multiple manual touch point and processes

Lost opportunity due to lengthy project time frames

Business Challenge

Typical ScenarioData size too large to view using excel & notepad

Data has to be loaded into database for profilingCannot load into database unless the data fields are known

File formats are not right and specifications are incorrect

Distribution, space, multiple touch points, moving files here and there

CRAZYToo many dependencies, wasted time

What do we need?

Speed, Agility & Automation

Data Profiler RequirementsProfile data from data lake

Validate and Preview data

Review statistics

Create Meta Data

Create Downstream Schema

Spark to the rescue

Check the Types

Check the Values

Calculate metrics

Generate MetaData

RDD

….

C1 C2 C3 C4 Cn

RDDsData Files

Dynamic build execution graph

Map-> Map

Built in transformations (unique, get first etc.,)

In memory execution provides speed

Execution Flow and Software Stack

Repository

Data Lake location for data

Data Profiler UI

3

SparkData Profiler

1

2

4

5

6.1

6.3

7

6.2

8

MapR FS (M7)

Spark

Spark Monitoring UI

Spark Data Profiler MapR

UI

MapR Cluster

Hardware Infrastructure Level

OS/File System Level

Razorsight Application Level

System Application Level

Legend:

NFS

Meta Data Repository

WEB Server

Data Profiler UI

Univariate StatisticsOutputs for Numeric Values Outputs for

Non-Numeric Values

Histograms

Count of Missing Values

Count of Non-Missing Values

Mean

Variance

Standard Deviation

Minimum

Maximum

Range

Mode

Median

Q1 Value

Q3 Value

Interquartile Range

Skewness

Kurtosis

Data Profiler Web Application

Meta Data and DDL

Advantages• Source data in data lake• All profiling done in the data lake• No manual movement of data• Profile sample or full data set• Integrate creation of meta data for transformation,

enrichment• Send clean data to downstream processes

Results• Improved data analysis time from weeks to

hours• Average improvement of data pipeline process

80%• Identified data quality issues well ahead of time• Empowered business analysts to perform the

work

Framework ComponentsIngestion

Multiple source channels

Batch/Real Time

Data Validation

Compression/Encryption

ProfilingData Health Check

Summary Statistics

Scrubbing/Cleansing

Meta Data Creation

Parsing

Fixed Width

Delimited

Mapping

TransformationEnrichment

Truncation

Imputation

Aggregation

Integration

Batch

RESTful

Database

Web PortalMeta Data Configuration

Tracking

Alerts

Dashboard

Framework Architecture

ProcessingComponents

Data Storage Layer

Data Aggregator

Data Parser &

Transformer

Elastic Search Loader

DB LoaderData Reconciliation

Orchestration Layer

Elastic Search

XDF Web UI

Data Profiler

MySQL

Meta-data Repository

Control Flow

Data Flow

Data Partitioner

Synchronoss Data Lake

Data Ingestion

DataBeacon

ExternalData

SourcesBivariate Engine

Data Prep Engine

SQL Engine

Framework Technology Stack

MapR FS (M7)

ScoopApache Spark

Hadoop

MapR Cluster

Hardware Infrastructure Level

OS/File System Level

System Application Level

NFS

UI/Control Cluster

Oozie ApacheDrill

Tomcat

ActiveMQ

Spring Integration

HUE

ElasticSearch Cluster

NFS

ElasticSearch Engine

Angular REST

Unix/Linux Unix/Linux Unix/Linux

What’s Next?• Bivariate Analysis • Multicollinearity

Outputs for Numeric Values(by target value for each variable)

Correlation Outputs

Record Count

Row Count Percent

Average

Variance

Standard Deviation

Skewness

Kurtosis

Minimum

Maximum

Pearson’s Correlation Coefficient

Spearman’s Correlation Coefficient

Covariance

Variable Clustering

Regression Coefficients

Dendogram

Hierarchical Cluster(HCA)

Correlation Matrix

Variance Inflation Factor

(VIF)

Lessons• Let business value drive technology adoption• Plan incremental updates• Pay attention to hidden costs• Simplify• Implement Framework based development• Leverage existing skillset to scale

Simplify

THANK [email protected]

spark summit keynote by suren nathan

Data & Analytics