mdm for the modern data architecture - leapros€¦ · purpose of mdm create correct and consistent...

MDM for the Modern Data ArchitectureSeptember 2014

2 RedPoint Global Inc. 2014 Confidential

Purpose of MDM

Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth.


“ Without data you’re just another person with an opinion.”

W. Edwards Deming

Why it matters


Vicious Cycle of Unmanaged Data

Unmanaged Data

Master Data Issues remain unaddressed or unresolved

1

Garbage in/garbage out creates process confusion

2Data conflicts reinforce siloed operations

4

Lack of process trust slows business momentum

3


© Hortonworks Inc. 2014

A Data Architecture Under Pressure

Applications

Business Analytics

Custom Applications

Packaged Applications

Data SystemRepositories

RDBMS EDW MPP

Sources

Existing Sources(CRM, ERP, Clickstream,

Logs)

Unstructured documents, emails

Transactional data

Server logs

Sentiment, web data

Geolocation

Sensor, machine data

Clickstream

Hierarchical data

OLTP, ERP, CRM

Master data

2.8 ZB in 2013

85% from new data types

15x Machine Data by 2020

40 ZB by 2020

Source: IDC


Broad Spectrum of Benefits Across Industries

• New account risk screens

• Fraud prevention• Trading risk• Maximize deposit

spread• Insurance underwriting• Accelerate loan

processing

FinancialServices

• 360° view of the customer

• Analyze brand sentiment

• Localized, personalized promotions

• Website optimization• Optimal store layout

Retail

• Call detail records (CDRs)

• Infrastructure investment

• Next product to buy (NPTB)

• Real-time bandwidth allocation

• New product development

Telecom

• Supplier consolidation• Supply chain and

logistics• Assembly line quality

assurance• Proactive maintenance• Crowdsourced quality

assurance

Manufacturing

• Genomic data for medical trials

• Monitor patient vitals• Reduce re-admittance

rates• Store medical research

data• Recruit cohorts for

pharmaceutical trials

Healthcare

• Smart meter stream analysis

• Slow oil well decline curves

• Optimize lease bidding• Compliance reporting• Proactive equipment

repair• Seismic image

processing

Utilities, Oil& Gas

• Analyze public sentiment

• Protect critical networks• Prevent fraud and

waste • Crowdsource reporting

for repairs to infrastructure

• Fulfill open records requests

Public Sector


Gartner’s Nexus of Forces Making Things Worse


Business Benefits of MDM

Today IT data mgmt. pros focus on: Business leaders really care about:Eliminating duplicate/orphaned data Increasing revenue

Standardizing and centralizing data/metadata Decreasing costs

Meeting operational SLAs Increasing operational efficiencies

Data enrichment Reducing risk

Data integration and synchronization Improving customer experiences

Increase in customer self-service for order management, technical support

and customer service

Reduction in customer privacy compliance risk exposure

Reduction in direct marketing postage costs

Increase in campaign response rates

Delivering a consistent cross-channel customer experience

Reduction in average handle time in call center

Use business-value driven KPIs to evangelize MDM benefits


How About MDM on a Data Lake?

• Severe shortage of Map Reduce skilled resources

• Inconsistent skills lead to inconsistent results of code based solutions

• Nascent technologies require multiple point solutions

• Technologies are not enterprise grade

• Some functionality may not be possible within these frameworks

Challenges to Data Lake Approach

• Data is ingested in its raw state regardless of format, structure or lack of structure

• Raw data can be used and reused for differing purposes across the enterprise

• Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computational platform

• Master Data can be fed across the enterprise and deep analytics on clean data is immediately enabled

Benefits of a Hadoop Data Lake


Key Functions for Master Data Management

Master Key Management

ETL & ELT Data Quality

Web Services Integration

Integration & Matching

Process Automation & Operations

• Profiling, reads/writes, transformations

• Single project for all jobs

• Cleanse data• Parsing, correction• Geo-spatial analysis

• Grouping• Fuzzy match

• Create keys• Track changes• Maintain matches

over time

• Consume and publish• HTTP/HTTPS protocols• XML/JSON/SOAP

formats

• Job scheduling, monitoring, notifications

• Central point of control• Meta Data Management


Data Lake is the Center of Your MDM Strategy

Ingestion of all data available from any source, format, cadence, structure or non-structure

ELT and data transformation, refinement, cleansing, completion, validation and standardization

Geospatial processing and geocoding

Data profiling, lineage and metadata management

Identity resolution and persistent keying and entity profile management

Attribute source and consumer mapping


Data Lake Architecture for MDM

Data Sources

CRM

ERP

Billing

Subscriber

Product

Network

Weather

Compete

Manuf.

Clickstream

Online Chat

Sensor Data

Social Media

Call Detail Records

Fabrication Logs

Sales Feedback

Field Feedback

Field Feedback

+


How Can That Possibly Work?

More Map Reduce! YARN!


Overview What is Hadoop/Hadoop 2.0

Hadoop 1.0

• All operations based on Map Reduce

• Intrinsic inconsistency of code based solutions

• Highly skilled and expensive resources needed

• 3rd party applications constrained by the need to generate code

Hadoop 2.0

• Introduction of the YARN: “a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.”

• Mature applications can now operate directly on Hadoop

• Reduce skill requirements and increased consistency


RedPoint Data Management on Hadoop

Partitioning AM / Tasks

Execution AM / Tasks

Data I/O

Key / Split

Analysis

Parallel Section

Partition Data

server

YARN

MapReduce


Reference Hadoop Architecture

Monitoring and Management Tools

AMBARI

MAPREDUCE

REST

DATA REFINEMENT

HIVEPIG

HTTP

STREAM

STRUCTURE

HCATALOG (metadata services)

Query/Visualization/ Reporting/Analytical

Tools and Apps

SOURCE DATA

- Sensor Logs- Clickstream

- Flat Files- Unstructured

- Sentiment- Customer- Inventory

DBs

JMSQueue’s

FilesFilesFiles

Data Sources

RDBMS

EDW

INTERACTIVE

HIVE Server2

LOAD

SQOOP

WebHDFS

Flume

NFS

LOAD

SQOOP/Hive

Web HDFS

YARN

n

HDFS

1

RedPoint Functional Footprint


>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code

6 hours of development 3 hours of development 15 min. of development

6 minutes runtime 15 minutes runtime 3 minutes runtime

Extensive optimization needed

User Defined Functions required prior to running script

No tuning or optimization required

RedPoint

Benchmarks – Project Gutenberg

Map Reduce Pig

Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count';


Data Lake Architecture for MDM

Data Sources

CRM

ERP

Billing

Subscriber

Product

Network

Weather

Compete

Manuf.

Clickstream

Online Chat

Sensor Data

Social Media

Call Detail Records

Fabrication Logs

Sales Feedback

Field Feedback

Field Feedback

+

mdm for the modern data architecture - leapros€¦ · purpose of mdm create correct and consistent...

Documents