mdm for the modern data architecture - leapros€¦ · purpose of mdm create correct and consistent...
TRANSCRIPT
MDM for the Modern Data ArchitectureSeptember 2014
2 RedPoint Global Inc. 2014 Confidential
Purpose of MDM
Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth.
3 RedPoint Global Inc. 2014 Confidential
“ Without data you’re just another person with an opinion.”
W. Edwards Deming
Why it matters
4 RedPoint Global Inc. 2014 Confidential
Vicious Cycle of Unmanaged Data
Unmanaged Data
Master Data Issues remain unaddressed or unresolved
1
Garbage in/garbage out creates process confusion
2Data conflicts reinforce siloed operations
4
Lack of process trust slows business momentum
3
5 RedPoint Global Inc. 2014 Confidential
© Hortonworks Inc. 2014
A Data Architecture Under Pressure
Applications
Business Analytics
Custom Applications
Packaged Applications
Data SystemRepositories
RDBMS EDW MPP
Sources
Existing Sources(CRM, ERP, Clickstream,
Logs)
Unstructured documents, emails
Transactional data
Server logs
Sentiment, web data
Geolocation
Sensor, machine data
Clickstream
Hierarchical data
OLTP, ERP, CRM
Master data
2.8 ZB in 2013
85% from new data types
15x Machine Data by 2020
40 ZB by 2020
Source: IDC
6 RedPoint Global Inc. 2014 Confidential
Broad Spectrum of Benefits Across Industries
• New account risk screens
• Fraud prevention• Trading risk• Maximize deposit
spread• Insurance underwriting• Accelerate loan
processing
FinancialServices
• 360° view of the customer
• Analyze brand sentiment
• Localized, personalized promotions
• Website optimization• Optimal store layout
Retail
• Call detail records (CDRs)
• Infrastructure investment
• Next product to buy (NPTB)
• Real-time bandwidth allocation
• New product development
Telecom
• Supplier consolidation• Supply chain and
logistics• Assembly line quality
assurance• Proactive maintenance• Crowdsourced quality
assurance
Manufacturing
• Genomic data for medical trials
• Monitor patient vitals• Reduce re-admittance
rates• Store medical research
data• Recruit cohorts for
pharmaceutical trials
Healthcare
• Smart meter stream analysis
• Slow oil well decline curves
• Optimize lease bidding• Compliance reporting• Proactive equipment
repair• Seismic image
processing
Utilities, Oil& Gas
• Analyze public sentiment
• Protect critical networks• Prevent fraud and
waste • Crowdsource reporting
for repairs to infrastructure
• Fulfill open records requests
Public Sector
7 RedPoint Global Inc. 2014 Confidential
Gartner’s Nexus of Forces Making Things Worse
8 RedPoint Global Inc. 2014 Confidential
Business Benefits of MDM
Today IT data mgmt. pros focus on: Business leaders really care about:Eliminating duplicate/orphaned data Increasing revenue
Standardizing and centralizing data/metadata Decreasing costs
Meeting operational SLAs Increasing operational efficiencies
Data enrichment Reducing risk
Data integration and synchronization Improving customer experiences
Increase in customer self-service for order management, technical support
and customer service
Reduction in customer privacy compliance risk exposure
Reduction in direct marketing postage costs
Increase in campaign response rates
Delivering a consistent cross-channel customer experience
Reduction in average handle time in call center
Use business-value driven KPIs to evangelize MDM benefits
9 RedPoint Global Inc. 2014 Confidential
How About MDM on a Data Lake?
• Severe shortage of Map Reduce skilled resources
• Inconsistent skills lead to inconsistent results of code based solutions
• Nascent technologies require multiple point solutions
• Technologies are not enterprise grade
• Some functionality may not be possible within these frameworks
Challenges to Data Lake Approach
• Data is ingested in its raw state regardless of format, structure or lack of structure
• Raw data can be used and reused for differing purposes across the enterprise
• Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computational platform
• Master Data can be fed across the enterprise and deep analytics on clean data is immediately enabled
Benefits of a Hadoop Data Lake
10 RedPoint Global Inc. 2014 Confidential
Key Functions for Master Data Management
Master Key Management
ETL & ELT Data Quality
Web Services Integration
Integration & Matching
Process Automation & Operations
• Profiling, reads/writes, transformations
• Single project for all jobs
• Cleanse data• Parsing, correction• Geo-spatial analysis
• Grouping• Fuzzy match
• Create keys• Track changes• Maintain matches
over time
• Consume and publish• HTTP/HTTPS protocols• XML/JSON/SOAP
formats
• Job scheduling, monitoring, notifications
• Central point of control• Meta Data Management
11 RedPoint Global Inc. 2014 Confidential
Data Lake is the Center of Your MDM Strategy
Ingestion of all data available from any source, format, cadence, structure or non-structure
ELT and data transformation, refinement, cleansing, completion, validation and standardization
Geospatial processing and geocoding
Data profiling, lineage and metadata management
Identity resolution and persistent keying and entity profile management
Attribute source and consumer mapping
12 RedPoint Global Inc. 2014 Confidential
Data Lake Architecture for MDM
Data Sources
CRM
ERP
Billing
Subscriber
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor Data
Social Media
Call Detail Records
Fabrication Logs
Sales Feedback
Field Feedback
Field Feedback
+
13 RedPoint Global Inc. 2014 Confidential
How Can That Possibly Work?
More Map Reduce! YARN!
14 RedPoint Global Inc. 2014 Confidential
Overview What is Hadoop/Hadoop 2.0
Hadoop 1.0
• All operations based on Map Reduce
• Intrinsic inconsistency of code based solutions
• Highly skilled and expensive resources needed
• 3rd party applications constrained by the need to generate code
Hadoop 2.0
• Introduction of the YARN: “a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.”
• Mature applications can now operate directly on Hadoop
• Reduce skill requirements and increased consistency
15 RedPoint Global Inc. 2014 Confidential
RedPoint Data Management on Hadoop
Partitioning AM / Tasks
Execution AM / Tasks
Data I/O
Key / Split
Analysis
Parallel Section
Partition Data
server
YARN
MapReduce
16 RedPoint Global Inc. 2014 Confidential
Reference Hadoop Architecture
Monitoring and Management Tools
AMBARI
MAPREDUCE
REST
DATA REFINEMENT
HIVEPIG
HTTP
STREAM
STRUCTURE
HCATALOG (metadata services)
Query/Visualization/ Reporting/Analytical
Tools and Apps
SOURCE DATA
- Sensor Logs- Clickstream
- Flat Files- Unstructured
- Sentiment- Customer- Inventory
DBs
JMSQueue’s
FilesFilesFiles
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
NFS
LOAD
SQOOP/Hive
Web HDFS
YARN
n
HDFS
1
RedPoint Functional Footprint
17 RedPoint Global Inc. 2014 Confidential
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
Extensive optimization needed
User Defined Functions required prior to running script
No tuning or optimization required
RedPoint
Benchmarks – Project Gutenberg
Map Reduce Pig
Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count';
18 RedPoint Global Inc. 2014 Confidential
Data Lake Architecture for MDM
Data Sources
CRM
ERP
Billing
Subscriber
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor Data
Social Media
Call Detail Records
Fabrication Logs
Sales Feedback
Field Feedback
Field Feedback
+