Hadoop and Your Enterprise Data Warehouse

Download Hadoop and Your Enterprise Data Warehouse

Post on 15-Jul-2015



Data & Analytics

1 download

Embed Size (px)


<ul><li><p>Welcome to Todays DBTA Roundtable Web Event</p></li><li><p>Stephen Faig</p><p>Business Development Manager</p><p>Unisphere Media</p><p>Publishers of DBTA</p></li><li><p>Hadoop and Your Enterprise Data Warehouse</p></li><li><p>Nitin Bandugula</p><p>Product Marketing Manager </p><p>MapR Technologies</p><p>Kevin Petrie</p><p>Senior Director</p><p>Attunity</p><p>George Corugedo</p><p>Chief Technology Officer &amp; Co-Founder</p><p>RedPoint Global Inc.</p></li><li><p> 2015 MapR Technologies 5 2015 MapR Technologies</p></li><li><p> 2015 MapR Technologies 6</p><p>Empowering as it happens businesses by speeding up the </p><p>data-to-action cycle</p></li><li><p> 2015 MapR Technologies 7</p><p>Top-Ranked NoSQL</p><p>Top-Ranked HadoopDistribution</p><p>Top-Ranked SQL-on-HadoopSolution</p></li><li><p> 2015 MapR Technologies 8</p><p>Topics</p><p> The Need for EDW Optimization</p><p> Different Stages of the Optimization</p><p> MapR Customer Examples</p><p> The MapR Advantage</p></li><li><p> 2015 MapR Technologies 9 2015 MapR Technologies</p><p>The Need for EDW Optimization</p></li><li><p> 2015 MapR Technologies 10</p><p>Technical Best-Practices Driving Change in Data Architecture</p><p>2Speed of </p><p>operations</p><p>1Scale of </p><p>analytics</p><p>Source: TDWI, April 2014</p></li><li><p> 2015 MapR Technologies 11</p><p>Unused Data, </p><p>Related LoadsEDW</p><p>ELT</p><p>Unused</p><p>Tables</p><p>(72%)</p><p>ELT</p><p> 70% of data is unused</p><p> Almost 60% of CPU capacity is ETL/ELT</p><p> 15% of CPU consumed by ETL to load unused data</p><p> 30% of CPU consumed by 5% of resource consuming ETL workloads.</p><p>MeanwhileThe Industry Norm in the DW</p></li><li><p> 2015 MapR Technologies 12</p><p>Data</p><p>IT Budgets</p><p>Force of Adoption: CostsHadoop TAM comes from disrupting enterprise data warehouse and storage spending</p><p> Gartner, "Forecast Analysis: Enterprise IT Spending by Vertical Industry Market, Worldwide, 2010-2016, 3Q12 Update. Wall Street Journal, Financial Services Companies Firms See Results from Big Data Push, Jan. 27, 2014</p><p>$9,000</p><p>$40,000</p></li><li><p> 2015 MapR Technologies 13</p><p>SCALE: New Data Sources Unlock New Insights &amp; Apps</p><p>Existing structured data</p><p> Well-defined and well-understood schema</p><p> OLTP data</p><p> Data warehouse data</p><p> End user data stores (e.g., Excel, Access)</p><p>New multi-structured data</p><p> Typically un-modeled, different in format</p><p> Log data</p><p> Clickstream data</p><p> Sensor data</p><p> Rich media (e.g., audio, video)</p><p> Documents</p><p>Both types needed today for deeper insights</p></li><li><p> 2015 MapR Technologies 14 2015 MapR Technologies</p><p>Stages of the Optimization</p></li><li><p> 2015 MapR Technologies 15</p><p>Stage 1: Offload Cold Data Free up DW space</p><p>Structured </p><p>Data</p><p>ETLIncoming </p><p>Data</p><p>Data Warehouse</p><p>Hadoop Platform</p><p> Unused data moved out</p><p> ETL done the traditional way</p><p> Critical data available for query</p><p>Data Access:</p><p> BI through ODBC</p><p> Hive Connectors</p><p>Cold Data </p><p>Offload</p><p>Restored </p><p>Disk</p></li><li><p> 2015 MapR Technologies 16</p><p>Stage 2: ETL In Hadoop</p><p>Low Latency Data</p><p>ETLIncoming </p><p>Data</p><p>Data Warehouse</p><p>Hadoop Platform</p><p>Bulk Data</p><p>Restored </p><p>CPU and </p><p>Disk</p><p> ETL now done on Hadoop</p><p> Analytics through EDW as well as </p><p>Hadoop</p><p> Restores even more CPU and Disk</p><p> Improves old DW Response and Speed</p></li><li><p> 2015 MapR Technologies 17</p><p>Stage 3: Hadoop Optimized Data Architecture</p><p>Sources</p><p>RELATIONAL, </p><p>SAAS, </p><p>MAINFRAME</p><p>DOCUMENTS, </p><p>EMAILS</p><p>LOG FILES, </p><p>CLICKSTREAMS</p><p>SENSORS</p><p>BLOGS, </p><p>TWEETS,</p><p>LINK DATA</p><p>DATA WAREHOUSE</p><p>Data Movement</p><p>Data Access</p><p>Analytics</p><p>Search</p><p>Schema-less </p><p>data exploration</p><p>BI, reporting</p><p>Ad-hoc integrated </p><p>analytics</p><p>Data Transformation, Enrichment </p><p>and Integration</p><p>MAPR DISTRIBUTION FOR HADOOP</p><p>Streaming(Spark Streaming, </p><p>Storm)</p><p>NoSQL ODBMS</p><p>(HBase, Accumulo, )</p><p>MapR Data Platform</p><p>MapR-DB</p><p>MAPR DISTRIBUTION FOR HADOOP</p><p>Batch / </p><p>Search(MR, Spark, Pig, )</p><p>MapR-FS</p><p>Operational Apps</p><p>Recommendations</p><p>Fraud Detection</p><p>Logistics</p><p>Optimized Data Architecture Machine Learning</p><p>SQL </p><p>Analytics(Hive, Drill )</p></li><li><p> 2015 MapR Technologies 18 2015 MapR Technologies</p><p>MapR Customer Examples</p></li><li><p> 2015 MapR Technologies 19</p><p>MapR Customer Success for Enterprise Data Hub </p><p> EDH most common use case</p><p> Across industries including</p><p>- Financial services</p><p>- Telecommunications</p><p>- Government</p><p>- Healthcare</p><p>- Technology</p></li><li><p> 2015 MapR Technologies 20</p><p>Cisco - 360 Customer ViewDeepening customer relationships and increasing sales opportunities </p><p> Improve customer satisfaction and sales opportunities by integrating all customer data into one dashboard, accessible across company divisions</p><p> Provide a consistent and proactively knowledgeable customer experience Integrate all customer data across silos into a central data repository Continually feed real-time customer data into the repository Provide a real-time view of each customer across company divisions: </p><p>marketing, support, finance, point of sale, etc. </p><p>OBJECTIVES</p><p>CHALLENGES</p><p>SOLUTION</p><p>Ciscos 360 customer view solution enabled them to analyze service sales opportunities in 1/10 the time, at 1/10 the cost, and generated $40 million in</p><p>incremental service bookings in the first year.</p><p>Business Impact</p><p> Central data repository results in lower cost and reduced complexity Accelerates analysis cycle time and rapid actions Provides high availability and disaster recovery</p></li><li><p> 2015 MapR Technologies 21</p><p>F100 Telco - Data Warehouse OptimizationImprove data services to customers while reducing enterprise architecture costs</p><p> Provide cloud, security, managed services, data center, &amp; comms Report on customer usage, profiles, billing, and sales metrics Improve service: Measure service quality and repair metrics</p><p> Reduce customer churn identify and address IP network hotspots Cost of ETL &amp; DW storage for growing IP and clickstream data; &gt;3 months Reliability &amp; cost of Hadoop alternatives limited ETL &amp; storage offload</p><p> MapR Data Platform for data staging, ETL, and storage at 1/10th the cost MapR provided smallest datacenter footprint with best DR solution Enterprise-grade: NFS file management, consistent snapshots &amp; mirroring</p><p>OBJECTIVES</p><p>CHALLENGES</p><p>SOLUTION</p><p> Increased scale to handle network IP and clickstream data Reduced workload on DW to maintain reporting SLAs to business Unlocked new insights into network usage and customer preferences</p><p>Business Impact</p><p>FORTUNE 100 </p><p>TELCO</p></li><li><p> 2015 MapR Technologies 22 2015 MapR Technologies</p><p>MapR Enterprise Data Hub Solution</p></li><li><p> 2015 MapR Technologies 23</p><p>MapR Enterprise Data Hub</p><p> Scale - Reliability Across the Enterprise</p><p> Advanced multi-tenancy</p><p> Business continuity HA, DR</p><p> Speed</p><p> 2-7x faster than other Hadoop distros</p><p> Ultra-fast data ingest, NFS, &amp; R/W file system</p><p> Self-Service Data Exploration</p><p> On-the-fly SQL without up-front schema</p><p> ANSI SQL: use existing BI/DW investments</p><p>The Hadoop platform of choice for big &amp; fast data-driven apps</p><p>Security</p><p>Streaming</p><p>NoSQL &amp; Search</p><p>Provisioning &amp; </p><p>coordination</p><p>ML, Graph</p><p>W orkflow &amp; Data Governance</p><p>Batch</p><p>SQL</p><p>INTEGRATED</p><p>COMMERCIAL</p><p>ENGINES</p><p>TOOLSCOMPUTE </p><p>ENGINES</p><p>Batch</p><p>Interactive</p><p>Real-time</p><p>Online</p><p>Others</p><p>Management</p><p>Operations</p><p>Governance</p><p>Audits</p><p>Security</p><p>MapR-FS MapR-DB</p><p>MapR Data Platform</p></li><li><p> 2015 MapR Technologies 24</p><p>Traditional</p><p>Approach</p><p>Drill: Agility by Reducing Distance to DataShort analytic life cycles with no upfront schema creation and management</p><p>Hadoop Data Schema Design Transformation Data Movement Users</p><p>Hadoop Data Users</p><p>New Business Questions</p><p>Total Time to Value: Weeks to Months</p><p>Total Time to Value: Minutes</p><p>New</p><p>Approach</p><p>Data Preparation</p><p>New Business Questions</p><p>Drill enables the </p><p>As-It-Happens business with instant SQL analytics</p><p>on complex data</p><p>FROM:</p><p>TO:</p></li><li><p> 2015 MapR Technologies 25</p><p>Thank You</p><p>@mapr maprtech</p><p>nitin@mapr.com</p><p>MapRTechnologies</p><p>maprtech</p><p>mapr-technologies</p><p>Free on-demand Hadoop training leading to certication Start becoming an expert now</p><p>mapr.com/training</p></li><li><p>Data Quality in the Data HubFebruary 2015</p></li><li><p>27 RedPoint Global Inc. 2015 Confidential</p><p>Overview of RedPoint Global</p><p>Launched 2006</p><p>Founded and staffed by industry veterans</p><p>Headquarters: Wellesley, Massachusetts</p><p>Offices in US, UK, Australia, Philippines</p><p>Global customer base</p><p>Serves most major industries MAGIC QUADRANTData Quality </p><p>MAGIC QUADRANTMultichannel Campaign </p><p>Management</p><p>MAGIC QUADRANTIntegrated Marketing </p><p>Management</p></li><li><p>28 RedPoint Global Inc. 2015 Confidential</p><p>Extensive experience with a diverse customer base</p></li><li><p>29 RedPoint Global Inc. 2015 Confidential</p><p>Cloudera Stack</p></li><li><p>30 RedPoint Global Inc. 2015 Confidential</p><p>Andrew Brust, GigaOm Research</p></li><li><p>31 RedPoint Global Inc. 2015 Confidential</p><p>There is lots of Hype Out There</p></li><li><p>32 RedPoint Global Inc. 2015 Confidential</p><p>Dont believe the Marketing Hype</p></li><li><p>33 RedPoint Global Inc. 2015 Confidential</p><p>Data Hub for MDM</p><p>Data Hub</p><p>1 </p><p> n</p><p>YARN</p><p>Production RDBMS </p><p>Databases</p><p>Dat</p><p>a In</p><p>gest</p><p>ion</p><p>Specialized Analytic </p><p>Databases &amp; Caches</p><p>Any analyticsAny reportingPredictive AnalyticsClusteringProfiling</p><p>Analytics</p><p>Marketing AutomationReal Time PersonalizationOmni-Channel OptimizationDigital and Traditional Channels</p><p>Interaction Systems</p><p>Dat</p><p>a Q</p><p>ual</p><p>ity </p><p>Pro</p><p>cess</p><p>ing Persistent Entity Resolution, Linkage and Keying</p></li><li><p>34 RedPoint Global Inc. 2015 Confidential</p><p>How About MDM on a Data Lake?</p><p> Severe shortage of Map Reduce skilled resources</p><p> Inconsistent skills lead to inconsistent results of code based solutions</p><p> Nascent technologies require multiple point solutions</p><p> Technologies are not enterprise grade</p><p> Some functionality may not be possible within these frameworks</p><p>Challenges to Data Lake Approach</p><p> Data is ingested in its raw state regardless of format, structure or lack of structure</p><p> Raw data can be used and reused for differing purposes across the enterprise</p><p> Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computational platform</p><p> Master Data can be fed across the enterprise and deep analytics on clean data is immediately enabled</p><p>Benefits of a Hadoop Data Lake</p></li><li><p>35 RedPoint Global Inc. 2015 Confidential</p><p>Key Functions for Master Data Management</p><p>Master Key Management</p><p>ETL &amp; ELT Data Quality</p><p>Web Services Integration</p><p>Integration &amp; Matching</p><p>Process Automation &amp; Operations</p><p> Profiling, reads/writes, transformations</p><p> Single project for all jobs</p><p> Cleanse data Parsing, correction Geo-spatial analysis</p><p> Grouping Fuzzy match</p><p> Create keys Track changes Maintain matches </p><p>over time</p><p> Consume and publish HTTP/HTTPS protocols XML/JSON/SOAP formats</p><p> Job scheduling, monitoring, notifications</p><p> Central point of control Meta Data Management</p></li><li><p>36 RedPoint Global Inc. 2015 Confidential</p><p>Overview - What is Hadoop/Hadoop 2.0</p><p>Hadoop 1.0</p><p> All operations based on Map Reduce</p><p> Intrinsic inconsistency of code based solutions</p><p> Highly skilled and expensive resources needed</p><p> 3rd party applications constrained by the need to generate code</p><p>Hadoop 2.0</p><p> Introduction of the YARN: a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.</p><p> Mature applications can now operate directly on Hadoop</p><p> Reduce skill requirements and increased consistency</p></li><li><p>37 RedPoint Global Inc. 2015 Confidential</p><p>RedPoint Data Management on Hadoop</p><p>Partitioning AM / Tasks</p><p>Execution AM / Tasks</p><p>Data I/OKey / Split Analysis</p><p>Parallel Section</p><p>YARN</p><p>MapReduce</p></li><li><p>38 RedPoint Global Inc. 2015 Confidential</p><p>Resource Manager</p><p>LaunchesTasks</p><p>Node Manager</p><p>DM App Master</p><p>DM Task</p><p>Node Manager</p><p>DM Task</p><p>DM Task</p><p>Node Manager</p><p>DM Task</p><p>DM Task</p><p>Launches DM App Master</p><p>Data ManagementDesigner</p><p>DM Execution </p><p>Server</p><p>Parallel Section</p><p>Running DM Task</p><p>12</p><p>3</p><p>RedPoint DM for Hadoop: Processing Flow</p></li><li><p>39 RedPoint Global Inc. 2015 Confidential</p><p>Reference Hadoop Architecture</p><p>Monitoring and Management Tools</p><p>Management</p><p>MAPREDUCE</p><p>REST</p><p>DATA REFINEMENT</p><p>HIVEPIG</p><p>HTTP</p><p>STREAM</p><p>STRUCTURE </p><p>HCATALOG (metadata services)</p><p>Query/Visualization/ </p><p>Reporting/Analytical </p><p>Tools and Apps</p><p>SOURCE </p><p>DATA</p><p>- Sensor Logs</p><p>- Clickstream</p><p>- Flat Files</p><p>- Unstructured</p><p>- Sentiment</p><p>- Customer</p><p>- Inventory</p><p>DBs</p><p>JMS</p><p>Queues</p><p>Fil</p><p>esFil</p><p>esFiles</p><p>Data Sources</p><p>RDBMS</p><p>EDW</p><p>INTERACTIVE</p><p>HIVE Server2 </p><p>LOAD</p><p>SQOOP</p><p>WebHDFS</p><p>Flume</p><p>NFS</p><p>LOAD</p><p>SQOOP/Hive</p><p>Web HDFS</p><p>YARN</p><p> n</p><p>1 </p><p>HDFS</p><p>RedPoint Functional Footprint</p></li><li><p>40 RedPoint Global Inc. 2015 Confidential</p><p>&gt;150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code</p><p>6 hours of development 3 hours of development 15 min. of development</p><p>6 minutes runtime 15 minutes runtime 3 minutes runtime</p><p>Extensive optimization needed</p><p>User Defined Functions required prior to running script</p><p>No tuning or optimization required</p><p>RedPoint</p><p>Benchmarks Project Gutenberg</p><p>Map Reduce Pig</p><p>Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper { private final static String delimiters = "',./?;:\"[]{}-=_+()&amp;*%^#$!@`~ \\|"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } </p><p>Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count'; </p></li><li><p>41 RedPoint Global Inc. 2015 Confidential</p><p>Data Lake Architecture for MDM</p></li><li><p>42 RedPoint Global Inc. 2015 Confidential</p><p>Recommendations for Data Quality</p><p> There is a gap between current use and the mainstream</p><p> Dont believe the hype; theres plenty of it</p><p> Data Quality creates trust in information which enables confident and nimble decision making.</p><p> Look for broad enterprise apps that have solved the parallel scalability problem</p><p> Consider a Data Hub approach for Data Quality for maximum flexibility and scalable performance</p></li><li><p>43 RedPoint Global Inc. 2015 Confidential</p><p>George Corugedo</p><p>Chief Technology Officer </p><p>George.corugedo@redpoint.net</p><p>781.725.0252</p><p>Download our white paper </p><p>From Yawn to Yarn: Why You Should be </p><p>Excited about Hadoop</p><p>Redpoint.net/dbtawebinar</p></li><li><p>Question and Answer Session</p><p>(please submit questions)</p></li><li><p>Nitin Bandugula</p><p>Product Marketing Manager </p><p>MapR Technologies</p><p>Kevin Petrie</p><p>Senior Director</p><p>Attunity</p><p>George Corugedo</p><p>Chief Technology Officer &amp; Co-Founder</p><p>RedPoint Global Inc.</p></li><li><p>Please use the same URL you used to view todays live event for the archive event, plus we will be sending you a follow-up </p><p>email with that URL once the archive is posted!</p></li><li><p>Thank you for participating in</p><p>todays roundtable web event </p><p>Just by attending this event the winner of the </p><p>$100 AmEx Gift Card is.</p></li></ul>


View more >