How to Use Hadoop

Download How to Use Hadoop

Post on 03-Jun-2018




0 download

Embed Size (px)


<ul><li><p>8/12/2019 How to Use Hadoop</p><p> 1/8</p><p>CONCLUSIONS PAPER</p><p>Featuring:</p><p>Brian Garrett, Product and Systems Architect, SAS</p><p>Scott Chastain, Product and Systems Manager, SAS</p><p>Bob Messier, Senior Director, Product Management, SAS</p><p>Insights from a webinar in the Applying Business Analytics Webinar Series</p><p>How to Use Hadoop as a Piece of the Big Data Puzzle</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 2/8</p><p>SAS Conclusions Paper</p><p>Table of Contents</p><p>What Hadoop Can Do for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1</p><p>Why Hadoop Is Not a Big Data Strategy. . . . . . . . . . . . . . . . . . . . . . . . 2</p><p>Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4</p><p>For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 3/8</p><p>1</p><p>How to Use Hadoop as a Piece of the Big Data Puzzle</p><p>What Hadoop Can Do for Big Data</p><p>Imagine you have a jar of multicolored candies, and you need to learn something from</p><p>them, perhaps the count of blue candies relative to red and yellow ones. You could</p><p>empty the jar onto a plate, sift through them and tally up your answer. If the jar held only</p><p>a few hundred candies, this process would take only a few minutes.</p><p>Now imagine you have four plates and four helpers. You pour out about one-fourth of</p><p>the candies onto each plate. Everybody sifts through their set and arrives at an answer</p><p>that they share with the others to arrive at a total. Much faster, no?</p><p>That is what Hadoop does for data. Hadoop is an open-source software framework</p><p>for running applications on large clusters of commodity hardware. Hadoop delivers</p><p>enormous processing power the ability to handle virtually limitless concurrent tasks</p><p>and jobs making it a remarkably low-cost complement to a traditional enterprise data</p><p>infrastructure.</p><p>Organizations are embracing Hadoop for several notable merits:</p><p> Hadoop is distributed. Bringing a high-tech twist to the adage, Many hands make</p><p>light work, data is stored on local disks of a distributed cluster of servers.</p><p> Hadoop runs on commodity hardware. Based on the average cost per terabyte</p><p>of compute capacity of a prepackaged system, Hadoop is easily 10 times</p><p>cheaper for comparable computing capacity compared to higher-cost specialized</p><p>hardware.</p><p> Hadoop is fault-tolerant. Hardware failure is expected and is mitigated by data</p><p>replication and speculative processing. If capacity is available, Hadoop runs</p><p>multiple copies of the same task, accepting the results from the task that finishesfirst.</p><p> Hadoop does not require a predened data schema. A key benet of Hadoop</p><p>is the ability to just upload any unstructured les without having to schematize</p><p>them first. You can dump any type of data into Hadoop and allow the consuming</p><p>programs to determine and apply structure when necessary.</p><p> Hadoop scales to handle big data. Hadoop clusters can scale to between 6,000</p><p>and 10,000 nodes and handle more than 100,000 concurrent tasks and 10,000</p><p>concurrent jobs. Yahoo! runs thousands of clusters and more than 42,000</p><p>Hadoop nodes storing more than 200 petabytes of data.</p><p> Hadoop is fast. In a performance test, a 1,400-node cluster sorted a terabyte of</p><p>data in 62 seconds; a 3,400-node cluster sorted 100 terabytes in 173 minutes.</p><p>To put it in context, one terabyte contains 2,000 hours of CD-quality music;</p><p>10 terabytes could store the entire US Library of Congress print collection.</p><p>You get the idea. Hadoop handles big data. It does it fast. It redefines the possible</p><p>when it comes to analyzing large volumes of data, particularly semi-structured and</p><p>unstructured data (text).</p><p>Hadoop automatically</p><p>replicates the data onto</p><p>separate nodes to mitigate the</p><p>effects of a hardware failure.The framework is not only</p><p>server-aware, its rack-aware,</p><p>so if a hardware component</p><p>becomes unavailable, the data</p><p>still exists somewhere else.</p><p>Brian Garrett</p><p>Product and Systems Architect, SAS</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 4/8</p><p>2</p><p>SAS Conclusions Paper</p><p>Why Hadoop Is Not a Big Data Strategy</p><p>For all its agility in handling big data, Hadoop by itself is not a big data strategy,</p><p>says Scott Chastain, Product and Systems Manager at SAS. The data storage</p><p>capabilities, the ability to divide and conquer, to replicate the data for redundancy </p><p>these capabilities dont necessarily solve any business questions. For that you need</p><p>the ability to efciently do query and reporting on big data sets. Hadoop by itself has</p><p>limited capabilities for generating insight from data and if you have a lot of users asking</p><p>questions of the data, Hadoop adds some unwelcome overhead.</p><p>Suppose you need to answer a new question about your collection of candies. Perhaps</p><p>you need to rank the colors by prevalence. The candies arent kept on the plate (in</p><p>memory); they were poured back into the jar after the rst analysis. So you would pour</p><p>the candies back out into the plates again, have your helpers sift through them all, and</p><p>come up with the answer to the new question.</p><p>Wouldnt it be nice if you could have preserved the orderliness of the candies after</p><p>you answered the rst question? Your analysis would be so much more efcient if you</p><p>didnt have to pour the candies back in the jar and dump them back out again later as a</p><p>jumble of colors.</p><p>But thats what Hadoop does; it keeps the data in the jar, not on the plates. The typical</p><p>paradigm of Hadoop processing is a computational approach called MapReduce,</p><p>which breaks out the individual pieces for distributed processing, and then gets that</p><p>answer across all of the individual nodes, said Chastain. Every time you re up a new</p><p>MapReduce job, you have to go to that data [in storage], pull it in [to memory], ingest it</p><p>and process it.</p><p>So Hadoop adds two elements of overhead: in creating multiple sets of the distributed</p><p>data for redundancy and in moving data between storage and memory every time</p><p>somebody comes to ask a question. Hadoop is extremely fast and efcient at</p><p>answering questions on big data sets, but theres still overhead associated with</p><p>getting that answer, said Chastain. This paradigm has been very successful for data</p><p>scientists and others doing ad hoc problem solving, but what if you have many business</p><p>intelligence users who want to do query and reporting based on big data sets?</p><p>Thats why we talk about Hadoop as a piece of this big data puzzle but not a strategy</p><p>in and of itself. Additional capabilities are needed to serve more people working with big</p><p>data. What if we can keep the data set in memory across this distributed computing</p><p>environment, and serve multiple pieces and multiple users with different questions</p><p>around that set of big data?</p><p>Hadoop is both a data storage</p><p>mechanism using the Hadoop</p><p>Distributed File System (HDFS)</p><p>and a parallel and distributed</p><p>programming model based on</p><p>MapReduce.</p><p>Hadoop has emerged as a</p><p>popular way to handle massive</p><p>amounts of structured and</p><p>unstructured data, thanks to its</p><p>ability to process quickly and</p><p>cost-effectively across clusters</p><p>of commodity hardware.</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 5/8</p><p>3</p><p>How to Use Hadoop as a Piece of the Big Data Puzzle</p><p>Lets start all over with the candies. This time, you pour the candies out of the</p><p>jar and organize them in an efficient manner, lining them up neatly in rows by</p><p>colors. Now if someone asks a question such as, Which colors are most and</p><p>least represented? the candies are already organized for a speedy response to</p><p>that query.</p><p>Leaving the data in memory in a structured fashion, its very efcient for us to enable</p><p>multiple users to answer various questions from the same set of data, said Chastain.</p><p>However, todays business problems require more than simple answers. Business users</p><p>need to understand about forecasting, fraud and risk, propensity to respond, root cause</p><p>analysis, optimization and so on questions that entail a lot of variables and analytical</p><p>complexity. The Hadoop framework doesnt provide the high-performance analytics to</p><p>answer those business problems. Even if the needed tools do exist, Hadoop often has</p><p>to wait for the slowest node to finish before it can deliver the answer. So we use SAS</p><p>for complex problems.</p><p>Suppose we are presented with a candy optimization problem. We can</p><p>substitute two orange candies for every red, but we must take a two-green</p><p>penalty for every red candy removed until greens are exhausted or we can</p><p>substitute two greens for every orange but must remove three blues for every</p><p>orange remaining. Which strategy of substitutions will yield the highest candy</p><p>inventory?</p><p>The answer wouldnt be quickly arrived at by intuition or counting, so youd</p><p>pass this question over to analytics. Pour the candies into a SAS mug and</p><p>imagine that SAS Analytics works its magic and delivers an optimized answer.</p><p>It can be very effective to pull that data and put it into a SAS process to answer complex</p><p>problems, such as using optimization or data mining to predict customer behavior,</p><p>detect potentially fraudulent activities or other types of activities.</p><p>But what if youve got a great big vat of candies?</p><p>Now youve really got a big data problem, said Chastain. Up until now in our example,</p><p>weve used Hadoop as a distributed data platform and brought the data to the complex</p><p>mathematics that SAS provides. That paradigm has to change, because we cant have</p><p>data floating around in the enterprise with little ability to provide security control and</p><p>governance. With big data, taking the data to SAS is not our best way forward. We have</p><p>a solution for that.</p><p>Pour the vat of candies into as many plates as needed, and put a magic SAS</p><p>mug with each plate. Now you can analyze the candies right in place, without</p><p>having to move them around and potentially spill them.</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 6/8</p><p>4</p><p>SAS Conclusions Paper</p><p>Now we can take SAS to where the data lives, said Chastain. We have the data in</p><p>memory, and SAS can go inside this Hadoop infrastructure and answer those complex</p><p>business problems right inside the commodity hardware.</p><p>Closing Thoughts</p><p>In the past, organizations were constrained in how much data could be stored and</p><p>what type of analytics could be applied against that data. Analysts were often limited</p><p>to analyzing just a sample subset of the data in an attempt to simulate a larger data</p><p>population, even when using all the data would have yielded a more accurate result.</p><p>Hadoop can overcome the bandwidth and coordination issues associated with</p><p>processing billions of records that previously might not have been saved. The SAS</p><p>approach brings world-class analytics to the Hadoop framework. There are a lot</p><p>of technical details involving the various Apache subprojects and Hadoop-based</p><p>capabilities, but SAS support for Hadoop can be boiled down to three simple</p><p>statements:</p><p> SAS can use Hadoop data as just another data source.</p><p>SAS Data Integration supports Hadoop alongside other data storage and</p><p>processing technologies. Graphical tools enable users to access, process and</p><p>manage Hadoop data and processes from within the familiar SAS environment.</p><p>This is critical, given the skills shortage and the complexity involved with Hadoop.</p><p> The power of SAS Analytics has been extended to Hadoop.</p><p>SAS augments Hadoop with world-class analytics, along with metadata, security</p><p>and lineage capabilities, which helps ensure that Hadoop will be ready for</p><p>enterprise expectations.</p><p> SAS brings much-needed governance to Hadoop.</p><p>SAS provides a robust information management life cycle approach to Hadoop</p><p>that includes support for data management and analytics management. This is a</p><p>huge advantage over other products that focus primarily on moving data in and</p><p>out of Hadoop.</p><p>Thats why we said that Hadoop is a piece of the big data puzzle, but its not</p><p>everything, said Chastain. Youre going to need other assets to drive a complete</p><p>strategy, such as the abilities to:</p><p> Keep data in memory on the distributed architecture.</p><p> Support multiple users for querying against this data.</p><p> Put complex mathematics inside the Hadoop environment to solve difcult</p><p>business problems.</p><p>To find out more, download the SAS white paper Bringing the Power of SAS to</p><p>Hadoop.</p><p>SAS support for Hadoop is part</p><p>of a broader big data strategy</p><p>that includes information</p><p>management for big data and</p><p>high-performance analytics,</p><p>including grid, in-database and</p><p>in-memory computing.</p><p>We can drive cost savings</p><p>in our existing storage</p><p>capacity by using Hadoop on</p><p>commodity hardware, and we</p><p>can then embed SAS into the</p><p>Hadoop infrastructure to help</p><p>solve the more complicated</p><p>and sophisticated business</p><p>problems.</p><p>Bob Messier</p><p>Senior Director for Product Management,</p><p>SAS</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 7/8</p><p>5</p><p>How to Use Hadoop as a Piece of the Big Data Puzzle</p><p>For More Information</p><p>To view the on-demand recording of this webinar:</p><p>Other events in the Applying Business Analytics Webinar Series:</p><p>For a go-to resource for premium content and collaboration with experts and peers:</p><p></p><p>Download the SAS white paper Bringing the Power of SAS to Hadoop:</p><p>wp/corp/46633</p><p>Download the TDWI Best Practices Report High-Performance Data Warehousing</p><p></p><p>Follow us on twitter: @sasanalytics</p><p>Like us on Facebook: SAS Analytics</p></li><li><p>8/12/2019 How to Use Hadoop</p><p> 8/8</p></li></ul>