making sense of big data - pwc tech forecast, issue 3 2010

TechnologyforecastMaking sense of Big DataA quarterly journal 2010, Issue 3

In this issue

04Tapping into the power of Big Data

22Building a bridge to the rest of your data

36Revising the CIOs data playbook

Contents

Features04 22 TappingintothepowerofBigDataTreating it differently from your core enterprise data is essential.

BuildingabridgetotherestofyourdataHow companies are using open-source cluster-computing techniques to analyze their data.

36

RevisingtheCIOsdataplaybookStart by adopting a fresh mind-set, grooming the right talent, and piloting new tools to ride the next wave of innovation.

Interviews14 ThedatascalabilitychallengeJohn Parkinson of TransUnion describes the data handling issues more companies will face in three to five years.

18

Creatingacost-effectiveBigDatastrategyDisneys Bud Albers, Scott Thompson, and Matt Estes outline an agile approach that leverages open-source and cloud technologies.

34

HadoopsforayintotheenterpriseClouderas Amr Awadallah discusses how and why diverse companies are trying this novel approach.

46

NewapproachestocustomerdataanalysisRazorfishs Mark Taylor and Ray Velez discuss how new techniques enable them to better analyze petabytes of Web data.

Departments02 50 54 Messagefromtheeditor Acknowledgments Subtext

Messagefrom theeditor

Bill James has loved baseball statistics ever since he was a kid in Mayetta, Kansas, cutting baseball cards out of the backs of cereal boxes in the early 1960s. James, who compiled The Bill James Baseball Abstract for years, is a renowned sabermetrician (a term he coined himself). He now is a senior advisor on baseball operations for the Boston Red Sox, and he previously worked in a similar capacity for other Major League Baseball teams. James has done more to change the world of baseball statistics than anyone in recent memory. As broadcaster Bob Costas says, James doesnt just understand information. He has shown people a different way of interpreting that information. Before Bill James, Major League Baseball teams all relied on long-held assumptions about how games are won. They assumed batting average, for example, had more importance than it actually does. James challenged these assumptions. He asked critical questions that didnt have good answers at the time, and he did the research and analysis necessary to find better answers. For instance, how many days rest does a reliever need? Jamess answer is that some relievers can pitch well for two or more consecutive days, while others do better with a day or two of rest in between. It depends on the individual. Why cant a closer work more than just the ninth inning? A closer is frequently the best reliever on the team. James observes that managers often dont use the best relievers to their maximum potential. The lesson learned from the Bill James example is that the best statistics come from striving to ask the best questions and trying to get answers to those questions. But what are the best questions? James takes an iterative approach, analyzing the data he has, or can gather, asking some questions based on that analysis, and then looking for the answers. He doesnt stop with just one set of statistics. The first set suggests some questions, to which a second set suggests some answers, which then give rise to yet another set of questions. Its a continual process of investigation, one thats focused on surfacing the best questions rather than assuming those questions have already been asked. Enterprises can take advantage of a similarly iterative, investigative approach to data. Enterprises are being overwhelmed with data; many enterprises each generate petabytes of information they arent making best use of. And not all of the data is the same. Some of it has value, and some, not so much. The problem with this data has been twofold: (1) its difficult to analyze, and (2) processing it using conventional systems takes too long and is too expensive.

02

PricewaterhouseCoopers Technology Forecast

Addressing these problems effectively doesnt require radically new technology. Better architectural design choices and software that allows a different approach to the problems are enough. Search engine companies such as Google and Yahoo provide a pragmatic way forward in this respect. Theyve demonstrated that efficient, cost-effective, system-level design can lead to an architecture that allows any company to handle different data differently. Enterprises shouldnt treat voluminous, mostly unstructured information (for example, Web server log files) the same way they treat the data in core transactional systems. Instead, they can use commodity computer clusters, open-source software, and Tier 3 storage, and they can process in an exploratory way the less-structured kinds of data theyre generating. With this approach, they can do what Bill James does and find better questions to ask. In this issue of the Technology Forecast, we review the techniques behind low-cost distributed computing that have led companies to explore more of their data in new ways. In the article, Tapping into the power of Big Data, on page 04, we begin with a consideration of exploratory analyticsmethods that are separate from traditional business intelligence (BI). These techniques make it feasible to look for more haystacks, rather than just the needle in one haystack. The article, Building a bridge to the rest of your data, on page 22 highlights the growing interest in and adoption of Hadoop clusters. Hadoop provides highvolume, low-cost computing with the help of opensource software and hundreds or thousands of commodity servers. It also offers a simplified approach to processing more complex data in parallel. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze lots of data they didnt have the means to analyze before. The buzz around Big Data and cloud storage (a term some vendors use to describe less-expensive clustercomputing techniques) is considerable, but the article,

Revising the CIOs data playbook, on page 36 emphasizes that CIOs have time to pick and choose the most suitable approach. The most promising opportunity is in the area of gray data, or data that comes from a variety of sources. This data is often raw and unvalidated, arrives in huge quantities, and doesnt yet have established value. Gray data analysis requires a different skill setpeople who are more exploratory by nature. As always, in this issue weve included interviews with knowledgeable executives who have insights on the overall topic of interest: John Parkinson of TransUnion describes the data challenges that more and more companies will face during the next three to five years. Bud Albers, Scott Thompson, and Matt Estes of Disney outline an agile, open-source cloud data vision. Amr Awadallah of Cloudera explores the reasons behind Apache Hadoops adoption at search engine, social media, and financial services companies. Mark Taylor and Ray Velez of Razorfish contrast newer, more scalable techniques of studying customer data with the old methods. Please visit pwc.com/techforecast to find these articles and other issues of the Technology Forecast online. If you would like to receive future issues of the Technology Forecast as a PDF attachment, you can sign up at pwc.com/techforecast/subscribe. We welcome your feedback and your ideas for future research and analysis topics to cover.

Tom DeGarmo Principal Technology Leader [email protected]

Message from the editor

03

Tappingintothe powerofBigData

Treating it differently from your core enterprise data is essential.By Galen Gruman

04


Like most corporations, the Walt Disney Co. is swimming in a rising sea of Big Data: information collected from business operations, customers, transactions, and the like; unstructured information created by social media and other Web repositories, including the Disney home page itself and sites for its theme parks, movies, books, and music; plus the sites of its many big business units, including ESPN and ABC. In any given year, we probably generate more data than the Walt Disney Co. did in its first 80 years of existence, observes Bud Albers, executive vice president and CTO of the Disney Technology Shared Services Group. The challenge becomes what do you do with it all? Albers and his team are in the early stages of answering their own question with an economical cluster-computing architecture based on a set of cost-effective and scalable technologies anchored by Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by Apache Software Foundation. These still-emerging technologies allow Disney analysts to explore multiple terabytes of information without the lengthy time requirements or high cost of traditional business intelligence (BI) systems. This issue of the Technology Forecast examines how Apache Hadoop and these related technologies can derive business value from Big Data by supporting a new kind of exploratory analytics unlike traditional BI. These software technologies and their hardware cluster

platform make it feasible not only to look for the needle in the haystack, but also to look for new haystacks. This kind of analysis demands an attitude of explorationand the ability to generate value from data that hasnt been scrubbed or fully modeled into relational tables. Using Disney and other examples, this first article introduces the idea of exploratory BI for Big Data. The second article examines Hadoop clusters and technologies that support them (page 22), and the third article looks at steps CIOs can take now to exploit the future benefits (page 36). We begin with a closer look at Disneys still-nascent but illustrative effort.

In any given year, we probably generate more data than the Walt Disney Co. did in its first 80 years of existence. Bud Albers of Disney

Tapping into the power of Big Data

05

Bringing Big Data under controlBig Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. Like everyone else, Disneys Big Data is huge, more unstructured than structured, and growing much faster than transactional data. The Disney Technology Shared Services Group, which is responsible for Disneys core Web and analysis technologies, recently began its Big Data efforts but already sees high potential. The group is testing the technology and working with analysts in Disney business units. Disneys data comes from varied sources, but much of it is collected for departmental business purposes and not yet widely shared. Disneys Big Data approach will allow it to look at diverse data sets for unplanned purposes and to uncover patterns across customer activities. For example, insights from Disney Store activities could be useful in call centers for theme park booking or to better understand the audience segments of one of its cable networks. The Technology Shared Services Group is even using Big Data approaches to explore its own IT questions to understand what data is being stored, how it is used, and thus what type of storage hardware and management the group needs. Albers assumes that Big Data analysis is destined to become essential. The speed of business these days and the amount of data that we are now swimming in mean that we need to have new ways and new techniques of getting at the data, finding out whats in there, and figuring out how we deal with it, he says. The team stumbled upon an inexpensive way to improve the business while pursuing more IT costeffectiveness through the use of private-cloud

technologies. (See the Technology Forecast, Summer 2009, for more on the topic of cloud computing.) When Albers launched the effort to change the divisions cost curve so IT expenses would rise more slowly than the business usage of ITthe opposite had been truehe turned to an approach that many companies use to make data centers more efficient: virtualization. Virtualization offers several benefits, including higher utilization of existing servers and the ability to move workloads to prevent resource bottlenecks. An organization can also move workloads to external cloud providers, using them as a backup resource when needed, an approach called cloud bursting. By using such approaches, the Disney Technology Shared Services Group lowered its IT expense growth rate from 27 percent to 3 percent, while increasing its annual processing growth from 17 percent to 45 percent. While achieving this efficiency, the team realized that the ability to move resources and tap external ones could apply to more than just data center efficiency. At first, they explored using external clouds to analyze big sets of data, such as Web traffic to Disneys many sites, and to handle big processing jobs more cost-effectively and more quickly than with internal systems. During that exploration, the team discovered Hadoop, MapReduce, and other open-source technologies that distribute data-analysis workloads across many computers, breaking the analysis into many parallel workloads that produce results faster. Faster results mean that more questions can be asked, and the low cost of the technologies means the team can afford to ask those questions. Disney assembled a Hadoop cluster and set up a central logging service to mine data that the organization hadnt been able to before. It will begin to provide internal group access to the cluster in October 2010. Figure 1 shows how the Hadoop cluster will benefit internal groups, business partners, and customers.

The speed of business these days and the amount of data that we are now swimming in mean that we need to have new ways and new techniques of getting at the data, finding out whats in there, and figuring out how we deal with it. Bud Albers of Disney06 PricewaterhouseCoopers Technology Forecast

Improved experience

4

Site visitors

Internal business partners

Affiliated businesses

1 Usage data 2 Central logging service 3

Interface to cluster (MapReduce/Hive/Pig) D-Cloud data clusterHadoop

Core IT and business unit systems

Metadata repository

Figure 1: DisneysHadoopclusterandcentralloggingservice Disneys new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment of (2) a central logging service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more responsive and personalized user experience.Source: Disney, 2010


07

Simply put, the low cost of a Hadoop cluster means freedom to experiment. Disney uses a couple of dozen servers that were scheduled to be retired, and the organization operates its cluster with a handful of existing staff. Matt Estes, principal data architect for the Disney Technology Shared Services Group, estimates the cost of the project at $300,000 to $500,000. Before, I would have needed to figure on spending $3 million to $5 million for such an initiative, Albers says. Now I can do this without charging to the bottom line. Unlike the reusable canned queries in typical BI systems, Big Data analysis does require more effort to write the queries and the data-parsing code for what are often unique inquiries of data sources. But Albers notes that the risk is lower due to all the other costs being lower. Failure is inexpensive, so analysts are more willing to explore questions they would otherwise avoid. Even in this early stage, Albers is confident that the ability to ask more questions will lead to more insights that translate to both the bottom line and the top line. For example, Disney already is seeking to boost customer engagement and spending by making recommendations to customers based on pattern analysis of their online behavior.

OpportunitiesforBigDatainsights Here are other examples of the kinds of insights that may be gleaned from analysis of Big Data information flows: Customer churn, based on analysis of call center, help desk, and Web site traffic patterns Changes in corporate reputation and the potential for regulatory action, based on the monitoring of social networks as well as Web news sites Real-time demand forecasting, based on disparate inputs such as weather forecasts, travel reservations, automotive traffic, and retail point-of-sale data Supply chain optimization, based on analysis of weather patterns, potential disaster scenarios, and political turmoil

Disney and others explore their data without a lot of preconceptions. They know the results wont be as specific as a profit-margin calculation or a drug-efficacy determination. But they still expect demonstrable value, and they expect to get it without a lot of extra expense. Typical BI uses data from transactional and other relational database management systems (RDBMSs) that an enterprise collectssuch as sales and purchasing records, product development costs, and new employee hire recordsdiligently scrubs the data for accuracy and consistency, and then puts it into a form the BI system is programmed to run queries against. Such systems are vital for accurate analyses of transactional information, especially information subject to compliance requirements, but they dont work well for messy questions, theyve been too expensive for questions youre not sure theres any value in asking, and they havent been able to scale to analyze large data sets efficiently. (See Figure 2.)

How Big Data analysis is differentWhat should other enterprises anticipate from Hadoopstyle analytics? It is a type of exploratory BI they havent done much before. This is business intelligence that provides indications, not absolute conclusions. It requires a different mind-set, one that begins with exploration, the results of which create hypotheses that are tested before moving on to validation and consolidation. These methods could be used to answer questions such as, What indicators might there be that predate a surge in Web traffic? or What fabrics and colors are gaining popularity among influencers, and what sources might be able to provide the materials to us? or Whats the value of an influencer on Web traffic through his or her social network? See the sidebar Opportunities for Big Data insights for more examples of the kinds of questions that can be asked of Big Data.

08


Large data sets

Big Data (via Hadoop/MapReduce)

Less scalability

Other companies have also tapped into the excitement brewing over Big Data technologies. Several Weboriented companies that have always dealt with huge amounts of datasuch as Yahoo, Twitter, and Googlewere early adopters. Now, more traditional companiessuch as TransUnion, a credit rating serviceare exploring Big Data concepts, having seen the cost and scalability benefits the Web companies have realized. Specifically, enterprises are also motivated by the inability to scale their existing approach for working on traditional analytics tasks, such as querying across terabytes of relational data. They are learning that the tools associated with Hadoop are uniquely positioned to explore data that has been sitting on the side, unanalyzed. Figure 3 illustrates how the data architecture landscape appears in 2010. Enterprises with high processing power requirements and centralized architectures are facing scaling issues.

Small data sets

Little analytical value

Traditional BI

Non-relational data

Relational data

Figure 2: WhereBigDatafitsinSource: PricewaterhouseCoopers, 2010

In contrast, Big Data techniques allow you to sift through data to look for patterns at a much lower cost and in much less time than traditional BI systems. Should the data end up being so valuable that it requires the ongoing, compliance-oriented analysis of regular BI systems, only then do you make that investment. Big Data approaches let you ask more questions of more information, opening a wide range of potential insights you couldnt afford to consider in the past. Part of the analytics role is to challenge assumptions, Estes says. BI systems arent designed to do that; instead, theyre designed to dig deeper into known questions and look for variations that may indicate deviations from expected outcomes. Furthermore, Big Data analysis is usually iterative: you ask one question or examine one data set, then think of more questions or decide to look at more data. Thats different from the single source of truth approach to standard BI and data warehousing. The Disney team started with making sure they could expose and access the data, then moved to iterative refinement in working with the data. We aggressively got in to find the direction and the base. Then we began to iterate rather than try to do a Big Bang, Albers says.

High processing power

Enterprises facing scaling and capacity/cost problems

Google, Amazon, Facebook, Twitter, etc. (all use nonrelational data stores for reasons of scale)

Low processing power

Most enterprises

Cloud users with low compute requirements

Centralized compute architecture

Distributed compute architecture

Figure 3: Thedataarchitecturelandscapein2010Source: PricewaterhouseCoopers, 2010

Wolfram Research and IBM have begun to extend their analytics applications to run on such large-scale data pools, and startups are presenting approaches they promise will allow data exploration in ways that technologies couldnt have enabled in the past, including support for tools that let knowledge workers examine traditional databases using Big Datastyle exploratory tools.


09

ThewaysdifferententerprisesapproachBigData It should come as no surprise that organizations dealing with lots of data are already investigating Big Data technologies, or that they have mixed opinions about these tools. At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions of rows of data, looking for things that match a pattern approximately, says John Parkinson, TransUnions acting CTO. We want to do accurate but approximate matching and categorization in very large low-structure data sets. Parkinson has explored Big Data technologies such as MapReduce that appear to have a more efficient filtering model than some of the pattern-matching algorithms TransUnion has tried in the past. MapReduce also, at least in its theoretical formulation, is very amenable to highly parallelized execution, which lets the users tap into farms of commodity hardware for fast, inexpensive processing, he notes. However, Parkinson thinks Hadoop and MapReduce are too immature. MapReduce really hasnt evolved yet to the point where your average enterprise technologist can easily make productive use of it. As for Hadoop, they have done a good job, but its like a lot of open-source software80 percent done. There were limits in the code that broke the stack well before what we thought was a good theoretical limit. Parkinson echoes many IT executives who are skeptical of open-source software in general. If I have a bunch of engineers, I dont want them spending their day being the technology support environment for what should be a product in our architecture, he says. Thats a legitimate point of view, especially considering the data volumes TransUnion manages8 petabytes from 83,000 sources in 4,000 formats and growing and its focus on mission-critical capabilities for this data. Credit scoring must run successfully and deliver top-notch credit scores several times a day. Its an operational system that many depend on for critical business decisions that happen millions of times a day. (For more on TransUnion, see the interview with Parkinson on page 14.)

Disneys system is purely intended for exploratory efforts or at most for reporting that eventually may feed up to product strategy or Web site design decisions. If it breaks or needs a little retooling, theres no crisis. But Albers disagrees about the readiness of the tools, noting that the Disney Technology Shared Services Group also handles quite a bit of data. He figures Hadoop and MapReduce arent any worse than a lot of proprietary software. I fully expect we will run on things that break, he says, adding facetiously, Not that any commercial product Ive ever had has ever broken. Data architect Estes also sees responsiveness in open-source development thats laudable. In our testing, we uncovered stuff, and you get somebody on the other end. This is their baby, right? I mean, they want it fixed. Albers emphasizes the total cost-effectiveness of Hadoop and MapReduce. My software cost is zero. You still have the implementation, but thats a constant at some level, no matter what. Now you probably need to have a little higher skill level at this stage of the game, so youre probably paying a little more, but certainly, youre not going out and approving a Teradata cluster. Youre talking about Tier 3 storage. Youre talking about a very low level of cost for the storage. Albers points are also valid. PricewaterhouseCoopers predicts these open-source tools will be solid sooner rather than later, and are already worthy of use in non-mission-critical environments and applications. Hence, in the CIO article on page 36, we argue in favor of taking cautious but exploratory steps. Askingnewbusinessquestions Saving money is certainly a big reward, but PricewaterhouseCoopers contends the biggest payoff from Hadoop-style analysis of Big Data is the potential to improve organizations top line. Theres a lot of potential value in the unstructured data in organizations, and people are starting to look at it more seriously, says Tom Urquhart, chief architect at PricewaterhouseCoopers. Think of it as a Google in a box, which allows you to do intelligent search regardless of whether the underlying content is structured or unstructured, he says.

10


The Google-style techniques in Hadoop, MapReduce, and related technologies work in a fundamentally different way from traditional BI systems, which use strictly formatted data cubes pulling information from data warehouses. Big Data tools let you work with data that hasnt been formally modeled by data architects, so you can analyze and compare data of different types and of different levels of rigor. Because these tools typically dont discard or change the source data before the analysis begins, the original context remains available for drill-down by analysts. These tools provide technology assistance to a very human form of analysis: looking at the world as it is and finding patterns of similarity and difference, then going deeper into the areas of interest judged valuable. In contrast, BI systems know what questions should be asked and what answers to expect; their goal is to look for deviations from the norm or changes in standard patterns deemed important to track (such as changes in baseline quality or in sales rates in specific geographies). Such an approach, absent an exploratory phase, results in a lot of information loss during data consolidation. (See Figure 4.)

Patternanalysismashupservices Theres another use of Big Data that combines efficiency and exploratory benefits: on-the-fly pattern analysis from disparate sources to return real-time results. Amazon.com pioneered Big Databased product recommendations by analyzing customer data, including purchase histories, product ratings, and comments. Albers is looking for similar value that would come from making live recommendations to customers when they go to a Disney site, store, or reservations phone linebased on their previous online and offline behavior with Disney. OReilly Media, a publisher best known for technical books and Web sites, is working with the White House to develop mashup applications that look at data from various sources to identify patterns that might help lobbyists and policymakers. For example, by mashing together US Census data and labor statistics, they can see which counties have the most international and domestic immigration, then correlate those attributes with government spending changes, says Roger Magoulas, OReillys research director.

Exploration Pre-consolidated data (never collected)

Information loss

Less information loss

All collected data Information loss Information loss

All collected data

Figure 4: InformationlossinthedataconsolidationprocessSource: PricewaterhouseCoopers, 2010


Co o ns lid ati on

Summary departmental data Summary enterprise data

Co oli ns da tio n

Summary departmental data Summary enterprise data

Insight

Greater insight

11

Mashups like this can also result in customer-facing services. FlightCaster for iPhone and BlackBerry uses Big Data approaches to analyze flight-delay records and current conditions to issue flight-delay predictions to travelers.

information resources whose accuracy and completeness may be more established. People use their knowledge and experience to appropriately weigh and correlate what they find across gray data to come up with improved strategies to aid the business. Figure 5 compares gray data and more normalized black data.Gray data Raw Data and context commingled Noisy Hypothetical Black data Classified Provenanced Cleaned Actual

Exploiting the power of human analysisBig Data approaches can lower processing and storage costs, but we believe their main value is to perform the analysis that BI systems werent designed for, acting as an enabler and an amplifier of human analysis. Adhocexplorationatabargain Big Data lets you inexpensively explore questions and peruse data for patterns that may indicate opportunities or issues. In this arena, failure is cheap, so analysts are more willing to explore questions they would otherwise avoid. And that should lead to insights that help the business operate better. Medical data is an example of the potential for ad hoc analysis. A number of such discoveries are made on the weekends when the people looking at the data are doing it from the point of view of just playing around, says Doug Lenat, founder and CEO of Cycorp and a former professor at Stanford and Carnegie Mellon universities. Right now the technical knowledge required to use these tools is nontrivial. Imagine the value of extending the exploratory capability more broadly. Cycorp is one of many startups trying to make Big Data analytic capabilities usable by more knowledge workers so they can perform such exploration. AnalyzingdatathatwasntdesignedforBI Big Data also lets you work with gray data, or data from multiple sources that isnt formatted or vetted for your specific needs, and that varies significantly in its level of detail and accuracyand thus cannot be examined by BI systems. One analogy is Wikipedia. Everyone knows its information is not rigorously managed or necessarily accurate; nonetheless, Wikipedia is a good first place to look for indicators of what may be true and useful. From there, you do further research using a mix of

e.g., WikipediaUnchecked Indicative Less trustworthy Managed by business unit

e.g., Financial system dataReviewed Confirming More trustworthy Managed by IT

Figure 5: GrayversusblackdataSource: PricewaterhouseCoopers, 2010

Web analytics and financial risk analysis are two examples of how Big Data approaches augment human analysts. These techniques comb huge data sets of information collected for specific purposes (such as monitoring individual financial records), looking for patterns that might identify good prospects for loans and flag problem borrowers. Increasingly, they comb external data not collected by a credit reporting agencyfor example, trends in a neighborhoods housing values or in local merchants sales patterns to provide insights into where sales opportunities could be found or where higher concentrations of problem customers are located. The same approaches can help identify shifts in consumer tastes, such as for apparel and furniture. And, by analyzing gray data related to costs of resources and changes in transportation schedules, these approaches can help anticipate stresses on suppliers and help identify where additional suppliers might be found. All of these activities require human intelligence, experience, and insight to make sense of the data, figure out the questions to ask, decide what information should be correlated, and generally conduct the analysis.

12


Why the time is ripe for Big DataThe human analysis previously described is old hat for many business analysts, whether they work in manufacturing, fashion, finance, or real estate. Whats changing is scale. As noted, many types of information are now available that never existed or were not accessible. What could once only be suggested through surveys, focus groups, and the like can now be examined directly, because more of the granular thinking and behaviors are captured. Businesses have the potential to discover more through larger samples and more granular details, without relying on people to recall behaviors and motivations accurately. This potential can be realized only if you pull together and analyze all that data. Right now, theres simply too much information for individual analysts to manage, increasing the chances of missing potential opportunities or risks. Businesses that augment their human experts with Big Data technologies could have significant competitive advantages by heading off problems sooner, identifying opportunities earlier, and performing mass customization at a larger scale. Fortunately, the emerging Big Data tools should let businesspeople apply individual judgments to vaster pools of information, enabling low-cost, ad hoc analysis never before feasible. Plus, as patterns are discovered, the detection of some can be automated, letting the human analysts concentrate on the art of analysis and interpretation that algorithms cant accomplish. Even better, emerging Big Data technologies promise to extend the reach of analysis beyond the cadre of researchers and business analysts. Several startups offer new tools that use familiar data-analysis tools similar to those for SQL databases and Excel spreadsheetsto explore Big Data sources, thus broadening the ability to explore to a wider set of knowledge workers. Finally, Big Data approaches can be used to power analytics-based services that improve the business itself, such as in-context recommendations to customers, more accurate predictions of service delivery, and more accurate failure predictions (such as for the manufacturing, energy, medical, and chemical industries).

ConclusionPricewaterhouseCoopers believes that Big Data approaches will become a key value creator for businesses, letting them tap into the wild, woolly world of information heretofore out of reach. These new data management and storage technologies can also provide economies of scale in more traditional data analysis. Dont limit yourself to the efficiencies of Big Data and miss out on the potential for gaining insights through its advantages in handling the gray data prevalent today. Big Data analysis does not replace other systems. Rather, it supplements the BI systems, data warehouses, and database systems essential to financial reporting, sales management, production management, and compliance systems. The difference is that these information systems deal with the knowns that must meet high standards for rigor, accuracy, and compliancewhile the emerging Big Data analytics tools help you deal with the unknowns that could affect business strategy or its execution. As the amount and interconnectedness of data vastly increases, the value of the Big Data approach will only grow. If the amount and variety of todays information is daunting, think what the world will be like in 5 or 10 years. People will become mobile sensorscollecting, creating, and transmitting all sorts of information, from locations to body status to environmental information. We already see this happening as smartphones equipped with cameras, microphones, geolocation, and compasses proliferate. Wearable medical sensors, small temperature tags for use on packages, and other radio-equipped sensors are a reality. Theyll be the Twitter and Facebook feeds of tomorrow, adding vast quantities of new information that could provide context on behavior and environment never before possibleand a lot of noise certain to mask whats important. Insight-oriented analytics in this sea of information where interactions cause untold ripples and eddies in the flow and delivery of business valuewill become a critical competitive requirement. Big Data technology is the likeliest path to gaining such insights.


13

Thedatascalability challenge

John Parkinson of TransUnion describes the data handling issues more companies will face in three to five years.Interview conducted by Vinod Baya and Alan MorrisonJohn Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines TransUnions considerable requirements for less-structured data analysis, shedding light on the many data-related technology challenges TransUnion faces todaychallenges he says that more companies will face in the near future.

PwC: In your role at TransUnion, youve evaluated many large-scale data processing technologies. What do you think of Hadoop and MapReduce?JP: MapReduce is a very computationally attractive answer for a certain class of problem. If you have that class of problem, then MapReduce is something you should look at. The challenge today, however, is that the number of people who really get the formalism behind MapReduce is a lot smaller than the group of people trying to understand what to do with it. It really hasnt evolved yet to the point where your average enterprise technologist can easily make productive use of it.

of rows of data looking for things that match a pattern approximately. MapReduce is a more efficient filter for some of the pattern-matching algorithms that we have tried to use. At least in its theoretical formulation, its very amenable to highly parallelized execution, which many of the other filtering algorithms weve used arent. The open-source stack is attractive for experimenting, but the problem we find is that Hadoop isnt what Google runs in productionits an attempt by a bunch of pretty smart guys to reproduce what Google runs in production. Theyve done a good job, but its like a lot of open-source software80 percent done. The 20 percent that isnt donethose are the hard parts. From an experimentation point of view, we have had a lot of success in proving that the computing formalism behind MapReduce works, but the software that we can acquire today is very fragile. Its difficult to manage. It has some bugs in it, and it doesnt behave very well in an enterprise environment. It also has some interesting limitations when you try to push the scale and the performance.

PwC: What class of problem would that be?JP: MapReduce works best in situations where you want to do high-volume, accurate but approximate matching and categorization in very large, lowstructured data sets. At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions

14


We found a number of representational problems when we used the HDFS/Hadoop/HBase stack to do something that, according to the documentation available, should have worked. However, in practice, limits in the code broke the stack well before what we thought was a good theoretical limit. Now, the good news of course is that you get source code. But thats also the bad news. You need to get the source code, and thats not something that we want to do as part of routine production. I have a bunch of smart engineers, but I dont want them spending their day being the technology support environment for what should be a product in our architecture. Yes, theres a pony there, but its going to be awhile before it stabilizes to the point that I want to bet revenue on it.

of the envelope. This is a problem for hardware as well as software. A lot of the vendors stop testing their applications at about 80 percent or 85 percent of their theoretical capability. We routinely run them at 110 percent of their theoretical capability, and they break. I dont mind making tactical justifications for technologies that I expect to replace quickly. I do that all the time. But having done that, I want the damn thing to work. Too often, weve discovered that it doesnt work.

PwC: Are you forced to use technologies that have matured because of a wariness of things on the absolute edge?JP: My dilemma is that things that are known to work usually dont scale to what we needfor speed or full capacity. I must spend some time, energy, and dollars betting on things that arent mature yet, but that can be sufficiently generalized architecturally. If the one I pick doesnt work, or goes away, I can fit something else into its place relatively easily. Thats why we like appliances. As long as they are well behaved at the network layer and have a relatively generalized or standards-based business semantic interface, it doesnt matter if I have to unplug one in 18 months or two years because something better came along. I cant do that for everything, but I can usually afford to do it in the areas where I have no established commercial alternative.

PwC: Data warehousing appliance prices have dropped pretty dramatically over the past couple of years. When it comes to data thats not necessarily on the critical path, how does an enterprise make sure that it is not spending more than it has to?JP: We are probably not a good representational example of that because our business is analyzing the data. There is almost no price we wont pay to get a better answer faster, because we can price that into the products we produce. The challenge we face is that the tools dont always work properly at the edge

I have a bunch of smart engineers, but I dont want them spending their day being the technology support environment for what should be a product in our architecture.

The data scalability challenge

15

PwC: What are you using in place of something like Hadoop?JP: Essentially, we use brute force. We use Ab Initio, which is a very smart brute-force parallelization scheme. I depend on certain capabilities in Ab Initio to parallelize the ETL [extract, transform, and load] in such a way that I can throw more cores at the problem.

PwC: Of the three kinds of data, which is the most challenging?JP: We have two kinds of challenges. The first is driven purely by the scale at which we operate. We add roughly half a terabyte of data per month to the credit file. Everything we do has challenges related to scale, updates, speed, or database performance. The vendors both love us and hate us. But we are where the industry is goingwhere everybody is going to be in two to five years. We are a good leading indicator, but we break their stuff all the time. A second challenge is the unstructured part of the data, which is increasing.

PwC: Much of the data you see is transactional. Is it all structured data, or are you also mining text?JP: We get essentially three kinds of data. We get accounts receivable data from credit loan issuers. Thats the record of what people actually spend. We get public record data, such as bankruptcy records, court records, and liens, which are semi-structured text. And we get other data, which is whatever shows up, and its generally hooked together around a well-understood set of identifiers. But the cost of this data is essentially freewe dont pay for it. Its also very noisy. So we have to spend computational time figuring out whether the data we have is right, because we must find a place to put it in the working data sets that we build. At TransUnion, we suck in 100 million updates a day for the credit files. We update a big data warehouse that contains all the credit and related data. And then every day we generate somewhere between 1 and 20 operational data stores, which is what we actually run the business on. Our products are joined between what we call indicative data, the information that identifies you as an individual; structured data, which is derived from transactional records; and unstructured data that is attached to the indicative. We build those products on the fly because the data may change every day, sometimes several times a day. One challenge is how to accurately find the right place to put the record. For example, we get a Joe Smith at 13 Main Street and a Joe Smith at 31 Main Street. Are those two different Joe Smiths, or is that a typing error? We have to figure that out 100 million times a day using a bunch of custom pattern-matching and probabilistic algorithms.

PwC: Its more of a challenge to deal with the unstructured stuff because it comes in various formats and from various sources, correct?JP: Yes. We have 83,000 data sources. Not everyone provides us with data every day. It comes in about 4,000 formats, despite our data interchange standards. And, to be able to process it fast enough, we must convert all data into a single interchange format that is the representation of what we use internally. Complex computer science problems are associated with all of that.

PwC: Are these the kinds of data problems that businesses in other industries will face in three to five years?JP: Yes, I believe so.

PwC: What are some of the other problems you think will become more widespread?JP: Here are some simple practical examples. We have 8.5 petabytes of data in the total managed environment. Once you go seriously above 100 terabytes, you must replace the storage fabric every four or five years. Moving 100 terabytes of data becomes a huge material issue and takes a long time. You do get some help from improved interconnect speed, but the arrays go as fast

16


as they go for reads and writes and you cant go faster than that. And businesses down the food chain are not accustomed to thinking about refresh cycles that take months to complete. Now, a refresh cycle of PCs might take months to complete, but any one piece of it takes only a couple of hours. When I move data from one array to another, Im not done until Im done. Additionally, I have some bugs and new vulnerabilities to deal with. Today, we dont have a backup problem at TransUnion because we do incremental forever backup. However, we do have a restore problem. To restore a material amount of data, which we very occasionally need to do, takes days in some instances because the physics of the technology we use wont go faster than that. The average IT department doesnt worry about these problems. But take the amount of data an average IT department has under management, multiply it by a single decimal order of magnitude, and it starts to become a material issue. We would like to see computationally more-efficient compression algorithms, because my two big cost pools are Store It and Move It. For now, I dont have a computational problem, but if I cant shift the trend line on Store It and Move It, I will have a computational problem within a few years. To perform the computations in useful time, I must parallelize how I compute. Above a certain point, the parallelization breaks because I cant move the data further.

PwC: Cloudera [a vendor offering a Hadoop distribution] would say bring the computation to the data.JP: That works only for certain kinds of data. We already do all of that large-scale computation on a file system basis, not on a database basis. And we spend compute cycles to compress the data so there are fewer bits to move, then decompress the data for computation, and recompress it so we have fewer bits to store. What we have discoveredbecause I run the fourth largest commercial GPFS [general parallel file system, a distributed computing file system developed by IBM] cluster in the worldis that once you go beyond a certain size, the parallelization management tools break. Thats why I keep telling people that Hadoop is not what Google runs in production. Maybe the Google guys have solved this, but if they have, they arent telling me how. n

We would like to see computationally more-efficient compression algorithms, because my two big cost pools are Store It and Move It.

The data scalability challenge

17

Creatingacost-effective BigDatastrategy

Disneys Bud Albers, Scott Thompson, and Matt Estes outline an agile approach that leverages open-source and cloud technologies.Interview conducted by Galen Gruman and Alan MorrisonBud Albers joined what is now the Disney Technology Shared Services Group two years ago as executive vice president and CTO. His management team includes Scott Thompson, vice president of architecture, and Matt Estes, principal data architect. The Technology Shared Services Group, located in Seattle, has a heritage dating back to the late 1990s, when Disney acquired Starwave and Infoseek. The group supports all the Disney businesses ($38 billion in annual revenue), managing the companys portfolio of Web properties. These include properties for the studio, store, and park; ESPN; ABC; and a number of local television stations in major cities. In this interview, Albers, Thompson, and Estes discuss how theyre expanding Disneys Web data analysis footprint without incurring additional cost by implementing a Hadoop cluster. Albers and team freed up budget for this cluster by virtualizing servers and eliminating other redundancies.

PwC: Disney is such a diverse company, and yet there clearly is lots of potential for synergies and cross-fertilization. How do you approach these opportunities from a data perspective?BA: We try and understand the best way to work with and to provide services to the consumer in the long term. We have some businesses that are very data intensive, and then we have some that are less so because of their consumer audience. One of the challenges always is how to serve both kinds of businesses and do so in ways that make sense. The sell-to relationships extend from the studio out to the distribution groups and the theater chains. If youre selling to millions, youre trying to understand the different audiences and how they connect.

One of the things Ive been telling my folks from a data perspective is that you dont send terabytes one way to be mated with a spreadsheet on the other side, right? Were thinking through those kinds of pieces and trying to figure out how we move down a path. The net is that working with all these businesses gives us a diverse set of requirements, as you might imagine. Were trying to stay ahead of where all the businesses are. In that respect, the questions Im asking are, how do we get more agile, and how do we do it in a way that handles all the data we have? We must consider all of the new form factors being developed, all of which will generate lots of data. A big question is, how do we handle this data in a way that makes cost sense for the business and provides us an increased level of agility?

18


We hope to do in other areas what weve done with content distribution networks [CDNs]. Weve had a tremendous amount of success with the CDN marketplace by standardizing, by staying in the middle of the road and not going to Akamai proprietary extensions, and by creating a dynamic marketplace. If we get a new episode of Lost, we can start streaming it, and I can be streaming 80 percent on Akamai and 20 percent on Level 3. Then we can decide were going to turn it back, and Im going to give 80 percent to Limelight and 20 percent to Level 3. We can do that dynamically.

of engineering on the Web site who reports to me. Our CIO worries about it from the firewall back; I worry about it from the firewall to the living room and the mobile device. Thats the way we split up the world, if that makes sense.

PwC: How do you link the data requirements of the central core with those that are unique to the various parts of the business?BA: Its more art than science. The business units must generate revenue, and we must provide the core services. How do you strike that balance? Ownership is a lot more negotiated on some things today. We typically pull down most of the analytics and add things in, and its a constant struggle to answer the question, Do we have everything? Were headed toward this notion of one data element at a time, aggregate, and queue up the aggregate. It can get a little bit crazy because you wind up needing to pull the data in and run it through that whole food chain, and it may or may not have lasting value. It may have only a temporal level of importance, and so were trying to figure out how to better handle that. An awful lot of what we do in the data collection is pull it in, lay it out so it can be reported on, and/or push it back into the businesses, because the Web is evolving rapidly from a standalone thing to an integral part of how you do business.

PwC: What are the other main strengths of the Technology Shared Services Group at Disney?BA: When I came here a couple of years ago, we had some very good core central services. If you look at the true definition of a cloud, we had the very early makings of oneshared central services around registration, for example. On Disney, on ABC, or on ESPN, if you have an ID, it works on all the Disney properties. If you have an ESPN ID, you can sign in to KGO in San Francisco, and it will work. Its all a shared registration system. The advertising system we built is shared. The marketing systems we built are sharedall the analytics collection, all those things are centralized. Those things that are common are shared among all the sites. Those things that are brand specific are built by the brands, and the user interface is controlled by the brands, so each of the various divisions has a head

Its more art than science. The business units must generate revenue, and we must provide the core services. How do you strike that balance? Ownership is a lot more negotiated on some things today. Bud Albers

Creating a cost-effective Big Data strategy

19

PwC: Hadoop seems to suggest a feasible way to analyze data that has only temporal importance. How did you get to the point where you could try something like a Hadoop cluster?BA: Guys like me never get called when its all pretty and shiny. The Disney unit I joined obviously has many strengths, but when I was brought on, there was a cost growth situation. The volume of the aggregate activity growth was 17 percent. Our server growth at the time was 30 percent. So we were filling up data centers, but we were filling them with CPUs that werent being used. My question was, how can you go to the CFO and ask for a lot of money to fill a data center with capital assets that youre going to use only 5 percent of? CPU utilization isnt the only measure, but its the most prominent one. To study and understand what was happening, we put monitors and measures on our servers and reported peak CPU utilization on fiveminute intervals across our server farm. We found that on roughly 80 percent of our servers, we never got above 10 percent utilization in a monthly period. Our first step to address that problem was virtualization. At this point, about 49 percent of our data center is virtual. Our virtualization effort had a sizable impact on cost. Dollars fell out because we quit building data centers and doing all kinds of redundant shuffling. We didnt have to lay off people. We changed some of our processes, and we were able to shift our growth curve from plus 27 to minus 3 on the shared service. We call this our D-Cloud effort. Another step in this effort was moving to a REST [REpresentational State Transfer] and JSON [JavaScript Object Notation] data exchange standard, because we knew we had to hit all these different devices and create some common APIs [application programming interfaces] in the framework. One of the very first things we put in place was a central logging service for all the events. These event logs can be streamed into one very large data set. We can then use the Hadoop and MapReduce paradigm to go after that data.

PwC: How does the central logging service fit into your overall strategy?ST: As we looked at it, we said, its not just about virtualization. To be able to burst and do these other things, you need to build a bunch of core services. The initiative were working on now is to build some of those core services around managing configuration. This project takes the foundation we laid with virtualization and a REST and JSON data exchange standard, and adds those core services that enable us to respond to the marketplace as it develops. Piping that data back to a central repository helps you to analyze it, understand whats going on, and make better decisions on the basis of what you learned.

PwC: How do you evolve so that the data strategy is really served well, so that its more of a data-driven approach in some ways?ME: On one side, you have a very transactional OLTP [online transactional processing] kind of world, RDBMSs [relational database management systems], and major vendors that were using there. On the other side of it, you have traditional analytical warehousing. And where weve slotted this [Hadoop-style data] is in the middle with the other operational data. Some of it is derived from transactional data, and some has been crafted out of analytical data. Theres a freedom thats derived from blending these two kinds of data. Our centralized logging service is an example. As we look at continuing to drive down costs to drive up efficiency, we can begin to log a large amount of this data at a price point that we have not been able to achieve by scaling up RDBMSs or using warehousing appliances. Then the key will be putting an expert system in place. That will give us the ability to really understand whats going on in the actual operational environment. Were starting to move again toward lower utilization trajectories. We need to scale the infrastructure back and get that utilization level up to the threshold.

20


PwC: This kind of information doesnt go in a cube. Not that data cubes are going away, but cubes are fairly well known now. The value you can create is exactly what you said, understanding the thinking behind it and the exploratory steps.ST: We think storing the unstructured data in its raw format is whats coming. In a Hadoop environment, instead of bringing the data back to your warehouse, you figure out what question you want to answer. Then you MapReduce the input, and you may send that off to a data cube and a place that someone can dig around in, but you keep the data in its raw format and pull out only what you need. BA: The wonderful thing about where were headed right now is that data analysis used to be this giant, massive bet that you had to place up front, right? No longer. Now, I pull Hadoop off of the Internet, first making sure that were compliant from a legal perspective with licensing and so forth. After thats taken care of, you begin to prototype. You begin to work with it against common hardware. You begin to work with it against stuff you otherwise might throw out. Rather than, Im going to go spend how much for Teradata?

Were using the basic premise of the cloud, and were using those techniques of standardizing the interface to virtualize and drive cost out. Im taking that cost savings and returning some of it to the business, but then reinvesting some in new capabilities while the cost curve is stabilizing. ME: Refining some of this reinvestment in new capabilities doesnt have to be put in the category of traditional $5 million projects companies used to think about. You can make significant improvements with reinvestments of $200,000 or even $50,000. BA: Its then a matter of how youre redeploying an investment in resources that youve already made as a corporation. Its a matter of now prioritizing your work and not changing the bottom-line trajectory in a negative fashion with a bet that may not pay off. I can try it, and I dont have to get great big governancebased permission to do it, because its not a bet of half the staff and all of this stuff. Its, OK, lets get something on the ground, lets work with the business unit, lets pilot it, lets go somewhere where we know we have a need, lets validate it against this need, and lets make sure that its working. Its not something that must go through an RFP [request for proposal] and standard procurement. I can move very fast. n

We think storing the unstructured data in its raw format is whats coming. In a Hadoop environment, instead of bringing the data back to your warehouse, you figure out what question you want to answer. Scott Thompson

Creating a cost-effective Big Data strategy

21

Buildingabridgeto therestofyourdata

How companies are using open-source cluster-computing techniques to analyze their data.By Alan Morrison

22


As recently as two years ago, the International Supercomputing Conference (ISC) agenda included nothing about distributed computing for Big Data as if projects such as Google Cluster Architecture, a low-cost, distributed computing design that enables efficient processing of large volumes of less-structured data, didnt exist. In a May 2008 blog, Brough Turner noted the omission, pointing out that Google had harnessed as much as 100 petaflops1 of computing power, compared to a mere 1 petaflop in the new IBM Roadrunner, a supercomputer profiled in EE Times that month. Have the supercomputer folks been bypassed and dont even know it? Turner wondered.2 Turner, co-founder and CTO of Ashtonbrooke.com, a startup in stealth mode, had been reading Googles research papers and remarking on them in his blog for years. Although the broader business community had taken little notice, some companies were following in Googles wake. Many of them were Web companies that had data processing scalability challenges similar to Googles.

Yahoo, for example, abandoned its own data architecture and began to adopt one along the lines pioneered by Google. It moved to Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by the Apache Software Foundation; it also adopted MapReduce, Googles parallel programming framework. Yahoo used these and other open-source tools it helped develop to crawl and index the Web. After implementing the architecture, it found other uses for the technology and has now scaled its Hadoop cluster to 4,000 nodes. By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what OReilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage. Big Data refers to data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis by traditional means. Many who are familiar with these new methods are convinced that Hadoop clusters will enable cost-effective analysis of Big Data, and these methods are now spreading beyond companies that mine the public Web as part of their business.

By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what OReilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage.

Building a bridge to the rest of your data

23

Hadoop will process the data set and output a new data set, as opposed to changing the data set in place. Amr Awadallah of ClouderaWhat are these methods and how do they work? This article looks at the architecture and tools surrounding Hadoop clusters with an eye toward what about them will be useful to mainstream enterprises during the next three to five years. We focus on their utility for less-structured data. SoftwaretoleranceforhardwarefailuresWhen a failure occurs, the system responds by transferring the processing to another node, a critical capability for large distributed systems. As Roger Magoulas, research director for OReilly Media, says, If you are going to have 40 or 100 machines, you dont expect your machines to break. If you are running something with 1,000 nodes, stuff is going to break all the time. HighcomputepowerperqueryThe ability to scale up to thousands of nodes implies the ability to throw more compute power at each query. That ability, in turn, makes it possible to bring more data to bear on each problem. ModularityandextensibilityHadoop clusters scale horizontally with the help of a uniform, highly modular architecture. Hadoop isnt intended for all kinds of workloads, especially not those with many writes. It works best for read-intensive workloads. These clusters complement, rather than replace, high-performance computing (HPC) and other relational data systems. They dont work well with transactional data or records that require frequent updating. Hadoop will process the data set and output a new data set, as opposed to changing the data set in place, says Amr Awadallah, vice president of engineering and CTO of Cloudera, which develops a version of Hadoop. A data architecture and a software design that are frugal with network and disk resources are responsible for the price/performance ratio of Hadoop clusters. In Awadallahs words, You move your processing to where your data lives. Each node has its own processing and storage, and the data is divided and processed locally in blocks sized for the purpose. This concept of localization makes it possible to use inexpensive serial advanced technology attachment (SATA) hard disksthe kind used in most PCs and serversand Gigabit Ethernet for most network interconnections. (See Figure 1.)

Hadoop clustersAlthough cluster computing has been around for decades, commodity clusters are more recent, starting with UNIX- and Linux-based Beowulf clusters in the mid-1990s. These banks of inexpensive servers networked together were pitted against expensive supercomputers from companies such as Cray and othersthe kind of computers that government agencies, such as the National Aeronautics and Space Administration (NASA), bought. It was no accident that NASA pioneered the development of Beowulf.3 Hadoop extends the value of commodity clusters, making it possible to assemble a high-end computing cluster at a low-end price. A central assumption underlying this architecture is that some nodes are bound to fail when computing jobs are distributed across hundreds or thousands of nodes. Therefore, one key to success is to design the architecture to anticipate and recover from individual node failures.4 Other goals of the Google Cluster Architecture and its expression in open-source Hadoop include: Price/performanceoverpeakperformanceThe emphasis is on optimizing aggregate throughput; for example, sorting functions to rank the occurrence of keywords in Web pages. Overall sorting throughput is high. In each of the past three years, Yahoos Hadoop clusters have won Grays terabyte sort benchmarking test.5

24


Client

Switch1000Mbps

Switch

100Mbps

Switch

100Mbps

Typical node setup 2 quad-core Intel NehalemTask tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode JobTracker NameNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode

24GB of RAM 12 1TB SATA disks (non-RAID) 1 Gigabit Ethernet card Cost per node: $5,000 Effective file space per node: 20TB Claimed benefits Linear scaling at $250 per user TB (versus $5,000$100,000 for alternatives) Compute placed near the data and fewer writes limit networking and storage costs Modularity and extensibility

Rack

Rack

Figure 1: HadoopclusterlayoutandcharacteristicsSource: IBM, 2008, and Cloudera, 2010


25

Amazon supports Hadoop directly through its Elastic MapReduce application programming interfaces. Chris Wensel of ConcurrentThe result is less-expensive large-scale distributed computing and parallel processing, which make possible an analysis that is different from what most enterprises have previously attempted. As author Tom White points out, The ability to run an ad hoc query against your whole data set and get the results in a reasonable time is transformative.6 The cost of this capability is low enough that companies can fund a Hadoop cluster from existing IT budgets. When it decided to try Hadoop, Disneys Technology Shared Services Group took advantage of the increased server utilization it had already achieved from virtualization. As of March 2010, with nearly 50 percent of its servers virtualized, Disney had 30 percent server image growth annually but 30 percent less growth in physical servers. It was then able to set up a multiterabyte cluster with Hadoop and other free opensource tools, using servers it had planned to retire. The group estimates it spent less than $500,000 on the entire project. (See the article, Tapping into the power of Big Data, on page 04.) These clusters are also transformative because cloud providers can offer them on demand. Instead of using their own infrastructures, companies can subscribe to a service such as Amazons or Clouderas distribution on the Amazon Elastic Compute Cloud (EC2) platform. The EC2 platform was crucial in a well-known use of cloud computing on a Big Data project that also depended on Hadoop and other open-source tools. In 2007, The New York Times needed to quickly assemble the PDFs of 11 million articles from 4 terabytes of scanned images. Amazons EC2 service completed the job in 24 hours after setup, a feat that received widespread attention in blogs and the trade press. Mostly overlooked in all that attention was the use of the Hadoop Distributed File System (HDFS) and the MapReduce framework. Using these open-source tools, after studying how-to blog posts from others, Times senior software architect Derek Gottfrid developed and ran code in parallel across multiple Amazon machines.7 Amazon supports Hadoop directly through its Elastic MapReduce application programming interfaces [APIs], says Chris Wensel, founder of Concurrent, which developed Cascading. (See the discussion of Cascading later in this article.) I regularly work with customers to boot up 200-node clusters and process 3 terabytes of data in five or six hours, and then shut the whole thing down. Thats extraordinarily powerful.

The Hadoop Distributed File SystemThe Hadoop Distributed File System (HDFS) and the MapReduce parallel programming framework are at the core of Apache Hadoop. Comparing HDFS and MapReduce to Linux, Awadallah says that together theyre a data operating system. This description may be overstated, but there are similarities to any operating system. Operating systems schedule tasks, allocate resources, and manage files and data flows to fulfill the tasks. HDFS does a distributed computing version of this. It takes care of linking all the nodes together to look like one big file and job scheduling system for the applications running on top of it, Awadallah says. HDFS, like all Hadoop tools, is Java based. An HDFS contains two kinds of nodes: A single NameNode that logs and maintains the necessary metadata in memory for distributed jobs Multiple DataNodes that create, manage, and process the 64MB blocks that contain pieces of Hadoop jobs, according to the instructions from the NameNode

26


HDFS uses multi-gigabyte file sizes to reduce the management complexity of lots of files in large data volumes. It typically writes each copy of the data once, adding to files sequentially. This approach simplifies the task of synchronizing data and reduces disk and bandwidth usage. Equally important are fault tolerance within the same disk and bandwidth usage limits. To accomplish fault tolerance, HDFS creates three copies of each data block, typically storing two copies in the same rack. The system goes to another rack only if it needs the third copy. Figure 2 shows a simplified depiction of HDFS and its data block copying method.Client

HDFS does not perform tasks such as changing specific numbers in a list or other changes on parts of a database. This limitation leads some to assume that HDFS is not suitable for structured data. HDFS was never designed for structured data and therefore its not optimal to perform queries on structured data, says Daniel Abadi, assistant professor of computer science at Yale University. Abadi and others at Yale have done performance testing on the subject, and they have created a relational database alternative to HDFS called HadoopDB to address the performance issues they identified.8 Some developers are structuring data in ways that are suitable for HDFS; theyre just doing it differently from the way relational data would be structured. Nathan Marz, a lead engineer at BackType, a company that offers a search engine for social media buzz, uses schemas to ensure consistency and avoid data corruption. A lot of people think that Hadoop is meant for unstructured data, like log files, Marz says. While Hadoop is great for log files, its also fantastic for strongly typed, structured data. For this purpose, Marz uses Thrift, which was developed by Facebook for data translation and serialization purposes.9 (See the discussion of Thrift later in this article.) Figure 3 illustrates a typical Hadoop data processing flow that includes Thrift and MapReduce.

NameNode (metadata)Files File A File A Blocks 1, 2, 4 3, 5

DataNode

DataNode

DataNode

DataNode

1

2

4

5

2

3

4

3

1

5

2

5

Figure 2: TheHadoopDistributedFileSystem,orHDFSSource: Apache Software Foundation, IBM, and PricewaterhouseCoopers, 2008

Input dataLess-structured information such as: log files messages images

Input applicationsCascading Thrift Zookeeper Pig

Core Hadoop data processing1 1 2 M M 2 3 M 3 3 1 2

Output applicationsMashups RDBMS apps BI systems

Jobs

R

ResultsM R

Map Reduce 64MB blocks

Figure 3: HadoopecosystemoverviewSource: PricewaterhouseCoopers, derived from Apache Software Foundation and Dion Hinchcliffe, 2010


27

MapReduceMapReduce is the base programming framework for Hadoop. It often acts as a bridge between HDFS and tools that are more accessible to most programmers. According to those at Google who developed the tool, it hides the details of parallelization and the other nuts and bolts of HDFS.10 MapReduce is a layer of abstraction, a way of managing a sea of details by creating a layer that captures and summarizes their essence. That doesnt mean it is easy to use. Many developers choose to work with another tool, yet another layer of abstraction on top of it. I avoid using MapReduce directly at all cost, Marz says. I actually do almost all my MapReduce work with a library called Cascading.

The terms map and reduce refer to steps the tool takes to distribute, or map, the input for parallel processing, and then reduce, or aggregate, the processed data into output files. (See Figure 4.) MapReduce works with key-value pairs. Frequently with Web data, the keys consist of URLs and the values consist of Web page content, such as Hypertext Markup Language (HTML). MapReduces main value is as a platform with a set of APIs. Before MapReduce, fewer programmers could take advantage of distributed computing. Now that user-accessible tools have been designed, simpler programming is possible on massively parallel systems and less adaptation of the programs is required. The following sections examine some of these tools.

Data store 1 Input key-value pairs Map key 1 values key 2 values Barrier ... key 3 values key 1 values

Data store n Input key-value pairs Map key 2 values key 3 values

Aggregates intermediate values by output key ... Barrier key 2 intermediate values Reduce final key 2 values key 3 intermediate values Reduce final key 3 values

key 1 intermediate values Reduce final key 1 values

Figure 4: MapReducephasesSource: Google, 2004, and Cloudera, 2009

28


You can code in whatever JVM-based language you want, and then shove that into the cluster. Chris Wensel of ConcurrentCascadingWensel, who created Cascading, calls it an alternative API to MapReduce, a single library of operations that developers can tap. Its another layer of abstraction that helps bring what programmers ordinarily do in non-distributed environments to distributed computing. With it, he says, you can code in whatever JVM-based [Java Virtual Machine] language you want, and then shove that into the cluster. Wensel wanted to obviate the need for thinking in MapReduce. When using Cascading, developers dont think in key-value pair termsthey think in terms of fields and lists of values called tuples. A Cascading tuple is simpler than a database record but acts like one. Each tuple flows through pipe assemblies, which are comparable to Java classes. The data flow begins at the source, an input file, and ends with a sink, an output directory. (See Figure 5.)Map [f1, f2, ...]P P

Rather than approach map and reduce phases large-file by large-file, developers assemble flows of operations using functions, filters, aggregators, and buffers. Those flows make up the pipe assemblies, which, in Marzs terms, compile to MapReduce. In this way, Cascading smoothes the bumpy MapReduce terrain so more developersincluding those who work mainly in Client scripting languagescan build flows. (See Figure 6.)AssemblyA A A A A A A A

Flow MR MR MR MR

Cluster Job Job

Reduce [f1, f2, ...]P

Map [f1, f2, ...]P

Reduce [f1, f2, ...]P

[f1, f2, ...]P

A MR

Pipe assembly Hadoop MR (translation to MapReduce) MapReduce jobs

[f1, f2, ...]So

[f1, f2, ...]Si

Figure 6: Cascadingassemblyandflow[f1, f2, ...] So Si P Tuples with field names Source Sink Pipe

Source: Concurrent, 2010

Figure 5: ACascadingassemblySource: Concurrent, 2010


29

Some useful tools for MapReduce-style analytics programmingOpen-source tools that work via MapReduce on Hadoop clusters are proliferating. Users and developers dont seem concerned that Google received a patent for MapReduce in January 2010. In fact, Google, IBM, and others have encouraged the development and use of open-source versions of these tools at various research universities.11 A few of the more prominent tools relevant to analytics, and used by developers weve interviewed, are listed in the sections that follow. Clojure Clojure creator Rich Hickey wanted to combine aspects of C or C#, LISP (for list processing, a language associated with artificial intelligence thats rich in mathematical functions), and Java. The letters C, L, and J led him to name the language, which is pronounced closure. Clojure combines a LISP library with Java libraries. Clojures mathematical and natural language processing (NLP) capabilities and the fact that it is JVM based make it useful for statistical analysis on Hadoop clusters. FlightCaster, a commercial-airline-delayprediction service, uses Clojure on top of Cascading, on top of MapReduce and Hadoop, for getting the right view into unstructured data from heterogeneous sources, says Bradford Cross, FlightCaster co-founder. LISP has attributes that lend themselves to NLP, making Clojure especially useful in NLP applications. Mark Watson, an artificial intelligence consultant and author, says most LISP programming hes done is for NLP. He considers LISP to be four times as productive for programming as C++ and twice as productive as Java. His NLP code uses a huge amount of memory-resident data, such as lists of proper nouns, text categories, common last names, and nationalities.

With LISP, Watson says, he can load the data once and test multiple times. In C++, he would need to use a relational database and reload each time for a program test. Using LISP makes it possible to create and test small bits of code in an iterative fashion, a major reason for the productivity gains. This iterative, LISP-like program-programmer interaction with Clojure leads to what Hickey calls dynamic development. Any code entered in the console interface, he points out, is automatically compiled on the fly. Thrift Thrift, initially created at Facebook in 2007 and then released to open source, helps developers create services that communicate across languages, including C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby. With Thrift, according to Facebook, users can define all the necessary data structures and interfaces for a complex service in a single short file. A more important aspect of Thrift, according to BackTypes Marz, is its ability to create strongly typed data and flexible schemas. Countering the emphasis of the so-called NoSQL community on schema-less data, Marz asserts there are effective ways to lightly structure the data in Hadoop-style analysis. Marz uses Thrifts serialization features, which turn objects into a sequence of bits that can be stored as files, to create schemas between types (for instance, differentiating between text strings and long, 64-bit integers) and schemas between relationships (for instance, linking Twitter accounts that share a common interest). Structuring the data in this way helps BackType avoid inconsistencies in the data or the need to manually filter for some attributes. BackType can use required and optional fields to structure the Twitter messages it crawls and analyzes. The required fields can help enforce data type. The optional fields, meanwhile, allow changes to the schema as well as the use of old objects that were created using the old schema.

Getting the right view into unstructured data from heterogeneous sources can be quite tricky. Bradford Cross of FlightCaster

30


Marzs use of Thrift to model social graphs like the one in Figure 7 demonstrates the flexibility of the schema for Hadoop-style computing. Thrift essentially enables modularity in the social graph described in the schema. For example, to select a single age for each person, BackType can take into account all the raw age data. It can do this by a computation on the entire data set or a selective computation on only the people in the data set who have new data.Gender male Age 39

Open-source,non-relationaldatastores Non-relational data stores have become much more numerous since the Apache Hadoop project began in 2007. Many are open source. Developers of these data stores have optimized each for a different kind of data. When contrasted with relational databases, these data stores lack many design features that can be essential for enterprise transactional data. However, they are often well tailored to specific, intended purposes, and they offer the added benefit of simplicity. Primary non-relational data store types include the following: MultidimensionalmapstoreEach record maps a row name, a column name, and a time stamp to a value. Map stores have their heritage in Googles Bigtable. Key-valuestoreEach record consists of a key, or unique identifier, mapped to one or more values. GraphstoreEach record consists of elements that together form a graph. Graphs depict relationships. For example, social graphs describe relationships between people. Other graphs describe relationships between objects, between links, or both. DocumentstoreEach record consists of a document. Extensible Markup Language (XML) databases, for example, store XML documents. Because of their simplicity, map and key-value stores can have scalability advantages over most types of relational databases. (HadoopDB, a hybrid approach developed at Yale University, is designed to overcome the scalability problems associated with relational databases.) Table 1 provides a few examples of the open-source, non-relational data stores that are available.Graph Resource Description Framework (RDF) Neo4j InfoGrid

Bob Alice

CharlieGender female Age 25 Gender male Age Apache Thrift 22

Language: C++

Figure 7: Anexampleofasocialgraphmodeledusing ThriftschemaSource: Nathan Marz, 2010

BackType doesnt just work with raw data. It runs a series of jobs that constantly normalize and analyze new data coming in, and then other jobs that write the analyzed data to a scalable random-access database such as HBase or Cassandra.12

Map HBase Hypertable Cassandra

Key-value Tokyo Cabinet/Tyrant Project Voldemort Redis

Document MongoDB CouchDB Xindice

Table 1: Exampleopen-source,non-relationaldatastoresSource: PricewaterhouseCoopers, Daniel Abadi of Yale University, and organization Web sites, 2010


31

We established that Hadoop does horizontally scale. This is whats really exciting, because Im an RDBMS guy, right? Ive done that for years, and you dont get that kind of constant scalability no matter what you do. Scott Thompson of DisneyOtherrelatedtechnologiesandvendors A comprehsensive review of the various tools created for the Hadoop ecosystem is beyond the scope of this article, but a few of the tools merit brief description here because theyve been mentioned elsewhere in this issue: PigA scripting language called Pig Latin, which is a primary feature of Apache Pig, allows more concise querying of data sets directly from the console than is possible using MapReduce, according to author Tom White. HiveHive is designed as mainly an ETL [extract, transform, and load] system for use at Facebook, according to Chris Wensel. ZookeeperZookeeper provides an interface for creating distributed applications, according to Apache. Big Data covers many vendor niches, and some vendors products take advantage of the Hadoop stack or add to its capabilities. (See the sidebar Selected Big Data tool vendors.) Cost-effectivescalabilityHorizontal scaling from a low-cost base implies a feasible long-term cost structure for more kinds of data. Scott Thompson, vice president of infrastructure at the Disney Technology Shared Services Group, says, We established that Hadoop does horizontally scale. This is whats really exciting, because Im an RDBMS guy, right? Ive done that for years, and you dont get that kind of constant scalability no matter what you do. FaulttoleranceAssociated with scalability is the assumption that some nodes will fail. Hadoop and MapReduce are fault tolerant, another reason commodity hardware can be used. Suitabilityforless-structureddataPerhaps most importantly, the methods that Google pioneered, and that Yahoo and others expanded, focus on what Clouderas Awadallah calls complex data. Although developers such as Marz understand the value of structuring data, most Hadoop/MapReduce developers dont have an RDBMS mentality. They have an NLP mentality, and theyre focused on techniques optimized for large amounts of less-structured information, such as the vast amount of information on the Web. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze the Big Data they didnt have the means to analyze before. This set of methods is separate from, yet complements, data warehousing. Understanding what Hadoop clusters do and how they do it is fundamental to deciding when and where enterprises should consider making use of them.

ConclusionInterest in and adoption of Hadoop clusters are growing rapidly. Reasons for Hadoops popularity include: Open,dynamicdevelopmentThe Hadoop/ MapReduce environment offers cost-effective distributed computing to a community of opensource programmers whove grown up on Linux and Java, and scripting languages such as Perl and Python. Some are taking advantage of functional programming language dialects such as C