€¦ · web viewno word on when, but cto indu ... sap hana, and tibco spotfire-style financial...

Hadoop Spurs Big Data RevolutionOpen source data processing platform has won over Web giants for its low cost, scalability, and flexibility. Now Hadoop will make its way into more enterprises.

By Doug Henschen InformationWeek November 09, 2011 03:50 PM

There's a revolution happening in the use of big data, and Apache Hadoop is at the center of it.

Excitement around Hadoop has been building since its release as an open source distributed data processing platform five years ago. But within the last 18 months, Hadoop has taken off, gaining customers, commercial support options, and dozens of integrations from database and data-integration software vendors. The top three commercial database suppliers--Oracle, IBM, and Microsoft--have adopted Hadoop.

IBM introduced its Hadoop-based InfoSphere BigInsights software in May, and last month Oracle and Microsoft separately revealed plans to release Hadoop-based distributions next year. Both companies plan to provide deployment assistance and enterprise-grade support, and Oracle has promised a prebuilt Oracle Big Data Appliance with Hadoop software already installed.

Will Hadoop turn out to be as significant as SQL, introduced more than 30 years ago? Hadoop is often tagged as a technology exclusively for unstructured data. By combining scalability, flexibility, and low cost, it has become the default choice for Web giants like AOL and ComScore that are dealing with large-scale clickstream analysis and ad targeting scenarios.

But Hadoop is headed for wider use. It's applicable for all types of data and destined to go beyond clickstream and sentiment analysis. For example, SunGard, a hosting and application service provider for small and midsize companies, plans to introduce a cloud-based managed service aimed at helping financial services companies experiment with Hadoop-based MapReduce processing. And software-as-a-service startup Tidemark recently introduced a cloud-based performance management application that will use MapReduce to bring mixed data sources into product and financial planning scenarios.

Hadoop Basics

Inspired in large part by a 2004 white paper in which Google described its use of MapReduce techniques, Hadoop is a Java-based software framework for distributed

From ww.informationweek.com/development/database/hadoop-spurs-big-data-revolution/231902466?queryText=hadoopla 14 August 2012

http://www.informationweek.com/

http://www.informationweek.com/authors/1197

mailto:[email protected]

processing of data-intensive transformations and analyses. MapReduce breaks a big data problem into subproblems; distributes them onto tens, hundreds, and even thousands of processing nodes; and then combines the results into a smaller, easy-to-analyze data set.

Hadoop includes several important subprojects and related Apache projects. The Hadoop Distributed File System (HDFS) gives the platform massive yet low-cost storage capacity. The Pig data-flow language is used to write parallel processing jobs. The HBase distributed, column-oriented database gives Hadoop a structured-data storage option for large tables. And the Hive distributed data warehouse supports data summarization and ad hoc querying.

Hadoop gets its well-known scalability from its ability to distribute large-scale data processing jobs across thousands of compute nodes built on low-cost x86 servers. Its capacity is constantly increasing, thanks to Moore's Law and ever-rising memory and disk drive capacity. The latest supporting hardware deployments combine 16 compute cores, 128 MB of RAM, and as much as 12 TB or even 24 TB of hard disk capacity per node. The cost of each node is about $4,000, according to Cloudera, the leading provider of commercial support and enterprise management software for Hadoop deployments. That cost is a fraction of the $10,000 to $12,000 per terabyte for the most competitively priced relational database deployments.

This high-capacity and low-cost combination is compelling enough, but Hadoop's other appeal is its ability to handle mixed data types. It can manage structured data as well as highly variable data sources, such as sensor and server log files and Web clickstreams. It can also manage unstructured, text-centric data sources, such as feeds from Facebook and Twitter. ("Loosely structured" or "free form" are actually more accurate descriptions of this type of data, but "unstructured" is the description that has stuck.)

This ability to handle various types of data is so important it has spawned the broader NoSQL (not only SQL) movement. Platforms and products, such as Cassandra, CouchDB, MongoDB, and Oracle's new NoSQL database, address the need for data flexibility in transactional processing. Hadoop has garnered most of the attention for supporting data analysis.

Relational databases, such as IBM DB2, Oracle, Microsoft SQL Server, and MySQL, can't handle mixed data types and unstructured data, because they don't fit into the columns and rows of a predefined data model (see "Hadoop's Flexibility Wins Over Online Data Provider" [see below, http://www.informationweek.com/news/development/database/231902692]).

R&D Roots At AOL

AOL has been using Hadoop for more than three years, first in its R&D unit, to make sense of the navigation patterns of the more than 180 million unique site visitors per month across AOL.com, MapQuest, the Huffington Post, and dozens of other sites it owns.


AOL starts by gathering as much information as possible about visitors' activities. That's where Hadoop's low-cost and scalability come in. "When you do the math, the cost per node of commodity systems versus commercial systems makes the choice very obvious," says Bao Nguyen, AOL's technical director of R&D for large-scale analytics. "The cost per node is orders of magnitude higher for the commercial systems."

AOL's R&D unit has a 300-node Hadoop deployment of mixed vintage and capacity in Mountain View, Calif. That system can store more than 500 TB of clickstream data on billions of events per day. An event can be someone clicking on an email promotion or banner ad, doing a search, reading an article, visiting a site, or clicking on a particular product on an e-commerce page. Events can also include time stamps added to the history and profile of a particular visitor (known by a particular cookie ID number but not by personally identifiable information).

This clickstream data is highly structured, but it's so massive and varied that it would be next to impossible to handle all the extract, transform, and load work that would be required to move it into a conventional relational database. AOL uses Hadoop's MapReduce processes to filter and correlate data, distributing text extraction, correlation, and calculation steps across hundreds of compute nodes.

With MapReduce job after MapReduce job, AOL refines massive amounts of raw data into thousands of categories, such as automobiles, news, finance, and sports. Next, it identifies features and attributes of the visitors to each category, determining whether they're car buyers, mortgage prospects, male heads of household, or teenagers, for example.

It feeds the final refined feature sets into more proprietary analytic applications (many built out on conventional relational platforms) that get down to the business priorities of delivering the right ad banners and email campaigns to the right people at the right time.

When online behavior shows that a visitor is interested in cars, Hadoop helps AOL figure that out and deliver a relevant ad. Hadoop is a batch-oriented platform, so it might take a day or two for such indicators to emerge. But profiles have a way of building over time and providing rich, multi-attribute targeting possibilities.

The success of the R&D Hadoop deployment led AOL to deploy an even larger, 700-node production system in April at its Dulles, Va., headquarters. The R&D unit now does more exploratory and ad hoc analyses, while the petabyte-scale production deployment does proven analyses, such as routine customer segmentation and online behavioral analysis. For example, an ad-targeting model running on the production deployment correlates data on the online and offline buying behavior of customers of large retailers that have both physical and online stores. AOL uses this anonymized data to build customer profiles and predictive models that let it aim online advertising at its 180 million unique online visitors per month.

Analyzing The Internet


Another company rolling out a large-scale Hadoop deployment is digital media measurement company ComScore. It's planning to use Hadoop as its main platform for raw data analysis, replacing a homegrown, grid-based system built on commodity hardware that it has used since 2004. The grid preprocesses raw data, boiling down hundreds of terabytes of Web clickstream data into orderly data sets that can be loaded onto ComScore's 150-TB Sybase IQ data warehouse, a row-oriented, relational database best suited to analytics.

Sybase IQ lets ComScore measure the traffic of the world's leading websites and do marketing segmentation based on the surfing habits of its panel of more than 2 million Web users. (ComScore's panel is a Web version of the Nielsen households used to track TV viewing.)

ComScore's Hadoop platform is expected to scale better than its grid system, while providing higher utilization rates and reducing operations costs, says CTO Michael Brown. It will also free the company's developers to work on business problems rather than having to maintain and scale a proprietary stack, Brown says.

ComScore first put Hadoop to work for Social Essentials, a service it introduced in June that processes the 5 TB of panelist data the company collects each day to determine the extent to which top social networks, social network brand pages, and influential people on social networks boost visits to and purchases from specific websites.

ComScore's panelists visit more than 140 million social network pages a day. "The Facebook API gives you basic statistics, but marketers have a huge need to know the impact of influencers, the Facebook news feed, the Facebook wall, and branded pages," Brown says.

Using algorithms running on top of Hadoop, ComScore determines which friends, influencers, and pages panelists visited on a given social network. ComScore also has profile information on its panelists and their Web activities, and it uses that information to develop broader insights about social network usage.

Social Essentials is geared to help marketers understand the effectiveness of their social networking activities. If you're Southwest Airlines, for example, the service can tell you that 3% of Web users are likely to visit your site, whereas 12% of those who are fans of the airline's Facebook page are likely to visit and 8% of friends of Facebook fans are likely to visit, Brown says.

What's Ahead?

Companies already using Hadoop invariably have bigger plans. AOL is moving critical applications to its 700-node production environment, which is described as a highly reliable and controlled deployment, providing data down to granular levels of detail. The 300-node R&D environment is where many of company's most advanced Ph.D. analytics experts work on cutting-edge projects. Cloudera provides the enterprise support for both deployments, helping AOL with bug fixes, software upgrades, and service problems.


At ComScore, it will be several months before Hadoop can scale up and replace its data processing grid, Brown says. That move was delayed in part because ComScore switched from Cloudera's Hadoop distribution to MapR's, which ComScore licensed through EMC Greenplum. MapR's version of Hadoop will let ComScore switch from HDFS to the more mature and widely used Network File System. NFS will enable the company to easily move data back and forth among Hadoop, Sybase IQ, and other data sources and systems, something it couldn't do with HDFS, Brown says.

EMC and partner MapR introduced new Hadoop software and support options this spring, as did IBM with its BigInsights offering. IBM partner Karmasphere, which provides Hadoop development and analytics tools, recently introduced a virtual appliance for BigInsights, designed to speed development of MapReduce jobs and related analytics projects. Microsoft has promised a Windows Server-friendly distribution of Hadoop supported by Yahoo spin-off Hortonworks, another enterprise-focused Hadoop tools and support provider. It's a safe bet that Oracle, too, will find ways to differentiate its Hadoop offering beyond the promised delivery of the Oracle Big Data Appliance.

Only the largest vendors have had the chutzpa to announce their own Hadoop software distributions and support plans. But dozens of others have added integrations and support tools, so they can move data into and out of Hadoop and analyze data sets after they're boiled down by MapReduce processing. That list includes data warehouse vendors Hewlett-Packard, ParAccel, and Teradata; data integration vendors Informatica, Pervasive, Talend, and Syncsort; and business intelligence and analytics vendors Jaspersoft, Pentaho, and SAS.

The latest wave of Hadoop announcements is coming from application developers and service providers. Amazon has offered a Hadoop-based service on its Elastic Compute Cloud since 2009. IBM launched a BigInsights service on its SmartCloud Enterprise platform in October. And Microsoft is promising a beta Hadoop-based service on the SQL Azure cloud platform by year's end.

SunGard plans to launch a Hadoop-based managed service that will let customers run MapReduce jobs. No word on when, but CTO Indu Kodukula says the company will run MapR software on EMC Greenplum's modular appliance. It will aim the service at customers that expect to operate 100 TB or more of data but aren't ready to commit to building out their own infrastructure to support Hadoop.

"Most of the requests that we've received to support Hadoop come from large financial customers that have an enormous amount of data and interest in blending in external sources, but they don't entirely know whether the results are going to be meaningful," Kodukula says. Rather than spending first and risking failure, they'd rather experiment with a managed service, he says.

On the apps front, Tidemark introduced an innovative cloud-based performance management application in October built on an "elastic computation grid based on in-memory technology coupled with Hadoop MapReduce processing." That's a mouthful, but it's simpler than it sounds. The in-memory technology is used for the fast analyses


you expect in a performance management app (think Cognos TM1, QlikTech, SAP Hana, and Tibco Spotfire-style financial analyses delivered via the cloud). The Hadoop MapReduce part speeds answers to big data problems and blends mixed data types that might not conform to a fixed schema.

Tidemark customer U.S. Sugar, for example, is mixing weather data with the information it gets from growers related to seeds, chemical treatments, and acres planted to better understand and predict crop production. And Acosta, a marketing services firm that works with consumer products companies, is analyzing consumer sentiments expressed in social media to do a better job of stocking products in support of marketing campaigns.

All this support for Hadoop will naturally encourage broader experimentation and is likely to boost adoption. According to a recent InformationWeek survey of 431 business technology professionals involved with information management tools, only about 3% have made extensive use of Hadoop or other NoSQL platforms while 11% have made limited use of it (see chart, below). With all the hype around Hadoop, those figures should begin to rise.

It may be that we're at the apex of Gartner's hype cycle, so beware the trough of disillusionment in the months ahead. For one thing, expect a cacophony of confusing commercial messages. Customer success stories and emerging applications will be the best way to guage Hadoop's progress.

Once Hadoop is proven and mission critical, as it is at AOL, its use will be as routine and accepted as SQL and relational databases are today. It's the right tool for the job when


scalability, flexibility, and affordability really matter. That's what all the Hadoopla is about.

Published in the print edition, November 14, 2011 with the title “Why all the Hadoopla?”


Hadoop's Flexibility Wins Over Online Data ProviderRapLeaf replaced relational database workflow with Hadoop and now can make quick database changes.

By Doug Henschen InformationWeek November 09, 2011 03:50 PM

Scalability, flexibility, and low cost are the praises we hear repeatedly from Hadoop adopters, and Rapleaf is no exception. The company, which provides companies with data about their online customers, chose Hadoop to replace a MySQL relational database processing workflow nearly four years ago, and it's finding advantages in being able to quickly add new data types to meet changing business needs.

Rapleaf provides data that companies can add to their own customer data in order to do better online personalization and targeting. Like many data providers, Rapleaf trades in demographic information, such as age, income, gender, education, and marital status, as well as psychographic information, such as hobbies, activities, and interests. It partners with email service providers, such as Constant Contact and Exact Target, to help companies doing digital marketing campaigns.

Rapleaf gets its data from many sources, but the total amount it processes is modest--tens of terabytes compared with the hundreds of terabytes and petabytes that some Hadoop users process. In addition, where many Hadoop users churn through ever-changing data, such as clickstreams that constantly reveal people's latest online activities, Rapleaf's deployment reprocesses a fairly stable core of information, as the universe of households and Internet users doesn't change dramatically.

What does change are the attributes Rapleaf must provide from its stockpile of data, as marketers seek new ways to target consumers. That's where Hadoop's ability to tap new data sources and mix data types comes in. If Rapleaf used a processing system based on a traditional relational database, it would have to use a predefined schema or data model. A database about people, for instance, would require a table with specific columns for attributes such as age, gender, and income level. If it wanted to add new data containing attributes that weren't originally included such as Twitter handle or Facebook name, IT would face the time-consuming task of adding new columns to the table. The larger the table, the bigger the problem.

"Just adding one column to a large table within a relational database can easily take hours, days, or worse, and that's totally unacceptable," says Jeremy Lizt, Rapleaf's VP of engineering.

Using Hadoop, Rapleaf doesn't need to create a new column; it simply tweaks what it calls its "people profile," and new attributes can be extracted in the next round of data processing, adding new sources as necessary to derive the additional information


http://www.informationweek.com/

http://www.informationweek.com/authors/1197

mailto:[email protected]

required. Thanks to the platform's scalability and use of highly distributed MapReduce processing, that processing can happen within minutes.

In the early days of its deployment, the young Hadoop platform that Rapleaf was using didn't have a lot of tools and industry best practices to draw from. Hadoop has since matured, and Lizt says enterprise support provider Cloudera helped Rapleaf deal with bug fixes.

"Hadoop is a lot more stable today than it was when we started, and it's obviously going to continue to evolve because it just makes so much sense for anybody who needs to do large-scale data processing," Lizt says.


€¦ · web viewno word on when, but cto indu ... sap hana, and tibco spotfire-style financial...

Documents