hadoop enterprise readiness

Dell | Hadoop White Paper Series

Hadoop Enterprise Readiness Dell | Hadoop White Paper

By Aurelian Dumitru

Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

2

Table of Contents

Introduction 3

Audience 3

The value of big data analytics in the enterprise 3

Case study: Using big data analytics to optimize/automate IT operations 7

Big data analytics challenges in the enterprise 9

The adoption of Hadoop technology 9

Hadoop technical strengths and weaknesses 10

Dell | Hadoop solutions 10

Dell | Hadoop for the enterprise 12

About the author 15

Special thanks 15

About Dell Next Generation Computing Solutions 15

Hadoop ecosystem component “decoder ring” 15

Bibliography 16

To learn more 16

This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express

or implied warranties of any kind.

© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.

For more information, contact Dell.


3

Introduction

This white paper describes the benefits and challenges of leveraging big data analytics in an enterprise environment.

The white paper begins with a holistic view of business process phases and highlights ways in which analytics may stimulate

better business operational efficiency, drive higher returns from existing or new investments, and also help business leaders

make rapid adjustments to the business strategy in response to varying market trends and/or customer demands.

The white paper continues with a case study of how big data analytics helps information technology (IT) departments run

information systems more efficiently and with little or no downtime.

Lastly, the paper introduces the Dell | Hadoop solutions and presents several best practices for deploying and using Hadoop in

the enterprise.

Audience

Dell intends this white paper for anyone in the business or IT community who wants to learn about the advantages and

challenges of implementing and using big data analytics solutions (like Hadoop) in a production environment. Readers should

be familiar with general concepts of business process design and implementation and also with the correlation between

business processes and IT practices.

The value of big data analytics in the enterprise

Business processes define the way business activities are performed, the expected set of inputs, and the desired outcomes.

Business processes often integrate business units, workgroups, infrastructures, business partners, etc. to achieve key

performance goals (i.e. strategy, operations, functionality). Business process adjustments and improvements are expected as

the company attempts to improve its operations or to create a competitive advantage. Business process maturity and

execution excellence are the core competencies of any modern company. Switching from last decade’s product-centric

business model to today’s customer-driven model requires reengineering of the business processes (i.e. just-in-time business

intelligence) along with deeper collaboration among departments.[2]

Enterprise business processes relate to cross-functional management of work activities across the boundaries of the various

departments (or business units) within a large enterprise. Controlling the sequence of work activities (and the corresponding

information flow) while delivering to customer’s needs is fundamental to the successful implementation and execution of the

business process. Because of its intrinsic complexity, enterprises start taking a process-centric approach to designing,

planning, monitoring, and automating the business operations. One example of such approach stands out: Business Process

Management (BPM).

“BPM is a holistic management approach focused on aligning all aspects of an organization with the wants and needs of

clients. It promotes business effectiveness and efficiency while striving for innovation, flexibility, and integration with

technology. BPM attempts to improve processes continuously.”[3]


4

The main BPM phases (Figure 1) and their respective owners are:

1. Vision—Functional leads in an organization create the strategic goals for the organization. The vision can be based

on market research, internal strategy objectives, etc.

2. Design & Simulate—Design leads in the organization work to identify existing processes or to design “to-be”

processes. The result is a theoretical model that is tested against combinations of variables. Special consideration is

given to “what if” scenarios. The aim of this step is to ensure that the design delivers the key performance goals

established in the Vision phase.

3. Implement—The theoretical design is adopted

within the organization. A high degree of automation

and integration are two key ingredients for successful

implementation. Other key elements may be personnel

training, user documentation, streamlined support, etc.

4. Execute—The process is fully adopted within

the organization. Special measures and procedures are

being put in place to enable the organization to

investigate/monitor the execution of the process and

test it against established performance objectives. An

example of such measures and procedures is what

Gartner defines as Business Activity Monitoring (BAM)

[4].

5. Monitor & Optimize—The process is being

monitored against performance objectives. Actual

performance statistics are gathered and analyzed.

Example of such statistics can be the measure of how

quickly an online order is processed and sent for shipping. In addition, these statistics can be used to work with other

organizations to improve their connected processes. The highest possible degree of automation can help

tremendously. Automation can cut costs, save time, add value, and eventually lead to competitive advantage. Process

Mining [5] is a collection of tools and methods related to process monitoring.

How can analytics help the business?

In 2005 Gartner released a thought-provoking study about combining business intelligence with a business process platform.

This results in what Gartner calls an environment in which processes are self-configuring and driven by clients or transactions.

The real challenge with such an endeavor is mapping business rules to intelligent processes that, by definition, need to be

self-configurable and transaction-driven.

Figure 1: Business Process Management (BPM) Phases


5

Recent advancements in high-volume data management technologies and data analysis algorithms make the mapping from

business rules to intelligent processes plausible. First, analytics enable flow automation and monitoring. Second, removal of

manual steps helps improve process reliability and efficiency. Third, analytics can become one of the driving factors for

continuous optimizations of business processes in the enterprise.

In conclusion, analytics can be the foundation for the environment that Gartner had envisioned (Figure 2).

Embedding analytics into the process lifecycle has

tremendous benefits.

For example, during the Vision phase, functional leads need to

understand market trends, customer behavior, internal

business challenges, etc. Being able to comb through treasure

troves of data quickly and pick the right signals impacts the

long-term profitability of the business.

Reliance on analytics during the Design & Simulate phase

helps the designers rule out suboptimal designs.

During the Execute and Monitor & Optimize phases, analytics

can provide automation, ongoing performance evaluation, and

decision-making.

Why can analytics be the business processes foundation?

Although analytics use cases vary between each BPM phase,

they all seem to answer the same basic questions: What

happened? Why did it happen? Will it happen again?—etc. This

convergence should be expected. In biology, convergent

evolution is a powerful explanatory paradigm. [1] “Convergent

evolution describes the acquisition of the same biological trait

in unrelated lineages. The wing is a classic example. Although

their last common ancestor did not have wings, birds and bats

do.” [7] A similar phenomenon is occurring in the business analytics world because although different questions demand

different answers, the algorithms that generate the answers are fairly similar.

The different use cases are converging into three categories of analytics [6] (Figure 3):

1. Reporting Analytics process historic data for purposes of reporting statistical information and interpreting the

insights identified by analysing the data

2. Forecast Analytics begins with a summary constructed from historic data and defines scenarios for better outcomes

(“Model & Simulate”)

3. Predictive Analytics encompasses the previous two categories and adds strategic decision-making.

Reporting Analytics helps analysts characterize the performance of a process instance by aggregating historical data and

presenting it in a human-readable interpretation (i.e. spreadsheets, dashboards, etc.). Business analysts use Reporting Analytics

to compare measured performance against objectives. They use Reporting Analytics only to understand the process. The

intelligence gathered from Reporting Analytics cannot be used to influence process optimizations or to adjust the overall

strategy. Process tuning or strategy adjustments are the subject of one of the next two types of analytics.

Figure 2: BPM + Analytics Environment


6

Figure 3: Business Analytics Categories

Forecast Analytics uses data mining algorithms to process historical data (“Report“ in Figure 3) and derive insights of statistical

significance. These insights are then used to define a set of rules (or mathematical models) called “forecast models,” which are

nothing but mathematic formulas. These models are being iterated (“Model & Simulate” in Figure 3) until the model with the

best outcome wins. Forecast Analytics helps analysts optimize the process within prescribed boundaries. Practitioners can

tune the process, for example by adopting automation which is fundamentally the first step toward intelligent processes.

Forecast Analytics’ primary role is to influence optimizations needed to tune a process; however it doesn’t provide the analyst

with the insights needed to make strategy adjustments.

Predictive Analytics offers the greatest opportunity to influence the strategy from which business objectives will be born.

Predictive Analytics begins with historic facts, takes in consideration data mining and fast-tracks forecast models definition

and validation. Predictive Analytics looks at the strategy and its derived processes holistically (“Predict” in Figure 3).

Let’s look at an example. We’ll consider the case of a home improvement company. Historical data indicates that ant killer

sells very well across southern U.S. during summer months. Historical data also indicates that shelf inventory sells very slowly

and at deep discounts after Labor Day. This year the company wants to make sure there is no shelf inventory come Labor Day.

Also the ant killer manufacturer has announced a new product that combines the ant killer with a lawn fertilizer. How can

analytics help?

Foremost, the company needs to start with Reporting Analytics to understand factors like volume of sales per month, geo-

distribution across the region, sales volume for each sales representative, discounts after Labor Day, etc. Second, the company

needs to consider Forecast Analytics to simulate various sales scenarios and choose the one that meets the strategic

criterion—no inventory left come Labor Day. The results may include: accelerate sales in July and August using coupons, hire

more sales representatives to “push” the inventory quicker, etc. Third, the company needs to use Predictive Analytics to

identify the best strategy for selling the new product. Contributing factors to the new strategy may be not only the ant killer

sales figures but also information like excessive drought zones (in these areas homeowners need both bug killers and

fertilizers to keep their lawns bug free and beautiful during summer months), single-home ownership rates, demographic

characteristics, social networks, etc.

To summarize, the three categories of analytics build on each other. They all attack the same problem, though they do it at

different levels and take a different view. It all starts with historical data, which is what Reporting Analytics is concerned with.

Next comes Forecast Analytics, which has the power to influence the outcome of the interaction with the customer. Forecast

Analytics shows us a glimpse into the future, though it is very narrow because it is based on limited insight. Predictive Analytics

really opens the window into the future and lets us choose if we like it or not.


7

Great, I understand it now! What about these exponentially growing volumes of data? Would analytics scale?

An emerging trend that begins disrupting traditional analytics is the ever-increasing amount of mostly unstructured data that

organizations need to store and process. Tagged as big data, it refers to the means by which an organization can create,

manipulate, store, and manage extremely large data sets (think tens of petabytes of data). Difficulties include capture, storage,

search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger

datasets, which allow analysts to gain insights never possible before. [8] [10]

Big data analytics require technologies like MPP (massively parallel processing) to process large quantities of data. Examples of

organizations with large quantities of data are the oil and gas companies that gather huge amounts of geophysical data.

Two chief concerns of big data analytics are linear scalability and efficient data processing.[9] Nobody wants to start down the

big data analytics path and realize that in order to keep up with data growth the solution needs armies of administrators.

In short, leveraging big data analytics in the enterprise presents both benefits and challenges. From the holistic perspective,

big data analytics enable businesses to build processes that encompass a variety of value streams (customers, business

partners, internal operations, etc.). The technology offers a much broader set of data management and analysis capabilities

and data consumption models. For example, the ability to consolidate data at scale and increase visibility into it has been a

desire of the business community for years. Technologies like Hadoop finally make it possible. Businesses no longer need to

skip on reporting and insights simply because the technology is not capable or it is too expensive.

Case study: Using big data analytics to optimize/automate IT operations

Steve: “What was wrong with the server that crashed last week?”

Bill: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed.”

Anyone who has been in IT operations must have had the above dialog, sometimes quite often. Today’s data centers generate

immense quantities of data, and the answer to the above question lies in IT’s ability to mine the data and uncover the chain of

events.

IT operations are a crucial aspect of most organizational operations.

Companies rely on their information systems to run their operations. IT must

therefore keep high standards for assuring business continuity in spite of

hardware or software glitches, network connectivity disruptions, unreliable

power systems, etc.

Effective IT operations require a balanced investment in both system data

gathering and data analysis. Most IT operations nowadays gather up-to-the-

minute (or second in some cases) logs from the servers, storage devices,

network components, applications running on this infrastructure (i.e. Linux

system log), and even the power and cooling components.

The data lifecycle (Figure 4) begins with the data being generated and collected. The vast majority of the collected data

consists of plain text files that have very little in common in the way the content is structured. Data can be stored in its original

format or it can be pre-processed and then stored. Pre-processing increases the value of the data by removing less significant

content. The data is then stored and made available for processing.

Processing of the data is mainly focused on two attributes:

Extract the value (also called insights) from the data through the use of statistical analysis

Make the results available for presentation in a format that readily communicates the value

Figure 4: Data Lifecycle


8

The last phase of the data lifecycle is the presentation of the insights uncovered along the way. At this phase data reaches its

maximum potential and has the biggest impact on decisions derived from analysis. In the broad spectrum of options,

presentation may imply graphic presentation of the results (i.e. pie chart) or only bundling the results and shipping them off to

an application for further examination.

Big data analytics can help optimize/automate IT operations in several ways:

Improve the quality of the control processes by embedding big data analytics in the control path

Keep the system operating within set boundaries by being able to predict the future operational state of the system

Minimize system downtime by avoiding predictable failures

Figure 5 illustrates an example of embedding analytics in the control loop of the data center management system.

As explained above, system components (hardware or software) generate metering data that is readily available on a system-

wide data bus. The analytics engine grabs the metering data from the data bus, processes it, and examines the results against

historical data (i.e. data that was gathered in a previous iteration). Next, the analytics engine computes the deviation and

compares it with the standard deviation defined in profiles. The analytics engine forwards the comparison results to the

intelligent controller, which, after evaluating the particular condition against pre-defined policies, issues control commands

back to the system.

Figure 5: Embedding Analytics in Automated System Control

The control system described above allows IT managers to rethink the operational efficiency of the data center. By harnessing

the power of sophisticated analytics, the system’s response can be correlated in a timely manner with the control stimuli and

external factors over a broad spectrum of conditions and application workloads. IT managers can optimize the system for the

supply side (i.e. utilities), for the demand side (i.e. software applications, business processes, etc.) or for both. The long-term

payoffs should outweigh the cost of analytics.


9

Big data analytics challenges in the enterprise

The adoption of big data analytics in the enterprise can deliver huge benefits but also presents equally important challenges.

Adopting big data analytics is both an opportunity and a challenge. Examples are in order:

An inability to share/correlate knowledge (data and algorithms) across organizational boundaries impacts the bottom line. As mentioned above, analytics are converging. Two or more business units may be working on a similar set of challenges. With no leveraged knowledge among them, each business unit will duplicate efforts only to discover similar solutions. Sharing the value of big data underpins substantial productivity gains and accelerates innovation.

Data is locked in many disparate data marts. This is not necessarily a new challenge. This has been seen since the early days of enterprise databases when two or more departments could not agree on a common set of requirements and decided to go their own ways and build separate data stores. The advent of big data exacerbates the age-old dispute—the sheer volume of data requires even more data marts to be stored. Big data mitigates this challenge by leveraging technologies that are built from the ground up to be scalable and schema-agnostic.

Traditional enterprise IT processes (i.e. user authentication and authorization) don’t scale with big data. Not being able to enforce and audit access controls against huge quantities of data leaves the enterprise open to unauthorized access and theft of the intellectual property.

The adoption of Hadoop technology

Hadoop has become the most widely known big data technology implementation. The rise of Hadoop proved to be

unstoppable. There is a very vibrant community around Hadoop. Venture capitalists are pouring money into startups much like

we saw back in 2000 with Internet companies. Most of these startups are enacted as academic research projects. Customer

demand eventually brings them into the mainstream marketplace where they start competing with more established

providers.

On the receiving end of the market, businesses start picking up the pace at which Hadoop is deployed. Businesses begin to

realize that data management, processing, and consumption are emerging as key challenges.

The wide adoption of Hadoop is hindered by both socio-business and technical factors.

Examples of socio-business factors are:

Hiring

Just like with any high-end niche technology, the emergence of Hadoop requires bleeding-edge data analytics design,

processing, and visualization skills. For example, the Hadoop MapReduce API is more complex than SQL. Managing

Hadoop deployments is equally complex. These skill sets are in short supply, thus slowing down the adoption of the

technology. Hiring will get better as tools and the underlying technology improves.

Confusion among vendors as well as buyers

The rapidly changing market landscape makes it difficult for technology innovators to forecast resource allocation and

maximize their returns on investments. Buyers are equally confused because they need more information about the

actual business value of the technology and about the costs and the characteristics of successful deployments.

Companies like Dell are taking a customer-centric approach. They work directly with customers and vendors to ease

the adoption of the technology by providing end-to-end Hadoop solutions and business value metrics, all wrapped in

strong services and consulting offerings.

The “checkbox” mentality and the genesis of a new form of vendor lock-in

Traditional enterprises demand their IT organizations to require support contracts for all their software applications. The

“checkbox” mentality is one in which support is provided so IT can mark off the appropriate checkbox. Yet, businesses

realize that true opportunities to improve the bottom line come from a deeper understanding of their internal

processes; thus demand for big data is rapidly increasing. That leaves IT with only one option. That is, choose one from

many competing vendors. Because of fierce competition among vendors, the one chosen vendor will try to lock in as

much functionality as possible. The answer is a leveraged approach: use open source as much as possible and pay only

for the support that is deemed absolutely necessary. Look for vendors that offer both open-source and commercial

versions of the technology needed. A different, yet long-term, answer is standardization (i.e. of the API, the data

models, the algorithms, etc.)


10

Hadoop technical strengths and weaknesses

Hadoop has been designed from the ground up for seamless scalability and massively parallel compute and storage. Hadoop

has been optimized for high aggregated data throughput (as opposed to query latency). The real power of Hadoop is in the

number of compute nodes in the cluster instead of the compute and store capacity of each individual node.

Hadoop’s strengths are:

It is highly scalable—Yahoo runs Hadoop on thousands of nodes

It integrates storage and compute—the data is processed right where it is stored

It supports a broad range of data formats (CSV, XML, XSL, GIF, JPEG, SAM, BAM, TXT, JSON, etc.).

Data doesn’t have to be “normalized” before it is stored in Hadoop.

Examples of the Hadoop’s weaknesses are:

Security—Hadoop has a fairly incoherent security design. Data access controls are implemented at the lowest level of

the stack (the file system on each compute node). Also there is no binding between data access and job access models.

Advanced IT operations and developer skills are required.

Lack of enterprise hardening—the NameNode is a single point of failure.

Dell | Hadoop solutions

The Dell | Hadoop solutions lower the barrier to adoption for businesses looking to use Hadoop in production. Dell’s

customer-centered approach is to create rapidly deployable and highly optimized end-to-end Hadoop solutions running on

commodity hardware. Dell provides all the hardware and software components and resources to meet the customer’s

requirements and no other supplier need be involved.

The hardware platforms for the Dell | Hadoop solutions (Figure 6) are the Dell™ PowerEdge™ C Series and Dell™

PowerEdge™ R Series. Dell PowerEdge C Series servers are focused on hyperscale and cloud capabilities. Rather than

emphasizing gigahertz and gigabytes, these servers deliver maximum density, memory, and serviceability while minimizing

total cost of ownership. It’s all about getting the processing customers need in the least amount of space and in an energy-

efficient package that slashes operational costs. Dell PowerEdge R Series servers are widely popular with a variety of

customers for their ease of management, virtually tool less serviceability, power and thermal efficiency, and customer-inspired

designs. Dell PowerEdge R Series servers are multi-purpose platforms designed to support multiple usage models/workloads

for customers who want to minimize differing hardware product types in their environments.

The operating system of choice for the Dell | Hadoop solutions is Linux (i.e. Red Hat Enterprise Linux, CentOS, etc.). The

recommended Java Virtual Machine (JVM) is the Oracle Sun JVM.

The hardware platforms, the operating system, and the Java Virtual Machine make up the foundation on which the Hadoop

software stack runs.

Figure 6: Dell | Hadoop Solution Taxonomy


11

The bottom layer of the Hadoop stack (Figure 6) comprises two frameworks:

1. The Data Storage Framework (HDFS) is the filesystem that Hadoop uses to store data on the cluster nodes. Hadoop

Distributed File System (HDFS) is a distributed, scalable, and portable filesystem.

2. The Data Processing Framework (MapReduce) is a massively parallel compute framework inspired by Google’s

MapReduce papers.

The next layer of the stack in the Dell | Hadoop solution design is the network layer. Dell recommends implementing the Hadoop

cluster on a dedicated network for two reasons:

1. Dell provides network design blueprints that have been tested and qualified.

2. Network performance predictability—sharing the network with other applications may have a detrimental impact on

the performance of the Hadoop jobs.

The next two frameworks—the Data Access Framework and the Data Orchestration Framework—comprise utilities that are

part of the Hadoop ecosystem.

Dell listened to its customers and designed a Hadoop solution that is fairly unique in the marketplace. Dell’s end-to-end

solution approach means that the customer can be in production with Hadoop in shortest time possible. The Dell | Hadoop

solutions embody all the software functions and services needed to run Hadoop in a production environment. The customer

is not left wondering, “What else is missing?” One of Dell’s chief contributions to Hadoop is a method to rapidly deploy and

integrate Hadoop in production. Other major contributions include integrated backup, management, and security functions.

These complementary functions are designed and implemented side-by-side with the core Hadoop core technology.

Installing and configuring Hadoop is non-trivial. There are different roles and configurations that need to deployed on various

nodes. Designing, deploying, and optimizing the network layer to match Hadoop’s scalability requires a lot of thinking and

also consideration for the type of workloads that will be running on the Hadoop cluster. The deployment mechanism that Dell

designed for Hadoop automates the deployment of the cluster from “bare-metal” (no operating system installed) all the way

to installing and configuring the Hadoop software components to specific customer requirements. Intermediary steps include

system BIOS update and configuration, RAID/SAS configuration, operating system deployment, Hadoop software deployment,

Hadoop software configuration, and integration with the customer’s data center applications (i.e. monitoring and alerting).

Data backup and recovery is another topic that was brought up during customer roundtables. As Hadoop becomes the de

facto platform for business-critical applications, the data that is stored in Hadoop is crucial for ensuring business continuity.

Dell’s approach is to offer several enterprise-grade backup solutions and let the customer choose.

Customers also commented on the current security model of Hadoop. It is a real concern because as a larger number of

business users share access to exponentially increasing volumes of data, the security designs and practices need to evolve to

accommodate the scale and the risks involved. Also HIPAA, Sarbanes-Oxley, SAS70, and PCI Security Standards Council may

have an interest in data stored in Hadoop. Particularly in industries like healthcare and financial services, access to the data has

to be enforced and monitored across the entire stack. Unfortunately, there is no clear answer on how the security

architecture of Hadoop is going to evolve. Dell’s approach is to educate the customer and also work directly with leading

vendors to deliver a model that suits the enterprise.

Lastly, Dell’s open, integrated approach to enterprise-wide systems management enables customers to build comprehensive

system management solutions based on open standards and integrated with industry-leading partners. Instead of building a

patchwork of solutions leading to systems management sprawl, Dell integrates the management of the Dell hardware running

the Hadoop cluster with the “traditional” Hadoop management consoles (Ganglia, Nagios).

To summarize, Dell is adding Hadoop to its data analytics solutions portfolio. Dell’s end-to-end solution approach means that

Dell will provide readily available software interfaces for integration between the solutions in the portfolio. Dell will provide the

ETL connector (Figure 6) that integrates Hadoop with the Dell | Aster Data solution.


12

Dell | Hadoop for the enterprise

In this section we introduce several best practices for deploying and running Hadoop in an enterprise environment:

Hardware selection

Integrating Hadoop with Enterprise Data Warehouse (data models, data governance, design optimization)

Data security

Backup and recovery

The focus in the paper is only on the introduction and high-level overview of these best practices. Our goal is to raise the

awareness among enterprise practitioners and help them create successful Hadoop-based designs. We leave the

implementation to be presented in an upcoming white paper titled Hadoop Enterprise How-To published in the same series.

The inherent challenge with recommendations for Hadoop in the enterprise is that there is not a lot of published research to

draw on. However, Dell has a very strong practice in defining and implementing best practices for its enterprise customers.

Thus, we had to take a different approach. Namely, we began with a gap analysis of Hadoop and drew on our enterprise

practice to derive recommendations that are likely to have the most profound impact on building Hadoop solutions for the

enterprise.

As mentioned above, we intentionally left the details for additional white papers because we did not want to run the risk of

making this high-level outline overly complex and, thus, fail to meet the original goal, which was to raise awareness.

Let’s now look at what it takes to run Hadoop in the enterprise.

First off, we’ve been using clustering technologies like HPCC in the enterprise for years. How is Hadoop different from

HPCC?

The main difference between high-performance computuing (HPC) and Hadoop is in the way the compute nodes in the

cluster access the data that they need to process. Traditional HPC architectures employ a shared-disk setup—all compute

nodes process data loaded in a shared network storage pool. Network latency and disk bandwidth become the critical factors

for HPC job performance. Therefore, low-latency network technologies (like InfiniBand) are commonly deployed in HPC.

Hadoop uses a shared-nothing architecture—data is distributed and copied locally on each compute node. Hadoop does not

need low-latency network; therefore using cheaper Ethernet networks for Hadoop clusters is the common practice for the

vast majority of Hadoop deployments. [11]

Got it! Let’s now look at the hardware. Is there anything I should be concerned with?

The quick answer is YES. First and foremost, standardization is key. Using the same server platform for all Hadoop nodes can save

considerable money and allows for faster deployments. Other best practices for hardware selection include:

Use commodity hardware—Commodity hardware can be re-assigned between applications as needed. Specialized

hardware cannot be moved that easily.

Purchase full racks—Hadoop scales very well with number of racks, so why not let Dell do the rack-n-stack and wheel

in the full rack?

Abstract the network and naming—Any IP addressing scheme, no matter how complex or laborious, can scale to only a

few hundred nodes. Using DNS and CNAMEs scales much better.

Okay, I got the racks in production. How do I exchange data between Hadoop and my data marts?

The answer varies depending on who is asking the question.

To an IT architect, this is a typical system integration challenge. That is, there are two systems (Hadoop and the data mart) that

need to be integrated with each other. For example, the IT architect would have to design the network connectivity between

the two systems. Figure 7 illustrates a possible network connectivity design.


13

Figure 7: Example of Network Connectivity between a Hadoop Cluster and a Data Mart

To a data analyst, this is a data pipeline design challenge (Figure 8). His chief concerns are data formatting, availability of data

for processing and analysis, query performance, etc. The data analyst doesn’t need to know the topology of the network

connectivity between the Hadoop cluster and the particular data mart.

The difference between the two perspectives could hardly be greater.

The solution is a mix of IT best practices and database administration best practices. The details are covered in an upcoming

white paper, titled Integrating Hadoop and Data Warehouse, published in this same series of papers.

Figure 8: Example of Data Pipeline between Hadoop and Data Warehouse


14

Great, I now have data in Hadoop! How should I secure the access to it?

Out of all the technical challenges that Hadoop exhibits, the security model is likely to be the biggest obstacle for the

adoption of Hadoop in the enterprise. Hadoop relies on Linux user permissions for data access. These user permissions are

enforced only at the lowest level of the stack (the HDFS layer on each compute node) instead of being checked and enforced

at the metadata layer (the NameNode) or higher. Jobs use the same userID to get access to data stored in Hadoop. A person

skilled in the art can deploy a man-in-the-middle or denial-of-service attack.

It should be noted that both Yahoo and Cloudera are making intense efforts to bring Hadoop’s security in line with the enterprise.

Meanwhile, the security best practices include:

Ensure strong perimeter security—for example, use strong authentication and encryption for all network access to the

Hadoop cluster.

Use Kerberos inside Hadoop for user authentication and authorization.

If purchasing support from Cloudera is an option, use Cloudera Enterprise to streamline the management of the

security functions across all the machines in the cluster.

Great, I’ll pay close attention to security! Last question: how do I back up the data in Hadoop?

Again, it depends on who is asking.

The IT administrator would be concerned about backup policies, media management, etc.

The data analyst wants to make sure that the data has been saved entirely, which means that the backup solution needs to be

data-aware. Sometimes a dataset may be composed of more than one file. Any file in Hadoop is broken down in a number of

blocks that are handed off to Hadoop nodes for storage. A file-aware (or even worse, block-aware) backup solution will not

maintain the dataset metadata (the association rules between files) which will render the dataset completely useless.

The intersection between the two views is the vision for Hadoop data backup. The best practices include:

Decide where the data is backed up: NAS, SAN, cloud, or another Hadoop cluster. While using the cloud for backing up

the data makes perfect sense, most of the enterprises tend to keep the data private within the corporate firewall. Saving

the data to another Hadoop cluster also makes sense; however the destination Hadoop cluster will need a backup

solution of its own. Realistically, there are only two options for backup: NAS and SAN. If the backup needs only volume

and average performance is acceptable, then the answer is NAS. For best-in-class performance and undisrupted access

requirements the answer is SAN.

Dedupe your data.

Prioritize your data—back up only the data that is deemed valuable.

Add dataset metadata awareness to the backup.

Establish backup policies for both metadata and actual data.

Great, thanks, that makes sense! What do I do if I have questions?

First, please don’t hesitate to contact the author—contact information is provided below. Second, Dell offers a broad variety of

consulting, support, and training services for Hadoop. Your Dell sales representative can put you in touch with the Dell

Services team.


15

About the author

Aurelian “A.D.” Dumitru is the Dell | Hadoop chief architect. In that role he is responsible for all architecture decisions and

long-term strategy for Hadoop. A.D. has over 20 years of experience. He has been with Dell for more than 11 years in various

engineering, architecture, and management positions. His background is in hyperscale massively parallel compute systems.

His interests are in automated process control, intelligent processes, and machine learning. Over the years he has authored or

made significant contributions to more than 20 patent applications, from RFID and automated process controls to software

security and mathematical algorithms. For similar topics please check his personal blog www.RationalIntelligence.com.

Special thanks

The author wishes to thank Nicholas Wakou, Howard Golden, Thomas Masson, Lee Zaretsky, Joey Jablonski, Scott Jensen,

John Igoe, and Matthew McCarthy for their helpful comments.

About Dell Next Generation Computing Solutions

When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next

Generation Computing Solutions are Dell’s response to your unique needs. We understand your challenges—from compute

and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune your company’s

“factory” for maximum performance and efficiency.

Dell’s Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the

needs of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while allowing

scalability as your company grows.

Deployment and support are tailored to your unique operational requirements. Dell’s Cloud Computing Solutions can help

you minimize the tangible operating costs that have hyperscale impact on your business results.

Hadoop ecosystem component “decoder ring”

1. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application

data

2. MapReduce: a software framework for distributed processing of large data sets on compute clusters

3. Avro: a data serialization system

4. Chukwa: a data collection system for managing large distributed systems

5. HBase: a scalable, distributed database that supports structured data storage for large tables

6. Hive: a data warehouse infrastructure that provides data summarization and ad-hoc querying

7. ZooKeeper: a high-performance coordination service for distributed applications

8. Pig: a platform for analyzing large data sets that consists of high-level language for expressing data analysis

programs, coupled with infrastructure for evaluating these programs.

9. Sqoop (from Cloudera): a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to

connect to a database.

10. Flume (from Cloudera): a distributed service for collecting, aggregating and moving large amounts of log data. Its

architecture is based on streaming data flows.

(Source: http://hadoop.apache.org/)

http://www.rationalintelligence.com/

http://hadoop.apache.org/


16

Bibliography

[1] Donald F. Ferguson et al. Enterprise Business Process Management—Architecture, Technology and Standards. Lecture

Notes on Computer Science 4102, 1-15, 2006

[2] Andrew Spanyi, Business Process Management (BPM) is a Team Sport: Play it to Win! Meghan-Kiffer Press, June 2003, ISBN

978-0929652023

[3] http://en.wikipedia.org/wiki/Business_process_management

[4] David W. McCoy, Business Activity Monitoring: Calm Before the Storm, Gartner 2002,

http://www.gartner.com/resources/105500/105562/105562.pdf

[5] http://en.wikipedia.org/wiki/Process_mining

[6] http://www.bpminstitute.org/articles/article/article/bringing-analytics-into-processes-using-business-rules.html

[7] http://en.wikipedia.org/wiki/Convergent_evolution

[8] http://en.wikipedia.org/wiki/Big_data

[9] http://www.asterdata.com/blog/2008/05/19/discovering-the-dimensions-of-scalability/

[10] McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, May 2011

[11] S. Krishnan et al., myHadoop—Hadoop-on-demand on Traditional HPC Resources, University of California at San Diego,

2010

To learn more

To learn more about Dell cloud solutions, contact your Dell representative or visit:

www.dell.com/hadoop

©2011 Dell Inc. All rights reserved. Trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Specifications are correct at date of publication but are subject to availability or change without notice at any time. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography. Dell’s Terms and Conditions of Sales and Service apply and are available on request. Dell service offerings do not affect consumer’s statutory rights. Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.

http://en.wikipedia.org/wiki/Business_process_management

http://www.gartner.com/resources/105500/105562/105562.pdf

http://en.wikipedia.org/wiki/Process_mining

http://www.bpminstitute.org/articles/article/article/bringing-analytics-into-processes-using-business-rules.html

http://en.wikipedia.org/wiki/Convergent_evolution

http://en.wikipedia.org/wiki/Big_data

http://www.asterdata.com/blog/2008/05/19/discovering-the-dimensions-of-scalability/

www.dell.com/hadoop

hadoop enterprise readiness

Technology

enterprise business

business strategy

business operations

business units

business partners

business processmanagement

business effectiveness

business process adjustments