about streaming data solutions for hadoop

Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach Lynn Langit April 2015

Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach 2

TABLE OF CONTENTS Executive summary ................................................................................................................................................................... 3 Introduction .................................................................................................................................................................................. 4 High-‐Volume Real-‐time Data Analytics ........................................................................................................................ 5

Streaming Analytics Project Design ................................................................................................................................... 7 Architectural Considerations ................................................................................................................................................ 7 Component Selection ................................................................................................................................................................ 9 Overall Architecture ............................................................................................................................................................. 9 Enterprise-‐grade Streaming Engine ............................................................................................................................ 10 Ease of Use and Development ........................................................................................................................................ 12

Creating the Proof of Concept ............................................................................................................................................. 13 Management/DevOps ............................................................................................................................................................. 14 Key Findings ............................................................................................................................................................................... 15 Summary Comparisons .......................................................................................................................................................... 16 About Lynn Langit .................................................................................................................................................................... 18


Executive summary An increasing number of big data projects include one or more streaming components. The architectures

of these streaming solutions are complex, often containing multiple data repositories (data at rest),

streaming pipelines (data in motion), and other processing components (such as real- or near-real time

analytics). In addition to performing analytics in real time (or near-real-time), streaming platforms can

interact with external systems (databases, files, message buses, etc.) for data enrichment or for storing

data after processing. Architects must consider many types of big data streaming solutions that can

include open source projects as well as commercial products. Now they also have a next-generation reference architecture for big data, also know as fast big data.

This report will help IT decision makers at all levels understand the common technologies,

implementation patterns, and architectural approaches for creating streaming big data solutions. It will

also describe each solution’s tradeoffs and determine which approach best fits specific needs.

Key takeaways from this report include:

• Use commercial streaming solutions to implement complex Event Stream Processing (ESP) projects. Organizations should match the solution’s complexity to the component’s maturity.

• Teams that have already implemented pure open source Hadoop solutions are most capable of

adding pure open source streaming solutions. An organization should match its team’s skill level to solution complexity and component maturity.

• Organizations should test solutions at production levels of load during the proof-of-concept phase and determine whether they will host the solution on premises, in the cloud, or as a hybrid project.

• Organizations should select tools or plan for coding appropriate types of visualization solution.


Introduction Data continues to grow in variety, volume, and velocity. Enterprises are handling the first two

components, with Hadoop emerging as the de facto big data technology of choice. However, they now

realize that processing and gaining insights from the increased velocity of data creation yields greater

value as it enables better operational efficiency (cost reduction) and personalized customer offerings (revenue growth). Streaming analytic solutions are the technology that supports processing fast big data.

Note that the term “fast big data” still includes “big data,” so streaming solutions must also adhere to the

tenets of big data: (near) linear scalability, fault tolerance, distributed computing, no single point of

failure, security, operability, ease of use etc. These abilities become more critical for fast big data. Fast big

data also requires seamless integration with external systems, in-memory processing, advanced

partitioning schemes, etc. The fast big data architecture stack must integrate seamlessly in the

enterprise’s existing data processing methodology.

Contrasting streaming to traditional analytic data architecture determines whether a business can benefit from processing fast big data.

• The traditional workflow first entails ingest (usually via batch processing), then store-and-process, and finally query. This workflow provides results in hours or days.

• The action-based architecture of streaming, which is based on a pipeline of steps for continuously

ingesting, transforming, processing, alerting, and then storing data, provides insights and take actions in seconds, minutes and hours.

The following definitions are helpful in understanding solution architectures designed to support one or more streaming data components.

• Event-stream processing (ESP). Streaming data is continuously ingested into one or more

data systems via a flow or stream. The opposite, non-streaming data is ingested via individual

record processing (insert, update or delete) or batched record processing. Considerations for streaming include:

o The size and duration of each streaming window or chunk of data.

o The stream’s volume and velocity; is it predictable and regular or variable and prone to spikes?


o The design must account for the sources, sizes, and types of data in the streams.

o In general, ESP solutions contain many input streams that can include different types and

volumes of data. Common types of stream data include sensor data, transactional data,

web clickstream, or log data. Internet of Things (IoT) data is increasing the demand for

streaming architectures as well. All streaming applications also require access to data at

rest for enrichment or to provide context—such as customer data, purchase history, and

support history.

o Along with multiple data input streams, ESP solutions are often implemented to answer

many mission-critical business questions and are placed into the operational data stream requiring fault tolerance.

o This type solution can include many data-pipeline-processing steps that vary from simple

aggregation to complex machine learning processes. An example is using predictive

analysis of live and stored stream data to process all airline engine sensor data for all

flights for an airline, with the business goals of improved flight safety and reduced engine maintenance.

High-Volume Real-time Data Analytics

Since Hadoop is the focal point of big data ecosystem, emerging fast big data platforms must be evaluated

based on their interaction with Hadoop. Including Apache Hadoop components in streaming solutions is common so the following definitions of major components are helpful.

• Apache Hadoop core. The cores services of Hadoop are the Hadoop Distributed File System

(HDFS) and YARN (Yet Another Resource Negotiator). This separation of HDFS and YARN has

enabled the emergence of streaming platforms native to Hadoop. MapReduce (which previously

provided cluster management services in place of YARN) is now a user-side library and completely separated from YARN. It is not relevant in streaming platform.

• Apache Flume. A distributed service for efficiently collecting, aggregating, and moving streaming event data.

• Apache Kafka. A distributed, highly scalable publish/subscribe messaging system. It maintains

feeds of messages in topics. Kafka is one of the most popular message buses in the big data ecosystem. Though not part of core Hadoop, it is very widely used by the open source community.


• Apache Zookeeper. This centralized configuration coordinator maintaining configuration

information and naming, and providing distributed synchronization and group services is commonly used in a Hadoop ecosystem.

• Apache Storm and Spark Streaming. These streaming data processing libraries are defined

and differentiated in the body of this report. Storm and Spark-Streaming predate YARN. Storm

uses its own scheduler rather than YARN’s, and while Spark-Streaming has YARN integration, it is not designed to work exclusively with HDFS.

• Commercial native Hadoop streaming platforms. Some commercial platforms run natively in YARN and leverage all the Hadoop semantics, operability, and other features.

The following figure shows a subset Apache Hadoop’s components. Selecting the appropriate Hadoop components for a solution is a key consideration in architecting the streaming solution.

Architectural view of typical Hadoop components


Streaming Analytics Project Design The common phases of streaming data projects are: architecture, component selection, proof of concept,

and management/DevOps (or moving the solution to production). Although design phases follow familiar,

standard architectural patterns, a closer examination of the tasks performed in each phase is useful in understanding solution design at a deeper level.

Because the landscape of streaming data solutions is changing rapidly as more open source libraries and

commercial products become available and because many technical teams lack experience creating these types of solutions, following best practices is essential.

Architectural Considerations The architecture phase includes sub-phases of design. The figure below shows a simple diagram of the fast big data pipeline.

A Fast Big Data Pipeline

Phase 1: Scalable Ingestion. Identify all of the streaming data sources for ingestion. The critical

requirements of ingestion are fault tolerance and scalability. These include handling fault tolerance with

no-data loss; no loss of application state; and in-order processing, with no manual intervention or

dependence on components external to the platform. The platform should have connectors to various sources, as well as the capability to add new future sources.

Phase 2: Real-time ETL. “What is the quality of the incoming data?” For example, will possible

duplicate records appear in a stream? If so, which will need to be de-duplicated? How will that

transformation be accomplished? Will the team write code to de-duplicate the data or a commercial

product with data manipulation capabilities do that?

Another set of considerations involves compliance. Are timestamps required on data for compliance

requirements? Does the streaming platform have connectors to various external sources for data


enrichment? If error checking depends on the order of data, or needs context of data, fault tolerance, and in-order processing with no data loss becomes critical.

Phase 3: Real-time Analytics. “What are the most complex analytics that need to be performed and

can the business SLA be met to have the job finish in an hour, one minute, or one second, as needed?”

Examples of analytics computations involve dimensional cube computations, aggregates, reconciliation, etc.

Fast big data analytics are computation-intensive and affect latency. Performance, scalability, and fault-

tolerance are of utmost importance in these use cases. Spikes in data have a great impact on analytics, so

the scaling automatically (rather than hand-coding such scalability) is important.

Analytics require that intermediate results be retained so fault tolerance is critical. A fault tolerant

platform must ensure that there is no loss of events, application state, or application data. As with many

other considerations around streaming architectures, a choice exists between custom coding fault tolerance on a per-application basis or letting the streaming platform provide that capability.

Another consideration is whether to run some or all of the solution on a public cloud where offerings for

streaming data and persisting it differ considerably. Commercial solutions can run only on-premises, only on a particular cloud, or both in the cloud and on-premises.

Phase 4: Alerts and Actions. Address the business need around notification for when events or

activities occur. Alerts can be provided as part of a visual dashboard, via STMP (email), SMS (text), or any

other message bus. Taking action is usually an automated business process that occurs without human

intervention, based on rules or policy that must be formulated. For example, for “smart building” data

from sensors, at what point should local building maintenance team be alerted and for which types of

event thresholds.

Phase 5: Visualization and Distribution. Design for both storage and integration of processes;

result data must be in formats consumable by end-user groups. For example, will manufacturing line

metrics be incorporated into a dashboard, a phone application, or a wearable device? Will the developers

create these visualizations or will the organization integrate commercial visualization solutions? A

streaming platform that has connectors to various external systems and provides ability to integrate with

a visualization technology, or provides its own, helps in this phase.


Component Selection The next stage in creating a streaming solution is selecting the particular components for the streaming data architectures. Gigaom suggests that the component selection be evaluated with these considerations:

• The overall architecture of the platform, both in terms of simplicity and reliability

• The enterprise-grade capabilities of the core streaming engine

• Ease of application development and the ability to process data in motion and data at rest

• Management and operation of the platform once it is in production

Following are key questions to consider that will assist with component selection along with a summary

table of the most critical features compared across representative technologies.

Overall Architecture

• Is production operability natively built into the platform? Streaming analytic solutions

run in operational capacity and have high operational requirements, including fault tolerance,

scalability, security, native integration with external sources, web services, CLI, visual dashboard,

etc. A platform that supports these features natively offers time-to-solution advantages over a platform that does not.

• Does the platform depend on external components? The number of external components

a streaming platform depends on impacts operability. Each added component introduces another

possible point of failure, and may require additional expertise. Stitching together disparate components makes architecture more complex and possibly more brittle.

• What is the general attitude toward using open source software? When selecting

streaming components, evaluate the level of developer talent that is available. For some teams,

particularly those that already have deep expertise in developing and deploying Hadoop-based

solution into production, adding streaming functionality via open source libraries may be a good

fit. This is because those teams are already familiar with patterns for working with Apache Hadoop libraries.

However, for other, more traditional enterprise teams, using pure open source technologies can

result in hidden project costs such as resource hours to set up, test, and implement the solution.


In some cases, underestimating the knowledge required to set up, configure, integrate, tune, and test can derail an entire project, as the complexity becomes overwhelming.

• What is the true cost of a creating, deploying, and managing a big data streaming

solution? Are there hidden costs?. When looking at commercial solutions versus open source,

they must compare the licensing costs, support costs, and development resources required for

both. Typically, commercial products have an upfront licensing fee that includes support and

requires less engineering expertise, while open source has no licensing fees, but requires support

fees, services fees, and internal development resources. Organizations running in a commercial cloud must analyze their monthly bills carefully for potential cost savings.

Enterprise-grade Streaming Engine

The core of any selection process is ensuring that the platform will meet the business needs for scalability,

high-availability, performance, and security. The many dimensions to consider vary based on use case. Here are the most common areas of consideration.

• What is the forecasted data volume and variety of sources for ingestion? The available

solution components vary widely in their ability to ingest at scale from multiple data sources. The

ingestion-volume requirements alone could drive a decision to a particular commercial product or

set of libraries, or some combination of both, because there are known upper limits to the

solutions. The number and scope of data sources requires enterprise-quality adapters to handle

the variety of data types. The platform must be based on a very common programming language,

such as Java, to enable re-use of current code within the streaming platform. A Java-based

commercial product designed and tested for that kind of scale, with fault tolerance and a large number of connectors would be a far better fit.

• What are the use case requirements for event processing regarding data processing

guarantees and event order? Fast big data is comprised of a series of events that occur over

time. Architectural decisions on how a streaming platform processes those events will have a

direct impact on performance, latency, and scalability. Many fast big data use cases require that

event order be maintained. For example, some predictive analytics use cases must compare event

order to determine what might happen next. Streaming platform architectural decisions can

impact the ability to guarantee the event order, and whether an event will be processed exactly


one time, at most one time or at least one time. Below, we look at three architectural methods of processing event streams and the implications of each method.

1. Event-at-a-time. Apache Storm uses this method, which processes each event and uses an acknowledgement signal for each. It can impact performance and the cost of hardware.

• Source: DataTorrent

2. Micro-batching. Apache Spark Streaming uses this method, which processes tiny groups of events. This method cannot provide a guarantee of in-order processing.

Source: DataTorrent

3. Streaming event window processing. This method processes windows in the stream and

differs from micro-batching by relying on a more lightweight process (markers in the stream)

rather than batching. This approach is a non-blocking high-performance implementation that

can guarantee event order and provide only-once, at-least-once, and at-most-once event processing with no data loss.

Source: DataTorrent

What are the fault tolerance requirements of the use case? Fast big data use cases are typically

implemented in an operational environment. A streaming application, unlike a batch application, does

not have an end. It runs continuously 24/7. An organization must know if the platform it has selected to

process incoming data streams meets its requirements for fault tolerance. Does the ESP system manage

the application fault tolerance or does the engineer need to hand-code it? In the event of a failure, does the ESP guarantee the events will be processed in the order in which they are generated?


Will a use case and business logic change over time? As an organization learns more about the

customer or operational aspects of the streaming application, it will typically want to change or

supplement the current business logic of its application. For example, in a financial services fraud

detection application, it may want to add another algorithm for detecting fraud, or change an alert

process. Since transactions occur continuously, the streaming application needs to handle being updated

without impacting the analysis of the data flow. This can be thought of as an extension of fault tolerance.

Ease of Use and Development

Development process. In addition to the complexity of setting up the development environment,

when an organization begins creating a POC, it should also understand the tradeoff involved in the details

of solution implementation. These details include items such as selection of programming language. For

example, will coding be done in a proprietary vendor-created language or in a general-purpose language,

such as Java? What features of the solution (for example event guarantees, parallel partitioning, and fault tolerant capabilities) will be manually coded?

Commercial vendors, such as DataTorrent, Informatica, and others, include visual data pipeline creation

tools for rapid prototyping. This can be a significant factor in successful pipeline POC creation. For

example, in the case of the need to build a time sensitive POC (due to competitive pressure, regulatory

changes or some other business concern, the ability to use visual tools to do so can significantly reduce

the time to create a prototype.

Data visualization. Another decision-point is how output data, insights, and action will be presented to

the project stakeholders for validation.

• What type of dash-boarding solution will be used? Will alerts be generated to demonstrate the

viability of the project?

• Will the developer team be expected to create visualization for results or will one or more

visualization tools be provided used instead?

• If using a commercial visualization solution, such as Tableau, what is the complexity of connecting

the output data to it and of creating visualizations that make sense to the subject matter experts?

• How is integration with processes, such as indexing or dimensional attribute processing, achieved?

This is another factor that separates commercial products from pure open source libraries.


• What steps are involved in optimizing real-time analytics? Can dimensional pre-calculations or some type of indexing be added or adjusted? And, again, is optimization manual or automated via tools?

Creating the Proof of Concept The first consideration in the proof of concept phase of a streaming is revalidating the first few business

questions the solution should address and what action the organization would like to take as a result of

the insights it creates. For example, in an IoT smart-building sensor data-streaming project, typical

questions are: Which building sections or rooms seem to be outside of normal for HVAC and is there any

related data (changes in weather, number of people in the area, etc.) that normally correlate to such a

change? Or does this seem to be an anomaly that requires investigation by a technician? If a technician is required, automatically schedule the appointment.

Next, begin the proof of concept (POC) that will be driven by the data sources and methods of making

sense of these streams. Developers will be eager to build a POC as this point. Understanding developer

knowledge of the systems being proposed is key. Another approach is to ask candidate vendors “What is the developer story when creating applications on different types streaming data systems?”


Management/DevOps A new set of priorities and decisions arises after the POC has been validated and planning begins for

production deployment. Top among these concerns is considering what enterprise-level services are

necessary to make the solution operate and meet SLA requirements. These often include processes and

tools for monitoring and maintaining security, availability, scalability, and recoverability. Next is deciding

whether to build or buy. Which open source projects and which products best meet these enterprise

service requirements? Is the best solution an all-in-one or an amalgamation of various open source libraries and vendors tools?

How are production solutions monitored, scaled, and managed? Does the streaming platform

have rich and deep web services and other integration methods to enable ease of integration into

monitoring systems? Commercial solutions include strong support for integration with monitoring

systems. Open source projects may not have strong monitoring integration points. Is the set of streams

spiky? Is dynamic or even automated scaling needed? Commercial products may include such auto-

scaling capabilities, including the ability to add operations into, or remove operations from, a running pipeline.

What steps are involved in optimizing the ingest pipeline? Is it a manual process? Does code have to be re-written, tested, and deployed or does the solution include tooling that can automate this?


Key Findings The addition of streaming platform to a big data stack adds significant complexity to big data solutions.

Given this, a solid understanding of technology choices around streaming data solutions is essential for

designing and delivering solutions that provide business value to the organization.

• Consider use of commercial streaming solutions for complex ESP projects. Match the

solution complexity to component maturity. Consider the volume, velocity, and variety of

both data and data streams. Consider the true costs of building on top of pure open source Hadoop

projects versus buying vendor tools and solutions that include Hadoop plus enterprise services.

• Teams that have already implemented pure open source Hadoop solutions are most

capable of adding pure open source streaming solutions. Match a team’s skill level

to solution complexity and component maturity. Know whether the team has production

experience working with Hadoop solutions, and has deployed Hadoop solutions with low time to

market. What programming languages do they use, what are their expectation about integrating

this new project into their current development environment, and their requirements around

monitoring and scaling. Do they expect to work with tools or are they comfortable working with

libraries from the command line?

• Select tools or plan for coding libraries to perform the types of analytics required.

Match the types of analysis expected on the event stream data. Organizations must

understand whether they will be performing simple aggregations or anticipate using machine

learning algorithms or some other type of predictive component. Know whether a team should

write this logic or whether the organization will purchase a solution that integrates or contains

these components. Figure out how much of the analytics must be returned in real-time.

• Test a solutions at production levels of load during the proof-of-concept phase.

Determine whether to host the solution on premises, in the cloud, or as a hybrid

project. Examine different vendor’s cloud solutions. Is the solution tested at scale on any/all

clouds? Know the potential service costs.

• Select tools or plan for coding appropriate types of visualization solution. If the

organization plans to build its own visualization solution, the technical team must have the talent

to create what is envisioned. If not, what is the cost of hiring, re-training, or getting additional

help to implement this part of the solution.


Summary Comparisons

Capability Storm Spark Streaming

DataTorrent RTS

IBM InfoStreams

TIBCO StreamBase

Core Streaming Engine Capabilities

In-memory compute engine Yes Yes Yes Yes Yes

Native Hadoop Architecture No Yes Yes No No

Sub-second event processing Yes No Yes Yes Yes

Extremely Linear Scalability (billion(s) of events /second) No No Yes No No Stream partitioning for parallel processing No No Yes No No

Auto-scaling input and event processing No No Yes Yes Yes

Event Processing Guarantees

Only Once At least once At most once

Only Once At least once At most once

Only Once At Least Once At Most Once

At Most Once

Only Once At Most Once

Event order Guaranteed No No Yes No No

End-to-End Stateful Fault Tolerance No No Yes No No

Incremental recovery No No Yes No No

Dynamic application updates No No Yes No No

Data Loss Potential Yes Yes No Yes Yes Complete separation of business logic, event acknowledgement and fault tolerance No No Yes No No


Streaming Application Development Tools

Native Application Programming Language Java

Scala Java API Java

Proprietary Streams Query

Language

Proprietary StreamSQL EventFlow

Open Source Pre-built Connectors <10 <10 > 75 No No Graphical application builder No No Yes (beta) Yes Yes

Visual Real-time Dashboard No No Yes (beta) No Yes

Operations and Management Tools

Fully functional GUI-based Management No No Yes Yes Yes

Ease of integration with monitoring systems No No Yes Yes Yes

Ease of integration with external systems via REST APIs No No Yes Yes Yes

Simple install and upgrade No No Yes No No


About Lynn Langit Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more

than 15 years. Over the past 4 years, she’s been working as an independent architect using these

technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done

POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has

done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB,

Neo4j and many other database systems. In addition to building solutions, Lynn also partners with all

major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings.

She is a Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is

also a Cloudera certified instructor (for MapReduce Programming).

Prior to re-entering the consulting world 4 years ago, Lynn’s background is over 10 years as a Microsoft

Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books

on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft.

She continues to write and screencast and hosts a BigData channel on YouTube

(http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit).

about streaming data solutions for hadoop

Technology

data analytics

streaming pipelines

fast big data

streaming components

streaming solutions

real time analytics

discerning differences

best approach lynn langit