ts whitepaper draft v3 - cdm media · !real!time!for!big!data!–!talksum!data!stream!appliances!!...
TRANSCRIPT
Real Time for Big Data:
The Next Age of Data Management
Talksum, Inc.
T a l k s u m , I n c . – 5 8 2 M a r k e t S t r e e t , S u i t e 1 9 0 2 , S a n F r a n c i s c o , C A 9 4 1 0 4
Fall
2 Real Time for Big Data – Talksum Data Stream Appliances
Real Time for Big Data – The Next Age of Data Management
Introduction
Much as been written about Big Data. Many have noted the expansion of both the volume, the variety
and the velocity of data. The rise of unstructured data, mobile device data, social streams, and complex
scientific modeling, to name just a few, are unquestionably altering the landscape of computing.
For some, this has meant a rush to augment current data management practices and incrementally
improve data stores to handle larger volume. There is certainly merit in this approach, as there’s a
need to do whatever is possible to handle the challenges at hand. But the nature of the changes that
face us require us to not only improve existing models, but also to vigorously explore entirely new
approaches in parallel.
It helps to consider the pace of data expansion and where we are in the curve. The current estimate is
that the volume of total data in the digital universe is 2.7 ZB, equivalent to more than 18 million
Libraries of Congress. This is expected to grow by an average of 50% a year to an astounding 8 ZB of
data by 2015. What’s more, an estimated 80% of the data in the digital universe today was created in
the past year. Not only is the volume of data, both structured and unstructured, exploding, but we’re
still at the very early stages of this growth. We are only on the very beginning of the exponential curve.
Data Management and the Next Step
The primary drivers of data growth are the expansion of Internet technologies and consumer adoption
across the world, the proliferation of mobile devices and the spread of digital signal components in just
about every product space imaginable. As noted previously, volume increasing, but the variety and
velocity of the data is also greatly increasing. Along with the associated data quality challenges these
increases are introducing, this trend is rapidly breaking current models of data operations and
computing. The evolution of data management is taking its next major turn.
Real Time for Big Data – Talksum Data Stream Appliances 3
For several decades, enterprise practices for data storage and analysis have revolved around the
structured database. The paradigm of writing data to a structured store and then querying that store
for both analysis and application uses has dominated computing practices. The SQL database lead to
data warehousing practices, where massive amounts of data is extracted from its original form,
transformed and loaded into structured formats tailored to perform specific analyses.
As data has become more voluminous and complex, this model has been expanded and improved
through the addition of more software to the enterprise data stack. And as the dawn of the Big Data
era rose, innovation focused on changing the nature of the data store, moving toward unstructured
NoSQL technologies and batch processing platforms like Hadoop. Again, useful tools to deal with large
data stores, but still in keeping with the core approach of handling data by writing it to a store and
then processing it by scaling software tools horizontally while incrementally enhancing processes with
yet more layers of software.
While this approach helps begin to solve some of today’s short-‐term big data analytic challenges, it
isn’t the fundamental change to our overall data infrastructure that is needed. Data is only getting
bigger and more varied, quality is a rising concern, clusters continue to grow…and it’s just the
beginning.
We have to take the next step, and consider three core concepts:
-‐ Data oriented architectures require innovation at the front end of the data workflow, the “top
of the data funnel,” so to speak. Agility in the face of change is critical, and today’s intensive ETL
processes, reloading of data sets for new analyses and custom data integration practices don’t
accommodate flexibility, leading to rigid and un-‐scalable platforms. Choosing a warehouse,
analytics platform and/or BI tools and then adopting custom data management practices and
tools to manipulate the data is a backward approach. You have to start with the management
of the top of your data funnel – how you take in data, parse and transform it, route it and store
it for the rest of your applications, BI tools and monitoring. The front end of the data process
4 Real Time for Big Data – Talksum Data Stream Appliances
must be flexible, high performance and massively efficient. It has to handle all kinds of data and
support just about any kind of architecture, and it has to be agile.
-‐ Real time is the future. It is easy to think of real time data management as a luxury. However, if
you consider the velocity, volume and variety of Big Data for a moment, it becomes clear that
real time is the only rational approach for many applications. You can’t store all the data in the
world, or at least, the cost becomes prohibitive as volume increases. Furthermore, as velocity
increases you can’t rely on being able to store and then process fast enough to keep up with
the data itself. You have to be able to do something meaningful at the time the data is
generated, including handling new types of data and triaging data quality, or else you
necessarily fall behind. While data warehousing and “write, process, query” approaches will
always have uses, it is critical that real time management, analysis and monitoring be part of
any “future-‐proof” Big Data strategy.
-‐ Hardware is important. While better utilization of “commodity hardware” is a laudable goal,
leveraging the leaps that are occurring in computing technology is critical, especially the
innovations in the storage space that promise to increase I/O significantly. Tailoring a platform
to use every last bit of the resources available to it and continually integrating the latest
hardware and network technologies has to be a core part of any reasonable solution to
tomorrow’s Big Data challenges.
Of course, the question these concepts lead to is clear – how do you reasonably ingest, transform,
monitor, analyze and route data in real time at speeds that are fast enough to accommodate Big Data
sized challenges? And how can you apply enough filtering and processing logic to real time data
streams without compromising efficiency to the point that the solution is infeasible? How can you
begin to create a holistic data management and analytics infrastructure that isn’t made up of dozens of
individual services with integration challenges on every level?
Talksum Data Stream Appliances address exactly those challenges.
Real Time for Big Data – Talksum Data Stream Appliances 5
Talksum Data Stream Appliances
Talksum Data Stream Appliances are a data management and analytics platform built from the ground
up to handle Big Data. The platform allows companies to act on data in real time through automated
response, business process or human action. It allows users to move data where it’s needed and in the
format it’s needed quickly and efficiently. It helps both streamline service delivery and boost overall
data performance – all with less infrastructure.
Simply put, Talksum Data Stream is a powerful multi-‐tool for managing and monitoring your data in
real time. Talksum Data Stream manages data acquisition, ingest and transformation, converting your
data into flexibly managed event streams. It filters and routes data based on your specific business
needs and data processing requirements, while also adding real time monitoring and analytics.
Talksum Data Stream outputs to any external data store, application, file system or process in
whatever format you need. All of this is done at very high speeds, accommodating Big Data initiatives
and optimizing your entire data infrastructure. The result is a simplified data management process that
gives you the ability to monitor and analyze your data in real time, while at the same time reducing the
pain of acquisition, ETL and integration.
TalkOS – OS Level Data Processing
The heart of the Talksum Data Stream appliance is TalkOS. TalkOS is a Linux based operating system
developed by Talksum. TalkOS is tailored to leverage the appliance hardware in the best manner,
optimizing worker thread utilization, efficiently managing disk IO and maximizing in-‐memory
processing.
TalkOS is extended by set of modules for each step in the data workflow (see Figure 1). The modular
architecture is service oriented and highly extensible – individual modules are instantiated for each
specific function for a particular data stream. Module creation and settings are managed by the TalkOS
configuration management layer, giving a great deal of flexibility in how the modules can be pieced
together for various data workflows, analytic scenarios and business use cases.
6 Real Time for Big Data – Talksum Data Stream Appliances
Figure 1. TalkOS provides a modular architecture to allow flexible and highly efficient data management and processing.
Let’s examine the primary TalkOS module families and follow the path of data through the system.
TalkOS Input Modules TalkOS Input Modules manage the acquisition, ingest and transformation of data.
• Dynamic Acquisition Modules – proactively manage the acquisition of data that needs to be
gathered from external systems by TalkOS. These configurable service modules automate data
acquisition.
• Protocol Transformation Modules – transforms incoming network protocols to an appropriate
protocol for downstream process to ensure that data from any protocol (e.g. UDP, TCP, ZMQ,
Unix Socket, Kernel Logging) can be appropriately handled.
Real Time for Big Data – Talksum Data Stream Appliances 7
• Event Transformation Modules – parses and transforms incoming data to a CEE compliant JSON
object format with the desired field structure based on configurable event transformation
templates. This replaces your current work-‐intensive ETL processes with a frictionless “NoETL”
approach, ensuring your data is in a format you can use where and when you need it.
TalkOS Core Router
The TalkOS Core Router serves as the center of communications between all other service modules,
forming the core of the Talksum Data Stream appliance.
• Object Normalization – ensures normalization of all incoming event objects before they are
placed onto the core router data bus.
• Core Router – the core router is a high performance data bus that moves data to and from all
internal and external services. It allows for real time filtering on data as it moves through the
bus, and routes to the desired internal service (such as an analytic or output module) based on
the filter parameters and routing rules associated with it.
• Filters – filters are simply expressed JSON rules that monitor data on the Core Router and
perform actions when the filter parameter is met. The output of a filter can be sent to any
other service, such as a notification module, analytics modules, a data reduction service, etc.
An example of filter code is shown below.
8 Real Time for Big Data – Talksum Data Stream Appliances
Figure 2. This creates a filter that only allows data with a srcaddr equal to 10.10.3.8 and a dstaddr that starts with 10.0.
• Notifications – the notification service allows for triggers to alert other service modules or
external systems and stored procedures when a threshold or filter parameter has been met. For
example, if the memory utilization on a server reaches a certain point, an email notification can
be sent to a distribution list or an external stored procedure can be triggered to provision
another server.
TalkOS Analytic Modules
TalkOS Analytic Modules provide real time analytic capabilities. Analytic Modules, like all other TalkOS
service modules, can be configured to work together for a particular analytic need, with the output
being stored in the internal appliance Stream Storage for consumption by the TalkOS GraphUI or
through the TalkOS API.
• Mathematics – this module maintains a count of all actively filtered data characteristics,
accessible in real time and usable by other modules. Additionally, this module enables users to
enter simple mathematic equations that are expressed by filters on the core router, the results
of which are accessible in real time.
• Aggregation – all counts, filter outputs and mathematic results are aggregated, correlated and
distributed by this module. This allows data to be organized and stored in more meaningful
combinations for analytics and monitoring.
Real Time for Big Data – Talksum Data Stream Appliances 9
• Data Reduce – allows for the real-‐time reduction of data to either the Stream Storage or
external data store as desired. This can include, but is not limited to, real time reduction to a
key-‐value store.
• Stream Storage – data from the analytic modules is stored in an internal appliance store,
making them usable by other internal service modules, by the TalkOS GraphUI and the TalkOS
API.
TalkOS GraphUI & API
TalkOS has it’s own native GUI layer and web socket for custom UX extensions. The native GUI
includes a dashboard application for visualizing and monitoring data in real time and GUI based tools to
configure key TalkOS service modules (complementing the various command line tools).
TalkOS GraphUI also supports an extensible web-‐based UX framework that interacts with TalkOS
services through a web socket connection. Importantly, the TalkOS GraphUI is based on an event-‐
driven UX framework. This means that all actions and functions within the GUI are powered by the
same common event description protocol as the core service modules. This approach both delivers an
extremely efficient GUI layer from a performance standpoint, as well as allowing for very simple
extensions of core GUI functionality based on the underlying event data for a particular vertical
solution or business use case.
TalkOS also supports an API for direct programmatic access.
TalkOS Output Modules
TalkOS Output Modules allow data to be transformed and sent to any external store or process as
desired. All Output Modules are configurable to send data in the desired protocol, format and
structure.
10 Real Time for Big Data – Talksum Data Stream Appliances
• Storage Outputs – outputs data to storage databases, including both SQL databases (e.g.
MySQL, Postgres, Oracle, SQLite) and NoSQL databases (e.g. MongoDB, Reddis, Berkely,
Couchbase).
• File System Outputs – outputs data to file systems, writing data out as files in the appropriate
format (e.g. text, JSON, XML, .csv).
• Network Modules – routes data to other networks and network services, such as replicating to
another data center or routing to a specific network device (e.g. HDFS, RFC3164, REST, AWS,
Third-‐party APIs).
Configuration Modules
TalkOS uses a robust configuration management framework to manage both internal service module
settings on the appliances as well as federated management of multi-‐appliance architectures.
Importantly, these configuration modules can optionally be used to help manage devices and services
external to the TalkOS platform, combining data service delivery tools and automation with real time
monitoring and routing.
• Host Configuration – allows you to manage all the servers in your datacenter as though it were
an integrated heterogeneous network.
• Network Device Configuration – manages the configuration and settings of network devices
such as routers and switches.
• Cloud Service Configuration – manages cloud service deployment and configuration to
complement existing data infrastructure assets.
• Virtualization Management – manage virtualized resources to complement existing data
infrastructure assets.
Real Time for Big Data – Talksum Data Stream Appliances 11
Hardware Specifications Talksum Data Stream Appliances offer flexible architectures based on two core sizes of appliances.
Recommended deployment models range from smaller instances for internal analytics and test/dev
purposes to larger, fully redundant architectures for enterprise production environments. For larger
deployments, a larger appliance size option is available.
All appliances feature redundant power supply, “lights-‐out” management functionality, integrated
hardware RAID controller with flash-‐backed cache and onsite hardware support services – as well as a
full installation of TalkOS and all core modules.
The following are basic hardware specifications for our two core appliance sizes.
Stream Transformer (“Small”) Stream Router (“Medium”)
Description 1 RU server best used for ingest & transformation in a multi-‐server environment. Can also be used as a stand alone system for test & dev.
2 RU server best used as the core router & analytics box in a multi-‐server environment. Can also be used as a stand alone multi-‐purpose system for internal analytics, test & dev.
Chassis Size 1 RU 2 RU
CPU 2 Intel i7 CPUs, E5600 series 2 Intel i7 CPUs, X5600 series
RAM 48 GB 96 GB
OS Disk 2 SAS hard disks for OS 2 SAS hard disks for OS
Stream Storage 4 SAS or SATA hard disks
300GB -‐ 500 GB
4 SAS or SATA hard disks
1.5 TB – 2.0 TB
PLUS
2 FusionIO NAND Storage
1.2 TB Note: where ranges are shown, configuration options and enhancements are available – final specifications will be developed based on customer needs.