the big data imperative: smarter choices for an analytics infrastructure

16

Upload: fedscoop

Post on 26-May-2015

1.084 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure
Page 2: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

The Big Data Imperative:Smarter Choices for an Analytics

InfrastructureJim Williams

Senior Software EngineerIBM Software Group

Page 3: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Text Documents Blogs Web Logs Mfg. Equipment

Email Weather Data Social Media Stock Trades

Text Documents BlogsText Documents Web LogsBlogs

Mfg. Equipment Utility Meters Medical Equip. Call Data Records

Point of Sale Data Video Cameras Audio Devices Oil Rigs

Where is the Big Data Coming From?

Data at restData is stored on disk

Huge volumes of unstructured data

No pre-defined schemas

Too large for traditional tools to process in a timely manner

Data in motion

Data is typically not stored

Tremendous velocity

Multiple data sources

Huge volumes of unstructured data

Ultra low latency required

Page 4: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Operational Data Store

Traditional data sources(ERP, CRM, databases, etc.)

Source data (Web, sensors, logs, media, etc. )

Applications

Big Data Platform: Gain Value From Unstructured Data Sources And Structured Enterprise Data

Unstructured data

Unstructured data

Structured data

Structured data

Page 5: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

New Programming Models and Low Cost Hardware For Handling Unstructured Data

Apache Hadoop and InfoSphere Streams

Proven frameworks to process large amounts of data

Hadoop for data at rest, Streams for data in motion

Enable applications to transparently work with large clusters of nodes in parallel

InfoSphere Streams

Clusters of low cost PowerLinux

servers that are ideal for Hadoop

and Streams

Hadoop Cluster

Streams Cluster

Page 6: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Apache HadoopProcessing

StorageInput

Hadoop Cluster

MapReduceJava

Program

Result

Enables applications to transparently work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner

Hadoop node is a processor and disksNodes are combined into clustersInput is parceled out to nodes at load timeMapReduce jobs are sent to nodes at job run time

Page 7: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

RR

R RR

Data BlockReplica

Hadoop Distributed File System

Node 1 Node 2 Node 3 Node n

A distributed file system that spans all the nodes in a Hadoop clusterFiles are split automatically at load time into blocks and spread among Data NodesElastically scalableAssumes nodes will fail - achieves reliability by replicating data across multiple nodes

Hadoop Distributed File System (HDFS)

Page 8: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Master node processesJobTracker for MapReduceNameNode for HDFS

Slave node processesTaskTrackers for MapReduceDataNode for HDFS

Hadoop Framework Processes

Slave Node

Master Node

TaskTracker Slave Node TaskTracker

NameNode

JobTracker

Slave Node TaskTracker

DataNode DataNodeDataNode

Data Data Data

Slave nodeSlave nodeSlave nodeSlave nodeSlave nodeSlave node

Page 9: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Map and Reduce are steps in the framework that a programmer implements

Hadoop framework orchestrates Map and Reduce steps

MapReduce jobs are sent out to each node to run

View Inside One MapReduce Job

Map Step

MapReduce jobs run in parallel across nodes

The steps process key/value pairs in some way

How the steps manipulate the pairs defines the solution

Key

Value

Value

Value

Framework Processing

Key

Value

MapReduce Programming Model

K3 V3 K2 V2

Key

Value

K4 V4 Reduce

StepInput

HDFSHDFS

Page 10: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

BigInsights – Makes It Easy

Web based management console Security enhancements

LDAP authentication Administrator enhancements

Installation and configuration Data import/export tools Monitoring tools

Developer enhancements Eclipse tools Job management tools

Integration enhancements Database/warehouse integration

Business user enhancements Spreadsheet style tool for users

without Java skills

InfoSphere BigInsights Console

Page 11: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Demo: Using BigInsights To Determine Sentiment About A Cabinet Appointment

Approve Disapprove

The Smith nomination as

Secretary of the Exterior is bad

The Smith nomination as

Secretary of the Exterior is bad

Smith is a lousy choice for Exterior

Smith is a lousy choice for Exterior

Smith is a brilliant

choice for Exterior

Secretary

Smith is a brilliant

choice for Exterior

Secretary

Smith would be a great SOE

Smith would be a great SOE

Brilliant choice for Secretary of the Exterior

Love the selection of Smith for Exterior The Smith appointment makes sense Smith would be an excellent Secretary

of the Exterior

Horrible appointment for Exterior Secretary

Lousy choice for Exterior Secretary Don’t like Smith for Secretary of Exterior Smith would be the worst choice for

Secretary of the Exterior

Page 12: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

What You Just Saw In The Demo

Large volumes of raw, unstructured data

Valuable insights into citizen sentiment

InfoSphere BigInsights

Page 13: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Key Benefits

Up to 17% lower power/cooling costs than x86 rack servers

Industry standard (Redhat & SUSE) Linux only servers, optimized for POWER architecture

Competitively priced compared to x86 Linux

BigInsights on PowerLinux runs 71% faster than Cloudera on x86

New PowerLinux Servers Ideal For Big Data

More Info: http://www.ibm.com/systems/power/software/linux/powerlinux/bigdata.htmlMore Info: http://www.ibm.com/systems/power/software/linux/powerlinux/bigdata.html

PowerLinux 7R2

•Linux only POWER7

•Two socket, 2U rack

•8 cores/socket•Up to 20 7R2’s per rack

Page 14: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

For Additional Information

Visit the Agile Summit Solution Center for demonstrations of these capabilities.

Ask an IBM Ambassador for additional information (case study, white paper, solution brief, etc.) related to the content shared during this session.

For a follow up discussion, complete the IBM Response Card on the table in front of you.

Page 15: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

Thank You !

Page 16: The Big Data Imperative: Smarter Choices for an Analytics Infrastructure

®

© 2012 IBM Corporation

Trademarks and notes

© Copyright IBM Corporation 2012

IBM Corporation

Software Group

3565 Harbor Boulevard

Costa Mesa, CA 92626-1420

U.S.A.

Produced in the United States of America

May 2012

All Rights Reserved

IBM, the IBM logo, ibm.com, and smarter planet are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and

service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. Client success stories are available at

ibm.com/software/success/cssdb.nsf

The information contained in this documentation is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this

documentation, it is provided “as is” without warranty of any kind, express or implied. In addition, this information is-based on IBM’s current product plans and strategy, which are subject to change by

IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this documentation or any other documentation. Nothing contained in this

documentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM (or its suppliers or licensors), or altering the terms and conditions of the applicable license

agreement governing the use of IBM software.

IBM customers are responsible for ensuring their own compliance with legal requirements. It is the customer ’s sole responsibility to obtain advice of competent legal counsel as to the identification and

interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws.