big data projects and use cases - bi consulting · powerful sql query rewriter cost based optimizer...

Claus SamuelsenIBM Analytics, [email protected]

Big Data projects and use cases

IBM Sofware

2 © 2014 IBM Corporation

Text AnalyticsText Analytics

POSIX Distributed Filesystem POSIX Distributed Filesystem

Multi-workload, multi-tenant scheduling

Multi-workload, multi-tenant scheduling

IBM BigInsights Enterprise Management

Machine Learning on Big R

Machine Learning on Big R

Big R (R support) Big R (R support)

IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,

Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)

IBM Open Platform with Apache Hadoop*(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,

Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)

IBM BigInsights Data Scientist

IBM BigInsights Analyst

Big SQLBig SQL

BigSheetsBigSheets

Industry standard SQL (Big SQL)

Industry standard SQL (Big SQL)

Spreadsheet-style tool (BigSheets)

Spreadsheet-style tool (BigSheets)

*IBM Open Platform with Apache Hadoop is a 100% open source Apache Hadoop distribution. IBM will include the Open Data Platform common kernel once available.

Overview of BigInsights

Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist

features • Community support

. . . . . .


IBM Big SQL – Runs 100% of the queries

Key points With Impala and Hive, many queries

needed to be re-written, some significantly

Owing to various restrictions, some queries could not be re-written or failed at run-time

Re-writing queries in a benchmark scenario where results are known is one thing – doing this against real databases in production is another

Other environments require significant effort at scale

Results for 10TB scale shown here


Hadoop-DS benchmark – Single user performance @ 10TB

Big SQL is 3.6x faster than Impala and 5.4x faster than Hive 0.13 for single query stream using 46 common queries

Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.

l © 2009 IBM Corporation

Big Data Projects

● Stock Trade Analysis

● Log File Root Cause Analysis

● 360 Degree Customer View

● Gamers Behaviour

● Weather Data Analysis

● Sensitive Data Access

● Tax Fraud Investigation

● Warehouse Augmentation

● Positive side effects of drugs

● CRM analysis

● Ontologies

● Document classification

● Roaming Log Analysis

● Connected Cars

● Historical Archive Research

● DNA sequencing


Warehouse Augmentation

Banking IndustryFraud Analysis

The customer wanted to implement two different kinds of fraud analysis:Transaction fraud and Social Engeneering fraud.

Problem:Existing data warehouse does not allow for long running jobsExtending the data warehouse has a huge cost


Warehouse Augmentation

Banking IndustryFraud Analysis

Solution:Moving data to IBM BigInsightsreduces the cost significantlyNo limitations on long running jobs

Obtaining the data from the various sources is the most time consuming processUsing BigSQL we can run the same queries in Hadoop as in the traditional warehouse

With BigSQL customer can connect using their standard JDBC/ODBC based SQL tools.


Document Classification

Insurrance IndustryAutomatic classification

Problem:Insurance documents are not standardized.They are typically free form documentswritten as e-mails, MS Words etc. Incoming documents are not classified, and are therefore often sent to wrong department or wrong person, thus resulting in unacceptable long processing time.


Document Classification

Solution:

Using BigInsights Text Analytics new documents can be classified automatic.

Customer had described what was the characteristics of the different classes the the documents had to be put into.

Using these descriptions we could in three weeks implements the rules in BigInsights to a degree that satisfied the customer.

l An IBM Proof of Technology


IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM

big

dat

a

• I

BM

big

dat

aIB

M bi g data • IB

M big d ata

THINK

IBM Software


Application Portability & IntegrationData shared with Hadoop ecosystemComprehensive file format support

Superior enablement of IBM and Third Party software

PerformanceModern MPP runtime

Powerful SQL query rewriterCost based optimizer

Optimized for concurrent user throughputResults not constrained by memory

Federation

Distributed requests to multiple data sources within a single SQL statement

Main data sources supported:DB2 LUW, Teradata, Oracle, Netezza,

Informix, SQL Server

Enterprise Features

Advanced security/auditingResource and workload management

Self tuning memory managementComprehensive monitoring

Rich SQLComprehensive SQL Support

IBM SQL PL compatibilityExtensive Analytic Functions

Distinguishing characteristics

IBM Software


Big SQL – Behind the scenes

Big SQL is derived from an existing IBM shared-nothing RDBMS– A very mature MPP architecture– Already understands distributed joins and optimization

Behavior is sufficiently different – Certain SQL constructs are disabled– Traditional data warehouse partitioning – is unavailable– New SQL constructs introduced

On the surface, porting a shared nothing RDBMS to a shared nothing cluster (Hadoop) seems easy, but …

databasepartition

databasepartition

databasepartition

databasepartition

Traditional Distributed RBMS Architecture

IBM Software


Architecture Overview

Big SQL Worker

Native I/O

Engine

Java I/O Engine

TempData

HBase

HDFSData HDFS

Data HDFSData

HDFS Data Node

MRTask

Tracker

Other Service

Big SQL Scheduler

Big SQL Master

Database Service

Hive Metastore

Big SQL Worker

Native I/O

Engine

Java I/O Engine

TempData

HBase

HDFSData HDFS

Data HDFSData

HDFS Data Node

MRTask

Tracker

Other Service

Big SQL Worker

Native I/O

Engine

Java I/O Engine

TempData

HBase

HDFSData HDFS

Data HDFSData

HDFS Data Node

MRTask

Tracker

Other Service

DDL Engine

big data projects and use cases - bi consulting · powerful sql query rewriter cost based optimizer...

Documents