big data bi-mature-oanyc summit

@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC

Hadoop meets Mature BI: Where the rubber meets the road for Data

Scientists

Michael HiskeyFuturist, + Product Evangelist

VP, Marketing & Business DevelopmentKognitio


The Data ScientistSexiest job of the 21st Century?


Key Concept: GraduationProjects will need

to Graduate from the

Data Science Lab and become part

of Business as Usual


Demand for the Data Scientist

Organizational appetite for tens, not hundreds


Don’t be a Railroad Stoker!Highly skilled engineering required … but the world innovated around them.


Business Intelligence

NumbersTablesChartsIndicators

Time - History - Lag

Access - to view (portal) - to data - to depth - Control/Secure

Consumption - digestion

…with ease and simplicity

Straddle IT and Business

FasterLower latency

More granularity

Richer data model

Self service


What has changed?

More connected-users?

More-connected users?

According to one estimate, mankind created 150 exabytes of data in 2005

(billion gigabytes)

In 2010 this was 1,200 exabytes

Data flow

@Kognitio @mphnyc #OANYC

Data Variety


Respondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have been abbreviated, and selections have been normalized to equal 100%. n=1144

Source: IBM Institute for Business Value/Said Business School Survey

What? New value comes from your existing data

Dark Data


Hadoop ticks many but not all the boxes

a

a a aa a a aa a a a a

a a a a aa a a aa a a a a

a a a a a


Talk to BI team about plugging into Hadoop--

Should be simple?

No need to pre-process No need to align to schema

No need to triage

New economics = New attitude just grab and retain all datathe data science team will dig into it later

NoSQL is a cool idea for

storage… not so much for our BI

ToolsNull storage concerns


Machine learning algorithms Dynamic

Simulation

Statistical Analysis

Clustering

Behaviour modelling

The drive for deeper understanding

Reporting & BPMFraud detection

Dynamic Interaction

Technology/Automation

Anal

ytica

l Com

plex

ity

Campaign Management

#MPP_R


Hadoop just too slow for interactive

BI!

…loss of train-of-thought

“while hadoop shines as a processing platform, it is painfully slow as a query tool”


Analytics needs low latency, no I/O wait

High speed in-memory processing

Analytical Platform: Reference Architecture

AnalyticalPlatform

LayerNear-lineStorage

(optional)

Application &Client Layer

All BI Tools All OLAP Clients Excel

PersistenceLayer

HadoopClusters

Enterprise DataWarehouses

LegacySystems

…

Reporting

Cloud Storage


The Future

Big DataAdvanced Analytics

In-memory

Logical Data Warehouse

Predictive Analytics

Data Scientists

connect

www.kognitio.com

twitter.com/kognitiolinkedin.com/companies/kognitio

tinyurl.com/kognitio youtube.com/kognitio

NA: +1 855 KOGNITIOEMEA: +44 1344 300 770

@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #MPP_R

Hadoop meets Mature BI: Where the rubber meets the road for Data

Scientists

• The key challenge for Data Scientists is not the proliferation of their roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production-ize them into their broader organizations.

• Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it - without reinventing the way they interact with their current reporting and analysis.

• To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop. Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.

The new bounty hunters:DrillImpalaPivotalStinger

The No SQL Posse

WantedDead or Alive

SQL

It’s all about getting work done

Bottlenecks

Used to be simple fetch of valueTasks evolving:

Then was calc dynamic aggregate

Now complex algorithms!

Bottlenecks

@Kognitio @mphnyc #MPP_R

create external script LM_PRODUCT_FORECAST environment rsint receives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES INTEGER ) partition by PRODNO order by PRODNO, ROW_ID sends ( R_OUTPUT varchar ) isolate partitions script S'endofr( # Simple R script to run a linear fit on daily sales

prod1<-read.csv(file=file("stdin"), header=FALSE,row.names=1)colnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW), median)daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))forecast<-array(0,c(dim1[1]+28,4))colnames(forecast)<-c("ID","ACTUAL","PREDICTED","RESIDUALS")

select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Year order by Num_Trans) Total_Accts,cast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_ID) Avg_Yearly_Spend,rank() over (partition by Trans_Year order by count(distinct Account_ID) desc) Rank_by_Num_Accts,rank() over (partition by Trans_Year order by sum(total_spend) desc) Rank_by_Total_Spendfrom( select Account_ID,

Extract(Year from Effective_Date) Trans_Year, count(Transaction_ID) Num_Trans, sum(Transaction_Amount) Total_Spend, avg(Transaction_Amount) Avg_Spend

from Transaction_fact where extract(year from Effective_Date)<2009 and Trans_Type='D' and Account_ID<>9025011 and actionid in (select actionid from DEMO_FS.V_FIN_actions where actionoriginid =1) group by Account_ID, Extract(Year from Effective_Date) ) Acc_Summarygroup by Trans_Year, Num_Transorder by Trans_Year desc, Num_Trans;

select dept, sum(sales) from sales_fact Where period between date ‘01-05-2006’ and date ‘31-05-2006’ group by depthaving sum(sales) > 50000;

select sum(sales) from sales_history where year = 2006 and month = 5 and region=1;

select total_sales from summary where year = 2006 and month = 5 and region=1;

Behind the numbers


For once technology is on our side

First time we have full triumvirate of– Excellent Computing power– Unlimited storage– Fast Networks

…now that RAM is cheap!


Lots of these

Not so many of these

Hadoop is…

Hadoop inherently disk oriented

Typically low ratio of CPU to Disk

big data bi-mature-oanyc summit

Technology

idsends r

existing data

data thismeans

exabytesof data

data flow

simple r script

persistent data storage

key big data projects