… i again turned over the pages. i came to typhoid fever — read the symptoms — discovered that...

80

Upload: stewart-obrien

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months
Page 2: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Jumpstarting Big Data Projects:Stories from the FieldDBI-B336

Alexei KhalyakoOlivia Klose

Page 3: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

EM OFC WIN DBI

CDP TWC DEV AZR

Following this session at 18:30

in Hall 5Meet with Microsoft Product ExpertsSnacks and Beverages Served

Ask The Experts Key and floorplan

Cloud and Datacenter Platform

Data Platform and Business Intelligence

Developer Platform and Tools

Enterprise Mobility

Office 365

Windows

Microsoft Azure

Trustworthy Computing

Page 4: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Focus on Azure & HDInsightGo through typical Big Data questionsCustomer use casesIt is NOT a Hadoop tutorial

Key TakeawaysUnderstand the variety of options in Big Data projects

Session Objectives & Key Takeaways

Page 5: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

I have Big Data!

Page 6: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Jerome K. Jerome, Three Men in a Boat

… I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months without knowing it — wondered what else I had got; turned up St. Vitus’s Dance — found, as I expected, that I had that too, — began to get interested in my case, and determined to sift it to the bottom, and so started alphabetically — read up ague, and learnt that I was sickening for it, and that the acute stage would commence in about another fortnight. Bright’s disease, I was relieved to find, I had only in a modified form, and, so far as that was concerned, I might live for years. Cholera I had, with severe complications; and diphtheria I seemed to have been born with. I plodded conscientiously through the twenty-six letters, and the only malady I could conclude I had not got was housemaid’s knee.

Page 7: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months
Page 8: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months
Page 9: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Overview

Demand Architecture

DataLoading

DataPreparation

Analytics Validation

Page 10: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Overview

Do I haveBig Data?

Whichplatform?

(The Agony of Choice)

How do I get my data?

How do Ipre-process my data?

How do Ianalyze my data?

How do I validate my architectur

e?

Page 11: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Do I really have Big Data?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 12: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Do I really have Big Data?

Up to 75 control units in 1 vehicle

About 1,000 individual possible extra

equipments

1 GB car software, 15 GB data on board

(incl. navi)

2,000 user functions implemented

12,000 types of error stored onboard for

diagnosis

Daily up to 60,000 car diagnosis worldwide

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 13: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

“We have structured data”</meldungText><antwort>False</antwort><wert>na</wert></meldung><steuergeraet sgbdVariante="SMG_60"><steuergeraeteFunktion zeitstempel="2013-04-30T09:00:37.9926171-04:00" endDate="2013-04-30T09:00:38.1158609-04:00" jobName="STATUS_FAHRZEUGTESTER"><datensatz satzNr="1"><result name="JOB_STATUS">OKAY</result><result name="_TEL_ANTWORT">80 F1 18 70 70 02 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 82 6B 00 6D 6B 39 CD 14 00 14 00 00 0E 00 15 00 0A 00 19 00 0C 00 12 00 15 85 57 71 88 81 C0 7D 73 C2 08 01 05 02 F7 00 FF FF 01 73 00 00 02 A8 00 C2 00 01 E0 00 00 00 00 00 00 3D 01 00 00 00 01 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 FD 01 E1 02 05 01 F8 03 4F FF AD 04</result><result name="_TEL_AUFTRAG">83 18 F1 30 02 01</result><result name="STAT_KL15_ROH">0</result><result name="STAT_KLR_EIN_ROH">0</result><result name="STAT_WAKE_UP_ROH">1</result><result name="STAT_ISTGANG_TEXT">Neutral</result><sgFunktion zeitstempel=“2013-04-30T10:33:37.0834084+02:00" endDate="2013-04-30T10:33:37.9310504+02:00" jobName="_FLM_LESEN_BOSCH"><datensatz satzNr="1"><result name="FLM_DATEN_1">00 00 00 03 02 08 C6 56 46 4C 4D 39 00 16 4B B2 00 00 00 32 00 00 06 99 00 00 00 65 00 00 18 6E 00 00 00 73 00 00 00 20 00 00 00 73 00 00 00 00 00 00 10 69 00 00 0F 53 00 00 00 2C 00 00 00 0A 00 00 79 6D 00 00 B7 34 00 00 D3 9E 4A 4C 41 52 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2C 00 00 00 00 00 00 1A 5C 00 15 4B CA 00 00 44 08 00 00 2D 39 00 00 1E 45 00 00 26 89 00 00 1E EB 00 00 0C 65 00 00 04 47 00 00 00 00 00 00 00 00 00 00 00 04 00 00 00 27 00 00 01 1E 00 00 02 AB 00 00 07 71 00 00 13 D7 00 00 36 48 00 15 91 AD 00 00 3F 97 00 00 19 C1 00 00 07 F9 00 00 02 D4 00 00 00 BD 00 00 00 20 00 16 1C 42 00 00 18 B1 00 00 09 40 00 00 08 9F 00 00 04 3A 00 00 01 3E 00 01 8C D7 00 00 61 A3 00 00 37 9D 00 00 1E 78 00 00 14 96 00 00 0A 71 00 00 05 49 00 00 02 B1 00 00 00 A7 00 00 00 1D 00 00 00 09 00 00 00 05 00 00 00 00 00 00 00 00 00 00 23 BB 00 00 2F 84 00 00 14 EF 00 00 09 40 00 00 04 71 00 00 03 34 00 00 02 12 00 00 01 AC 00 00 01 59 00 00 0B C4 00 00 00 06 00 00 00 38 00 00 00 19 00 00 00 01 00 00 00 00 00 00 00 04 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 52 4F 54 48 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 00 56 30 00 00 00 03 00 11 00 01 01 06 00 00 00 00 00 00 00 00 00 01 00 00 00 0E 00 05 00 1A 00 12 00 00 00 26 00 00 00 00 00 0B 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 44 44 00 43 00 16 00 08 00 0D 00 04 00 02 00 00 00 02 00 11 00 20 00 1A 00 0A 00 15 00 0F 00 1B 00 13 00 08 00 08 00 00 00 00 00 07 00 0E 00 08 00 04 00 02 00 01 00 00 00 6D 00 03 00 02 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0A 00 21 00 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 18 05 1F 00 00 00 00 00 00 00 00 00 1F 00 03 00 02 00 00 00 00 00 00 00 20 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2E 00 00 1B 00 19 00 18 00 0D 00 00 00 00 00 00 00 01 00 01 00 02 00 00 06 00 01 E6 00 00 12 00 03 00 02 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00 02 01 BA 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 24 00</result><result name="FLM_DATEN_2">08 00 00 00 00 00 00 00 00 00 00 0C 00 80 1B 00 45 10 00 A6 0D 00 51 16 00 59 44 00 00 EB 00 00 CA 00 00 49 00 00 17 00 10 00 0C 00 05 00 04 00 06 00 02 00 01 00 00 00 00 12 00 00 3A…

Here!

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 14: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Do I really have Big Data?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 15: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Recommendation Engine

IIS Logs

Table Storage

BlobOnline Recommender

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 16: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Which Platform?The Agony of Choice

Page 17: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Big Data State of the Art

Page 18: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

The Agony of Choice

Big Data

Big Data

Big Data

Big Data

Big Data

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 19: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Pre-process

data?

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Big Data

Big Data

Big Data

Big Data

Big Data

Do I really need Hadoop?Generalize

dNo SQL

Hadoop

Standard SQL

or MPP Appliances

Specialized No SQL

Streaming

In-MemoryAnalytics

Velocity

Variety

HighlyStructured

PolyStructured

Batch Realtime

Page 20: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Platform?(Agony of Choice)

Get data?

Pre-processdata?

Analyze data?

Validate architectur

e?

HaveBig Data?

Agony of Choice: ArchitectureOn-Premise Cloud↔

Azure AWS↔HDInsight (PaaS) Hadoop on Azure (IaaS)↔

Windows Linux↔C# / .NET Java↔Microsoft ↔ Big Data

Big Data

Big Data

Big Data

Big Data

Open Source

Page 21: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Platform?(Agony of Choice)

Get data?

Pre-processdata?

Analyze data?

Validate architectur

e?

HaveBig Data?

Agony of Choice: ArchitectureOn-Premise Cloud↔

Azure AWS↔HDInsight (PaaS)Hadoop on Azure (IaaS)↔

Windows Linux↔C# / .NET Java↔Microsoft ↔ Big Data

Big Data

Big Data

Big Data

Big Data

Open Source

Page 22: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Hadoop Deployment Options in AzureHDInsight Hadoop on Azure

Platform?(Agony of Choice)

Get data?

Pre-processdata?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 23: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Automated Deployment in AzureHDInsight

Need to KNOW configuration BEFORE deploying clusterPowerShellhttp://aka.ms/HDIpowershell

Azure Data FactoryAzure Automation

Hadoop on Azure

Hortonworks or ClouderaGitHub / CodePlexhttps://github.com/lararubbelke/Azure-DDP/

Platform?(Agony of Choice)

Get data?

Pre-processdata?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 24: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

PowerShell Deployment

Platform?(Agony of Choice)

Get data?

Pre-processdata?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 25: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

HDInsight Configuration

Supported Configuration

Files(hadoop dist):

core-site.xmlhdfs-site.xmlmapred-site.xmlcapacity-scheduler.xml

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 26: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

HDInsight Configuration – Hive

Supported Configuration

Files(hive dist):

hive-site.xml

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 27: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

PowerShell Deployment – Configuration $coreConfig = @{

"io.compression.codec"="org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec"; "io.sort.mb" = "1024";} $mapredConfig = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightMapReduceConfiguration'$mapredConfig.Configuration = @{ "mapred.tasktracker.map.tasks.maximum"="2";} $clusterConfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $numberNodes ` | Set-AzureHDInsightDefaultStorage -StorageAccountName $fqStorageAccountName -StorageAccountKey $storageAccountKey ` -StorageContainerName ($storageContainer.Name) $continueCheck = Read-Host "Attach additional storage accounts? (yes to continue)"

if ($continueCheck -eq "yes"){ foreach($asa in 1..5) { $newStorageAccountName = ($clusterPrefix + [DateTime]::Now.ToString("yyyyMMddHHmmss") + "a" + $asa) New-AzureStorageAccount -StorageAccountName $newStorageAccountName -Location "North Europe" $clusterConfig = $clusterConfig | Add-AzureHDInsightStorage ` -StorageAccountName ($newStorageAccountName + ".blob.core.windows.net") ` -StorageAccountKey (Get-AzureStorageKey $newStorageAccountName).Primary }}

$clusterConfig = $clusterConfig | Add-AzureHDInsightConfigValues -Core $coreConfig -MapReduce $mapredConfig # "At this point we are able to create a hdinsight cluster with a customised configuration"

Changing cluster configuration setting when deploying:

http://aka.ms/HDIconfiguration

Platform?(Agony of Choice)

Get data?

Pre-processdata?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 28: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Supported Configuration

Files(oozie dist):

oozie-site.xml

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

HDInsight Configuration – Oozie

Page 29: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Configuration best practicesHDInsight is on-demand compute powerStore important scripts in Blob re-useDo not rely on HDFS as this is NOT default file system

Example: Oozie job configurationnameNode=wasb://container_name@storage_name.blob.core.windows.netjobTracker=jobtrackerhost:9010queueName=default oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobsoutputDir=ooziejobs-outoozie.use.system.libpath=true

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 30: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Automation: Self-Made

Provision Cluster

Run Hive/Pig Script

Shut down Cluster

Challenges

Troubleshooting cluster provisioning failures

Serialized workflow execution

Troubleshooting job failuresOozie or Hive/Pig

Company Infrastructure

HDInsightCluster

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 31: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Automation: Self-Made

Provision Cluster

Run Hive/Pig Script

Shut down Cluster

Challenges

Troubleshooting cluster provisioning failures

Serialized workflow execution

Troubleshooting job failuresOozie or Hive/Pig

Azure Automation HDInsightCluster

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 32: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Automation via Azure Data Factory

Incoming Data

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

IIS Logs

IIS Logs

Pre-process

data?

Page 33: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Pre-process

data?

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

ProductAugmenter: Resolve JSON event files, augment latest product data, stores variants/attributes & prepare data to be loaded in SQL.

ProductPolarizer

ProductPolarizerPig

Product Polarizer ADF Pipeline

DailyWorkflow Step

K Daily

ProductSegmenter

ProductSegmenterHive

Product Segmenter ADF Pipeline Workflow Step

M

ProductPolarizer: Retrieves polarizing product information and stores in suitable format for SQL.

Concurrent Execution

Input Data

Azure BlobProduct Data Augmenter ADF Pipeline

Raw JSONEvent Files

ProductFeed

ProductDataAugmented1Hive

ProductsParent

ProductDataAugmented2Hive

ProductDataAugmented3Hive

Attributes Variants ProductsToSql

ProductPrepSQLHive Daily

Workflow Step B

Azure Blob

Azure Blob Azure Blob

Input Data

ProductDataAugmented1

Hive

ADF Table

ADF Activity

Automation via Azure Data Factory

Page 34: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

How do I get my Data?

Page 35: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Where the Data was parkedDatabases Storage Account

SQL Azure

SQL IaaS

Table Storagehttp://aka.ms/HDItablestorage

Blob Storage

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 36: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

What Data do I have?Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

.txt

.csv

.xml

.txt

.csv

.xmlNote: Hadoop does not do well with lots of small fileshttp://aka.ms/HDI_smallfiles

Page 37: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

How do I pre-process

Data?

Page 38: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Data Querying OptimizationsPlatform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process data?

Analytical type of workload

Page 39: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Data Querying OptimizationsPlatform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process data?

Analytical type of workload

Page 40: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Platform?(Agony of Choice)

Get data?

Pre-process data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Data Querying Optimizations

Analytical type of workload

Large, incrementally growing Fact tablesData Warehouse type of workload

Page 41: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Applied to Recommendation EngineStore the customer related dataUse appropriate partition strategyCan hurt performance significantly

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Table Storage

Pre-process data?

Page 42: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Running the WorkflowData stored in BlobAccessible from multiple services inside and outside HDI

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Schedule the Jobs using OozieNow moving to ADF

Mahout for running clustering algorithms

Pig is used for preparing the data setsPre-process

data?

Page 43: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

How do I analyze my Data?

Page 44: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Decision TreesRecommendation Engine

Analytics: What was wanted?Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 45: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Analytics Options

MahoutOpen sourceWrite your own codeBy default on HDInsight

Azure MLVisual Composition: UI, Drag & DropModulesExtensible / Support for RSupport for CollaborationSupport for Data Science Process

Azure MLPlatform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 46: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Mahout DemoRun Random Forests!

What are Random Forests?

Hang on...

Page 47: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

One Decision TreePlatform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 48: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

One Decision TreePlatform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 49: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

One Random TreePlatform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 50: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

A Random ForestPlatform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 51: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Mahout Demo

Page 52: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

1. Where does the data need to be?2. Generate descriptor file3. Build forest4. Classify test data

Mahout: Run Random ForestsPlatform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Page 53: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

1. Get DataPlatform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 54: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

1. Get Datahdfs dfs -cp wasb://<container>@<storage_account>.blob.core.windows.net/user/<remote_user>/testdata/KDDTrain+.arffwasb://<container>@<storage_account>.blob.core.windows.net/user/hdp/testdata/KDDTrain+.arff

hdfs dfs -cp wasb://<container>@<storage_account>.blob.core.windows.net/user/<remote_user>/testdata/KDDTest+.arff wasb://<container>@<storage_account>.blob.core.windows.net/user/hdp/testdata/KDDTest+.arff

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 55: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

2. Generate Descriptor FilePlatform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 56: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

2. Generate Descriptor Filehadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jarorg.apache.mahout.classifier.df.tools.Describe -p wasb:///user/hdp/testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 57: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

2. Generate Descriptor FilePlatform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 58: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

3. Build Foresthadoop jarC:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jarorg.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d wasb:///user/hdp/testdata/KDDTrain+.arff -ds wasb:///user/hdp/testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest

Data

Dataset

Selection Partial #Trees Output

Leaf size

Page 59: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

3. Build Forest – Copy Datahdfs dfs -cp wasb://<container>@<storageaccount>.blob.core.windows.net/user/<remoteuser>/nsl-forest wasb://<container>@<storageaccount>.blob.core.windows.net/user/hdp/nsl-forest

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 60: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

4. Classify Test Datahadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jarorg.apache.mahout.classifier.df.mapreduce.TestForest-i wasb:///user/hdp/testdata/KDDTest+.arff-ds wasb:///user/hdp/testdata/KDDTrain+.info-m wasb:///user/hdp/nsl-forest -a -mr -o predictions

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 61: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

4. Classify Test Data

9,458 253

8,325

Predicted

4,508

normal anomaly

Actu

alnorm

al

an

om

aly

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 62: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

4. Classify Test Data

accuracy=#correctly   classified   instances#classified   instances

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 63: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

4. Classify Test Data – Output?

http://aka.ms/mahout

Platform?(Agony of Choice)

Get data?

Analyze data?

Validate architectur

e?

HaveBig Data?

Pre-process

data?

Page 64: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Validation & Troubleshooting

Page 65: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Managing Solution

Performance

How many cores does each workload use?

How much data ingested?

Validate architectur

e?

Platform?(Agony of Choice)

Get data?

Analyze data?

HaveBig Data?

Scalability

How do I get more compute/storage?

Does workload utilize the capacities?

Manageability

PaaS almost takes care of itself.

Still needs managing, e.g. storage account

Pre-process

data?

Page 66: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Monitoring and Troubleshooting

Compute

Mahout / Pig Calculations

I/O

HDFSIaaS VM max 16 TB of space

BlobDifferent latency and throughput characteristics

Validate architectur

e?

Platform?(Agony of Choice)

Get data?

Analyze data?

HaveBig Data?

Pre-process

data?

Page 67: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Troubleshooting Pig in RecommenderWorkflowLoad date from the WASB filesGet Product and session dataJoin customer and product dataClean up (duplicates, filters and etc.)Store

Validate architectur

e?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

HaveBig Data?

Job fails after running for 4

hours

Page 68: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Troubleshooting Pig in RecommenderIntelligent parallelismLogical PlanPhysical Plan

Reduce Plan may limit execution to single node

Validate architectur

e?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

HaveBig Data?

Job fails after running for 4

hours

Page 69: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Specific to HDInsightHDInsight is PaaS No Admin rightsNo access to the data nodes

Logs know all about the systemUse RDP session!Get all you may need: Oozie, Pig

Validate architectur

e?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

HaveBig Data?

Page 70: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Storage Account: Advanced AnalyticsExplains application to storage interactionsVery useful counters

AuthorizationError,Availability,AverageE2ELatency,AverageServerLatency,ClientTimeoutError,NetworkError,PercentAuthorizationError,PercentNetworkError,PercentSuccess,ServerTimeoutError,Success,ThrottlingErrorTimestamp,TotalBillableRequests,TotalEgress,TotalIngress,TotalRequests

Application

Storage throttling

When?

Data exchange0

2000000000

4000000000

6000000000

8000000000

10000000000

12000000000

14000000000

Jobs Storage

Sum of TotalIngress Sum of TotalEgressValidate

architecture?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

HaveBig Data?

Page 71: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Mapping Application and Storage Logs

0

200

400

600

800

1000

1200 Jobs StorageTotal

2014-03-26 22:28:37,321 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01] callback for action [0000000-140326181153083-oozie-hdp-W@pig-node-01]2014-03-26 22:28:37,472 INFO PigActionExecutor:539 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01] action completed, external ID [job_201403261811_0001]2014-03-26 22:28:37,562 WARN PigActionExecutor:542 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01]

Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.PigMain], exit code [2]2014-03-26 22:28:38,101 INFO ActionEndXCommand:539 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0000000-140326181153083-oozie-hdp-W@pig-node-01] ERROR is considered as FAILED for SLA2014-03-26 22:28:38,228 WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor [WorkflowActionGetJPAExecutor]

ended with an active transaction, rolling back2014-03-26 22:28:38,343 INFO ActionStartXCommand:539 - USER[Admin] GROUP[-] TOKEN[] APP[receipts-products-mahout] JOB[0000000-140326181153083-oozie-hdp-W] ACTION[0

High Latency Timeout

Validate architectur

e?

Platform?(Agony of Choice)

Get data?

Pre-process

data?

Analyze data?

HaveBig Data?

Page 72: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Wrap Up

Page 73: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Wrapping up

Whichplatform

?

Get my data?

Pre-process

my data?

Analyze my

data?

Validate?

HaveBig

Data?Recognizing the big data need

HDP (IaaS) vs. HDInsight (PaaS)

BLOB preferred (multiple storage accounts)Pig, Hive and others – perform and scale?Mahout, Azure ML and others

Performance – interactions b/w components?

Page 74: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Related ContentBreakout SessionsDBI-B219 Introduction to Hadoop through Azure HDInsightDBI-B221 TWC | Using Big Data and Machine Learning to Protect Your Online Service DBI-B335 Hadoop for Windows Deep Dive DBI-B411 Extending Your Hadoop Distributions in the Cloud

LabsDBI-H335 Working with Hive in HDInsightDBI-IL202 Getting Started Using HBase in Microsoft Azure HDInsight DBI-IL203 Processing WebLogs with HDInsight

Find us later at MSE – Data Platform and Business

Intelligence

Page 75: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Olivia Klose http://blogs.technet.com/b/oliviaklose/

Track Resources

Alexei Khalyako http://alexeikh.wordpress.com/

Big Data Support http://blogs.msdn.com/b/bigdatasupport/

Page 76: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

27 Hands on Labs + 8 Instructor Led Labs in Hall 7

DBI Track resources

Free SQL Server 2014 Technical Overview e-book

microsoft.com/sqlserver and Amazon Kindle StoreFree online training at Microsoft Virtual Academy

microsoftvirtualacademy.com Try new Azure data services previews!Azure Machine Learning, DocumentDB, and Stream Analytics

Page 77: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

Resources

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Sessions on Demand

http://channel9.msdn.com/Events/TechEd

Developer Network

http://developer.microsoft.com

Page 78: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

TechEd Mobile app for session evaluations is currently offline

SUBMIT YOUR TECHED EVALUATIONSFill out an evaluation via

CommNet Station/PC: Schedule Builder

LogIn: europe.msteched.com/catalog

We value your feedback!

Page 79: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

EM OFC WIN DBI

CDP TWC DEV AZR

Following this session at 18:30

in Hall 5Meet with Microsoft Product ExpertsSnacks and Beverages Served

Ask The Experts Key and floorplan

Cloud and Datacenter Platform

Data Platform and Business Intelligence

Developer Platform and Tools

Enterprise Mobility

Office 365

Windows

Microsoft Azure

Trustworthy Computing

Page 80: … I again turned over the pages. I came to typhoid fever — read the symptoms — discovered that I had typhoid fever, must have had it for months

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.