big data on the microsoft platform - with hadoop, ms bi and the sql server stack

28
Global Sponsor: Big Data on the Microsoft Platform Andrew J. Brust, CEO, Blue Badge Insights With Hadoop, MS BI and the SQL Server stack

Upload: andrew-brust

Post on 12-Nov-2014

3.993 views

Category:

Technology


4 download

DESCRIPTION

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack - 24 Hours of PASS

TRANSCRIPT

Page 1: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Global Sponsor:

Big Data on the Microsoft Platform

Andrew J. Brust, CEO, Blue Badge Insights

With Hadoop, MS BI and the SQL Server stack

Page 2: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

CEO and Founder, Blue Badge Insights

Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC http://www.msbigdatanyc.com

Co-moderator, NYC .NET Developers Group http://www.nycdotnetdev.com

“Redmond Review” columnist for Visual Studio Magazine

brustblog.com, Twitter: @andrewbrust

Meet Andrew

Page 3: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Read all about it!

Page 4: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

My New Blog (bit.ly/bigondata)

Page 5: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Agenda

Big Data, Hadoop and HDInsightMapReduceHive ODBC, BI StackHekaton, NoSQLSQL Server Parallel Data Warehouse, MPP, PolyBase

Page 6: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

What is Big Data?

100s of TB into PB and higherInvolving data from: financial data, sensors, web logs, social media, etc.Parallel processing often involved

Hadoop is emblematic, but other technologies are Big Data too

Processing of data sets too large for transactional databases

Analyzing interactions, rather than transactions The three V’s: Volume, Velocity, Variety

•Big Data tech sometimes imposed on small data problems

Page 7: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

7

What is Hadoop?

Open source implementation of Google’s MapReduce and GFS (Google File System)Allows for scale-out processing of petabyte scale data 1 PB = 1,024 TB

Also distributed storageCommodity hardwareCan work against flat files, or certain database formatsNative processing involves imperative Java codeOther languages supported through “Streaming”

Page 8: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

8

What is HDInsight?

Microsoft’s Hadoop distribution, on Windows Most other distros on Linux

Based on Hortonworks Data Platform (HDP)Runs on Azure, eventually on Windows Server, and as sandbox on dev PCFor .NET devs: .NET SDK for Hadoop, LINQ provider

Page 9: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Global Sponsor:

DemoHDInsight

Page 10: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 11: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

Page 12: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

A MapReduce Example

• Count by suite, on each floor

• Send per-suite, per platform totals to lobby

• Sort totals by platform

• Send two platform packets to 10th, 20th, 30th floor

• Tally up each platform

• Merge tallies into one spreadsheet

• Collect the tallies

Page 13: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

13

MapReduce Options

Pig, Hive, Sqoop, Mahout also generate MapReduce code

Java

JavaScript (“Rhino”)

Other languages, especially Python, via Streaming

C# via Streaming

C# via .NET SDK

Page 14: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Hortonworks Data Platform for Windows

MRLib (NuGet Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

WebHDFS client

MR code in C#, HadoopJob,

MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Page 15: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Global Sponsor:

DemoMapReduce

Page 16: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Hive

Began as Hadoop sub-project Now top-level Apache project

Provides a SQL-like (“HiveQL”) abstraction over MapReduceHas its own HDFS table file format (and it’s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabular data

Page 17: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Hive ODBC Consumers

17

Excel 2010 or 2013 (including via add-in)

PowerPivot

SQL Server Analysis Services, Tabular Mode

SQL Server Reporting Services

ADO.NET OdbcClient provider

LINQ provider

Page 18: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

18

xVelocity Technologies

Formerly known as VertiPaqPowerPivot, SSAS Tabular, SQL Server columnar indexesImplements BI Semantic Model (BISM)Uses column store technology

Compression In-memory Speed

Not a Big Data technology per se, but very useful for analysis of job output

Page 19: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

19

Power View

Reports on BISM models (PowerPivot, SSAS Tabular)Hosted in SharePoint 2010, 2013 EnterpriseAlso Excel 2013 (but not on ARM/Windows RT)Interactive data exploration

Page 20: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Global Sponsor:

DemoHive ODBC + BI Stack

Page 21: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

21

Project “Hekaton”

In-memory engine for SQL Server transactional workloadsTables must be declared as in-memory explicitlyIn-memory and standard tables can coexist in same dbStored procs on in-mem tables are compiled to native code

Hekaton and xVelocity are separateHekaton ≠ PowerPivot/SSAS TabularHekaton ≠ Columnstore indexesCompare to SAP HANA

In-memory, transactional, analytical, column store

Page 22: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

22

NoSQL

NoSQL databases are non-relational and non- or loosely-schematizedHBase is a NoSQL database, of the wide column variety Hive implements a SQL layer over it HBase not yet in HDInsight

HBase table = HDFS fileThree other NoSQL categories Key-value store, document store, graph database Azure Table Storage is a key-value store NoSQL database

Some of them aren’t really Big Data tools, but market themselves that way anyway

Page 23: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

23

SQL Parallel Data Warehouse (PDW)

SQL PDW is a Massively Parallel Processing (MPP) databaseTeradata, IBM Netezza, HP Vertica also in this categoryIt’s an array/cluster of SQL servers made to look like one SQL ServerAvailable as appliance only Purchase from HP, Dell Server, storage and network all pre-built and configured

Many other MPP products based on PostgreSQLPDW loosely based on acquired DATAllegro product Implemented MPP with Ingres, written in Java, running on

Linux

Page 24: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

24

MapReduce versus MPP

MapReduce MPP

Splits preprocessing amongst mapper nodes and aggregation amongst reducers

Scales infinitely on commodity hardware

Uses locally attached commodity disks on nodes

Uses imperative code Processes flat files, wide column

tables (HBase), relational tables (Hive)

Divide-and-conquer approach, parallel, distributed

Splits query amongst nodes then unifies result sets

Scales to high-end assets in the appliance cabinet

Uses shared storage (can be more network traffic)

Uses SQL Works with relational tables only

Divide-and-conquer approach, parallel, distributed

Page 25: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

25

PolyBase

To be included in next version of PDWMashup of SQL Server and HadoopEnables PDW to address Hadoop data nodes (HDFS) directlyParallelism managed by PDWTables are imported into SQL Server db They are EXTERNAL tables They can participate in joins with standard tables

Page 26: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

26

ResourcesMS Big Data/HDInsight

http://www.microsoft.com/bigdata

Apache Hadoop http://hadoop.apache.org/

Apache HBase http://hbase.apache.org/

SQL PDW http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

PolyBase http://gsl.azurewebsites.net/Projects/Polybase.aspx

xVelocity http://

www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/in-memory.aspx

Column store http://en.wikipedia.org/wiki/Column-oriented_DBMS

Power View http://

office.microsoft.com/en-us/excel-help/power-view-explore-visualize-and-present-your-data-HA102835634.aspx

Hekaton http://

blogs.technet.com/b/dataplatforminsider/archive/2012/11/08/breakthrough-performance-with-in-memory-technologies.aspx

Page 27: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Global Sponsor:

Questions?

Page 28: Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

Global Sponsor:

Thank You for Attending