big data on the microsoft platform - pass · sql parallel data warehouse (pdw) sql pdw is a...

28
Global Sponsor: Big Data on the Microsoft Platform Andrew J. Brust, CEO, Blue Badge Insights With Hadoop, MS BI and the SQL Server stack

Upload: others

Post on 25-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Global Sponsor:

Big Data on the Microsoft Platform

Andrew J. Brust, CEO, Blue Badge Insights

With Hadoop, MS BI and the SQL Server stack

Page 2: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

CEO and Founder, Blue Badge Insights Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair VSLive! and 17 years as a speaker Founder, Microsoft BI User Group of NYC http://www.msbigdatanyc.com Co-moderator, NYC .NET Developers Group http://www.nycdotnetdev.com “Redmond Review” columnist for Visual Studio Magazine brustblog.com, Twitter: @andrewbrust

Meet Andrew

Page 3: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Read all about it!

Page 4: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

My New Blog (bit.ly/bigondata)

Page 5: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Agenda

Big Data, Hadoop and HDInsight MapReduce Hive ODBC, BI Stack Hekaton, NoSQL SQL Server Parallel Data Warehouse, MPP, PolyBase

Page 6: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

What is Big Data?

100s of TB into PB and higher Involving data from: financial data, sensors, web logs, social media, etc. Parallel processing often involved Hadoop is emblematic, but other technologies are Big Data too

Processing of data sets too large for transactional databases Analyzing interactions, rather than transactions The three V’s: Volume, Velocity, Variety

•Big Data tech sometimes imposed on small data problems

Page 7: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

What is Hadoop?

Open source implementation of Google’s MapReduce and GFS (Google File System) Allows for scale-out processing of petabyte scale data 1 PB = 1,024 TB

Also distributed storage Commodity hardware Can work against flat files, or certain database formats Native processing involves imperative Java code Other languages supported through “Streaming”

7

Page 8: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

What is HDInsight?

Microsoft’s Hadoop distribution, on Windows Most other distros on Linux

Based on Hortonworks Data Platform (HDP) Runs on Azure, eventually on Windows Server, and as sandbox on dev PC For .NET devs: .NET SDK for Hadoop, LINQ provider

8

Page 9: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Global Sponsor:

Demo HDInsight

Page 10: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 11: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

Page 12: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

A MapReduce Example

• Count by suite, on each floor

• Send per-suite, per platform totals to lobby

• Sort totals by platform

• Send two platform packets to 10th, 20th, 30th floor

• Tally up each platform

• Merge tallies into one spreadsheet

• Collect the tallies

Page 13: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

MapReduce Options

Pig, Hive, Sqoop, Mahout also generate MapReduce code

13

Java

JavaScript (“Rhino”)

Other languages, especially Python, via Streaming

C# via Streaming

C# via .NET SDK

Page 14: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Hortonworks Data Platform for

Windows

MRLib (NuGet Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

WebHDFS client

MR code in C#, HadoopJob,

MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Page 15: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Global Sponsor:

Demo MapReduce

Page 16: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Hive

Began as Hadoop sub-project Now top-level Apache project

Provides a SQL-like (“HiveQL”) abstraction over MapReduce Has its own HDFS table file format (and it’s fully schema-bound) Can also work over HBase Acts as a bridge to many BI products which expect tabular data

Page 17: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Hive ODBC Consumers

17

Excel 2010 or 2013 (including via add-in)

PowerPivot

SQL Server Analysis Services, Tabular Mode

SQL Server Reporting Services

ADO.NET OdbcClient provider

LINQ provider

Page 18: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

xVelocity Technologies

Formerly known as VertiPaq PowerPivot, SSAS Tabular, SQL Server columnar indexes Implements BI Semantic Model (BISM) Uses column store technology Compression In-memory Speed

Not a Big Data technology per se, but very useful for analysis of job output

18

Page 19: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Power View

Reports on BISM models (PowerPivot, SSAS Tabular) Hosted in SharePoint 2010, 2013 Enterprise Also Excel 2013 (but not on ARM/Windows RT) Interactive data exploration

19

Page 20: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Global Sponsor:

Demo Hive ODBC + BI Stack

Page 21: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Project “Hekaton”

In-memory engine for SQL Server transactional workloads Tables must be declared as in-memory explicitly In-memory and standard tables can coexist in same db Stored procs on in-mem tables are compiled to native code Hekaton and xVelocity are separate Hekaton ≠ PowerPivot/SSAS Tabular Hekaton ≠ Columnstore indexes Compare to SAP HANA In-memory, transactional, analytical, column store

21

Page 22: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

NoSQL

NoSQL databases are non-relational and non- or loosely-schematized HBase is a NoSQL database, of the wide column variety Hive implements a SQL layer over it HBase not yet in HDInsight

HBase table = HDFS file Three other NoSQL categories Key-value store, document store, graph database Azure Table Storage is a key-value store NoSQL database Some of them aren’t really Big Data tools, but market themselves that way anyway

22

Page 23: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

SQL Parallel Data Warehouse (PDW)

SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in this category It’s an array/cluster of SQL servers made to look like one SQL Server Available as appliance only Purchase from HP, Dell Server, storage and network all pre-built and configured Many other MPP products based on PostgreSQL PDW loosely based on acquired DATAllegro product Implemented MPP with Ingres, written in Java, running on Linux 23

Page 24: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

MapReduce versus MPP

24

MapReduce MPP

Splits preprocessing amongst mapper nodes and aggregation amongst reducers

Scales infinitely on commodity hardware

Uses locally attached commodity disks on nodes

Uses imperative code Processes flat files, wide column tables

(HBase), relational tables (Hive)

Divide-and-conquer approach, parallel, distributed

Splits query amongst nodes then unifies result sets

Scales to high-end assets in the appliance cabinet

Uses shared storage (can be more network traffic)

Uses SQL Works with relational tables only

Divide-and-conquer approach, parallel, distributed

Page 25: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

PolyBase

To be included in next version of PDW Mashup of SQL Server and Hadoop Enables PDW to address Hadoop data nodes (HDFS) directly Parallelism managed by PDW Tables are imported into SQL Server db They are EXTERNAL tables They can participate in joins with standard tables

25

Page 26: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Resources MS Big Data/HDInsight http://www.microsoft.com/bigdata

Apache Hadoop http://hadoop.apache.org/

Apache HBase http://hbase.apache.org/

SQL PDW http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

PolyBase http://gsl.azurewebsites.net/Projects/Polybase.aspx

xVelocity http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/in-

memory.aspx Column store http://en.wikipedia.org/wiki/Column-oriented_DBMS

Power View http://office.microsoft.com/en-us/excel-help/power-view-explore-visualize-and-present-your-

data-HA102835634.aspx Hekaton http://blogs.technet.com/b/dataplatforminsider/archive/2012/11/08/breakthrough-performance-

with-in-memory-technologies.aspx 26

Page 27: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Global Sponsor:

Questions?

Page 28: Big Data on the Microsoft Platform - PASS · SQL Parallel Data Warehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in

Global Sponsor:

Thank You for Attending