big data in the real world
DESCRIPTION
Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.TRANSCRIPT
![Page 1: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/1.jpg)
Big Data in the Real World
Orlando PASSOctober 2013http://www.pssug.org
Mark Kromerhttp://www.kromerbigdata.com@kromerbigdata@mssqldude
![Page 2: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/2.jpg)
‣What is Big Data?
‣The Big Data and Apache Hadoop environment
‣Big Data Analytics
‣SQL Server in the Big Data world
‣Microsoft + Hortonworks (Yahoo!) = HDInsights
What we’ll (try) to cover today
2
![Page 3: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/3.jpg)
Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)
LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
![Page 4: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/4.jpg)
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing
‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ NoSQL does not have to be Big Data
‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value
‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and insights
‣ Use the DW to aggregate and store the summaries of those calculations for reporting
Mark’s Big Data Myths
![Page 5: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/5.jpg)
‣ Batch Processing
‣ Commodity Hardware
‣ Data Locality, no shared storage
‣ Scales linearly
‣ Great for large text file processing, not so great on small files
‣ Distributed programming paradigm
![Page 6: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/6.jpg)
Popular Hadoop Distributions
Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
![Page 7: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/7.jpg)
Popular NoSQL Distributions
Transactional-based, not analytics schemas
![Page 8: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/8.jpg)
Popular MPP Distributions
Big Data as distributed, scale-out, sharded data stores
![Page 9: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/9.jpg)
Big Data Analytics Web Platform - Example
Data Source
s
Data M
asterin
g
Data
Warehouse
&
Analytics
Prese
ntatio
n
AttributionSegmentation
Stacking Effect
…
Media Level Data WarehouseAudience Level
Data WarehouseBig Data
SandboxesData Mapping
Business RulesExternal &
Extended Data
Tableau & Pentaho
MapReduceJobs
![Page 10: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/10.jpg)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "\\s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
![Page 11: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/11.jpg)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
![Page 12: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/12.jpg)
‣ Linux shell commands to access data in HDFS
‣ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
‣ List files in HDFS:
‣ c:\Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
‣ View file in HDFS:c:\Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
Get Data into Hadoop
![Page 13: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/13.jpg)
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Use Hive for Data Schema and Analysis
![Page 14: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/14.jpg)
‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
‣ > hadoop fs -cat /user/mark/customers/part-m-00000
‣ > 5,Bob Smith
‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
SqoopData transfer to & from Hadoop & SQL Server
![Page 15: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/15.jpg)
SQL Server Big Data – Data Loading
Amazon HDFS & EMR
Data Loading
Amazon S3 Bucket
![Page 16: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/16.jpg)
Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later
‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface
‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
![Page 17: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/17.jpg)
‣ SQL Server Database‣ SQL 2012 Enterprise Edition
‣ Page Compression
‣ 2012 Columnar Compression on Fact Tables
‣ Clustered Index on all tables
‣ Auto-update Stats Asynch
‣ Partition Fact Tables by month and archive data with sliding window technique
‣ Drop all indexes before nightly ETL load jobs
‣ Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services‣ SSAS 2012 Enterprise Edition
‣ 2008 R2 OLAP cubes partition-aligned with DW
‣ 2012 cubes in-memory tabular cubes
‣ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
![Page 18: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/18.jpg)
‣Columnstore
‣Sqoop adapter
‣PolyBase
‣Hive
‣In-memory analytics
‣Scale-out MPP
SQL Server Big Data Analytics Features
![Page 19: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/19.jpg)
19 19
Sensors Devices Bots CrawlersERP CRM LOB APPs
Unstructured and Structured Data
Parallel Data Warehouse
Hadoop On Windows
Azure
Hadoop On Windows
ServerConnectors
S S RS
SSAS
BI Platform
Familiar End User ToolsExcel with PowerPivot
Embedded BIPredictive Analytics
Data Market Place
Data Market
Petabytes of Data (Unstructured)
Hundreds of TB of Data (structured)
Microsoft’s Data Solution – Big Data & PDW
![Page 20: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/20.jpg)
MICROSOFT BIG DATA
Discover Combine Refine
Relational Non-relational Streaming
immersive data
experiences
connecting with worlds data
any data, any
size, anywhere
Self-Service Collaboration Corporate Apps Devices
Analytical
Parallel Data Warehouse
Microsoft HDInsight Server
HDInsight Service
StreamInsight
PowerPivot Power View
![Page 21: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/21.jpg)
Windows Azure HDInsight Service
Microsoft HDInsight Server
Expanded Partnership
![Page 22: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/22.jpg)
Microsoft .NET Hadoop APIs
‣ WebHDFS
‣ Linq to Hive
‣ MapReduce
‣ C#
‣ Java
‣ Hive
‣ Pig
‣ http://hadoopsdk.codeplex.com/
‣ SQL on Hadoop
‣ Cloudera Impala
‣ Teradata SQL-H
‣ Microsoft Polybase
‣ Hadapt
![Page 23: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/23.jpg)
Data Movement to the Cloud
‣Use Windows Azure Blob Storage• Already stored in 3 copies
• Hadoop can read from Azure blob storage
• Allows you to upload while using no Hadoop network or CPU resources
‣Compress files• Hadoop can read Gzip
• Uses less network resources than uncompressed
• Costs less for direct storage costs
• Compress directories where source files are created as well.
23
![Page 24: Big Data in the Real World](https://reader033.vdocuments.mx/reader033/viewer/2022061218/54b6ff354a7959aa2a8b4682/html5/thumbnails/24.jpg)
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why do we take this Big Data Analytics approach?
‣ TBs of change data in each subject area
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data
‣ With the configs mentioned previously, SQL Server works great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse
Wrap-up