do you have big data? (most likely!)
TRANSCRIPT
Do You Have Big Data? (Most Likely!)Peter Myers – Bitwise SolutionsSaptak Sen – Microsoft
DBI-B325
Presenter IntroductionPeter MyersBI Expert – Bitwise SolutionsBBus, SQL Server MCSE, MCT, SQL Server MVPExperienced in designing, developing and maintaining Microsoft database and application solutions, since 1997Focuses on education and mentoringBased in Melbourne, [email protected]://www.linkedin.com/in/peterjsmyers
Presenter IntroductionSaptak SenSenior Product Manager, Big Data, Microsoft Corporation
Focused on Big Data and NoSQL offerings for Microsoft customers. For last 12 years at Microsoft he has worked on various distributed computing platforms.
Twitter: @saptak
Session ObjectivesTo introduce:Big dataHadoopHDInsightTo describe big data processesTo demonstrate various big data scenariosTo describe and inspire you with big data capabilities and potentialTo provide relevant resources for further investigation
Introducing Big Data“Big data is a collection of data sets so large
and complex that it becomes awkward to work with using on-hand database
management tools. Difficulties include capture, storage, search, sharing, analysis,
and visualization.” – Wikipedia
Introducing Big DataContinuedBig data solutions deal with complexities of:
VOLUME (Size)
VARIETY (Structure)
VELOCITY (Speed)
Introducing Big DataContinued
Data Complexity: Variety and Velocity
Terabytes
Gigabytes
Megabytes
Petabytes Big
DataLog filesSpatial & GPS coordinatesData market feedseGov feedsWeather Text/image
Click streamWikis/blogs
Sensors/RFID/devices
Social sentimentAudio/video
Web 2.0
Web LogsDigital MarketingSearch MarketingRecommendations
AdvertisingMobile
CollaborationeCommerce
ERP/CRMPayables
PayrollInventory
ContactsDeal TrackingSales Pipeline
Introducing Big DataContinued
Introducing Big DataResponding to New Questions
Advanced Analytics
Live Data Feed
Social Analytics
How do I optimize my services based on patterns of weather, traffic, etc.?
What’s the social sentiment of my product?
How do I better predict future outcomes?
Introducing HadoopApache Hadoop is for big dataIt is a set of open source projects that transform commodity hardware into a service that can:Store petabytes of data reliablyAllow huge distributed computations
Key attributes:Open sourceHighly scalableRuns on commodity hardwareRedundant and reliable (no data loss)Batch processing centric –using “Map-Reduce” processing paradigm
Introducing the Hadoop Ecosystem
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing(Map Reduce)
Scripting(Pig)
NoSQL Database(HBase)
Metadata(HCatalog)
Data Integration( ODBC / SQOOP/
REST)
Business Intelligence (Excel, PowerView…
)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processing(RHadoop)
Pipeline / workflow(Oozie)
Log file aggregation
(Flume)
PDW
World’s Data (Azure Data Marketplace) AD, System CenterWindows Azure
Storage
Introducing HDInsightHDInsight is Microsoft’s 100% Apache compatible Hadoop distributionAvailable as a Windows Azure service – presently available as developer previewEmpowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used BI tools on the planet
How it WorksFIRST, STORE THE DATA
Server
Files
Server Server
Server
How it WorksSECOND, TAKE THE PROCESSING TO THE DATA
// Map Reduce function in JavaScriptvar map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
ServerServer
ServerServer
RUNTIME
Code
Demonstration
Peter MyersBitwise Solutions
1 – Word Count (The “Hello World” for Hadoop)
Traditional E-Commerce Data FlowOPERATIONAL DATA
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Excess Data
Logs
ETL Some Data
Data Warehouse
New E-Commerce Big Data FlowOPERATIONAL DATA
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Data Warehouse
Logs
Logs Raw Data“Store it All” Cluster
Raw Data“Store it All” Cluster
Demonstration
Peter MyersBitwise Solutions
2 – Integration Services ETL with HIVE
The Hadoop Data Flow
HadoopData Analytics
Demonstration
Saptak SenMicrosoft
3 – Self-Service BI with HIVE
Hadoop Capabilities
Machine Learning
Graph Processing
Distributed Compute
Extract Load Transform
Predictive
Analysis
Common Big Data Algorithms
Mining Social-Network Graphs
Finding Similar Items Mining Data Streams Frequent Item Sets
Advertising on the Web
Link Analysis
Recommendation SystemsClustering
c
Common Big Data AlgorithmsFrequent Item Sets – Market Basket Analysis
Market Basket Analysis
Plagerism
BioMarkers
Related Concepts
Demonstration
Peter MyersBitwise Solutions
4 – Analysis Services Data Mining with HIVE
Collaborative FilteringSimilar Music tastes
Common Big Data AlgorithmsFinding Similar or Complimentary Items
Demonstration
Saptak SenMicrosoft
5 – Data Mining with Apache Mahout
Do You Have Big Data?It is likely that you have big data – you’re definitely capturing outcome data, and probably capturing ambient data
All data – outcome or ambient – has value
Azure and SQL Server Data Platform can unleash insight from big data, small data, all data
Take action and operationalize
Form theories, analyze, and refine
Find, combine,
and manage
Complete.
Powerful.Easy.
DATA INSIGHT
ResourcesMicrosoft Big Datahttp://www.microsoft.com/bigdataWindows Azure HDInsighthttps://www.hadooponazure.comHDInsight Services for WindowsIncludes an excellent set of BI specific resources in the section named “Using HDInsight with Other BI Technologies”http://social.technet.microsoft.com/wiki/contents/articles/6204.hadoop-based-services-for-windows-en-us.aspxBlog: Big Data for Everyone: Using Microsoft’s Familiar BI Tools with Hadoophttp://blogs.msdn.com/b/microsoft_business_intelligence1/archive/2012/02/24/big-data-for-everyone-using-microsoft-s-familiar-bi-tools-with-hadoop.aspx
Related contentBreakout Sessions
DBI-B366: Big Data Analytics with Microsoft Excel 2013 [Wed 8:30AM]DBI-B340: Taking Your Application Design to the Next Level by Using SQL Server 2012 Data Mining [Thu 10:15AM]DBI-B401: Enriching Big Data for Analysis [Fri 10:15AM]DBI-B221: Data Management in Microsoft HDInsight: How to Move and Store Your Data [Fri 4:30PM]
msdnResources for Developers
http://microsoft.com/msdn
LearningMicrosoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources
Sessions on Demandhttp://channel9.msdn.com/Events/TechEd
Resources for IT Professionalshttp://microsoft.com/technet
Evaluate this session
Scan this QR code to evaluate this session.
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.