BIG Data Science: A Path Forward

Download BIG Data Science:  A Path Forward

Post on 09-Jul-2015




1 download

Embed Size (px)


This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience: - Practitioners will learn two key techniques for early success - Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences - Hiring managers will expand their knowledge of the skills required to bring business value with data


Presentation Title Placeholder Allows For Multiple Lines

June 2013BIG DATA SCIENCE: A PATH FORWARDCONFIDENTIAL | Science Lead @ Think BigProduct/Brand ObsessiveTeacherOccasional Engineer

CONFIDENTIAL |#TODAYHigh level exploration of theskills, tools, and techniquesneeded to achieve early successand to help you buildyour data science practice.CONFIDENTIAL |#Understand our organizational needs for data science

Infrastructure: Technological tools and platforms.Talent: Staff hired and trained.Capabilities: Data science techniques utilized.INFRASTRUCTURE, TALENT, & CAPABILITIESHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceData ExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuous ModelsText Analysis

CONFIDENTIAL |#4Boxed Solutions: Mahout & PlatformToolkits: RHadoop, Scikit, etc.

You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity.

ANALYTICS TOOLSCONFIDENTIAL |#5Gigabytes from StackoverflowQuestions from usersWith metadataUsers have reputationsQuestions open or closedFollow alongThinking about your dataTo learn in a Familiar context andPlanDATAPresenterAudienceHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText AnalysisCONFIDENTIAL |#select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions;STEP 1: EXPLOREHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText Analysis

Patterns through HivePatterns through TableauCONFIDENTIAL |#Summaries of unstructured dataTime-since metrics

select transform() using python Clustering: Browsing cohorts/bin/mahout canopy

STEP 2: FEATURE BUILDINGHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText Analysis

SQL WindowingCross-Record FeaturesCONFIDENTIAL |#Sample (dont parallelize)Naturally parallelSVDRandom ForestsEstimators and EnsemblesBootstrappingLocalizingAdvanced ParallelizationLinear models with SGDNeural networksPARALLEL MODELS IN HADOOPHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText AnalysisCONFIDENTIAL |#Single R modelrun many timesover samplesand aggregated