BIG Data Science: A Path Forward
Post on 09-Jul-2015
Embed Size (px)
DESCRIPTIONThis talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience: - Practitioners will learn two key techniques for early success - Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences - Hiring managers will expand their knowledge of the skills required to bring business value with data
Presentation Title Placeholder Allows For Multiple Lines
June 2013BIG DATA SCIENCE: A PATH FORWARDCONFIDENTIAL |#firstname.lastname@example.orgData Science Lead @ Think BigProduct/Brand ObsessiveTeacherOccasional Engineer
CONFIDENTIAL |#TODAYHigh level exploration of theskills, tools, and techniquesneeded to achieve early successand to help you buildyour data science practice.CONFIDENTIAL |#Understand our organizational needs for data science
Infrastructure: Technological tools and platforms.Talent: Staff hired and trained.Capabilities: Data science techniques utilized.INFRASTRUCTURE, TALENT, & CAPABILITIESHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceData ExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuous ModelsText Analysis
CONFIDENTIAL |#4Boxed Solutions: Mahout & PlatformToolkits: RHadoop, Scikit, etc.
You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity.
ANALYTICS TOOLSCONFIDENTIAL |#5Gigabytes from StackoverflowQuestions from usersWith metadataUsers have reputationsQuestions open or closedFollow alongThinking about your dataTo learn in a Familiar context andPlanDATAPresenterAudienceHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText AnalysisCONFIDENTIAL |#select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions;STEP 1: EXPLOREHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText Analysis
Patterns through HivePatterns through TableauCONFIDENTIAL |#Summaries of unstructured dataTime-since metrics
select transform() using python Clustering: Browsing cohorts/bin/mahout canopy
STEP 2: FEATURE BUILDINGHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText Analysis
SQL WindowingCross-Record FeaturesCONFIDENTIAL |#Sample (dont parallelize)Naturally parallelSVDRandom ForestsEstimators and EnsemblesBootstrappingLocalizingAdvanced ParallelizationLinear models with SGDNeural networksPARALLEL MODELS IN HADOOPHadoopNoSQLAnalyticsSQL/MPPReal TimeScriptingMapReduceExplorationBasic ModelingPhD MathVisualizationClusteringCategorizationContinuousText AnalysisCONFIDENTIAL |#Single R modelrun many timesover samplesand aggregated