is5126 lecture 02 - nus computing - home · lecture 2 – data, databases, sql, behavioral...

11
1/17/17 1 IS5126 - HowBA Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18, 2017 Dr. Tuan Q Phan NUS IS5126 Admin Pick up syllabus and schedule, also available on my website: hRp://www.tuanqphan.us Purchase HBS Case from hRp:// hbsp.harvard.edu Data.gov, #9-610-075 Sign up team of 4 on IVLE by Jan. 30 Use IVLE forums to find team mates Dr. Tuan Q PHAN, NUS IS5126, (c) 2017 Learning ObjecCves Data.gov Case Discussion and PresentaCons Data ManipulaCon, ETL SQL Database Design Best PracCces NormalizaCon Guidelines MarkeCng and Behavioral AnalyCcs Mini-case Dr. Tuan Q PHAN, NUS IS5126, (c) 2017 Learning ObjecCves Products Product Life Cycle Supply/Demand Market Basket MarkeCng Strategy People CRM UClity Modeling OrganizaCons/Companies CompeCCon Strategy CorrelaCon and CausaliCes Resource: The Ten Day MBA, Steven Silbiger 50 Social/Psycology books: hRp://www.sparringmind.com/psychology-books/ Dr. Tuan Q PHAN, NUS IS5126, (c) 2017 Databases and ManipulaCon Raw data Data ware- house Import Transform Analyze Dr. Tuan Q PHAN, NUS IS5126, (c) 2017 Data ManipulaCon Raw data is large, unstructured, noisy Extract, Transform, Load (ETL): process to “clean up” the data for processing and storage Extract: parsing, collecCon from mulCple sources/formats, webscraping Transform: convert to appropriate format, apply set of rules, noise reducCon, error handling, translate codes, validaCon Python, SQL, awk, sed, …. Load: loads in to the data warehouse (database) Staging environment Resource: The Data Warehouse ETL Toolkit, Ralph Kimball & Joe Caserta Dr. Tuan Q PHAN, NUS IS5126, (c) 2017

Upload: trantuyen

Post on 01-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

1

IS5126-HowBA

Lecture2–Data,Databases,SQL,BehavioralAnalyCcs;Jan18,2017

Dr.TuanQPhanNUSIS5126

Admin

•  Pickupsyllabusandschedule,alsoavailableonmywebsite:hRp://www.tuanqphan.us

•  PurchaseHBSCasefromhRp://hbsp.harvard.edu– Data.gov,#9-610-075

•  Signupteamof4onIVLEbyJan.30– UseIVLEforumstofindteammates

Dr.TuanQPHAN,NUSIS5126,(c)2017

LearningObjecCves

•  Data.govCaseDiscussionandPresentaCons•  DataManipulaCon,ETL•  SQL

– DatabaseDesign– BestPracCces– NormalizaConGuidelines

•  MarkeCngandBehavioralAnalyCcs•  Mini-case

Dr.TuanQPHAN,NUSIS5126,(c)2017

LearningObjecCves•  Products

–  ProductLifeCycle–  Supply/Demand–  MarketBasket–  MarkeCngStrategy

•  People–  CRM–  UClityModeling

•  OrganizaCons/Companies–  CompeCCon–  Strategy

•  CorrelaConandCausaliCes•  Resource:

–  TheTenDayMBA,StevenSilbiger–  50Social/Psycologybooks:hRp://www.sparringmind.com/psychology-books/

Dr.TuanQPHAN,NUSIS5126,(c)2017

DatabasesandManipulaCon

RealWorld

Rawdata

Dataware-house

CollecCon Import

Transform

Analyze Report

DATAFLOW

Dr.TuanQPHAN,NUSIS5126,(c)2017

DataManipulaCon•  Rawdataislarge,unstructured,noisy•  Extract,Transform,Load(ETL):processto“cleanup”thedatafor

processingandstorage•  Extract:parsing,collecConfrommulCplesources/formats,

webscraping•  Transform:converttoappropriateformat,applysetofrules,noise

reducCon,errorhandling,translatecodes,validaCon–  Python,SQL,awk,sed,….

•  Load:loadsintothedatawarehouse(database)•  Stagingenvironment•  Resource:TheDataWarehouseETLToolkit,RalphKimball&Joe

Caserta

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 2: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

2

DataStorageTechnology

•  Dataislarge,needtostore,organize,andmanipulate

•  Approaches:– Filesystem:tapedrive,harddisks,RAID,solidstates,NAS

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–IntroducCon•  SQL:“StructuredQueryLanguage,”(aka“sequel”)

–  LanguagefordatamanipulaCon–  Independentofstoragemedium

•  Manyvariants,standardizedANSI•  RelaConalmodelfordatabasemanagement•  HeavilyusedinBA•  DevelopedbyEdgarCodd,IBMResearchLaboratoryin1970s•  Highlypopular1980’s,1990s,2000s,?•  SoluCons:

–  Commercialproducts:Oracle,MicrosopAccess,IBMDB2–  Open-source:MySQL(Oracle),PostgreSQL,SQLite–  BigData:Hive/Hadoop,Netezza(IBM)

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL-IntroducCon

•  Dataintables,rows,andcolumns(akarelaCon,tuple,aRributes)

•  ValueatparCcular(row,column)•  Rowas“unitofanalysis”•  Primarykey:columnwithuniqueidenCfierforrow•  Fewcommands:

–  TablemanipulaCon:CREATE,ALTER,DROP,(GRANT)–  DatamodificaCon:INSERT,DELETE,UPDATE–  Querydata:SELECT

•  Resource:hRp://www.sqlite.org

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL-CREATE

•  Createanamedtablewithnamedcolumnsandtypes,“schema”

CREATE TABLE books(

id int not null primary key, title text,

published_year int, price double

);

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–CREATEDatatypes

•  Columnsmustbeofatype–  Fixed-width:fastaccess,efficient–  Variable-width:flexible

•  Numbers:fixed-width–  int:Cnyint,smallint,mediumint,bigint,unsigned–  double

•  Text:variable-width•  Date:notypeinsqlite3,int,dateCme,Cmestamp

–  string,eg.“Aug.28,2012”–  “UnixCme”,numberofsecondssinceJan1,1970UTC–  Timezones

•  Binarydata(eg.Image):blob

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–ALTER&DROP

•  ModifiesanexisCngtableschemaalter table books add column author text;

•  Removesatableschema(anditsdata)drop table books;

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 3: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

3

SQL-INSERT

•  Addsdatatotableinsert into books values (1, "Practical SQL", 1998, 14.00, "Bowman");

insert into books values (2, "Data Mining", 2011, 26.85, "Linoff");

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 14.0 Bowman 2 Data Mining 2011 26.85 Linoff

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–Loadingdata

•  Loaddatafromacsvfile:sopware-specificbooks.csv

3,"Scoring Points",2008,22.00,"Humby"

4,"Business Intelligence",2009,57.85,"Vercellis”

.separator ","

.import books.csv books

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 14.0 Bowman 2 Data Mining 2011 26.85 Linoff 3 Scoring Point 2008 22.0 Humby 4 Business Inte 2009 57.85 Vercellis

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–DELETE&UPDATE

•  Deletesarowdelete from books where id=4;

•  Modifiesvalue(s)Update books set price=5.00;

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–SELECT

•  Querydatabaseselect * from books;

•  Sortresultsselect * from books order by published_year desc;

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis

id title published_year price author ---------- ----------- -------------- ---------- ---------- 2 Data Mining 2011 5.0 Linoff 4 Business In 2009 5.0 Vercellis 3 Scoring Poi 2008 5.0 Humby 1 Practical S 1998 5.0 Bowman

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–SELECT…WHERE

•  Whereclausesubsetsresultsselect title, author from books where published_year > 2000;

•  CombiningcondiConsselect * from books where published_year > 2000 and author="Linoff";

title author published_year ----------- ---------- -------------- Data Mining Linoff 2011 Scoring Poi Humby 2008 Business In Vercellis 2009

id title published_year price author ---------- ----------- -------------- ---------- ---------- 2 Data Mining 2011 5.0 Linoff

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–SELECT…FUZZY

•  Allowsforwildcardstringmatching

select * from books where title like “%ness%”;

id title published_year price author ---------- --------------------- -------------- ---------- ---------- 4 Business Intelligence 2009 5.0 Vercellis

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 4: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

4

SQL–Groupby

•  Aggregatebyacolumn:insert into books values(5,"2008 book",2008,25.00,"Phan");

select published_year, count(*), avg(price), sum(price) from books group by published_year;

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis 5 2008 book 2008 25.0 Phan

published_year count(*) avg(price) sum(price) -------------- ---------- ---------- ---------- 1998 1 5.0 5.0 2008 2 15.0 30.0 2009 1 5.0 5.0 2011 1 5.0 5.0

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL–Embeddedqueriesselect avg(sub.num_books) from (select published_year, count(*) as num_books from books group by published_year) sub;

published_year num_books -------------- ---------- 1998 1 2008 2 2009 1 2011 1

avg(sub.num_books) ------------------ 1.25

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL-JOIN

•  Abilitytocombinefromtwoormoretablesbycolumns,“JOIN”

select * from books b, publish_year p where b.published_year=p.year;

Whereis1998?

year num_books ---------- ---------- 2008 100 2009 120 2010 90 2011 104

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis 5 2008 book 2008 25.0 Phan

id title published_year price author year num_books ---------- ----------- -------------- ---------- ---------- ---------- ---------- 2 Data Mining 2011 5.0 Linoff 2011 104 3 Scoring Poi 2008 5.0 Humby 2008 100 4 Business In 2009 5.0 Vercellis 2009 120 5 2008 book 2008 25.0 Phan 2008 100

Dr.TuanQPHAN,NUSIS5126,(c)2017

SQL-Sets

XY Z

TableA TableB

1.   Innerjoin2.   Le9Join

X = A B

A =Y X select * from books b left join publish_year p on b.published_year=p.year;

select * from books b inner join publish_year p on b.published_year=p.year;

id title published_year price author year num_books ---------- ------------- -------------- ---------- ---------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 2011 104 3 Scoring Point 2008 5.0 Humby 2008 100 4 Business Inte 2009 5.0 Vercellis 2009 120 5 2008 book 2008 25.0 Phan 2008 100 Dr.TuanQPHAN,NUSIS5126,(c)2017

Longvs.wide

•  Longtablesvs.widetables•  pivottable,crosstabulaCon,report

trans_id book_id year num_books ---------- ------- ---- ---------- 1 1 2008 5 2 1 2008 1 3 1 2009 1 4 2 2011 3 5 3 2009 4 6 3 2009 1 7 4 2010 1 8 4 2010 5 9 4 2011 2 10 5 2010 1

book_id y2008 y2009 y2010 y2011 ---------- ----- ----- ----- ---------- 1 6 1 0 0 2 0 0 0 3 3 0 5 0 0 4 0 0 6 2 5 0 0 1 0

Dr.TuanQPHAN,NUSIS5126,(c)2017

DatabaseDesign•  Howtodesigntableschema?•  Whichcolumnsgowhere?•  GooddesigncharacterisCcs:

– MakesinteracConswithdatabaseeasytounderstand–  Consistencyofvaluesanddatabase–  Highperformance

•  BaddesigncharacterisCcs:– Misunderstandingofquery–  Increasedriskofinconsistencies–  Redundantdataentry–  Difficulttochangestructureofthetables

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 5: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

5

DatabaseDesign•  Normaliza:on:reduceduplicates,protectdataintegrity

•  Non-lossdecomposi:on:spliungtableswithredundantvaluesintotwoormoretables–  Jointo“putbacktogether”

•  Clear,easytoreadtableandcolumnnames:–  Eg.books_prices,author_firstname,books,authors

•  EnCty-relaConship(ER)modeling•  DefinerelaConshiptypes:1-1,1-N,N-N•  Nomagicbullet,iterateandexperience

Dr.TuanQPHAN,NUSIS5126,(c)2017

GeneralGuidelines1.  WhatkindofquesConsarewetryingto

answer?2.  Whatarethesourcesofdata?3.  WhicharethefocalenCCesorsubjects?

•  RowwasonethingintheenCty,columnsasaRributes

•  IndependentExistence4.  Groupcommoncolumns,useE-Rdiagrams

tohelp5.  DetermineuniqueidenCfier–primarykey6.  WhataretherelaConshipsbetween

enCCes:1-1,1-N,N-N7.  Normalizeandverify8.  Testandreiterate

Dr.TuanQPHAN,NUSIS5126,(c)2017

NormalizaConGuidelines•  Firstnormalform:

–  eachrow-columnintersecConmustbeoneandonlyonevalue–  mustbeatomic–  norepeaCnggroups–  “rectangular”tables

Bad:BeRer:

Order_id Book_id1 Transact_date1

Book_id2 Transact_date2

1 1 19/10/2010

2 1 01/10/2010 2 01/10/2010

Record_id Order_id Book_id Transact_date

1 1 1 19/10/2010

2 2 1 01/10/2010

3 2 2 01/10/2010

Dr.TuanQPHAN,NUSIS5126,(c)2017

NormalizaConGuidelines•  Secondnormalform

–  “Everynon-keycolumnmustdependontheenCreprimarykey”

–  Compositeprimarykey•  Thirdnormalform

–  Nonon-keycolumndependonanothernonkeycolumn•  Fourthnormalform

–  Noindependent1-NrelaConshipsbetweenprimarykeycolumnsandnon-keycolumns:toomanyblanks

•  Fiphnormalform–  Breaktablesintosmallestpossiblepiecesinordertoeliminateallredundancywithinatable.

Dr.TuanQPHAN,NUSIS5126,(c)2017

CombineSQL&Python

•  PythonloopstocreateSQLcode•  UsedforaggregaConor“pivottables”•  SimplescripCng

Dr.TuanQPHAN,NUSIS5126,(c)2017

WhentousePython,SQL,R?•  Similartoolsforalllanguages•  Excel:filters,sort,pivottable,…

–  Pro:easyGUI,“intuiCve,”easyforprototyping–  Cons:slow,cannothandlelargedatasets,requireshighlystructureddata,

limitedtools,$$$•  Python:dicConaries,loops,NumPy,etc…

–  Pro:flexible,fast,goodforbigdatasets,rich/mulCmediadata–  Cons:slowfilesystems,limitedtools,complicatedforsimpletasks

•  SQL:select,groupby,…–  Pro:manycommercialandopensourcesoluCons,fast(whenstructured

properly)–  Cons:requiresstructureddata,limitedbinarydatasupport,$$$

•  R:indices,aggregate,ddply,data.table…–  Pro:singlelanguage/framework,manypackagesforfastETL–  Cons:Memoryinefficient,slow,singleprocessor(exceptRevoluConR),

inconsistentnotaConacrosspackages

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 6: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

6

BestPracCceGuidelines•  Time-space(bestpracCces)•  Bigrawdatabestinfilesystems(harddrive)

–  PythonforrawdatacollecCon,binarydata–  Input:rawdata–  Output:semi-structured,non-normalized(eg.csv)

•  ETLandmanipulaConindatawarehouse(eg.SQL)–  Sqlite:easytouse,standardANSI–  MySQL:free,opensource,fastreads–  Oracle:transacCondata(writes)–  Hadoop:bigandslow,HiveprovidesSQL-likenotaCon–  Input:semi-structured–  Output:highlystructured,transformeddatareadyforanalysis,unitofanalysisonreachrow

•  AnalysisinstaCsCcaltools(R,Stata,SPSS,Matlab,etc…):–  Commercialandopensourceavailable–  Commercialfaster,higherperformance,beRermemorymanagement–  Input:highlystructured–  Output:reports,analysis,insights,visualizaCons

Dr.TuanQPHAN,NUSIS5126,(c)2017

Misc.

•  Otherdatabasedesignparadigms•  DimensionalModeling•  Resource:TheDataWarehouseToolkit,TheCompleteGuidetoDimensionalModeling;RalphKimball&MargyRoss

Dr.TuanQPHAN,NUSIS5126,(c)2017

Break

Dr.TuanQPHAN,NUSIS5126,(c)2017

MarkeCngandBehavioralAnalyCcs

•  Whatistheunitofanalysis?– Country– Firms– Products– Consumers/individuals

•  AggregaConvs.Sparsity•  “BigData”makessparsitylessofaproblem

Dr.TuanQPHAN,NUSIS5126,(c)2017

ProductLifeCycle(PLC)

•  StagesofproductadopConandsales•  IntroducCon,Growth,Maturity,Decline

Dr.TuanQPHAN,NUSIS5126,(c)2017

PLC–BassDiffusionModel•  ANewProductGrowthforModelConsumerDurables,Bass,F.M.,ManagementScience1969

•  AdopConmodelofconsumerdurables

•  Pr(t):probabilityofpurchaseatCmet•  m:totalmarketsize(numberofpeople)•  Y(t):numberofpreviousbuyers•  p:innovaCon(probability)•  q:imitaCon(probability)

Pr(t) = p+ qmY (t)

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 7: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

7

PLC–Innovators&Imitators

0 5

10 15 20 25 30

0 5 10 15 20

Cum

ulat

ive

No.

of

Ado

pter

s

(in m

illio

ns)

Year

0.0

0.5

1.0

1.5

2.0

2.5

0 2 4 6 8 10 12 14 16 18 20

Non

-cum

ulat

ive A

dopt

ers

(in m

illio

ns)

Year

Innovators Imitators

Dr.TuanQPHAN,NUSIS5126,(c)2017

PLC–CrossingtheChasm

Resource:CrossingtheChasm,GeoffreyA.Moore,1991

Dr.TuanQPHAN,NUSIS5126,(c)2017

Products–Supply/Demand•  Lawsofsupply&

demand•  Highdemand,high

prices–  DemandisnotstaCc

–  PromoConcanchangedemand

•  Surplussupply,lowprices–  EfficientstockallocaCon

–  Stockoutproblems

Dr.TuanQPHAN,NUSIS5126,(c)2017

Products–Supply/Demand

•  Profit(margins)=Price–Cost•  Cost=fixedcost+marginalcost•  PerfectmarketcompeCCon=>efficiency•  AdverCsingandpromoConscanincreasedemand

•  R.O.I.:ReturnonInvestment=Profit/investment

Dr.TuanQPHAN,NUSIS5126,(c)2017

MarkeCngStrategy

Dr.TuanQPHAN,NUSIS5126,(c)2017

Product–MarketBasketAnalysis•  Lookatwhatproductsarepurchasedtogether•  AssociaCverules:correlaConbetweenA&B

–  Prob(A|B),Prob(B|A)–  Beer&Diapers

•  Featureanalysis:eg.size,color,specificaCons•  Cross-sell

–  Upsell:sellmoreexpensive/highermarginproduct–  SubsCtutes–  RecommendaConengines

•  Bundling:packagetwosimilarproducts–  Lowcostofbundling–  (WordPerfect&Lotus)vs.MicrosopOffice–  Convergeddevices:(PDA&phone)vs.smartphone

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 8: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

8

Products

•  Isn’tproduct-levelanalysisperfect?•  Whatismissing?•  Whyshouldwecareaboutindividual/consumeranalysis?

Dr.TuanQPHAN,NUSIS5126,(c)2017

People(Consumers)•  5-stepbuyingprocess,“markeCngfunnel”:

–  Awareness:“Imightneedsoap”•  triggersincludeadverCsing

–  InformaConsearch:“Dovesoapsoundsgood,letmefindoutmoreaboutit”

•  TargeCngandsegmentaContogetbestinformaContocustomers–  EvaluatealternaCves:Whichisbestforme?Withinandoutsidecategory

•  Influencerscanplaykeyrole–  Purchase:distribuConchannel–  Evaluate(postpurchase):“DidImakeamistake?”

•  Repeatpurchase?•  ProduceposiCveword-of-mouth(WOM)

Dr.TuanQPHAN,NUSIS5126,(c)2017

People-CRM

Dr.TuanQPHAN,NUSIS5126,(c)2017

People–CRMAcquisiCon

•  AcquisiCon:–  Acquisition rate (%) = (Number of prospects acquired / Number

of prospects targeted) x 100 –  Acquisition is defined as the first purchase or purchasing in the

first predefined period –  Denotes average probability of acquiring a customer –  Always calculated for a group of customers –  Usually computed on a campaign-by-campaign basis

•  AcquisiConcostperprospect–  Acquisition cost ($) = Acquisition spending ($) / Number of

prospects acquired –  Measured in monetary terms –  Precise values for companies targeting prospects through direct

mail –  Less precise for broadcasted communications

Dr.TuanQPHAN,NUSIS5126,(c)2017

People–CRMAcCvityMeasurements

•  Trackcustomersloyaltyprogram•  ObservetransacConspercustomeroverCme•  RFM:

–  Recency:whenwasthelastpurchase–  Frequency:howopenpurchaseinaperiod– Monetary:totalvalueofsales

•  Easytocalculate•  HelpfulforsegmentaCon•  Cons:

– NotgoodforforecasCng

Dr.TuanQPHAN,NUSIS5126,(c)2017

People–CRMAcCvityMeasurements

•  Average inter-purchase time = 1 / Number of purchase incidences from first purchase till current time period – Measured in time periods – Evaluation of metric – Easy to calculate – Useful for industries with frequent customer

purchases – Marketing intervention might be warranted

anytime customers fall considerably below their AIT

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 9: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

9

People–CRMRetenCon/DefecConrates

•  Retention rate –  Average likelihood that a customer purchases in period t, given

that he/she has purchased in the last period t-1 –  Retention rate (%) = [(Number of customers in cohort buying in

period t | buying in period t-1) / Number of customers in cohort buying in period t-1] x 100

–  Retention rate (%) = 1 - (1 / Average lifetime duration) •  Defection rate

–  Average likelihood that a customer defects in period t, given that he/she has purchased in the last period t-1

–  Defection rate (%) = 1 - Retention rate –  Average lifetime duration = 1 / (1 - Average retention rate)

Dr.TuanQPHAN,NUSIS5126,(c)2017

People–CRMRetenCon/DefecConrates

•  Number of retained customers in any period (t+n) = (Number of acquired customers in period t) x (Retention rate(t+n))

–  Assuming a constant retention rate among acquired customers

•  Example –  Assume a constant retention rate of 0.75, or defection rate of

0.25 –  Average lifetime duration = 4 (1 / [1 - 0.75]) –  Customers starting at beginning of year 1 = 100 –  Customers remaining at end of year 1 = 75.00 (100 x 0.751) –  Customers remaining at end of year 2 = 56.25 (100 x 0.752) –  Customers remaining at end of year 3 = 42.19 (100 x 0.753) –  Customers remaining at end of year 4 = 31.64 (100 x 0.754)

Dr.TuanQPHAN,NUSIS5126,(c)2017

People-CRMDefecConRatevs.CustomerTenure

•  Variation (or heterogeneity) around average lifetime duration of 4 years

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Customer Tenure (Periods)

# of

Cus

tom

ers

Def

ectin

g

Dr.TuanQPHAN,NUSIS5126,(c)2017

People-CRMLifeCmeDuraCon

•  Less precise metric –  Average lifetime duration = 1 / (1 - Average retention rate)

•  More precise metric –  Average lifetime duration =

–  where N = cohort size, t = time period •  Complete or incomplete information on customer

–  Complete: customer’s time of first and last purchases are known

–  Incomplete: either only time of first purchase, or only time of last purchase, or both time of first and last purchases are unknown

1Number of customers retained

T

tt

t

N=

×∑

Dr.TuanQPHAN,NUSIS5126,(c)2017

People–CRMProbability(AcCve)

•  Probability of a customer being active in time t in a non-contractual setting – Probability(Active) = Tn – where n = number of purchases in a given

period, T = time of the last purchase (given as a fraction of the observation period)

– Simple approximation of probability(active) – More advanced computation methods exist

Dr.TuanQPHAN,NUSIS5126,(c)2017

People-CRMProbability(AcCve)

•  Customer 1: T = (8/12) = 0.667 and n = 4 –  Probability(Active) = (0.667)4 = 0.198

•  Customer 2: T = (8/12) = 0.667 and n = 2 –  Probability(Active) = (0.667)2 = 0.444

Customer 1

Customer 2

Observation Period Holdout Period

Month 1 Month 12 Month 8 Month 18

X indicates that a purchase was made by a customer in that month

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 10: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

10

BehavioralAnalyCcs

•  Buildunderstandingofconsumerlifecycle

•  Segmentdifferentbehavior/moCvaCons

•  Separatetypesofloyalty:–  Behavioral:observedmanytransacCons

–  Autudinal:emoConalloyalty

•  ProvidesguidancetodifferentmarkeCngeffort

•  Howtomeasureandcapturedataondifferentcustomertypes?

Mini-case:Taobao(e-commerce)

Mini-case:Buy,Search,Browse(Taobao)

•  E-commerceincreasinglypopular

•  HowcananalyCcsbuildinsightandtakeacCon?

•  Howisonlinedifferentthanofflineshopping?

•  WhataddiConaldataisavailable?

•  Whatkindofbehaviorcanweobserve?Moe,WendyW.“Buying,Searching,orBrowsing:DifferenCaCngbetweenOnlineShoppersUsingin-StoreNavigaConalClickstream.”JournalofConsumerPsychology13,no.1(2003):29–39.

Data

•  ProductinformaCon/pricing

•  TransacCons

ProductID DescripFon Size AHributes Price Date

12345 CatT-shirt L Red 15.00 Winter2016

Timestamp TransacFonID

ProductID

UserID QuanFty Price Shipping

Dec.1,2016 1 12345 tphan 2 30.00 SingPost

Dec.1,2016 1 34567 tphan 1 15.00 SingPost

Data

•  Clickstream– Webserver(Apache)logs

Dr.TuanQPHAN,NUSIS5126,(c)2017

Timestamp URL Client IP SessionID UserID

Dec.1,2016,00:00:01

hRp://qoo10.sg/

Firefox 192.168.1.1 12345ABCD tphan

Dec.1,2016,00:00:10

hRp://qoo10.sg/Mens_Shirts/

Firefox 192.168.1.1 12345ABCD tphan

… … … … … ….

Approach

• Categorizepages:• HomePage• CategoryPages• BrandPages• ProductPages• SearchPages

Page 11: IS5126 Lecture 02 - NUS Computing - Home · Lecture 2 – Data, Databases, SQL, Behavioral AnalyCcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph Kimball ... Hive/Hadoop, Netezza

1/17/17

11

Metrics•  Avg.Cmespentperpage•  %searchpages•  #categorypages•  #productpages•  Diff#Cat•  #Brand•  #Prod

Behaviors

•  KnowledgeBuilding

•  HedonicBuilding

•  DirectedBuying

•  Search/DeliberaCon

•  Shallowsessions

Admin

•  Pickupsyllabusandschedule,alsoavailableonmywebsite:hRp://www.tuanqphan.us

•  PurchaseHBSCasefromhRp://hbsp.harvard.edu– Data.gov,#9-610-075

•  Signupteamof4onIVLEbyJan.30– UseIVLEforumstofindteammates

Dr.TuanQPHAN,NUSIS5126,(c)2017