an introduction to statistical consulting - rutgers universitycabrera/sc/workshop/lecture 4.pdf ·...

19
Lecture 4. 1. An Example of an interaction: Preeclampsia 2. A paradigm for data analysis 3. Case study: A MARKET RESEACH STUDY ON HOW TO IMPROVE SALES Javier Cabrera: [email protected] 1 An introduction to Statistical Consulting

Upload: vukien

Post on 05-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 4.

1. An Example of an interaction: Preeclampsia

2. A paradigm for data analysis

3. Case study: A MARKET RESEACH STUDY

ON HOW TO IMPROVE SALES

Javier Cabrera: [email protected]

1

An introduction to Statistical Consulting

2

Case Study: Are preeclamptic women at high

risk of Heart Diseases

Preeclampsia: Preeclamtic women are admitted to hospital with High Blood Pressure, Seizures, sometimes even with a Stroke, or a Heart attack…. But are they at risk of a new Stroke or Heart attack? 1. Initiating the interaction. Try to understand the basic principles of the science… What are the mechanisms that produce high blood pressure(HBP) ?…Why does HBP affect pregnant women? 2. Understanding and defining the problem. These symptoms are not well understood…. 3. Evaluating the technical knowledge of the collaborator/client In this case the client understanding of statistics is low.

3

Case Study: Are preeclamptic women at high

risk of Heart Diseases

2. Understanding and defining the problem. - Science: The exact cause of preeclampsia is unknown. (These

symptoms are not well understood…. ) - How do we phrase this problem in terms of what we can do? - Who is the comparison population? Do we compare this women

with healthy women who are not pregnant? - Find out about what kind of data is available. 3. Evaluating the technical knowledge of the collaborator/client - In this case the client understanding of statistics is low. This

makes the communication harder because the client tries to direct the conversation to medical issues using medical terms.

- Educating the client. Sometimes it maybe necessary to explain in very simple terms some basic statistical methodology.

4

Case Study: Are preeclamptic women at high

risk of Heart Diseases

4. Assessing the overall issues and objectives of the project - Case control study: Compare preeclamptic women to controls - Cases:

• In NJ a total of 37 women in 1991 who were admitted with preeclampsia and with no prior history of heart diseases. 21 with HBP + 16 with HA.

• We have medical history showing the date of the next hospitalization for a heart attack.

- How to find controls? • Not possible to find a database of healthy controls. • Data base of all hospital admissions in NJ for a

cardiovascular condition. • We are going to compare preeclamtic women with those

women that where admitted to the hospital with a heart condition.

• We would like to find if these controls are sicker than the cases and therefore they should have more events.

5

Case Study: Are preeclamptic women at high

risk of Heart Diseases

5. Identifying the statistical issues of the project. - Matching cases to controls. - Exact matching: Year, Age, Race, Reason of Admission - Propensity scores, or combination of PE and Exact Matching

using other clinical variables. - Use Survival analysis/Cox proportional hazard model to

estimate the hazard ratios of a heart attack event. - Test the null hypothesis that the hazard ratio of the next

heart attack is 1. - Also perform a simple analysis of Odd ratios after 1 year or 5

years or 10 years.

6

Case Study: Are preeclamptic women at high

risk of Heart Diseases

ID Year Age Race Adm NextHAcase NextHActr

1 2000 41 CAC HA NA NA

2 2000 30 CAC HA NA 2007

3 2000 27 CAC HA NA NA

4 2000 31 CAC HA NA NA

5 2000 45 AA HA NA NA

6 2000 30 CAC HA NA 2004

7 2000 21 CAC NHA NA 2011

8 2000 31 CAC NHA 2004 NA

9 2000 25 AA NHA NA 2007

10 2000 20 CAC NHA NA NA

11 2000 18 CAC NHA 2002 NA

12 2000 39 CAC NHA NA NA

13 2001 21 CAC HA NA NA

14 2001 33 AA HA NA 2010

15 2001 20 CAC HA NA 2005

16 2001 21 CAC HA NA 2002

17 2001 25 CAC HA 2001 NA

18 2001 26 AA HA NA NA

19 2001 21 CAC HA NA 2004

20 2001 26 CAC HA 2001 NA

21 2001 27 CAC HA NA 2002

22 2001 25 CAC HA 2001 2001

23 2001 30 CAC NHA NA 2001

24 2001 33 CAC NHA NA 2004

25 2001 21 CAC NHA NA 2011

26 2001 24 AA NHA NA NA

27 2001 25 CAC NHA NA 2001

28 2001 25 AA NHA NA 2002

29 2001 24 AA NHA 2001 NA

30 2001 29 CAC NHA NA 2002

31 2001 29 CAC NHA NA 2005

32 2001 35 AA NHA NA 2001

33 2001 19 CAC NHA NA 2006

34 2001 22 CAC NHA NA 2002

35 2001 19 CAC NHA NA 2001

36 2001 30 CAC NHA NA NA

37 2001 22 CAC NHA NA 2008

Variables ID: Assigned to each pair case control Year: Year of admission Age: Age at admission Race: CAC or AA Adm: Conditioin of admission HA or not HA NextHAcase: Year of next HA for the case NextHActr: Year of next HA for the control

General Paradigm

Research

Question

Find

Data

Internal Databases

Data Warehouses

Internet

Online databases

Data Collection

Data Processing

Extract Information

Data Analysis

Answer

Research

Question

8

You work as a market research analyst for a company that sales orthopedic products and it specializes in sales to hospitals. Your boss tell you that the company needs to make an effort to improve sales before years end. You need to produce quickly a report that identifies potential customers where you can make an effort and improve sales by a large $ amount. IDEA: Find those hospitals who, given their demographic characteristics, have high potential for consumption of such equipment but where our sales are low. Ex: Hospitals that do lots of orthopedic surgery operations. Then find a selected group of hospitals where you think your efforts will be rewarded. Data: Go to a data warehouse and find a dataset with all hospitals in the US and extract some relevant demographic variables.

A MARKET RESEACH STUDY

ON HOW TO IMPROVE SALES

9

The following description of the dataset includes variable names and some summaries of variable.

A MARKET RESEACH STUDY

ON HOW TO IMPROVE SALES

ARIABLES:

ZIP : US POSTAL CODE

HID : HOSPITAL ID

CITY : CITY NAME

STATE : STATE NAME

BEDS : NUMBER OF HOSPITAL BEDS

RBEDS : NUMBER OF REHAB BEDS

OUT-V : NUMBER OF OUTPATIENT VISITS

ADM : ADMINISTRATIVE COST(In $1000's per year)

SIR : REVENUE FROM INPATIENT

SALESY : SALES OF REHABILITATION EQUIPMENT SINCE JAN 1

SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO

HIP2Y : NUMBER OF HIP OPERATIONS FOR TWO YEARS AGO

KNEE2Y : NUMBER OF KNEE OPERATIONS FOR TWO YEARS AGO

TH : TEACHING HOSPITAL? 0, 1

TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1

REHAB : DO THEY HAVE A REHAB UNIT? 0, 1

HIP1Y : NUMBER HIP OPERATIONS FOR LAST YEAR

KNEE1Y : NUMBER KNEE OPERATIONS FOR LAST YEAR

FEMUR1Y : NUMBER FEMUR OPERATIONS FOR LAST YEAR

10

A MARKET RESEACH STUDY

Variables ZIP CITY STATE BEDS

Min. : 612 Chicago : 45 CA : 458 Min. : 0.0

1st Qu.:28550 Houston : 41 TX : 342 1st Qu.: 69.0

Median :49000 Philadelphia : 38 NY : 241 Median : 136.0

Mean :50600 Los Angeles : 28 PA : 238 Mean : 191.2

3rd Qu.:75240 New York : 24 FL : 228 3rd Qu.: 262.0

Max. :99900 Dallas : 24 IL : 208 Max. :1476.0

(Other) :4503 (Other):2988

RBEDS OUTV ADM SIR

Min. : 0.000 Min. : 0 Min. : 0 Min. : 0

1st Qu.: 0.000 1st Qu.: 7510 1st Qu.: 1932 1st Qu.: 1312

Median : 0.000 Median : 20880 Median : 4508 Median : 3384

Mean : 7.244 Mean : 47350 Mean : 6689 Mean : 4849

3rd Qu.: 0.000 3rd Qu.: 47700 3rd Qu.: 9402 3rd Qu.: 6832

Max. :850.000 Max. :1987000 Max. :66440 Max. :70300

SALESY SALES12 HIP2Y KNEE2Y HIP1Y

Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0

1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu. : 1.00 1st Qu.: 8.0

Median : 1.00 Median : 2.00 Median : 28.00 Median : 18.00 Median : 29.0

Mean : 25.91 Mean : 41.05 Mean : 51.27 Mean : 41.73 Mean : 52.6

3rd Qu.: 23.00 3rd Qu.: 33.00 3rd Qu.: 70.00 3rd Qu.: 52.50 3rd Qu.: 71.0

Max. :1209.00 Max. :2770.00 Max. :1421.00 Max. :868.00 Max. :1373.0

T-H TRAUMA REHAB KNEE1Y FEMUR1Y

Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00

1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 11.00

Median :0.0000 Median :0.0000 Median :0.0000 Median : 18.00 Median : 34.00

Mean :0.2737 Mean :0.1225 Mean :0.1839 Mean : 41.91 Mean : 49.39

3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 56.00 3rd Qu.: 74.00

Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1081.00 Max. :489.00

11

1. Transformations:

Look at each individual variables and decide "if and which" transformation

is appropriate.

2. Dimension reduction.

i) Separate the variables into the following groups:

Response: SALES = SALES12 +SALESY, SALES=0 => SALES=NA

Demographics: BEDS, RBEDS, OUTV, ADM, SIR, TH, TRAUMA, REHAB

Operation numbers: HIP2Y, KNEE2Y, HIP1Y, KNEE1Y, FEMUR1Y

ii) Use the factor method to summarize the demographic variables and

the operation variables and come out with a final reduced list of factor

variables (perhaps 3 or 4). Use the rotated factors in order to find a

good interpretation of the factors and try to make a good story.

A MARKET RESEACH STUDY:

Transformations & dimension reduction

12

A MARKET RESEACH STUDY

3. Market segmentation

i) Independent variables are used to divide the list of hospitals (all possible clients = the market) into subsets which we call market segments. Use cluster analysis to find the market segments or clusters. Since we are summarizing the variables with factors then use the factors. ii) Once the clusters are chosen we must study the summary statistics for each cluster and try to describe their content. Interpretation is very important at this stage. iii) Finally we select the cluster or clusters that agree with our objectives. In this study you are looking for segments with over all high sales but where there are hospitals were the company's sales are low. Some segments will have mostly low numbers for sales. This means that those hospitals have few patients who would need our products so we are not interested in them.

13

A MARKET RESEACH STUDY

Estimating potential gain in sales.

Potential gain in sales is the difference between current sales and the average of sales to similar hospitals. If you are analyzing a very small cluster (N <20) then we might assume that the sales are homogeneous and the “average sales to similar hospitals” is just the average sale to that cluster. But if the cluster is larger we will need to obtain a regression estimate. This is the procedure: i) Do a regression for each of the t selected segments. Notice that since the segments are very homogeneous you may expect that the R-square may not be very high SO DO NOT BE CONCERNED WITH LOW R-SQUARES. ii) The hospitals with large negative residuals are the ones that have low sales but their characteristics suggest that they are below their potential sales (use predicted values as potential sales). Make a list of the hospitals in your segment were sales can be improved. iii) Give your estimate of the potential gains.

14

Doing the Computations: All these parts are can be easily performed using SAS. In addition you could use similar robust analysis using R. The R analysis would apply the methods for robust clustering (pam) and for classification and regression trees (rpart). PAM: compare the clusters given by PAM with those from HC, are they similar? RPART: The idea here is to take the sales variable and make it into a categorical 1:0-median 2:median-80% 3:80%-100%. Run the tree method and select one good node that have very high sales and find hospitals on that group that have SALES=NA and estimate a potential sale gain.

A MARKET RESEACH STUDY

Estimating potential gain in sales.

15

A MARKET RESEACH STUDY

Estimating potential gain in sales.

Obs CITY STATE HID GAIN

135 Richmond VA 92134 134.621

136 Flushing NY 63521 144.105

137 Buffalo NY 111021 164.319

138 Tallahas FL 103039 181.047

19 Toms Riv NJ 139522 142.149

16 Oakland CA 224093 139.782

17 Voorhees NJ 160022 135.16

This is what the outcome of your analysis should be

16

CASE STUDY Plastic explosives detection. The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the 23 principal components components of the discrete xray absorption spectrum over an array of rays across the surface of the suitcase. The response is the last variable in the dataset. It takes two values: 0: There is explosive 1: There is not. The objective is to detect the suitcases with explosives.

PLASTIC EXPLOSIVES DETECTION

17

DATA Plastic explosives detection Data: Variables 1-23 : discrete xcomponents of the xray absorption spectrum. The 24th variable is the response variable in the dataset. Data Set: Training Set: Pex23 http://www.rci.rutgers.edu/~cabrera/sc/cs7/pex23.txt Testing Set: Pex23 testing http://www.rci.rutgers.edu/~cabrera/sc/cs7/pex23.test

PLASTIC EXPLOSIVES DETECTION

18

OBJECTIVES Plastic explosives detection Data: The objective is to detect the suitcases with explosives. 1. To compare several classification techniques using the training

and testing dataset provided. Some reasonable methods that could be used are LDA, QDA, CART, Random Forest, Boosting, SVM, ANN, Naïve Bayes and Bayesian Models

2. To find the best classifier for this data.

PLASTIC EXPLOSIVES DETECTION

19

PLASTIC EXPLOSIVES DETECTION