an introduction to statistical consulting - rutgers universitycabrera/sc/workshop/lecture 4.pdf ·...
TRANSCRIPT
Lecture 4.
1. An Example of an interaction: Preeclampsia
2. A paradigm for data analysis
3. Case study: A MARKET RESEACH STUDY
ON HOW TO IMPROVE SALES
Javier Cabrera: [email protected]
1
An introduction to Statistical Consulting
2
Case Study: Are preeclamptic women at high
risk of Heart Diseases
Preeclampsia: Preeclamtic women are admitted to hospital with High Blood Pressure, Seizures, sometimes even with a Stroke, or a Heart attack…. But are they at risk of a new Stroke or Heart attack? 1. Initiating the interaction. Try to understand the basic principles of the science… What are the mechanisms that produce high blood pressure(HBP) ?…Why does HBP affect pregnant women? 2. Understanding and defining the problem. These symptoms are not well understood…. 3. Evaluating the technical knowledge of the collaborator/client In this case the client understanding of statistics is low.
3
Case Study: Are preeclamptic women at high
risk of Heart Diseases
2. Understanding and defining the problem. - Science: The exact cause of preeclampsia is unknown. (These
symptoms are not well understood…. ) - How do we phrase this problem in terms of what we can do? - Who is the comparison population? Do we compare this women
with healthy women who are not pregnant? - Find out about what kind of data is available. 3. Evaluating the technical knowledge of the collaborator/client - In this case the client understanding of statistics is low. This
makes the communication harder because the client tries to direct the conversation to medical issues using medical terms.
- Educating the client. Sometimes it maybe necessary to explain in very simple terms some basic statistical methodology.
4
Case Study: Are preeclamptic women at high
risk of Heart Diseases
4. Assessing the overall issues and objectives of the project - Case control study: Compare preeclamptic women to controls - Cases:
• In NJ a total of 37 women in 1991 who were admitted with preeclampsia and with no prior history of heart diseases. 21 with HBP + 16 with HA.
• We have medical history showing the date of the next hospitalization for a heart attack.
- How to find controls? • Not possible to find a database of healthy controls. • Data base of all hospital admissions in NJ for a
cardiovascular condition. • We are going to compare preeclamtic women with those
women that where admitted to the hospital with a heart condition.
• We would like to find if these controls are sicker than the cases and therefore they should have more events.
5
Case Study: Are preeclamptic women at high
risk of Heart Diseases
5. Identifying the statistical issues of the project. - Matching cases to controls. - Exact matching: Year, Age, Race, Reason of Admission - Propensity scores, or combination of PE and Exact Matching
using other clinical variables. - Use Survival analysis/Cox proportional hazard model to
estimate the hazard ratios of a heart attack event. - Test the null hypothesis that the hazard ratio of the next
heart attack is 1. - Also perform a simple analysis of Odd ratios after 1 year or 5
years or 10 years.
6
Case Study: Are preeclamptic women at high
risk of Heart Diseases
ID Year Age Race Adm NextHAcase NextHActr
1 2000 41 CAC HA NA NA
2 2000 30 CAC HA NA 2007
3 2000 27 CAC HA NA NA
4 2000 31 CAC HA NA NA
5 2000 45 AA HA NA NA
6 2000 30 CAC HA NA 2004
7 2000 21 CAC NHA NA 2011
8 2000 31 CAC NHA 2004 NA
9 2000 25 AA NHA NA 2007
10 2000 20 CAC NHA NA NA
11 2000 18 CAC NHA 2002 NA
12 2000 39 CAC NHA NA NA
13 2001 21 CAC HA NA NA
14 2001 33 AA HA NA 2010
15 2001 20 CAC HA NA 2005
16 2001 21 CAC HA NA 2002
17 2001 25 CAC HA 2001 NA
18 2001 26 AA HA NA NA
19 2001 21 CAC HA NA 2004
20 2001 26 CAC HA 2001 NA
21 2001 27 CAC HA NA 2002
22 2001 25 CAC HA 2001 2001
23 2001 30 CAC NHA NA 2001
24 2001 33 CAC NHA NA 2004
25 2001 21 CAC NHA NA 2011
26 2001 24 AA NHA NA NA
27 2001 25 CAC NHA NA 2001
28 2001 25 AA NHA NA 2002
29 2001 24 AA NHA 2001 NA
30 2001 29 CAC NHA NA 2002
31 2001 29 CAC NHA NA 2005
32 2001 35 AA NHA NA 2001
33 2001 19 CAC NHA NA 2006
34 2001 22 CAC NHA NA 2002
35 2001 19 CAC NHA NA 2001
36 2001 30 CAC NHA NA NA
37 2001 22 CAC NHA NA 2008
Variables ID: Assigned to each pair case control Year: Year of admission Age: Age at admission Race: CAC or AA Adm: Conditioin of admission HA or not HA NextHAcase: Year of next HA for the case NextHActr: Year of next HA for the control
General Paradigm
Research
Question
Find
Data
Internal Databases
Data Warehouses
Internet
Online databases
Data Collection
Data Processing
Extract Information
Data Analysis
Answer
Research
Question
8
You work as a market research analyst for a company that sales orthopedic products and it specializes in sales to hospitals. Your boss tell you that the company needs to make an effort to improve sales before years end. You need to produce quickly a report that identifies potential customers where you can make an effort and improve sales by a large $ amount. IDEA: Find those hospitals who, given their demographic characteristics, have high potential for consumption of such equipment but where our sales are low. Ex: Hospitals that do lots of orthopedic surgery operations. Then find a selected group of hospitals where you think your efforts will be rewarded. Data: Go to a data warehouse and find a dataset with all hospitals in the US and extract some relevant demographic variables.
A MARKET RESEACH STUDY
ON HOW TO IMPROVE SALES
9
The following description of the dataset includes variable names and some summaries of variable.
A MARKET RESEACH STUDY
ON HOW TO IMPROVE SALES
ARIABLES:
ZIP : US POSTAL CODE
HID : HOSPITAL ID
CITY : CITY NAME
STATE : STATE NAME
BEDS : NUMBER OF HOSPITAL BEDS
RBEDS : NUMBER OF REHAB BEDS
OUT-V : NUMBER OF OUTPATIENT VISITS
ADM : ADMINISTRATIVE COST(In $1000's per year)
SIR : REVENUE FROM INPATIENT
SALESY : SALES OF REHABILITATION EQUIPMENT SINCE JAN 1
SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO
HIP2Y : NUMBER OF HIP OPERATIONS FOR TWO YEARS AGO
KNEE2Y : NUMBER OF KNEE OPERATIONS FOR TWO YEARS AGO
TH : TEACHING HOSPITAL? 0, 1
TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1
REHAB : DO THEY HAVE A REHAB UNIT? 0, 1
HIP1Y : NUMBER HIP OPERATIONS FOR LAST YEAR
KNEE1Y : NUMBER KNEE OPERATIONS FOR LAST YEAR
FEMUR1Y : NUMBER FEMUR OPERATIONS FOR LAST YEAR
10
A MARKET RESEACH STUDY
Variables ZIP CITY STATE BEDS
Min. : 612 Chicago : 45 CA : 458 Min. : 0.0
1st Qu.:28550 Houston : 41 TX : 342 1st Qu.: 69.0
Median :49000 Philadelphia : 38 NY : 241 Median : 136.0
Mean :50600 Los Angeles : 28 PA : 238 Mean : 191.2
3rd Qu.:75240 New York : 24 FL : 228 3rd Qu.: 262.0
Max. :99900 Dallas : 24 IL : 208 Max. :1476.0
(Other) :4503 (Other):2988
RBEDS OUTV ADM SIR
Min. : 0.000 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0.000 1st Qu.: 7510 1st Qu.: 1932 1st Qu.: 1312
Median : 0.000 Median : 20880 Median : 4508 Median : 3384
Mean : 7.244 Mean : 47350 Mean : 6689 Mean : 4849
3rd Qu.: 0.000 3rd Qu.: 47700 3rd Qu.: 9402 3rd Qu.: 6832
Max. :850.000 Max. :1987000 Max. :66440 Max. :70300
SALESY SALES12 HIP2Y KNEE2Y HIP1Y
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu. : 1.00 1st Qu.: 8.0
Median : 1.00 Median : 2.00 Median : 28.00 Median : 18.00 Median : 29.0
Mean : 25.91 Mean : 41.05 Mean : 51.27 Mean : 41.73 Mean : 52.6
3rd Qu.: 23.00 3rd Qu.: 33.00 3rd Qu.: 70.00 3rd Qu.: 52.50 3rd Qu.: 71.0
Max. :1209.00 Max. :2770.00 Max. :1421.00 Max. :868.00 Max. :1373.0
T-H TRAUMA REHAB KNEE1Y FEMUR1Y
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 11.00
Median :0.0000 Median :0.0000 Median :0.0000 Median : 18.00 Median : 34.00
Mean :0.2737 Mean :0.1225 Mean :0.1839 Mean : 41.91 Mean : 49.39
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 56.00 3rd Qu.: 74.00
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1081.00 Max. :489.00
11
1. Transformations:
Look at each individual variables and decide "if and which" transformation
is appropriate.
2. Dimension reduction.
i) Separate the variables into the following groups:
Response: SALES = SALES12 +SALESY, SALES=0 => SALES=NA
Demographics: BEDS, RBEDS, OUTV, ADM, SIR, TH, TRAUMA, REHAB
Operation numbers: HIP2Y, KNEE2Y, HIP1Y, KNEE1Y, FEMUR1Y
ii) Use the factor method to summarize the demographic variables and
the operation variables and come out with a final reduced list of factor
variables (perhaps 3 or 4). Use the rotated factors in order to find a
good interpretation of the factors and try to make a good story.
A MARKET RESEACH STUDY:
Transformations & dimension reduction
12
A MARKET RESEACH STUDY
3. Market segmentation
i) Independent variables are used to divide the list of hospitals (all possible clients = the market) into subsets which we call market segments. Use cluster analysis to find the market segments or clusters. Since we are summarizing the variables with factors then use the factors. ii) Once the clusters are chosen we must study the summary statistics for each cluster and try to describe their content. Interpretation is very important at this stage. iii) Finally we select the cluster or clusters that agree with our objectives. In this study you are looking for segments with over all high sales but where there are hospitals were the company's sales are low. Some segments will have mostly low numbers for sales. This means that those hospitals have few patients who would need our products so we are not interested in them.
13
A MARKET RESEACH STUDY
Estimating potential gain in sales.
Potential gain in sales is the difference between current sales and the average of sales to similar hospitals. If you are analyzing a very small cluster (N <20) then we might assume that the sales are homogeneous and the “average sales to similar hospitals” is just the average sale to that cluster. But if the cluster is larger we will need to obtain a regression estimate. This is the procedure: i) Do a regression for each of the t selected segments. Notice that since the segments are very homogeneous you may expect that the R-square may not be very high SO DO NOT BE CONCERNED WITH LOW R-SQUARES. ii) The hospitals with large negative residuals are the ones that have low sales but their characteristics suggest that they are below their potential sales (use predicted values as potential sales). Make a list of the hospitals in your segment were sales can be improved. iii) Give your estimate of the potential gains.
14
Doing the Computations: All these parts are can be easily performed using SAS. In addition you could use similar robust analysis using R. The R analysis would apply the methods for robust clustering (pam) and for classification and regression trees (rpart). PAM: compare the clusters given by PAM with those from HC, are they similar? RPART: The idea here is to take the sales variable and make it into a categorical 1:0-median 2:median-80% 3:80%-100%. Run the tree method and select one good node that have very high sales and find hospitals on that group that have SALES=NA and estimate a potential sale gain.
A MARKET RESEACH STUDY
Estimating potential gain in sales.
15
A MARKET RESEACH STUDY
Estimating potential gain in sales.
Obs CITY STATE HID GAIN
135 Richmond VA 92134 134.621
136 Flushing NY 63521 144.105
137 Buffalo NY 111021 164.319
138 Tallahas FL 103039 181.047
19 Toms Riv NJ 139522 142.149
16 Oakland CA 224093 139.782
17 Voorhees NJ 160022 135.16
This is what the outcome of your analysis should be
16
CASE STUDY Plastic explosives detection. The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the 23 principal components components of the discrete xray absorption spectrum over an array of rays across the surface of the suitcase. The response is the last variable in the dataset. It takes two values: 0: There is explosive 1: There is not. The objective is to detect the suitcases with explosives.
PLASTIC EXPLOSIVES DETECTION
17
DATA Plastic explosives detection Data: Variables 1-23 : discrete xcomponents of the xray absorption spectrum. The 24th variable is the response variable in the dataset. Data Set: Training Set: Pex23 http://www.rci.rutgers.edu/~cabrera/sc/cs7/pex23.txt Testing Set: Pex23 testing http://www.rci.rutgers.edu/~cabrera/sc/cs7/pex23.test
PLASTIC EXPLOSIVES DETECTION
18
OBJECTIVES Plastic explosives detection Data: The objective is to detect the suitcases with explosives. 1. To compare several classification techniques using the training
and testing dataset provided. Some reasonable methods that could be used are LDA, QDA, CART, Random Forest, Boosting, SVM, ANN, Naïve Bayes and Bayesian Models
2. To find the best classifier for this data.
PLASTIC EXPLOSIVES DETECTION