sas training session 2

Upload: rajesh-kumar

Post on 14-Apr-2018

234 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/29/2019 SAS Training Session 2

    1/35

    SAS Training Session 2

    Basic Statistical Analysis Using SAS

    Sun Li

    Centre for Academic Computing

    [email protected]

    mailto:[email protected]:[email protected]
  • 7/29/2019 SAS Training Session 2

    2/35

    Outline

    Produce reports Using Output Delivery System (ODS) to produce output

    Producing summary report using PROC MEANS & PROC FREQ

    Perform elementary statistical procedures

    Simple inference using PROC FREQ & PROC UNIVARIATE Correlation using PROC CORR

    Group comparison using PROC TTEST

    Introduction to regression procedures

    ANOVA procedures using PROC ANOVA and PROC GLM

    General Linear regression using PROC REG

    Logistic models using PROC LOGISTIC

  • 7/29/2019 SAS Training Session 2

    3/35

    Produce reports

    Produce reports

    Using Output Delivery System (ODS) to produce output

    Producing HTML output

    Producing RTF file Producing descriptive statistics using PROC MEANS

    Computing descriptive statistics

    Creating a summarized data set

    Produce tabular reports using PROC FREQ Creating frequency tables

    Creating cross-tabulations

  • 7/29/2019 SAS Training Session 2

    4/35

    Produce reports - ODS

    Using Output Delivery System (ODS)

  • 7/29/2019 SAS Training Session 2

    5/35

    Produce reports - ODS

    RTF output

    icon

  • 7/29/2019 SAS Training Session 2

    6/35

    Produce reports - ODS

    ** producing HTML output and RTF file;PROC PRINT data=sas2.marchflights (obs=10);RUN;ODS html body=E:\lsun\marchflights.html';PROC PRINT data=sas2.marchflights (obs=10);RUN;ODS html close;

    ODS listing close;ODS rtf file=E:\lsun\insure.rtf';PROC PRINT data=sas2.insure;RUN;PROC TABULATE data=sas2.insure;var total balancedue;table min max mean, total balancedue;RUN;ODS rtf close;ODS listing;

  • 7/29/2019 SAS Training Session 2

    7/35

    Produce reports PROC MEANS

    PROC MEANSStatistic-keywords:

    PROC MEANS ;

    RUN;

    CLM Two-sided confidence intervals RANGE The range

    CSS Corrected sun of squares SKEWNESS Skewness

    CV Coefficient of variation STDDEV Standard deviation

    KURTOSIS Kurtosis STDERR Standard error of the mean

    LCLM Lower confidence interval SUM Sum

    MAX Maximum value SUMWGT Sum of weight variables

    MEAN Mean UCLM Upper confidence limit

    MIN Minimum value USS Uncorrected sum of squares

    N Number of non-missing values VAR VarianceNMISS Number of missing values PROBT Probability of Students t

    MEDIAN Median T Students t

    Q1 25% quantile Q3 75% quantile

    P1 1% quantile P5 5% quantile

    P10 10% quantile P90 90% quantile

    P95 95% quantile P99 99% quantile

  • 7/29/2019 SAS Training Session 2

    8/35

    Produce reports PROC MEANS

    ** computing statistics using proc means;DATA prdsale;set sas2.prdsale;RUN;PROC PRINT data=prdsale (obs=10); RUN;PROC MEANS data=prdsale maxdec=2 alpha=0.1 clm mean std;var actual predict;class product;RUN;PROC SORT data=prdsale;by product;RUN;PROC MEANS data=prdsale maxdec=2 alpha=0.1 clm mean std;var actual predict;by product;RUN;

    B Ystatement vs. CLASSstatement:

    1. Unlike CLASSprocessing, B Yprocessing

    requires that your data already be sorted or indexed

    in the order of the B Yvariables.

    2. B Ygroup results have a layout that is different

    from the layout ofCLASSgroup results.

  • 7/29/2019 SAS Training Session 2

    9/35

    Produce reports PROC MEANS

    ** creating a summarized data set using proc means;PROC MEANS data=prdsale mean clm;var actual predict;class product year;output out=prdstats mean=ave_act ave_pre uclm=uclm_actuclm_pre lclm=lclm_act lclm_pre;RUN;

    OUTPUT OUT=SAS-data-set statistic=variable(s);

    OUT=specifies the name of the output data set

    statistic= specifies the summary statistic written out

    variable(s) specifies the names of the variables to create. These variables represent

    the statistics for the analysis variables that are listed in the VAR statement.

  • 7/29/2019 SAS Training Session 2

    10/35

    Produce reports PROC FREQ

    PROC FREQ PROC FREQ ;TABLES variables

    variable-1*variable-2 ;

    RUN;

    **creating tables in proc freq;PROC FREQ data=Color;weight Count;

    tables Eyes Hair Eyes*Hair;RUN;PROC SORT data=Color;by region;RUN;PROC FREQ data=color nlevels;weight count;tables eyes*hair /crosslist;by region;RUN;

    NLEVELS: displays the number of levels for

    the variables listed.

    CROSSLIST: displays the cross-tabulation

    results in a listing form.

  • 7/29/2019 SAS Training Session 2

    11/35

    Produce reports

    QUIZ 1

    See the file QUIZ-SAS2.pdf.

  • 7/29/2019 SAS Training Session 2

    12/35

    Elementary statistical procedures

    Perform elementary statistical procedures Simple statistical inference

    PROC FREQ

    PROC UNIVARIATE

    Correlation using PROC CORR

    Group comparison using PROC TTEST

    One-sample T test

    Two-independent samples T test

  • 7/29/2019 SAS Training Session 2

    13/35

    Elementary statistical procedures PROC FREQ

    Simple statistical inference Chi-square test using PROC FREQ

    More hypothesis tests in PROC UNIVARIATE

    **chi-square test using proc freq;PROC FREQ data=color;weight Count;table eyes*hair /chisq cl;RUN;

    CHISQ: displays Chi-square test results.

    CL : displays the 95% confidence intervals of the statistics.

  • 7/29/2019 SAS Training Session 2

    14/35

    PROC UNIVARIATE PLOT NORMAL ;

    CLASS variable(s) ;

    HISTOGRAM variable(s) /normal ;

    PROBPLOT variable(s) /normal ;

    QQPLOT variable(s);

    VAR variable(s) ;

    RUN ;

    PLOT:generates a stem and leaf plot, a box plot and a normal probability plot

    NORMAL:generates normality test

    HISTOGRAM/normal:generates histogram with fitted distribution curve

    PROBPLOT/normal:generates probability plot with specified distribution

    QQPLOT:generates QQ plot

    **simple hypothesis tests in proc univariate;PROC UNIVARIATE data=prdsale modes plot normal cibasic(alpha=.1);var actual;histogram /normal (color=red);qqplot;RUN;

    Elementary statistical procedures PROC UNIVARIATE

  • 7/29/2019 SAS Training Session 2

    15/35

    PROC UNIVARIATE vs. PROC MEANS

    In general, if youre interested in a general view of your population

    distribution, or if you want to do some simple hypothesis tests

    (normality, etc.), then PROC UNIVARIATE is appropriate. Otherwise, if

    youre looking at specific statistics, then PROC MEANS and a specificoutput may be more efficient.

    If youre looking for a probability plot or a histogram, PROC UNIVARIATE

    may be what you need.

    Elementary statistical procedures PROC UNIVARIATE

  • 7/29/2019 SAS Training Session 2

    16/35

    Correlation

    Elementary statistical procedures PROC CORR

    PROC CORR options;VAR variables ;

    RUN ;

    **correlation using proc corr;ODS html;ODS graphics on;PROC CORR data=sas2.citiday nomiss pearson spearman covplots=matrix;var snydjcm snysecm dsiuswil;RUN;ODS graphics off;ODS html close;

    COV requests output of the covariance matrix (for Pearson).

    PEARSON requests Pearsons product moment correlation coefficient (default).

    KENDALL requests Kendalls tau-b correlation coefficient.

    SPEARMAN requests Spearmans rank-order correlation coefficient.

    PARTIAL produces partial correlations in the VARvariable list, controlling for the variables specified in the

    PARTIALvariable list.

  • 7/29/2019 SAS Training Session 2

    17/35

    Group comparison PROC TTEST ;VAR variables ;CLASS stratifier ;

    RUN ;

    **group comparison using proc ttest;PROC TABULATE data=prdsale;var actual;class year quarter;tables year*quarter, actual*(mean min max);RUN;PROC TTEST data=prdsale h0=500 alpha=0.1;var actual;where year=1994;RUN;PROC TTEST data=prdsale;class year;var actual;RUN;

    Elementary statistical procedures PROC TTEST

  • 7/29/2019 SAS Training Session 2

    18/35

    Elementary statistical procedures

    Variable name Variable information

    permno CRSP Permanent Number

    date Numeric date

    ret Holding Period Return

    retx Return without dividends

    mktrf Excess return on markertsmb Small-minus-big return

    hml High-minus-low return

    rf Risk-free return rate

    umd Momentum factor

    QUIZ 2

    See the file QUIZ-SAS2.pdf.

  • 7/29/2019 SAS Training Session 2

    19/35

    Introduction to regression procedures

    Introduction to regression procedures Introduction to ANOVA procedures

    Comparing groups using PROC ANOVA

    Unbalanced design using PROC GLM

    General linear models

    The REG procedure

    Logistic regression

    Statistical background

    The Logistic procedure

  • 7/29/2019 SAS Training Session 2

    20/35

    Regression procedures PROC ANOVA

    Introduction to ANOVA procedures

    PROC ANOVA ;

    CLASS stratifier ;

    MODEL dependents = effects ;

    MEANS var / options;

    RUN ;

    QUIT;

    **comparing groups using proc anova;PROC TABULATE data=sas2.cargo99;var cargorev cargowgt;class routeid;tables routeid, cargorev*meancargowgt*mean;RUN;ODS html;ODS graphics on;PROC ANOVA data=sas2.cargo99;

    class routeid;model cargorev = routeid;RUN; means routeid / Bon;RUN;ODS graphics off;ODS html close;QUIT;

    CLASS specifies stratifier variables.

    MODEL defines the model to be fit.

    MEANS compute and compare means.

  • 7/29/2019 SAS Training Session 2

    21/35

    PROC GLM ;

    CLASS variables;

    MODEL dependents = effects ;Means effects;

    RUN ;

    QUIT;

    **unbalanced ANOVA for two-way design with interaction;ODS html;ODS graphics on;PROC GLM data=sas2.cargo99;class routeid;model cargorev=routeid cargowgt routeid*cargowgt / ss1 ss2 ss3 ss4;RUN; lsmeans routeid / pdiff=all adjust=bon ;RUN;ODS graphics off;ODS html close;QUIT;

    Regression procedures PROC GLM

  • 7/29/2019 SAS Training Session 2

    22/35

    Linear regression procedures

    General linear models

    XY

    .XY

    sy variableexplanatorofmatrix)1(theisresponses.ofvector1theis

    pnn

    tscoefficienregressiontheofestimatessquareleastYX1

    X)X(

  • 7/29/2019 SAS Training Session 2

    23/35

    PROC GLM:

    It uses the method of least squares to fit general linear models

    relating to one or several continuous dependent variables to one or

    several independent variables.

    Strengths:

    direct specification of polynomial effects

    ease of specifying categorical effects (PROC GLM automaticallygenerates dummy variables for class variables)

    Weaknesses:

    No collinearity diagnostics

    No influence diagnostics

    No scatter plots

    Only one model at one time

    Regression procedures PROC GLM

  • 7/29/2019 SAS Training Session 2

    24/35

    PROC REG: Provides the most general analysis capabilities

    handles multiple regression models

    provides nine model-selection methods

    allows interactive changes both in the model and in the data used to

    fit the model

    allows linear equality restrictions on parameters

    tests linear hypotheses and multivariate hypotheses

    produces collinearity diagnostics, influence diagnostics, and partial

    regression leverage plots

    saves estimates, predicted values, residuals, confidence limits, and

    other diagnostic statistics in output SAS data sets

    generates plots of data and of various statistics

    Regression procedures PROC REG

  • 7/29/2019 SAS Training Session 2

    25/35

    PROC REG ;MODEL dependent-variable = predictors /

    selection=method R CLI CLM ;

    PLOT r.*p. ;

    RUN ;

    QUIT;

    *Regression using proc reg ;PROC REG data=insurance;model time = size type sizetype

    /selection=none;RUN;delete sizetype;print;RUN;plot r.*p. time*p.;RUN;QUIT;

    DATA insurance;input time size type @@;sizetype=size*type;datalines;17 151 0 26 92 0 21 175 0 30 31 0 22 104 00 277 0 12 210 0 19 120 0 4 290 0 16 238 028 164 1 15 272 1 11 295 1 38 68 1 31 85 121 224 1 20 166 1 13 305 1 30 124 1 14 246 1;

    SELECTION:specifies model selectionmodel: forward, backward, etc.

    DELETE:deletes variables from the model.

    PRINT:print the analysis results.

    PROT:produces diagnostic plots.

    Regression procedures PROC REG

  • 7/29/2019 SAS Training Session 2

    26/35

    *Polynomial regression using proc reg;PROC REG data=USPopulation;var YearSq;model Population=Year / selection=none;plot r.*p. ;RUN; add YearSq;

    print;plot / cframe=ligr;RUN; plot (Population predicted. u95. l95.)*Year / overlay cframe=ligr;RUN;QUIT;ODS html;ODS graphics on;PROC REG data=USPopulation;Linear: model Population=Year;Quadratic:model Population=Year YearSq;RUN;ODS graphics off;ODS html close;

    QUIT;

    Regression procedures PROC REG

  • 7/29/2019 SAS Training Session 2

    27/35

    Logistic regression procedures

    Logistic models Binary logistic model: dichotomous response outcomes

    e.g.: presence or absence of an event

    PROC LOGISTIC provides the capability of model-fitting.

    Ordinal logistic model: ordinal response variable with more than two

    ordered categories

    e.g.: a 5-point Likert scale

    PROC LOGISTIC fits the proportional odds model with CLOGIT link.

    Multinomial logistic model: nominal response variables with more thantwo categories

    e.g.: different types of programs in school

    PROC LOGISTIC fits the generalized logit model if you specify the GLOGIT link.

  • 7/29/2019 SAS Training Session 2

    28/35

    Binary logistic model

    Ordinal logistic model

    Multinomial logistic model

    )|( iii xyE

    Xg )())1/(log()(logit

    kiXXiYg i ,......,1,'))|(Pr(

    kiXXkY

    XiYii ,......,1,'

    )|1Pr(

    )|Pr(log

    PROC LOGISTIC ;CLASS variables;

    MODEL dependent-variable = predictors / options;

    RUN ;

    Regression procedures PROC LOGISTIC

  • 7/29/2019 SAS Training Session 2

    29/35

    Binary logistic model

    Variable name Variable information

    age Age in years

    ed Level of education

    1= didnt complete high school 2= high school degree

    3= college degree 4= undergraduate 5= postgraduate

    employ Years with current employer

    address Years in current address

    income Household income in thousands

    debtinc Debt to income ratio (*100)

    creddebt Credit card debt in thousands

    othdebt Other debts in thousands

    default Previously defaulted (1=Yes; 0=No)

    How to identify a person with high chance of getting defaults on the bank

    loan. We have 700 records from bank database (bankloan) .

    Regression procedures PROC LOGISTIC

  • 7/29/2019 SAS Training Session 2

    30/35

    *Binary logistic model;PROC MEANS data=sas2.bankloan;var age employ address income debtinc creddebt othdebt;class default;RUN;

    PROC LOGISTIC data=sas2.bankloan;class ed(ref='1') / param=ref;model default(event='1')= ed age employ address income debtinccreddebt othdebt/ selection=stepwise slentry=0.3 slstay=0.35 detailsrsquare lackfit;output out=bankloanpred p=prob lower=lcl upper=ucl xbeta=logit;ods output parameterestimates=bankloanest;RUN;

    Regression procedures PROC LOGISTIC

    SELECTION: specifies model selection methods.

    SLENTRY=0.3: a significance level of 0.3 is required to allow a variable into the model.

    SLSTAY=0.35: a significance level of 0.35 is required for a variable to stay in the model.

    DETAILS:produces a detailed account of the variable selection process.

    RSQUARE:produces generalized R-square measure.

    LACKFIT:produces Hosmer and Lemeshow goodness-of-fit test for the final selected model.

    PARAM=REF: specifies the reference cell coding.

    REF:specifies reference group for categorical predictors.

    EVENT:specifies reference group for dependent variable.

  • 7/29/2019 SAS Training Session 2

    31/35

    Regression procedures PROC LOGISTIC

    Proportional Odds Model for Ordinal Logistic Model

    To identify factors that influence a persons income category.

    *Ordinal logistic model;DATA income;

    set sas2.bankloan;if income

  • 7/29/2019 SAS Training Session 2

    32/35

    Regression procedures PROC LOGISTIC

    Generalized Logits Model for Multinomial Logistic Model*Multinomial logistic model;DATA school;input school program style $ count;datalines;1 1 self 101 1 team 171 1 class 261 2 self 51 2 team 121 2 class 502 1 self 212 1 team 172 1 class 262 2 self 162 2 team 122 2 class 363 1 self 153 1 team 153 1 class 163 2 self 123 2 team 123 2 class 20;

    To identify the difference of study types among

    schools and programs.

  • 7/29/2019 SAS Training Session 2

    33/35

    Regression procedures PROC LOGISTIC

    PROC LOGISTIC data=school;freq count;class school program / order=data;model style = school program school*program / link=glogit;output out=progstat p=prob;ods output parameterestimates=progest;RUN;PROC FREQ data=progstat;format prob 5.4;tables school*program*_level_*prob /list nopercent nocum;RUN;DATA progodd;set progest;

    odds=exp(estimate);RUN;PROC PRINT data=progodd;var response estimate odds;RUN;

    LINK: specifies the link function.

    GLOGIT:generalized logit function.

  • 7/29/2019 SAS Training Session 2

    34/35

    Resources and books

    Regression methods Applied Regression Analysis, Linear Models, and Related Methods by John Fox

    Regression Analysis by Example by Chatterjee, Hadi and Price

    An Introduction to Generalized Linear Models, Second Edition by Annette J. Dobson

    Logistic regression and categorical data analysis

    Applied Logistic Regression, Second Edition by David Hosmer and Stanley Lemeshow

    An Introduction to Categorical Data Analysis Alan Agresti

    CAC statistical consultation support: CAC statistical WIKI page:

    http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/SAS.aspx

    Statistical consultation service: [email protected]

    http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/SAS.aspxmailto:[email protected]:[email protected]://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/SAS.aspx
  • 7/29/2019 SAS Training Session 2

    35/35

    End!