introduction to statistical...

32
Introduction to Sta atistical Packages Eugene Tseytlin Deparment of BioMedical Informatics University of Pittsburgh

Upload: others

Post on 10-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Introduction to Statistical PackagesIntroduction to Statistical Packages

Eugene Tseytlin

Deparment of BioMedical Informatics

University of Pittsburgh

Page 2: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Expectations

� NOT to become an expert in any statistical software package

� NOT to become an expert statistician

� Present an Overview of what solutions are available with emphasis on free open source software

Expectations

NOT to become an expert in any statistical

NOT to become an expert statistician

Present an Overview of what solutions are available with emphasis on free open source

Page 3: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

About Me

Who� Senior Software Developer

Where� Department of BioMedical Informatics, University of Pittsburgh

Areas of Expertise� Intelligent Tutoring Systems (ITS) � Natural Language Processing (NLP)� Digital Imaging: digital microscopy and fMRI� Machine Learning

Technologies� Java, Matlab, R, RapidMiner, SAS, C/C++, OWL, PHP, Perl

About Me

Department of BioMedical Informatics, University of Pittsburgh

Intelligent Tutoring Systems (ITS) Natural Language Processing (NLP)Digital Imaging: digital microscopy and fMRI

Java, Matlab, R, RapidMiner, SAS, C/C++, OWL, PHP, Perl

Page 4: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Introduction

� Overview of what is available for statistical analysis

� Overview of what is popular today and what are the trends for tomorrow

� Overview of some individual software packages

� Overview of the dataset that we will be using in next lecture

Introduction

Overview of what is available for statistical

Overview of what is popular today and what are the trends for tomorrow

Overview of some individual software packages

Overview of the dataset that we will be using in

Page 5: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Available Statistical Packages

Proprietary

� Excel

� SPSS

� MINITAB

� SAS

Available Statistical Packages

Free Software

� LibreOffice Calc

� PSPP

� EpiInfo

� R

Page 6: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

What is Used? (Academia)

Figure 7a. Use of data analysis software in academic publications as measured by hits on Google Scholar.

What is Used? (Academia)

Figure 7a. Use of data analysis software in academic publications as measured by hits on Google Scholar.

Page 7: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

What is Used? (Survey)What is Used? (Survey)

Page 8: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

What is Used? (Job Market)What is Used? (Job Market)

Page 9: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Microsoft ExcelMicrosoft Excel

Page 10: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Microsoft

COST

� Individual License for Microsoft Office Professional $350

� Microsoft Office University Student License: $99

� Volume Discounts available for large organizations and universities

� Free Starter Version available on some new PCs

Microsoft Excel

PRO

� Nearly ubiquitous and is often pre-installed on new computers

� User friendly

� Very good for basic descriptive statistics, charts and plots

CON

� Costs money

� Not sufficient for anything beyound the most basic statistical analysis

Page 11: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

MinitabMinitab

Page 12: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Minitab

COST

� $1,395.00 per single user license

CON

� Costs Money

� Not suitable for very complicated statistical computation and analysis

� Not often used in academic research

Minitab

PRO

� Easy to learn and use

� Often taught in schools in introductory statistics courses

� Widely used in engineering for process improvement

Page 13: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

SPSSSPSS

Page 14: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

SPSS

COST

� From $1000 to $12000 per license depending on license type.

CON

� Very expensive

� Not adequate for modeling and cutting edge statistical analysis

SPSS

PRO

� Easy to learn and use

� More powerful then Minitab

� One of the most widely used statistical packages in academia and industry

� Has a command line interface in addition to menu driven user intefrace

� One of the most powerful statistical package that is also easy to use.

Page 15: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

SASSAS

Page 16: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

SAS

COST

� Complicated pricing model

� $8,500 first year license fee

CON

� Very very expensive

� Not user friendly

� Steap learning curve

� Relatively poor graphics capabilities

SAS

PRO

� Widely accepted as the leader in statistical analysis and modeling

� Widely used in the industry and academia

� Very flexible and very powerful.

Page 17: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

LibreOfficeLibreOffice Calc

Page 18: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

LibreOffice

LibreOffice is a free and open source office suite, developed by The Document Foundation. It is descended from OpenOffice.org, from which it was forked in 2010

� OpenOffice vs LibreOffice

� Star → Sun → Oracle → Apache, Document Foundation

� OpenOfficehttp://www.openoffice.org/download

� LibreOfficehttp://www.libreoffice.org/download/

LibreOffice Calc

is a free and open source office suite, developed by The Document Foundation. It is descended from OpenOffice.org, from which it

OpenOffice vs LibreOffice

Star → Sun → Oracle → Apache, Document

http://www.openoffice.org/download

http://www.libreoffice.org/download/

Page 19: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

LibreOffice

PRO

� Very similar to Microsoft Excel in functionality and look and feel (earlier versions)

� User friendly

� Very good for basic descriptive statistics, charts and plots

� Inter-operable with Microsoft Office

LibreOffice Calc

COST

� Free

CON

� Not sufficient for anything beyound the most basic statistical analysis

Page 20: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

EpiInfoEpiInfo

Page 21: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

EpiInfo

Epi Info is public domain statistical software for epidemiology developed by Centers for Disease Control and Prevention (CDC)

Epi Info has been in existence for over 20 years and is currently available for Microsoft Windows. The program allows for electronic survey creation, data entry, and analysis. Within the analysis module, analytic routines include tnonparametric statistics, cross tabulations and stratification with estimates of odds ratios, risk ratios, and risk differences, logistic regression (conditional and unconditional), survival analysis (Kaplan Meier and Cox proportional hazard), and analysis of complex survey data. The software is in the public domain, free, and can be downloaded from http://www.cdc.gov/epiinfo. Limited support is available

EpiInfo

Epi Info is public domain statistical software for epidemiology developed by Centers for Disease Control and Prevention

Epi Info has been in existence for over 20 years and is currently available for Microsoft Windows. The program allows for electronic survey creation, data entry, and analysis. Within the analysis module, analytic routines include t-tests, ANOVA, nonparametric statistics, cross tabulations and stratification with estimates of odds ratios, risk ratios, and risk differences, logistic regression (conditional and unconditional), survival analysis (Kaplan Meier and Cox proportional hazard), and analysis of complex survey data. The software is in the public domain, free, and can be downloaded from http://www.cdc.gov/epiinfo. Limited

Page 22: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

EpiInfo

PRO

� Consists of multiple modules to accomplish various tasks beyond just statistical analysis.

� ability to rapidly develop a questionnaire

� customize the data entry process

� quickly enter data into that questionnaire

� analyze the data

EpiInfo

Consists of multiple modules to accomplish various tasks beyond just

customize the data entry process

COST

� Free

CON

� Not a dedicated statistical package

� Not as powerful as commercial alternative for performing advanced analysis and modeling

Page 23: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

PSPPPSPP

Page 24: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

PSPP

COST

� Free

PRO

� Aims as a free SPSS alternative with an interface that closely resembles SPSS

� User friendly

� Good enough for basic statistical analysis

PSPP

CON

� Lacks many advanced statistical tests and features that are present in SPSS

� Last version released in 2010

� Not very well known nor widely used

Page 25: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

RR

Page 26: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

R

R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, timeclustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. There are some important differences, but much code written for S runs unaltered. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.

R is highly extensible through the use of useror specific areas of study. Due to its S heritage, R has stronger objectprogramming facilities than most statistical computing languages. Extending R is also eased by its permissive lexical scoping rules.[10]

According to Rexer's Annual Data Miner Survey in 2010, R has become the data mining tool used by more data miners (43%) than any other.[11]

Another strength of R is static graphics, which can produce publicationincluding mathematical symbols. Dynamic and interactive graphics are available through additional packages.[12]

R

R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. There are some important differences, but much code written for S runs unaltered. Many of R's standard functions are written in R itself, which makes it easy for users to follow the

R is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. Due to its S heritage, R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also eased by its permissive lexical scoping rules.[10]

According to Rexer's Annual Data Miner Survey in 2010, R has become the data mining tool used by more data miners (43%) than any other.[11]

Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols. Dynamic and interactive graphics are available

Page 27: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

R

PRO

� Widely used and accepted in industry and academia

� Very powerful and flexible

� Very large user base

� Lots of books and manuals

� Several User Interface Shells available

R

COST

� Free / Open Source

CON

� Not user friendly

� Requires steep learning curve

Page 28: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Dataset

The Dataset and Story Library

http://lib.stat.cmu.edu/DASL/

DASL (pronounced "dazzle") is an online library of datafiles and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that statistics teachers can find real-interesting to their students. Use DASL's powerful search engine to locate the story or datafile of interest.

Dataset

The Dataset and Story Library

DASL (pronounced "dazzle") is an online library of datafiles and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that

-world examples that will be interesting to their students. Use DASL's powerful search engine to locate the story or datafile of interest.

Page 29: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Brain Size and Intelligence

Are the size and weight of your brain indicators of your mental capacity? In this study by Willerman et al. (1991) the researchers use Magnetic Resonance Imaging (MRI) to determine the brain size of the subjects. The researchers take into account gender and body size to draw conclusions about the connection between brain size and intelligence.

http://lib.stat.cmu.edu/DASL/Stories/BrainSizeandIntelligence.html

Methods� Correlation

� Regression

� Scatterplot

Brain Size and Intelligence

Are the size and weight of your brain indicators of your mental capacity? In this study by Willerman et al. (1991) the researchers use Magnetic Resonance Imaging (MRI) to determine the brain size of the subjects. The researchers take into account gender and body size to draw conclusions about the connection between brain size and intelligence.

http://lib.stat.cmu.edu/DASL/Stories/BrainSizeandIntelligence.html

Page 30: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Brain Size and IntelligenceDescription: Willerman et al. (1991) collected a sample of 40 right

introductory psychology students at a large southwestern university. Subjects took four subtests (Vocabulary, Similarities, Block Design, and Picture Completion) of the Wechsler (1981) Adult Intelligence Scale-Revised. The researchers used Magnetic Resonance Imaging (MRI) to determine the brain size of the subjects. Information about gender and body size (height and weight) are also included. The researchers withheld the weights of two subjects and the height of one subject for reasons of confidentiality.

Number of cases: 40

Variable Names:

Gender: Male or Female

FSIQ: Full Scale IQ scores based on the four Wechsler (1981) subtests

VIQ: Verbal IQ scores based on the four Wechsler (1981) subtests

PIQ: Performance IQ scores based on the four Wechsler (1981) subtests

Weight: body weight in pounds

Height: height in inches

MRI_Count: total pixel Count from the 18 MRI scans

Brain Size and IntelligenceDescription: Willerman et al. (1991) collected a sample of 40 right-handed Anglo

introductory psychology students at a large southwestern university. Subjects took four subtests (Vocabulary, Similarities, Block Design, and Picture Completion) of the

Revised. The researchers used Magnetic Resonance Imaging (MRI) to determine the brain size of the subjects. Information about gender and body size (height and weight) are also included. The researchers withheld the weights of two subjects and the height of one subject for reasons of

: Full Scale IQ scores based on the four Wechsler (1981) subtests

: Verbal IQ scores based on the four Wechsler (1981) subtests

: Performance IQ scores based on the four Wechsler (1981) subtests

: total pixel Count from the 18 MRI scans

Page 31: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

Conclusion

� Statistical analysis is an integral part of any study and publication

� While commercial statistical software may cost an arm and a leg, free alternatives do exists.

� While some free alternatives don't measure up, others are growing and expending rapidly and may overtake commercial software in features and popularity

Conclusion

Statistical analysis is an integral part of any

While commercial statistical software may cost an arm and a leg, free alternatives do exists.

While some free alternatives don't measure up, others are growing and expending rapidly and may overtake commercial software in features

Page 32: Introduction to Statistical Packageslibvolume7.xyz/nursing/bsc/1styear/introductiontocomputers/statisticalpackages/...and cutting edge statistical analysis PRO Easy to learn and use

References

https://sites.google.com/site/r4statistics/popularityhttp://en.freestatistics.info/http://lib.stat.cmu.edu/http://www.comfsm.fm/~dleeling/statistics/notes000.html

References

https://sites.google.com/site/r4statistics/popularity

http://www.comfsm.fm/~dleeling/statistics/notes000.html