introduction to stata - ali rashed
TRANSCRIPT
CAPMAS International Statistics Day
20-10-2015 Egyptian Economic Census
Workshop 2012/2013
Introduction to STATA
By Ali Rashed
Population Council
17th – 19th October 2015
STATA is a complete, integrated statistical package that provides everything you need for data analysis, data management, and graphics.
STATA is not sold in pieces, which means you get everything you need in one package without annual license fees.
Fast, Accurate, and Easy to use:
WHY STATA
You can access all of STATA’s data management, statistical, and analysis features from the menus and associated dialogs.
Command syntax: a simple and consistent
Online help & a topical index built into the online help system
All analyses can be reproduced and documented for publication and review.
WHY STATA, Cont.
Run Stata, Open a data set, describe its contents and Exit:
Run Stata program from the “Start” button
“use” Command: Open a Stata data set from the “File” pull-down Menu Example:
cd “D:\My Documents\Training Courses\UNDP Jordan June 2011\Jordan LMPS2010 “
Use "JLMPS indiv public v1_ 0.dta", clear
“describe” Command
dir and cd commands work just like in DOS
STATA commands are case-sensitive.
Type in small letters
Opening (Using) a data set
Note the FOUR main Windows: 1. Command: to issue the commands to Stata 2. Result: to see the results 3. Variables: Shows the list of variables of the data set active in memory: Click on a variable name to put it into the command window 4. Review: Keeps track of the commands issued, so each command you type is displayed here.
Click on a command to put it into the command window for editing
Double-click on a command to execute it directly
The STATA Display
You can resize these 4 windows independently, and you can
resize the outer window as well. To save your window size
changes, click on Edit, Preferences, Save Preferences Set
Main FOUR Windows, Cont.
File types: xxxx.do files → txt files with your
commands, for future reference and editing
xxxx.log files → txt files with your output, for future reference and printing
xxxx.dta files → data files in Stata format
xxxx.gph files → graph files in Stata format
xxxx.ado files → programs in Stata
STATA Files Types
Log File: For good documentation of operations and output
Variable storage type:
byte : variable stored in one byte
int: variable stored in 2 bytes
long: 4 bytes (for variables with 9 digits or more)
Float: 4 bytes (7 digits of accuracy )
double: 8 bytes (16 digits of accuracy )
“compress” command (Reduce the storage type to minimum storage necessary)
set memory 500m,perm
Describing Data
Commands
summarize (or summarize x y z)
provides summary statistics for all or a subset of variables
remember SATATA commands are case-sensitive
you can always use abbreviations if they are not ambiguous e.g. sum x
summarize by subgroup
sort groupvar
bysort groupvar: sum varname
Summarizing
in qualifier
Defines range of observation that command applies to
Examples:
list in 5/10
list gov pubpriv frame sector4d sector2d in 4/l (the letter l refers to last)
Edit command
Specifying Subsets of the Data
if qualifier
Defines observations that satisfy a certain condition
Example:
sum empl weight totprod totsales totwage outputfc totva netva netindtax profit1 population if pubpriv ==1
sum empl weight totprod totsales totwage outputfc totva netva netindtax profit1 population hhcount if profit1 >=0 & profit1<200
count if profit1<0 & pubpriv ==1 //How Many?//
tab gov if profit1<0 & pubpriv ==1
tab sector4d if profit1<0 & pubpriv ==1
tab sector2d if profit1<0 & pubpriv ==1
== is equal to
!= is not equal to ( ~= also works)
> is greater than
< is less than
>= is greater than or equal to
<= is less than or equal to
Logical operators
“generate” command is used to create new variables
“replace” command is used to modify an existing variable
Examples: sum profit1 // Net Profit gen LnProfit=ln(profit1) generate durEstablished= 2015-firmage sum durEstablished replace durEstablished =. if durEstablished <0 recode durEstablished (min/0 =. )
Transforming Variables
Basic descriptive commands
• describe or d Gives a summary of the current data file:
•Number of observations/variables
•Data file size
•List of variables (name, type, label value)
• codebook – Variables summary:
•Type, range, values, frequency
• List or l – Display the values of the variables for each
observation
Basic data set management
• Sort – Sort the data set
– Examples: sort gov or sort gov sector2d
• Keeping variables – Examples:
• keep id gov pubpriv sector2d profit1 : will only keep these variables
• Dropping variables – Examples:
• drop gov pubpriv frame : will drop variables these variables
• drop gov- prjs : will drop all variables from gov to prjs
• drop w* : will drop all variables beginning with q
Creation of variables
• Command: generate (or gen)
• Create string variables • gen str10 cityname= « Cairo"
• Create numeric variables • gen Net_Profit=profit1- netindtax (type float by default)
• gen byte Sales_Per_worker= totsales / saleworktot
• Change a variable type: • gen str7 cluster=substr(id,7,12)
• edit id cluster
• gen str4 year="2015"
• destring year, replace
• Rename variables – Ren oldname new name: ren id firm_id
• Recode variable values • for var profit1 netindtax : recode X (min/0=.)
Variables Labels and Values
• Labelling variable names
• label var gov "Governorate"
• label var profit1 « Net Profit in ,000"
• Labelling variables values (2 steps)
• label def yesno 0 "No" 2 "Yes"
• label val public yesno
• Changing label values
• label def yesno 1 "Yes" 2 "No", modify
• label val public yesno
Identify and Delete duplicated observations
• duplicates list id
• duplicates report id
• duplicates browse
• duplicates tag id,gen(tag)
• duplicates drop id
• duplicates drop id, force
tabulate command produces frequency cross-tabs of one or two variables
tabu gov
tabu gov sector2d
tabu gov sector2d,col
tabu gov sector2d, col row missing nolabel nofreq
tab1 varlist - performs one-way tables for varlist (tab1 gov sector4d sector2d )
tab2 varlist - performs all possible 2-way tables for varlist (tab2 age sector2d sector4d)
Table Command
Tabulation
Several types of weights
- fweight or frequency weights: are weights that indicate the number of duplicated observations
- aweight or analytical weights: are weights that are inversely proportional to the variance of an observation.
- iweights or importance weights: are weights that indicate the "importance" of the observation in some vague sense.
- pweight or probability weight: or sampling weights, are weights that denote the inverse of the probability that the observation is included due to the sampling design.
Using Weights
EXAMPLES
Frequency weights
tabu gov sector2d [fweight=int(weight)], ro co
Analytical wegihts
tabu gov sector2d [aweights=weight]
Using Weights
To add observations from two files with the same variables
append command
To add variables from two files with similar observations
merge
To add variables from two files with different observations (e.g. individuals and household)
merge idvar
Combining 2 or More STATA Files
Merging by unique id allows you to combine variables from two different STATA data sets
Examples
Merging an individual’s employment variables to his/her demographic characteristics
Merging the parent’s info to the individual’s demographic file.
Merging information on a parent who is present in the household to an individual’s demographic file
Merging community information to the individual or household level files
Merging Files
The objective is to match observations that share a unique id from two files
The master file: the file to merge into
The using file: the file to merge from
Examples with two files containing indiv. information
open the file containing the variables you need
use filename, clear
keep the unique id and the variables you need
keep indid hhid gov pn varnames
Match Merge
sort by unique id Sort id
save under new name save temp1
use master data set use “ORIGINAL FILE.dta”, clear
sort by unique id sort id
merge by unique id merge id using temp1
Match Merge (2)
checking how successful your merge was tabu _merge
_merge==3 observ in both master and using
_merge==2 observations in using but not in master
_merge==1 observation in master but not in using
drop _merge
update option substitute missing values in master with nonmissing values in
using for same variables
replace option replaces any value in master with non-missing value in using
Match Merge (3)
1- Merging individual-level data into individual level files
2- Merging household level data into individual-level file
3- Merging individual-level data into household-level file
Types of Match Merge
On-line help is one of the most useful aspects of STATA
Now connected to STATA Corp web site through the net
Help menu
search
stata commands
Stata Technical Bulletin
Using STATA’s on-line help
What’s new in STATA
STATA is web-aware
use data sets over the web
example: use http://www.stata.com/manual/oddeven.dta,clear
updates
update query
check out help menu
For Advanced Users
Stata can accept data in several forms.
Stata Editor:
Enter a small data set consisting of 6 observations, and three variables, where var1 is the name of individual, var2 is his income, and var3 is his/her consumption.
Then, “list”, “describe”, and “save”.
Stata can read ASCII (text) file, Delimited ASCII, data separated by : spaces, comma, tab.
Fixed length ASCII file Utilities to transform data sets from one form (say SPSS,
Excel, etc.) into all other forms (STAT/Transfer).
Inputting and Reading Data
ASCII delimited files are text files where data are separated by delimiters
If missing observations are spaces, then delimiter should not be a space, use comma instead
For space delimited data, the command to use is: infile x y z using data.txt
x y z should be names equal in number to the variables in each record
if x y z is omitted, STATA assigns v1 v2 v3 describe compress
infile assumes numeric format unless otherwise specified Assume x is a string (alphanumeric) variable infile str10 x y z
Reading Delimited ASCII files
Another common format is comma or tab delimited data
Variables names are assumed to be in first row, also comma or tab delimited
No need to identify string variables in comma or tab delimited files
The appropriate STATA command is insheet using filename.csv, comma insheet using filename.txt, tab
A utility program such as STAT/TRANSFER can be used to read most data formats, including SPSS, Excel, SAS, Dbase, Access, etc.
Reading Delimited ASCII files
Fixed format ASCII files has no separators between variables but each variable always appears in the same positions
This is how data typically come from data entry packages
Two ways of doing it: Without data dictionary infix rectyp 1-2 gov 3-4 qism 5-6 psu 7-9 urbrur 10
hhgov 11-14 hhpsu 15-16 using rec02.dat With data dictionary Prepare dictionary file using text editor as
explained in handout
Reading Fixed Format ASCII files
Using STATA Graphs
graph twoway scatterplots, line plots, etc.
graph matrix scatterplot matrices
graph bar bar charts
graph dot dot charts
graph box box-and-whisker plots
graph pie pie charts
histogram graph save graph use graph display graph combine graph export
Macros
• A macro is a shorthand—one thing standing for another. For instance:
• local list "age weight sex"
• regress outcome `list' is the same as
• regress outcome age weight sex
• local or global? What is the difference? Which one should I use?
Global can get you into a mess
Better to stick with local variables rather than get in over
your head
•
Thank you