introduction to stata - ali rashed

CAPMAS International Statistics Day

20-10-2015 Egyptian Economic Census

Workshop 2012/2013

Introduction to STATA

By Ali Rashed

Population Council

17th – 19th October 2015

STATA is a complete, integrated statistical package that provides everything you need for data analysis, data management, and graphics.

STATA is not sold in pieces, which means you get everything you need in one package without annual license fees.

Fast, Accurate, and Easy to use:

WHY STATA

You can access all of STATA’s data management, statistical, and analysis features from the menus and associated dialogs.

Command syntax: a simple and consistent

Online help & a topical index built into the online help system

All analyses can be reproduced and documented for publication and review.

WHY STATA, Cont.

Run Stata, Open a data set, describe its contents and Exit:

Run Stata program from the “Start” button

“use” Command: Open a Stata data set from the “File” pull-down Menu Example:

cd “D:\My Documents\Training Courses\UNDP Jordan June 2011\Jordan LMPS2010 “

Use "JLMPS indiv public v1_ 0.dta", clear

“describe” Command

dir and cd commands work just like in DOS

STATA commands are case-sensitive.

Type in small letters

Opening (Using) a data set

Note the FOUR main Windows: 1. Command: to issue the commands to Stata 2. Result: to see the results 3. Variables: Shows the list of variables of the data set active in memory: Click on a variable name to put it into the command window 4. Review: Keeps track of the commands issued, so each command you type is displayed here.

Click on a command to put it into the command window for editing

Double-click on a command to execute it directly

The STATA Display

You can resize these 4 windows independently, and you can

resize the outer window as well. To save your window size

changes, click on Edit, Preferences, Save Preferences Set

Main FOUR Windows, Cont.

File types: xxxx.do files → txt files with your

commands, for future reference and editing

xxxx.log files → txt files with your output, for future reference and printing

xxxx.dta files → data files in Stata format

xxxx.gph files → graph files in Stata format

xxxx.ado files → programs in Stata

STATA Files Types

Log File: For good documentation of operations and output

Variable storage type:

byte : variable stored in one byte

int: variable stored in 2 bytes

long: 4 bytes (for variables with 9 digits or more)

Float: 4 bytes (7 digits of accuracy )

double: 8 bytes (16 digits of accuracy )

“compress” command (Reduce the storage type to minimum storage necessary)

set memory 500m,perm

Describing Data

Commands

summarize (or summarize x y z)

provides summary statistics for all or a subset of variables

remember SATATA commands are case-sensitive

you can always use abbreviations if they are not ambiguous e.g. sum x

summarize by subgroup

sort groupvar

bysort groupvar: sum varname

Summarizing

in qualifier

Defines range of observation that command applies to

Examples:

list in 5/10

list gov pubpriv frame sector4d sector2d in 4/l (the letter l refers to last)

Edit command

Specifying Subsets of the Data

if qualifier

Defines observations that satisfy a certain condition

Example:

sum empl weight totprod totsales totwage outputfc totva netva netindtax profit1 population if pubpriv ==1

sum empl weight totprod totsales totwage outputfc totva netva netindtax profit1 population hhcount if profit1 >=0 & profit1<200

count if profit1<0 & pubpriv ==1 //How Many?//

tab gov if profit1<0 & pubpriv ==1

tab sector4d if profit1<0 & pubpriv ==1

tab sector2d if profit1<0 & pubpriv ==1

== is equal to

!= is not equal to ( ~= also works)

> is greater than

< is less than

>= is greater than or equal to

<= is less than or equal to

Logical operators

“generate” command is used to create new variables

“replace” command is used to modify an existing variable

Examples: sum profit1 // Net Profit gen LnProfit=ln(profit1) generate durEstablished= 2015-firmage sum durEstablished replace durEstablished =. if durEstablished <0 recode durEstablished (min/0 =. )

Transforming Variables

Basic descriptive commands

• describe or d Gives a summary of the current data file:

•Number of observations/variables

•Data file size

•List of variables (name, type, label value)

• codebook – Variables summary:

•Type, range, values, frequency

• List or l – Display the values of the variables for each

observation

Basic data set management

• Sort – Sort the data set

– Examples: sort gov or sort gov sector2d

• Keeping variables – Examples:

• keep id gov pubpriv sector2d profit1 : will only keep these variables

• Dropping variables – Examples:

• drop gov pubpriv frame : will drop variables these variables

• drop gov- prjs : will drop all variables from gov to prjs

• drop w* : will drop all variables beginning with q

Creation of variables

• Command: generate (or gen)

• Create string variables • gen str10 cityname= « Cairo"

• Create numeric variables • gen Net_Profit=profit1- netindtax (type float by default)

• gen byte Sales_Per_worker= totsales / saleworktot

• Change a variable type: • gen str7 cluster=substr(id,7,12)

• edit id cluster

• gen str4 year="2015"

• destring year, replace

• Rename variables – Ren oldname new name: ren id firm_id

• Recode variable values • for var profit1 netindtax : recode X (min/0=.)

Variables Labels and Values

• Labelling variable names

• label var gov "Governorate"

• label var profit1 « Net Profit in ,000"

• Labelling variables values (2 steps)

• label def yesno 0 "No" 2 "Yes"

• label val public yesno

• Changing label values

• label def yesno 1 "Yes" 2 "No", modify

• label val public yesno

Identify and Delete duplicated observations

• duplicates list id

• duplicates report id

• duplicates browse

• duplicates tag id,gen(tag)

• duplicates drop id

• duplicates drop id, force

tabulate command produces frequency cross-tabs of one or two variables

tabu gov

tabu gov sector2d

tabu gov sector2d,col

tabu gov sector2d, col row missing nolabel nofreq

tab1 varlist - performs one-way tables for varlist (tab1 gov sector4d sector2d )

tab2 varlist - performs all possible 2-way tables for varlist (tab2 age sector2d sector4d)

Table Command

Tabulation

Several types of weights

- fweight or frequency weights: are weights that indicate the number of duplicated observations

- aweight or analytical weights: are weights that are inversely proportional to the variance of an observation.

- iweights or importance weights: are weights that indicate the "importance" of the observation in some vague sense.

- pweight or probability weight: or sampling weights, are weights that denote the inverse of the probability that the observation is included due to the sampling design.

Using Weights

EXAMPLES

Frequency weights

tabu gov sector2d [fweight=int(weight)], ro co

Analytical wegihts

tabu gov sector2d [aweights=weight]

Using Weights

To add observations from two files with the same variables

append command

To add variables from two files with similar observations

merge

To add variables from two files with different observations (e.g. individuals and household)

merge idvar

Combining 2 or More STATA Files

Merging by unique id allows you to combine variables from two different STATA data sets

Examples

Merging an individual’s employment variables to his/her demographic characteristics

Merging the parent’s info to the individual’s demographic file.

Merging information on a parent who is present in the household to an individual’s demographic file

Merging community information to the individual or household level files

Merging Files

The objective is to match observations that share a unique id from two files

The master file: the file to merge into

The using file: the file to merge from

Examples with two files containing indiv. information

open the file containing the variables you need

use filename, clear

keep the unique id and the variables you need

keep indid hhid gov pn varnames

Match Merge

sort by unique id Sort id

save under new name save temp1

use master data set use “ORIGINAL FILE.dta”, clear

sort by unique id sort id

merge by unique id merge id using temp1

Match Merge (2)

checking how successful your merge was tabu _merge

_merge==3 observ in both master and using

_merge==2 observations in using but not in master

_merge==1 observation in master but not in using

drop _merge

update option substitute missing values in master with nonmissing values in

using for same variables

replace option replaces any value in master with non-missing value in using

Match Merge (3)

1- Merging individual-level data into individual level files

2- Merging household level data into individual-level file

3- Merging individual-level data into household-level file

Types of Match Merge

On-line help is one of the most useful aspects of STATA

Now connected to STATA Corp web site through the net

Help menu

search

stata commands

Stata Technical Bulletin

Using STATA’s on-line help

What’s new in STATA

STATA is web-aware

use data sets over the web

example: use http://www.stata.com/manual/oddeven.dta,clear

updates

update query

check out help menu

For Advanced Users

Stata can accept data in several forms.

Stata Editor:

Enter a small data set consisting of 6 observations, and three variables, where var1 is the name of individual, var2 is his income, and var3 is his/her consumption.

Then, “list”, “describe”, and “save”.

Stata can read ASCII (text) file, Delimited ASCII, data separated by : spaces, comma, tab.

Fixed length ASCII file Utilities to transform data sets from one form (say SPSS,

Excel, etc.) into all other forms (STAT/Transfer).

Inputting and Reading Data

ASCII delimited files are text files where data are separated by delimiters

If missing observations are spaces, then delimiter should not be a space, use comma instead

For space delimited data, the command to use is: infile x y z using data.txt

x y z should be names equal in number to the variables in each record

if x y z is omitted, STATA assigns v1 v2 v3 describe compress

infile assumes numeric format unless otherwise specified Assume x is a string (alphanumeric) variable infile str10 x y z

Reading Delimited ASCII files

Another common format is comma or tab delimited data

Variables names are assumed to be in first row, also comma or tab delimited

No need to identify string variables in comma or tab delimited files

The appropriate STATA command is insheet using filename.csv, comma insheet using filename.txt, tab

A utility program such as STAT/TRANSFER can be used to read most data formats, including SPSS, Excel, SAS, Dbase, Access, etc.

Reading Delimited ASCII files

Fixed format ASCII files has no separators between variables but each variable always appears in the same positions

This is how data typically come from data entry packages

Two ways of doing it: Without data dictionary infix rectyp 1-2 gov 3-4 qism 5-6 psu 7-9 urbrur 10

hhgov 11-14 hhpsu 15-16 using rec02.dat With data dictionary Prepare dictionary file using text editor as

explained in handout

Reading Fixed Format ASCII files

Using STATA Graphs

graph twoway scatterplots, line plots, etc.

graph matrix scatterplot matrices

graph bar bar charts

graph dot dot charts

graph box box-and-whisker plots

graph pie pie charts

histogram graph save graph use graph display graph combine graph export

Macros

• A macro is a shorthand—one thing standing for another. For instance:

• local list "age weight sex"

• regress outcome `list' is the same as

• regress outcome age weight sex

• local or global? What is the difference? Which one should I use?

Global can get you into a mess

Better to stick with local variables rather than get in over

your head

•

Thank you

introduction to stata - ali rashed

Government & Nonprofit