stata.notes

62
Brandeis University Maurice and Marilyn Cohen Center for Modern Jewish Studies Using Stata More Effectively Benjamin Phillips gen dobday=string(day,"%2.0f") replace dobday="0"+dobday if length(dobday)==1 gen dobmonth=string(month,"%2.0f") replace dobmonth="0"+dobmonth if length(dobmonth)==1 recode year (1962/1977=1978)(1998=1992) gen dobyear=string(year,"%4.0f") gen age=floor((date("1jan2010","DMY")- /// date((dobday+"-"+dobmonth+"-"+dobyear), "DMY"))/365.25)

Upload: dhanushka-rajapaksha

Post on 19-Oct-2015

54 views

Category:

Documents


3 download

DESCRIPTION

introduction to statabasic notes

TRANSCRIPT

Using Stata More Effectively

Brandeis UniversityMaurice and Marilyn Cohen Center for Modern Jewish StudiesUsing Stata More Effectively

Benjamin Phillips gen dobday=string(day,"%2.0f")replace dobday="0"+dobday if length(dobday)==1gen dobmonth=string(month,"%2.0f")replace dobmonth="0"+dobmonth if length(dobmonth)==1recode year (1962/1977=1978)(1998=1992)gen dobyear=string(year,"%4.0f")gen age=floor((date("1jan2010","DMY")- ///date((dobday+"-"+dobmonth+"-"+dobyear), "DMY"))/365.25) August 2010Using Stata More Effectively

2010 Brandeis UniversityMaurice and Marilyn Cohen Center for Modern Jewish Studies

Using Stata More Effectively

Updated August 17, 2010

Table of ContentsIntroduction1Stata 111Setting up Stata1Working with directories2Versions3Running .do files within .do files or the command dialog3Comments3Breaking long lines4Avoiding errors4Renaming variables5Changing variable order5Computing variables with egen5Macros5Looping (foreach, forvalues, and while)7Creating sets of dummy variables: the xi command11The if and else commands13Case order variables, sorting, and cross-case functions14The duplicates command17The list command17The by command17Data verification18The in command18Predictions from estimation commands18Working with dates and times19Numeric variable types22String functions22Importing .csv and other text files25Exporting .csv, fixed format, and other text files26Merging, appending, and reshaping27Matrices and scalars30Running Stata from the command line34Programs36The post and postfile commands38The bootstrap38Weird error messages39Index41Using Stata More EffectivelyUsing Stata More Effectively

34

i

IntroductionThis file contains most of the collective wisdom of the Cohen Center regarding the effective use of Stata. It assumes a good working knowledge of basic Stata procedures and provides a guide to nonobvious shortcuts and other tricks of the trade. While I am the author of this document, Ive incorporated others discoveries as well, giving credit in text to the discoverers of new functionalities.Stata 11Stata 11 introduces three very useful features: a variables manager, an improved .do file editor, and the full set of manuals in PDF format. The variable manager is very similar to the SPSS PASW IBM SPSS variable view. Other than providing more screen real estate to view variable labels, it shows what label (if any) is attached to the variable. The .do file editor now allows collapsing of loops and colors commands, strings, locals, and comments, helping differentiate text. It also numbers rows and gives column numbers. The on-line manual is available from Help > PDF Documentation. For this to work properly, though, you need to use Adobe Acrobat or Acrobat Reader as your default PDF viewer. The reason for this is that it is a set of linked PDF files and third party readers do not seem able to move from one to the other. If you have a third party PDF viewer as the default, find a PDF file in My Computer or Windows Explorer, right click on it, select Open With > Choose Program, click the box next to Always choose the selected program to open this type of file, choose Acrobat or Acrobat Reader, and then select O.K.Along with the good points, some syntax has changed. The syntax for merging datasets has arguably improved and is certainly very different from the previous version (see p. 27ff). If you dont want to rewrite old syntax, be sure to use the version function (see p. 3).Setting up StataStata has default settings that some of us do not like. Here is a list of ways to permanently correct them.MemoryStata opens datasets in RAM (random access memory). If you dont have enough RAM, you cant open the dataset. But even if you do have enough RAM, you may not be able to open the dataset. Stata grabs a chunk of RAM when it is launched for opening and working with datasets. By default, this is a measly 10MB. To expand this to a more useful 200MB permanently:set mem 200m, permThis can be expanded on a temporary basis to, say, 1GB as follows:set mem 1gNote that youre limited by the RAM in your computer, the amount of memory used by other applications, and whether you are using a 32- or 64-bit operating system. Basically, a 32-bit operating system can only keep track of 232 memory addresses (4,294,967,296), roughly corresponding to 4GB. In Windows, 2GB (some of this may be virtual memory stored on the swap file) is allocated to the operating system and each application receives another 2GB. In practice, the maximum amount of RAM 32-bit Windows will allocate to Stata in a system with 2GB of RAM (the normal maximum for 32-bit OSs) is somewhere in the 200MB to 250MB range. In 64-bit OSs, the maximum number of memory addresses that can be tracked is 264. In theory, this would include 18,446,744,073,709,600,000 addresses, roughly corresponding to 18PB (petabytes). In practice, the 64-bit architecture used in most AMD and Intel chips limits addressable memory to 256TB (terabytes).MoreTo turn off Statas annoying characteristic of making you click to get the next page of results, use:set more off, permScroll Buffer SizeStata will only display a certain number of past results. In general, its better to display more than less. The command to use is set scrollbufsize #, where # is bytes between 10,000 and 2,000,000. It is permanent and does not take the , perm option. Stata must be closed and started again for this to take effect.Working with directoriesStata works in a similar fashion to DOS or Unix with directories.cd "C:\Cohen Center\BRI"mkdir BRI20cd BRI20If you are in the correct directory, you do not need to specify the full file path. Hence, instead of:use C:\Cohen Center\BRI\BRI20\mydata.dta, replaceYou can simply specify:use mydata, replaceThe .dta is assumed and need not be specified.Files in the working directory can be listed:dirStata can also erase files:erase mydata, replaceThis can be useful in situations where it is necessary to create temporary files (there is another way of doing this, tempfile, but it is most useful when creating commands).VersionsStata syntax changes from version to version. Generally, this isnt a problem, being limited to relatively obscure areas. Occasionally, though, this impacts analyses, causing strange error messages to appear. This is easily solved. Stata is smart enough to be able to translate your commands from an earlier version of Stata to the present version. All this requires is a statement near the beginning of the .do file that lists the version of Stata the command was written on:version 11.1Be aware, though, that Stata usually changes syntax to facilitate greater functionality. Statas survey commands prior to Stata 9 didnt allow as many options for defining the characteristics of a complex survey sample. Consequently, while Stata 8 commands would still run on later versions (provided the version command was used), they may less accurately estimate variance than if rewritten for version 10.0 or later. The merge commands also changed between 10.1 and 11.Running .do files within .do files or the command dialog.do files can be run inside another .do file or from the command dialog provided one is in the correct folder (see p. 2):do mydofileThis was necessary in Stata 10 and before when there was a maximum number of lines for a .do file in the .do file editor. This is no longer the case in Stata 11, but this functionality may still be of use if there are modular segments of identical code that need to be run at multiple points in a file.While Im well aware of the fact that many PCs run Stata too slowly to rerun the entire .do file as needed, this problem will be eventually addressed by Moores Law or (when I win the lottery) the Jodi and Benjamin Phillips Fund for Ridiculous Computing Initiatives. When it is, running the entire file is good practice because it avoids the common problem of having the .do file blow up at a certain point because we have been tinkering with the file and running it piecemeal.CommentsA well-written .do file will have considerable commentary outlining what is being done, how it is being achieved, and why this is necessary. There are two types of comments, those that constitute a line in themselves or those that can be written in the middle of a command. To write a comment on a line, it simply needs to be prefaced with an asterisk. You can add more asterisks and finish with an asterisk or not, depending on your preferences. It doesnt matter as anything on the line after the initial asterisk is disregarded. As soon as you type in a carriage return, though, the next line will be considered part of the program unless it, too, is preceded by an asterisk. (Note that you can put spaces and tabs before the first asterisk, allowing one to create bullet-point lists of comments. In some cases, it might be useful to make comments within a command. Stata will stop paying attention as soon as it reaches /*. It will not pay attention to again until it reaches */. Anything in between will be ignored, even if it stretches across multiple lines with many carriage returns. Conversely, this could appear in the middle of a command and it would not disrupt the command itself.* Here is a comment that must go on one line/* Here is a commentthat covers several linesnow it is over */tab vara varb /* Comment at the end of the line */tab /* comments */ vara /* in the middle */ varb /* are confusing but syntactically acceptable */, colBreaking long linesStata will accept very long lines of code. Unfortunately, this means that the entire line wont be visible at once in the text editor and will break up in an ugly fashion in the display window and log files. The simplest way to break a line is ///, which tells Stata to ignore the carriage return (which normally tells Stata that the commandwhatever it isis finished and should be executed). You can also use the comment indicator:reg vary varx1 varx2 varx3 varx4 varx5 varx6 varx7 /**/ varx8 varx9An alternative (which Im not fond of) is to use the #delim command, assigning a semicolon as the end of command statement (note that periods cant be used), e.g.,#delim ;reg vary varx1 varx2 varx3 varx4 varx5 varx6 varx7 varx8 varx9 ;#delim CRThe last statement returns the delimiter to the default carriage return. The only options are the semicolon or the carriage return.Avoiding errorsWhile the fact that Stata crashes as soon as it hits an error may be useful, there are times when what Stata regards as an error and what we would regard as an error diverge. Lets say weve been working with a file that defines some value labels and we switch to another dataset which creates value labels of the same name. This will bring Stata to a crashing halt. We could specify label drop mylabel, but that is (a) a pain in the neck and (b) will cause the .do file to crash if there is no label specified at the beginning. This can be avoided by using the capture prefix. Hence:capture label define mylabel 0 No 1 YesCapture refers to Stata capturing the error message.Renaming variablesAt times it is necessary to rename variables, this is simply done with rename. If you wish to rename variables with prefixesfor instance, changing w09* to w1* you can use the renpfix command.Changing variable orderStata can change the order in which the variables appear in a file. The order command send the variables one specifies in the order one specifies to the front of the dataset. Any variables not included in the varlist of an order command appear in their original order immediately after the last specified variable in the varlist. Thanks to Michelle for finding this command.Computing variables with egenStatas generate (usually shortened to gen) only handles simple mathematical operations like addition, subtraction, multiplication, division, exponentiation, and logarithms. While you can do a lot with these, theres an additional command called egen that offers commands that work across multiple cases or multiple variables. These include calculating means, medians, summing (called total, not sum, for reasons I dont understand), minimums, maximums, and so on. Before leaping in, though, be aware that the default mode for egen is operations across cases within a single variable. Thus egen xbar=mean(x) will create a new variable (xbar; i.e., ) that will be identical for every case containing the mean of the variable x. Thus, the within-case sum of a group of variables x1 x2 x3 will be egen sumx=rowtotal(x1 x2 x3), which could be simplified to egen sumx=rowtotal(x1-x3) if the variables were located next to one another in the dataset.MacrosStata has a macro function that can record arbitrary strings of characters. This can be useful for situations where one wants to have blocks of text that can be easily substituted in instead of having to be retyped or copied. The most useful form of Stata macro for our purposes is a local macro, which must be defined within your .do file. We typically have sets of interrelated dummy variables. Defining these as a macro would make specifying models easier.local denom rereform conserve orthodox othersvy: ologit potrprelgpilg prtrpexprelgpilg landed15 /// kdmitzvot prmitzvot `denom, orMacros can also be useful for complicated statements and so on. Note that local macros include indexes for foreach or forvalues (see next section). Stata will overwrite previously defined local macros from these, so use different names.Macros can also be expressed as:local macroname = macrocontentsIt is recommended, however, that you stick to the form displayed above:local macroname macrocontentsThis executes faster.However, if you were to have a mathematical function as part of the macro, the equal sign would be necessary. Hence a program that counts to two and displays it on the screen:local y 1display `ylocal y = `y + 1display `yAfter being defined, local macros are referred to as `x (assuming x is the name of the macro). Note very carefully that the left hand apostrophe is from the top left key in your keyboard, under the tilde (~), immediately to the left of the key for 1. The right apostrophe is the one under the regular quotation mark, immediately left of the enter button and right of the key for the colon and semicolon.Advanced macro useWhen running .do files from the command line (p. 3) or programs (p. 36), arguments after do myfile get entered as macros `1 `2 etc. These can be then referred to in the .do file itself. For this trivial.do file:tab `1 `2Thus:do trivial vara varbis equivalent to:tab vara varbObviously, this isnt the sort of thing we would want to use on an everyday basis, but it could be helpful in certain complicated programming situations.GlobalsLocals are only one kind of macro. There are also global macros, which are ever-present. While one can add new global macros, this is not recommended. One neat global macro is $S_DATE, which contains the current date. Thus, to save a file with todays date:save myfile $S_DATE.dta, replaceTake care with this, though. The sequence is very specific: dd Mmm yyyy. There is a leading space and, in addition, if dd < 10, there will be another space in place of the first d. Month, of course, is the first three letters with the first being capitalized. Thus:June 8, 2008 8 Jun 2009May 22, 1975 22 May 1975I havent tried years < 1000 or > 9999 but as the date is drawn from your system clock, it is unlikely that you will have this problem. (If youre reading this in 10000 CE, youre probably up to speed on this, given the Y10K bug.)Looping (foreach, forvalues, and while)Stata supports looping and makes it very easy. There are three primary kinds of loops. foreach loops through strings of text, forvalues loops through numbers, and while loops. There are several simple rules to remember. First, after writing the specifications of the loop, you have to put a left-hand brace { at the end of the line (i.e. immediately before the carriage return). It is good practice to then indent the lines of code that run within the loop (though the loop will run fine if you dont indent). Second, the loop is closed when it reaches a right-hand brace } on a line by itself. I like to keep this at the same level of indentation of the rest of the loop, but others may put the right-hand brace unindented. Third, you need an index for the vectors. In the examples below I use x for text strings and n for numbers, but these can be any letters (and more than one letter) you find convenient. They can even be the same name as variables, but it is probably best to avoid the confusion this may cause. The index is declared at the beginning of the loop. The index is a local macro, so be sure not to call your index the name of a macro you will be calling (or will call at a later point). Here is a loop over text strings:foreach x in shabcan shabmeal mitzvot {svy: ologit po`x pr`x landed, or}And here is a loop over values:forvalues n=1/4 {svy, subpop(if region==`n): mean age}Note that Stata differentiates between mathematical equalities (=) and logical equalities (==). Here the equality in the forvalues statement is mathematical while the equality of the if qualifier is logical. Stata will throw error messages if you confuse one with the other.If you want to loop over nonconsecutive, unevenly space numbers like 1, 3, 5, 6, and 9 you would enter these into foreach, as in foreach n in 1 3 5 6 9. To loop over evenly spaced numbers forvalues should be specified as forvalues n=2(2)10, which would yield the sequence 2 4 6 8 10.One can run loops within loops:foreach x in shabcan shabmeal mitzvot {forvalues n=1/5 {svy, subpop(if denom==`n): ologit po`x ///pr`x landed, or}}One small issue with running large loops or sets of loops, particularly for analysis commands, is that it can be difficult to keep track of what each piece of output represents. This can be solved by getting Stata to specify which variable is being run under which conditions using the display command. The as txt option, discovered by Michelle, ensures it displays nicely. You can also precede variable output with as output to conform to Statas usual scheme and _newline to force new lines. Here is the previous example:foreach x in shabcan shabmeal mitzvot {forvalues n=1/5 {display _newline as output ///`x as text if denom== as output `nsvy, subpop(if denom==`n): ologit po`x ///pr`x landed, or}}For shabmeal and denom=3 this would display:. . shabmeal if denom==3Of course, loops can also be very helpful in data manipulation, not just analysis. Here we Z-score a group of variables (Stata has a user-written command called zscore that will do this, but well ignore it for the present):foreach x in busguide busgroup busmifgash buslearn {quietly summarize `xgen z`x=(`x-r(mean))/r(sd)}An excursus on silence and system variablesWhat on earth is quietly summarize and r(mean) and r(sd)? First, quietly tells Stata to suppress output. Generally, you dont want to do this, but it minimizes clutter in instances where you want to run a command but dont need the output. A block of commands can be set to quietly, much as one would do a loop:quietly {commandcommand}Within this loop, one could always specify quietlys counterpart, noisily (who says computer programmers dont have a sense of humor?), for a given command to see its output.Second, summarize is an analysis command that reports the number of valid observations, mean, standard deviation, minimum, and maximum. Almost all Stata analysis commands store some information in a matrix. An OLS regression will store R2, the coefficients, and so on (type return list and ereturn list to see details). These are removed when the next analysis command is run. (See help return for details.) As it happens, summarize stores the mean and the standard deviation. From there, we simply plug these pieces into the formula for a z-score:

Looping using whileAn alternative means of looping through values is while. In this instance, the index serves as a counter and the loop continues for a given case until the logical condition is specified. Note that this can lead to loops of infinite length is the logical condition is not set properly. Here is a loop to assign a value for the last cohort a given case is associated with using forvalues:gen lastround=.forvalues n=1/18 {replace lastround=`n' if round`n'==1&qualified`n'==1}Here it is using while:local i 0while (`i++') 0) {replace lastround=`i' if round`i'==1&qualified`i'==1if (lastround != .) exit}That is to say that if lastround no longer has a missing value, the loop for that case is over, and it should proceed to the next case until all cases are complete. In my case, going forwards through all 18 possibilities took .64 seconds while going backwards and stopping at the first hit took .58 seconds, so there was a small benefit. (I got the timing by set rmsg on.) Benefits will be greater for very large loops, very large datasets, or very slow computers.Alternately, I could add in a conditional break to a decrementing forvalues loop to achieve the same effect as the while loop:forvalues n=18(-1)1 {replace lastround=`n' if round`n'==1&qualified`n'==1if (lastround != .) exit}Creating sets of dummy variables: the xi commandCreating a set of dummy variables is a common operation in data analysis. Unfortunately, it is an annoying chore and one that goes wrong occasionally. Michelle has found a better alternative in the xi command. Using this, instead of laboriously coding:recode denom (1=1)(2/7=0), gen(orthodox)recode denom (2=1)(1 3/7=0), gen(conserv)recode denom (3 4=1)(1 2 5/7=0), gen(rereform)recode denom (5 6=1)(1/4 7=0), gen(justjew)recode denom (7=1)(1/6=0), gen(otherjew)One could simply code:recode denom (1=1)(2=2)(3 4=3)(5 6=4)(7=4), gen(newdenom)xi i.newdenom, noomitThe noomit statement just means that one variable will be created for each category, compared to the default state where the category with the lowest value (here, Orthodox) is omitted. Of course, some labor is still required if youre going to have a clue what these variables mean:rename _Inewdenom_1 orthodoxrename _Inewdenom_2 conservrename _Inewdenom_3 rereformrename _Inewdenom_4 justjewrename _Inewdenom_5 otherjewThis could be speeded up, too, using loops:local i=0foreach x in orthodox conserve rereform justjew otherjew {local i=`i'+1rename _Inewdenom_`i `x}xi can be used to create more complicated variables, too. See documentation in the help file.Using xi in estimationxi can be used in estimation commands. For instance, the following command:reg y x conserv rereform justjew otherjewcould be recast as:xi: reg y x i.conservDoing this essentially creates temporary versions of the variables used in the analysis and then immediately dropped. The names of these temporary variables follow the logic of variable creation. You could specify noomit after xi, but that will cause problems because a set of dummy variables needs to have one category excluded.This sounds great, but its usually more trouble than its worth. For one thing, you dont get to choose the omitted category. While you could work around this, perhaps recoding denomination so Conservative=1 and Orthodox=2, but that removes some of the labor saving aspect. Perhaps more problematically, you (yes, you!) will have to remember precisely what _Isomevariable1 actually represents and type out _Isomevariable1 (and 2 and 3 and so on) into postestimation commands. In most cases, youre better off creating new variables and giving them meaningful names.The if and else commandsThese commands look superficially similar to the SPSS do if and else if commands. Unfortunately, where SPSS applies these case by case, so they can be used to branch to account for, say, skip patterns, Stata treats all cases alike. Here is a sample of SPSS syntax:do if pocomplete=1.+compute dadjew=podadjew.else if prcomplete=1.+compute dadjew=prdadjew.end if.What we would like to be able to do in Stata is as follows:gen dadjew=.if pocomplete==1 {replace dadjew=podadjew}else {replace dadjew=prdadjew}Note that else doesnt take conditions. What would happen, though, is that if the first case had completed the post-trip survey, then everyone would have dadjew=podadjew; if the first case had not completed the post-trip survey, every case would have dadjew=prdadjew. We could tell Stata to do this for every case:gen dadjew=.local n = _Nforvalues i = 1/`n' {if pocomplete[`n']==1 {replace dadjew[`n']=podadjew[`n']}else {replace dadjew[`n']=prdadjew[`n']}}However, it would be a lot easier to simply do:gen dadjew=podadjew if pocomplete==1replace dadjew=prdadjew if prcomplete==1Or better yet:gen dadjew=.foreach x in po pr {replace dadjew=`x'dadjew if `x'complete==1}Either of the latter two options would also run faster, because Stata executes this on the entire dataset at once, not case by case.Enthusiastic as I am about Stata, this is not a very useful command for most instances and is aimed at people writing new commands. It would be great if there was a parallel to the SPSS commands, but as far as I know there isnt.Case order variables, sorting, and cross-case functionsSPSS has $casenum which is a system variable that contains a unique positive integer for each case from 1 to n. This can be used to save the original order of cases prior to sorting. Stata has a similar system variable: _n. Hence, the original order of cases can be saved to a variable as follows:gen sortorder=_nAn excursus on sortingOne might think that when a sort command is issued, Stata will keep the relative order of cases within each sort category. Thus, if we sorted for sex, we would expect case 1 to remain ahead of case 3 among men and case 2 to remain ahead of case 4 among women. Not so! When sorting, Stata randomizes the order of variables with a given sort category. In general, this should cause no difficulty. If, however, there is a tacit assumption that the order within each sorting category is retained, there will be problems (Ive spent days sorting out the messes this has created in sampling). This can be solved by saving the original order as above and then sort sex sortorder. If you are setting up a stratified random sample and require reproducibility, this can be solved by setting the seed of the random number generator ahead of the sort (e.g., set seed 1000). When one needs to sort in descending order, the sort command will not work; instead it is necessary to use gsort; the syntax is gsort sex +age.Lags and leadsUnlike SPSS, _n can also be used for lags and leads (cross-case comparisons within a single variable). Here, _n is appended to a variable inside brackets (e.g., []) to indicate a particular case. Hence, sex[3] refers to the sex of the third case, while sex[_n] refers to the sex of the nth case. SPSS has a function called lag that can be computed for the same ends. For instance, a variable identifying duplicate cases (though see the duplicates command below) could be constructed as:sort briusaid_1gen dupe=0replace dupe=1 if briusaid_1[_n]==briusaid[_n-1]The lag or lead can be, respectively, backward or forward by an arbitrary number of places by substituting +1 or -2 instead of the -1 in the above example. Note that the [_n] on the left hand side of the logical equality is unnecessary. I include it for the sake of clarity.If we want to refer consistently to the nth case of the dataset, we put that cases row number in as:gen newvar = oldvar[1]([_N] always refers to the last row in the dataset, which also happens to document the number of cases in the dataset.)These suffixes can be combined. For instance, we could reverse the values of oldvar as follows:gen newvar = oldvar[_N-_n+1](And, no, I didnt think that up myself.)One can substitute in a variable name and Stata will refer to the row number designated by the value of that variable. Lets say we have a dataset with parents and children as individual cases and ID variables for each child with the row number of each parent (I will assume there is a variable called sortorder that makes sure the variables are in the correct order for these operations. To add, say, each parents denomination, as variables to the childs data, we could do as follows:sort sortorderforeach x in mom dad {gen `xdenom=denom[_`xid]}An extended example from a Stata lecture follows.By combining _n and _N with explicit indexing, we can produce truly amazing results. (Note the version command at the top of the file. This is needed for Stata 11 and later because this file uses Stata 10 and before merge commands.) For instance, let's assume we have a dataset that contains personidsix-digit id number of person

agecurrent age

sexsex (1=male, 2=female)

weightweight (lbs.)

fatheridsix-digit id number of father (if in data)

motheridsix-digit id number of mother (if in data)

version 10capture log closelog using crrel2, replaceuse relation, clearsort personidby personid: assert _N==1 /* see Exercise 9 */gen obsno = _nkeep personid obsnorename personid idsave mapping, replaceuse relation, cleargen id = fatheridsort idmerge id using mappingkeep if _merge==1 | _merge==3rename obsno f_nlabel var f_n "Father's obs. # when sorted"drop _merge idgen id = motheridsort idmerge id using mappingkeep if _merge==1 | _merge==3rename obsno m_nlabel var m_n "Mother's obs. # when sorted"drop _merge idsort personidsave rel2, replaceerase mapping.dtalog closeexitThen, when I wanted, say, the fathers age sort personid /* if not already */gen fage = age[f_n]and, if I wanted the mother's weight gen mweight = weight[m_n] The duplicates commandCharles correctly points out that my first example in the case order variable section reinvents the wheel. Stata has a built-in command called duplicates that handles just about anything one would like to do regarding duplicate cases. It can report all duplicatescases with identical values for the variables specified in varlist, report only one example for each group of duplicates, create a new variable identifying duplicate observations, delete duplicates (though caution is advised whenever using powerful commands that dont leave a record of what they dropped), and has powerful controls for how the duplicate report tables are displayed. See help duplicates for details.The list commandIts often helpful to look at some actual data to aid debugging. One way of going about this is to use the data browser. However, the variables one wants to compare are often far apart. A neat alternative is to use the list command, which will list onscreen (record in a log file if you expect a lot of values). Here is a potential sequence of commands for finding and checking dupes in a BRI file (but see the duplicates command, above).log using dupecheck, replace textsort briusaid_1list briusaid_1 idmain idpanelmain if ///briusaid_1[_n]==briusaid[_n-1]| ///briusaid_1[_n]==briusaid[_n+1], clean noobslog closeNote the use of the if option to limit the number of cases displayed and the use of forward and backward lags to ensure that both dupes are shown. clean and noobs respectively get rid of frames around the items displayed and suppresses observation numbers.list can also be used to quickly list answers for all items for a given respondent:list if token="abc1234"The by commandThe by command in Stata is extremely helpful. It is produces the same result as forming separate datasets for each unique set of values of varlist and running stata_cmd on each dataset separately. However, data must be sorted by the varlist used first. This can be used for analysis:sort denominationby denomination: tab poshabcan prshabcan, colIf [_n] and [_N] are used with a by command, they refer to within each by grouping. Here by is used for data manipulation, creating bus averages for the bus guide scale:sort groupnameby groupname: egen mnbusguide=total(busguide)/_NThe only thing to watch out for is that this will divide the sum of the values of bus guide within a bus by the total number of people on that bus, which will be problematic if we dont have a response from each person. Of course, it is easier to simply do:sort groupnameby groupname: egen mnbusguide=mean(busguide)Data verificationStata has a command called assert. This is followed by a logical expression. If the logical expression is contradicted, the program will throw an error message. Hence, looking for out of range values for an opinion question:assert prtripfree>=1&prtripfree=1&prtripfree