data cleaning public

Upload: shwetank-vashisht

Post on 08-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Data Cleaning Public

    1/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    Data cleaning: hintsData cleaning: hints

    and tipsand tipsFelicity ClemensFelicity Clemens

    Stata Users Group meetingStata Users Group meetingLondon, 17 & 18London, 17 & 18thth May 2005May 2005

  • 8/6/2019 Data Cleaning Public

    2/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    IntroductionIntroduction

    Data cleaningData cleaning one of the most timeone of the most time

    consuming jobs of all!consuming jobs of all!

    Many ways of attacking the sameMany ways of attacking the same

    problem when using Stataproblem when using Stata

    The talk will describe some commonThe talk will describe some common

    problems and propose possible solutionsproblems and propose possible solutions

    These are mostly reminders!These are mostly reminders!

  • 8/6/2019 Data Cleaning Public

    3/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    ContentsContents

    1)1) Introduction to the first datasetsIntroduction to the first datasets

    2)2) Identifying and removing duplicatesIdentifying and removing duplicates by handby hand

    3)3) Merging data and uses of theMerging data and uses of themerge commandmerge command

    4)4) Generating a moving targetGenerating a moving targetvariablevariable

  • 8/6/2019 Data Cleaning Public

    4/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    The studyThe study

    A caseA case--control study carried across 3control study carried across 3

    central European countriescentral European countries

    Exposure of interest: exposure toExposure of interest: exposure to

    chemicals in the environmentchemicals in the environment

    Outcome of interest: cancerOutcome of interest: cancer

  • 8/6/2019 Data Cleaning Public

    5/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    Identifying duplicates in aIdentifying duplicates in a

    datasetdataset This can be done automatically (usingThis can be done automatically (using

    the duplicates set of commands)the duplicates set of commands)

    We will demonstrate a manual method ofWe will demonstrate a manual method ofidentifying duplicatesidentifying duplicates

    Two different possibilities:Two different possibilities:

    The same data have been entered on moreThe same data have been entered on morethan one occasion;than one occasion;

  • 8/6/2019 Data Cleaning Public

    6/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    Identifying duplicates in aIdentifying duplicates in a

    datasetdataset This can be done automatically (usingThis can be done automatically (using

    the duplicates set of commands)the duplicates set of commands)

    We will demonstrate a manual method ofWe will demonstrate a manual method ofidentifying duplicatesidentifying duplicates

    Two different possibilities:Two different possibilities:

    The same data have been entered on moreThe same data have been entered on morethan one occasion;than one occasion;

    Different data have been entered using theDifferent data have been entered using thesame identifier (id numbers)same identifier (id numbers)

  • 8/6/2019 Data Cleaning Public

    7/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    The merge commandThe merge command

    A necessary command in dataA necessary command in data

    management of most big studiesmanagement of most big studies

    There are many different uses of the mergeThere are many different uses of the merge

    command. We look at two of them:command. We look at two of them:

    Simple merge on idSimple merge on id

    Multiple merge on idMultiple merge on id

  • 8/6/2019 Data Cleaning Public

    8/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    Identifying a movingIdentifying a moving

    targettarget Scenario: we have data for each town givingScenario: we have data for each town giving

    the chemical concentration for each yearthe chemical concentration for each year

    between 1982 and 2002between 1982 and 2002 Problem: we need to identify the year countingProblem: we need to identify the year counting

    backwards from 2002 in which the chemicalbackwards from 2002 in which the chemical

    changed from its 2002 levelchanged from its 2002 level

    Why? We need to overwrite the 2002 valueWhy? We need to overwrite the 2002 valuewith a new value, and overwrite backwardswith a new value, and overwrite backwards

    until the value changeduntil the value changed

  • 8/6/2019 Data Cleaning Public

    9/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    Identifying a movingIdentifying a moving

    target (2)target (2)rescode y1990 y1991 y1992

    1010113 65 32 32

    1010114 41 41 41

    1010115 78 23 23

    1010116 44 44 44

    1010117 82 82 29

    1010118 25 25 25

    1010119 12 12 6

    1010120 40 12 7

  • 8/6/2019 Data Cleaning Public

    10/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    Identifying a movingIdentifying a moving

    target (3)target (3)We will use the forval loop to examine theWe will use the forval loop to examine the

    relationship between each yearsrelationship between each years

    observed value and the observed valueobserved value and the observed valuefor the previous yearfor the previous year

  • 8/6/2019 Data Cleaning Public

    11/11

    Felicity Clemens 18 May 2005Felicity Clemens 18 May 2005

    ummarySummary

    Identifying duplicatesIdentifying duplicates can be done bycan be done by

    hand or automatically using thehand or automatically using the

    duplicates set of commandsduplicates set of commands

    Use of the merge commandUse of the merge command to mergeto merge

    on a specific variable, to multiply mergeon a specific variable, to multiply merge

    datasetsdatasets Generating a moving target variableGenerating a moving target variable thethe

    use of the forval loopuse of the forval loop