shortcuts to ddi markup automation tools and methods that will save you time and effort – and are...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Shortcuts to DDIShortcuts to DDI
Markup automation tools and Markup automation tools and methods that will save you time methods that will save you time and effort – and are fun to use!and effort – and are fun to use!
One rule of thumb:One rule of thumb:
Select and combine strategies for Select and combine strategies for conversion appropriate for your conversion appropriate for your available sources / study available sources / study documentation.documentation.
Different sources will give you different parts of the DDI.
DDI
spss, sas, stata
text codebook
XML
html
database
Excel
delimited text
osiris, marc, …
Study info
Categories
Quest. textLocations
Freq
DDI
Vargrps
Process the different sources and assemble/merge the result
Most common study Most common study documentation “combo”:documentation “combo”:
Statistical package file(s)Statistical package file(s) Machine-readable codebook and/or Machine-readable codebook and/or
questionnaire: ASCII or PDF questionnaire: ASCII or PDF
Example: ICPSR study no. Example: ICPSR study no. 33563356
Step one: Convert Step one: Convert statistical package file(s)statistical package file(s)
Programs:Programs:1) XCONVERT – free to download at 1) XCONVERT – free to download at
http://sda.berkeley.edu:7502/ddi/tools/http://sda.berkeley.edu:7502/ddi/tools/,,Created by the SDA Project, CSM Program, Created by the SDA Project, CSM Program,
UC Berkeley.UC Berkeley.
2) Nesstar’s Publisher – commercial 2) Nesstar’s Publisher – commercial software, see software, see http://http://www.nesstar.comwww.nesstar.com
/products/publisher/products/publisher
3) Currently, SPSS and SAS are working on 3) Currently, SPSS and SAS are working on tools to directly export to DDItools to directly export to DDI
XCONVERT converts to DDI:XCONVERT converts to DDI:
SPSS dds (syntax)SPSS dds (syntax) SAS dds (syntax)SAS dds (syntax) Stata dds (.do+ dictionary files)Stata dds (.do+ dictionary files)
Resulting DDI markup has no frequencies.Resulting DDI markup has no frequencies.Frequencies may be obtained only when converting Frequencies may be obtained only when converting
to DDIto DDIwith the SDATOXML program, available to SDA with the SDATOXML program, available to SDA subscribers.subscribers.XCONVERT does NOT convert dds for hierarchical XCONVERT does NOT convert dds for hierarchical
data files.data files.
Exercise 1: Convert Stata dds Exercise 1: Convert Stata dds to DDI using XCONVERTto DDI using XCONVERT
Download XCONVERT to same folder Download XCONVERT to same folder where you have your Stata dds files.where you have your Stata dds files.
In a text editor, combine the two In a text editor, combine the two Stata dds files (.do and .dct) in one Stata dds files (.do and .dct) in one single file that you can save as .txtsingle file that you can save as .txt
Conversion command (run in DOS): Conversion command (run in DOS): xconvert –x stata –i inputfile –o xconvert –x stata –i inputfile –o outputfile.xmloutputfile.xml
Nesstar Publisher converts to Nesstar Publisher converts to DDI:DDI:
SPSS dds (syntax)SPSS dds (syntax)(Merge in raw data file to obtain frequencies)(Merge in raw data file to obtain frequencies)
SPSS portable/export SPSS portable/export SPSS systemSPSS system Stata system Stata system (ex.: ICPSR study no. 3740)(ex.: ICPSR study no. 3740)
DDI obtained from system/portable files will haveDDI obtained from system/portable files will haveno column locations. no column locations. Nesstar Publisher does NOT import dds for hierarchicalNesstar Publisher does NOT import dds for hierarchicaldata files.data files.
Exercise 2: Convert SPSS dds Exercise 2: Convert SPSS dds to DDI using Nesstar Publisherto DDI using Nesstar Publisher
Edit your SPSS dds: delete comment box, Edit your SPSS dds: delete comment box, and any other additional lines down to data and any other additional lines down to data list.list.
Make your first line read: DATA LIST/Make your first line read: DATA LIST/ Remove “comment out” star from missing Remove “comment out” star from missing
values section.values section. Save as .spsSave as .sps Import into Nesstar Publisher using “File-Import into Nesstar Publisher using “File-
import” command.import” command. Import ASCII data file using “Data-Insert data Import ASCII data file using “Data-Insert data
matrix from fixed format set” command.matrix from fixed format set” command. Export DDI, or save in .NSDstat format for Export DDI, or save in .NSDstat format for
further additions.further additions.
Step two: Convert PDF Step two: Convert PDF documentation to text documentation to text
formatformat
Use xpdf (available from Use xpdf (available from http://www.foolabs.com/http://www.foolabs.com/))
Command type:Command type:
pdftotext –layout infilename pdftotext –layout infilename outfilenameoutfilename
(Preservation of formatting is NOT guaranteed)(Preservation of formatting is NOT guaranteed)
Exercise 3: Convert PDF Exercise 3: Convert PDF codebook to text formatcodebook to text format
Download xpdf program to same Download xpdf program to same folder as your PDF codebook.folder as your PDF codebook.
Conversion command (run in DOS):Conversion command (run in DOS):
pdftotext –layout infilename pdftotext –layout infilename outfilenameoutfilename
(-layout option increases chances for (-layout option increases chances for preserving regular text format)preserving regular text format)
Step three: Extract from Step three: Extract from codebook, and tag in DDI, codebook, and tag in DDI, question text and other question text and other relevant variable-level relevant variable-level
informationinformation For codebooks with regular format, apply For codebooks with regular format, apply
text-processing techniques – like macros, text-processing techniques – like macros, or regular expressions syntax – in a or regular expressions syntax – in a powerful text editor, like TextPad or powerful text editor, like TextPad or emacs.emacs.
Make sure your final product is well-Make sure your final product is well-formed XML and DDI compliant!!!formed XML and DDI compliant!!!
TextpadTextpad
Textpad is a powerful plain text Textpad is a powerful plain text editor available from editor available from http://www.textpad.comhttp://www.textpad.com
Cost: $16 - $29, depending on Cost: $16 - $29, depending on volumevolume
Includes regular expressions search Includes regular expressions search and replace and other nice featuresand replace and other nice features
Regular ExpressionsRegular Expressions
Regular expressions are a special Regular expressions are a special syntax that describes patterns in a syntax that describes patterns in a text. They appear as strings of text. They appear as strings of ordinary characters which take on ordinary characters which take on special meanings.special meanings.
Regular expressions: examplesRegular expressions: examples
. any single character. any single character [^a] any character, except “a”[^a] any character, except “a” [0-9] any single digit[0-9] any single digit [0-9]{2,4} any sequence of min. 2 and [0-9]{2,4} any sequence of min. 2 and
max. 4max. 4 digitsdigits ^ beginning of line^ beginning of line $ end of line$ end of line + zero or more of preceding + zero or more of preceding characters or expressionscharacters or expressions
Exercise 4: Create DDI file Exercise 4: Create DDI file containing variables names containing variables names
and question textand question text
Open your .txt codebook in TextPadOpen your .txt codebook in TextPad Use regular expressions-based commands, Use regular expressions-based commands,
and other TextPad special features to:and other TextPad special features to: -Delete unnecessary text-Delete unnecessary text -Attach DDI tags to the appropriate sections of -Attach DDI tags to the appropriate sections of
text text (Instructions provided)(Instructions provided) Insert codebook beginning- and end-tags to Insert codebook beginning- and end-tags to
create valid DDI.create valid DDI. Save as .xmlSave as .xml
Step three (continued): Step three (continued): Create variable groupsCreate variable groups
Use Nesstar Publisher’s “Variable Use Nesstar Publisher’s “Variable Groups” feature.Groups” feature.
OR,OR, Use SDA’s VARGROUP script to Use SDA’s VARGROUP script to
produce DDI markup.produce DDI markup.(A word of warning! If using SDA’s VARGROUP, (A word of warning! If using SDA’s VARGROUP,
replace commas with spaces in the DDI output replace commas with spaces in the DDI output file, as commas are not allowed in attributes!) file, as commas are not allowed in attributes!)
Exercise 5:Exercise 5:Create DDI markup for variable Create DDI markup for variable groups using SDA’s VARGROUPgroups using SDA’s VARGROUP
Open your .txt codebook in TextPad.Open your .txt codebook in TextPad. Use regular expressions-based commands, Use regular expressions-based commands,
and other special TextPad features, to produce and other special TextPad features, to produce input file for VARGROUP script (instructions input file for VARGROUP script (instructions provided).provided).
Download VARGROUP program to same folder Download VARGROUP program to same folder as your input file.as your input file.
Conversion command (run in DOS): vargroup –Conversion command (run in DOS): vargroup –i inputfilei inputfile
In TextPad, replace commas with spaces in In TextPad, replace commas with spaces in the DDI output file.the DDI output file.
Step four: Merge or Step four: Merge or combine DDI files to combine DDI files to
generate information-rich generate information-rich codebookcodebook
To combine (attach new sections): Use To combine (attach new sections): Use XML- or text- editing software to insert XML- or text- editing software to insert new sections in the appropriate new sections in the appropriate sequence sequence (but beware of producing invalid (but beware of producing invalid documents!)documents!)..
To merge: Use Nesstar Publisher or To merge: Use Nesstar Publisher or
XSLT.XSLT.
Nesstar Publisher’s merge Nesstar Publisher’s merge featurefeature
Will merge in:Will merge in:
Entire sections of the DDI.Entire sections of the DDI. Individual fields within each section.Individual fields within each section.
Using this feature will enable you to write in newly Using this feature will enable you to write in newly added tags or overlay tags that already have added tags or overlay tags that already have content.content.
Key for merges is <var name=“”>Key for merges is <var name=“”>
Exercise 6: Use Nesstar Publisher Exercise 6: Use Nesstar Publisher to merge DDI to merge DDI filesfiles documenting documenting different parts of the same studydifferent parts of the same study
In Nesstar Publisher, open the In Nesstar Publisher, open the saved .NSDstat file (reimporting the DDI will saved .NSDstat file (reimporting the DDI will result in loss of frequencies).result in loss of frequencies).
Use the “Documentation – Import from DDI” Use the “Documentation – Import from DDI” command, to merge in the Question Text command, to merge in the Question Text file.file.
Use the same command to merge in an Use the same command to merge in an ICPSR catalog record covering Sections 2 ICPSR catalog record covering Sections 2 (Study Description) and 3 (File Description) (Study Description) and 3 (File Description) of the DDI.of the DDI.
ReviewReview
Regular expressions are very powerful Regular expressions are very powerful and worth your time to learnand worth your time to learn
XCONVERT can extract DDI variables and XCONVERT can extract DDI variables and categories (but not frequencies)categories (but not frequencies)
Nesstar can work directly with statistical Nesstar can work directly with statistical data files to extract frequenciesdata files to extract frequencies
Nesstar can merge DDI information from Nesstar can merge DDI information from different sourcesdifferent sources..
AutomationAutomation
AutomationAutomation Approaches to AutomationApproaches to Automation
– PROGRAMMINGPROGRAMMING: Use a programming : Use a programming language such as java, C#, VB, perl, language such as java, C#, VB, perl, PHP, ColdFusionPHP, ColdFusion
– COCOONCOCOON: Use an XML publishing : Use an XML publishing framework such as Apache Cocoon framework such as Apache Cocoon (PLUG)(PLUG)
– UNIXUNIX: Adapt/reuse existing scripts using : Adapt/reuse existing scripts using UNIX (Linux, Mac OS X)-based toolsUNIX (Linux, Mac OS X)-based tools
Automation Automation RecommendationsRecommendations
Use UNIX to glue existing scripts togetherUse UNIX to glue existing scripts together Use XSLTUse XSLT Use Cocoon or scripts to process XMLUse Cocoon or scripts to process XML Code new functionality as necessary, with Code new functionality as necessary, with
command-line wrapperscommand-line wrappers
DDI
Scripts
UNIX
XSLTCocoon
XSLT
IN OUT
Survey of DDI and XML Survey of DDI and XML ToolsTools
ToolTool PlatformsPlatforms SourcesSources ResultsResults License*License*
SDA’s SDA’s XCONVERTXCONVERT, , VARGROUPVARGROUP
UNIX, UNIX, WindowsWindows
Stat Stat package package files (SPSS, files (SPSS, SAS, Stata)SAS, Stata)
DDI (no DDI (no frequencies)frequencies)
freefree
Oracle XML Oracle XML Developer’s Developer’s Kit (Kit (XDKXDK))
UNIX, UNIX, WindowsWindows
XML, XSLTXML, XSLT anyany freefree
DDI_DTD.cifDDI_DTD.cif BlaiseBlaise BlaiseBlaise ““xml”xml” freefree
MSXML 4.0MSXML 4.0 WindowsWindows XML, XSLTXML, XSLT anyany freefree
GESIS GESIS spssoms2ddspssoms2ddii
XSLTXSLT SPSS OMS SPSS OMS XMLXML
DDIDDI GNUGNU
HTML HTML TidyTidy UNIX, UNIX, WindowsWindows
Badly Badly formed htmlformed html
xhtmlxhtml openopen
* Check licensing terms
How do I use XSLT How do I use XSLT stylesheets?stylesheets?
BrowserBrowser (IE and Mozilla) (IE and Mozilla) Programming language (many Programming language (many
libraries and APIs)libraries and APIs) Server (Xalan, Xerces, xt, Saxon)Server (Xalan, Xerces, xt, Saxon) Apache CocoonApache Cocoon Command line (Command line (Oracle XDKOracle XDK or or
MSXML 4.0)MSXML 4.0) Windows shortcutWindows shortcut
Automation Exercise 1Automation Exercise 1
Apply an xslt stylesheet in various Apply an xslt stylesheet in various waysways
Open the folder “xslt” and follow the Open the folder “xslt” and follow the instructions in “oraxsl lesson.txt”instructions in “oraxsl lesson.txt”
XSLT advantagesXSLT advantages When the source is XML, XSLT can output When the source is XML, XSLT can output
to XML, text, pdf, even jpegto XML, text, pdf, even jpeg This might be done directly, or possibly via This might be done directly, or possibly via
an intermediate format and a conversion an intermediate format and a conversion tool/library such as html2pdf, foptool/library such as html2pdf, fop
Cocoon has a large number of such Cocoon has a large number of such libraries built inlibraries built in
XSLT stylesheets can be reused in java, XSLT stylesheets can be reused in java, C#, perl, PHP, ColdFusion.C#, perl, PHP, ColdFusion.
XSLT stylesheets are easier to modify if XSLT stylesheets are easier to modify if the xml changes or needs to be parsed the xml changes or needs to be parsed differentlydifferently
XSLT drawbacksXSLT drawbacks
Not in typical skillset — functional Not in typical skillset — functional programming is different from OO programming is different from OO and proceduraland procedural
Memory hog — the entire document Memory hog — the entire document is loaded into memory and expandedis loaded into memory and expanded– Doc size/content ratio = 20+Doc size/content ratio = 20+– Solutions:Solutions:
Preprocess using streaming parserPreprocess using streaming parser Allot more memoryAllot more memory
– java -Xms<min_size> -Xmx<max_size> java -Xms<min_size> -Xmx<max_size>
A Survey of UNIX ToolsA Survey of UNIX Tools
UNIX Text Processing ToolsUNIX Text Processing Tools– sed, awk, tr, cut, head, …sed, awk, tr, cut, head, …
PipesPipes– Allows the results of one command to be Allows the results of one command to be
sent to anothersent to another UNIX batch commandsUNIX batch commands
– ls, grep, xargsls, grep, xargs UNIX schedulingUNIX scheduling
– croncron
Introduction to sedIntroduction to sed
Sed performs line-by-line Sed performs line-by-line substitutions using regular substitutions using regular expressionsexpressions
sed –f commandsfile sourcefile > sed –f commandsfile sourcefile > destinationfiledestinationfile
Automation Exercise 2Automation Exercise 2
We’ll use sed to duplicate the We’ll use sed to duplicate the functionality of a textpad macro we functionality of a textpad macro we created previouslycreated previously
Open the folder “sed” and follow the Open the folder “sed” and follow the instructions in “sed lesson.txt”instructions in “sed lesson.txt”
WARNING 1: sed’s regular expressions WARNING 1: sed’s regular expressions are slightly different from textpad’sare slightly different from textpad’s
WARNING 2: sed by default processes WARNING 2: sed by default processes line-by-lineline-by-line Sed is available on all unix systems. See
“README_download_instructions” for windows machines
spss, sas, stata
text codebook
XML
html
database/Excel
delimited text
CAI, Blaise
osiris, marc, …
Sources ReviewSources Review
DDI
textpad
The functionality of textpad on windows can be replaced by sed or awk on UNIX
Automation
Translating manual steps to automated steps
Sources ReviewSources Review
DDI
pdf2text
Textpad/sed
xconvert
The functionality of textpad on windows can be replaced by sed or awk on UNIX
spss, sas, stata
text codebook
XML
html
database/Excel
delimited text
CAI, Blaise
osiris, marc, …
Automation Exercise 3Automation Exercise 3
Hooking things together with pipes Hooking things together with pipes (or files)(or files)
Open the folder “automate” and Open the folder “automate” and follow the instructions in “automate follow the instructions in “automate lesson.txt”lesson.txt”
Batch processing with ls, sed, grep, Batch processing with ls, sed, grep, and xargsand xargs
Advice for Batch ProcessingAdvice for Batch Processing
Use a consistent naming conventionUse a consistent naming convention Identify the driving filesIdentify the driving files Schedule using cronSchedule using cron
Sources for AutomationSources for Automation
Not every process is suited for Not every process is suited for automationautomation
A process may be partially automatedA process may be partially automated Sources which are formatted in a Sources which are formatted in a
regular manner are ideal for automationregular manner are ideal for automation– Database outputDatabase output– Excel spreadsheetsExcel spreadsheets– Delimited textDelimited text– Machine-generated outputMachine-generated output
Make use of intermediate Make use of intermediate formatsformats
A candidate for an intermediate regular A candidate for an intermediate regular format that already has scripts/tools format that already has scripts/tools written for it can simplify your work.written for it can simplify your work.
Candidates:Candidates:– Delimited textDelimited text– XmlXml– HtmlHtml– Proprietary format (SDA’s DDL, SPSS’s __)Proprietary format (SDA’s DDL, SPSS’s __)
Using the Intermediate Format Using the Intermediate Format Strategy: Example 1Strategy: Example 1
Gesis Gesis spssoms2ddi spssoms2ddi is an example of is an example of using the intermediate format using the intermediate format strategystrategy
SPSS fileSPSS OMS
XML DDI
Spssoms2ddistylesheet
study_oms.spss
This is an example of doing it the right way: SPSS outputs proper XML according to a schema
Using the Intermediate Format Using the Intermediate Format Strategy: Example 2Strategy: Example 2
XCONVERT does not output XCONVERT does not output frequenciesfrequencies
SAS ODS command wrapper displays SAS ODS command wrapper displays output as (badly formed) html tablesoutput as (badly formed) html tables
SASHTML
frequenciesxhtml DDI
ODSHTMLtidy xslt
OracleDelimited text
xsltsqlldr
SAS ODSSAS ODS
SAS ODS is able to output its results SAS ODS is able to output its results as html instead of .lst or .rtf fileas html instead of .lst or .rtf file
Just wrap your run statementJust wrap your run statement
ODS html file=“result.htm”
your sas code …proc print data =new; run;
ODS html close;
SAS ODS HTML outputSAS ODS HTML output
bad html – verbose, mismatched nestingbad html – verbose, mismatched nesting Show exampleShow example Xslt cannot be applied directly to this Xslt cannot be applied directly to this
outputoutput Use HTML tidy (open source) to clean this Use HTML tidy (open source) to clean this
bad html before applying xslt style sheetsbad html before applying xslt style sheets tidy options sourcefile > resultfiletidy options sourcefile > resultfile HTML tidy is built into Apache CocoonHTML tidy is built into Apache Cocoon
Automation Exercise 4Automation Exercise 4
HTML Tidy allows you to deal with HTML Tidy allows you to deal with badly formed xml/html that naturally badly formed xml/html that naturally occur in the real worldoccur in the real world
Open the folder “tidy” and follow the Open the folder “tidy” and follow the instructions in “tidy lesson.txt”instructions in “tidy lesson.txt”
SourcesSources
DDI
pdf2text
sed
xconvert
oraxsl + stylesheetODS
HTML tidy
spss, sas, stata
text codebook
XML
html
database/Excel
delimited text
CAI, Blaise
osiris, marc, …
Database sourcesDatabase sources
Use intermediate formats such as Use intermediate formats such as xml or htmlxml or html
Some databases can output directly Some databases can output directly to “xml” or “html”, but delimited text to “xml” or “html”, but delimited text is fineis fine
Usually, the “xml” output needs to Usually, the “xml” output needs to be cleaned by HTML tidybe cleaned by HTML tidy
Excel as an editing/automation Excel as an editing/automation tooltool
Excel can read/write delimited textExcel can read/write delimited text Excel can read htmlExcel can read html Excel has macrosExcel has macros Excel rowset demo/exerciseExcel rowset demo/exercise
spss, sas, stata
text codebook
XML
html
database/Excel
delimited text
CAI, Blaise
osiris, marc, …
SourcesSources
DDI
pdf2text
sed
xconvert
oraxsl + stylesheetODS
HTML tidy
Sources & DestinationsSources & Destinations
DDI
spss, sas, stata
text codebook
XML
html
database
Excel
delimited text
osiris, marc, …
XS
LT
spss, sas, stata
text codebook
XML
html
database/Excel
delimited text
CAI, Blaise
osiris, marc, …
DDI to MARCDDI to MARC
Sometimes, XSLT will only get you 99% of the Sometimes, XSLT will only get you 99% of the wayway
MARC output requires control characters which MARC output requires control characters which are illegal in XML/XSLTare illegal in XML/XSLT
Strategy1: output substitute characters and Strategy1: output substitute characters and then use tr or sed to replace control charactersthen use tr or sed to replace control characters
oraxsl 06084.xml 00.xsl temp1.xmloraxsl temp1.xml 00.xsl temp2.xmloraxsl temp2.xml 00.xsl temp3.xmloraxsl temp3.xml 00.xsl temp4.txtsed -f restoreIllChars.sed > 06084.marc
oraxsl $1.xml 00.xsl temp1.xmloraxsl temp1.xml 00.xsl temp2.xmloraxsl temp2.xml 00.xsl temp3.xmloraxsl temp3.xml 00.xsl temp4.txtsed -f restoreIllChars.sed > $1.marcrm -f temp?.xml temp4.txt
DDI to MarcDDI to Marc
Revised strategy: after working with Revised strategy: after working with MARC for a while, we decided that we MARC for a while, we decided that we could make use of existing utilitiescould make use of existing utilities– 1. convert DDI to marcxml (with xslt 1. convert DDI to marcxml (with xslt
stylesheet written at icpsr) using oraxslstylesheet written at icpsr) using oraxsl– 2. convert marcxml to marc21 using marc4j2. convert marcxml to marc21 using marc4j
Marc4j and other marc utilities are Marc4j and other marc utilities are available at available at http://www.loc.gov/marc/marctools.htmlhttp://www.loc.gov/marc/marctools.html
Contact infoContact info
Sanda IonescuSanda Ionescu– [email protected]@icpsr.umich.edu
I-Lin Kuo (until Aug 18)I-Lin Kuo (until Aug 18)– [email protected]@icpsr.umich.edu