enhancing access to government information

18
This article was downloaded by: [Fondren Library, Rice University ] On: 16 November 2014, At: 04:46 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Collection Management Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/wcol20 Enhancing Access to Government Information Patrick Yott a a University of Virginia Library , USA Published online: 22 Sep 2008. To cite this article: Patrick Yott (1998) Enhancing Access to Government Information, Collection Management, 23:3, 61-76, DOI: 10.1300/J105v23n03_06 To link to this article: http://dx.doi.org/10.1300/J105v23n03_06 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is

Upload: patrick

Post on 22-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing Access to Government Information

This article was downloaded by: [Fondren Library, Rice University ]On: 16 November 2014, At: 04:46Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK

Collection ManagementPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/wcol20

Enhancing Access toGovernment InformationPatrick Yott aa University of Virginia Library , USAPublished online: 22 Sep 2008.

To cite this article: Patrick Yott (1998) Enhancing Access to Government Information,Collection Management, 23:3, 61-76, DOI: 10.1300/J105v23n03_06

To link to this article: http://dx.doi.org/10.1300/J105v23n03_06

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all theinformation (the “Content”) contained in the publications on our platform.However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness,or suitability for any purpose of the Content. Any opinions and viewsexpressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of theContent should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses, damages,and other liabilities whatsoever or howsoever caused arising directly orindirectly in connection with, in relation to or arising out of the use of theContent.

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone is

Page 2: Enhancing Access to Government Information

expressly forbidden. Terms & Conditions of access and use can be found athttp://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 3: Enhancing Access to Government Information

Enhancing Access to Government Information:

Redistribution of Data via the World Wide Web

Patrick Yott

Throughout the 1990s, as executive and congressional agencies have em- braced electronic modes of data dissemination, libraries have been receiving an ever-increasing amount of government information in digital form, often acquiring both print and digital versions of the same item. Unfortunately the enthusiasm for digital production has, at times, led to the disappearance of a printed version altogether, placing increased rcliance on the development of adequate technology and technological skills to make the information avail- able. Thc burden has fallen on libraries to develop mechanisms and proce- dures to ensure that the information distributed remains useful and accessible. To this end, this article describes the creation and use of Web-based data distribution systems at the University of Virginia Library, the regional depos- itory for Virginia.

The Social Sciences Data Center (SSDC) was inaugurated in 1.992 to better integrate data products received through the FDLP and the Intcr-Uni- versity Consortium for Political and Social Research (ICPSR) into the l i - brary’s mission and functions. The depository data formed the bulk of the data collections at that time, and still represent a significant portion of the data used by SSDC patrons. These public data also form thc core of the

Patrick Yott is Social Sciences Data Services Coordinator, University of Virgin-

[Haworth co-indexing entry note): “Enhancing Access to Government Information: Redislributioii of Data vin the World Wide Web.” You, Patrick. Co-publiahcd simultaneously in Collecriotr Mnmgenietii (The Haworth Press, Inc.) Vol. 23. No. 3, 1998. pp. 61-76; and: Covenintei!t I t ~ J o r n i ~ ~ i o ~ ~ Colkcriota Lt

lhr Nerwvorked Etivirofinierir: New Issues and Models (ed: Joan E Chevrrie) The Haworth I’ress. Inc., 1998, pp, 61-76. Single or multiple copies of this article are available For a fce from 7 h e Hawonh Document Delivery Service [1-800-342-9678.9M) a.m. - 500 p.m. (EST). E-mail address: getinro~haworthpresiiic. corn].

0 1998 by The Haworth Press, Inc. All rights reserved.

ia Library (e-mail: [email protected]),

61

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 4: Enhancing Access to Government Information

62 Government Injormation Collectioiis in the Nelworked Environment

Web-based interactivc services developed by the center. In 1994 the SSDC announced the availability of the Regional Economic Information Service (REIS) data site (http://www.lib.virginia.edu/socsci/reis/index.html) one of the earliest public data extraction sites on the Web, and probably the first such effort undertaken by an academic library. Since then, the SSDC has introduced over a dozen such data utilities to thc public, and has constantly worked to improve and expand existing data sites. This distribution approach has been well received by users of cvery typc, including students and profes- sors, local business owners and town planners, as well as the government agencies producing the original data.

What consideration goes into the design and development of a data site? What advantages does this approach offcr the library and thc data user? What are the nuts and bolts that make this work, and does the end product justify the expense?

AD VANTAGES OF WEB-BASED DISTRIBUTION

The Web is rapidly becoming as ubiquitous as television. As with televi- sion, howcver, the medium’s potential ability to deliver useful and needed information was immediately drowned out by the whir of white noise and commercial static. While the clutter still dwarfs the content (and will prob- ably continue to do so for some time), the Web has quickly becomc a valuablc cducational tool. Thc following excerpt from the 1996 Commencement ad- dress by Neil Rudenstine, President of Harvard University, captures the Inter- net’s potential alliance with research and research libraries:

More fundamentally, there is in fact a very closc fit-a critical inter- lock-between the structures and processes of the Intcrnet, and the main structures and processes of university teaching and learning. That same fit simply did not (and does not) exist with radio, film, or television.

When I say there is a critical interlock or fit here, I mean that students can carry forward their work on the Internet that are similar to-and tightly intertwined with-the traditional ways that they study and learn in libraries, classrooms, discussion groups, laboratories, and other settings.

Let me suggest a few examples of what I mean. The Internet-as we know it-can provide access to vast sources of

information not conveniently obtainable through other means. Let’s assume for the moment that most of thc technological problems of the Internet will in time be solvcd: that there will be, as there are now in the research library system, effective ways of helping users to find what they want; of ensuring quality control; and of creating linkages among different bodies of knowledge in different media.

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 5: Enhancing Access to Government Information

Patrick YOU 63

At that point, the Internet and its successor technologies will have the essential features of a massive library system, where people can roam through the electronic equivalent of book stacks, with assistance from the electronic equivalent of reference librarians. In other words, one major reason why the charactcristics of the Internet are so compat- ible with those of universities, is that some of the Internet's most signif- icant capabilities resemble, and dovetail with, the capabilities of re- search libraries.'

So what benefits, in practical terms, does the Web providc as a platform for redistributing government data? Experiences in designing these services for the Social Scicnces Data Center suggest fivc easily distinguishable and important areas that support this enthusiasm.

The Web Bridges and Incorporates Multiple Network Protocols

One of the most powerful features of the Web is its relativcly seamless integration of various networking protocols, such as POP (Post Office Proto- col-email), FTP (File Transfer Protocol), CGI (Common Gateway Interface), and of coursc, HITP (HyperText Transfer Protocol). Therefore, it is possiblc to prepare a single Web site that provides access to documentation (HTTP), selective access to data (CGI and FTP), and through mail-to tags (POP) direct communication with the site maintainer or data provider. While this may scem obvious to the jaded Web surfer, this is a quantum irnprovcment over the carly Internet days of Gopher and SMTP (Simple Mail Transfer Proto- col), and enables the site maintaincr to create an effective yet simple-to-use data delivery system.

The Web Can Ameliorate the Problem of Multiple Operating Platforms

One of the most common complaints about thc distribution of government data (and Ccnsus Data in particular) is the almost universal disregard for users of non-DOS/WINDOWS operating systems. For examplc, the 1990 Census was distributed with GO and EXTRACT software that function well in DOS, sorncwhat well in Windows 3.x and somewhat peculiarly in Win- dows95, but will not work in most Macintosh or UNIX environments. A simple solution (employed by the 1990 PUMS filcs) is to simply provide ASCII versions of the data, and let the end-user (or the intermediary data spccialist) deal with finding suitable software. Redistribution of data via the Web, however, renders this a non-issue. As long as the user has a suitable Web browser (compatible Netscape 2.0 or higher), the user can work within whichever computing environment he or she is most comfortable.

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 6: Enhancing Access to Government Information

64 Government Information Colleclions in the Networked Environment

The Web Provides More Equitable Access to Data

One of the fundamental tenets of the Federal Depository Library Program is that every individual is entitled to equal and unabated access to government information. While this remains an admirable goal it is unlikcly that it has ever bccn fully realized and the rapid migration to electronic disscmination makes it an even more difficult goal to attain.

At the University of Virginia government information products receive the same treatment and policies as other digital library materials. Unfortunately this has worked to the detriment of access. For example, the Library does not circulate thcir CD-ROMs, and unfortunately, the Government Information Resources Section (GIR) does not havc enough workstations to accommo- date all the digital products received on deposit or through purchase. Given the complexity (and peculiarity) of many of these products, they are not accessible outside a limited range of hours when the librarian designated as the ‘electronic expert’ is available. By repackaging these data and building user-fricndly Web interfaces for them, the SSDC can make them available at all times and to broadest constituency of computer skills.

The Web Provides More Flexible Access to Data

Government agencies often work in isolation with regard to dcsigning and releasing data products. There is little or no evidence of any data distribution standard employcd across all agencies. Similarly, there seems to he minimal involvement with potential users in the design process. So when the Census Burcau was ready to begin shipping the 1990 Census data the decision on how to format the data and what software to include (if any) was already made. Their decision to producc two software packages (GO and EX- TRACT) met what they saw as their obligation for making the data usable. GO was designed for the most casual “browser” of data, while EXTRACT lent itself to those who needed to port various data items into a spreadsheet or database package. Neither approach, howcvcr, was ideal for the traditional library setting. GO is remarkably slow and limiting, and EXTRACT lacks the intuitivc interface required for many first-time users. Similarly, the choice to use dBASE as thc file format presented an obstacle in that the format is limited to 128 fields (256 in dBASE 4), and forced thc Ccnsus to chop files into multiple parts (34 separate files for STF’3A).

By migrating Census data to a statistical software package (SPSS for example), i t is possible to present all data in a single file, and to move away from the “drill down” methodology employed by many Census products. Furthermore, it is also possible to empower the user by giving him or her more “editorial” control over how the data is to he presented. For example, users can sort the data, create new variablcs, and explore relationships that

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 7: Enhancing Access to Government Information

Palrick Yott 65

the Census Bureau software would not easily accommodate. Again, al l this can be done without the user needing any specialized computer or statistical training.

The Web Can Expand and Enrich Curricular Materials

This is a very important point. Since the Web can be used as the intermedi- ary tool between the data user and the data, it can provide powerful access to statistical data (i.e., construction of tables and matriccs) to students with limited quantitative and/or programming skills. It is possible (although prob- ably not warranted) to reproduce the full functionality of a statistical package through a carefully designed Web site. And all the user would have to do is fill out some forms and check some boxes. For example, students in a tradi- tional criminology course could use the on-line data extractor built using the Uniform Crime Reports Olttp://www.lib.virginia.edu/socsci/crime/index.html) to compare crime rates over time and behveen counties. The statistical programming required to achieve the same result using traditional data distribution means would make this type of information unavailable to most non-quantitative courses, or would place an undue burden on the staff i n charge of the source data.

D E M N D S IMPOSED BY DATABASES IN THE CONTEXT OF THE WEB

Successful use of the Web to redistribute government data requires that the program designer manage four basic demands. Thcse demands stem from the fact that the library as a physical space and collection is being somewhat dissolved, and that many traditional “fall-back” techniques used in dealing with governmental data may no longer apply.

The Utility of a Data Set Is Often Dependent upon Textual Documentation

Documentation is a vital component of any data set, and the lack of adequate documentation has been a common complaint of library users. Try to imagine using the 1990 STF3A files without access (either in hard copy or through context sensitive help in the software) to variable definitions, de- scriptions of how the samples were constructed, and why and how various universes change between tables. Now try to imagine blindly using the same data through a Web interface without first knowing, for example, that there are 300 data points presented i n a “Race by Sex by Age” matrix (table P14). It is not a comforting thought.

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 8: Enhancing Access to Government Information

66 Government Information Collections in the Networked Envrronmettt

While preparing documcntation is not the most enjoyable or glamorous aspect of preparing a service, it will, in the long run, determine the service’s ultimate usefulness. For example, the REIS site hosted by the Social Sciences Data Center contains footnotes for every variable group, overview of each group, as wcll as a section containing some twenty methodological and procedural documents. Without access to any one of these items the data become less clearly understood, and potentially misleading for thc cnd-user. Furthermore, if there is a downloading option where the user can receive an ASCII file that can be uploaded into a spreadsheet package, there must be documentation provided that is customized to describe all variables and cases included in the ASCII file. Documentation for state and county FIPS (Fcdcral Information Processing Standards) is a good example. While the CD-ROM product (and whatever type of database is used for the Web application) contains internal labels for FIPS codes, thc ASCII file will most likely con- tain a two-digit state code and a three-digit county code. Therefore, docu- mentation must bc provided to allow the user to convert these codes into meaningful area names.

Redistribution of Data Forces the Data Specialist to Assume New Roles

The traditional role of the librarian allowed some level of safety in that the librarian was not assumed to have any responsibility for thc information in their collections. In fact, it is probably safe to say that there was little assump- tion of expertise with the information beyond the ability to retrieve it as needed. Thc existing modcl was of passive “keepers of information.” This model is significantly altered in this new Web-based redistribution model.

Moving data from a CD-ROM, federal Web site, or other electronic prod- uct into a Web-based application represents a fundamental change in the data. By altering the information the institution assumes responsibility for data integrity and for any errors that may have crept into the data during the transformation process. The magnitude of any possible error and the corre- sponding assumption of responsibility vary with the type of transformations involved. For example, the SSDC has takcn thc 1990 STF3A filcs for the local planning district (Charlottesville and five neighboring counties) and has converted the formats from dBASE (as distributed) to Excel. This was a simple and extremely reliable transformation with marginal risk of error, but it allowed students in various urban planning classes to retrieve the data directly through a simple Web page and immediately launch Excel through their browser windows. Other projects, such as merging the approximately 20 files that make up the USA Counties CD-ROM into a single master file for use in a Web resource, have the potential to introduce significant error.

Asidc from thc need to aggressively safeguard data integrity the data specialist assumes a significant responsibility for data currency. Many gov-

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 9: Enhancing Access to Government Information

Patrick Yolt 67

ernment data products arc constantly updated and the institution, once it has placed the data on the Web, is expected to keep their data site updated. This burden can manifest itself in two ways. First, the general user often fails to undcrstand what qualifies as “current” as it relates to govcrnment data. A good example is the County Business Pnllerrts (CBP) data. The most recent year released by the Census Bureau is 1995, but the SSDC regularly rcceived inquiries as to why i t has not updated the data. More stressful, however, is the fact that users will often be awarc of a new data release weeks before a copy arrives in the library. This was the case with the REIS database, and forced the SSDC to become a member of the Bureau of Economic Analysis (BEA) User Group to ensurc the most immcdiate delivery of BEA data products.

Finally, the use of the Wcb as a dis-intcrmediation mcdium places the librarian or site maintaincr in the position of having to develop additional expertise with the data. The passive model required limited knowledgc of how the data are collected and used. What was required was the ability to classify and retrieve the data product. Users in this new paradigm will gener- ally not distinguish between the role of data creator (the government agency) and data distributor (the library). To successfully anticipate and answer thc types of questions inherent in data use, the site maintainer will need to become well versed in all facets of the data set. In fact, successful design of any Web application hinges more strongly upon knowledge of the data than knowledge of statistics and programming.

Data Is Distributed in a Wide Variety of Fife Types and Formats

Perhaps the most significant character of many government data files is their enormous size. Files from the Current Population Survey (CPS) can exceed 200 megabytes in size. Obviously, significant computing resources are necessary to work with these data. Whilc a data set likc the 1990 Census can be convenicntly segmented (34 parts for the Summary Tape File 3A data), a file like the CPS does not easily lend itself to this approach.

The Current Populalion Survey also provides an excellent example of the variability of file types encountered in dealing with large sets of data. The CPS file is a hierarchical file-a filc structure that contains records for more than one unit of analysis. Data in the CPS are available at thc household, family, and individual Icvel. An easy solution to the problem offilc size is to present the data as separate files, onc for each unit of analysis. Of course, the more useful approach (especially for advanced researchers) would bc to compound the problem by “rectangularizing” the data so that each record would be presented at the individual level, but would also contain data on that individual’s family and household.

In the traditional “here’s the disk and there’s a computer” approach to dealing with government data products the burden of dcciphering and dealing

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 10: Enhancing Access to Government Information

68 Government Information Collections in the Nelworked Environment

with variable formats was left to the user. In the redistribution model, the library assumes this responsibility, but in so doing reduces the burden on the ultimate user by creating a “standard” data format for all products.

USERS, USER NEEDS, AND USER SERVICES

Data sites hostcd by the Social Sciences Data Center are designed to meet the needs of three distinct user communities: gencral public, general academ- ic, and academic research. It is essential to fully understand the needs and skills of each user before any programming begins.

General Public

Non-academic data uscrs have generally fallcn into two general catego- ries: busincss researchers and gencral ‘fact seekers.’ Business researchers are interested in finding data that they can download and incorporate into their own information systems. For them, it is esscntial to provide data that is well documcnted and up-to-date, and to package that data into a format that is easily importcd into major spreadsheet or database packages. For example, the Regional Economic Information System provides a comma delimited file (an ASCII file where cach column of data is scparated by a comma) with a .CSV extension. Users can, with a modicum of effort and sophistication, configure their browser to launch an Excel session and automatically parse the data into an active spreadshect. Other software packages can also be configured into thc browser.

The othcr component of public data use involves more general fact-seek- ing behaviors. The types of data these individuals seek are often community level aggregates such as data from the 1990 Census, and various compendia like the Coiinty and City Data Book (CCDB) and USA Counties. Although it is likely that many of these users may be loading data into spreadsheets, many more, especially students, are looking for data presented in an attrac- tive and functional format that can be printed out, or incorporated into school reports and other similar projects.

Genernl A cademic

General academic users include students in statistics and quantitative methods classes as well as students in classes that may not focus on data but whose research can benefit from access to quantitative data, such as criminol- ogy and urban planning. The focus is on the instructional usc of data. While this group shares many needs in common with thc public user, their needs and the types of data required often are more complex.

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 11: Enhancing Access to Government Information

Pulrick Yoll 69

The general academic focuses more on statistics than facts. This may manifcst itself in the need to obtain data sufficiently complex and interesting to use in a rescarch paper, to being able to obtain statistical output for subse- quent analysis and interpretation. For instance, the public user may only want to know thc number of crimes in a particular county whereas the academic may need to obtain these crime statistics in conjunction with other data in an effort to explain the causes of, or variations in, criminal activity in certain areas. Similarly, the academic user may need to obtain data that represents a span of time.

Research Level Users

Designing a Web service for advanced data users permits the greatest flexibility and creativity on the part of the site maintainer. Research users are upper division students, faculty, and consultants who possess a broad range of statistical sophistication. The most fundamental need for research level users is convenient access to the data. This entails evaluation of data relevan- cy through documentation and survey instruments, access to descriptive tools (frequencies, distributions, standard deviations, etc.) through the Web site, and on-line subsetting of the data by both variable and case selection. It may also be desirable to provide access to advanced statistical techniqucs such as regression analysis, and to graphical and geospatial data representation through the Web interface. One important issue to be considered in designing advanced data sites is the isolation in which the user works. Providing a broader possible range of functionality and complexity often comes at the expense of fundamental guidance in issues such as variable selection and record limitations. This may not be an issue for the intended audience, but users from the general public and general acadcmic categories may be led to inaccurate or incomplete results without adequate site documentation and on-line assistance.

TECHNICAL DETAILS: HARDWARE AND SOFTWARE

Developing a data distribution presence on the Web is considerably differ- ent than developing a traditional ‘static’ Wcb presence. As with other opcra- tions involving collections that are made ‘freely available’ to the public, a significant investmcnt of library capital (technical and human) will be re- quired to successfully initiate data Web-based data services.

Thc CPU and storage demands imposed by data are significant. Tradition- al Web interactions, which account for the majority of library-based Web services, impose only minor storage and memory demands. A data service

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 12: Enhancing Access to Government Information

70 Government Information Collections in the Networked Environment

such as the REIS data site involves hundreds of megabytes of data, and requircs significant computing power to function. It is not just the size of a given file that must be considered, but also the frequency of use and the effects of concurrent use on servcr capacity.

The following graph details the growth and use of 10 major interactive data resources developed by the Social Scienccs Data Center. (Sce Figure 1.)

This graph is a measure of actual data resource use and not just visits or ‘page hits’-each use represents the execution of an SPSS or SAS job on the library’s Web server. Ths peak of just under 18,000 uses represents an aver- age CPU commitment of GOO jobs per day, or 25 jobs per hour. This is considerable burden on a server.

At the time of writing, the center is migrating to a new server. Due to demand that has outstripped server capacity, the SSDC is moving from an IBM modcl 390 RS6000 to an IBM model F50 RS6000 server with two processors, 512 Mb of RAM, and 27 gigabytes of onboard storage. As con- figured, this server comes at a price of almost $70,000 (the bulk of that cost was provided by IBM as part of its Shared University Rcsearch [SUR] pro- gram with the University of Virginia). Even with the assistance of IBM, configurations such as this are beyond the reach of many (if not most) l i - braries. The issue then becomes one of scale. A library can, with a more

FIGURE 1. Web-Mediated Resource Use

18000 - 16000

14000

I 12000 ln

0 E 10000 i? 8000

c v)

6000

4000

MONTH

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 13: Enhancing Access to Government Information

Patrick Yotr 71

reasonable investment, develop a service plan that is designed to meet the needs of its immediate user community as opposed to the national approach taken by the University of Virginia. Scaled down to local needs, a library could build an effective site using PC-based architecture using an operating system such as Window/NT or Linux.

Software and programming mark the sccond departure from normal Web development. Most libraries have developed a Web presence, and most infor- mation specialists are familiar with creation of HTML documents. Once written, these documents need only be deposited in a public directory on the server and the process is complete. When it comes to creating interactive sites, the HTML document is only the first step in the process; a complete system needs to be developed. Information must be gathered from the user, interpreted and acted upon, and some result must be returned to the user. This is what is referred to as the Common Gatcway Interface or CG1.

Designing the CGI requires programming that is more complicated than normal HTML programming. A background program, often referrcd to as a script, must be created that can decode whatever user choiccs have been made (the query string), and perform some task with that information. The SSDC uses a programming language called PERL (Practical Extraction and Report Language) to handlc these ‘back-end’ chores. While thcre are a vari- ety of programming languages from which to chose, PERL is thc most widely used. A survey of popular scripting languages would includc:

UNlX CICtt

Steep Lcarning Curve-Not Recommended for Novice Program- mers Lack strong Pattern Matching Capabilities Programs are compiled-faster and use less system resources

Lacks pattern matching capabilities Requires use of Sed and Awk to work with strings

Most widely used CGI programming language Highly portable Powerful string manipulation features Simple Constructs Easy system calls Lots of existing code free\y available

CShell

PERL

MACINTOSH PERL

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 14: Enhancing Access to Government Information

72 Covernmenl Information Collections in the Networked Environment

Applescript Lacks strong string manipulation capabilities Powerful interface to Macintosh software

WIND0 WS/NT

PERL Visual Basic

Direct communication with other Windows Applications Weak string manipulation Easy to learn

Beyond a thorough knowledge of a scripting language, the ability to pro- gram in at least one major statistical package is desirable. While it is possible to perform all aspects of the CGI cycle in PERL alone, use of a statistical software package opcns up the range of services that can be provided. The SSDC USCS PERL to decode the user input and to create an SPSS or SAS program based on that data. PERL executes this program, and returns the output from this program to the user’s browser. It is not necessary for the librarian to be expcrt in any given package; but to bc able to understand a few basic syntax structures (such as rcports and graphing features), and then to create a CGI script that can properly create those structures with the user-sup- plied information.

CASE STUDIES

An examination of three Web services developed by the Social Sciences Data Center (the County and City Data Book, National Income and Products Accounts, and 1990 Census Public Use Microdata Samples) will tie the above discussions together. Each service meets the needs of unique mix of users and capabilities and involves unique data preparation and maintenance issues.

County and City Data Book Olttp://wwwlib.virginia.edu/socsciindex. html). The County and City Data Book (CCDB) site was developed to serve as a direct corollary to the printed edition as well as the CD-ROM. Heavily utilized by undergraduatcs and the general public, the CCDB probably ranks second only to the Statistical Absrract in frcquency of use. The CD-ROM product was networked on the University Library’s CD-ROM network, but the GO software‘s drill down approach to getting data did not prove satisfac- tory. What was needed was a quick way to look at disparate pieces of data for a range of counties or states, and to offer more flexibility in data presentation.

The CCDB data are presented i n dBASE format on the CD-ROM, in a split file presentation. SPSS was used to read the data (SPSS recognizes and

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 15: Enhancing Access to Government Information

Patrick Yott 73

translates .dbf files), and the data were then merged into a single file for cach level of observation (county, city, state, and place) by linking on the FIPS code field. The total space requirement for all the 1994 files was just over 14 megabytes. Bascd on the small size of these files, data from the 1988 CD- ROM were added to the project.

The original goal was to offer the capability to generate a simple report or a delimited file for downloading. The positive response to this simple ap- proach was quite unexpected and provided the impetus for enhancing the services offcred. A sort feature and the ability to create HTML tables were added, both of which significantly increased the intensity of CPU use and decreased the response timc, but also improved the usefulness and attractive- ness of the result. Future plans include migrating to SAS (fastcr response time) and developing a mapping component.

Nufional Income und Product Accounts (http://www.lib.virginia.edu/socsci/ nipa). 'Ihe Nutiotlal Accounts atid Products Accounts ("A) data systcm was targeted for students in forecasting classes in the University's undergraduate business school (Mclntirc School of Commcrce). Students in thcse classes make extensive and regular use of the NIPA tables from the Siirvey ofCurrerrf Business-often seeking ten years of quartcrly data from the various NIPA series. Although the STAT-USA Web site had already been launched, making the NIPA data available through thc Web, they were distributed in a single filc containing all years and a l l lines. Redistribution seemed a logical way to improve access to these data.

Processing of the data involved downloading the fu l l file (available in both Lotus and ASCII formatted versions), and converting the data into a single SPSS file. This conversion was done usingDBMSICOP): Once the data were migrated, a new line variable was created by merging the table and line numbers. The data were then transposed so that thc years and quarters formed the cases (the data is distributed with the table/line combinations forming the cascs). Finally, a new variable was creatcd to differentiate quarterly from annual data.

Once the data were migrated to the Web, the system was successfully opened. User comments were positive, and suggestcd the need for a graphical presentation for the data. The decision was made to redesign a system that would provide not only the data, but also graphs. To do this, the data were ported to Stata, which allowed more flexibility in creating graphs that could be converted into GIF and JPG formats that could thcn be viewed through a Web browser. Because of the way Stata manages memory, the NIPA data were divided into eight files according to the main NIPA table groups. Thc NIPA system now produces a graph for each table/line chosen, a table for that variablc, as well as a comma-delimited data file containing al l requested years and variables.

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 16: Enhancing Access to Government Information

74 Govertiment Information Collections in the Networked Environment

The NIPA systcm was the first non-static data system designed by the SSDC. The system is updated quarterly as well as when any comprehensive data revisions are announccd by the BEA. As with the original product, updates are downloaded from thc STAT-USA Web site and converted into Stata format. Records in the original data filc that are repeated in the update file (usually the past two yearly and last 6 quarterly data points) are removed from thc original and replaced with the update.

1990 CENSUS PUMS FILES (http://www.lib.virginia.edu/pums.html). The 1990 Public Use Microdata Samples (PUMS) filcs represent an extraordinary research tool for economists, sociologists, demographers, urban planners, and othcr researchers. The PUMS files contain data on individuals from the 1990 Census and contain over 75 variables for each household and over 75 variables for cach person in the sample. Since thc unit of analysis is the individual housing uni t or person, the PUMS data allow researchcrs to inves- tigate all varieties of relationships that the Census Bureau does not examine in either its printed reports or CD-ROM products.

The data arc distributed on CD-ROM as raw ASCII files. To further complicate the issue, the data are released in a hierarchical form, with hous- ing and pcrson data in the same file. For many researchers, the burden of accessing the data is a significant hindrance to their use. While the CD-ROM docs contain a tabulation software packagc, the data are best used in a full statistical package like SPSS or SAS.

The first consideration in designing this service was how to present the data; either as separate housing and person files, or a rectangularized file wherein each record represented a person complete with the corresponding housing unit data. The Census Bureau has a Web-based PUMS extraction system where the user must choose between the person record and housing data record. For many uscrs, this would require running extractions within both record types and thcn combining the data once it had been downloaded, which rcprcsents an unnecessary inconvenience. The approach taken by the SSDC was to present the data in rectangularized files, although the trade-off for such convenience would be much largcr files and more intensive use of scrver CPU.* I

Because the PUMS data is fundamentally different than the majority of the other data available through the Social Sciences Data Center’s Web pages, the variety of options we could make available to the user were much differ- ent. Since the underlying assumption with the PUMS site was that the data would not be of intcrcst to the casual user, and would instead find an audience with more advanced rcscarchers, the SSDC opted to provide common explor- atory tools (frequencies and cross-tabulations) instead of the more common

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 17: Enhancing Access to Government Information

Patrick Yott 75

reporting features. The site is designed to allow the user to explore the usefulness of the data and then to perform subsetting and downloading.

Figure 2 provides a clear example of the use patterns that can be expected from various user types. While both the CCDB and the NIPA sites show semester-based peaks in their use, the difference in the mean use between the two sites (2600 for CCDB, 1050 for NIPA) is attributable to the CCDB's greater usefulness for the general public. Similarly, the more rigorous nature of the data in the PUMS site has kept its overall use to a marc modest level with less evidence of being affected by academic calendars. While the PUMS usage of between 450 and 500 monthly uses is dwarfed by the monthly average of the CCDB, it is a rather significant amount of use given the limited audience for the data.

CONCLUSION

The experiment begun i n 1994 with the Regionnl Economic fuforrnalion System has proven to be quitc successful.

Our experience has clearly demonstrated that the Web is a viable mecha- nism for fulfilling and expanding the core library mission, and can enhancc and cnrich the search for government data. And while it is not reasonable to

FIGURE 2. System Use for 3 Major Resources

CCDB

NIPA

PUMS

- - - _

MONTH

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014

Page 18: Enhancing Access to Government Information

76 Government informalion Collections iri the Networked Environment

expect all libraries to pursue this course at thc scale of the University of Virginia Library (indeed it may not be possible for most to do so) it is an approach libraries whose clientele require access to government information should seriously consider.

NOTES

1. Harvard University Commencement Day Address by Neil L. Rudenstine. Deliv- ered June 6, 1996. (URL,: http://www.halvard.edu/presidents_office/presidents-add~ss. html)

2. Suppose a housing un i t had 5 persons in it. Each record type in the PUMS file contains 231 characters, so all the records for that housing unit (6 records) would occupy 1386 (6 x 231) characters. Rectangularized, each person’s record would extend over 462 characters. The same information would occupy 2310 (5 X 462) characters spread over 5 records. This represents a 66% increase in size. Because of this, we decided to offer only the 5% sample file for Virginia and to use the 1% sample for all other states.

Dow

nloa

ded

by [

Fond

ren

Lib

rary

, Ric

e U

nive

rsity

] a

t 04:

46 1

6 N

ovem

ber

2014