simple database construction using local sources … database construction using local sources of...

21
Simple Database Construction Using Local Sources Of Data Dr. D. Timothy Gerber Associate Professor Biology Department, Cowley Hall University of Wisconsin - La Crosse 1725 State Street, La Crosse, WI 54601 Email: [email protected] Phone: 608.785.6977 (office), 608.785.6959 (fax), 608.781.5824 (home) Dr. David M. Reineke Assistant Professor Mathematics Department, Cowley Hall University of Wisconsin - La Crosse 1725 State Street, La Crosse, WI 54601 Email: [email protected] Phone: 608.785.6607 (office), 608.785.6602 (fax), 608.779.5603 (home) Word count: 3,151

Upload: doanthuy

Post on 03-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Simple Database Construction Using

Local Sources Of Data

Dr. D. Timothy Gerber

Associate Professor

Biology Department, Cowley Hall

University of Wisconsin - La Crosse

1725 State Street, La Crosse, WI 54601

Email: [email protected]

Phone: 608.785.6977 (office), 608.785.6959 (fax), 608.781.5824 (home)

Dr. David M. Reineke

Assistant Professor

Mathematics Department, Cowley Hall

University of Wisconsin - La Crosse

1725 State Street, La Crosse, WI 54601

Email: [email protected]

Phone: 608.785.6607 (office), 608.785.6602 (fax), 608.779.5603 (home)

Word count: 3,151

2

"Information is all around us – often in such great quantities that we are

unable to make sense of it. A set of data can be represented by a few

summary characteristics that may reveal or conceal important aspects

of it. Statistics is a form of mathematics that develops useful ways for

organizing and analyzing large amounts of data." AAAS (1990, p. 137)

Abstract: With the increased accessibility of information in our society, databases have

become a common way to organize and distribute data. To best understand how

information is organized in a database, students need to see firsthand how they are

constructed. Construction of three simple databases using a spreadsheet is described here

and basic summary statistics are provided for each. Recommendations for building

simple databases using a computer spreadsheet and the statistical analysis of its data are

given.

Introduction: As in the above quote, information is truly all around us. Many of the

informed decisions scientists, government officials, industrial analysts and others in our

society make revolve around summarizing information amassed in large sets of data or

databases. In biology, this information can range from human health (Kaiser, 2002) and

genetic data (FIDD, 1999) to species descriptions (e.g., Guiry & Nic Dhonncha, 2002) to

quantitative limnological data (e.g., NODC, 2001). Readily amenable to manipulation by

computer and highly organized, databases and statistics are excellent tools for analyzing

and summarizing quantitative measurements and are important for scientific

interpretation/explanations of natural phenomena (NRC, 1996, p. 118).

3

Since measurement is such an integral part of data collection in many biological fields

and computers/software have become increasingly available, organizing quantitative

information in databases for statistical analysis seems a reasonable way to integrate math

and science (AAAS, 1990, p. 212) in the classroom. Equally, information in one

database can be used collaboratively between math and science classrooms (NRC, 2000,

p. 141). Unfortunately, while databases can offer incredible amounts of raw, meta-,

and/or summarized data, using a database as an introduction to information processing

can be daunting, confusing, and even a ‘turn off.’ A less daunting way to introduce

students to databases is to have them build their own.

Student-built databases using local information sources can serve several functions in a

biology, math or combined classroom(s). (1) The actual process of collection and

manipulation of data allows students to internalize or give meaning to the numbers in a

database. In essence, they better acquire a ‘feel for the data.’ (2) Using local

information can be an interesting lesson. Databases and the statistics generated from

them seem sterile and objective when viewed in a textbook or downloaded from a

website. They are anything but that when one understands how they are constructed,

where the data come from, and the assumptions behind their construction and

calculations. (3) Databases can be included as part of a spiraling curriculum.

Constructed at a lower grade level, additions and manipulation of information to a

dynamic database can be used at higher grade levels with increasing sophistication. (4)

Student-built databases connect science and math with integrated levels of understanding.

4

A database could literally be used in both science and math classes at one grade level or

potentially in these classes from elementary to high school.

Following the idea that “[s]ound teaching usually begins with questions and phenomena

that are interesting and familiar to students, not with abstractions or phenomena outside

their range of perception, understanding, or knowledge” (AAAS, 1990, p. 201), the

purpose of this paper is to describe how a simple database can be easily constructed from

quantitative data using a computer-generated spreadsheet. Constructed using quantitative

information collected from a local newspaper, vital statistics (e.g., natality, mortality)

data can easily be statistically summarized and graphically displayed using commonly

available software. While there are papers on the use of existing databases (e.g.,

Putterbaugh & Burleigh, 2001; LaBare et. al., 2000; Capelle & Smith, 1998), few

address the basic construction of a database using student-collected information. To best

understand the value of a database, students need to understand how they are constructed.

Settings:

The authors have used these databases in college biology (Bio 103, Introductory Biology,

non-majors) and mathematics (Mth 205, Elementary Statistics) courses to teach

biological and statistical concepts. As courses that fulfill general science and math

requirements on our campus, Bio 103 and Mth 205 are taken by students with wide

ranging educational backgrounds. In addition, many K-12 pre-service teachers take these

courses. Computer facilities are available for student use.

5

Information for databases used in Bio 103 were collected (see details in Methods below)

by students early in the semester and emailed to the instructor for inclusion in one large

master database or were instructor generated. After completion, the master database was

emailed back to each student as an attachment. Basic statistical data manipulation using

the master database is part of the lecture component in Bio 103. The constructed

databases are shared with Mth 205 students.

Methods:

Three separate databases for (1) Baby (all infants), (2) Twin babies and (3) Obituary data

were constructed from information in the La Crosse Tribune (local newspaper). One of

our local hospitals now has baby information online (see Gundersen-Lutheran in

reference section). Length, weight (continuous variables), sex (categorical variable), date

and time of birth data were published bimonthly for all babies (twins are identified) born

at Gundersen Lutheran Hospital, La Crosse, WI. Sex, birth and death data, published

daily, were collected from the obituary column. No names were recorded in building

Baby (all infants) or Obituary databases, however, surnames were used to keep track of

twins. Care was taken not to double count the same person in the obituary column since

a person is usually listed two successive days.

Newspaper data were entered into a Microsoft® Excel 2000 (hereafter Excel) spreadsheet.

Excel was chosen because of its ubiquity (included with the Office 2000 suite of

programs) and basic statistics analysis capability. For those unfamiliar with Excel, use

the “Help” pull-down bar within the program or consult a general reference (e.g., Shelly

6

Cashman Series, 2000). Basic database structure and terminology can be found in

Spooner & Barracato (1999). Graphs were generated using SPSS, an easy-to-use menu-

driven software package for statistical processing, which accepts Excel spreadsheets

(SPSS, 2002). The sampling tool in the Excel Analysis ToolpakTM was used to draw

random samples from database “populations” and to compute descriptive statistics and

confidence intervals.

Results:

Each of the three databases (Baby (all infants), Twins, Obituary) was easy to generate

even with only a rudimentary knowledge of Excel (see Fig. 1 for basic database setup).

However, 10-15 Bio 103 students usually needed additional help in using Excel and

attaching email files. This problem was quickly solved with one instructor-led, ‘hands-

on’, computer lab session (approx. 1 hour) on entering data into Excel and a discussion of

attaching files to email messages. The Baby (all infants) and obituary databases were

produced at a rate of approximately 70-100 babies or deaths/month and can be used in a

month’s time as a good sample of human birth weight. In our local newspaper, births are

listed once every other Saturday. The Twins database took much longer to develop since

only 0-5 sets of twins were listed monthly. This database takes semesters to develop;

thus, its development is long relative to the Baby (all infants) or obituary databases.

There were slight, but not statistically significant, differences in birth weight between

males and females and a few outliers were discovered, as shown by the boxplots in

Figure 2. Overall birth weight (Fig. 2) was similar to what is found in much larger data

7

sets for the United States (e.g., Wilcox et. al., 1995). Birth weight, birth length, and

length of human life span (vital statistics) were excellent measures to use for database

building and basic statistical analysis for several reasons. (1) Biologically, vital statistics

convey important information concerning the human condition. For example, human

birth weight is associated with individual infant survival and a population’s infant

mortality (Wilcox, 2001). (2) Statistically, these measures often show a normal or bell-

shaped distribution (Wilcox, 2001), important for assumptions of parametric statistical

tests. (3) From an educational view, even young students should be familiar with or can

easily understand what these measures are and how they are determined. There is also a

strong positive correlation between baby birth weight and length (Fig. 3), which can be

used to introduce correlation and regression using a student-generated database.

Significant differences in weight between single births and twin births for males and

females are displayed clearly by the graphic comparison shown in Figure 4. Differences

in human life expectancy by sex also provide a nice graphic comparison (Fig. 5) using

our obituary database and can be compared with what students know about average life

expectancy. This database can be used to discuss statistical calculations based on a

population and samples of various sizes (Fig. 6). Data were graphically represented

using boxplots (Fig. 2), scatterplots (Fig. 3), histograms (Fig. 5) and confidence intervals

(Figs. 2 & 4).

8

Discussion:

“Using data from actual investigations from science in mathematics courses, students

encounter all the anomalies of authentic problems – inconsistencies, outliers, and errors –

which they might not encounter with contrived textbook data.” (NRC, 1996, p. 214)

Creating your own database is an excellent way for students to learn the trials and

tribulations of data collection and data management. It provides an opportunity to

discuss ethical issues in data collection as well as data integrity. Furthermore, students

will see that data in the “real world” doesn’t always present itself as neatly as it appears

in textbooks or web-based databases, but that it needs to be organized, carefully labeled,

and proofread. Sometimes part of a data record may be missing or recorded incorrectly,

giving rise to unusually large or small values. In these situations, students should be

taught the difference between an outlier and a data entry error. That is, that legitimate

data errors are to be corrected (where correction is possible) or removed from the

database, but that outliers are to remain and be dealt with appropriately.

A database can also be used to illustrate the concepts of population and sample. For

example, the entire database can be defined to be a hypothetical population of interest

and a random sample of a given size can be drawn from it, as shown in Figure 6. The

descriptive statistics from the sample can then be compared to the corresponding

population parameters. Repeated sampling can be used to demonstrate the variability of

sample statistics, which may be followed up by a discussion of sampling distribution

theory. This can easily be done in Excel using the Sampling tool in the Analysis

9

ToolpakTM. Naturally, such a discussion would lead to statistical inference for students in

grades 9–12 or in a university-level elementary statistics course. Constructing

confidence intervals and conducting hypothesis testing using random samples from a

database affords students the rare opportunity of having complete knowledge of the

population from which the sample came.

Biologically, most populations that researchers are interested in studying are so large that

it is not possible to have complete knowledge of them, making clear the idea of the

necessity of statistics as a discipline and the need to account for and understand random

variation that occurs with random sampling. Using vital statistics to build a database

provides students an opportunity to investigate, discover, and collect “real” data using

biologically important measures they can understand. Building a simple database with

student-collected data offers an excellent opportunity to connect the biological with the

mathematical and produces collaboration between students as well as faculty.

Databases in the Classroom:

“To take hold and mature, concepts must not just be presented to students from time to

time but must be offered to them periodically in different contexts and at increasing

levels of sophistication.” (AAAS, 1990, p. 207)

At the K-12 level, simple databases can easily be performed using spreadsheet software

(e.g., Excel), a calculator with a spreadsheet function (Morgan, 1997), or as a pencil and

paper exercise. Classroom-generated databases can easily be compared with trends for

10

the nation, too (see NCHS, 2003 ). In addition, many of the education standards for data

analysis and probability for grades 3-12 can be addressed through the assembly and use

of databases (Table 1). While the sophistication of statistical analysis will vary

drastically from lower grades to the college level, database construction and data

summarization offer the opportunity to use these exercises throughout much of the formal

educational training a student receives.

Regardless of grade level, several words of caution should be mentioned. (1) When

using a local information source, students may know people in the databases they are

constructing. This may be an advantage, if for example, a student has a new baby sister

listed in the birth announcement section of the newspaper and her information is included

in a database. However, it may be devastating for a student, whose uncle was just killed

in a car accident and is now listed in the obituary section, to include him in a database.

(2) Database construction in a classroom will not necessarily be easy. Missing data or

measurement problems of some sort are likely to be encountered. Such situations can be

exploited to teach students that data collection is often “messy” and that it is essential to

be as careful and accurate as possible. Another pitfall is the tendency to view the

database as a “random sample” when that is not likely to be the case. Instructors will

want to be careful to define exactly what the database represents, which is more likely to

be a well-defined population than a random sample. This point has more relevance for

secondary and university-level students covering statistical inference procedures because

they require that samples be randomly selected. (3) Database construction will be time-

consuming for both students and instructor, especially when first beginning. We

11

recommend starting with a small, easily controlled but meaningful data set. Complexity

can be built into databases over time.

You may request the three databases we have developed by emailing the first author.

When emailing, please include your name, institution/school, city, state/province, and

country so that we may keep a record of requests. Databases used for this publication

will be emailed to you as attached Excel files. Included in our databases are the compiled

Baby-(all infants), Twin babies, and Obituary raw data collected from the La Crosse

Tribune. These databases may be freely used for educational purposes, however, it is

suggested that they be used as examples. It is preferable to build your databases using

student-collected data. The data was not double checked for accuracy.

Acknowledgements: The authors thank L. Gerber and two anonymous reviewers for

comments on the original manuscript.

References: American Association for the Advancement of Science (AAAS). (1990). Science for All Americans. New York: Oxford University Press. Capelle, J. & M. Smith. (1998). Using cemetery data to teach population biology & local history. The American Biology Teacher 60: 690-693. Frequency of Inherited Disorders Database (FIDD) (1999). http://archive.uwcm.ac.uk/uwcm/mg/fidd/index.html Guiry, M. D. & Nic Dhonncha, E. (2002). AlgaeBase. http://www.algaebase.org/default.html Gundersen-Lutheran Hospital’s On-Line Nursery (http://www.gundluth.org/babies) Kaiser, J. (2002). Population databases boom, from Iceland to the U.S. Science 298(5596): 1158-1161.

12

LaBare, K., R. Klotz, & E. Witherow. (2000). Using online databases to teach ecological concepts. The American Biology Teacher 62(2): 124-127. Morgan, L. (1997). Explorations: Statistics Handbook for the TI-83. Texas Instruments Inc. National Center for Health Statistics website. 2003. (http://www.cdc.gov/nchs/) National Council of Teachers of Mathematics (NCTM). (2000). Principles and Standards for School Mathematics. Reston, VA: The National Council of Teachers of Mathematics, Inc. National Research Council (NRC). (1996). National Science Education Standards. Washington D. C.: National Academy Press. -----. (2000). Inquiry and the National Science Education Standards: A Guide for Teaching and Learning. Washington D. C.: National Academy Press. National Oceanographic Data Center (NODC) (2001) http://www.nodc.noaa.gov/ Putterbaugh, M. & J. Burleigh. (2001). Investigating evolutionary questions using online molecular databases. The American Biology Teacher 6: 422-431. Shelly Cashman Series. (2000). Microsoft Office 2000: Introductory concepts and techniques. Cambridge, MA: Course Technology. Spooner, B. & J. Barracato. (1999). Database Basics Skills Book. Arlington, VA: National Science Teachers Association. SPSS for Windows, Rel. 11.5.1. 2002. Chicago: SPSS Inc. Wilcox, A.J. (2001). On the importance – and the unimportance – of birthweight. International Journal of Epidemiolgy 30: 1233-1241. Online at: http://eb.niehs.nih.gov/bwt/V0M3QDQU.pdf -----, R. Skjaerven, P. Buekens, & J. Kiely. (1995). Birth weight and perinatal mortality: A comparison of the United States and Norway. Journal of the American Medical Association 273: 709-711.

13

Table 1. Selected science* (NRC, 1996) and math+ (NCTM, 2000) standards relevant to this activity for K-12 grade levels. Grade Standard 3-5 Collect data using observations, surveys and experiments (p. 176) + 5-8 …tools and techniques to gather, analyze, and interpret data ( p. 145) * 5-8 Nature of science (p. 170) * 6-8 Find, use, and interpret measures of center and spread, including mean

and interquartile range (p. 248)+ 6-8 Discuss and understand the correspondence between data sets and their

graphical representations, especially histograms, stem-and-leaf plots, box plots, and scatterplots (p. 248)+

6-8 Use observations about differences between two or more samples to make conjectures about the populations from which the samples were taken (p. 248)+

9-12 Use technology and mathematics to improve investigations and communications (p. 175) *

9-12 Understandings about scientific inquiry (p. 176) * 9-12 Understand the meaning of measurement data and categorical data, of

univariate and bivariate data, and of the term variable (p. 324)+ 9-12 Understand how sample statistics reflect the values of population

parameters and use sampling distributions as the basis for informal inference (p. 324)+

14

Captions for Figures

Figure 1. Example of the spreadsheet for the database of newborns (Baby (all infants)

database) in La Crosse, WI.

Figure 2. Boxplots of birth weight for newborns in La Crosse, WI. Circles represent

outliers in the data.

Figure 3. Scatterplot of birth weight vs. length for newborns in La Crosse, WI.

Figure 4. Mean birth weight in ounces of male and female newborn twins and singles

with 95% confidence intervals for newborns in La Crosse, WI.

Figure 5. Histograms for ages of males and females using the Obituary database.

Fig. 6. Example of an Excel spreadsheet with both raw (sex, year of death (YOD), year

of birth (YOB)) and calculated (age = YOD - YOB) obituary data and an embedded table

of statistics.

15

16

6956N =

Newborns in La Crosse, WI

SEX

MaleFemale

Wei

ght (

oz.)

200180160140120100

80604020

17

Newborns in La Crosse, WI

Length (in.)

242220181614

Wei

ght (

oz.)

200180160140120100

806040

SEX

Male

Female

18

2228 6956N =

Sex

MaleFemale

Mea

n W

eigh

t (oz

.)

140

130

120

110

100

90

80

7060

TYPE

Single

Twin

19

AGE (Years)

130.0120.0

110.0100.0

90.080.0

70.060.0

50.040.0

30.020.0

10.00.0

SEX= FemaleFr

eque

ncy

400

300

200

100

0

AGE (Years)

130.0120.0

110.0100.0

90.080.0

70.060.0

50.040.0

30.020.0

10.00.0

SEX= Male

Freq

uenc

y

400

300

200

100

0

20

21

Verification This is to verify that our manuscript is neither being nor has been accepted for publication elsewhere. D. Timothy Gerber ________________________________________ David M. Reineke ________________________________________