selected problems in epidemiology nina h. fefferman, ph.d. co-director tufts univ. informid

Selected Problems in Epidemiology

Nina H. Fefferman, Ph.D.Co-Director Tufts Univ. InForMID

Data mining in public health is not new, but it is more complicated

A small historical example : Cholera, John Snow, 1854During the height of the Miasmic theory of Disease

1) There was a Cholera outbreak in London

2) John Snow became ‘irrationally’ convinced that Cholera came from contaminated drinking water

So Snow went to the London Registrar-General

He looked at where those who died from Cholera got their water and when

"The experiment … was on the grandest scale. No fewer than 300,000 people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and, in most cases, without their knowledge; one group being supplied with water containing the sewage of London, and, amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity."

On the Mode of Communication of Cholera, Second Edition, 1854

Snow’s findings:

Number of Houses

Death from Cholera

Death in Each 10,000 Houses

Southwark and Vauxhall Company 40,046 1,263 315

Lambeth Company 26,107 98 37

Rest of London 256,423 1,422 59

Before 1852, your chances of getting cholera were not correlated with getting your water from either water company

In the epidemic of 1853-54, your chances of getting cholera if your water was from Southwark and Vauxhall were more than eight times greater than if you got your water from Lambeth

Then Cholera reoccurred in the Soho district of London

About 600 people died from cholera in a 10-day period

Once again Snow took the operational death-certificate data from the

Registrar-General

This time he plotted the data on a clustering diagram, using a stacked histogram technique plotted on a map of Soho to do the data mining

And then it got really impressive:

Based upon this map, Snow was able to convince the London Board of Guardians to remove the pump handle from the public pump located on Broad Street

The outbreak of cholera subsided with this operational change

It was later revealed that the Broad Street well was contaminated by an underground cesspool located at 40 Broad Street which was just three feet from the well

The Broad Street pump without a handle remains today as a tribute to Snow

Lives saved due to real-time data mining

Modern problems: Happening on every scale imaginable:

Genetic –

We know what we’re looking at and what we’re looking for, just not how to find it

Single Defined Population –

We know who we’re looking at and what we’re looking for, but not how to find it

Undefined Population –

We don’t know who to look at, but we know what to look for

Undefined Everything –

We want to save lives, but don’t know what to do at all

Genetic Epidemiology:

You have good reason to believe that a disease has a genetic component

You have the sequenced genomes of some afflicted people

The human genome is huge

Normally one-tenth of a single percent of DNA (about 3 million bases) differs from one person to the next

Luckily junk DNA makes up at least 50% of the human genome

But we still know of about 1.4 million locations where single nucleotide polymorphisms (SNPs) occur in humans

Chromosome Sequence Length (in base pairs)

1 245,203,898

2 243,315,028

3 199,411,731

4 191,610,523

5 180,967,295

6 170,740,541

7 158,431,299

8 145,908,738

9 134,505,819

10 135,480,874

11 134,978,784

12 133,464,434

13 114,151,656

14 105,311,216

15 100,114,055

16 89,995,999

17 81,691,216

18 77,753,510

19 63,790,860

20 63,644,868

21 46,976,537

22 49,476,972

X 152,634,166

Y 50,961,097

This type of examination is called a “large-scale genotype–phenotype association study”

Classical statistical methods (i.e. multivariable regression, contingency table analysis) are ill suited for high dimensional problems because they are “single inference procedures”

We need “joint inference procedures”

Methods for combining results across multiple “single inference procedures” are inefficient

In this type of case, Data-mining methods are hypothesis-generating and classical statistical methods are hypothesis-testing

So we need Data Mining

A paper on something like this: Rodin et al. 2005 J Comput Biol. 12(1): 1–11. Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels

A single defined population: We know who we’re looking at and what we’re looking for, but not how to find it

In an adverse reaction study for a new vaccine or drug

We know who to watch (those who receive the treatment)

We know we’re looking for (“bad things that happen to them”)

How do we find “it”?

We also have to monitor people who don’t get the treatment and see what happens to them

We wind up with a huge set of “all bad things that happen to lots of people”

This leads to a lot of problems:

Health care providers report adverse reactions by patients to any drug

Unfortunately, many patients need to take several drugs at once, so all will be reported with the same event

And there’s reporting bias - results don’t reflect the overall population (only the people who needed the drug in the first place, but that’s probably the portion

we’re worried about anyway)

Explicit example: Sudden Infant Death Syndrome (SIDS) and the Polio vaccine

You can easily find a statistical association between the two – Does this mean the polio vaccine is dangerous?

Not necessarily – the polio vaccine is mainly given to infants, who are the only possible victims of SIDS

Receiving the polio vaccine increases your likelihood of being an infant, which significantly increases your chance of SIDS

We would need to if there is an association within infants

Example problems in data mining for adverse events:

A reference and paper on something like this: http://www.fda.gov/cder/aers/default.htm or Nu et al. 2001 Vaccine. 19(32):4627-34.

Undefined Population – We don’t know who to look at, but we know what to

look for

Example: Figuring out the source of a food-borne outbreak(Good news: we know some diseases are caused by food-borne

pathogens)

We can hypothesize that a certain activity is somehow related to the source like the food at a party being contaminated

Unfortunately, there can be a lot of food at one large party

You might not know if the food at the party is actually the culprit

You need to ask if people at the party got sick

If they did, you need to know which particular food at the party is contaminated

The normal process here is to call everyone at the party and conduct a survey (see handout)

These surveys can generate a huge amount of data and there’s no guarantee that the party was the source of the outbreak

Horror scenario from a data perspective:

Food poisoning at the Republican National Convention

We wouldn’t know• Which day• Which location• Which caterer• How many people were made ill

How do you figure out what how and who in real time?

Part of the problem is to get the answer before more people become sick, so you want to narrow the focus of your investigation as you go

– ask fewer people, ask fewer questions, all these surveys take time

Undefined Everything –We want to save lives, but don’t know what to do

at all

Cancer :

You’ll hear more about this later in the program from

Dmitriy Fradkin

Huge numbers of people diagnosed

Huge numbers of possible contributing risks – environmental exposure to carcinogensgenetic predispositioncancer-causing viruses

Huge numbers of confounding factors – differences in diagnosis, treatment, outcomeco-morbidity

Let’s say we’re worried about the beginning of an outbreak of H5N1 avian flu

It will probably start out looking like normal flu

How quickly we can figure out where it is will determine how quickly we can try active intervention strategies

We don’t know where it will start:

International travel?

Near airports?

International bird migration patterns?

Along the coasts? Depending on time of year?

Once it’s here, we don’t really know how it will spread –

Maybe we want an early warning system for cities – is the disease present or absent?

These are the types of Epidemiological problems we face, what are the kinds of practical constraints we have to expect?

There are many data collectors: Insurance companies, HMOs, public health agencies

Issues of data control – Who controls the data?

Is each entity found only at a single site?

Do different sites contain different types of data?

How can we make sure the data isn’t redundant and therefore skewing our information?

How can we make sure we get all the pertinent data at the same time?

Or at least how fast is fast enough to figure out what we need as quickly as possible?

Individual privacy concerns limit the willingness of the data custodians to share it, even with government agencies such as the U.S. Centers for Disease Control

In many cases, data is shared only after it has been “de-identified” according to HIPAA regulations –

This removes a lot of useful information and doesn’t really do a whole lot to protect privacy, but that’s another issue (see Fefferman et al. 2005 J. Public Health Policy 26(4):430-449)

We need a whole different slew of data mining techniques to mine data “blind”

(when we don’t know what we’re seeing, what the numbers represent, how much they’ve been aggregated to represent

averages or what we’re looking for)

And Privacy and Ethics:

For more information, see http://www.hipaa.org/

And other problems:

Sometimes we don’t know where the best source of data is –

We can monitor some cities more closely

We can monitor certain diseases (notifiable diseases)

Although this is constrained by having to verify by lab test

Sometimes our expectations of “normal” levels of disease set the wrong benchmark for when we should start being concerned

Different diseases have different normal incidence, which means that an increase of 10 cases per year of one disease is an outbreak, but it would take an increase of 1000 in another to be ‘unusual’

BOTULISM, FOODBORNENumber of reported cases,

by year - United States, 1983-2003

ESCHERICHIA COLI, ENTEROHEMORRHAGIC O157:H7Number of reported cases,

United States and U.S. territories, 2003

Sometimes we expect something intermediate

And sometimes we expect the numbers to be reasonably large

SALMONELLOSISIncidence,* by year

United States, 1973-2003

*Per 100,000 population

ACQUIRED IMMUNODEFICIENCY SYNDROME (AIDS) Number of reported cases, by year

United States* and U.S. territories, 1983-2003

*Total number of AIDS cases includes all cases reported to CDC as of December 31, 2003. Total includes cases among residents in U.S. territories and 220 cases among persons with unknown state of residence.

And sometimes our methods of surveillance itself creates issues

Sometimes our problems are prospective

Sometimes our problems retrospective

In outbreak detection and biosurveillance, we want to find “unusual disease incidence” early

In adverse reaction trials, we want to know overall effects, we don’t particularly care about the time scales on which they act

In “classic epidemiological investigations” we are looking for the source of exposure to prevent further infection

Advances in technology have caused a shift in our data mining needs

It used to be that the bottle-neck to appropriate analysis was figuring out where to look for the data and collecting it

A pre-processing problem

Due to advances in reporting technology, we’re very close to getting real-time reporting for mortality data and we’re getting there for incidence data (for at least some diseases)

Now we have to figure out how to find meaningful results in the chaos and clutter

Data mining techniques can be tailored to handle all of these problems

We haven’t covered all of the problems, but as you can see, we need better

techniques and we need more people working on the use of these techniques

Thanks for attending this workshop – we need you!

selected problems in epidemiology nina h. fefferman, ph.d. co-director tufts univ. informid

Documents

cholera death

cholera outbreak

outbreak of cholera

cholera patients

lambeth slide

water company

john snow

realtime data mining