selected problems in epidemiology nina h. fefferman, ph.d. co-director tufts univ. informid
Post on 19-Dec-2015
216 views
TRANSCRIPT
Data mining in public health is not new, but it is more complicated
A small historical example : Cholera, John Snow, 1854During the height of the Miasmic theory of Disease
1) There was a Cholera outbreak in London
2) John Snow became ‘irrationally’ convinced that Cholera came from contaminated drinking water
So Snow went to the London Registrar-General
He looked at where those who died from Cholera got their water and when
"The experiment … was on the grandest scale. No fewer than 300,000 people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and, in most cases, without their knowledge; one group being supplied with water containing the sewage of London, and, amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity."
On the Mode of Communication of Cholera, Second Edition, 1854
Snow’s findings:
Number of Houses
Death from Cholera
Death in Each 10,000 Houses
Southwark and Vauxhall Company 40,046 1,263 315
Lambeth Company 26,107 98 37
Rest of London 256,423 1,422 59
Before 1852, your chances of getting cholera were not correlated with getting your water from either water company
In the epidemic of 1853-54, your chances of getting cholera if your water was from Southwark and Vauxhall were more than eight times greater than if you got your water from Lambeth
Then Cholera reoccurred in the Soho district of London
About 600 people died from cholera in a 10-day period
Once again Snow took the operational death-certificate data from the
Registrar-General
This time he plotted the data on a clustering diagram, using a stacked histogram technique plotted on a map of Soho to do the data mining
And then it got really impressive:
Based upon this map, Snow was able to convince the London Board of Guardians to remove the pump handle from the public pump located on Broad Street
The outbreak of cholera subsided with this operational change
It was later revealed that the Broad Street well was contaminated by an underground cesspool located at 40 Broad Street which was just three feet from the well
The Broad Street pump without a handle remains today as a tribute to Snow
Lives saved due to real-time data mining
Modern problems: Happening on every scale imaginable:
Genetic –
We know what we’re looking at and what we’re looking for, just not how to find it
Single Defined Population –
We know who we’re looking at and what we’re looking for, but not how to find it
Undefined Population –
We don’t know who to look at, but we know what to look for
Undefined Everything –
We want to save lives, but don’t know what to do at all
Genetic Epidemiology:
You have good reason to believe that a disease has a genetic component
You have the sequenced genomes of some afflicted people
The human genome is huge
Normally one-tenth of a single percent of DNA (about 3 million bases) differs from one person to the next
Luckily junk DNA makes up at least 50% of the human genome
But we still know of about 1.4 million locations where single nucleotide polymorphisms (SNPs) occur in humans
Chromosome Sequence Length (in base pairs)
1 245,203,898
2 243,315,028
3 199,411,731
4 191,610,523
5 180,967,295
6 170,740,541
7 158,431,299
8 145,908,738
9 134,505,819
10 135,480,874
11 134,978,784
12 133,464,434
13 114,151,656
14 105,311,216
15 100,114,055
16 89,995,999
17 81,691,216
18 77,753,510
19 63,790,860
20 63,644,868
21 46,976,537
22 49,476,972
X 152,634,166
Y 50,961,097
This type of examination is called a “large-scale genotype–phenotype association study”
Classical statistical methods (i.e. multivariable regression, contingency table analysis) are ill suited for high dimensional problems because they are “single inference procedures”
We need “joint inference procedures”
Methods for combining results across multiple “single inference procedures” are inefficient
In this type of case, Data-mining methods are hypothesis-generating and classical statistical methods are hypothesis-testing
So we need Data Mining
A paper on something like this: Rodin et al. 2005 J Comput Biol. 12(1): 1–11. Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels
A single defined population: We know who we’re looking at and what we’re looking for, but not how to find it
In an adverse reaction study for a new vaccine or drug
We know who to watch (those who receive the treatment)
We know we’re looking for (“bad things that happen to them”)
How do we find “it”?
We also have to monitor people who don’t get the treatment and see what happens to them
We wind up with a huge set of “all bad things that happen to lots of people”
This leads to a lot of problems:
Health care providers report adverse reactions by patients to any drug
Unfortunately, many patients need to take several drugs at once, so all will be reported with the same event
And there’s reporting bias - results don’t reflect the overall population (only the people who needed the drug in the first place, but that’s probably the portion
we’re worried about anyway)
Explicit example: Sudden Infant Death Syndrome (SIDS) and the Polio vaccine
You can easily find a statistical association between the two – Does this mean the polio vaccine is dangerous?
Not necessarily – the polio vaccine is mainly given to infants, who are the only possible victims of SIDS
Receiving the polio vaccine increases your likelihood of being an infant, which significantly increases your chance of SIDS
We would need to if there is an association within infants
Example problems in data mining for adverse events:
A reference and paper on something like this: http://www.fda.gov/cder/aers/default.htm or Nu et al. 2001 Vaccine. 19(32):4627-34.
Undefined Population – We don’t know who to look at, but we know what to
look for
Example: Figuring out the source of a food-borne outbreak(Good news: we know some diseases are caused by food-borne
pathogens)
We can hypothesize that a certain activity is somehow related to the source like the food at a party being contaminated
Unfortunately, there can be a lot of food at one large party
You might not know if the food at the party is actually the culprit
You need to ask if people at the party got sick
If they did, you need to know which particular food at the party is contaminated
The normal process here is to call everyone at the party and conduct a survey (see handout)
These surveys can generate a huge amount of data and there’s no guarantee that the party was the source of the outbreak
Horror scenario from a data perspective:
Food poisoning at the Republican National Convention
We wouldn’t know• Which day• Which location• Which caterer• How many people were made ill
How do you figure out what how and who in real time?
Part of the problem is to get the answer before more people become sick, so you want to narrow the focus of your investigation as you go
– ask fewer people, ask fewer questions, all these surveys take time
Undefined Everything –We want to save lives, but don’t know what to do
at all
Cancer :
You’ll hear more about this later in the program from
Dmitriy Fradkin
Huge numbers of people diagnosed
Huge numbers of possible contributing risks – environmental exposure to carcinogensgenetic predispositioncancer-causing viruses
Huge numbers of confounding factors – differences in diagnosis, treatment, outcomeco-morbidity
Let’s say we’re worried about the beginning of an outbreak of H5N1 avian flu
It will probably start out looking like normal flu
How quickly we can figure out where it is will determine how quickly we can try active intervention strategies
We don’t know where it will start:
International travel?
Near airports?
International bird migration patterns?
Along the coasts? Depending on time of year?
Once it’s here, we don’t really know how it will spread –
Maybe we want an early warning system for cities – is the disease present or absent?
These are the types of Epidemiological problems we face, what are the kinds of practical constraints we have to expect?
There are many data collectors: Insurance companies, HMOs, public health agencies
Issues of data control – Who controls the data?
Is each entity found only at a single site?
Do different sites contain different types of data?
How can we make sure the data isn’t redundant and therefore skewing our information?
How can we make sure we get all the pertinent data at the same time?
Or at least how fast is fast enough to figure out what we need as quickly as possible?
Individual privacy concerns limit the willingness of the data custodians to share it, even with government agencies such as the U.S. Centers for Disease Control
In many cases, data is shared only after it has been “de-identified” according to HIPAA regulations –
This removes a lot of useful information and doesn’t really do a whole lot to protect privacy, but that’s another issue (see Fefferman et al. 2005 J. Public Health Policy 26(4):430-449)
We need a whole different slew of data mining techniques to mine data “blind”
(when we don’t know what we’re seeing, what the numbers represent, how much they’ve been aggregated to represent
averages or what we’re looking for)
And Privacy and Ethics:
For more information, see http://www.hipaa.org/
And other problems:
Sometimes we don’t know where the best source of data is –
We can monitor some cities more closely
We can monitor certain diseases (notifiable diseases)
Although this is constrained by having to verify by lab test
Sometimes our expectations of “normal” levels of disease set the wrong benchmark for when we should start being concerned
Different diseases have different normal incidence, which means that an increase of 10 cases per year of one disease is an outbreak, but it would take an increase of 1000 in another to be ‘unusual’
ESCHERICHIA COLI, ENTEROHEMORRHAGIC O157:H7Number of reported cases,
United States and U.S. territories, 2003
Sometimes we expect something intermediate
And sometimes we expect the numbers to be reasonably large
SALMONELLOSISIncidence,* by year
United States, 1973-2003
*Per 100,000 population
ACQUIRED IMMUNODEFICIENCY SYNDROME (AIDS) Number of reported cases, by year
United States* and U.S. territories, 1983-2003
*Total number of AIDS cases includes all cases reported to CDC as of December 31, 2003. Total includes cases among residents in U.S. territories and 220 cases among persons with unknown state of residence.
And sometimes our methods of surveillance itself creates issues
Sometimes our problems are prospective
Sometimes our problems retrospective
In outbreak detection and biosurveillance, we want to find “unusual disease incidence” early
In adverse reaction trials, we want to know overall effects, we don’t particularly care about the time scales on which they act
In “classic epidemiological investigations” we are looking for the source of exposure to prevent further infection
Advances in technology have caused a shift in our data mining needs
It used to be that the bottle-neck to appropriate analysis was figuring out where to look for the data and collecting it
A pre-processing problem
Due to advances in reporting technology, we’re very close to getting real-time reporting for mortality data and we’re getting there for incidence data (for at least some diseases)
Now we have to figure out how to find meaningful results in the chaos and clutter