members: raghuram krishnamachari manish maheshwari maryam el kherba guided by: prof. alan mislove
TRANSCRIPT
PREDICTING PROPAGATION OF A DISEASE
Members:
Raghuram Krishnamachari
Manish Maheshwari
Maryam El Kherba
Guided by:
Prof. Alan Mislove
Flu Prediction / Activity
CDC Flu ActivityReports Influenza like Illness (ILI) for each
region Google Flu Trends
Aggregates search data to estimate flu activity Our experiment (Twitter)
Analyze Twitter data (tweets) to estimate flu activity
Google Flu Trends
CDC’s ILI data VS Google Flu Trends
Google Flu Trends Vs Twitter
1/6/
2008
1/25
/200
8
2/13
/200
8
3/3/
2008
3/22
/200
8
4/10
/200
8
4/29
/200
8
5/18
/200
8
6/6/
2008
6/25
/200
8
7/14
/200
8
8/2/
2008
8/21
/200
8
9/9/
2008
9/28
/200
8
10/1
7/20
08
11/5
/200
8
11/2
4/20
08
12/1
3/20
08
1/1/
2009
1/20
/200
9
2/8/
2009
2/27
/200
9
3/18
/200
9
4/6/
2009
4/25
/200
9
5/14
/200
9
6/2/
2009
6/21
/200
9
7/10
/200
9
7/29
/200
9
8/17
/200
90
2000
4000
6000
8000
10000
12000HHS Region 1 (CT, ME, MA, NH, RI, VT)
HHS Region 2 (NJ, NY)
HHS Region 3 (DE, DC, MD, PA, VA, WV)
HHS Region 4 (AL, FL, GA, KY, MS, NC, SC, TN)
HHS Region 5 (IL, IN, MI, MN, OH, WI)
HHS Region 6 (AR, LA, NM, OK, TX)
HHS Region 7 (IA, KS, MO, NE)
HHS Region 8 (CO, MT, ND, SD, UT, WY)
HHS Region 9 (AZ, CA, HI, NV)
HHS Region 10 (AK, ID, OR, WA)
United States
1/6/
2008
1/26
/200
8
2/15
/200
8
3/6/
2008
3/26
/200
8
4/15
/200
8
5/5/
2008
5/25
/200
8
6/14
/200
8
7/4/
2008
7/24
/200
8
8/13
/200
8
9/2/
2008
9/22
/200
8
10/1
2/20
08
11/1
/200
8
11/2
1/20
08
12/1
1/20
08
12/3
1/20
08
1/20
/200
9
2/9/
2009
3/1/
2009
3/21
/200
9
4/10
/200
9
4/30
/200
9
5/20
/200
9
6/9/
2009
6/29
/200
9
7/19
/200
9
8/8/
2009
8/28
/200
90
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
Region 1
Region 2
Region 3
Region 4
Region 5
Region 6
Region 7
Region 8
Region 9
Region 10
Google Flu Trends Vs Twitter
1/6/
2008
1/23
/200
8
2/9/
2008
2/26
/200
8
3/14
/200
8
3/31
/200
8
4/17
/200
8
5/4/
2008
5/21
/200
8
6/7/
2008
6/24
/200
8
7/11
/200
8
7/28
/200
8
8/14
/200
8
8/31
/200
8
9/17
/200
8
10/4
/200
8
10/2
1/20
08
11/7
/200
8
11/2
4/20
08
12/1
1/20
08
12/2
8/20
08
1/14
/200
9
1/31
/200
9
2/17
/200
9
3/6/
2009
3/23
/200
9
4/9/
2009
4/26
/200
9
5/13
/200
9
5/30
/200
9
6/16
/200
9
7/3/
2009
7/20
/200
9
8/6/
2009
8/23
/200
90
1000
2000
3000
4000
5000
6000
7000
G-R3
T-R3
1/6/
2008
1/23
/200
8
2/9/
2008
2/26
/200
8
3/14
/200
8
3/31
/200
8
4/17
/200
8
5/4/
2008
5/21
/200
8
6/7/
2008
6/24
/200
8
7/11
/200
8
7/28
/200
8
8/14
/200
8
8/31
/200
8
9/17
/200
8
10/4
/200
8
10/2
1/20
08
11/7
/200
8
11/2
4/20
08
12/1
1/20
08
12/2
8/20
08
1/14
/200
9
1/31
/200
9
2/17
/200
9
3/6/
2009
3/23
/200
9
4/9/
2009
4/26
/200
9
5/13
/200
9
5/30
/200
9
6/16
/200
9
7/3/
2009
7/20
/200
9
8/6/
2009
8/23
/200
90
1000
2000
3000
4000
5000
6000
7000
8000
G-R9
T-R9
Tweets, Phrases"having a cold" 4"have a cold“ 7"feel feverish" "flu" 5"headache" "flu" 8"sick" "flu" 9"flu" "fever“ 5"came down with the flu" 7"chills" "flu" 7"catching the flu" 6"cough" "flu" 6"fatigue" "flu" 8"weakness" "flu" 6"flu like symptoms" 4"runny nose" "flu" 5"sore throat" "flu" 7"stomach ache" "flu" 6"stuffy nose" "flu" 6"tiredness" "flu" 4"vomiting" "flu" 4"watery eyes" "flu" 6"body hurts" "flu" 7
Process
•Filter flu tweets from twitter data
•Store data for each state (FIPS)
Filter
•Count flu tweets (weekly)
•Count total tweets (weekly)
Count
•Ratio of flu related to total tweets
•Compare against Google/CDC
Plot
Implementation
Linux bash shell script Filtering
find fips -name "*.gz" -exec zcat {} \; | grep "$1"
Counting find … -exec zcat {} \; | awk ‘{ print $3 }' | awk
'{ print $3 " " $2 " " $6 } sort -k 3n -k 2M -k 1n | uniq -c
Plotting pr -mft -s, dates.txt NJ.tot NY.tot > RE2.tot Microsoft Excel
Challenges
FilteringPhrases that express flu symptomsProcessing timeSegregation based on location
CountingProcessing timeStorage format
PlottingLack of consistent CDC dataHandling of large numeric data
Future
Better prediction algorithm Live Tweet monitoring Flu propagation Facebook application