data science for social good and ushahidi
DESCRIPTION
The Eric and Wendy Schmidt Data Science for Social Good - Summer Fellowship 2013 Preliminary Update July 2013 About the DSSG Rock stars: http://dssg.io/ https://twitter.com/datascifellows/ Their project: http://dssg.io/2013/07/15/ushahidi-machine-learning-for-human-rights.html More @ ushahidi.com / wiki.ushahidi.com / blog.ushahidi.comTRANSCRIPT
![Page 1: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/1.jpg)
Project Update - July 11, 2013
The Eric & Wendy Schmidt
Data Sciencefor Social GoodSummer Fellowship 2013
www.dssg.io | [email protected]
![Page 2: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/2.jpg)
Ushahidi Workflow
![Page 3: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/3.jpg)
Ushahidi Workflow + DSSG
![Page 4: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/4.jpg)
Data Sets
23,000 reports from 20 datasets
• 22% English
• 35% non-English
• 43% mixed languages
Each report includes text, category, location, sometimes more data
![Page 5: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/5.jpg)
Data SetsAdditional unusable datasets for various reasons (e.g. overly formulaic language)
What is the quality of the existing "gold standard" annotation?
Working on translations of non-English texts
![Page 6: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/6.jpg)
Afghanistan election(peaceful)
Kenyan election(less peaceful)
Data Set Differences
![Page 7: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/7.jpg)
Current Task Status [July 11]
1) Suggest categories.......................
2) Extract named entities...................
(especially locations)
3) Detect language............................
4) Detect (near-)duplicate reports.....End of presentation has more extensive technical details
![Page 8: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/8.jpg)
Toy Demo
http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home
Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality.
Our plan is to deliver an open-source code library,which Ushahidi will incorporate into the existing user interface.
If link doesn't work -- just look at the screenshots in the next slides. :)
![Page 9: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/9.jpg)
Demo: Example #1
![Page 10: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/10.jpg)
Demo: Example #2
![Page 11: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/11.jpg)
Secondary Project Ideas
1. Detect private info to strip
2. Urgency assessment
3. Filtering irrelevant reports (not strictly spam)
4. Automatically proposing new [sub-]categories
5. Cluster similar (non-identical) reports
6. Hierarchical topic modelling / visualization
![Page 12: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/12.jpg)
Evaluation Plans• Tap into Ushahidi and crisis mapping communities
for feedback
• Simulate past event with our system
• Success metrics:o Increased annotator speedo Increased annotator categorization accuracyo Decreased annotator frustration/tediumo Increased citizen web report speed
![Page 13: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/13.jpg)
Feedback welcome!Contact us at dssg-
We would love your input!
See next 4 slides for technical details on our 4 tasks...
or skip if you're happy to stay unaware... :)
![Page 14: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/14.jpg)
1) Suggest categoriesCurrently:
• Simple bag-of-words unigram features
• 1-vs.-all classification (scikit-learn)
• Little categories fewer big categories
• Performance uninspiring :(
Future:
Bigrams... word frequency filter...
balancing positive/negative examples...
topic modeling... hierarchical categories...
![Page 15: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/15.jpg)
2) Extract named entities
Currently:
• NLTK's Named Entity Recognizer
• Eval: pretty good
Future:
• Train location-recognizer on datasets
• Merge types for non-location NEs
• Remember previously-confirmed NEs
![Page 16: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/16.jpg)
3) Detect Language
Currently:
• Existing packages (Bing, python, ...)
Future:
• Evaluate quality
• Allow event-specific language bias
![Page 17: Data Science for Social Good and Ushahidi](https://reader037.vdocuments.mx/reader037/viewer/2022110306/554e1c35b4c9056b798b4a6c/html5/thumbnails/17.jpg)
4) Near-Duplicate Detection
Currently:
• SimHash compares distances of message text hashes efficiently
Future:
• Evaluate quality more rigorously
• Explore other methods