have data? what now?!
DESCRIPTION
A brief overview of common data analysis problems and algorithms.TRANSCRIPT
Have Data? What now?!
Hilary Mason@hmason
(Focused) Data == Intelligence
Common Problems
Gathering dataParsing, Entity Extraction and DisambiguationClusteringDocument classificationNLP
Text is MESSY
Do you need to parse it?Parsing unstructured data is hard. (we’ll get to this)
CHEAT.
Open Calais (www.opencalais.com) currently supports:
Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL
Entity Disambiguation
This is important.
MEUGLY HAG
Entity Disambiguation
This is important.
Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
A Practical Approach – Path101
Human classification
Data APIs
Automaticclassification
model
Example: Company Name
External data from Open Calais, Freebase
Based on industry, location, and type of job, we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)
SPAM sucks
Supervised Classification
TextText Feature ExtractorFeature
ExtractorTrained
ClassifierTrained
Classifier
CatsCats
DogsDogs
FireFire
Training Data
Training Data
Feature ExtractorFeature
Extractor
Classification Example: Movie Reviews!
[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]
…tagged ‘positive’ and ‘negative’.
#!/usr/bin/env python# encoding: utf-8"””classification_example.py"""
from __future__ import divisionimport sys, os, random, nltk, re, pprintfrom nltk.corpus import movie_reviews
def document_features(document, word_features): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features
def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000]
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features(20)
if __name__ == '__main__': main()
Clustering
immunityimmunity
ultrasoundultrasound
medical imagingmedical imaging
medical devicesmedical devices
thermoelectric devices
thermoelectric devices
fault-tolerant circuits
fault-tolerant circuits
low power devices
low power devices
Hierarchical Clustering
<3 Data
Thank you!