have data? what now?!

17
Have Data? What now?! Hilary Mason @hmason

Upload: hilary-mason

Post on 27-Jan-2015

105 views

Category:

Technology


0 download

DESCRIPTION

A brief overview of common data analysis problems and algorithms.

TRANSCRIPT

Page 1: Have data? What now?!

Have Data? What now?!

Hilary Mason@hmason

Page 2: Have data? What now?!

(Focused) Data == Intelligence

Page 3: Have data? What now?!

Common Problems

Gathering dataParsing, Entity Extraction and DisambiguationClusteringDocument classificationNLP

Page 4: Have data? What now?!

Text is MESSY

Page 5: Have data? What now?!

Do you need to parse it?Parsing unstructured data is hard. (we’ll get to this)

CHEAT.

Open Calais (www.opencalais.com) currently supports:

Anniversary, City, Company, Continent, Country, Currency, EmailAddress, EntertainmentAwardEvent, Facility, FaxNumber, Holiday, IndustryTerm, MarketIndex, MedicalCondition, MedicalTreatment, Movie, MusicAlbum, MusicGroup, NaturalFeature, OperatingSystem, Organization, Person, PhoneNumber, Position, Product, ProgrammingLanguage, ProvinceOrState, PublishedMedium, RadioProgram, RadioStation, Region, SportsEvent, SportsGame, SportsLeague, Technology, TVShow, TVStation, URL

Page 6: Have data? What now?!

Entity Disambiguation

This is important.

Page 7: Have data? What now?!

MEUGLY HAG

Page 8: Have data? What now?!

Entity Disambiguation

This is important.

Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?

Page 9: Have data? What now?!

A Practical Approach – Path101

Human classification

Data APIs

Automaticclassification

model

Example: Company Name

External data from Open Calais, Freebase

Based on industry, location, and type of job, we can differentiate between MS Volt (Microsoft) and Volt (Volt Information Sciences, Inc.)

Page 10: Have data? What now?!

SPAM sucks

Page 11: Have data? What now?!

Supervised Classification

TextText Feature ExtractorFeature

ExtractorTrained

ClassifierTrained

Classifier

CatsCats

DogsDogs

FireFire

Training Data

Training Data

Feature ExtractorFeature

Extractor

Page 12: Have data? What now?!

Classification Example: Movie Reviews!

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]

…tagged ‘positive’ and ‘negative’.

Page 13: Have data? What now?!

#!/usr/bin/env python# encoding: utf-8"””classification_example.py"""

from __future__ import divisionimport sys, os, random, nltk, re, pprintfrom nltk.corpus import movie_reviews

def document_features(document, word_features): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features

def main(): all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000]

documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] featuresets = [(document_features(d, word_features), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features(20)

if __name__ == '__main__': main()

Page 14: Have data? What now?!

Clustering

immunityimmunity

ultrasoundultrasound

medical imagingmedical imaging

medical devicesmedical devices

thermoelectric devices

thermoelectric devices

fault-tolerant circuits

fault-tolerant circuits

low power devices

low power devices

Page 15: Have data? What now?!

Hierarchical Clustering

Page 16: Have data? What now?!
Page 17: Have data? What now?!

<3 Data

Thank you!