we love nltk
DESCRIPTION
NLTK + Data Matching? Yep!TRANSCRIPT
[(‘We’, ‘PRP’),(‘<3’, ‘VBP’),(‘NLTK’, ‘NNP’)
]Dhiana Deva | Gabriel Fonseca
Data Matching @ UFRJ
“NLTK” == “Natural Language ToolKit”
+ Python library for NLP+ Created in 2001 at University of Pennsylvania+ Very extensive+ Many examples+ Built-in support for 84 datasets (today!)+ Great documentation+ Open source ;)+ Active community
Lot’s of modules!corpus
standardized interfaces to corpora and lexicons
tokenizetokenizers!
stemstemmers!
collocationt-test, chi-squared, point-wise mutual information
classifydecision tree, maximum
entropy, naive bayes
clusterEM, k-means
chunkregular expression, n-gram, named-entity
metricsdistances, precision,
recall, agreement coefficients
probabilityfrequency distributions, smoothed probability
distributions
...parse
chart, feature-based, unification, probabilistic,
dependency
tagpart-of-speech tagging, n-gram, backoff, Brill,
HMM, TnT
Can I haz Data Matching?☑ Accuracy score
☑ Precision score
☑ Recall score
☑ F-measure score
☐ Reduction ratio
☑ Stop-words (11 languages)
★ Punkt sentence tokenizer
★ Punkt word tokenizer
☑ N-gram (words and chars)
☑ Tf-idf
☑ Levenshtein distance
☑ Damerau-Levenshtein distance
☑ Binary distance... Durr!
★ Krippendorff's distance
★ Masi distance
☑ Jaccard distance
☐ Jaro distance
☐ Jaro-Winkler distance
☐ Monge-Elkan distance
☐ Soundex
☐ Phonex
☐ NYSIIS
☐ ONCA
☐ Double-Metaphone
☐ Fuzzy Soundex
☑ Decision tree
☑ SVM
☑ Naive Bayes
★ MaxEnt
Fun fun fun!Sentiment analysisSpelling correctionSpam detectionTopic modelingRecommender systemsData deduplication
Why not song matching?!Grooveshark: online music streaming serviceSongs uploaded by record labels, independent artists and usersLot’s of duplicates!Tinysong: Grooveshark’s open RESTful APIOur goal: No repeated songs!
(remixes and lives are okay!)
Bohemian Rhapsody by Qween-?! {
"Url": "http:\/\/tinysong.com\/PBCJ",
"SongID": 33834073,
"SongName": "Bohemian Rhapsody",
"ArtistID": 2324,
"ArtistName": "Queen",
"AlbumID": 1071492,
"AlbumName": "Greatest Hits"
},
...
{
"Url": "http:\/\/tinysong.com\/CYxG",
"SongID": 28835215,
"SongName": "Bohemian Rhapsody",
"ArtistID": 1731732,
"ArtistName": "Qween -",
"AlbumID": 2364353,
"AlbumName": "A Night at the Opera"
}
...
Next stepsOther textual dataMachine learningAcoustic features
LoudnessBPMLiveness
Acoustic fingerprinting for supervised learningYes, songs have fingerprints too!
Our “sentiment”+ Quick and easy!+ Exteeeeeeeeeeeeeeeeensive!+ Docs & community!+ Internationalization- Time performance- Memory usage- No online or active learning
Want more?!+ jellyfish
Jaro-Winkler, Hamming, Soundex, Metaphone, NYSIIS, …+ nltk-trainer
Command-line NLTK classifiers!+ scikit-learn
More machine learning! Memory efficient!+ pattern
Web mining. Out-of-the-box!+ gensim
Topic modeling. Out-of-the-box!
Referenceshttp://www.nltk.org/
http://www.nltk.org/book/
http://streamhacker.com/
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
http://developers.grooveshark.com/tuts/tinysong
https://github.com/sunlightlabs/jellyfish
https://github.com/japerk/nltk-trainer
http://scikit-learn.org/stable/
http://www.clips.ua.ac.be/pattern
http://radimrehurek.com/gensim/
Thanks! ;)