icpsr - complex systems models in the social sciences - lecture 7 - professor daniel martin katz...

Post on 14-Dec-2014

507 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

IC P S R Ju l y 3 0 , 2 01 3

NATURAL LANGUAGE PROCESSING AND

MACHINE LEARNING

Let’s start with some text.

“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy)

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Real Data ¡  When we work with real data, we often need to pre-process

and clean data before we can segment and tokenize.

¡  Consider, for example: §  Hand-written documents: OCR §  Digital formats: PDF, Word, WordPerfect, HTML §  Typesetting remnants, e.g., page breaks, line break hyphens

¡  Pre-processing is very important! All subsequent work depends on this quality.

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

What kind of questions can we ask?

¡  Basic §  What is the structure of the text?

§  Paragraphs §  Sentences §  Tokens/words

§  What are the “words” that appear in this text? §  Nouns

§  Subjects §  Direct objects §  …

§  Verbs

¡  Advanced §  What are the concepts that appear in this text? §  How does this text compare to other text?

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Segmentation and Tokenization

“Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

•  Segments Types •  Paragraphs •  Sentences •  Tokens

Segmentation and Tokenization But how does it work? ¡  Paragraphs

§  Two consecutive line breaks §  A hard line break followed by an indent

¡  Sentences §  Period, except abbreviation, ellipsis within quotation, etc.

¡  Tokens and Words §  Whitespace §  Punctuation

Remember what real-world text looks like – think text and email.

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Segmentation and Tokenization “Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”

¡  Paragraphs: 2 ¡  Sentences: 2 ¡ Words: 561.

§  ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

What kind of questions can we ask? We now have an ordered list of tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]

§  Does the word phrase “quote stuffing” occur in the text? §  How many times does “Sandy” occur? §  How often does “outage” occur after “power?” § What percentage of tokens are numbers?

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

An Aside on Storage

Data: The word ‘the’ ten times and the word ‘a’ ten times. §  Representation 1 - Ordered List:

§  [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]

§  Representation 2 – Term Frequency: §  [(‘the’, 10), (‘a’, 10)]

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

An Aside on Storage

§  Representation 1 - Ordered List: §  [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]

§  Representation 2 - Frequency Map: §  [(‘the’, 10), (‘a’, 10)]

§  Tradeoffs

§  Total space §  Ease of answering certain questions §  Information about context

§  Not all software make the same choice!

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Stopwording, Stemming, Parsing, and Tagging §  Stopwording

§  Removing “filler” words like prepositions, auxiliary or infinitive verbs, and conjunctions.

§  Stemming §  Matching declined nouns like dog/dogs or child/children. §  Matching conjugated verbs like run/ran.

§  Parsing §  Determining the “structure” of a sentence, typically as represented by a

grade school sentence diagram (requires grammar definition; we’ll skip).

§  Tagging §  Identifying the part of speech of each token in a sentence.

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Stopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts.

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Stopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted  New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]

NATURAL LANGUAGE PROCESSING

© Bommarito Consulting

¡  Definition: Automated classification and prediction on data.

¡  Examples: §  Product recommenders, a la Amazon §  Computer vision – is it a cat? §  Sentiment analysis §  Topic classification §  Document clustering

¡  At least two stages to a classification problem: §  Training §  Classification

MACHINE LEARNING

© Bommarito Consulting

Learning ¡ Machine learning requires “learning” or “training.”

¡  There are two types of training: §  Supervised §  Unsupervised

¡  The goal of training is to determine a mapping from input features to a set of target classes.

MACHINE LEARNING

© Bommarito Consulting

Learning Imagine a student given a small list of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from? ¡  Supervised: The teacher provides the answers while learning. ¡  Unsupervised: The teacher provides nothing while learning.

In our example, the teacher wil l typically provide the “canonical” domains and kingdoms of biology. However, most real-world problems domains are not so well -studied.

MACHINE LEARNING

© Bommarito Consulting

Learning What if the teacher gave the student some of the answers? This is semi -supervised learning. ¡  Supervised: The teacher provides the answers while learning. ¡  Semi-supervised: The teacher provides some answers while

learning.. ¡  Unsupervised: The teacher provides nothing while learning..

MACHINE LEARNING

© Bommarito Consulting

Classification The student has now learned to map from an organism’s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism.

MACHINE LEARNING

© Bommarito Consulting

Replace the student with an algorithm and we have machine learning. ¡  Sentiment Analysis Example

§  Organisms : Restaurant reviews §  Descriptions :

§  Number of positive phrases §  Number of negative phrases §  Number of times visited §  Number of restaurants reviewed §  Recency of review

§  Target: 1-5 stars for restaurant sentiment

MACHINE LEARNING

© Bommarito Consulting

Retailers are doing this every day. ¡  Purchasing Example

§  Organism: Consumer §  Descriptions:

§  How many products purchased of category A, B, … §  How many dollars spent on brand A, B, … §  How recently was an item purchased from category A, B, … §  How many visits to web pages in category A, B, …

§  Target: Will they purchase in the next 30 days? §  Training: Look out-of-sample at purchasing database

MACHINE LEARNING

© Bommarito Consulting

Some Machine Learning Algorithms ¡  Supervised

§  Statistical models §  Bayesian, e.g., Naïve Bayes Classification §  Frequentist, e.g., Ordinary Least Squares.

§  Neural Networks (NN) §  Support Vector Machines (SVM) §  Random Forests (RF) §  Genetic Algorithms (GA)

¡  Semi/unsupervised §  Neural Networks (NN) §  Clustering

§  K-means §  Hierarchical §  Radial Basis (RBF) §  Graph

MACHINE LEARNING

© Bommarito Consulting

Notes on Algorithm Diversity ¡  Not all algorithms return scores; some are binary.

§  True, True, False §  0.9, 0.7, 0.1

¡  Not all algorithms support more than two classes. §  Cat, Dog, Mouse §  Cat, Not Cat

¡  Not all algorithms scale similarly. §  1M documents = 1 day §  10M documents = {10 days, 100 days, 1000 days}

MACHINE LEARNING

© Bommarito Consulting

¡ Michael J Bommarito II §  CEO, Bommarito Consulting, LLC §  Email: michael@bommaritollc.com § Web: http://bommaritollc.com/

THANKS!

You can get these slides on my blog – http://bommaritollc.com/blog/.

© Bommarito Consulting

¡  Books and Wiki Pages §  A Brief Survey of Text Mining. Hotho, Nurnberger, Paaß.

§  http://www.kde.cs.uni-kassel.de/hotho/pub/2005/hotho05TextMining.pdf §  Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya,

Zhang, Damerau. §  http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/0387954333

§  The Elements of Statistical Learning. §  http://www-stat.stanford.edu/~tibs/ElemStatLearn/

§  Wiki – Machine Learning. §  http://en.wikipedia.org/wiki/Machine_learning

§  Wiki – Machine Learning Algorithms. §  http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms

¡  Sof tware §  Natural Language Toolkit (NLTK).

§  http://nltk.org/ §  Stanford NLP Group.

§  http://nlp.stanford.edu/software/ §  Weka.

§  http://www.cs.waikato.ac.nz/ml/weka/ §  R.

§  http://www.r-project.org/ §  SAS Predictive Analytics and Data Mining.

§  http://www.sas.com/technologies/analytics/datamining/index.html

REFERENCES

top related