text data mining - bitbucket · 6 • machine learning: developing computer programs that improve...

55
Text Data Mining: Predictive and Exploratory Analysis of Text Sandeep Avula [email protected] August 22, 2018

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

Text Data Mining:Predictive and Exploratory Analysis of Text

Sandeep [email protected]

August 22, 2018

Page 2: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

2

Outline

Introductions

What is Text Data Mining?

Predictive Analysis of Text: The Big Picture

Exploratory Analysis of Text: The Big Picture

Applications

Page 3: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

3

• Hello, my name is ______.

• I’m in the ______ program.

• Prior experience with data?

• Any curious data problems you have thought of?

• I’m taking this course because I’d like to learn how to ______.

Introductions

Page 4: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

Other relevant details

• Course website:https://asandeepc.bitbucket.io/courses/inls613_fall2018/

• Channel of communication: MS Teams. Would you all be ok with that?

• Office hours: By appointment.

• Pronouns: He/Him.

• Assignment submissions: Sakai.

Page 5: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

5

• The science and practice of building and evaluatingcomputer programs that automatically detect or discover interesting and useful things in collections of natural language text

What is Text Data Mining?

Page 6: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

6

• Machine Learning: developing computer programs that improve their performance with “experience”

• Data Mining: developing methods that discover patterns within large structured datasets

• Statistics: developing methods for the interpretation of data and experimental outcomes in reaching conclusions with a certain degree of confidence

• Data Science: Wait, what is this? (I think it is …) An interdisciplinary field which encompasses: ML, Stats, and some form of humanities.

Related Fields

Page 7: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

7

• Predictive Analysis of Text

‣ developing computer programs that automatically recognize or detect a particular concept within a span of text

• Exploratory Analysis of Text:

‣ developing computer programs that automatically discover interesting and useful patterns or trends in text collections

Text Data Mining in this Course

Page 8: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

8

Outline

Introductions

What is Text Data Mining?

Predictive Analysis of Text: The Big Picture

Exploratory Analysis of Text: The Big Picture

Applications

Page 9: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

9

Predictive Analysisexample: recognizing triangles

Page 10: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

10

• We could imagine writing a “triangle detector” by hand:

‣ if shape has three sides, then shape = triangle.

‣ otherwise, shape = other

• Alternatively, we could use supervised machine learning!

Predictive Analysisexample: recognizing triangles

Page 11: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

11

Predictive Analysisexample: recognizing triangles

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

Page 12: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

12

Predictive Analysisexample: recognizing triangles

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

What is the part that is missing?

HINT: It’s what most of this class will be about!

Page 13: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

13

color size #slides equalsides ... label

red big 3 no ... yes

green big 3 yes ... yes

blue small inf yes ... no

blue small 4 yes ... no....

........

........

....

red big 3 yes ... yes

Predictive Analysisrepresentation: features

Page 14: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

14

Predictive Analysisexample: recognizing triangles

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

color size sides equalsides ... label

red big 3 no ... yes

green big 3 yes ... yes

blue small inf yes ... no

blue small 4 yes ... no....

........

........

....

red big 3 yes ... yes

color size sides equalsides ... label

red big 3 no ... ???

green big 3 yes ... ???

blue small inf yes ... ???

blue small 4 yes ... ???....

........

........ ???

red big 3 yes ... ???

color size sides equalsides ... label

red big 3 no ... yes

green big 3 yes ... yes

blue small inf yes ... no

blue small 4 yes ... no....

........

........

....

red big 3 yes ... yes

Page 15: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

15

Predictive Analysisbasic ingredients

1. Training data: a set of examples of the concept we want to automatically recognize

2. Representation: a set of features that we believe are useful in recognizing the desired concept

3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept

Page 16: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

16

Predictive Analysisbasic ingredients

1. Training data: a set of examples of the concept we want to automatically recognize

2. Representation: a set of features that we believe are useful in recognizing the desired concept

3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept

Highly influential!

Page 17: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

17

Predictive Analysisbasic ingredients

4. Model: a (mathematical) function that describes a predictive relationship between the feature values and the presence/absence of the concept

5. Test data: a set of previously unseen examples used to estimate the model’s effectiveness

6. Performance metrics: a set of statistics used measure the predictive effectiveness of the model

Page 18: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

18

Predictive Analysisbasic ingredients: the focus in this course

1. Training data: a set of examples of the concept we want to automatically recognize

2. Representation: a set of features that we believe are useful in recognizing the desired concept

3. Learning algorithm: uses the training data to learn a predictive model of the “concept”

Page 19: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

19

4. Model: describes a predictive relationship between feature values and the presence/absence of the concept

5. Test data: a set of previously unseen examples used to estimate the model’s effectiveness

6. Performance metrics: a set of statistics used measure the predictive effectiveness of the model

Predictive Analysisbasic ingredients: the focus in this course

Page 20: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

20

Predictive Analysisapplications

• Topic categorization

• Opinion mining

• Sentiment analysis

• Bias or viewpoint detection

• Discourse analysis (e.g., student retention)

• Forecasting and nowcasting

• Any other ideas?

Page 21: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

21

Predictive Analysisexample: recognizing triangles

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

Page 22: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

22

Predictive Analysisexample: recognizing triangles

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

color size sides equalsides ... label

red big 3 no ... yes

green big 3 yes ... yes

blue small inf yes ... no

blue small 4 yes ... no....

........

........

....

red big 3 yes ... yes

color size sides equalsides ... label

red big 3 no ... ???

green big 3 yes ... ???

blue small inf yes ... ???

blue small 4 yes ... ???....

........

........ ???

red big 3 yes ... ???

color size sides equalsides ... label

red big 3 no ... yes

green big 3 yes ... yes

blue small inf yes ... no

blue small 4 yes ... no....

........

........

....

red big 3 yes ... yes

Page 23: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

23

What Could Possibly Go Wrong?

1. Bad feature representation

2. Bad data + misleading correlations

3. Noisy labels for training and testing

4. Bad learning algorithm

5. Misleading evaluation metric

Page 24: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

24

Training data + Representationwhat could possibly go wrong?

Page 25: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

25

color size 90deg.angle equalsides ... label

red big yes no ... yes

green big no yes ... yes

blue small no yes ... no

blue small yes yes ... no....

........

........

....

red big no yes ... yes

Training data + Representationwhat could possibly go wrong?

Page 26: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

26

color size 90deg.angle equalsides ... label

red big yes no ... yes

green big no yes ... yes

blue small no yes ... no

blue small yes yes ... no....

........

........

....

red big no yes ... yes

Training data + Representationwhat could possibly go wrong?

1.badfeaturerepresentation!

Page 27: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

27

Training data + Representationwhat could possibly go wrong?

Page 28: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

28

Training data + Representationwhat could possibly go wrong?

color size #slides equalsides ... label

blue big 3 no ... yes

blue big 3 yes ... yes

red small inf yes ... no

green small 4 yes ... no....

........

........

....

blue big 3 yes ... yes

Page 29: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

29

Training data + Representationwhat could possibly go wrong?

2.baddata+misleadingcorrelations

color size #sides equalsides ... label

blue big 3 no ... yes

blue big 3 yes ... yes

red small inf yes ... no

green small 4 yes ... no....

........

........

....

blue big 3 yes ... yes

Page 30: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

30

Training data + Representationwhat could possibly go wrong?

Page 31: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

31

Training data + Representationwhat could possibly go wrong?

3.noisytrainingdata!

color size #slides equalsides ... label

white big 3 no ... yes

white big 3 no ... no

white small inf yes ... yes

white small 4 yes ... no....

........

........

....

white big 3 yes ... yes

Page 32: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

32

Learning Algorithm + Modelwhat could possibly go wrong?

• Linear classifier

Page 33: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

33

parameters learned by the modelpredicted value (e.g., 1 = positive, 0 = negative)

Learning Algorithm + Modelwhat could possibly go wrong?

• Linear classifier

Page 34: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

34

f_1 f_2 f_3

0.5 1 0.2

model parameterstest instancew_0 w_1 w_2 w_3

2 -5 2 1

output=2.0 +(0.50 x-5.0)+(1.0 x2.0)+(0.2 x1.0)

output=1.7

outputprediction=positive

Learning Algorithm + Modelwhat could possibly go wrong?

Page 35: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

35(source:http://en.wikipedia.org/wiki/File:Svm_separating_hyperplanes.png)

Learning Algorithm + Modelwhat could possibly go wrong?

Page 36: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

36

x2

x1• Would a linear classifier do well on positive (black) and

negative (white) data that looks like this?

0.5 1.0

0.5

1.0

Learning Algorithm + Modelwhat could possibly go wrong?

Page 37: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

37

x2

x10.5 1.0

0.5

1.0

Learning Algorithm + Modelwhat could possibly go wrong?

4.Badlearningalgorithm!

Page 38: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

38

Learning Algorithm + Modelwhat could possibly go wrong?

X1>0.50

noyes

yes no

negpos

X2>0.5 X2>0.5

yes no

posneg

Page 39: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

39

• Most evaluation metrics can be understood using a contingency table

triangle other

triangle A B

other C D

truepr

edic

ted

• What number(s) do we want to maximize?

• What number(s) do we want to minimize?

Evaluation Metricwhat could possibly go wrong?

Page 40: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

40

• True positives (A): number of triangles correctlypredicted as triangles

• False positives (B): number of “other” incorrectlypredicted as triangles

• False negatives (C): number of triangles incorrectlypredicted as “other”

• True negatives (D): number of “other” correctlypredicted as “other”

triangle other

triangle A B

other C D

true

pred

icte

d

Evaluation Metricwhat could possibly go wrong?

Page 41: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

41

• Accuracy: percentage of predictions that are correct (i.e., true positives and true negatives)

triangle other

triangle A B

other C D

true

pred

icte

d(?+?)

(?+?+?+?)

Evaluation Metricwhat could possibly go wrong?

Page 42: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

42

triangle other

triangle A B

other C D

true

pred

icte

d(A+D)

(A+B+C+D)

• Accuracy: percentage of predictions that are correct (i.e., true positives and true negatives)

Evaluation Metricwhat could possibly go wrong?

Page 43: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

43

• Accuracy: percentage of predictions that are correct (i.e., true positives and true negatives)

• What is the accuracy of this model?

Evaluation Metricwhat could possibly go wrong?

Page 44: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

44

• Interpreting the value of a metric on a particular data set requires some thinking ...

• On this dataset, what would be the expected accuracy of a model that does NO learning (degenerate baseline)?

Evaluation Metricwhat could possibly go wrong?

?

?

??

?

?

?

? ?

Page 45: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

45

• Interpreting the value of a metric on a particular data set requires some thinking ...

Evaluation Metricwhat could possibly go wrong?

5.Misleadinginterpretationofametricvalue!

?

?

??

?

?

?

? ?

Page 46: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

46

What Could Possibly Go Wrong?

1. Bad feature representation

2. Bad data + misleading correlations

3. Noisy labels for training and testing

4. Bad learning algorithm

5. Misleading evaluation metric

Page 47: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

47

• Predictive Analysis of Text‣ Supervised machine learning principles‣ Text representation

‣ Feature selection‣ Basic machine learning algorithms‣ Tools for predictive analysis of text‣ Experimentation and evaluation

• Exploratory Analysis of Text‣ Clustering‣ Outlier detection (tentative)‣ Co-occurrence statistics

Road Mapfirst half of the semester

Page 48: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

48

• Applications‣ Text classification‣ Opinion mining‣ Sentiment analysis‣ Bias detection‣ Information extraction‣ Relation learning

‣ Text-based forecasting

‣ Temporal Summarization

• Is there anything that you would like to learn more about?

Road Mapsecond half of the semester

Page 49: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

49

Grading

• 30% homework

‣ 10% each

• 20% midterm

• 40% term project‣ 5% proposal‣ 10% presentation‣ 25% paper

• 10% participation

Page 50: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

50

Grading for Graduate Students

• H: 95-100%

• P: 80-94%

• L: 60-79%

• F: 0-59%

Page 51: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

51

Grading for Undergraduate Students

• A+: 97-100%

• A: 94-96%

• A-: 90-93%

• B+: 87-89%

• B: 84-86%

• B-: 80-83%

• C+: 77-79%

• C: 74-76%

• C-: 70-73%

• D+: 67-69%

• D: 64-66%

• D-: 60-63%

• F: <= 59%

Page 52: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

52

General Outline of Homework

• Given a dataset (i.e., a training and test set), run experiments where you try to predict the target class using different feature representations

• Do error analysis

• Report on what worked, what didn’t, and why!

• Answer essay questions about the assignment

‣ These will be associated with the course material

Page 53: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

53

Homework vs. Midterm

• The homework will be more challenging than the mid-term. It should be, you have more time.

Page 54: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

54

• Work hard

• Do the assigned readings

• Do other readings

• Be patient and have reasonable expectations

‣ you’re not supposed to understand everything we cover in class during class

• Seek help sooner rather than later

‣ office hours: by appointment

‣ questions via email or MS Teams

• Remember the golden rule: no pain, no gain

Course Tips

Page 55: Text Data Mining - Bitbucket · 6 • Machine Learning: developing computer programs that improve their performance with “experience” • Data Mining: developing methods that

55

Questions?