machine learning in practice lecture 14

48
Machine Learning in Practice Lecture 14 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Upload: uma-ashley

Post on 03-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Machine Learning in Practice Lecture 14. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Plan for the Day. Announcements Questions? Assignment 6 More about Text Using TagHelper Tools Discussion about Assignment 6 Museli Paper. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning in Practice Lecture 14

Machine Learning in PracticeLecture 14

Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction

Institute

Page 2: Machine Learning in Practice Lecture 14

Plan for the Day Announcements

Questions?Assignment 6

More about TextUsing TagHelper ToolsDiscussion about Assignment 6Museli Paper

Page 3: Machine Learning in Practice Lecture 14

Using TagHelper Tools

Page 4: Machine Learning in Practice Lecture 14

Setting Up Your Data

Page 5: Machine Learning in Practice Lecture 14

You can also add additional features to the right of the text column

Extra Features

Page 6: Machine Learning in Practice Lecture 14

TagHelper Tools Process

TagHelper

Labeled Texts

Unlabeled Texts

Labeled Texts

A Model that can Label More Texts

Page 7: Machine Learning in Practice Lecture 14

Running TagHelper Tools

Click on the portal.batexecutable

Page 8: Machine Learning in Practice Lecture 14

Training and Testing

Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder

You will then see the following tool pallet

The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data

Click on Train New Models

Page 9: Machine Learning in Practice Lecture 14

Loading a FileFirst click on Add a File

Then select a file

Page 10: Machine Learning in Practice Lecture 14

Simplest Usage

Click “GO!” TagHelper will use its

default setting to train a model on your coded examples

It will use that model to assign codes to the uncoded examples

Page 11: Machine Learning in Practice Lecture 14

More Advanced Usage

Another option is to modify the default settings

You get to the options you can set by clicking on >> Options

After you finish that, click “GO!”

Page 12: Machine Learning in Practice Lecture 14

Output You can find the output in the OUTPUT folder

User Defined Features UserDefinedFeatures_[name of input file].txt E.g., UserDefinedFeatures_SimpleExample.xls.txt

Performance Report Eval_[name of coding dimension]_[name of input file].txt E.g., Eval_Code_SimpleExample.xls.txt

Output File [name of input file]_OUTPUT.xls E.g., SimpleExample_OUTPUT.xls

Page 13: Machine Learning in Practice Lecture 14

User Defined Feature File You can reuse

these If you load these as

the default user defined features, you don’t have to create them again by hand

You do have to insert them manually

Page 14: Machine Learning in Practice Lecture 14

Loading Your User Defined Features

Put your user definedfeature file here

Page 15: Machine Learning in Practice Lecture 14

Loading Your User Defined Features

Page 16: Machine Learning in Practice Lecture 14

Loading Your User Defined Features

Double click

Page 17: Machine Learning in Practice Lecture 14

Loading Your User Defined Features

Then click here

Page 18: Machine Learning in Practice Lecture 14

Loading Your User Defined Features

Page 19: Machine Learning in Practice Lecture 14

Loading Your User Defined Features

Or export to csv

Page 20: Machine Learning in Practice Lecture 14

Loading Your User Defined Features Now you can just

copy columns for new features into your input file

Will be treated like the extra features to the right of the text column

You need to reload the long way when you create the final model

Page 21: Machine Learning in Practice Lecture 14

Using the Output file Prefix

If you use the Output file prefix, the text you enter will be prepended to the output files

Prefix1_Eval_Code_SimpleExample.xls.txt

Prefix1_SimpleExample.xls

Page 22: Machine Learning in Practice Lecture 14

Performance report

The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a

confusion matrix that tells you which types of errors are being made

Page 23: Machine Learning in Practice Lecture 14

Performance report

The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a

confusion matrix that tells you which types of errors are being made

Page 24: Machine Learning in Practice Lecture 14

Performance report

The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a

confusion matrix that tells you which types of errors are being made

Page 25: Machine Learning in Practice Lecture 14

Output File The output file

contains The codes for each

segment Note that the

segments that were already coded will retain their original code

The other segments will have their automatic predictions

The prediction column indicates the confidence of the prediction

Page 26: Machine Learning in Practice Lecture 14

Applying a Trained Model

Select a model file

Then select a testing file

Page 27: Machine Learning in Practice Lecture 14

Applying a Trained Model

Testing data should be set up with ? on uncoded examples

Click Go! to process file

Page 28: Machine Learning in Practice Lecture 14

Results

Page 29: Machine Learning in Practice Lecture 14

Assignment 6

Page 30: Machine Learning in Practice Lecture 14

Example Negative Review

in this re-make of the 1954 japanese monster film , godzilla is transformed into a " jurassic park " copy who swims from the south pacific to new york for no real reason and trashes the town . although some of the destruction is entertaining for a while , it gets old fast . the film often makes no sense ( a several-hundred foot tall beast hides in subway tunnels ) , sports second-rate effects ( the baby godzillas seem to be one computer effect multiplied on the screen ) , lame jokes ( mayor ebert and his assistant gene are never funny ) , horrendous acting ( even matthew broderick is dull ) and an unbelievable love story ( why would anyone want to get back together with maria pitillo's character ? ) . there are other elements of the film that fall flat , but going on would just be a waste of good words . only for die-hard creature feature fans , this might be fun if you could check your brain at the door . i couldn't . ( michael redman has written this column for 23 years and has seldom had a more disorienting cinematic experience than seeing both " fear and loathing " and " godzilla " in the same evening . )

Page 31: Machine Learning in Practice Lecture 14

Example Positive Review

sometimes a movie comes along that falls somewhat askew of the rest . some people call it " original " or " artsy " or " abstract " . some people simply call it " trash " . a life less ordinary is sure to bring about mixed feelings . definitely a generation-x aimed movie , a life less ordinary has everything from claymation to profane angels to a karaoke-based musical dream sequence . whew ! anyone in their 30's or above is probably not going to grasp what can be enjoyed about this film . it's somewhat silly , it's somewhat outrageous , and it's definitely not your typical romance story , but for the right audience , it works . a lot of hype has been surrounding this film due to the fact that it comes to us from the same team that brought us trainspotting . well sorry folks , but i haven't seen trainspotting so i can't really compare . whether that works in this film's favor or not is beyond me . but i do know this : ewan mcgregor , whom i had never had the pleasure of watching , definitely charmed me . he was great ! cameron diaz's character was uneven and a bit hard to grasp . the audience may find it difficult to care about her , thus discouraging the hopes of seeing her unite with mcgregor

Page 32: Machine Learning in Practice Lecture 14

Positive Review Continued

after we are immediately sucked into caring about and identifying with him . misguided? you bet . loveable ? you bet . a life less ordinary was a delight and even had a bonus for me when i realized it was filmed in my hometown of salt lake city , utah . this was just one more thing i didn't know about this movie when i sat down with a five dollar order of nachos and a three dollar coke . maybe not knowing the premise behind this film made for a pleasant surprise , but i think even if i had known , i would have been just as happy . a life less ordinary is quirky , eccentric , and downright charming ! not for everyone , but a definite change of pace for your typical night at the movies .

Page 33: Machine Learning in Practice Lecture 14

Note that the texts are

LONG!!!

Page 34: Machine Learning in Practice Lecture 14

Takes about 15 minutes on my machine!

Page 35: Machine Learning in Practice Lecture 14

Using the Display Option

Page 36: Machine Learning in Practice Lecture 14
Page 37: Machine Learning in Practice Lecture 14

Helpful Hints Use Feature Selection! Limit the number of times you use the

Advanced Feature Editing interfaceExport the features you create to CSV so you

can reuse the already created versions You can use Weka once you dump out

a .arff file from TagHelper tools Do your experimentation strategically

Note that POS tagging is slow

Page 38: Machine Learning in Practice Lecture 14

Museli

Page 39: Machine Learning in Practice Lecture 14

Definition of “Topic” in Dialogue Discourse Segment Purpose (Passonneau and Litman, 1994), based on

(Grosz and Sidner, 1984) TOPIC SHIFT = SHIFT IN PURPOSE that is acknowledge and acted

upon by both dialogue participants Example:

T: Let me know once you are done reading.

T: I’ll be back in a min.

T: Are you done reading?

S: not yet.

T: ok

T: Do you know where to enter all the values?

S: I think so.

S: I’ll ask if I get stuck though.

. . .

Tutor wants to know when student is ready to start the session.

Tutor checks if student knows how to setup the analysis

Page 40: Machine Learning in Practice Lecture 14

Definition of “Topic” in Dialogue Discourse Segment Purpose (Passonneau and Litman, 1994), based on

(Grosz and Sidner, 1984) TOPIC SHIFT = SHIFT IN PURPOSE that is acknowledge and acted

upon by both dialogue participants Example:

T: Let me know once you are done reading.

T: I’ll be back in a min.

T: Are you done reading?

S: not yet.

T: ok

T: Do you know where to enter all the values?

S: I think so.

S: I’ll ask if I get stuck though.

. . .

Tutor wants to know when student is ready to start the session.

Tutor checks if student knows how to setup the analysis

Page 41: Machine Learning in Practice Lecture 14

Overview of Single Evidence Source Approaches

Models based on lexical cohesionTextTiling (Hearst, 1997)Foltz (Foltz, 1998)Olney & Cai (Olney & Cai, 2005)

Models relying on regularities in topic sequencingBarzilay & Lee (Barzilay & Lee, 2004)

Page 42: Machine Learning in Practice Lecture 14

MUSELI

Integrates multiple sources of evidence of topic shift

Features: Lexical Cohesion (via cosine correlation) Time lag between contributions Unigrams (previous and current contribution) Bigrams (previous and current cont.) POS Bigrams (previous and current cont.) Contribution Length Previous/Current Speaker Contribution of Content Words

Page 43: Machine Learning in Practice Lecture 14

* P < .005

Experimental Corpora

Olney and Cai

Corpus

Thermo Corpus

# Dialogues 42 22

Conts./Dialogue 195.40 217.90

Words/Cont. 28.63* 5.12*

Conts./Topic 24* 13.31*

Topics/ Dialogue

8.14* 16.36*

Tutor Conts./ Dialogue

97.48 152.86*

Student Conts./ Dialogue

97.93 65.05*

Our thermo corpus: Is more terse! Has fewer

Contributions! Has more

Topics/Dialogue! Strict turn-taking not

enforced!

Olney and Cai (Olney and Cai, 2005)

Thermo corpus: student/tutor optimization problem, unrestricted interaction, virtually co-present

Page 44: Machine Learning in Practice Lecture 14

Baseline Degenerate Approaches ALL: every contribution = NEW_TOPIC EVEN: every nth contribution = NEW_TOPIC NONE: no NEW_TOPIC

Page 45: Machine Learning in Practice Lecture 14

Two Evaluation Metrics A metric commonly used to evaluate topic segmentation algorithms (Olney & Cai,

2005) F-measure:

Precision (P): # correct predictions / # predictionsRecall (R): # correct predictions / # boundaries

An additional metric designed specifically for segmentation problems (Beeferman et al., 1999)

Pk: Pr(error|k)

The probability that two contributions, separated by k contributions, are misclassified

Effective if k = ½ average topic length

)(

2

RP

PR

Page 46: Machine Learning in Practice Lecture 14

Experimental ResultsOlney and Cai

Corpus

Thermodynamics

Corpus

Pk F Pk F

NONE .4897 -- .4900 --

ALL .5180 -- .5100 --

EVEN .5117 -- .5131 --

TT .5069 .1475 .5353 .1614

B&L .5092 .1747 .5086 .1512

Foltz .3270 .3492 .5058 .1180

Ortho .2754 .6012 .4898 .2111

Museli .1051 .8013 .4043 .3693

Compared to

degenerates:

> NO DEG.

> 1 DEG.

> ALL 3 DEG.

P < .05

Page 47: Machine Learning in Practice Lecture 14

Experimental ResultsOlney and Cai

Corpus

Thermodynamics

Corpus

Pk F Pk F

NONE .4897 -- .4900 --

ALL .5180 -- .5100 --

EVEN .5117 -- .5131 --

TT .5069 .1475 .5353 .1614

B&L .5092 .1747 .5086 .1512

Foltz .3270 .3492 .5058 .1180

Ortho .2754 .6012 .4898 .2111

Museli .1051 .8013 .4043 .3693

Museli > all approaches in BOTH corpora

P < .05

Page 48: Machine Learning in Practice Lecture 14

Take Home Message We explored some of TagHelper tools’s

functionality TagHelper provides simple linguistic

features like bigrams and POS bigrams that can be useful for classification

Assignment 6 will give you realistic experience working with text on a non-trivial classification task

The most important thing for Assignment 6 is to be strategic!