folktale classification using learning to rank

37
Folktale Classification using Learning to Rank Dong Nguyen, Dolf Trieschnigg, and Mariët Theune University of Twente

Upload: others

Post on 20-Feb-2022

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Folktale Classification using Learning to Rank

Folktale Classification using Learning to Rank

Dong Nguyen, Dolf Trieschnigg, and Mariët Theune University of Twente

Page 2: Folktale Classification using Learning to Rank
Page 3: Folktale Classification using Learning to Rank
Page 4: Folktale Classification using Learning to Rank
Page 5: Folktale Classification using Learning to Rank
Page 6: Folktale Classification using Learning to Rank

Folktales

•  Fairy tales •  Riddles •  Legends •  Urban legends •  Jokes •  Etc..

Page 7: Folktale Classification using Learning to Rank

Folktale researchers

•  Folktales are a resource to research – Variation in tales – Shifting moral values, beliefs, identities etc. – Intertextuality

Page 8: Folktale Classification using Learning to Rank

Classification systems

•  Folktale researchers have developed classification systems

•  To compare and analyze stories •  To organize stories

Page 9: Folktale Classification using Learning to Rank

Classification systems

•  Folktale researchers have developed classification systems

•  To compare and analyze stories •  To organize stories

Story types  

Page 10: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

Page 11: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 12: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !

disappears. She resembles a murdered girl…!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 13: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !

disappears. She resembles a murdered girl…!

A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !

he discovers she passed !away due to a car accident !

a while ago. He finds! his sweater on her grave.!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 14: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !

disappears. She resembles a murdered girl…!

A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !

he discovers she passed !away due to a car accident !

a while ago. He finds! his sweater on her grave.!

A car driver picks up a girl wearing a white dress. He

accidently spills red wine on her dress. The next day !he finds out she died a !

year ago. When the police open !her grave, they find the white! dress with the red wine spot.!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 15: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !

disappears. She resembles a murdered girl…!

A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !

he discovers she passed !away due to a car accident !

a while ago. He finds! his sweater on her grave.!

A car driver picks up a girl wearing a white dress. He

accidently spills red wine on her dress. The next day !he finds out she died a !

year ago. When the police open !her grave, they find the white! dress with the red wine spot.!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 16: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !

disappears. She resembles a murdered girl…!

A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !

he discovers she passed !away due to a car accident !

a while ago. He finds! his sweater on her grave.!

A car driver picks up a girl wearing a white dress. He

accidently spills red wine on her dress. The next day !he finds out she died a !

year ago. When the police open !her grave, they find the white! dress with the red wine spot.!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 17: Folktale Classification using Learning to Rank

The Vanishing Hitchhiker (BRUN 01000)!!

A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!

A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !

disappears. She resembles a murdered girl…!

A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !

he discovers she passed !away due to a car accident !

a while ago. He finds! his sweater on her grave.!

A car driver picks up a girl wearing a white dress. He

accidently spills red wine on her dress. The next day !he finds out she died a !

year ago. When the police open !her grave, they find the white! dress with the red wine spot.!

A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker

vanishes. The police tell !him they have heard the !story earlier that day.!

Page 18: Folktale Classification using Learning to Rank

Story Type Indexes: ATU

Aarne-­‐Thompson-­‐Uther    classifica4on  system  (ATU)    Red  Riding  Hood    (ATU  0333),    The  Race  between  Hare  and  Tortoise  (ATU  0275A),  etc.

Page 19: Folktale Classification using Learning to Rank

Story Type Indexes: Brunvand

Urban  legends    The  Microwaved  Pet  (BRUN  02000),  The  Kidney  Heist  (BRUN  06305),    The  Killer  in  the  Backseat          (BRUN  01305),    The  Vanishing  Hitchhiker            (BRUN  01000),  etc.

Page 20: Folktale Classification using Learning to Rank

Automatic Identification of Story Types

•  Increasing digitalization

•  Discover relationships between stories

Page 21: Folktale Classification using Learning to Rank

Outline

•  Problem  descrip4on,  corpora  •  Experimental  setup  – Baselines,  Learning  to  Rank    

•  Results  – Baselines,  Feature  analysis,  Error  analysis  

•  Discussion/Conclusion  

Page 22: Folktale Classification using Learning to Rank

Goal and Evaluation

Given an input story, return a ranking of story types

•  Semi automatic setting  Reciprocal  Rank  (MRR)  

 •  Classifica4on  –  Simula4ng  a  classifica4on  seRng.  The  highest  ranked  label  is  then  taken  as  the  predicted  class.    

 Accuracy

1ranki

Page 23: Folktale Classification using Learning to Rank

Folktale database

•  Dutch Folktale Database (http://www.verhalenbank.nl) •  Over 42.000 stories •  In our experiments: only stories written in

standard Dutch

Page 24: Folktale Classification using Learning to Rank

Dataset I

3 Story Type Indexes

The Dutch Folktale Database is a large collection of Dutch folktales containinga variety of subgenres, including fairy tales, urban legends and jokes. We onlyconsider stories that are written in standard Dutch (the collection also containsmany narratives in historical Dutch, Frisian and Dutch dialects). In this paperwe restrict our focus to the two type indexes mentioned in the introduction,the ATU index [25] and the Type-Index of Urban Legends [6]. We created twodatasets based on these type indexes. For each type index, we only keep thestory types that occur at least two times in our dataset. The frequencies of thestory types are plotted in Figure 1. Many story types only occur a couple oftimes in the database, whereas a few story types have many instances.

3.1 Aarne-Thompson-Uther (ATU)

Our first type-index is the Aarne-Thompson-Uther classification (ATU) [25].Examples of specific story types are Red Riding Hood (ATU 0333) and TheRace between Hare and Tortoise (ATU 0275A). The index contains story typeshierarchically organized into categories (e.g. Fairy Tales and Religious Tales).We discard stories belonging to the Anecdotes and Jokes category (types 1200-1999), since the story types in this category are very di↵erent in nature from therest of the stories2. The average number of words per story is 489 words.

3.2 Brunvand

Our second type index is proposed by Brunvand [6] and is a classification of urbanlegends. Examples of story types are The Microwaved Pet (BRUN 02000), TheKidney Heist (BRUN 06305) and The Vanishing Hitchhiker (BRUN 01000). Thestories have on average 158 words.

Number of stories

Number

ofstorytypes

0 20 40 60

020

4060

(a) ATU

Number of stories

Number

ofstorytypes

0 20 40 60

020

4060

(b) Brunvand

Fig. 1. Story type frequencies

2 As was suggested by a folktale researcher. Story types in the Anecdotes and Jokescategory are mostly based on thematic similarity, while others are based on plot.

Page 25: Folktale Classification using Learning to Rank

Dataset II

Index   Train   Dev   Test  

Nr  documents   400     75     25   50  

Nr  story  types   98   59     24   43  

Index   Train   Dev   Test  

Nr  documents   687     175   50   75  

Nr  story  types   125     92   40   50  

ATU  

Brunvand  

Page 26: Folktale Classification using Learning to Rank

Baselines: big doc

Vanishing  Hitchiker  

Microwaved  Pet  

Killer  in  the  Backseat  

 Killer  in  the  Backseat  

 

Input  document  (query)  

Ranking  

Page 27: Folktale Classification using Learning to Rank

Baselines: small doc

Vanishing  Hitchiker  

Microwaved  Pet  

Killer  in  the  Backseat  

Input  document  (query)  

Ranking  

Vanishing  Hitchiker  Microwaved  Pet  Killer  in  the  

Backseat  

When  taking  the  top  ranked  label  as  the  class,    this  is  the  same  as  a  Nearest  Neighbour  classifier  (k=1).    

Page 28: Folktale Classification using Learning to Rank

Results - Baseline

MRR   Accuracy  

Smalldoc   0.7779     0.72    

Bigdoc   0.4423     0.36    

MRR   Accuracy  

Smalldoc   0.6430     0.56    

Bigdoc   0.6411     0.56    

ATU  

Brunvand  

Page 29: Folktale Classification using Learning to Rank

Learning to Rank

1.  Retrieve  an  ini4al  set  of  candidate  stories  (small  document  baseline).    

2.  Apply  learning  to  rank  to  rerank  the  top  50  candidates.    

3.  Create  a  final  ranked  list  of  story  types,  by  taking  the  corresponding  labels  of  the  ranked  stories  and  removing  duplicates.

Page 30: Folktale Classification using Learning to Rank

Features I •  Small  Document  Scores  (IR)  

–  Indicates  the  score  of  the  query  on  the  candidate  stories.  –  Fulltext  (BM25  -­‐  Full  text),  only  nouns  (BM25  -­‐  nouns)  and  only  verbs  (BM25  –  verbs)  

•  Big  Document  Scores  (Bigdoc)  –  Similarity  to  all  stories  of  the  candidate's  story  type  (bigdoc)  Fulltext  (Bigdoc  -­‐  BM25  -­‐  Full  text),  only  nouns  (Bigdoc  -­‐  BM25  -­‐  nouns)  and  only  verbs  (Bigdoc  -­‐  BM25  -­‐  verbs).  

•  Lexical  Similarity  (LS)  –  Jaccard  and  TFIDF  similarity,  calculated  on  the  following  token  types:  unigram,  bigrams,  character  ngrams  (2-­‐5),  chunks,  named  en44es,  4me  and  loca4ons.  

Page 31: Folktale Classification using Learning to Rank

Features II

•  Verb(Subject, Object) triplets

•  Extracted based on dependencies obtained using Frog parser

Lives (princess, castle)!!

Disappear(driver,)    

Page 32: Folktale Classification using Learning to Rank

Features III

•  Matches – Exact, Subject-Object, Verb-Object, Subject-Verb

•  Abstraction – VerbNet

Example: ‘consider-29.9’ class in VerbNet: achten (esteem), bevinden (find), inzien (realise), menen (think/believe), veronderstellen (presume), kennen (know), wanen (falsely believe), denken (think)

Page 33: Folktale Classification using Learning to Rank

Feature Analysis I

MRR   Accuracy  

Baseline                      (smalldoc)    

0.7779     0.72    

+  Bigdoc     0.8367     0.78    

+  IR     0.8049     0.76    

+  LS   0.7921     0.72    

+  Triplets     0.8016     0.72    

All   0.8569     0.82    

MRR   Accuracy  

Baseline                  (smalldoc)    

0.6430   0.56  

+  Bigdoc     0.7933   0.72  

+  IR     0.7247   0.61  

+  LS   0.6810   0.60  

+  Triplets     0.6600   0.59  

All   0.8132   0.76  

ATU   Brunvand  

Page 34: Folktale Classification using Learning to Rank

Feature Analysis II

Feature   Weight  

Bigdoc:  BM25  -­‐  nouns     0.179    

Bigdoc:  BM25  -­‐  full  text     0.158    

LS:  unigrams  -­‐  TFIDF     0.109    

Bigdoc:  BM25  -­‐  verbs     0.069    

Triplets:  SO  match,  Jaccard,  no  abstrac4on    

0.063    

ATU   Brunvand  

Feature   Weight  

Bigdoc:  BM25  -­‐  full  text     0.209    

Bigdoc:  BM25  -­‐  nouns     0.204    

LS:  unigrams  -­‐  TFIDF     0.065    

IR:  BM25  -­‐  nouns     0.062    

Bigdoc:  BM25  -­‐  verbs     0.051    

Page 35: Folktale Classification using Learning to Rank

Error analysis

•  Matches  based  on  style  instead  of  actual  plot.  Dis4nguishing  narrator,  or  very  short  stories.  

•  ATU:  Also  matched  on  content  words  that  were  not  core  to  the  plot.  Happened  in  par4cular  with  long  stories.  

Page 36: Folktale Classification using Learning to Rank

Discussion

•  High  MRR    – Suitable  for  a  semi-­‐automa4c  seRng,  where  annotators  are  presented  with  a  ranked  list  

•  What’s  next?    Other  type  indexes,  dialects,  historical  variants.  

Page 37: Folktale Classification using Learning to Rank

Summary

•  Classifica4on  of  story  types.  •  Two  story  type  indexes:    ATU  and  Brunvand.  •  Nearest  Neighbor  using  Learning  to  Rank  approach.  

•  Combining  a  small  document  and  big  document  model  was  very  effec4ve.