a monolingual tree-based translation model for sentence simplification
DESCRIPTION
A Monolingual Tree-based Translation Model for Sentence Simplification. Zhemin Zhu, UKP, TU Darmstadt, Germany Delphine Bernhard, LIMSI-CNRS, France Iryna Gurevych , UKP, TU Darmstadt, Germany. COLING2010 – Beijing, China. Presenter: Zhemin Zhu. An Example of Sentence Simplification. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/1.jpg)
124.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Zhemin Zhu, UKP, TU Darmstadt, GermanyDelphine Bernhard, LIMSI-CNRS, FranceIryna Gurevych, UKP, TU Darmstadt, Germany
A Monolingual Tree-based Translation Model for Sentence Simplification
Presenter: Zhemin Zhu
COLING2010 – Beijing, China
![Page 2: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/2.jpg)
224.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
An Example of Sentence Simplification
This month was first called Sextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus.
-- Simple Wikipedia
This month was originally named Sextilis in Latin, because it was the sixth month in the original [ten-month] Roman calendar under Romulus in 753 BC, when March was the first month of the year.
-- Wikipedia
![Page 3: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/3.jpg)
3
Sentence Simplification Targeted at Humans
Reading and Speech Assistance
People with Comprehension Disabilities [Carroll et al., 1999; Inui et al., 2003]
Low-literacy people[Watanabe et al., 2009]
Non-native Speakers [Siddharthan, 2002]
Children 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 4: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/4.jpg)
4
Sentence Simplification Targeted at NLP Applications
Parsing and Translation [Chandrasekar et al., 1996]
Summarization[Knight and Marcu, 2000]
Sentence Fusion[Filippova and Strube, 2008b]
Semantic Role Labeling[Vickrey and Koller, 2008]
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Question Generation[Heilman and Smith, 2009]
Relation Extraction[Miwa et al., COLING2010]
Information Extraction [Jonnalagadda and Gonzalez, 2009]
Robot Command[Young KY and Liu SH, 2002]
![Page 5: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/5.jpg)
5
What Makes a Sentence Difficult?
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
1. Difficult Vocabulary→ Vocabulary (Word/Phrase) Substitution
2. Complex Syntax Length → Splitting, Dropping Order → Reordering, such as passive and active
Simplification operations: Splitting, Dropping, Reordering and Substitution
This month was originally named Sextilis in Latin, because it was the sixth month in the original ten-month Roman calendar under Romulus in 753 BC, when March was the first month of the year.
-- Wikipedia
![Page 6: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/6.jpg)
6
Simplification Operation: Sentence Splitting
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
August is the eighth month of the year in the Gregorian Calendar and one of seven Gregorian months with a length of 31 days.
-- Wikipedia
August is the eighth month of the year.It has 31 days.
-- Simple Wikipedia
![Page 7: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/7.jpg)
7
Simplification Operation: Dropping
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
April is the fourth month of the year [in the Gregorian Calendar, and one of four months] with [a length of] 30 days.
-- Wikipedia
April is the fourth month of the year with 30 days.
-- Simple Wikipedia
![Page 8: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/8.jpg)
8
Simplification Operation: Reordering
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Mr. Anthony, who runs an employment agency, decries program trading, but he isn't sure it should be strictly regulated.
-- [Siddharthan, 2006]
Mr. Anthony decries program trading. Mr. Anthony runs an employment agency.But he isn't sure it should be strictly regulated.
-- [Siddharthan, 2006]
![Page 9: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/9.jpg)
9
Simplification Operation: Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
The traditional etymology is from the Latin aperire, "to open," in allusion to its being the season when trees and flowers begin to "open," which is supported by comparison with the modern Greek use of ἁνοιξις (opening) for spring.
-- Wikipedia
The name April comes from that Latin word aperire which means "to open".
-- Simple Wikipedia
![Page 10: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/10.jpg)
10
Motivation
Most of the existing methods only cover one simplification operation: [Siddharthan, 2006] and [Petersen and Ostendorf , 2007]: Splitting Sentence Compression: Dropping [Carroll et al. ,1999]: Word Substitution
In most cases, different simplification operations happen simultaneously.
It is necessary to model different simplification operations integrally.
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 11: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/11.jpg)
11
Our Contributions
The first statistical model: TSM (Tree-based Simplification Model) Integrally covering splitting, dropping, reordering and word/phrase substitution Based on the great successes of parsing and translation techniques.
An Efficient Training Method for TSM Speeding up by monolingual word mapping
PWKP : Parallel Complex-Simple Dataset Obtained from Wikipedia and Simple Wikipedia
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 12: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/12.jpg)
12
Tree-base Simplification Model: TSM
Splitting
Dropping
Reordering
Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Parse Trees of Complex Sentences
SimpleSentences
Probabilistic Model: EM Training
![Page 13: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/13.jpg)
13
Parallel Complex-Simple Dataset: PWKP
Paired articles from the Wikipedia and Simple Wikipedia
1. Article Pairing: following the “language links”
2. Plain Text Extraction: JWPL [Zesch et al., 2008]
3. Pre-processing: sentence boundary detection and tokenization with the Stanford Parser package [Klein and Manning, 2003], lemmatization with the TreeTagger [Schmid,1994]
4. Monolingual Sentence Alignment: sentence-level TF*IDF [Nelken and Shieber, 2006]
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 14: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/14.jpg)
14
Parallel Complex-Simple Dataset: PWKP
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Similarity Precision Recall
TF*IDF 91.3% 55.4%
Word Overlap 50.5% 55.1%
MED 13.9% 54.7%
Table 1: Monolingual Sentence Alignment
Sentence Length Token Length #Pairs
Simple 20.87 4.89108016Complex 25.01 5.06
Table 2: Statistics for the PWKP dataset
![Page 15: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/15.jpg)
15
TSM: Splitting
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Example Complex Sentence:
August was the sixth month in the ancient Roman calendar which started in 735BC.
![Page 16: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/16.jpg)
16
TSM: Splitting
Question 1: Where to split the sentence? Step 1: Segmentation
Question 2: How to make the split sentences complete and grammatical? Step 2: Completion
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 17: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/17.jpg)
17
TSM: Splitting
Step 1: Segmentation
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Word Constituent Length Probability
which SBAR 1 0.0016
which SBAR 2 0.0835
Table 3: Segmentation Feature Table (SFT)
![Page 18: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/18.jpg)
18
TSM: Splitting
Step 1: Segmentation
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 19: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/19.jpg)
19
TSM: Splitting
Step 2: Completion
Should the “which” be dropped?
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Word Constituent isDropped Probability
which WHNP true 1.0
which WHNP false Prob.min
Table 4: Border Drop Feature Table (BDFT)
![Page 20: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/20.jpg)
20
TSM: Splitting
Step 2: Completion
Which parts should be copied? Where to put these parts in the new sentences?
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Dependency Constituent isCopied Position Probability
gov_nsubj VBD true left 0.9000
gov_nsubj VBD true right 0.0994
gov_nsubj VBD false left + right 0.0006
Table 5: Copy Feature Table (CFT)
![Page 21: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/21.jpg)
21
TSM: Splitting
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 22: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/22.jpg)
22
TSM: Dropping & Reordering
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Constituent Children Drop Probability
NP DT JJ NNP NN 1101 7.66E-4
NP DT JJ NNP NN 0001 1.26E-7
Table 6: Dropping Feature Table (DFT)
Constituent Children Reorder Probability
NP DT JJ NN 012 0.8303
NP DT JJ NN 210 0.0039
Table 7: Reordering Feature Table (RFT)
![Page 23: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/23.jpg)
23
TSM: Dropping & Reordering
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 24: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/24.jpg)
24
TSM: Word/Phrase Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Original (word/phrase)
Substitution(word/phrase)
Probability
ancient ancient 0.963
ancient old 0.0183
old ancient 0.005
ancient than transportation 1.83E-102
Table 8: Substitution Feature Table (SubFT)
Word substitution: terminal nodes
Phrase Substitution: non-terminal nodes
![Page 25: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/25.jpg)
25
TSM: Word/Phrase Substitution
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 26: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/26.jpg)
26
Speeding up
We filter out the unpromising candidates at the early stages. This is done using monolingual word mapping.
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 27: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/27.jpg)
27
Experiments
Testing dataset:100 complex sentences
131 parallel simple sentences from PWKP
Baseline systems:1. Moses: state-of-the-art phrase-based SMT
2. Compression (Filippova and Strube, 2008a)
3. Compression + Substitution Substitution: Wordnet + Frequency in Simple Wikipedia Articles
4. Compression + Substitution + Splitting Splitting: split at conjunctions and relatives.
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 28: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/28.jpg)
28
Experiments: Basic Statistics
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Tok. Len. Sent . Len. #Sent.
Complex Sentences 4.95 27.81 100
Simple Sentences 4.76 17.86 131
1. Moses 4.81 26.08 100
2. Compression 4.98 18.02 103
3. Compression+Substitution 4.90 18.11 103
4. Compression+Substitution+splitting 4.98 10.20 182
5. TSM 4.76 13.57 180
![Page 29: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/29.jpg)
29
Experiments: Translation Assessment
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
BLEU NIST #Same
Complex Sentences 0.50 6.89 100
Simple Sentences 1.00 10.98 3
1. Moses 0.55 7.47 25
2. Compression 0.28 5.37 1
3. Compression+Substitution 0.19 4.51 0
4. Compression+Substitution+splitting 0.18 4.42 0
5. TSM 0.38 6.21 2
![Page 30: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/30.jpg)
30
Experiments: Readability Assessment
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Flesch Lix (Grade)
OOV% PPL
Complex Sentences 49.1 53.0 (10) 52.9 384
Simple Sentences 60.4 (PE) 44.1 (8) 50.7 179
1. Moses 54.8 48.1 (9) 52.0 363
2. Compression 56.2 45.9 (8) 51.7 481
3. Compression+Substitution 59.1 45.1 (8) 49.5 616
4. Compression+Substitution+splitting 65.5 (PE) 38.3 (6) 53.4 581
5. TSM 67.4 (PE) 36.7 (5) 50.8 353
PE: Plain English Grade: School Year
![Page 31: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/31.jpg)
31
Conclusions
1. Moses is not good at simplification tasks.
2. BLEU and NIST are not a good evaluation metrics for sentence simplification systems.
3. TSM can achieve the best overall readability scores.
4. We contributed the PWKP dataset:
http://www.ukp.tu-darmstadt.de/software-data/data/quality-assessment/
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 32: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/32.jpg)
32
Future Work
More sophisticated features and rules to improve TSM
Extend TSM’s expressiveness to model more complex transformations: synchronous syntax is a promising direction
Evaluation methods for simplification systems: Readability Assessment
24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 33: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/33.jpg)
33
Acknowledgements
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 34: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/34.jpg)
34
Thanks for your interests!
Comments & Questions!
24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
![Page 35: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/35.jpg)
35
Backup: Training
EM algorithm:
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Training (dataset){Initialize all probability tables using the uniform distribution;for (several iterations){
reset all cnt = 0;for (each sentence pair < c; s > in dataset){
tt = buildTrainingTree(< c; s >);calcInsideProb(tt);calcOutsideProb(tt);update cnt for each conditioning feature in eachnode of tt: cnt = cnt + node:insideP rob node:outsideP rob=root:insideP rob;
}updateProbability();
}}
![Page 36: A Monolingual Tree-based Translation Model for Sentence Simplification](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56812d9f550346895d92c06b/html5/thumbnails/36.jpg)
36
Backup: Training
24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |