graduation project by: parush anat & grisha klots klotsg supervised by: yoav goldberg & dr....

21
Hebrew Sentence Compression Graduation project by: Parush Anat & Grisha Klots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU.

Post on 20-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Hebrew Sentence Compression

Graduation project by:Parush Anat & Grisha Klotshttp://www.cs.bgu.ac.il/~klotsg

Supervised by: Yoav Goldberg & Dr. Michael Elhadad

Nov. 2010, CS BGU.

Page 2: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

A short example…

A “long” sentence may look like - אתמול בשעה שש איה אפתה עוגת תפוחים טעימה ואני

אכלתי חתיכה קטנה ממנה

And can be compressed to - אתמול איה אפתה עוגת תפוחים ואני אכלתי ממנה

Page 3: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Sentence Compression – Why & How?Motivations, implementation and some theoretical background

• Automatic text summarization (Academic, technical, text books and so on…)•A sentence by sentence approach – each sentence is compressed individually• Our method is based on word deletion to generate a shorter “version” of the sentence

Page 4: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Work process

Our work consisted of two main phases:1. Corpus Generation – Developed in Python2. Algorithm Implementation – Developed in

Java

• We implemented the algorithm developed by Ryan McDonald and described in his paper: “Discriminative Sentence Compression with Soft Syntactic Evidence”

• First time implemented in Hebrew!

Page 5: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #1Sentence Generation – General

• Sentences were extracted from “Haaretz” web-site• A scoring method was employed to find pairs of “full” and “short” sentences. • All pairs were grouped into a single database (large XML-Like file)• XML file scanned again and filtered for irregularities and words that are not in the Hebrew lexicon• Final output is formatted to a predefined structure as the input for the 2nd Phase – Algorithm Implementation

Page 6: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Main Headline

Sub-headline

Body

Page 7: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #1Sentence Generation – Extraction (Extractor)

•A scoring method ensures that only the best matches are returned (matching percentage varies)

• At this stage, we allowed for Clauses to change their relative place in the different versions of the sentence (number of changes also varies)

Page 8: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #1Sentence Generation – Filtration

• Due to strict rules imposed on the Algorithm Implementation, much filtration has to be performed• Each word that appears in the short version should appear in the long one as well.• No clauses change their position in each pair of sentences.

More than 90% of pairs fail this test!(Initial size of DB was ~4500 sentence. Filtration left us with only ~400 sentences to work with)

Page 9: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #1Sentence Generation – Algorithm Input Formatting

Example: השוטר כופר בכל ההאשמותNN VB DTT NNעו"ד מיכאל בוסקילה המייצג את השוטר אמר כי החשוד כופר בכל ההאשמות נגדוTTL NNP NNP BN AT NN VB CC NN VB DTT NN IN5 9 10 11------------------END------------------

Page 10: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Algorithm Implementation

Page 11: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Algorithm Implementation

C[i,j]

Th

e i-t

h w

ord

fro

m t

he

long

sen

tence

Length of short sentence

C [

k ,

j – 1

]

C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)}

Maximum Score for a short sentence having the desired

length

The “heart” of the algorithm – Dynamic Programming: Compress: Long sent x requested length Short Sentence

Page 12: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Algorithm Implementation - basic terminology

• Feature – a string that characterizes a pair of words according to their syntactic analysis, their position in the sentence and the words that are between them. • For example:• pi:pj = getPOS(i):getPOS(j)• “pi:pj = NN : VB”• for i<k<j:

IsNeg = isNeg(getWord(k))“IsNeg = True”, “IsNeg = False”pi:pk:pj = getPos(i):getPos(k):getPos(j)

Page 13: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Algorithm Implementation - basic terminology (cont.)

• Weights Vector – Contains ordered pairs of <Feature, Weight> for all instances induced by different pairs of words

For example:<<“pi:pj = NN:VB”,100>,<“pi:pj = NN:DTT”,5>>

• All the feature templates are hard-coded and predefined. • The Weights Vector is updated constantly during the learning phase.

Page 14: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Algorithm Implementation – the Score function

S(x,k,i) returns the sum of weights of the features for k-th and i-th word in sentence x

C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)}

Page 15: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Learning – Dynamic Programming (Part 1)

• For each cluster from the input file, we iterate over the list of indices and for each two adjacent indices, we extract their feature list. For example:

5 9 10 11

• For each feature in each list, we increase its weight by 1 in the Weights Vector

Page 16: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Phase #2Learning – Dynamic Programming (Part 2)

• Now, we compress the long sentence to a new

sentence having the length of the short one

• From the compressed sentence, we generate a

new list of indices (as shown before) and extract

the lists of features

• For each feature in each list, we decrease its

score by 1

Page 17: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Results & DiscussionA short example

•: ארוך העם משפט כמדינת בישראל ההכרה בפרשת כמו

מעמד הענקת בדבר הממשלה החלטת של העיתוי היהודי

מכוונת כפרובוקציה מצטייר לבירה מיוחד

•:) מקורי ) מקוצר בדבר משפט הממשלה החלטת של העיתוי

מכוונת כפרובוקציה מצטייר לבירה מיוחד מעמד הענקת

•:)' ( אלג מקוצר בדבר משפט הממשלה החלטת של העיתוי

מכוונת כפרובוקציה מצטייר לבירה מיוחד מעמד הענקת

Page 18: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Results & DiscussionAnd another one…

•: ארוך המשפטים משפט ההחלטה 60למשרד על לערער יום

אובמה ממשל בפני דילמה שמציבה

• :) ( מקורי מקוצר המשפטים משפט לערער 60למשרד יום

ההחלטה על

• :)' ( אלג מקוצר ממשל 60משפט ההחלטה על לערער יום

אובמה

Page 19: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Results & DiscussionA (very) basic results analysis

We analyzed the compression of 50 “unseen”

sentences:

• 8% matched exactly the shortened version

•25% differ by one or two words from the

shortened version

• 37% are valid Hebrew sentences

•43% retained the general notion of the original

sentence

Page 20: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Results & DiscussionFuture improvements

• Increase DB size!!!

• Increase variety – use other sources of

information

•Add more feature templates

Page 21: Graduation project by: Parush Anat & Grisha Klots klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU

Thank you!