segmentation in sanskrit texts
TRANSCRIPT
![Page 1: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/1.jpg)
![Page 2: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/2.jpg)
देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत
देहिनः अस्मिन ्यथा देिे कौिारं यौवनं जरा तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत
![Page 3: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/3.jpg)
![Page 4: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/4.jpg)
तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत
तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत
![Page 5: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/5.jpg)
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
![Page 6: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/6.jpg)
A
![Page 7: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/7.jpg)
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
रािरािेभ्यः रािमय
witi
PMI Matrix of the un-segmentable token lemmas
P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)
![Page 8: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/8.jpg)
Set (Size in sentences) Micro Accuracy Macro Accuracy
Training set (1700) 87.76 % 92.56 %
Testing Set (150) 87.82 93.56 %
•
•
•
•
![Page 9: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/9.jpg)
• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .....wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are candidates for semantically correct segmentation.
Similarly for c2 and c3
![Page 10: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/10.jpg)
• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .....wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are candidates for semantically correct segmentation.
Similarly for c2 and c3
![Page 11: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/11.jpg)
![Page 12: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/12.jpg)
![Page 13: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/13.jpg)
• From Query Nodes, reach the most promising candidate word nodes.• Perform multiple personalised random walks.• Edge weights – Accommodate heterogeneous information• Learn weights for each of the random walk approach (path) by
supervised methods.• The weighted sum of all the random walk methods, gives the most
suitable candidate• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.
Language Model (LM) with word lemmas
LM with morphological types
Verb specific Expectancy
Compound word formation patterns
![Page 14: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/14.jpg)
Language Model with words - LMw
LM with morphological types - LMt
Verb specific Expectancy – ViE
Compound word formation patterns
PCRW -Unifying
Framework
• Handle Free Word Order• Incorporate heterogeneous types of information• Bonus – Form different relational paths(upto l) by combination of
individual edge weights.• For l = 3, some sample paths that can be formed as combination.• LMw -> LMt ->LMw• LMt -> V1E -> LMt• LMt -> VkE -> LMt
![Page 15: Segmentation in Sanskrit texts](https://reader031.vdocuments.mx/reader031/viewer/2022020108/58ed50131a28abea2d8b46a1/html5/thumbnails/15.jpg)