![Page 1: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/1.jpg)
LOHAI: Providing a baseline for KOS based
automatic indexing
Kai Eckert
Mannheim University Library, [email protected]
First workshop on Semantic Digital Archives (SDA 2011),
Sep 29th 2011, Berlin
![Page 2: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/2.jpg)
Motivation
● General KOS based indexer for various (even serious) purposes:
– Document exploration
– Thesaurus examination
– Automatic Indexing
– ...
● No free and open source implementation was available.
● LOHAI: Low Hanging Fruits Automatic Indexer
![Page 3: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/3.jpg)
Design Principles
● Simplicity over quality– Easy to use
– Easy to understand
– Easy to improve
● Knowledge-poor and without any training– Must not rely on any additional sources
– No training step, to be usable in a setting where no preindexed documents are available.
![Page 4: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/4.jpg)
Indexing Pipeline
![Page 5: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/5.jpg)
Indexing Pipeline
Covered in this talk.
![Page 6: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/6.jpg)
Required Disambiguation
● Compound terms:
– “insur” >> “insurance”, “insurance market”● Overstemming:
– „nation“ >> “nationalism”, “nationality”, “nation”● Homonyms:
– “bank”: the financial institution
– “bank”: a raised portion of seabed or sloping ground along the edge of a stream, river, or lake
![Page 7: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/7.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
![Page 8: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/8.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
Ambigous
![Page 9: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/9.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
Ambigous
![Page 10: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/10.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
MarketInsurance Market...
Ambigous
![Page 11: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/11.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
MarketInsurance Market...
Cross-Concordance
Unique: STOP
![Page 12: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/12.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
MarketInsurance Market...
Cross-Concordance
Money Insur Market
No match
![Page 13: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/13.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
MarketInsurance Market...
Cross-Concordance
Money Insur Market
No match
![Page 14: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/14.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
MarketInsurance Market...
Cross-Concordance
Money Insur Market
MATCH
>> Money
![Page 15: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/15.jpg)
Compound Term Detection
Money Insur Market Cross-Concordance
MoneyMoney Transfer...
InsuranceInsurance Market...
MarketInsurance Market...
Cross-Concordance
Money Insur Market
Money Insur Market
MATCH
>> Money
>> Insurance Market
![Page 16: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/16.jpg)
Unstemming
● Overstemming:
– „nation“ >> “nationalism”, “nationality”, “nation”● Compare unstemmed terms.
● If they match:
– Assign them.● If not:
– Continue with Word Sense Disambiguation.
![Page 17: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/17.jpg)
Word Sense Disambiguation
Yarowsky's assumptions:● One sense per collocation
– Collocated terms are unique for each possible sense of a given term.
● One sense per discourse– Only one sense for a given word is used
throughout a whole document.
![Page 18: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/18.jpg)
Jaccard Measure
● Term Environments:– Document: 100 words before and after the
term occurrence form set W.
– KOS: All labels of the concept, its direct children, siblings and parents form set C.
● Jaccard Measure:
● Assign one sense per document, based on the Jaccard value of all occurrences.
Jaccard W ,C =∣W∪C∣∣W∩C∣
![Page 19: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/19.jpg)
Example ResultHardin, Russell: Contractarianism: Wistful Thinking
The contract metaphor in political and moral theory is misguided. It is a poor metaphor both descriptively and normatively, but here I address its normative problems. Normatively, contractarianism is supposed to give justifications for political institutions and for moral rules, just as contracting in the law is supposed to give justification for claims of obligation based on consent or agreement. This metaphorical association fails for several reasons. First, actual contracts generally govern prisoner’s dilemma, or exchange, relations; the so-called social contract governs these and more diverse interactions as well. Second, agreement, which is the moral basis of contractarianism, is not right-making per se. Third, a contract in law gives information on what are the interests of the parties; a hypothetical social contract requires such knowledge, it does not reveal it. Hence, much of contemporary contractarian theory is perversely rationalist at its base because it requires prior, rational derivation of interests or other values. Finally, contractarian moral theory has the further disadvantage that, unlike contract in the law, its agreements cannot be connected to relevant motivations to abide by them.
Constitutional Political Economy, 1 (2) 1990: 35-52
![Page 20: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/20.jpg)
Example Result
Manual assignment
● Constitutional economics● Influence of government● Ethics ● Theory
LOHAI
● Contract Law (1.21)● Contract (0.76)● Social contract (0.64)● Law (0.51)● Politics (0.37)● Prisoner’s dilemma (0.34)● Theory (0.32)● Rationalism (0.24)● Association (0.23)● Exchange (0.20)● Knowledge (0.19)● Government (0.16)● Information (0.12)
![Page 21: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/21.jpg)
Example ResultThe contract metaphor in political and moral theory is misguided. It is a poor metaphor both descriptively and normatively, but here I address its normative problems. Normatively, contractarianism is supposed to give justifications for political institutions and for moral rules, just as contracting in the law is supposed to give justification for claims of obligation based on consent or agreement. This metaphorical association fails for several reasons. First, actual contracts generally govern prisoner’s dilemma, or exchange, relations; the so-called social contract governs these and more diverse interactions as well. Second, agreement, which is the moral basis of contractarianism, is not right-making per se. Third, a contract in law gives information on what are the interests of the parties; a hypothetical social contract requires such knowledge, it does not reveal it. Hence, much of contemporary contractarian theory is perversely rationalist at its base because it requires prior, rational derivation of interests or other values. Finally, contractarian moral theory has the further disadvantage that, unlike contract in the law, its agreements cannot be connected to relevant motivations to abide by them.
![Page 22: LOHAI: Providing a baseline for KOS based automatic indexing](https://reader033.vdocuments.mx/reader033/viewer/2022060120/5592a5b11a28ab6e798b469f/html5/thumbnails/22.jpg)
Conclusion
● Reasonable results without „Black Box“ Effect.● Directly usable for new KOS and document
sets.● No training step needed.● Low hanging fruits, but a good baseline.● ~ 500 LoC in (quite verbose) Java.● Easy to adapt and to improve.● Free and open source:
– https://github.com/kaiec/LOHAI