marianne lykke royal school of library and information science susan l. price and lois m. l....

28
ISKO 2010 Marianne Lykke Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University ISKO 2010 Conference Sapienza University of Rome, Faculty of Philosophy February 23 - 26, 2010 Using semantic components to represent and search domain-specific documents: An evaluation of indexing accuracy and consistency

Upload: tehya

Post on 25-Feb-2016

74 views

Category:

Documents


2 download

DESCRIPTION

Using semantic components to represent and search domain-specific documents: An evaluation of indexing accuracy and consistency. Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University. ISKO 2010 Conference - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Marianne LykkeRoyal School of Library and Information Science

Susan L. Price and Lois M. L. DelcambrePortland State University

ISKO 2010 ConferenceSapienza University of Rome, Faculty of Philosophy

February 23 - 26, 2010

Using semantic components to represent and search domain-specific documents: An evaluation of indexing accuracy and consistency

Page 2: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Agenda

• Problem and motivation• Semantic component model• Research questions• Test design• Results• Conclusions

Page 3: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Problem and motivationChallenges for information retrieval in domain-

specific digital libraries:• Domain-specific libraries often contain large sets

of similar documents about few topicso Important to be able to distinguish between

topical similar documents• Domain experts often have specific information

needs targeting a single “right answer”, specified by domain-specific facets. o Important to be able to limit search to domain-

specific dimensions(e.g. Leckie et al., 1996; Fagin et al., 2003; Freund et al., 2005; Hearst et al.,

2006)

Page 4: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Problem and motivation

• Little time for information retrievalo Important that then relevant documents are highly

ranked and retrieved by first query• Distributed indexing, carried out by indexers with

varied degree of indexing competenceo Important to address classical indexing problems:

quality, exhaustivity, specificity, consistency (e.g. Leckie et al., 1996; Fagin et al., 2003; Freund et al., 2005; Hearst et al., 2006)

Page 5: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Semantic component model • Semantic components model developed to facilitate

formulation of specific, structured queries covering the search topic exhaustively by domain-specific dimensions

• Two-level model dividing a given collection into a set of document classes, each class with an associated set of semantic components

• Based on assumptions thato Domain experts know document genres within a certain

domain: content and structure (Dillon, 1991; Orlikowski & Yates, 1994; Bishop, 1999; Vaughan & Dillon, 2005)

o Domain-specific document content and structure correspond to domain-specific information needs (Ely et al, 1999,2000; Price, Delcambre, Nielsen, 2006)

Page 6: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

SC: General information

SC: Practical information

Document class: Clinical method

Page 7: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

SC: General information

SC: Risk factors

After treatment

Document class: Clinical method

Page 8: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Semantiske component modelDocument class Semantic component Document class Semantic component

Clinical problem General informationDiagnosisReferralTreatment

Clinical unit Function and specialtyPractical informationReferralStaff and organization

Clinical method General informationPractical informationReferralAftercareRisksExpected results

Drugs General informationPractical informationTarget groupEffectSide effects

Services General informationPractical informationReferral

Notice General informationPractical informationQualification

Page 9: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

Page 10: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

Page 11: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University
Page 12: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Case study• sundhed.dk: Danish, national health portal• Active since 2001, 25.000 documents• Two main target groups: citizens and medical

professionals • Combination of full-text indexing and controlled,

assigned indexing: o ICPC, International Classification Primary Careo ICD-10, International Classification of Diseaseso Home-grown Citizens Thesaurus

• Large and varied group of indexers o 5 regionso Up to 250 indexers per region

• Specific target group: family doctors

Page 13: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Test design• Comparative, experimental indexing study

o Baseline: keyword indexing (controlled and free terms)o Experimental: semantic component indexing

• Test persons: 16 sundhed.dk indexers (convenience sample)

• Indexing task: 12 sundhed.dk documentso 6 documents were indexed with semantic components

(SC)o 6 documents were indexed with keywords

• Random assignment of documents and indexing methods

• Training session• Evaluation measures:

o Accuracy o Consistencyo Indexing timeo Easiness

Page 14: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Research questions

• Is semantic component indexing more accurate than keyword indexing compared to a reference standard?

• Is semantic component indexing more consistent than keyword indexing?

• Is semantic component indexing faster than keyword indexing?

• Is semantic component indexing easier than keyword indexing?

Page 15: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

AccuracyDocument Semantic component Keywords

Recall macroaverage

Precisionmacroaverage

Recallmacroaverage

Precision macroaverage

1 0.74 ± 0.37 0.89 ± 0.26 0.14 ± 0.33 0.74 ± 0.432 0.56 ± 0.33 0.61 ± 0.39 0.35 ± 0.47 0.74 ± 0.423 0.59 ± 0.45 0.72 ± 0.38 0.10 ± 0.23 0.72 ± 0.424 0.33 ± 0.29 0.72 ± 0.41 0.16 ± 0.35 0.70 ± 0.455 0.74 ± 0.39 0.68 ± 0.47 0.38 ± 0.47 0.85 ± 0.306 0.59 ± 0.13 0.81 ± 0.35 0.01 ± 0.04 0.88 ± 0.317 0.63 ± 0.39 0.79 ± 0.31 0.28 ± 0.36 0.62 ± 0.418 0.70 ± 0.31 0.93 ± 0.17 0.01 ± 0.02 0.61 ± 0.499 0.66 ± 0.33 0.76 ± 0.43 0.21 ± 0.39 0.79 ± 0.3910 0.61 ± 0.35 0.75 ± 0.26 0.25 ± 0.42 0.79 ± 0.3911 0.65 ± 0.43 0.86 ± 0.31 0.12 ± 0.27 0.80 ± 0.3612 0.63 ± 0.48 0.83 ± 0.30 0.03 ± 0.08 0.85 ± 0.34

Page 16: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

ConsistencyDocument Semantic

componentKeywords

Mean K ± SD(of all semantic

components in the document)

Binary K

(all vocabularies)Traditional 1 ± SD

consistency = c / (a + b – c)

1 0.46 ± 0.35 -0.08 0.05 ± 0.13

2 0.21 ± 0.16 0.001 0.18 ± 0.19

3 0.25 ± 0.30 -0.08 0.05 ± 0.11

4 0.35 ± 0.23 0.02 0.19 ± 0.30

5 0.50 ± 0.30 0.32 0.33 ± 0.23

6 0.05 ± 0.11 -0.07 0.23 ± 0.41

7 0.40 ± 0.48 0.26 0.27 ± 0.18

8 0.66 ± 0.11 -0.08 0.05 ± 0.11

9 0.04 ± 0.24 -0.02 0.09 ± 0.14

10 0.44 ± 0.16 0.27 0.29 ± 0.13

11 0.48 ± 0.41 -0.06 0.04 ± 0.09

12 0.01 ± 0.07 -0.12 0.08 ± 0.24

Page 17: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

Time to index

0

5

10

15

20

25

30

35

40

< 2min 2 - 5 min 5 - 10 min 10-15 min > 15 min

Time to Index

Num

ber o

f Ind

exin

g In

stan

ces

Semantic Component Indexing Keyword Indexing

Page 18: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

Easiness

0

1

2

3

4

5

6

7

8

9

10

Chooseconcept

Choosekeyword

What each SCis

Designate SC Markboundaries

Choose doc.class

Num

ber o

f Ind

exer

s

Very difficult Very easy

Page 19: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2010 Marianne Lykke

Conclusions• Varied accuracy for both indexing methods, but data

suggests that semantic component indexing might be more accurate

• Indications that feasibility and easiness of indexing methods are similar

• Semantic component indexing may be preferable alternative if no appropriate controlled vocabulary is available due to short time for development and easy customization to specific document collection

• Limitations:o Small sample and a single domaino Not directly comparable evaluation measure

• Retrieval test shows improvement of document ranking of 25.6% by nDCG (normalized Discounted Cumulative Gain)

Page 20: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

ISKO 2009 Marianne Lykke

Future research• Development of model:

o Simpler versiono Up-marking by users (social tagging)o Automatic up-markingo Up-marking by XML

• Larger scale evaluation • Evaluation in other domains

Page 21: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

LitteraturDillon, M (1991). Reader’s model of text structures: the case of academic articles. International Journal of

Man-Machine Studies, 35. 913 – 925.Ely, J, Osheroff, J, Ebell, M, Bergus, G, Levy, B Chambliss, M & Evans, E (1999). Analysis of wquestions asked

by family doctors regarding patient care. BMJ, 310 (7206). 358 – 361.Ely, J, Osheroff, J, Gorman, P, Ebell, M, Bergus, G, Levy, B Chambliss, M, Pifer, E & Stavri, P (2000). A

taxonomy of generic clinical questions: classification study. BMJ, 321 (7278). 429 - 432.Fagin, R., Kumar, R., McCurley, K S., Novak, J., Sivakumar, D., Tomlin, J.A. & Williamson, D.P. (2003).

Searching the workplace web. In: Proceedings of the 12th International World Wide Web Conference (WWW ’03), Budapest, Hungary, May 20-24, 2003. 366-375.

Freund, L., Toms, E. & Waterhouse, J. (2005). Modeling the information behaviour of software engineers using a work-task framework. In: Grove, A (ed.) ASIS&T ’05 Proceedings of the 68th Annual meeting, Charlotte, NC, October 28-ember 2, 2005.

Hearst, M & Plaunt, C (1993). Subtopic structuring for full length document access. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 59 – 69.

Leckie, G.J., Pettigrew, K.E. & Sylvain, C. (1996). Modeling the information seeking of professionals. Library Quarterly, 66 (2). 161-193.

Orlikowaki, W J & Yates, J (1994). Genre repertoire: the structuring of communicative practices in organizations. Administrative Science Quarterly, 39. 541 – 574.

Price, S, Delcambre, L & Nielsen, M L (2006). Using semantic components to express questions against document collections. Proceedings International Workshop on Health Information and Knowledge Management (HIKM 2006), Arlington (VA).

Price, S, Nielsen, M L, Delcambre, L & Vedsted, P (2007). Semantic components enhance retrieval of domain-specific documents. Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (CIKM), Lisboa, November 6 - 8, 2007.

Page 22: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

Search term should appear in specified semantic component

Search term

Page 23: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

Semantic component should appear in document

Page 24: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University
Page 25: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

0

50

100

150

200

250

< 2 min 2 - 5 min 5 - 10 min 10 - 15 min > 15 min

Time to Index

Num

ber o

f Doc

umen

ts

Page 26: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

Time to index

Indexing Type

Total Documents

Indexed (max = 96)

Mean Num. Docs Indexed

Per Indexer (max = 6)

Mean Time (min:sec)

Min Time (min:sec)

Max Time (min:sec)

Semantic Components 83 5.2 07:03 00:24 27:05

Keywords 88 5.5 05:56 01:06 31:26

Time required for indexing documents

Page 27: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

0

1

2

3

4

5

6

7

8

9

10

For indexing documents For searching

Task type

Num

ber o

f ind

exer

s

Prefer keyword indexing About the same Prefer semantic component indexing

Page 28: Marianne Lykke Royal School of Library and Information Science Susan L. Price and Lois M. L. Delcambre Portland State University

HIO 2009 Marianne Lykke

Research teamGeneral practice Peter VedstedMD, Ph.D.Research Unit general Practice,Århus University

Jens RubakMDPraksis.dk, Region Midt

Information and computer science

Lois Delcambre, Ph.D., ProfessorSusan Price, MD, Ph.D. studentComputer Science DepartmentPortland State University, USA

Marianne Lykke, Ph.D., Associate professorInformation Interaktion and Information ArkitectureDanmarks Bibliotekskole

sundhed.dk Vibeke Luk Frans la CourInformation specialist IT consultantsundhed.dk Autonomy

Supported by grants from the National Science Foundation, grant numbers 0514238, 0511050 and 0534762, the National Library of

Medicine Training Grant 5-T15-LM07088 and Kvalitetsudviklingsudvalget for Almen Praksis, Aarhus Amt