evaluation of the current doe document conversion system ... · - manicure improves word and...

17
U.S. Department of Energy Office of Civilian Radioactive Waste Management Evaluation of the Current DOE Document Conversion System: A Study of Retrievability Presented to: NRC/DOE Technical Exchange on Electronic Submissions Presented by: .w-- -. 4 " I . - I .. .1,-s. , , . .. . Jake Wooley Deputy Director, ,Offie:o. U.S. Departmie nt of.fEi nerg Opffice -of CivifianpRadioa V Dr n n

Upload: others

Post on 07-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

U.S. Department of EnergyOffice of Civilian Radioactive Waste Management

Evaluation of the Current DOEDocument Conversion System:A Study of RetrievabilityPresented to:NRC/DOE Technical Exchange on Electronic Submissions

Presented by: .w�-- - . � 4 "I . - I �.. .1,-�s. , , . . . .

Jake WooleyDeputy Director, ,Offie:o.U.S. Departmie nt of.fEi nergOpffice -of CivifianpRadioa

V Dr

n n

Page 2: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Study of Retrievability

Agenda- Introduction

- Background

- Document conversion system recommendations

- Tests to measure document conversion performance

* Text accuracy* Retrievability tests

- Conclusion

N _ ,YUCCA MOUNTAIN PROJECT

BSC PresentationsNRCIDOE Technical Exchange.YMWooley-06/25/02.ppt 2

Page 3: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Study of Retrievability(Continued)

e Background- Who is Information Science Research Institute (ISRI)?

- Licensing Support System (LSS) (1990 - Current)

- Optical Character Reader (OCR) Conferences (1991 - 1995)

- Contracted by M&O (1996 - 1999)

- Contracted by DOE (1990 - 1995, 2000 - Current)

* Current tasks for FY02> Provide recommendations on DOE document conversion system

> Evaluate performance of DOE document conversion system

YUCCA MOUNTAIN PROJECT

BSC PresentationsINRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 3

Page 4: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Study of Retrievability(Continued)

* Document conversion system recommendations- Retrievability is a better performance metric than character

accuracy

* Not all characters are used by a retrieval system

- Automatic zoning, followed by MANICURE, will produceretrievability equivalent to manual zoning

* Manually zoned text

* Automatic zoned text

* MANICURE

YUCCA MOUNTAIN PROJECT

BSC PresentabonsNRC/DOE Technical Exchange_YMWooleyO06/25/02.ppt 4

Page 5: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Tests to Measure DocumentConversion Performance

o Text accuracy

- NRC Licensing Support Network Administrator (LSNA)target accuracy for OCR created text (Licensing SupportNetwork Guidelines provided 1/02)

* Goal is to have 99.5% accurate text

- Text accuracy test

* 17 documents (1253 pages, 164,483 non-stopwords, and1,361,124 characters)

* Non-stopword accuracy tests») DOE word accuracy between 96.15% and 97.23%

* Tests of character accuracy of non-stopwords

> DOE character accuracy between 98.83% and 99.30%

-------- YUCCA MOUNTAIN PROJECT

BSC PresentationsNRCIDOE Technical ExchangeYMWooley-06/25102.ppt 5

Page 6: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Tests to Measure DocumentConversion Performance

(Continued)

* Retrievability tests- Test data

* 1055 documents containing 75,236 Pages

* 40 queries

* Average number of relevancy judgements per query - 100

* Autonomy ServerTM v2.2.0

- Retrievability metrics

* Precision and recall

YUCCA MOUNTAIN PROJECT

BSC PresentationsNRC/DOE Technical Exchange.Y.YMWooley-06125/02.ppt 6

Page 7: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Tests to Measure DocumentConversion Performance

(Continued)

* Automatic zoned text versus manually zoned textretrieval tests- Average precision

* Manually zoned - 37.9%

* Automatic zoned - 39.2%

- YUCCA MOUNTAIN PROJECT

BSC PresentabonsNRC/DOE Technical ExchangeYMWooleyO06/25/02.ppt 7

Page 8: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Tests to Measure DocumentConversion Performance

(Continued)

o Ranking of retrieved documents compared betweenthe manually zoned text and the automatic zoned text- Importance of ranking in retrieval systems

- Results of ranking tests

+ Correlation factor - .97

- Ranking problems in information retrieval systems

_1 loYUCCA MOUNTAIN PROJECT

BSC Presentations.NRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 8

Page 9: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Tests to Measure DocumentConversion Performance

(Continued)

o Automatic zoned text versus 99.8% correct textretrieval tests- 1058 documents containing 46,731 Pages

- 62 queries

- Average number of relevant documents per query = 17

- Average precision

* 99.8% accurate text - 24.5%

* Automatic zoned text - 24.2%

YUCCA MOUNTAIN PROJECT

BSC PresentationsNRCIDOE Technical Exchange-.YMWooley06/25/02.ppt 9

Page 10: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Study of Retrievability

Conclusion- Character accuracy produced by DOE document

conversion system close to NRC LSNA goal

- MANICURE improves word and character accuracy

- Average character accuracy of non-stopwords on DOEdocuments is between 98.83% and 99.30%

- Retrievability is equivalent for automatic zoned text andmanually zoned text

- Ranking of query results is equivalent for automatic zonedtext and manually zoned text

- Retrievability is equivalent for automatic zoned text and99.8% accurate text

NYUCCA MOUNTAIN PROJECT

BSC Presentations_NRC/DOE Technical Exchange.YMWooley-06/25/02.ppt 10

Page 11: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Backup

YUCCA MOUNTAIN PROJECT

BSC Presentations.NRC/DOE Technical ExchangeYMWooley-06/25/02.ppt 11

Page 12: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

I

Tertiary extension north of the Las Vegas Valley shear zone,

.Sheep and Desert Ranges, Clark County, Nevada

fETER L GCflI Depea'tewt ofEaPA ad flax"Scwetn. MAase ts l tociwtt, elmalozy. | A It5ns--n ; .

rnco(0(0

-0

co0)

0m-;1

2

ABSTRACr

Deuied maupping MnC tuhe ptiiit ot

0gl tkle evtenrial (taks and 6w-an"legernty $lidC$ rn tke Wt 1& of th ShepRange. Threc Mierl hijghaneg! faulting.ntrnteii at owt( for 2Tof <esswad to-rwttan fij acomawdatt etension betrwentwh La% Vegas Rayse avd the Dcert Range

rnSkit faults reprrtr Surficbl slides incspt te r ploqI7hy produted by um*

oa the h;si~Sh.Oe fauh. Fatuagpxt dunng the Mwioenr. syndshen-

Ilywu do.a ofe Heart Springoettio and w..hmhaplwmenthon te Ls

V tas v y heir tn. The res'esio't inSheep Rarn ooke m lc wuiltoti Val-: Nitn ;tru or mteta norphr" of the

lek sedinetary, roeks.offiet thrutst faults Swest that the am

of the Sheep Range eintreld slawt100% d ino she Mmcee wh&1 the r owe

dirg area toulh of the Lat Vegas Vellytit so did tiOi txtund signiwaInly.Tciat Mne bowndd the CTurnding Intoneshe tSotau actingaS S A transorm UtilL% eatensain Wetm of the SIeep Ra.&C

ay b prt hIslace stht snapped by Andet-It971) in the EIoeado Moantuina.The

IVeps Vat r shear acne sdte t ikecad iaso! sysrtn mar hrte mud together

pe nut tot $rant of baliatd exwn.to bettwren the Coeorado Plateu and the

ixttyof the pecttr Ritdg.

Srwart 197I)calculittd rh9 the hotrs andpabes model .mus ibovt 10% extren

son mcmii thenfnte GrCmi San. saiwefurW dips exi t on rangefront ftalrs.

Th titelblock model of Morton andI (197S) wtems to requnt 10.30% *xslo sacrox the entirt Great Nits but

cal hxteC"bn might excetd 100%Sitewl.- 1,9tO). Ialerrd littictfault

, uorn q k&at ciltsnmat d3DS torrt 2011 hatleTisOn (Antr on. 1771;

'sigh and Trotel. 1973: Pralftt. 1977Iotg the rd ms it net apply to the

Gtrtteat 5ism,.Shrsliepbult a 'are rtlated to ext

Ioul tiutirg as boundariuet of Jornair

Az differences in Srtk er fr~nktiu& *ftw-

tension IDans 1979; Srteart, 19l0).

Str4ke-slip luin saPy be tntsfori bound-ar*t betwen resotn of d rential eaten-.iin such at tkGarocis fault of southernCalitmoia IlHanallon and Myers. 196;Dartt arbd BuechFl 1973). Ricrtnt Worksuggrs tta ,s nokel lor fauling n the LAeMead area ol ruthernrt N3ad (BSohu tt,19794^ pot ur.sitp and censialfaultin aweayr to be dornmis Y last icrT-titary rnrits in the Creat Bals.

Armitmeg (19721 tre.wej Widespread

lo-tngle denuufirwinml fAusin in thte est-em GrnC t m Es6%. Tir faults generally pLayoungn roks on "lder, and A-srotu adlvocard a Terniary grasisy rihanism d t-tiner 1rmm Mesowon COlisiti In ton-

j

N0

CD0L

(0CD

CDNTlIODOLCTO

liTh imptrts"c t eiuitrstonml laultinginbhe deelousCnt of structure WA phsiog-

br'hy in the Great satn has been evident

lince the peoneerina obSrvatos of GCdbec1174). Reeet4d Stewart (1971. 17 0) tt-iwed and summariztd the Orst aWAd sa-m tsted-block. and Irrtictftult tnodrlsmasonly aplied to the Gtct Basin.

I-etCD

2,

IaC

ill10M.

igime 1. Lotabryma w4_ tof atheu Nevada. Moian agtiniae "ii ekrtoI

i rwrtheAtt nCR. CAngil;Pj. Diritk Rt EM. Vosd MOoiAt:

FM, Yredamaa Mourstii; LVR! Lai V"t Raste: MM. N Myo-uiin; M

IWDn Montaim; 3M?miwa ter ROW; SR, Sep R : SdR. t Rtar SpR.

SpItt RAPP; lad . VM;s* MrSt. Iht bauh dated at th Wheeer ta

WMY xrflV fGa (it CMh. Ma r*y hownu~n (.%L. aledsle CCTh

_ ,

I Pma# Wda Copatny. 34Ith [agienrsat uba fr,*MbU) (4easyi. fort RAry. Krassu

I

9 G da a Sutr T.!AamwaI~. ~artL .I.f.P.701-771.i iP.) It.Onabi1 Ius

to M1t'

Page 13: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Example Queries

o Find all documents that discuss fracture frequencydata (fracture density and radial fracture density) inboreholes at Yucca Mountain

* Find all documents that discuss strategies forenvironmental restoration and remediation (HanfordSite)

f YUCCA MOUNTAIN PROJECT

BSC PresentationsNRC/DOE Technical ExchangeYMWooley-06/25/02.ppt 13

Page 14: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Precision and Recall

* Precision- # of relevant documents retrieved / # of retrieved

documents

* Recall

- # of relevant documents retrieved / total # of relevantdocuments

* Suppose there are 10 relevant documents for anexample query

* And suppose the system returns 15 documents forthis example query

YUCCA MOUNTAIN PROJECT

BSC PresentationsNRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 14

Page 15: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Precision and Recall Example

Docid Relevant Recall Precision1 R 10% 100%2 10% 50%3 10% 33%4 R 20% 50%5 R 30% 60%6 30% 50%7 R 40% 57%8 40% 50%9 R 50% 55%

10 R 60% 60%11 R 70% 64%12 70% 58%13 R 80% 62%14 R 90% 64%15 R 100% 67%

NOTE: Docid - Document identified

"on" YUCCA MOUNTAIN PROJECT

BSC PresentationsNRCIDOE Technical ExchangeYMWooleyO06/25I02.ppt 15

Page 16: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

Calculating Average Precision

o Add precision values at 10%,levels

20%, . . .. 100% recall

- 100+50+60+57+55+60+64+62+64+67 = 639

0 Average total precision I # of recall levels- 639/ 10 = 63.9%

_ ME YUCCA MOUNTAIN PROJECT

BSC Presentations.NRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 16

Page 17: Evaluation of the Current DOE Document Conversion System ... · - MANICURE improves word and character accuracy - Average character accuracy of non-stopwords on DOE documents is between

I '

Scatter Plot of Automatic Zoned andManual Zoned Ranks

en

am

coE

n

00V

a)0N

a

0CD

CD

0

0 200 400 600 800 1000

Automatically-Zoned Document Ranks

IYUCCA MOUNTAIN PROJECTBSC PresentationsNRCIDOE Technical ExchangeYMWooley_06/25/02.ppt 17