evaluation of the current doe document conversion system ... · - manicure improves word and...
TRANSCRIPT
U.S. Department of EnergyOffice of Civilian Radioactive Waste Management
Evaluation of the Current DOEDocument Conversion System:A Study of RetrievabilityPresented to:NRC/DOE Technical Exchange on Electronic Submissions
Presented by: .w�-- - . � 4 "I . - I �.. .1,-�s. , , . . . .
Jake WooleyDeputy Director, ,Offie:o.U.S. Departmie nt of.fEi nergOpffice -of CivifianpRadioa
V Dr
n n
Study of Retrievability
Agenda- Introduction
- Background
- Document conversion system recommendations
- Tests to measure document conversion performance
* Text accuracy* Retrievability tests
- Conclusion
N _ ,YUCCA MOUNTAIN PROJECT
BSC PresentationsNRCIDOE Technical Exchange.YMWooley-06/25/02.ppt 2
Study of Retrievability(Continued)
e Background- Who is Information Science Research Institute (ISRI)?
- Licensing Support System (LSS) (1990 - Current)
- Optical Character Reader (OCR) Conferences (1991 - 1995)
- Contracted by M&O (1996 - 1999)
- Contracted by DOE (1990 - 1995, 2000 - Current)
* Current tasks for FY02> Provide recommendations on DOE document conversion system
> Evaluate performance of DOE document conversion system
YUCCA MOUNTAIN PROJECT
BSC PresentationsINRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 3
Study of Retrievability(Continued)
* Document conversion system recommendations- Retrievability is a better performance metric than character
accuracy
* Not all characters are used by a retrieval system
- Automatic zoning, followed by MANICURE, will produceretrievability equivalent to manual zoning
* Manually zoned text
* Automatic zoned text
* MANICURE
YUCCA MOUNTAIN PROJECT
BSC PresentabonsNRC/DOE Technical Exchange_YMWooleyO06/25/02.ppt 4
Tests to Measure DocumentConversion Performance
o Text accuracy
- NRC Licensing Support Network Administrator (LSNA)target accuracy for OCR created text (Licensing SupportNetwork Guidelines provided 1/02)
* Goal is to have 99.5% accurate text
- Text accuracy test
* 17 documents (1253 pages, 164,483 non-stopwords, and1,361,124 characters)
* Non-stopword accuracy tests») DOE word accuracy between 96.15% and 97.23%
* Tests of character accuracy of non-stopwords
> DOE character accuracy between 98.83% and 99.30%
-------- YUCCA MOUNTAIN PROJECT
BSC PresentationsNRCIDOE Technical ExchangeYMWooley-06/25102.ppt 5
Tests to Measure DocumentConversion Performance
(Continued)
* Retrievability tests- Test data
* 1055 documents containing 75,236 Pages
* 40 queries
* Average number of relevancy judgements per query - 100
* Autonomy ServerTM v2.2.0
- Retrievability metrics
* Precision and recall
YUCCA MOUNTAIN PROJECT
BSC PresentationsNRC/DOE Technical Exchange.Y.YMWooley-06125/02.ppt 6
Tests to Measure DocumentConversion Performance
(Continued)
* Automatic zoned text versus manually zoned textretrieval tests- Average precision
* Manually zoned - 37.9%
* Automatic zoned - 39.2%
- YUCCA MOUNTAIN PROJECT
BSC PresentabonsNRC/DOE Technical ExchangeYMWooleyO06/25/02.ppt 7
Tests to Measure DocumentConversion Performance
(Continued)
o Ranking of retrieved documents compared betweenthe manually zoned text and the automatic zoned text- Importance of ranking in retrieval systems
- Results of ranking tests
+ Correlation factor - .97
- Ranking problems in information retrieval systems
_1 loYUCCA MOUNTAIN PROJECT
BSC Presentations.NRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 8
Tests to Measure DocumentConversion Performance
(Continued)
o Automatic zoned text versus 99.8% correct textretrieval tests- 1058 documents containing 46,731 Pages
- 62 queries
- Average number of relevant documents per query = 17
- Average precision
* 99.8% accurate text - 24.5%
* Automatic zoned text - 24.2%
YUCCA MOUNTAIN PROJECT
BSC PresentationsNRCIDOE Technical Exchange-.YMWooley06/25/02.ppt 9
Study of Retrievability
Conclusion- Character accuracy produced by DOE document
conversion system close to NRC LSNA goal
- MANICURE improves word and character accuracy
- Average character accuracy of non-stopwords on DOEdocuments is between 98.83% and 99.30%
- Retrievability is equivalent for automatic zoned text andmanually zoned text
- Ranking of query results is equivalent for automatic zonedtext and manually zoned text
- Retrievability is equivalent for automatic zoned text and99.8% accurate text
NYUCCA MOUNTAIN PROJECT
BSC Presentations_NRC/DOE Technical Exchange.YMWooley-06/25/02.ppt 10
Backup
YUCCA MOUNTAIN PROJECT
BSC Presentations.NRC/DOE Technical ExchangeYMWooley-06/25/02.ppt 11
I
Tertiary extension north of the Las Vegas Valley shear zone,
.Sheep and Desert Ranges, Clark County, Nevada
fETER L GCflI Depea'tewt ofEaPA ad flax"Scwetn. MAase ts l tociwtt, elmalozy. | A It5ns--n ; .
rnco(0(0
-0
co0)
0m-;1
2
ABSTRACr
Deuied maupping MnC tuhe ptiiit ot
0gl tkle evtenrial (taks and 6w-an"legernty $lidC$ rn tke Wt 1& of th ShepRange. Threc Mierl hijghaneg! faulting.ntrnteii at owt( for 2Tof <esswad to-rwttan fij acomawdatt etension betrwentwh La% Vegas Rayse avd the Dcert Range
rnSkit faults reprrtr Surficbl slides incspt te r ploqI7hy produted by um*
oa the h;si~Sh.Oe fauh. Fatuagpxt dunng the Mwioenr. syndshen-
Ilywu do.a ofe Heart Springoettio and w..hmhaplwmenthon te Ls
V tas v y heir tn. The res'esio't inSheep Rarn ooke m lc wuiltoti Val-: Nitn ;tru or mteta norphr" of the
lek sedinetary, roeks.offiet thrutst faults Swest that the am
of the Sheep Range eintreld slawt100% d ino she Mmcee wh&1 the r owe
dirg area toulh of the Lat Vegas Vellytit so did tiOi txtund signiwaInly.Tciat Mne bowndd the CTurnding Intoneshe tSotau actingaS S A transorm UtilL% eatensain Wetm of the SIeep Ra.&C
ay b prt hIslace stht snapped by Andet-It971) in the EIoeado Moantuina.The
IVeps Vat r shear acne sdte t ikecad iaso! sysrtn mar hrte mud together
pe nut tot $rant of baliatd exwn.to bettwren the Coeorado Plateu and the
ixttyof the pecttr Ritdg.
Srwart 197I)calculittd rh9 the hotrs andpabes model .mus ibovt 10% extren
son mcmii thenfnte GrCmi San. saiwefurW dips exi t on rangefront ftalrs.
Th titelblock model of Morton andI (197S) wtems to requnt 10.30% *xslo sacrox the entirt Great Nits but
cal hxteC"bn might excetd 100%Sitewl.- 1,9tO). Ialerrd littictfault
, uorn q k&at ciltsnmat d3DS torrt 2011 hatleTisOn (Antr on. 1771;
'sigh and Trotel. 1973: Pralftt. 1977Iotg the rd ms it net apply to the
Gtrtteat 5ism,.Shrsliepbult a 'are rtlated to ext
Ioul tiutirg as boundariuet of Jornair
Az differences in Srtk er fr~nktiu& *ftw-
tension IDans 1979; Srteart, 19l0).
Str4ke-slip luin saPy be tntsfori bound-ar*t betwen resotn of d rential eaten-.iin such at tkGarocis fault of southernCalitmoia IlHanallon and Myers. 196;Dartt arbd BuechFl 1973). Ricrtnt Worksuggrs tta ,s nokel lor fauling n the LAeMead area ol ruthernrt N3ad (BSohu tt,19794^ pot ur.sitp and censialfaultin aweayr to be dornmis Y last icrT-titary rnrits in the Creat Bals.
Armitmeg (19721 tre.wej Widespread
lo-tngle denuufirwinml fAusin in thte est-em GrnC t m Es6%. Tir faults generally pLayoungn roks on "lder, and A-srotu adlvocard a Terniary grasisy rihanism d t-tiner 1rmm Mesowon COlisiti In ton-
j
N0
CD0L
(0CD
CDNTlIODOLCTO
liTh imptrts"c t eiuitrstonml laultinginbhe deelousCnt of structure WA phsiog-
br'hy in the Great satn has been evident
lince the peoneerina obSrvatos of GCdbec1174). Reeet4d Stewart (1971. 17 0) tt-iwed and summariztd the Orst aWAd sa-m tsted-block. and Irrtictftult tnodrlsmasonly aplied to the Gtct Basin.
I-etCD
2,
IaC
ill10M.
igime 1. Lotabryma w4_ tof atheu Nevada. Moian agtiniae "ii ekrtoI
i rwrtheAtt nCR. CAngil;Pj. Diritk Rt EM. Vosd MOoiAt:
FM, Yredamaa Mourstii; LVR! Lai V"t Raste: MM. N Myo-uiin; M
IWDn Montaim; 3M?miwa ter ROW; SR, Sep R : SdR. t Rtar SpR.
SpItt RAPP; lad . VM;s* MrSt. Iht bauh dated at th Wheeer ta
WMY xrflV fGa (it CMh. Ma r*y hownu~n (.%L. aledsle CCTh
_ ,
I Pma# Wda Copatny. 34Ith [agienrsat uba fr,*MbU) (4easyi. fort RAry. Krassu
I
9 G da a Sutr T.!AamwaI~. ~artL .I.f.P.701-771.i iP.) It.Onabi1 Ius
to M1t'
Example Queries
o Find all documents that discuss fracture frequencydata (fracture density and radial fracture density) inboreholes at Yucca Mountain
* Find all documents that discuss strategies forenvironmental restoration and remediation (HanfordSite)
f YUCCA MOUNTAIN PROJECT
BSC PresentationsNRC/DOE Technical ExchangeYMWooley-06/25/02.ppt 13
Precision and Recall
* Precision- # of relevant documents retrieved / # of retrieved
documents
* Recall
- # of relevant documents retrieved / total # of relevantdocuments
* Suppose there are 10 relevant documents for anexample query
* And suppose the system returns 15 documents forthis example query
YUCCA MOUNTAIN PROJECT
BSC PresentationsNRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 14
Precision and Recall Example
Docid Relevant Recall Precision1 R 10% 100%2 10% 50%3 10% 33%4 R 20% 50%5 R 30% 60%6 30% 50%7 R 40% 57%8 40% 50%9 R 50% 55%
10 R 60% 60%11 R 70% 64%12 70% 58%13 R 80% 62%14 R 90% 64%15 R 100% 67%
NOTE: Docid - Document identified
"on" YUCCA MOUNTAIN PROJECT
BSC PresentationsNRCIDOE Technical ExchangeYMWooleyO06/25I02.ppt 15
Calculating Average Precision
o Add precision values at 10%,levels
20%, . . .. 100% recall
- 100+50+60+57+55+60+64+62+64+67 = 639
0 Average total precision I # of recall levels- 639/ 10 = 63.9%
_ ME YUCCA MOUNTAIN PROJECT
BSC Presentations.NRCIDOE Technical ExchangeYMWooley-06/25/02.ppt 16
I '
Scatter Plot of Automatic Zoned andManual Zoned Ranks
en
am
coE
n
00V
a)0N
a
0CD
CD
0
0 200 400 600 800 1000
Automatically-Zoned Document Ranks
IYUCCA MOUNTAIN PROJECTBSC PresentationsNRCIDOE Technical ExchangeYMWooley_06/25/02.ppt 17