submitted on: may 31, 2013library.ifla.org/131/1/152-lim-en.pdf · search d a typica through. next...
TRANSCRIPT
Connestructu Chee KTechnoE-mail BalakuTechno E-mail
Copyrigunder thhttp://cre
Abstra With so search d A typicathrough.next one Instead informat The advand grow By leverLibrary allowed Keywor
ecting librured and
Kiam Lim ology and Inaddress: ch
umar Chinnology and In address: b
ght © 2013 byhe terms of theativecommo
ct:
much data does a great j
al search at a. This tediou
e until the use
of having ution package
vances in Bigwing amount
raging data Board (NLBus to provid
rds: data min
rary contunstruct
nnovation, Nhee_kiam_li
nasamy nnovation, Nbalakumar_c
y Chee Kiamhe Creative Cons.org/licen
available, hjob but is it s
a popular Inus process coer’s informa
users repeates to them. A
g Data technt of informat
mining and B) of Singapode comprehen
ning, text ana
ent usingured data
National [email protected]
National Libchinnasamy
m Lim and BCommons Atnses/by/3.0/
ow can busysufficient?
nternet searcontinues aftetion needs ar
t the tediound to do this
nologies pretion resource
d text analytiore has connnsive, releva
alytics, Maho
g data mina
brary Boardv.sg
brary [email protected]
Balakumar ttribution 3.0
y users find
ch engine reter finding a re satisfied (
us search ans, we must co
esent significes in our repo
ics techniquenected our snt and truste
out, Hadoop
ning and t
d, Singapore
d, Singaporesg
Chinnasam0 Unported L
the right bit
turns thousarelevant art
(or they give
nd sieve proonnect our co
cant opportunositories.
es, and Big tructured an
ed informatio
, National Li
Submitted
text analy
e.
e.
my. This workLicense:
ts of informa
ands of resulticle to find up).
ocess, we shontent.
nities for us
Data technond unstructuron to our use
ibrary Board
on: May 31
ytics on
k is made ava
ation? The v
lts for a userthe next one
hould push
to connect
ologies, the red content. ers.
d (NLB) of S
, 2013
1
ailable
venerable
r to sieve e and the
relevant
the huge
National This has
Singapore.
1 CON The extcollectiannuall By usinthe borecomm Our titlThese in
•
•
NNECTING
tensive netwon of overy.
ng data minoks, NLB
mendation se
le recommenclude:
“Read On” o “Re
eng
Figure 1.1: T “Library in
o LiY
Figure 1.2: T
G STRUCT
work of librr a million
ning techniqhas succe
ervice since
endation se
(http://readead On” is ines to easi
Title recomm
n Your PockYP is NLB’s
Title recomm
TURED DA
raries of then physical t
ques on pasessfully coe 2009.
ervice can b
don.sg) NLB’s se
ly crawl and
mendations o
ket” (LiYP)s mobile lib
mendations o
2
ATA
e National titles. Thes
st loan tranonnected o
be found o
earch-engined index NL
n NLB’s “R
(http://m.nlrary portal
n NLB’s LiY
Library Bose titles gen
sactions anur titles a
n many of
e friendly B’s collecti
ead On” mic
lb.gov.sg)
YP mobile si
ard (NLB) nerate over
d the biblioand launch
f NLB’s we
micro-site ion of book
cro-site
ite
of Singapor 30 millio
ographic reched our ow
ebsites and
that allowss
ore has a on loans
cords of wn title
portals.
s search
•
•
The titlthe NLB
“SearchPluo “Se
reso
Figure 1.3: T
Online “C(http://www
o NLBpatrouts
Figure 1.4: T
le recommeB Open Dat
us" discoverarchPlus” i
ources, inclu
Title recomm
heck Yourw.pl.sg) B’s Public rons to checstanding fin
Title recomm
endation serta/Web initi
ry portal (httis NLB’s duding e-boo
mendations o
r Account”
Library pock their acc
nes
mendations o
rvice is alsoiative (http:
3
tp://searchpdiscovery poks and e-D
n NLB’s “Se
patron se
ortal’s “Chcount inform
n NLB’s Pub
o available //nlblabs.sg
plus.nlb.govortal for boatabases
earchPlus” d
ervice on N
heck Your mation such
blic Library
for third pag).
v.sg) oth physica
discovery por
NLB’s Pub
Account” h as loan ite
Portal
arty develop
al titles and
rtal
blic Library
service alloems, due da
pers to use
d digital
y portal
ows our ates and
through
Data mapproac NLB’s combin Given apatrons
The “Ofilteringhundredtend to years athat are In casescontent It is theactual lin patro
ining is the ch and proce
title recomnes both coll
a title, the talso borrow
Figure 1.5: “
Our patrons g for recomds of millionobtain recore consider outdated.
s where col-based filter
erefore a rooans by pat
on reading p
process to ess taken by
mmendation laborative a
title recommwed the foll
“Our patrons
also borrommendations
ns of loan rommendatiored when g
llaborative fring using t
obust and fitrons, and apatterns.
discover pay NLB.
service imand content-
mendation slowing” and
s also borrow
wed the fos. Collaborarecords to mons from peenerating th
filtering failthe bibliogra
ine-grained s a result, w
4
atterns in la
mplementatio-based filter
service gened “Quick Pi
wed the follow
ollowing” cative filteri
make recommople with thhe recomm
ls due to loaphic record
recommendwill also adju
arge data set
on is a hybring.
erates 2 typcks”.
wing” and “
ategory pring mines thmendationshe same int
mendations t
w or no loads.
dation systeust its recom
ts. The follo
brid recomm
pes of recom
“Quick Pick”
marily reliehe reading . It relies on
terests. Onlyto minimize
ans, the algo
em that takmmendation
owing descr
mender syst
mmendation
” recommend
es on collapatterns wi
n the notiony loans in the recommen
orithm falls
kes into accons if there a
ribes the
tem that
ns: “Our
dations
aborative ithin the
n that we he last 3 ndations
s back to
ount the are shifts
“Quick recommmade fr The parecommwhich a
Figure 1.6: C
Picks” recmendations rom a list of
atron level mendations. are then mer
Figure 1.7: P
Collaborative
commendatusing the
f books sele
recommenThe patro
rged into a
Patron level r
e and conten
tions rely sbibliograph
ected by libr
dations aren’s latest lsingle list.
recommenda
5
nt-based filter
solely on chic recordsrarians inste
e currently loaned title
ations
ring
content-base. Howeveread of the w
implements are used
ed filteringr, these recwhole collec
ted on topto provide
g to come commendatiction.
p of the tite recommen
up with ions are
tle level ndations
6
Table 1.1 shows some statistics on recommendation coverage. Fiction titles are highly loaned and therefore collaborative filtering generates recommendations for 59% of the total fiction titles compared to only a 28% success percentage for the less loaned non-fiction titles. Collaborative filtering Combined with simple content-based filtering Fiction 59% 89% Non-Fiction 28% 53% Overall 34% 59%
Table 1.1 NLB’s recommendation coverage
The simple content-based filtering applied here uses only author and language fields. As can be seen, even combining collaborative filtering with a very straightforward content-based filtering algorithm significantly improves the recommendation coverage, raising the fiction success percentage from an acceptable 59% to an excellent 89% and non-fiction from a low 28% to an acceptable 53%. The recommendation coverage for patron level recommendations is even higher since the recommendations are the combination of the title level recommendations based on the patron’s recent loans.
2 CONNECTING UNSTRUCTURED DATA Unstructured data makes up a huge and growing portion of the content that NLB holds. Connecting these data will allow us to make these contents a lot more discoverable. One of our earliest attempts to connect such content involves using text analytics to perform “key phrase extraction”. The extracted key phrases are then used to perform searches which throw up other related content. These connections ensure that further discovery around the key topics are just 2 clicks away. The NLB “Infopedia” search-engine friendly micro site (http://infopedia.nl.sg) is a popular electronic encyclopedia on Singapore’s history, culture, people and events. With less than 2,000 articles, it attracts over 200,000 page views a month. It makes use of this approach alongside manual recommendations from the librarians. These manual recommendations by librarians are added as Librarian’s Recommendations and provides a 1 click access to related content.
Figu Howevearticles With malternatcontent The Apthat im(http://mclusteri Using trecomm
ure 2.1: Infop
er, as manuhave Libra
most other cotive approac.
pache Mahomplements mahout.apacng, classific
text analytimendations f
pedia connec
ual recommarian Recom
ollections mch to autom
ut softwarea wide r
che.org). It cation, and
cs in Apacfor the full I
ctions throug
mendation is mmendations
many times matically ge
is a scalablrange of supports fofrequent ite
che MahoutInfopedia c
7
h extracted k
a labour ins.
larger than enerate goo
le open soudata mini
our main usem set minin
t to find simollection in
key phrases a
ntensive and
the Infopedod recomm
rce softwarng and me cases, namng.
milar articln less than 5
and manual r
d time cons
dia collectioendations f
re from the Amachine lemely recom
es, we succ minutes.
recommenda
suming task
on, NLB nefor its unstr
Apache Fouearning alg
mmendation
cessfully ge
ations
k, not all
eeded an ructured
undation gorithms mining,
enerated
Using Medicinrecomm The Ma(such asterm fretherefor These adiscovesuch co The folof the alibrariancontent
Figure 2.2: I
the same ne”), Figure
mendations a
ahout text s Euclideanequency/invre statistical
automaticalery immediaonnectivity i
llowing figuautomaticallns. In caseand immed
Infopedia tex
Infopedia ae 2.2 showsautomatical
analytics arn, Cosine, Tverse documlly relevant
lly generateately. Thereis available.
ures from Inly generateds where m
diately allow
xt analytics r
article in Fs that 3 of tlly generate
re based onanimoto, Mment freque.
ed recommee is no longe.
nfopedia in d recommenanual recom
ws the disco
8
result sample
Figure 2.1the manual
ed by Mahou
n establisheManhattan an
ency, etc.).
endations aer a need to
NLB’s stagndations to mmendationovery of oth
e
(titled “Krecommen
ut.
ed and rigond many othThe recom
allow us towait for ma
ging envirosupplemen
ns are not her related c
King Edwarndations wer
orous mathehers for dist
mmendations
o connect reanual recom
onment illusnt manual re
yet availabcontent.
rd VII Colre among th
ematical algtance measus from Mah
elevant artimmendation
strate the caecommendable, they en
llege of he top 4
gorithms urement, hout are
icles for ns before
apability ations by nrich the
9
Figure 2.3: Supplementing an Infopedia article with manual recommendations
Figure 2.4: Enriching an Infopedia article with no manual recommendations
The Sinthe nextthe proc MovingcompletNewspa Newspanewspanewspa Scalingin aroun The resarticles story un
With thyear’s wa muchrecommarticles
ngapore Met data set thcess of push
g on from Inted anotheraperSG coll
aperSG (httapers publishapers at the a
g up from 2,nd 18 hours
ults provedin a chrono
nfolds.
Figure 2.5: P
he successfworth of 58h larger datamendations
from anoth
emory portahat we havehing it out to
nfopedia anr proof-of-clection.
tp://newspaphed betweearticle level
,000 to 58,0s.
d to be very ological ord
PoC results s
ful applicati,000 newspa set. We dto see if w
her newspap
al’s (http://s successfullo enhance t
nd Singaporconcept (Po
pers.nl.sg) ien 1831 andl, and there
000 was rela
promising. der, we can
sample for N
ion of textpaper articledecided to pwe can discper appear in
10
singaporemly completehe current c
re Memory, oC) for abo
is NLB’s od 2009. NLB
are now ov
atively easy
Interesting,discover th
NewspaperSG
analytics es, we were pull all the cover additin the recom
emory.sg) ped this similcontent-base
we looked out 58,000
nline resouB has investver 18 millio
y. We comp
, if we werehe progressi
G (58,000 art
on Infopedencouragedarticles froional persp
mmendations
personal melarity proceed filtering
to scale it unewspaper
urce for Singted significon articles in
pleted the si
e to organiseion of a new
icles)
dia, Singapod to take the
om 2 newspectives of s.
emory collessing and wrecommend
up and succr articles fr
gapore and antly to dign Newspape
imilarity pro
e the recomws item and
ore Memore plunge an
papers and ga news item
ection is we are in dations.
cessfully rom our
Malaya gitise the erSG.
ocessing
mmended d see the
y and a nd tackle generate m when
Workinfrom 18scalabil The prnewspasoftwarcommoOCR ecomplexof interm A reliab Apacheclusters(http://hprocess Many othe map After mrunningon 3 vir
ng on the 6 845 to 2009lity issue. T
rocessing oaper articlesre during ton due to therrors introxity of the mediate dis
ble, scalable
e Hadoop iss of chadoop.apacsing, boastin
of Apache Mp/reduce par
much experg Apache Mrtual machin
Figure 2.6: A
million Ne9 and The S
The processi
of Newspaps were derthe digitisahe historic noduced ‘noi
computatiosk storage.
e and distrib
s a framewocommodity che.org). Itng users inc
Mahout’s coradigm to al
rimentation Mahout in ou
ne hosts.
Apache Hado
ewspaperSGSingapore Fing ran for m
perSG articrived from ation of thenature of thise’ into th
on, leading t
buted compu
ork that supcomput
t is the de-luding Face
ore algorithmllow scaling
and tuningur secondar
oop setup in
11
G articles frFree Press fmore than a
cles also suthe use of
e historic nhe newspapehe data seto lengthy p
uting platfo
pports distribters usin-facto openebook, Twit
ms are implg to reasona
g, we finallry data cent
NLB
rom the 2 nfrom 1925 tweek befor
urfaced anf Optical Cnewspaperers, particu
et, and alsoprocessing
orm is requi
buted proceng simpln source sotter, Yahoo,
lemented onably large d
ly set up are. We set u
newspapers to 1962) gare we ran ou
other challCharacter R
microfilmslar for the o significanand the nee
red.
essing of larle prograoftware plat, LinkedIn a
n top of Apata sets.
a full Apacup a total o
(The Straitave us our fut of disk st
lenging issuRecognitions. OCR errolder issuesntly increaed for huge
rge data setamming tform for band EBay.
pache Hadoo
che Hadoopof 13 virtual
ts Times first real torage.
ue. The n (OCR) rors are s. These ased the
amount
ts across models
big data
op using
p cluster l servers
12
For more information on our setup, do feel free to contact us. To tackle the issue of OCR errors within the newspaper articles, we tuned the parameters for the text analytics algorithm to ignore infrequent word tokens. This has resulted in the removal of a significant portion of the OCR errors before the actual Mahout processing commenced. With the Apache Hadoop cluster set up and applying more aggressive algorithm parameters, we were able to generate similarity recommendations for more than 150,000 articles. The aggressive change in the parameter values are instrumental in slicing away a huge chunk of the time required for the processing. From being able to process 58,000 articles in around 18 hours, we are now able to process 150,000 articles in less than 30 minutes. However, we are still in the midst of resolving issues as we move towards our target of processing the 6 million articles. 3 NEXT STEPS We are currently working on additional proof-of-concepts with Apache Mahout particularly in the areas of on content enrichment, clustering and classification. In the area of content enrichment, there are lots of interesting ideas and possibilities for us to explore and conquer. We hope to enrich the content with semantic information so that content becomes connected semantically instead of just textually. We also hope to enrich the content with language translations to explore the possibilities of connecting content in different languages, given that we have four official languages in Singapore. On the clustering front, working with user-submitted personal memories from the Singapore Memory portal, Apache Mahout clustered the memories into 43 groups (Figure 8). The key terms amongst the memories within each cluster can then be analysed to identify the key theme for the clusters.
Figure 3.1: Singapore Memory Cluster PoC cluster sizes
We aregenerat
For clasform thsubmissamountthe Sing The vaacross phenomto uneacollecti
Figure 3.2: S
e exploring ed clusters.
Figure 3.3: S
ssification, he training dsions into ot of effort regapore Mem
lue of a libthe conten
menon. The rth the hiddons that go
Sample of to
visualizatio
Singapore M
we are currdata that theone of the iequired to m
mory portal.
brary’s collnt of the cability to c
den values ibeyond tho
p terms in ge
on technolo
Memory porta
rently collate classificatiidentified cmaintain an.
lections inccollections, connect conin their painousands of i
13
enerated clus
ogies that w
al’s showcase
ting data foion algorithclusters. Thind update th
creases witha result
ntent automanstakingly citems. Data
sters
will allow u
e clusters
or user-idenhms use to ais will allowhe current c
h the numbof the weatically is t
curated collemining and
users to na
ntified clusteautomaticallw us to draclusters that
ber of connll understo
therefore esections, espd text analyt
avigate thro
ers. These dly classify nastically redt are showc
nections witood networssential for pecially for tics technol
ough the
data will new user duce the cased on
thin and rk effect libraries sizeable
logy and
14
software are now more matured and readily available through open source licenses. The time is ripe for libraries to leverage on them to bring out the enormous values in their collections. There are many more possibilities that we will explore in the future to make more and better connections – connections that allow users to discover comprehensive, relevant and trusted information. We do the work so that our users do not.