submitted on: may 31, 2013library.ifla.org/131/1/152-lim-en.pdf · search d a typica through. next...

Connestructu Chee KTechnoE-mail BalakuTechno E-mail

Copyrigunder thhttp://cre

Abstra With so search d A typicathrough.next one Instead informat The advand grow By leverLibrary allowed Keywor

ecting librured and

Kiam Lim ology and Inaddress: ch

umar Chinnology and In address: b

ght © 2013 byhe terms of theativecommo

ct:

much data does a great j

al search at a. This tediou

e until the use

of having ution package

vances in Bigwing amount

raging data Board (NLBus to provid

rds: data min

rary contunstruct

nnovation, Nhee_kiam_li

nasamy nnovation, Nbalakumar_c

y Chee Kiamhe Creative Cons.org/licen

available, hjob but is it s

a popular Inus process coer’s informa

users repeates to them. A

g Data technt of informat

mining and B) of Singapode comprehen

ning, text ana

ent usingured data

National [email protected]

National Libchinnasamy

m Lim and BCommons Atnses/by/3.0/

ow can busysufficient?

nternet searcontinues aftetion needs ar

t the tediound to do this

nologies pretion resource

d text analytiore has connnsive, releva

alytics, Maho

g data mina

brary Boardv.sg

brary [email protected]

Balakumar ttribution 3.0

y users find

ch engine reter finding a re satisfied (

us search ans, we must co

esent significes in our repo

ics techniquenected our snt and truste

out, Hadoop

ning and t

d, Singapore

d, Singaporesg

Chinnasam0 Unported L

the right bit

turns thousarelevant art

(or they give

nd sieve proonnect our co

cant opportunositories.

es, and Big tructured an

ed informatio

, National Li

Submitted

text analy

e.

e.

my. This workLicense:

ts of informa

ands of resulticle to find up).

ocess, we shontent.

nities for us

Data technond unstructuron to our use

ibrary Board

on: May 31

ytics on

k is made ava

ation? The v

lts for a userthe next one

hould push

to connect

ologies, the red content. ers.

d (NLB) of S

, 2013

1

ailable

venerable

r to sieve e and the

relevant

the huge

National This has

Singapore.

1 CON The extcollectiannuall By usinthe borecomm Our titlThese in

•

•

NNECTING

tensive netwon of overy.

ng data minoks, NLB

mendation se

le recommenclude:

“Read On” o “Re

eng

Figure 1.1: T “Library in

o LiY

Figure 1.2: T

G STRUCT

work of librr a million

ning techniqhas succe

ervice since

endation se

(http://readead On” is ines to easi

Title recomm

n Your PockYP is NLB’s

Title recomm

TURED DA

raries of then physical t

ques on pasessfully coe 2009.

ervice can b

don.sg) NLB’s se

ly crawl and

mendations o

ket” (LiYP)s mobile lib

mendations o

2

ATA

e National titles. Thes

st loan tranonnected o

be found o

earch-engined index NL

n NLB’s “R

(http://m.nlrary portal

n NLB’s LiY

Library Bose titles gen

sactions anur titles a

n many of

e friendly B’s collecti

ead On” mic

lb.gov.sg)

YP mobile si

ard (NLB) nerate over

d the biblioand launch

f NLB’s we

micro-site ion of book

cro-site

ite

of Singapor 30 millio

ographic reched our ow

ebsites and

that allowss

ore has a on loans

cords of wn title

portals.

s search

•

•

The titlthe NLB

“SearchPluo “Se

reso

Figure 1.3: T

Online “C(http://www

o NLBpatrouts

Figure 1.4: T

le recommeB Open Dat

us" discoverarchPlus” i

ources, inclu

Title recomm

heck Yourw.pl.sg) B’s Public rons to checstanding fin

Title recomm

endation serta/Web initi

ry portal (httis NLB’s duding e-boo

mendations o

r Account”

Library pock their acc

nes

mendations o

rvice is alsoiative (http:

3

tp://searchpdiscovery poks and e-D

n NLB’s “Se

patron se

ortal’s “Chcount inform

n NLB’s Pub

o available //nlblabs.sg

plus.nlb.govortal for boatabases

earchPlus” d

ervice on N

heck Your mation such

blic Library

for third pag).

v.sg) oth physica

discovery por

NLB’s Pub

Account” h as loan ite

Portal

arty develop

al titles and

rtal

blic Library

service alloems, due da

pers to use

d digital

y portal

ows our ates and

through

Data mapproac NLB’s combin Given apatrons

The “Ofilteringhundredtend to years athat are In casescontent It is theactual lin patro

ining is the ch and proce

title recomnes both coll

a title, the talso borrow

Figure 1.5: “

Our patrons g for recomds of millionobtain recore consider outdated.

s where col-based filter

erefore a rooans by pat

on reading p

process to ess taken by

mmendation laborative a

title recommwed the foll

“Our patrons

also borrommendations

ns of loan rommendatiored when g

llaborative fring using t

obust and fitrons, and apatterns.

discover pay NLB.

service imand content-

mendation slowing” and

s also borrow

wed the fos. Collaborarecords to mons from peenerating th

filtering failthe bibliogra

ine-grained s a result, w

4

atterns in la

mplementatio-based filter

service gened “Quick Pi

wed the follow

ollowing” cative filteri

make recommople with thhe recomm

ls due to loaphic record

recommendwill also adju

arge data set

on is a hybring.

erates 2 typcks”.

wing” and “

ategory pring mines thmendationshe same int

mendations t

w or no loads.

dation systeust its recom

ts. The follo

brid recomm

pes of recom

“Quick Pick”

marily reliehe reading . It relies on

terests. Onlyto minimize

ans, the algo

em that takmmendation

owing descr

mender syst

mmendation

” recommend

es on collapatterns wi

n the notiony loans in the recommen

orithm falls

kes into accons if there a

ribes the

tem that

ns: “Our

dations

aborative ithin the

n that we he last 3 ndations

s back to

ount the are shifts

“Quick recommmade fr The parecommwhich a

Figure 1.6: C

Picks” recmendations rom a list of

atron level mendations. are then mer

Figure 1.7: P

Collaborative

commendatusing the

f books sele

recommenThe patro

rged into a

Patron level r

e and conten

tions rely sbibliograph

ected by libr

dations aren’s latest lsingle list.

recommenda

5

nt-based filter

solely on chic recordsrarians inste

e currently loaned title

ations

ring

content-base. Howeveread of the w

implements are used

ed filteringr, these recwhole collec

ted on topto provide

g to come commendatiction.

p of the tite recommen

up with ions are

tle level ndations

6

Table 1.1 shows some statistics on recommendation coverage. Fiction titles are highly loaned and therefore collaborative filtering generates recommendations for 59% of the total fiction titles compared to only a 28% success percentage for the less loaned non-fiction titles. Collaborative filtering Combined with simple content-based filtering Fiction 59% 89% Non-Fiction 28% 53% Overall 34% 59%

Table 1.1 NLB’s recommendation coverage

The simple content-based filtering applied here uses only author and language fields. As can be seen, even combining collaborative filtering with a very straightforward content-based filtering algorithm significantly improves the recommendation coverage, raising the fiction success percentage from an acceptable 59% to an excellent 89% and non-fiction from a low 28% to an acceptable 53%. The recommendation coverage for patron level recommendations is even higher since the recommendations are the combination of the title level recommendations based on the patron’s recent loans.

2 CONNECTING UNSTRUCTURED DATA Unstructured data makes up a huge and growing portion of the content that NLB holds. Connecting these data will allow us to make these contents a lot more discoverable. One of our earliest attempts to connect such content involves using text analytics to perform “key phrase extraction”. The extracted key phrases are then used to perform searches which throw up other related content. These connections ensure that further discovery around the key topics are just 2 clicks away. The NLB “Infopedia” search-engine friendly micro site (http://infopedia.nl.sg) is a popular electronic encyclopedia on Singapore’s history, culture, people and events. With less than 2,000 articles, it attracts over 200,000 page views a month. It makes use of this approach alongside manual recommendations from the librarians. These manual recommendations by librarians are added as Librarian’s Recommendations and provides a 1 click access to related content.

Figu Howevearticles With malternatcontent The Apthat im(http://mclusteri Using trecomm

ure 2.1: Infop

er, as manuhave Libra

most other cotive approac.

pache Mahomplements mahout.apacng, classific

text analytimendations f

pedia connec

ual recommarian Recom

ollections mch to autom

ut softwarea wide r

che.org). It cation, and

cs in Apacfor the full I

ctions throug

mendation is mmendations

many times matically ge

is a scalablrange of supports fofrequent ite

che MahoutInfopedia c

7

h extracted k

a labour ins.

larger than enerate goo

le open soudata mini

our main usem set minin

t to find simollection in

key phrases a

ntensive and

the Infopedod recomm

rce softwarng and me cases, namng.

milar articln less than 5

and manual r

d time cons

dia collectioendations f

re from the Amachine lemely recom

es, we succ minutes.

recommenda

suming task

on, NLB nefor its unstr

Apache Fouearning alg

mmendation

cessfully ge

ations

k, not all

eeded an ructured

undation gorithms mining,

enerated

Using Medicinrecomm The Ma(such asterm fretherefor These adiscovesuch co The folof the alibrariancontent

Figure 2.2: I

the same ne”), Figure

mendations a

ahout text s Euclideanequency/invre statistical

automaticalery immediaonnectivity i

llowing figuautomaticallns. In caseand immed

Infopedia tex

Infopedia ae 2.2 showsautomatical

analytics arn, Cosine, Tverse documlly relevant

lly generateately. Thereis available.

ures from Inly generateds where m

diately allow

xt analytics r

article in Fs that 3 of tlly generate

re based onanimoto, Mment freque.

ed recommee is no longe.

nfopedia in d recommenanual recom

ws the disco

8

result sample

Figure 2.1the manual

ed by Mahou

n establisheManhattan an

ency, etc.).

endations aer a need to

NLB’s stagndations to mmendationovery of oth

e

(titled “Krecommen

ut.

ed and rigond many othThe recom

allow us towait for ma

ging envirosupplemen

ns are not her related c

King Edwarndations wer

orous mathehers for dist

mmendations

o connect reanual recom

onment illusnt manual re

yet availabcontent.

rd VII Colre among th

ematical algtance measus from Mah

elevant artimmendation

strate the caecommendable, they en

llege of he top 4

gorithms urement, hout are

icles for ns before

apability ations by nrich the

9

Figure 2.3: Supplementing an Infopedia article with manual recommendations

Figure 2.4: Enriching an Infopedia article with no manual recommendations

The Sinthe nextthe proc MovingcompletNewspa Newspanewspanewspa Scalingin aroun The resarticles story un

With thyear’s wa muchrecommarticles

ngapore Met data set thcess of push

g on from Inted anotheraperSG coll

aperSG (httapers publishapers at the a

g up from 2,nd 18 hours

ults provedin a chrono

nfolds.

Figure 2.5: P

he successfworth of 58h larger datamendations

from anoth

emory portahat we havehing it out to

nfopedia anr proof-of-clection.

tp://newspaphed betweearticle level

,000 to 58,0s.

d to be very ological ord

PoC results s

ful applicati,000 newspa set. We dto see if w

her newspap

al’s (http://s successfullo enhance t

nd Singaporconcept (Po

pers.nl.sg) ien 1831 andl, and there

000 was rela

promising. der, we can

sample for N

ion of textpaper articledecided to pwe can discper appear in

10

singaporemly completehe current c

re Memory, oC) for abo

is NLB’s od 2009. NLB

are now ov

atively easy

Interesting,discover th

NewspaperSG

analytics es, we were pull all the cover additin the recom

emory.sg) ped this similcontent-base

we looked out 58,000

nline resouB has investver 18 millio

y. We comp

, if we werehe progressi

G (58,000 art

on Infopedencouragedarticles froional persp

mmendations

personal melarity proceed filtering

to scale it unewspaper

urce for Singted significon articles in

pleted the si

e to organiseion of a new

icles)

dia, Singapod to take the

om 2 newspectives of s.

emory collessing and wrecommend

up and succr articles fr

gapore and antly to dign Newspape

imilarity pro

e the recomws item and

ore Memore plunge an

papers and ga news item

ection is we are in dations.

cessfully rom our

Malaya gitise the erSG.

ocessing

mmended d see the

y and a nd tackle generate m when

Workinfrom 18scalabil The prnewspasoftwarcommoOCR ecomplexof interm A reliab Apacheclusters(http://hprocess Many othe map After mrunningon 3 vir

ng on the 6 845 to 2009lity issue. T

rocessing oaper articlesre during ton due to therrors introxity of the mediate dis

ble, scalable

e Hadoop iss of chadoop.apacsing, boastin

of Apache Mp/reduce par

much experg Apache Mrtual machin

Figure 2.6: A

million Ne9 and The S

The processi

of Newspaps were derthe digitisahe historic noduced ‘noi

computatiosk storage.

e and distrib

s a framewocommodity che.org). Itng users inc

Mahout’s coradigm to al

rimentation Mahout in ou

ne hosts.

Apache Hado

ewspaperSGSingapore Fing ran for m

perSG articrived from ation of thenature of thise’ into th

on, leading t

buted compu

ork that supcomput

t is the de-luding Face

ore algorithmllow scaling

and tuningur secondar

oop setup in

11

G articles frFree Press fmore than a

cles also suthe use of

e historic nhe newspapehe data seto lengthy p

uting platfo

pports distribters usin-facto openebook, Twit

ms are implg to reasona

g, we finallry data cent

NLB

rom the 2 nfrom 1925 tweek befor

urfaced anf Optical Cnewspaperers, particu

et, and alsoprocessing

orm is requi

buted proceng simpln source sotter, Yahoo,

lemented onably large d

ly set up are. We set u

newspapers to 1962) gare we ran ou

other challCharacter R

microfilmslar for the o significanand the nee

red.

essing of larle prograoftware plat, LinkedIn a

n top of Apata sets.

a full Apacup a total o

(The Straitave us our fut of disk st

lenging issuRecognitions. OCR errolder issuesntly increaed for huge

rge data setamming tform for band EBay.

pache Hadoo

che Hadoopof 13 virtual

ts Times first real torage.

ue. The n (OCR) rors are s. These ased the

amount

ts across models

big data

op using

p cluster l servers

12

For more information on our setup, do feel free to contact us. To tackle the issue of OCR errors within the newspaper articles, we tuned the parameters for the text analytics algorithm to ignore infrequent word tokens. This has resulted in the removal of a significant portion of the OCR errors before the actual Mahout processing commenced. With the Apache Hadoop cluster set up and applying more aggressive algorithm parameters, we were able to generate similarity recommendations for more than 150,000 articles. The aggressive change in the parameter values are instrumental in slicing away a huge chunk of the time required for the processing. From being able to process 58,000 articles in around 18 hours, we are now able to process 150,000 articles in less than 30 minutes. However, we are still in the midst of resolving issues as we move towards our target of processing the 6 million articles. 3 NEXT STEPS We are currently working on additional proof-of-concepts with Apache Mahout particularly in the areas of on content enrichment, clustering and classification. In the area of content enrichment, there are lots of interesting ideas and possibilities for us to explore and conquer. We hope to enrich the content with semantic information so that content becomes connected semantically instead of just textually. We also hope to enrich the content with language translations to explore the possibilities of connecting content in different languages, given that we have four official languages in Singapore. On the clustering front, working with user-submitted personal memories from the Singapore Memory portal, Apache Mahout clustered the memories into 43 groups (Figure 8). The key terms amongst the memories within each cluster can then be analysed to identify the key theme for the clusters.

Figure 3.1: Singapore Memory Cluster PoC cluster sizes

We aregenerat

For clasform thsubmissamountthe Sing The vaacross phenomto uneacollecti

Figure 3.2: S

e exploring ed clusters.

Figure 3.3: S

ssification, he training dsions into ot of effort regapore Mem

lue of a libthe conten

menon. The rth the hiddons that go

Sample of to

visualizatio

Singapore M

we are currdata that theone of the iequired to m

mory portal.

brary’s collnt of the cability to c

den values ibeyond tho

p terms in ge

on technolo

Memory porta

rently collate classificatiidentified cmaintain an.

lections inccollections, connect conin their painousands of i

13

enerated clus

ogies that w

al’s showcase

ting data foion algorithclusters. Thind update th

creases witha result

ntent automanstakingly citems. Data

sters

will allow u

e clusters

or user-idenhms use to ais will allowhe current c

h the numbof the weatically is t

curated collemining and

users to na

ntified clusteautomaticallw us to draclusters that

ber of connll understo

therefore esections, espd text analyt

avigate thro

ers. These dly classify nastically redt are showc

nections witood networssential for pecially for tics technol

ough the

data will new user duce the cased on

thin and rk effect libraries sizeable

logy and

14

software are now more matured and readily available through open source licenses. The time is ripe for libraries to leverage on them to bring out the enormous values in their collections. There are many more possibilities that we will explore in the future to make more and better connections – connections that allow users to discover comprehensive, relevant and trusted information. We do the work so that our users do not.

submitted on: may 31, 2013library.ifla.org/131/1/152-lim-en.pdf · search d a typica through. next...

Documents