cs 5604 information storage and retrieval solr team final ... · cs 5604 information storage and...

29
CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, Ye Wang, Anusha Pillai, Ke Tian {liuqing, yewang16, anusha89, ketian} @vt.edu Instructor: Dr. Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA, 24061 December 6, 2016

Upload: others

Post on 08-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

CS 5604 Information Storage and RetrievalSolr Team Final Presentation

Presenters:Liuqing Li, Ye Wang, Anusha Pillai, Ke Tian

{liuqing, yewang16, anusha89, ketian} @vt.edu

Instructor: Dr. Edward A. Fox

Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061

December 6, 2016

Page 2: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• Background• Implementation• ProblemsFaced• LessonsLearned• FutureWork• Acknowledgement

Outline

1

Page 3: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Background — Overview

2

Page 4: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Background — Updates

3

Spring 2016 Fall 2016

schema.xml

Coarsegrained Finegrained

Nocopyfields Copyfields forallfieldssearch

Createstopwords.txt &profanity.txt Updatethetwofiles

morphlines.conf

Twofieldtypes:stringandtext Multiplefieldtypes

Field“time”=>string Field“time”=>datetime

Nomultiple-valuedfields Multiple-valuedfield parser

Basic Indexing Smallcollection 1.2billiontweetsdataset

Incremental Indexing VirtualCloudera(VC) VC &HadoopCluster(HC)

Recommendation Brief description ImplementedinVC&HC

Custom Ranking Brief description ImplementedinVC&HC

Solr Admin UIBrief description Detaileddescription

Limitedfacetedsearch Detailedfacetedsearch

Page 5: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• LiveMode• ContinuousstreamofHBase cellupdatesintolivesearchindexers

• Simpleandefficient• Cannothandlebigdata

• BatchMode• BatchindextablesinHBase byusingMapReducejobs• WriteindexfilesintoHDFS(/user/cs5604f16_solr/…)• Canhandlebigdata

Implementation — Basic Indexing

4

Page 6: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• schema.xml:fieldsconfiguration• field(e.g.,ideal-cs5604f16-fake)

• #offields:30• Types:string(22),text_general (2),int (2),float(2),long(1),date(1)• Stored:True(17),False(13)

• dynamicField:matchingmultiplefields,usingwildcard

• copyField

Implementation — Basic Indexing

5

Page 7: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• stopword.txtandprofanity.txt• stopword.txt:tf-idf valuewillnotbecalculated• profanity.txt:quickresponseforsuchsearchqueries• Solr loadsthetwofileswhilereadingschema.xml

Implementation — Basic Indexing

6

Source:https://pypi.python.org/pypi/many-stop-wordshttp://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/

Page 8: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• morphlines.conf:mappingandparsing

Implementation — Basic Indexing

7

MappingdatafromHBase toSolr

Splitmultiplevaluesintolist "topic_label_s": "twitter;social;media;text"

Page 9: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• Indexthebigdataset

Implementation — Basic Indexing

8

ideal-cs5604f16 ideal-cs5604f16-1204

Dataset Allcollections(rawtweets)

Allcollections(rawtweets+processeddata)

Indexing

# of DataNode 18 17

Space Cost 392.33GB 399.21GB

Time Cost

Mapping 1h21m 1h45m

Reducing 5h11m 5h13m

Merging 3h18m 3h10m

Total 9h50m 10h8m

Page 10: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• Purpose• ProcessacontinuousstreamofHBase cellupdatesintolivesearchindexes(NearReal-Time,NRTIndexing)

• Solvetheproblemoffrequentinserts,deletesandupdates

• Howdoesitwork?• EnablingHBase replication(columnfamily)• PointinganNRTIndexerServiceatanHBase table• StartinganNRTIndexerService

• Ourwork

Implementation — Incremental Indexing

8

Source:http://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_config_hbase_indexer_for_search.html

Page 11: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Incremental Indexing

CreateandchecktheNRTindexer

9

Page 12: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

RestarttheHBase Solr Indexerservice

Implementation — Incremental Indexing

RestarttheserviceinVC

RestarttheserviceinHC

10

Page 13: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Incremental Indexing

11

CreateandchecktheNRTindexerChecktheresultsinHBase andSolr AdminUI

Page 14: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• Types• Textualsimilaritybased• Collaborativefiltering

• MoreLikeThisComponent• Identifiessimilardocumentstosearchresultdocuments.• Canbeconfiguredasarequesthandlerorsearchcomponent

• Usestermvectorstocomputesimilarity.• Termvectorcanbecalculatedduringqueryruntimeorprecomputedduringindexing

• Extractshighestmatchingtermsbasedontf-idf similarity

Implementation — Recommendation

12

Page 15: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• schema.xml• Setstored=true• SettermVectors =true(forcalcalating tf-idf)

• Aftermakingchanges,reindexing ismandatory

• solrconfig.xml• Enablemlt

• Defineotherconfigurationparameters• e.g.,mlt.fl,mlt.mintf,mlt.mindf,mlt.maxdf,mlt.qf

Implementation — Recommendation

13

Page 16: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• RequestHandler

Implementation — Recommendation

Link:https://drive.google.com/open?id=0B2iasHDgHqGyYUk0R3RkVktkM2M 14

Page 17: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

• SearchComponent

Implementation — Recommendation

Link:https://drive.google.com/open?id=0B2iasHDgHqGyU0doVEpidlh3c2c 15

Page 18: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Custom Ranking

16

• Purpose• Customizeandoptimizetherankedresults

• Howdoesitwork?• SearchComponent

• prepare():pre-processing,invokedbeforequeryisexecuted• processing():post-processing,invokedafteralltheresultsarefetched

• CustomScoring

• Re-ranking

𝑺𝒄𝒐𝒓𝒆 = 𝑫𝒐𝒄𝒔𝒄𝒐𝒓𝒆,𝑺𝒐𝒍𝒓 + 𝑫𝒐𝒄𝒊𝒎𝒑𝒐𝒓𝒕𝒂𝒏𝒄𝒆+𝑊45678×𝐷𝑜𝑐=85>?,45678 + 𝑊8@A=4?>×𝐷𝑜𝑐=85>?,8@A=4?>

Page 19: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Custom Ranking

BuildandcopyjarfileintoHadoopCluster

16

Page 20: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Custom Ranking

BuildandcopyjarfileintoHadoopCluster

16

Modifythesolrconfig.xml

Page 21: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Custom Ranking

17

UpdatetheinstanceDirReloadthecollectionChecktheresultsinSolr AdminUI

Page 22: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Solr Admin UI

1

2

3

Choose ideal-cs5604f16-fake for querying

DashBoard:providebasicfunctionsforuserstochoose.(LoggingtocheckSolrlogsfordebugging)

CoreSelector:selectthecore(dataset)forqueries

Solr instanceInformation:currentversions,JVMinformation

19

Page 23: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Solr Admin UI

22

1

2

4 5

3

Fieldname

Resultstatistics

Therequest-handler:/selectThequeryevent:qParametersforquery:fq (filterqueries)sort(descendingorascending)ExecutequeryResultsoutputs:json format

Page 24: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Implementation — Solr Admin UI

23

1

2

4

3

5

Thefacetedsearchquery:rangeFacetedsearchfield:t_month_iParameters,truewhenenabledSearchResults:countsSearchResults:details

Page 25: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Problem Faced

24

ClouderaandOSVirtualClouderaseems slowandoftencrashesduetothememory

Notfamiliar withthewholearchitectureatthebeginning

VersionsofClouderaandSolr

DataConsistencycheck

Notenoughrealdataavailabletoperformtests

Notmuchinformationavailableregardinglogstoperformcollaborativefiltering

CollaborationCommunicationandmodification

Page 26: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Lessons Learned

25

SolrHBase

HDFS

Patience

Carefulness

TeamCollaboration

Page 27: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Future Work

26

SearchCustomizemorerequesthandlers

Dealwiththeprofanityissue

CustomRankingCustomizemoresearchcomponents

Recommendation

Createacustomrecommendationcomponent(Probabilities– CTAteam)

Implementthecollaborativefiltering(Log files– FEteam)

SolrFigureoutSolrCloud,multipleSolr nodesinClouderaSearch

Page 28: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Solr Team Final Presentation

Acknowledgement

27

Projects

NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)

NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)

TeamsCMT,CMW,CLA,CTA,FEteams

PersonsInstructor Dr.EdwardA.Fox

GRA Sunshin Lee

Page 29: CS 5604 Information Storage and Retrieval Solr Team Final ... · CS 5604 Information Storage and Retrieval Solr Team Final Presentation Presenters: Liuqing Li, YeWang, Anusha Pillai,

Thank you !

Questions?