search and discovery technology stack v3 - green chameleon€¦ · "i am oz, the great and...

15
1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe The Lion thought it might be as well to frighten the Wizard, so he gave a large, loud roar, which was so fierce and dreadful that Toto jumped away from him in alarm and tipped over the screen that stood in a corner. As it fell with a crash they looked that way, and the next moment all of them were filled with wonder. For they saw, standing in just the spot the screen had hidden, a little old man, with a bald head and a wrinkled face, who seemed to be as much surprised as they were. The Tin Woodman, raising his axe, rushed toward the little man and cried out, "Who are you?" "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and I'll do anything you want me to.” L. Frank Baum. The Wonderful Wizard of Oz. Project Gutenberg (2008) The Wizard of Oz captures the life trajectory of many technologies, and search is no exception: the technology starts in the public domain, it is appropriated and made mysterious (hidden behind the curtain) by commercial interests, then touted and overblown as magical, eventually the curtain falls, and we can see what has been there all along, and work with it more intelligently as it is, knowing what it’s good for and knowing what it’s not good for. This is true of search, as it is true of auto- classification. Modern search technology has its roots in Vannevar Busch’s seminal 1945 paper “As we may think” and its foundations were developed by research groups at MIT, Cornell and Harvard into the 1970s. One of the consequences of hiding a technology behind the commercial curtain is that different technologies compete with each other for the same market, and therefore downplay their perceived competitors, or attempt to duplicate the functionalities of their perceived competitors. Search technology has competed with taxonomy work for years – “you don’t need taxonomies,” went the mantra of one infamous search vendor, “our software will understand your content for you.” The curtain fell with the unveiling of Google Knowledge Graph, when it became obvious just how much Google was investing in knowledge organisation structures (controlled vocabularies, taxonomies, ontologies, knowledge graphs) to make its search smarter. The superior performance of open source search engines (and their striking adoption rates) over commercial search engines in solving particular types of problem is, like Toto’s leap, another manifestation of the dropped curtain. Machine classification of content has been another example. It is frequently touted by its commercial vendors as a single technology performing a single “we’ll tag your

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

1

Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

The Lion thought itmight be as well to frighten theWizard, so he gave alarge, loud roar, whichwas so fierce and dreadful that Toto jumped awayfromhiminalarmandtippedoverthescreenthatstoodinacorner.Asitfellwith a crash they looked thatway, and thenextmomentall of themwerefilledwithwonder. For they saw, standing in just the spot the screen hadhidden,alittleoldman,withabaldheadandawrinkledface,whoseemedtobeasmuchsurprisedastheywere.TheTinWoodman,raisinghisaxe,rushedtowardthelittlemanandcriedout,"Whoareyou?""I amOz, theGreat andTerrible," said the littleman, in a trembling voice."Butdon'tstrikeme--pleasedon't--andI'lldoanythingyouwantmeto.”

L.FrankBaum.TheWonderfulWizardofOz.ProjectGutenberg(2008)TheWizardofOzcapturesthelifetrajectoryofmanytechnologies,andsearchisnoexception:thetechnologystartsinthepublicdomain,itisappropriatedandmademysterious(hiddenbehindthecurtain)bycommercialinterests,thentoutedandoverblownasmagical,eventuallythecurtainfalls,andwecanseewhathasbeenthereallalong,andworkwithitmoreintelligentlyasitis,knowingwhatit’sgoodforandknowingwhatit’snotgoodfor.Thisistrueofsearch,asitistrueofauto-classification.ModernsearchtechnologyhasitsrootsinVannevarBusch’sseminal1945paper“Aswemaythink”anditsfoundationsweredevelopedbyresearchgroupsatMIT,CornellandHarvardintothe1970s.Oneoftheconsequencesofhidingatechnologybehindthecommercialcurtainisthatdifferenttechnologiescompetewitheachotherforthesamemarket,andthereforedownplaytheirperceivedcompetitors,orattempttoduplicatethefunctionalitiesoftheirperceivedcompetitors.Searchtechnologyhascompetedwithtaxonomyworkforyears–“youdon’tneedtaxonomies,”wentthemantraofoneinfamoussearchvendor,“oursoftwarewillunderstandyourcontentforyou.”ThecurtainfellwiththeunveilingofGoogleKnowledgeGraph,whenitbecameobviousjusthowmuchGooglewasinvestinginknowledgeorganisationstructures(controlledvocabularies,taxonomies,ontologies,knowledgegraphs)tomakeitssearchsmarter.Thesuperiorperformanceofopensourcesearchengines(andtheirstrikingadoptionrates)overcommercialsearchenginesinsolvingparticulartypesofproblemis,likeToto’sleap,anothermanifestationofthedroppedcurtain.Machineclassificationofcontenthasbeenanotherexample.Itisfrequentlytoutedbyitscommercialvendorsasasingletechnologyperformingasingle“we’lltagyour

Page 2: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

2

contentforyou”functionwhereitisactuallyabundleofquitedistincttechnologies,androotedininformationretrievalresearchandsearchtechnologiesdevelopedinthe1970sand1980s.Machineclassificationvendorshavefoundthemselvescompetingwithsearch(“wecanmakeyoursearchsmarter”)andtaxonomies(“wecanderivetaxonomiesautomaticallyforyou”).ThecurtainisnowstartingtofallwithlargeorganisationslearninghowtousetheopensourcetoolkitssuchastheUniversityofSheffield’sGATEtoolset,tosolvetheirsearchanddiscoveryproblems.The“goitalone”propagandaofthetechnologyvendorsisdamagingbecauseitlocksbuyersintoasingletoolsetdesignedforcommondenominators,withprohibitivecustomisationandmaintenancecosts.The“magicblackbox”propagandablindsorganisationstotheneedtobuildcapabilitiesinarangeoftools.Andthe“wedoeverything”propagandahidesthesynergiesandcapabilitiesthatcanbeattainedbyintelligentlyusingsearch,machineclassificationandknowledgeorganisationtechnologiesasatoolkit,withdifferenttoolsetstobeusedindifferentcombinationstosolvedifferentkindsofsearchanddiscoveryproblem.Thepurposeofthisarticleistoexplaintheusesandinteractionsofthreetechnologiesthattogethersupportsearchanddiscovery,sothatorganisationscanlearnwhichtoolstoadoptinwhichcombinationtosolvedifferentproblems,andtheycanseewhatkindofinternalcapabilitiestheyneedtobuildandmaintaintocontinuesolvingthoseproblems:

• Enterprisesearch• Taxonomyandontologymanagement• Textanalyticstoperformmachineclassification.

Enterprisesearchisatechnologyfordeliveringusefulandrelevantresultstousersexploringalargecontentbase.Asatechnologyitisrelativelysimpleandnotespeciallysmart,andtodeliversuperiorresultsincomplexenvironmentsitneedstobesupportedbyothertechnologies.“Complexity”canrefertothediversityofusercommunities,andtothediversityofthecontentitself.Searchenginesbythemselvesdon’tunderstandanything.Theycrawlandindex,processqueriesandserveupuseranalytics.Tobecomesmart,theyneedhumandesignedcurationtools(taxonomyandontologytools)andtoolsforscalingthehumantaggingofcontent(textanalyticstools).Taxonomyandontologymanagementtechnologiessupportdesigning,deliveringandmaintainingknowledgestructuresandcontrolledvocabulariestoenhancethesearchanddiscoveryexperience.Thesestructuresarebasedonananalysisofbusinessanduserneedstogetherwithananalysisofthetargetcontentforsearchanddiscovery,andtheyhelptodisambiguateconcepts,capturevariationsinlanguageandmapthemassynonyms,andidentifyrelationshipsbetweenconceptssothatsearchescanbeexpandedornarrowed.Theconstraintsoftaxonomyandontologymanagementmostoftenrelate(a)tostayinguptodatewithnewandemergingneedsofdiverseusersets(whichiswheresearchanalyticscanhelp),and(b)toscalingthewaythatconceptscanbeidentifiedincontentandmetadataenriched(whichiswheretextanalyticsandmachineclassificationcanhelp).

Page 3: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

3

Textanalyticsreferstoaclusteroftechnologiesthatextractmeaningfromtextandturnitintometadata.Someoftheroottechnologies(crawlingandindexing)aresharedwithsearch.Thesetechnologiesarearesponsetotheproblemsofhumaninconsistency(peoplearenotconsistentinhowtheyapplytagstocontent)andofthescaleofcontenttobetagged(whenitwouldnotbefeasibletoapplyhuman-curatedtaggingtolargeamountsofcontentinashortperiodoftime).Differenttechnologiesareusedtosupplydifferenttypesoftaggingoperation,rangingfromidentificationofknownentitiessuchaspeople,organisationsandplaces,quantitiessuchasmoneyamounts,totheidentificationofabstractconceptsthatcharacterizewhatadocumentis“about”inwaysthatmakesensetotargetusers.Theconstraintsoftextanalyticsare(a)thatithasahardtimefiguringoutwithoutintensivehumancurationorverywellstructuredtrainingsetswhichconceptsaremostsalienttothemostpeople(whichiswheretaxonomyandontologymanagementhelps)and(b)likesearch,itrequiresconstanttuningbasedonemerginguserneeds(whichiswheresearchanalyticsonuserbehaviourscanhelp).Thesetechnologiescanbeextremelypowerfulwhenusedincombinationinsupportofthesearchanddiscoveryexperience.However,itisimportanttounderstandwhattheyarecapableof,howtheywork,andthelimitationsofwhattheycando.FigureA1showsadetailedviewofthethreepillarsofthesearchanddiscoverytechnologystack,howtheywork,andtheirkeyinteractions.Webeginatthetopofthediagram.Sincethesetechnologiesaresupposedtobedeployedintheserviceofuserneeds,decisionsabouttheirconfigurationanddesignneedtobemadeinthecontextofknownuserneeds,aswellasathoroughunderstandingofthecontenttobecovered.Notethatthisisasimplifiedconceptualframework,anditdoesnotfullyrepresentthecomplexitiesoftherelationshipsbetweenthepillars,noralltheoverlapsbetweenthem.

Page 4: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

4

Figu

reA1De

tailview

ofthese

archand

disc

overytechno

logystack

Page 5: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

5

“Content”canbestructured,semi-structuredorunstructured.Itcanbeintheformofdata,documents,webcontent,audiovisualfiles,discussions,orquestions.“People”canalsobeconsideredaformofcontentsinceknowledgeablepeoplemayalsobeausefulresultinssearchonaparticulartopic.Inacontentmanagementsystemapersonprofilecanbeusedasaproxyforthatperson,andcanbetaggedwithrelevantsubjecttagssoastosurfaceinsearchalongsideotherformsofcontent.Allthreetechnologiesinthestackdependuponathoroughsurveyofthetargetcontenttobecovered,andpreparationandstructuringofthecontentstoressothatthesearch,taxonomymanagementandtextanalyticstechnologiescanaccessthemandoperateuponthem.

Enterprise Search Let’sbeginwiththecentralpillar,theenterprisesearchstack.Therearesixmaincomponentsinthesearchstack.

CrawlingThefirstcomponentisthecrawlingmodule.Thisisdirectedtocontentsources,anditcrawlsthecontentinpreparationforindexing.Youmaywanttobeabletodirectthecrawlertospecificpartsofyourcontent.

ContentProcessingContentisthenprocessed,usingtoolsincommonwithinitialprocessingofcontentbytextanalyticstools.Atextfileiscreatedforparsingandindexing,andthecontentistokenized,meaningeachelementisgivenauniqueidentifiersothattheycanbemanipulatedandlocatedlateron.Thisisalsothepointatwhichlemmatization(resolvingvariantgrammaticalformssuchastensesandpluralstothesamegrammaticalroot)andstemming(resolvingdifferentwordendingstothesameword-stem)occurs.Youmayalsowanttoremove“stopwords”thatwillcreatemeaninglessnoise(commonoperatorssuchasif,but,and,the,a)fromtheindexing.Yourstopwordsmayneedtobetuned.Sometimesstopwordsthatarenoisyinsearchindexesprovidesemanticcluesinsomerules-drivenapplicationsoftextanalytics.

IndexingTheindexingmodulecreatestermindexesfromtheparsedtext.Eachtermislinkedtoitssourcecontent,andtokenizationprovidesthecontextinwhichthesearchtermappears.Thepre-processedindexesprovidespeedinsearch.Crawlingandpre-processingoflarge-scalecontentcollectionscanbeveryslowandoccupiessignificantprocessingcapacity.Hencemostindexingconsistsofincrementalupdatestotheindexesbynoticing,crawlingandindexingnewcontentwheneveritisadded.Itisveryimportanttoplantheconfigurationandsetupofyoursearchenginecarefullyinadvance.Anymajorchangestothesearchengineconfiguration,ortothe

Page 6: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

6

wayitexploitstaxonomyortextanalyticsmayrequireanexpensiveandslowcompletere-crawlandre-index.

Auto-categorizationSomesearchenginesalsooffersimpleauto-categorizationfunctions.Intheabsenceofsophisticatedtextanalyticstools,theseareusuallyentityrecognitionbasedonlookupsofreferencetermlists.Theselistscanbemaintainedwithinthesearchengineitself,orcanbesuppliedfromataxonomymanagementsystem.Thisisreferredtoas“knownentityextractionvialookup”.Theentitiesyouareinterestedincallingout(forexamplecompanynamesorcountrynames)arecompiledintolists,andareregisteredassignificantentitiesifthemainsearchindexescomeacrossthesetermsortheirsynonyms.Thiscanprovideaverysimplefilteringcapabilitybasedontheentitytypesyouareinterestedin.However,thisisafairlybasictechnology.Justbecauseadocumentmentions“Google”doesnotmeanthatthedocumentissubstantially“about”Google.

QueryProcessingMuchoftheutilityofasearchengineliesinthesophisticationwithwhichithandlesqueriesandqueryinterfaces.Queryinterfacescanincludetheubiquitoussearchquerybox,browsingtaxonomystructureswhereconceptsarepre-linkedtocontentthroughstoredsearches(requiresintegrationwithataxonomymanagementmodule),filteringofsearchresultsusingmetadataelementsortaxonomyfacets(againrequiresintegrationwithtaxonomy).Manysearchenginesofferauto-completeorauto-suggestofsearchqueriessuggestingcommonsearchqueriesastheusertypestheirqueryintothesearchbox.Auto-completecanexploitcommonsearchqueries,existingtaxonomyorthesaurusterms(linkedtoresultsthroughstoredsearches),and/orkeywordsincontextfrommetadatasuchastitlesofhighvaluecontent.Therelevanceofsearchresultsforcommonqueriescanbetunedattheback-endbyasearchmanagerbyalteringtherulesthatcalculatethelikelyrelevanceofagivenpieceofcontentforagivenquery.Wherethereisaknown“best”pieceofcontentforagivenquery,thiscanbepromotedtothetopoftheresultspage.Relevancetuningcanalsobeautomated,bylookingatclickbehaviorsbetweenqueryandclickingonresultsandpromotingcontentthatisfrequentlyclickedon.Relevance-tuningcanexploitwhatisknownabouttheusersandtheircontexts.Knowingthatauserissearchingfromwithinagivenorganizationalfunctionenablesrulestobewrittenforhowqueriesfromthoseusersshouldbehandled,fromzoningthesearchtoknowntargetcontentforthatfunctionalgroup,tohandlingrelevancecalculationbasedonmetadataassociatedwiththosefunctionsandusers,andmatchingitwiththeindexedcontent.Contentrankingmechanismscanalsobepulledintothemixed.

Page 7: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

7

SearchAnalyticsQueryhandlingandsearchtuningrequiresophisticationinthesearchengineitselfaswellasarelativelysophisticatedsearchmanagementcapability–i.e.thewaythatsearchisstaffed,administered,configuredandconstantlytunedinrelationtoknownuserneeds.Searchanalyticsisthemodulethatpowersthissophisticationandtuning.Searchengineshaveverypowerfulreportingcapabilities,trackinghowqueriesareconductedanduserinteractionswithresultspages,butasinallcomplexreportingcapabilities,thesmartslieinhowthereportsaredefinedandconfigured.Whatwillyoutrack,how,andhowfrequently?Howwillyoubackuphunchesgatheredfromthereports,andvalidatethemwithusers?Theinitialsearchdesignstrategybasedonusecasesandcontentanalysiswillgiveyouastartingpoint.Thiswillberefinedbytheinsightsyougainbyobservinguserbehaviorsthroughthesearchroll-outandsubsequentmaintenance.

Taxonomy and Ontology Management Likesearch,taxonomyandontologymanagementrequiresveryrobustanalysisofhowusersneedtointeractwithcontent,andanalysisofthecontentitselftounderstandhowitisdescribed,organizedandused.Taxonomydevelopmentandmanagementisstillbestconductedasaworkofskilledhumandesign,butitcanbesubstantiallyenhancedwiththeappropriatetechnologies.Therearetwobroadfunctionswithinenterprisetaxonomyandontologymanagementsystems.

TaxonomyBuildingandMaintainingEnterprisetaxonomymanagementsystemssupporttheworkofbuildingandmaintainingtaxonomiesbyprovidingworkflowsandmetadataaroundtaxonomyconcepts.Forexample,theycanhousesourcevocabulariesofvaryingauthorityandstructurethatcanbeusedtogeneratecandidatetermsforthetaxonomy.Thereareapprovalworkflowstotracktheapprovalprocess.Theremayberoles-basedpermissionsforcomplexenvironmentswheremultipletaxonomyprofessionalsareinvolvedinmaintainingthevocabularies.Sourcesandusagesoftaxonomytermscanbetrackedasattributesofthetaxonomyconcepts.Scopenotescanbeaddedtoexplicatetaxonomyterms,andtherevisionhistoryoftermsandtaxonomyfacetscanbetracked.

ReferenceStoreThesecondmajorfunctionofanenterprisetaxonomymanagementsystemistoactasareferencestoreforothersystems,supplyingtaxonomyterms,relationshipsbetweenterms,synonymsandcontrolledvocabulariestoothersystems.Itcansupply:

• controlledvocabularytermstocontentmanagementsystemstosupportmanualtaggingofcontent

• termsandrelationshipstosupportauto-categorizationviaknownentitylookup,whetherappliedwithinasearchengineoratextanalyticsengine;

Page 8: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

8

• synonymsandrelationshipstosearchsothatitcanimproverelevanceandprecisionofresults,andsupportexpansionofsearch

• taxonomyfacetstosearchtosupportfilteringofsearchresults.Becausetheyarebuilttomanipulatedefinedrelationshipsbetweenconceptsandtheirattributes,taxonomyandontologymanagementsystemscanalsoactasbrokerstootherLinkedDataorLinkedOpenDatasources,tocallauthoritytermsandrelationshipsfromelsewhere,tosupportsearchand/ortosupportenrichmentofcontentviatextanalytics.Finally,enterprisetaxonomyandontologymanagementsystemsarebuilttoactastaxonomy/ontologyhubscateringtomultiplesystemsandaudiences.Theyarecapableofinteractingdifferentlywithdifferentplatformsandaudiences.Forexample,differentversionsofthesamecoretaxonomycanbepresenteddifferentlytodifferentaudiences,dependingontheirneeds.Differentvocabulariesandtaxonomyservicescanbesuppliedtodifferentsystemsbasedontheircontentanduses.

Text Analytics Textanalyticsreferstoaquitediversefamilyofcomputer-assistedtechniquesforassigningmeaningtocontent.Differentmodulescanbeusedtoservedifferentpurposes.

ContentProcessingTherearesomebasiccontentprocessingfunctionsthattextanalyticshasincommonwithenterprisesearch:preparationoftextforparsing,tokenization,lemmatization,stemming,stopwords.

SyntacticandSemanticPre-ProcessingSeveralfunctionalapplicationsoftextanalyticsdependontwofoundationalprocessingtechniques.Thefirst,syntacticanalysis,analyzesthelanguagestructureoftextgrammatically,usingverylargeopensourcetrainingsets.Sentencesplittersbreakupsentencesintopartsofspeech,distinguishingnounsandnounphrasesfromverbsandoperators,andsoon,andsyntacticanalyzerscreatealogicalrepresentationofthesentences.Semanticanalysisattemptstoinfermeaningfromanalyzedtexts.Likesyntacticanalysisitdependsonreferencetrainingsets.Itenablescomputerstoinfermeaningfulassertionsandfactsfrombodiesoftext,andunderpinsthebasetechnologiesbehindmachinetranslation.Syntacticanalysisisbasedonknownrulesofgrammar.Semanticanalysishasamuchmorechallengingtask,whichistoinferhuman-understandablemeanings.Notallgrammaticallycorrectsentencesaremeaningfultohumans–"Colorlessgreenideassleepfuriously"isafamousexample

Page 9: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

9

presentedbyNoamChomskyofasyntacticallycorrectsentencethatissemanticallyincoherent.Syntacticandsemantictechniquesunderpintheidentificationof“entities”withincontent.Inthecontextoftextanalytics,an“entity”canbeaperson,athing,aplace,atime,atopic.Itisanydistinctconceptthatcanbeidentifiedfromtext.Identificationandextractionofentitiescanbedonemostsimplybysimplylookingupreferencelists,asinauto-categorizationofknownentitiesprovidedbysomesearchengines.However,withsyntacticandsemanticanalysistextanalyticscanalsoidentifyunknown(unpredicted)entitiesforwhichyoudon’tyethaveexamplesinyourlists.ForexampleinEnglishpropernamesarecapitalized,andpersonalnameshaveknownforms(thatalsovarybyculture).Hencecandidatesforpersonsmentionedintextcouldbepickedupbyarulethatsays“lookforoneormorecapitalizedsubjectsorobjectsofsentences,andthenlookforlemmatizedverbsfromthefollowinglist‘say’…‘reply’…etc”.Suchrulesarebasedoncommonusageasrepresentedintheirtrainingsetsandaretypicallyprovidedasstandard.Theseruleswilltypicallyneedtobetestedandrefinedbasedonlocalusage.Forexample,wefoundthatonestandardtooldidnotpickupcountryoforiginincontentfromanimmigrationandcustomsauthority,becauseitwasusingsimpleterm-phraselookup.However,customsofficersatcheckpointshabituallyusedtheconvention“[Country]-registeredvehicle”.Thehyphenationandextensionofthetermdisruptedthesimplelookupfunction.Noticethatthesolutiontothisproblemisacombinationoftechniques,likelysupplementingknownentitylookupwithsyntacticallydrivenentityextractionrules.Thepracticeofusingdifferenttechniquesincombinationofcharacteristicoftextanalyticsapproaches.Withthesefoundationalcapabilities,therearemanyspecializedapplicationsoftextanalytics,notallofwhichwillbesuitabletoeverypurpose.

Auto-categorizationTheabilitytoidentifybothknownandunknownentities/conceptsisasignificantimprovementinauto-categorizationcapabilities.Lookuplistsforknownentities/conceptscanbesupportedbyataxonomymanagementsystem,andstrengthenedthroughthesupplyofsynonyms.Conversely,throughthejudicioususeofrules,textanalyticscandiscoverentities/conceptsmentionedintextthatarenotyetinthetaxonomyorcontrolledvocabularies,andcanbeconsideredascandidateterms.

TextMiningTextminingreferstoanumberoftechniquesforanalyzingtextbasedonstatisticalalgorithms.Inverybroadterms,largebodiesoftextaresubmittedtoiterative

Page 10: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

10

operationsusingalgorithmsto“sort”thetextintoclustersor“buckets”thatareinternallystatisticallysimilar,andstatisticallydissimilartotheotherclustersorbuckets.Statisticaltechniquesarealsousedtodeterminethewordsorword-phrasesthataremostlikelytouniquelyrepresenttheirbucketcomparedtootherbuckets.Thesewordstringsareoftencalled“topicmodels”.Thelikelihoodofspecifictermsoccurringinproximitytoeachothercanalsobecalculated.Textminingissometimestoutedasafully-automatedalternativetodescribingandtaggingcontent.Inpracticehowever,theclusteringtechniquesusefuzzyalgorithms,andtheend“described”stateishighlyinfluencedbywhichclusteringpathwaysaretakenatthebeginningoftheoperation.Textminingtechniquessufferfromnon-reproducibility(thesameseriesofoperationsonthesamecontentcanproducedifferentresultsondifferentoccasions)andfromnon-comprehensibility–thetopicmodelsgeneratedoftendonotmatchhowusersthemselvesdescribethecontent.Soinpractice,applicationsoftextminingareheavilytunedbytheirhumanoperators,anditisnotalwaystransparenthowagivensetofresultsareachieved.Textminingalsosuffersfromendogeneity–meaningthatthetagsthatareextractedtodescribesimilarbucketsofcontentarewhollyderivedfromthatcontentset(orwhentrainingsetsareusedforcomparison,fromthetargetcontent+thetrainingset).Endogeneityisaproblemwhenyouwanttosurveydiversecollectionsofcontentandmapmeaningsconsistentlyacrossthem.Inthatcase,ataxonomyorontologythatcrossesthosecontentsetsandaudiencesprovidesanexogenousreferencepointthatcanbrokermeaningsacrossdifferentcontentsetsanduseraudiences.However,textminingcanbeapowerfultoolusedincombinationwithsearchandtaxonomymanagement.Itsstatisticaltechniquestoclusterbasedonsimilaritycanproposenewcategorizationsandpotentiallyrelevantrelationshipsbetweentopicstothetaxonomymanager.Itsabilitytoextracttopicmodelscanbeusefulinautomatedsummarizationtechniques.

SemanticAnalysisWehavealreadydiscussedsomeofthechallengesofcomplexityfacingsemanticanalysis,notleasttheheavyinfluenceofcontextandculture(factorsexternaltothecontent)onmeaning.Inpracticaltextanalyticsapplications,semanticanalysisreliesonanumberoftechniquesandnotjustcomputationalsemanticanalysis.Forexample,oneofthemainapplicationsofsemanticanalysisistoextractassertionsorfactsfromcomplexdocumentation.Ifsemanticanalysishasaccesstoontologiesviaataxonomymanagementsystemitcanthenenrichitsinferencesbasedonknownandvalidatedrelationshipsbetweenentities.

Page 11: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

11

SentimentAnalysisSentimentanalysisisaveryspecializedapplicationoftextanalytics,oftenusedinmarketing,socialmediaanalysis,andqualitativefeedback.Itlooksatthesemanticsofpositive,negativeandneutralexpressions,inassociationwithknown(extracted)entities.Againitishighlycontextual,andsubjecttoover-simplification–forexample,sarcasticlanguagecanbesuperficiallypositive,butthelinguisticmarkersofsarcasmareoftennotveryobvious.Aswithmosttextanalyticstechniques,therulesandalgorithmsneedtobetested,validatedandconstantlysupervisedforaccuracy.

SummarizationAutomaticsummarizationoftenexploitsknowndocumentordatastructures(itknowswheretolookforkeycontent),anditmayexploitsyntacticanalysis(e.g.removingextraneous“padding”languagesuchasadjectivesandadverbs).Itmayusesemanticanalysistoextractkeyfactsfromdocuments.

How the Pieces of the Stack Work Together Let’sbringthisbacktogethertogetabetterunderstandingofhowthepillarsofthestackworktogether.FigureA2illustratesthehighlevelviewofhowthedifferentpillarsofthesearchanddiscoverytechnologystackinteractandworktogethertoproducesuperiorresultsfororganisations.Thefocusforhowthestackisdeployed,andwhichparticulartechnologycomponentsareused,alwaysspringsfromuseranalysis,andanunderstandingofthequeriesthatusersmake,againstanunderstandingofthetargetcontent,howitisstructured,howandwhyitisproduced,howitiscurrentlybeingused,andhowitcouldpotentiallybeused,ifthesearchanddiscoveryfunctionalitycouldbeimproved.Indigitalenvironments,“targetcontent”canbeanycombinationofdocuments,webcontent,data,dashboards,andpeopleprofiles.TheconfigurationoftheelementsoftheSearchandDiscoverystackspringsfromanunderstandingofuserneedsagainstanunderstandingoftargetcontent.Determininghowtousethepillarsofthestacksitswithinalargerdiagnosticanddecision-makingframework.

1. Whatisthebusinessproblem(orsetofbusinessproblems)youaretryingtosolve?Theanswertothisquestioncanoftencomethroughaknowledgemanagementdiagnosticsactivitysuchasaknowledgeaudit,combinedwithanunderstandingoftheorganisation’sbusinessdirectionandstrategy.

2. Whoarethekeyusercommunitiesinrelationtothebusinessproblem,andhowwillyouunderstandtheirworkingneedsandopportunitiesforimprovement?(Thiscanalsocomefromaknowledgeaudit).

3. Whatisthetargetcontent(theknowledgeresourcesthatyourtargetusercommunitiesneedstoworkwithmoreeffectively),whereisit,howisitstructuredandusedcurrently?(Contentanalysisandmodelingcanhelp).

Page 12: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

12

4. Whattoolsfromwithinthesearchanddiscoverystackcanhelptoaddresstheproposedsolution?(Thefocusofthispaper)

5. Whatneedstochangeonthebehavioural,processandgovernancelevelsforthenewsolutiontowork?(Forsustainableimplementationplanning,checkoutTheKnowledgeManager’sHandbook(MiltonandLambe,2016)fromaknowledgemanagementperspectiveorMaishNichani’sseriesofarticlesathttp://olasearch.com/articlesfromasearchdesignperspective).

FigureA2Interactionsbetweentaxonomymanagement,searchandtextanalytics

SearchandTaxonomySearchistheuser-facingcomponentofthestack.Itmediatescontenttousers.Taxonomymanagementsuppliesknownsalientconceptsandrelationshipstosearch,andittellssearchwhichquerytermsaresynonymsofthesametaxonomyconcepts.By“salient”wemeanthattheconceptsarefrontofmind,action-relatedconceptsintheheadsofusersastheyperforminformationandknowledgeseekingactivitiestocompleteworktasks.Taxonomiesmakesearchsmarterbytellingitwhichconceptsareimportanttousers,howtheyarerelatedtootherconcepts,whetherthroughhierarchicalorotherrelationships.Hencesearchcanexploittaxonomiestohelpuserstorefineorexpandtheirsearch,tofollowmeaningfulpathwaysandexplorecontent.Taxonomiesandontologiesenhancetherelevanceandprecisionofsearchforusers.Becausesearchisuserfacing,itgathersalotofdataaboutuserneedsandbehaviors.Frequencyanalysisofsearchqueriestellsthetaxonomistalotabouthowusersthinkabouttheircontentandtheirsearchrequirements.Clickthroughactivity

Page 13: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

13

onsearchresultspagesalsoprovidesevidenceaboutuserperceptionsofwhatmayormaynotbeuseful.Hencesearchanalyticsprovidesafeedbacklooptohelpthetaxonomymanagerconstantlyrefineandimprovethetaxonomytouserneeds.Thisisapositivefeedbackloop,wheretaxonomyprovidesrelevanceandprecisiontosearch,andsearchprovidesaconstantflowofrealtimelarge-scaleevidenceonwhatisactuallyrelevantandusefultousersatthepresenttime.Searchanalyticscanalsopickupemergingtermsandconceptsthatareimportanttousers,andproposethemascandidatetermsfortaxonomiesandtheirsupportingthesauri.Withouttaxonomymanagement,searchisrelativelydumb,whetherintermsofwhatitcanindex,howitcanhelpusersmaketheirqueries,andhowitcanhelpusersactupontheirresults.Withoutsearch,taxonomymanagementlacksaconvincingmediumtodemonstrateitsvalue,anditlacksacontinuousflowofevidencetomaintainitsrelevanceandusefulnesstousers.

SearchandTextAnalyticsTextminingcanproposenewconceptsandnewassociationsbetweencontentitemsforthesearchmanagertobuildintoqueryprocessingandrelevancytuning.Semanticanalysisandsummarizationtechniquescanbeusedtoprovide“smart”summariesofcontentitemstobepreviewedinsearchsnippetsandinsearchresultspages,makingiteasierforuserstoassesswhetherornotthecontentisrelevanttotheirneed,beforetheyopenthecontentitem.Textanalyticscanextendthewayinwhichcontentcanbepulledtogetherinspecializedsearchbasedapplications.Allthreepillarsofthestack(search,taxonomyandtextanalytics)dependonconstanttuningoftheinfrastructuretodelivergoodsearchanddiscoveryresults.Textanalyticsinparticulardependsonrulestoguidehowthecontentisprocessedandtagged.Theserulesneedmaintenanceandtuning.Thistuningisnecessarybecauseoftheconstantongoingchangesincontent,context,priorities,usersandneeds.Searchanalyticsprovidealargescale,real-timewindowintouserneedsandbehaviors,andso,aswithtaxonomymanagement,itprovidestheevidencebaseneededtokeepthetextanalyticsstacktunedtodeliveraccurateandusefulresultsforusers.Whentherearelargeandcomplexknowledgebases,withouttextanalytics,searchfindsitdifficulttopickoutsalientconceptsrelatedtocontent.Aswithtaxonomymanagement,textanalyticsdependsondeepuserandcontentknowledgetoknowhowtoconfigureandtuneitsoperations.Withoutsearch,textanalyticsfindsithardtodothisunlesstheteamispreparedtoinvestinconstant(andexpensive)userresearchandtesting.

TextAnalyticsandTaxonomyManagementTaxonomiesaredesigned,evidence-basedartifacts.Assuch,theyconsistentlylagthepaceofchange.Theyareoptimizedtoexploitknownanddocumentedconcepts.Textanalyticstechniquessuchasauto-categorizationandtextminingcansupplycandidatetermsandrelationshipstothetaxonomymanagerwithouttheneedfor

Page 14: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

14

comprehensivere-surveyingofcontentanduserneeds.Theycanpickupemergingconceptsinthecontent,justassearchanalyticspicksupandproposesemergingconceptsevidencedinsearchqueries.Sotextanalyticscanactasapowerfultaxonomyenrichmenttool.Taxonomiesandontologiescanguideandstrengthentheapplicationoftextanalyticstechniquessuchasauto-categorization(knownentityextractionthroughlookup),factextractionthroughsemanticanalysis,andsentimentanalysis(byhelpingthesentimentengineresolvedifferentsynonymstothesameentitybeingreferencedinthecontent).Whentherearelargeandcomplexknowledgebases,withouttextanalytics,taxonomytagscannoteasilyandconsistentlybeappliedtolarge,diversebodiesofcontent.Withouttaxonomymanagement,textanalyticsstrugglestodescribecontentintermsthatmakesensetokeyusersandinthecontextofuseractivities.Machinetaggingtechniquesalonedonotdescribecontentintermsthathumansintuitivelyunderstand,withoutsomeformofhuman-curatedfilter.Taxonomymanagementprovidesthis.

Implications Fortoolongorganisationshavebeenlisteningtotheboomingvoicegivingmagicalreassurancesfrombehindthecurtain,andhaveneglectedtheirowncapabilities,andthetoolsetsthatexistintheopensourcearena(thatcommercialproductsalsoexploit).Astheworldbecomesmorecomplex,solutionstosearchanddiscoveryproblemswillbecomemorediverse.Weneedtobecomebetterjourneymen,moreknowledgeableaboutthecapabilitiesdifferenttoolsetsthatareavailable,sothatwecanchoosetherighttoolsforanygivenproblem.Someofthemwillbecommercialtools,acquiredfortheirexcellenceinaspecificsetoffunctions.Someofthemwillbeopensourcetools,acquiredandtunedtosolvespecifickindsofproblem.Theageofsinglesource“doitall”commercialplatformsisover.Thispaperhasproposedthatknowledgeofthesearchanddiscoverytechnologystackisanimportantcapabilitytoacquire.Itneedstobeatleastadequatetothejobofevaluatingneedsandpossibilities,andofaskingvendorssearchingquestionsabouthowtheirtechnologyworks,whatitisgoodat,andwhatitslimitationsare.Insomecases,organisationswillwanttointernalizeadeepertechnicalcapabilityinworkingwithanumberofsetsoftools.Thereareotherimportantsearchanddiscoveryrelatedcapabilitiesthatenterprisesneedtoacquirebeyondaknowledgeofthesearchanddiscoverytechnologystack–forexample,howuserneedsandcontentcanbeanalysedtodetermineprioritiesandopportunities,howcontentcanbemanagedandenrichedtoprovideoptimaluserexperiences,andhowbusinessproblemscanbetransformedthroughdesignprocessesintoeffectivesolutionsforbusinessneeds,problemsandopportunities.

Page 15: Search and Discovery Technology Stack v3 - Green Chameleon€¦ · "I am Oz, the Great and Terrible," said the little man, in a trembling voice. "But don't strike me--please don't--and

15

I would like to acknowledge the following colleagues who have provided valuable insights and advice on this paper: Dave Clarke, Agnes Molnar, Maish Nichani and Tom Reamy. Patrick Lambe October 2016-January 2017