what's the story with open source?

67
What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/

Upload: charlie-hull

Post on 05-Jul-2015

1.232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: What's the story with Open Source?

What's the story with open source? Searching and monitoring news media with open source technology

Charlie Hull, FlaxBCS IRSG Search Solutions 2010

Photo source: http://www.flickr.com/photos/shironekoeuro/

Page 2: What's the story with Open Source?

www.flax.co.uk 2

What is Flax?

Page 3: What's the story with Open Source?

www.flax.co.uk 3

What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd

and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by

Lucid Imagination Customers include Mydeco, NLA, Durrants

Ltd, Financial Times, MediaMiser, MySkreen

Apache Lucene and Solr are trademarks of The Apache Software Foundation

Page 4: What's the story with Open Source?

www.flax.co.uk 4

The challenges

Page 5: What's the story with Open Source?

www.flax.co.uk 5

The challenges

Content is created for publication, not for search

Page 6: What's the story with Open Source?

www.flax.co.uk 6

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to all

Page 7: What's the story with Open Source?

www.flax.co.uk 7

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple

Page 8: What's the story with Open Source?

www.flax.co.uk 8

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google”

Page 9: What's the story with Open Source?

www.flax.co.uk 9

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google” Every system will have to scale beyond its originally

planned size

Page 10: What's the story with Open Source?

www.flax.co.uk 10

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google” Every system will have to scale beyond its originally

planned size

- Every project is different

Page 11: What's the story with Open Source?

www.flax.co.uk 11

So how do we build news search?

Page 12: What's the story with Open Source?

www.flax.co.uk 12

So how do we build news search?

Indexing

Page 13: What's the story with Open Source?

www.flax.co.uk 13

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)

Page 14: What's the story with Open Source?

www.flax.co.uk 14

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quickly

Page 15: What's the story with Open Source?

www.flax.co.uk 15

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, source

Page 16: What's the story with Open Source?

www.flax.co.uk 16

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessary

Page 17: What's the story with Open Source?

www.flax.co.uk 17

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes required

Page 18: What's the story with Open Source?

www.flax.co.uk 18

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes requiredContent restriction & embargo data

Page 19: What's the story with Open Source?

www.flax.co.uk 19

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes requiredContent restriction & embargo data

SolutionLightweight, customisable index scripts using powerful open source libraries

Page 20: What's the story with Open Source?

www.flax.co.uk 20

So how do we build news search? import xapian import flax.core

db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db)

doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()

Page 21: What's the story with Open Source?

www.flax.co.uk 21

So how do we build news search?

Searching

Page 22: What's the story with Open Source?

www.flax.co.uk 22

So how do we build news search?

SearchingFree text with Boolean operators

Page 23: What's the story with Open Source?

www.flax.co.uk 23

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date ranges

Page 24: What's the story with Open Source?

www.flax.co.uk 24

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance ranking

Page 25: What's the story with Open Source?

www.flax.co.uk 25

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriate

Page 26: What's the story with Open Source?

www.flax.co.uk 26

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting

Page 27: What's the story with Open Source?

www.flax.co.uk 27

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'

Page 28: What's the story with Open Source?

www.flax.co.uk 28

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters

Page 29: What's the story with Open Source?

www.flax.co.uk 29

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters

SolutionTemplate-based user interface scripts, again using open source libraries

Page 30: What's the story with Open Source?

www.flax.co.uk 30

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters

SolutionTemplate-based user interface scripts, again using open source librariesBeware Javascript & older browsers!

Page 31: What's the story with Open Source?

www.flax.co.uk 31

So how do we build news search?

Administration Indexing failures commonLogging is essential

Page 32: What's the story with Open Source?

www.flax.co.uk 32

So how do we build news search?

Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later

Page 33: What's the story with Open Source?

www.flax.co.uk 33

So how do we build news search?

Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later

ScalabilityContent is always growingBoth indexing & searching must scale

Page 34: What's the story with Open Source?

www.flax.co.uk 34

So how do we build news search?

Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later

ScalabilityContent is always growingBoth indexing & searching must scaleOpen source search libraries provide distributed indexing, replication, remote indexesNot simple to get this right!

Page 35: What's the story with Open Source?

www.flax.co.uk 35

So how do we build news search?

●Available open source technologiesLanguages – C/C++, Java, Python, JavascriptSearch libraries – Xapian, LuceneSearch bindings/servers – Xappy, Flax.core, SolrExternal libraries – pyparsing, CherryPy, xmllib, mxODBC, ...Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...

Page 36: What's the story with Open Source?

www.flax.co.uk 36

So how do we build news search?

●Available open source technologiesLanguages – C/C++, Java, Python, JavascriptSearch libraries – Xapian, LuceneSearch bindings/servers – Xappy, Flax.core, SolrExternal libraries – pyparsing, CherryPy, xmllib, mxODBC, ...Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), …We can use whatever works!

Page 37: What's the story with Open Source?

www.flax.co.uk 37

Some examples

Newspaper Licensing Agency – NLA Clipshare20 million newspaper stories6500 usersContent from every major newspaper (and most regionals)Used by journalists, clippings agencies, media monitorsReplacing internal systems at major newspapers

http://www.nla-clipshare.com

Page 38: What's the story with Open Source?

www.flax.co.uk 38

Some examples

Newspaper Licensing Agency – NLA Clipshare20 million newspaper stories6500 usersContent from every major newspaper (and most regionals)Used by journalists, clippings agencies, media monitorsReplacing internal systems at major newspapersOne of very few ways to search content from all the papers within hours of publication

http://www.nla-clipshare.com

Page 39: What's the story with Open Source?

www.flax.co.uk 39

Page 40: What's the story with Open Source?

www.flax.co.uk 40

Page 41: What's the story with Open Source?

www.flax.co.uk 41

Page 42: What's the story with Open Source?

www.flax.co.uk 42

Some examples

Financial Times – press cuttingsWeb Service for easy integrationXML source dataFaceted searchArea filters (whole article, body, headline, byline or any combination)Synonyms, spelling suggestions

http://presscuttings.ft.com

Page 43: What's the story with Open Source?

www.flax.co.uk 43

Some examples

Financial Times – press cuttingsWeb Service for easy integrationXML source dataFaceted searchArea filters (whole article, body, headline, byline or any combination)Synonyms, spelling suggestionsBuilt from scratch in a fortnightDesigned as a prototype, scaled to production use without significant change

http://presscuttings.ft.com

Page 44: What's the story with Open Source?

www.flax.co.uk 44

Page 45: What's the story with Open Source?

www.flax.co.uk 45

A different task – news monitoring

Non-traditional use of search

Page 46: What's the story with Open Source?

www.flax.co.uk 46

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming content

Page 47: What's the story with Open Source?

www.flax.co.uk 47

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needs

Page 48: What's the story with Open Source?

www.flax.co.uk 48

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needsFalse positives require human checking

Page 49: What's the story with Open Source?

www.flax.co.uk 49

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needsFalse positives require human checkingFalse negatives should never occur!

Page 50: What's the story with Open Source?

www.flax.co.uk 50

A different task – news monitoringAn example

Durrants Ltd.

Page 51: What's the story with Open Source?

www.flax.co.uk 51

A different task – news monitoringAn example

Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline

Page 52: What's the story with Open Source?

www.flax.co.uk 52

A different task – news monitoringAn example

Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline

SolutionFlexible query language allows OCR errors, punctuation, fuzzy matching, weightingSupports features of previous engineScalable master-slave architecture

Page 53: What's the story with Open Source?

www.flax.co.uk 53

A different task – news monitoringAn example

Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline

SolutionFlexible query language allows OCR errors, punctuation, fuzzy matching, weightingSupports features of previous engineScalable master-slave architecture

Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system

Page 54: What's the story with Open Source?

www.flax.co.uk 54

Why open source?

Flexible, extendable

Page 55: What's the story with Open Source?

www.flax.co.uk 55

Why open source?

Flexible, extendable Powerful & scalable

Page 56: What's the story with Open Source?

www.flax.co.uk 56

Why open source?

Flexible, extendable Powerful & scalable Lower cost

Page 57: What's the story with Open Source?

www.flax.co.uk 57

Why open source?

Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary

Page 58: What's the story with Open Source?

www.flax.co.uk 58

Why open source?

Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary

- Freedom to innovate

Page 59: What's the story with Open Source?

www.flax.co.uk 59

Looking to the future

Page 60: What's the story with Open Source?

www.flax.co.uk 60

Looking to the future

More and more content including social media

Page 61: What's the story with Open Source?

www.flax.co.uk 61

Looking to the future

More and more content including social mediaMultiple delivery platforms

Page 62: What's the story with Open Source?

www.flax.co.uk 62

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications

Page 63: What's the story with Open Source?

www.flax.co.uk 63

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'

Page 64: What's the story with Open Source?

www.flax.co.uk 64

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud

Page 65: What's the story with Open Source?

www.flax.co.uk 65

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud

Search no longer a bolt-on, but a platform for innovation

Page 66: What's the story with Open Source?

www.flax.co.uk 66

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud

Search no longer a bolt-on, but a platform for innovationOpen source no longer an outsider, but the obvious choice

Page 67: What's the story with Open Source?

www.flax.co.uk 67

Thankyou!

Questions?

[email protected]/blogTwitter: @FlaxSearch

Photo source: http://www.flickr.com/photos/katerha/4259440136/