implementing solr in online media bo raun

Upload: fernando-jalon

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    1/33

    Suddenly. SolrImplementing Solr in Online Media as an

    Alternative to Commercial Search Products

    Bo Raun, Nordjyske Medier, DK

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    2/33

    Introduction

    2

    About Nordjyske Medier

    Our Search Challenges

    Discovering Search with Solr

    Making the Transition

    Lessons Learned

    Looking ahead

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    3/33

    Nordjyske Medier

    3

    First publication in 1767

    Danish media company for web,

    radio, tv and print media. News and adverts for

    Northern Denmark

    About 600 employees: Media, Call centers, Instore TV, application development, etc

    Media reaches 75% of local population daily 90% weekly

    Net & mobile developing our online business

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    4/33

    My background

    4

    20 Years experience

    Analysis, design and programming

    Pascal, Delphi, C, VB, C#

    RDBMS and SQL

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    5/33

    The environment and legacy

    5

    A long history working with RDBMS systems

    IT strategy built on Microsoft technologies, other closed sourcesolutions, like Citrix, VMware, etc. No tradition or experience using Open Source Search

    IT Organization & Development skills: .NET, Visual Studio, Windows, websites andwebservices (XML) app/integration development

    MS SQL is the de facto storage for data

    Main media sites: 183.000 users March 2010.Yellow Pages users: 2.700 August 2009, 10.900 March 2010

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    6/33

    Why search is important to Nordjyske Medier

    6

    Yellow (and White) Pages Major source of revenue in advertising; tying online display

    adverts to Yellow Pages directory listings Add-on to advertising campaign packages

    Editorial Content

    Articles (in-depth, review, short syndicated news feeds,picture captions)

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    7/33

    Yellow pages and White pages (www.folkogfag.dk)

    7

    Names, addresses, phone numbers, directions, Vcards

    Yellow Pages

    Daily updates, a few hundred bytes to 5 kB Advertisers get boosting, links, keywords, profiles

    500k documents

    White Pages 4 million documents, a few hundred bytes

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    8/33

    Yellow pages design keeping it simple

    85/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    9/33

    Editorial archive (www.nordjyske.dk)

    9

    Updates almost every minute during primetime

    Changes pulled and indexed every 10 minutes

    Offers different media versions for same story (default web)

    Quite simple interface (for now)

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    10/33

    Editoral search design

    105/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    11/33

    What were we using, and where did it fall short?

    11

    Original Strategy, c. 2008 Search is about crawling. Lets get a search appliance and have it

    crawl content that we want to repurpose/present, and we canwork in the Yellow/White Pages data and every other websitewe want

    Ten million terms in the index should be more than enough Relevance out of the box - Let the search appliance do its

    magic Lean onto the branding-value of a well known search

    technology

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    12/33

    Appliance? Sounds good..

    12

    Up and running in no time

    Excellent response times

    Commercial support

    Strong brand name

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    13/33

    There were problems

    13

    Articles and yellow profile pages being missed

    Slow updates lets turn off crawling and post data by scripting

    How do we control boosting of costumer profiles?

    Wheres the test environment? Or the development environment?

    The index is running full put white pages in SQL and join them client side

    Yellow pages polluted with text from news feeds and HTML layout

    Strong brand name and so what.

    Struggling with core functionality past deadline

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    14/33

    MS SQL: Strengths and Weaknesses

    14

    Microsoft FAST?

    Server already well know and supported inside the organization

    Integrates well with .Net and Visual Studio

    Virtually unlimited document space

    But Yellow/White pages didnt perform well

    Full Text goes some of the way

    Not that many options compared to alternatives

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    15/33

    Demand > technology

    15

    Appliance + SQL Server wasnt doing it for search.

    Alternatives? Maybe Open Source?

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    16/33

    Prototyping: faster than teaching a 9-year-old Ju-Jitsu

    16

    In the course of 60 minutes on my laptop

    Found out Solr 1.3 was probably worth a try

    Downloaded and installed Posted example documents into the Solr index

    and started searching, figuring out operations

    Feeding the family Late night coding, feeding Solr

    Repurposed utility that posted XML into the appliance, to stageXML for Solr

    Built harness to test Solr with random terms did quite well

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    17/33

    The old SQL dog learns new tricks- or how I learned to stop worrying and love non relational data structures

    17

    How do I make schema relations?

    No direct row editing for debug, no direct data manipulationstatements

    XML-driven query and retrieval very appealing at first, reuse ofexisting scripts

    Boosting documents instead of sorting relevance takes care of therest

    Faceting instead of extra requests for counting results

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    18/33

    Idea for sale but whos buying?

    18

    Are we Java based now?

    .Net library (SolrSharp) for the developers

    Are we based on open source now? What about support? Small adventures, e.g. mySQL, had been ill-introduced

    Lucid Imagination and Findwise Support contract (ExpertLink) provided asupport scenario similar to commercial products

    Support: For disaster scenarios and for stuck developers

    Time for a sanity check Assesment report, ensuring stability

    Annual search health check

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    19/33

    How Solr was implemented

    19

    Solution: leverage the example schema!

    Platform: 32 bit Windows machine w/2GB RAM, as Solr

    (unfortunately) used very little capacity in proof of concept

    Data handling done by old scripts

    VMware machine snapshot backup

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    20/33

    Results: It worked!

    20

    Customers boosted as promised

    Excellent response times

    Instant indexing

    Full control over data (eventually disabled profiletext indexing)

    But now the editorial archive is having frequent timeouts what todo about that?

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    21/33

    Upgrading the editorial search

    21

    Configuration from scratch

    Content directly from SQL

    More challenges Ontology integration

    More features wanted

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    22/33

    Handy Solr features out of the box

    22

    Stemming (Danish supported, not dictionary perfect) Example specialist vs specialsterne (specialists)

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    23/33

    Handy Solr features out of the box

    23

    Special characters Example lborg vs Aalborg.

    5/25/2010Apache Lucene EuroCon

  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    24/33

    Data import handlers (DIH) vs posting XML- Goodbye import scripts, hello XML-SQL and curl

    24

    Easy import SQL scripting

  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    25/33

    The usual data-juggling... The Solr way

    25

    Getting the data for a corehttp://solr01:8983/solr/Nordjyske/ dataimport?command=full-import

    Incremental delta importshttp://solr01:8983/solr/Nordjyske/ dataimport?command=delta-import

    Oopshttp://solr01:8983/solr/Nordjyske/ dataimport?command=abort

    Reload without restarthttp://solr01:8983/solr/Nordjyske/dataimport?command=reload-config

    Curl + scheduled tasks = saves hours of plumbing &

    programming

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    26/33

    .. Is this thing turned on..?

    26

    http://solr01:8983/solr/Nordjyske/dataimport?command=status

    busy

    .0:2:24.925312621

    .. Indexing completed. Added/Updated: 21 documents. Deleted 0documents.

    This response format is experimental. It is likely to change in the future.

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    27/33

    Boosting Yellow Page costumer happiness

    27

    The old XML posting

    599113

  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    28/33

    Custom transformers

    28

    Data Import Handler transformer call 0){row.put("$docBoost", 2.5f);return row;

    }}

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    29/33

    Integration with Semaphore (ontology classification)

    29

    Ontology engine server adds in keywords automatically

    Subjects (Emner)

    People (Mennesker)

    Places (Steder)

    Companies (Firmaer)

    analyze text

    Documents User search

    Enhance search

    Create topic pages seamlessly

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    30/33

    Semaphore integration?

    30

    Custom transformers save the day yet again

    Repackage documents, throw them at Classification server, fill in metatags, save to Solr

    Smartlogic

    August 2009: Solr?

    Now has built in integration for Solr (baseliner)

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    31/33

    Semaphore and Solr

    31

    The basic setup

    SolrContent DB Web

    SearchEnhancement

    ServerOntology Server

    ClassificationServer

    IndexingPipeline

    Baseliner

    Rulebases

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    32/33

    Nutch integration

    32

    A little more tricky than just turning on Solr

    Cygwin

    Seem to work ok, requires some nursing (like GSA)

    Useful for external sites and closed turn-key websites from3d party

    Currently on hold crawling just got less important

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw
  • 8/3/2019 Implementing Solr in Online Media Bo Raun

    33/33

    Conclusion: Lessons learned

    33

    The right tools for the right job

    To crawl or not to crawl

    No such thing as magic relevance

    Prototyping is the key

    Get buy-in from the people who will run it

    Commercial support foundation

    Extensibility integration and no limits to document base Solr pops up everywhere! In the new CMS, next editoral

    backend, ontology integration

    Get in touch: [email protected]

    5/25/2010Apache Lucene EuroCon

    http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw