sla summer 2008

28
Mining Solutions A New Approach to Making the Most of Your Research Time SLA,Strategic Technology Alliance, Seattle, 2008 Joe Buzzanga, Product Manager, Elsevier Science and Technology June 17, 2008

Upload: joebuzz1

Post on 07-Dec-2014

698 views

Category:

Technology


1 download

DESCRIPTION

My presentation to SLA, summer 2008

TRANSCRIPT

Page 1: SLA Summer 2008

Mining SolutionsA New Approach to Making the Most of Your Research Time

SLA,Strategic Technology Alliance, Seattle, 2008Joe Buzzanga, Product Manager, Elsevier Science and TechnologyJune 17, 2008

Page 2: SLA Summer 2008

Agenda

•Challenges and Framework for Information Retrieval (IR)

•Using Natural Language Processing (NLP) in IR (illumin8)

•Product Demo

Page 3: SLA Summer 2008

Digital Universe: 10x bigger in 5 years

“Searching for meaning in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the rocket science of the digital universe.” IDC

Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008

Page 4: SLA Summer 2008

Today’s Researcher?

Search for Meaning?

Page 6: SLA Summer 2008

Impact on Information Retrieval

•Separate the Signal from Noise

•Signal processing

Page 7: SLA Summer 2008

Our Goal

•Make you successful through superior information retrieval tools

Page 8: SLA Summer 2008

Framework for Information Retrieval

HumanIndex SearchSimple

Model Content

•Traditional: card catalog, periodical index…

HumanIndex SearchPrint

Collections Surrogate

RecordContent

•Simple Model: single book

Meta Data

Page 9: SLA Summer 2008

Framework for Information Retrieval

HumanIndex SearchDigital

BibliographicA&I

Surrogate Record

DigitalIndex

Content

Hybrid Index

Meta Data

•Digital bibliographic A&I•Semi-structured records•Content under editorial control•Application of controlled terms•Application of digital indexing•Results need to be organized and ranked

•additional access points (e.g., facets, tags..)

Results

Page 10: SLA Summer 2008

Framework for Information Retrieval

•No Human Intervention•Content unstructured, uncontrolled and unmeasurable•Crawling is inherently imperfect•Typically Keyword indexing•Ranking of results becomes critical

Web SearchCrawl Digital

IndexContent

Results

Page 11: SLA Summer 2008

Content:How Big is the Web?

Today

170 million websites across all domains

Source: Netcraft

2 years ago

80 million websites across all domains

Page 12: SLA Summer 2008

Content: Plumbing the Depths

Source: Mills Davis, Project 10X

Page 13: SLA Summer 2008

Content: How Big is the Web?

~10 Billion pages (2003 estimate)

http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

Page 14: SLA Summer 2008

Crawling in the Dark

Page 15: SLA Summer 2008

The Key in Keyword?

• Keyword is a misnomer in context of an index• Keyword is in the mind of the searcher• Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”)

• Brute force approach, feasible with compute power

Page 16: SLA Summer 2008

Results: Mystery Equation

mystery clip

Page 17: SLA Summer 2008

Results: Facets

Page 18: SLA Summer 2008

Research and its Discontents

18185.5 hours / week *Searching and gathering information

* Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc.

4.7 hours / week *Organizing and analyzing and applying information

Page 19: SLA Summer 2008

Introducing illumin8

•Cut through the noise•Rapid summary/overview•Cross domain view•Integrated content•Web-based•Sharing results

Applies Natural Language Processing at Internet Scale!

Page 20: SLA Summer 2008

Typical Search

Current general searchGet millions of documents

to sift through

Page 1 Page 2 Page 180,000

compostable film

There is just no way any researcher can read through all this information.It just takes too long!

Page 21: SLA Summer 2008

Illumin8 Uses Natural Language Processing to “read” text

Enter search termsGenerate

Organized Result Set

Products Companies/Organizations Technical Approaches

•Results grouped into meaningful classes

•System generates list of solutions, not records

•Quickly see interesting and useful areas for investigation

Page 22: SLA Summer 2008

Our Approach• Premium Scientific• Patent• Web

Search-Crawl-Load

SemanticIndex

Content

Results

NLP Applied

Problems, Solutions, Benefits

NLP Applied

Fuse, Classify, Summarize

NLP Applied

NLP applied throughout the system: index, query, result set

Page 23: SLA Summer 2008

Full Text

Abstracts

illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents

Internet

Patents

illumin8 Solution Database1.1 billion

5 Billion web pages, blogs and forums

3 Million full-text scientific and technical articles from 1,800 Elsevier journals

33 Million scientific records from 15,000 peer reviewed journals & more than 4,000 publishers

21 Million patents from 5 world-wide patent offices

Extract and Summarize Solutions

Search

How does illumin8 work?

Page 24: SLA Summer 2008

WEB JOURNAL PATENT

• Summarizing information about Companies, Products, etc., for technologies that researchers

care about

• Organizing results from the worlds most trusted scientific content and billions of web pages

A Uniform Lens (index) Across Content Sources

Page 25: SLA Summer 2008

Keyword Indexing

• Meaning is lost

Taking Search Beyond Keyword Indexing

Sentence processing

• Meaning is maintained

• Identify & classify problems, solutions and benefits

Neural Network used in handwriting recognitionSolution Problem

Page 26: SLA Summer 2008

Natural Language Parsing

Help_patternsSucceed2Correct_problemtreatPerson_SAVSpositively_influencehave_positive_influenceprotect_sb_against_sthProduct_would_do_goodprovide_sb_with_sthProduct_is_shown_totalented_atuse_sth_to_do_sthapprove_sthrely_on_product_toapplication_isProduct_allows_sb_toVG2ensure_protagonistA_makes_B_goodbenefit_of

...

Thousands of rulesPlus statistical models

illumin8 Rules Grammatical Role Role Test Role Assignment

provides

Capacitive deionization

an economical and efficient method for removing salt and impurities from water

Solution

Benefit

Continue …Modal?

Check that Verb polarity is positive; this rule would not match if the Verb were modal (i.e. only in certain cases), for example if it said “should provide … but”

Check that Subject is not negated; this rule would not match if Subject were not positive, for example if it said “no process provides an economical an efficient …”

Check that Object is not antagonistic; this rule would not match if Object were, for example “provides a costly and complicated method”

no

yes

Negated? no

yes

Antagonistic? noye

s

Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water.

Verb

Subject

Object

Page 27: SLA Summer 2008

Analyzing A Sentence

Carrier’s Infinity™ Air Purifier uses ultraviolet light to eliminate germs such as viruses, molds, bacteria, mildew and mold spores from the indoor air of homes and offices, ensuring a higher indoor air quality.

Germ[Problem]

Indoor air quality[Benefit]

Carrier[Organization]

Infinity Air Purifier

[Product]

Ultraviolet light

[Technology]

Virus

Mold

Bacteria

MildewMakes Uses

Solves

Provides

Kind of

Mold spore

Concepts, ideas and entities extracted from a single sentence.

Page 28: SLA Summer 2008

DEMO