ii-sdv 2016 aleksandar kapisoda, klaus kater - deep web search

31
Boehringer Ingelheim Pharma GmbH & Co. KG Research Networking - Aleksandar Kapisoda Deep Web Search Deep SEARCH 9 GmbH Klaus Kater

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 16-Jan-2017

570 views

Category:

Internet


3 download

TRANSCRIPT

Boehringer Ingelheim Pharma GmbH & Co. KGResearch Networking - Aleksandar Kapisoda

Deep Web Search

Deep SEARCH 9 GmbHKlaus Kater

Content

1. Intro

2. Search Approach

• Public Search Approach

• SEARCHCORPUS® Approach

3. Use Cases

• SEARCHCORPUS® for life science startups: We find startup information we could not find in public search engines.

• Life science news SEARCHCORPUS®:100s of incoming mails and alerts are processed every day and websites and articles behind the news tags are crawled automatically.

4. Technical Features

5. Outlook

1. Intro

Deep (Web) Search

1. Intro

2015

(Deep Web) SearchWe showed that we can crawl and find content that public search engines do not find.

1. IntroWhat we did in 2015…

2015 (Deep Web) Search

……2014 …………………….………2015………………….………2016…..

During the year we established our internal processes to build targeted SEARCHCORPORA.We built solutions and rolled them out.

And we found more than we bargained for.

1. Intro

2016

Deep (Web Search)This year we will talk about a misconception were confronted with

when comparing our SEARCHCORPUS® based search results with search results from public search engines.

2. The Public Search Approach

Public Search Misconception

Clashing with Incomplete Search Results

Let’s make up a „Weißwurst Misconception“…

2. The Public Search ApproachClashing with Incomplete Search Results

Anybody understands that Weißwurst without Weißwurst mustard is like Fish‘n‘Chips without Chips.

…to make it easier to understand the “Public Search Misconception” .

Web search is like trying to find “Weißwurst”mustard”

in a Convenience Store1)

2. The Public Search ApproachClashing with Incomplete Search Results

You will find loads of local and not so local mustards.

But if Weißwurst mustard is located in the specialities section, you will only find it by chance or not at all…

1) Not a Bavarian conveniance store.

1) Not a Bavarian conveniance store.

No Weißwurst mustard!

Web search is like trying to find Weißwurst mustardin a Convenience Store1)

2. The Public Search ApproachClashing with Incomplete Search Results

So you may believe, that the store does not carry Weißwurstmustard at all.

2. The Public Search ApproachClashing with Incomplete Search Results

There are two common misperceptions researchersusing public search are entrapped in:

• If a search has results, we believe that these results are complete.

• If a search doesn‘t have results, we believe there is nothing that can be found

Both perceptions are wrong and represent the Public Search misconception :We believe that there is nothing to be found, even though the information may be

available.

We just don’t know where and need the right tools to find it.

This store doesn‘t have Weißwurstmustard…

2. The Public Search ApproachWhy Results Are Missed

An explanation why results are missed

Assume we want to monitor startup activities in the area

of CRISPR being used in the fight against diabetes type 1:

+CRISPR +diabetes type 1

2. The Public Search ApproachWhy Results Are Missed

2. The Public Search ApproachWhy Results Are Missed

An explanation why results are missed

To avoid getting overloaded with biotechnological research papers,

we try to tell the search engine that we are interested in +startups....

+CRISPR +diabetes type 1+startup

2. The Public Search ApproachWhy Results Are Missed

+CRISPR +diabetes type 1+startup

Only documents in which all termsmatch are returned. These documentsare actually on startups.

But only, if the startups werementioned in some press releaseor report.

2. SEARCHCORPUS® Approach

Documents are set into context already when the SEARCHCORPUS® is being built.

3. Use Cases

Use Case 1SEARCHCORPUS® for Life Science Startups

3. Use CasesSEARCHCORPUS® for Life Science Startups:

Situation:

Researchers manually search for startup activities and companies who are active in specific areas of interest. Interest changes frequently.

Problem:

Searching for startups by scientific topics generates an enormous amount of noise that needs to be filtered manually.

Approach:

Implementation of a startup SEARCHCORPUS® spanning global startup companies.

Status:

Existing startup SEARCHCORPUS for targeted Search

3. Use CasesSEARCHCORPUS® for life science startups:

Google SEARCH results

3. Use CasesSEARCHCORPUS® for life science startups:

3. Use CasesSEARCHCORPUS® for life science startups:

3. Use CasesSEARCHCORPUS® for life science startups:

3. Use CasesSEARCHCORPUS® for life science startups:

The startup that was found in the SEARCHCORPUS®

Proximity search

3. Use Cases

Use Case 2Life Science News SEARCHCORPUS®

3. Use CasesLife Science News SEARCHCORPUS®

Situation:

Researchers are manually filtering 100reds of websites, emails and news feeds

• News that are not screened immediately are lost

Approach:

A targeted news SEARCHCORPUS® using periodic targeted crawling and extraction ofnews from sources used by Boehringer Ingelheim scientists.

1. Tracker is made available to researchers in the corporate Intranet

2. News-Archive with faceted search using ontology based query term expansion

3. Search profile based email alerting, whenever matching news are crawled

Status:

Existing news SEARCHCORPUS for targeted Search

3. Use CasesLife science news SEARCHCORPUS®

• Viewer is updated by the minute, targets could be crawled as frequently as every 10s.

• Crawling frequence and crawling schedule are defined by target.

3. Use CasesLife science news SEARCHCORPUS®

4. Technical Features

Software:

Deep SEARCH 9 platform for advanced web analytics:

• Concurrent targeted crawling

• Content extraction

• Document caching

• Content annotation (RDF based and via APIs, e.g. Luxid)

• Scheduler for periodic jobs

• Integration of ds9 search and visualization in BI Intranet through API

• News tracker GUI for real-time news monitoring

• Faceted search GUI with RDF based query term expansion

Hardware:

3 Server cluster running ds9, JDBC database, RDF triple store and Elasticsearch.Currently 90 TB disk space.

5. Outlook

• SEARCHCORPORA®

• Setup of more comprehensive SEARCHCORPORA® (startup, news)

• Extending targeted SEARCHCORPORA® (Life Science domain)

• More Viewer for Data Visualisation (Results)

• Communication with other third party software via API / webservice

• Integration of Semantic Web Technologies

• Terminology

• RDF import/export

Contact Information

Aleksandar Kapisoda

[email protected] Networking

[email protected]

Klaus Kater

Questions?

Thank You