anna divoli (pingar research): extracting and mapping sharepoint content to create a custom taxonomy

Post on 12-Sep-2014

400 Views

Category:

Health & Medicine

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy Pingar presentation at ShareFEST in Philadelphia (Apr 2013).

TRANSCRIPT

Extracting and Mapping SharePoint Content to Create

a Custom Taxonomy

Anna DivoliPingar Research

@annadivoli

Why?

Why Automatic Generation? DynamicFastCheapConsistentRDF / Flexible…

Why from a Document Collection?Focused/specificOptimal for those documents…

Why Taxonomies? Organize knowledgeDomain representationEnable automatic tasks…

Why in SharePoint?All you need is there!Can be used straight away!

Talk OverviewThe Team

The Process

Evaluation

Use Cases – Withdrawn drug– Cancer treatments– Re-purposed drug

Summary

Taxonomy Generation Research Team

Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenConstructing a Focused Taxonomy from a Document Collection

ESWC 2013, Montpellier, France

Taxonomy Generation Process

Input: Documentsstored somewhere

Analysis: Using variety of tools*and datasets, extract concepts,entities, relations

Grouping & Output: A taxonomy is createdthat groups resulting taxonomy terms hierarchically

Custom Taxonom

y

How Taxonomy Generation works

Document Database

Solr

Concepts & Relations Database

Sesame

1. Import & convert to text

2. Extract concepts

3. Annotate with Linked Data

4. Disambiguateclashing concepts

5. Consolidate taxonomy

InputDocs

Preferred top-level terms

In 5 Steps!

FocusedSKOS

Taxonomy

Step 1. Document input & conversion

InputDocuments Document

Database1. Convert to text

Current input:• Directory path read

recursively

Other possible inputs:• Docs in a database or a

DMS• Emails +attachments

(Exchange)• Website URL• RSS feed

External tool to convert different file formats to text

Database to storedocument content

Step 2. Extracting concepts

Documents

Database

Concepts Database

2. Extract concepts

http://localhost/solr/select?q=path:mycollection\\document456.txt

Pingar API:Taxonomy Terms: Climate and Weather Leaders AgreementsPeople: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa

Wikify:Wikipedia Terms: South Africa Yvo de Boer U.N. Climate agreements Associated Press

Specific terminology: green policies; climate diplomacy

Step 3. Annotation with meaning

Annotations Database

3. Annotate with Linked Data

mycollection/document456.txt

Pingar API:People: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa

Later this additional infowill help create

e-Discovery & semantic searchsolutions

Concepts Database

Step 4. Discarding irrelevant meanings

Final Concepts Database

4. Disambiguate clashing concepts

wikipedia.org/wiki/Ocean

wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc

www.fao.org/aos/agrovoc#c_4607

Over the past three years,  Apple has acquired three mapping companies

For millions of years, the  oceans have been filled with sounds from natural sources.

Two concepts were extracted,that are dissimilarDiscard the incorrect one

Two concepts were extracted,that are similarAccept both correct

Agrovoc term:Marine areas

Concepts Database

Step 5. Group taxonomy (a)

5a. Add relationsConcepts & Relations Database

felines tiger bird

horse family

zebra donkey pigeonhorselizard

Category:Carnivorous animals Category:Animals

animals Building the taxonomybottom up

Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals

FocusedSKOS

Taxonomy

Step 5. Consolidating taxonomy (b)

Films and film making Film stars Mila Kunis Daniel Radcliffe Sally Hawkins Julianna Margulies

Association football clubs Former Football League clubs Manchester United F.C. Manchester United F.C. Manchester City F.C.

Finance Economics and finance Personal finance Commercial finance Tax

Capital gains tax Tax Capital gains tax

5b. Prune relationsConcepts & Relations Database

FocusedSKOS

Taxonomy

Evaluation

Recall: 75% (comparing with manually generated taxonomy for the same domain) Precision:89% for concepts 90% for relations (15 human judges based evaluation)

SharePoint Taxonomy Generation Process

Analysis: Using variety of tools*and datasets, extract concepts,entities, relations

Custom Taxonom

y

Triazolam[A benzodiazepine drug used for short-term treatment of acute insomnia. Withdrawn in 1991 in the UK because of risk of psychiatric adverse drug reactions. It continues to be available in the U.S.] Excerpt of the taxonomy generated from:- 131 PubMed abstracts of clinical trials on triazolam before1991- 180 PubMed abstracts of clinical trials on triazolam since1991 Colors of terms:- proposed to group other terms- found in both document collections- in before withdrawal docs- in since withdrawal docs

Taxonomy Statistics

Concept Count: 305Edges Count: 437Intermediate Count: 97Leaves Count: 183Labels Count: 353

Nesting Counts

0: 251: 512: 1243: 1604: 1765: 1536: 547: 4Average Depth: 3.6

proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs

proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs

proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs

Cancer Treatments

Excerpt of the taxonomy generated from:- 200 PubMed abstracts on breast cancer treatments - 149 (all) PubMed abstracts on lung cancer treatments- 47 (all) PubMed abstracts on gastric cancer treatments Colors of terms:- proposed to group other terms- found in two or more document collections- in the breast treatment docs- in the stomach treatment docs- in the lung treatment docs

Taxonomy Statistics

Concept Count: 308Edges Count: 387Intermediate Count: 90Leaves Count: 195Labels Count: 371

Nesting Counts

0: 231: 522: 993: 1384: 1375: 1596: 607: 368: 6Average Depth: 3.88

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

TamoxifenTamoxifen is drug commonly used to treat breast cancer but with a subsequent indication for treating bipolar disorder.

 Excerpt of the taxonomy generated from:- papers discussing tamoxifen and bipolar disorder: 8 PubMed abstracts AND 2 PDFs of full papers (17641532, 18316672)- papers discussing tamoxifen and breast cancer: 50 PubMed abstracts of AND 2 PDFs of full papers (21635709, 12618491)- papers discussing tamoxifen but no mention of either breast cancer nor bipolar disorder: 50 PubMed abstracts of AND 2 PDFs of full papers (16275887, 19458291)

 Colors of terms:- proposed to group other concepts- in two or more document collections- in the bipolar document collection- in the breast cancer document collection- in the neither cancer or bipolar document collection

Taxonomy Statistics

Concept Count: 587Edges Count: 751Intermediate Count: 188Leaves Count: 365Labels Count: 718

Nesting Counts

0: 341: 732: 1333: 2844: 2255: 1576: 897: 308: 2Average Depth: 3.66

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Summary

Entity Extraction

Linked Data

Disambiguation

Consolidation

Case Studies

More?

bit.ly/f-step

pingar.com

@PingarHQ

anna.divoli@pingar.com

@annadivoli

Focused SKOS Taxonomy Extraction Process (F-STEP) wiki

top related