Download - Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy
Extracting and Mapping SharePoint Content to Create
a Custom Taxonomy
Anna DivoliPingar Research
@annadivoli
Why?
Why Automatic Generation? DynamicFastCheapConsistentRDF / Flexible…
Why from a Document Collection?Focused/specificOptimal for those documents…
Why Taxonomies? Organize knowledgeDomain representationEnable automatic tasks…
Why in SharePoint?All you need is there!Can be used straight away!
Talk OverviewThe Team
The Process
Evaluation
Use Cases – Withdrawn drug– Cancer treatments– Re-purposed drug
Summary
Taxonomy Generation Research Team
Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenConstructing a Focused Taxonomy from a Document Collection
ESWC 2013, Montpellier, France
Taxonomy Generation Process
Input: Documentsstored somewhere
Analysis: Using variety of tools*and datasets, extract concepts,entities, relations
Grouping & Output: A taxonomy is createdthat groups resulting taxonomy terms hierarchically
Custom Taxonom
y
How Taxonomy Generation works
Document Database
Solr
Concepts & Relations Database
Sesame
1. Import & convert to text
2. Extract concepts
3. Annotate with Linked Data
4. Disambiguateclashing concepts
5. Consolidate taxonomy
InputDocs
Preferred top-level terms
In 5 Steps!
FocusedSKOS
Taxonomy
Step 1. Document input & conversion
InputDocuments Document
Database1. Convert to text
Current input:• Directory path read
recursively
Other possible inputs:• Docs in a database or a
DMS• Emails +attachments
(Exchange)• Website URL• RSS feed
External tool to convert different file formats to text
Database to storedocument content
Step 2. Extracting concepts
Documents
Database
Concepts Database
2. Extract concepts
http://localhost/solr/select?q=path:mycollection\\document456.txt
Pingar API:Taxonomy Terms: Climate and Weather Leaders AgreementsPeople: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa
Wikify:Wikipedia Terms: South Africa Yvo de Boer U.N. Climate agreements Associated Press
Specific terminology: green policies; climate diplomacy
Step 3. Annotation with meaning
Annotations Database
3. Annotate with Linked Data
mycollection/document456.txt
Pingar API:People: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa
Later this additional infowill help create
e-Discovery & semantic searchsolutions
Concepts Database
Step 4. Discarding irrelevant meanings
Final Concepts Database
4. Disambiguate clashing concepts
wikipedia.org/wiki/Ocean
wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc
www.fao.org/aos/agrovoc#c_4607
Over the past three years, Apple has acquired three mapping companies
For millions of years, the oceans have been filled with sounds from natural sources.
Two concepts were extracted,that are dissimilarDiscard the incorrect one
Two concepts were extracted,that are similarAccept both correct
Agrovoc term:Marine areas
Concepts Database
Step 5. Group taxonomy (a)
5a. Add relationsConcepts & Relations Database
felines tiger bird
horse family
zebra donkey pigeonhorselizard
Category:Carnivorous animals Category:Animals
animals Building the taxonomybottom up
Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals
FocusedSKOS
Taxonomy
Step 5. Consolidating taxonomy (b)
Films and film making Film stars Mila Kunis Daniel Radcliffe Sally Hawkins Julianna Margulies
Association football clubs Former Football League clubs Manchester United F.C. Manchester United F.C. Manchester City F.C.
Finance Economics and finance Personal finance Commercial finance Tax
Capital gains tax Tax Capital gains tax
5b. Prune relationsConcepts & Relations Database
FocusedSKOS
Taxonomy
Evaluation
Recall: 75% (comparing with manually generated taxonomy for the same domain) Precision:89% for concepts 90% for relations (15 human judges based evaluation)
SharePoint Taxonomy Generation Process
Analysis: Using variety of tools*and datasets, extract concepts,entities, relations
Custom Taxonom
y
Triazolam[A benzodiazepine drug used for short-term treatment of acute insomnia. Withdrawn in 1991 in the UK because of risk of psychiatric adverse drug reactions. It continues to be available in the U.S.] Excerpt of the taxonomy generated from:- 131 PubMed abstracts of clinical trials on triazolam before1991- 180 PubMed abstracts of clinical trials on triazolam since1991 Colors of terms:- proposed to group other terms- found in both document collections- in before withdrawal docs- in since withdrawal docs
Taxonomy Statistics
Concept Count: 305Edges Count: 437Intermediate Count: 97Leaves Count: 183Labels Count: 353
Nesting Counts
0: 251: 512: 1243: 1604: 1765: 1536: 547: 4Average Depth: 3.6
proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs
Cancer Treatments
Excerpt of the taxonomy generated from:- 200 PubMed abstracts on breast cancer treatments - 149 (all) PubMed abstracts on lung cancer treatments- 47 (all) PubMed abstracts on gastric cancer treatments Colors of terms:- proposed to group other terms- found in two or more document collections- in the breast treatment docs- in the stomach treatment docs- in the lung treatment docs
Taxonomy Statistics
Concept Count: 308Edges Count: 387Intermediate Count: 90Leaves Count: 195Labels Count: 371
Nesting Counts
0: 231: 522: 993: 1384: 1375: 1596: 607: 368: 6Average Depth: 3.88
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs
TamoxifenTamoxifen is drug commonly used to treat breast cancer but with a subsequent indication for treating bipolar disorder.
Excerpt of the taxonomy generated from:- papers discussing tamoxifen and bipolar disorder: 8 PubMed abstracts AND 2 PDFs of full papers (17641532, 18316672)- papers discussing tamoxifen and breast cancer: 50 PubMed abstracts of AND 2 PDFs of full papers (21635709, 12618491)- papers discussing tamoxifen but no mention of either breast cancer nor bipolar disorder: 50 PubMed abstracts of AND 2 PDFs of full papers (16275887, 19458291)
Colors of terms:- proposed to group other concepts- in two or more document collections- in the bipolar document collection- in the breast cancer document collection- in the neither cancer or bipolar document collection
Taxonomy Statistics
Concept Count: 587Edges Count: 751Intermediate Count: 188Leaves Count: 365Labels Count: 718
Nesting Counts
0: 341: 732: 1333: 2844: 2255: 1576: 897: 308: 2Average Depth: 3.66
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection
Summary
Entity Extraction
Linked Data
Disambiguation
Consolidation
Case Studies
More?
bit.ly/f-step
pingar.com
@PingarHQ
@annadivoli
Focused SKOS Taxonomy Extraction Process (F-STEP) wiki