knowledge extraction from the encyclopedia of life using python nltk

33
Knowledge extraction from the Encyclopedia of Life Using Python NLTK Anne Thessen [email protected]

Upload: anne-thessen

Post on 09-May-2015

715 views

Category:

Technology


5 download

DESCRIPTION

This presentation demonstrates the potential for NLTK to extract information about ecological species interactions from text in EOL. It was presented Nov 12, 2013 at the Startup Institute in Cambridge, MA for the Boston PyLadies monthly meeting.

TRANSCRIPT

Page 1: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Knowledge extraction from the Encyclopedia of Life

Using Python NLTK

Anne [email protected]

Page 2: Knowledge extraction from the Encyclopedia of Life using Python NLTK
Page 3: Knowledge extraction from the Encyclopedia of Life using Python NLTK
Page 4: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Finding Taxonomic Names

Page 5: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Challenges

Koko

Горилла

Guerilla

Eastern Lowland Gorilla

Gorilla graueri

Gorilla berengeiGorilla beringei

MatschieGorilla beringei mikenensisKing kong

Gorilla gorilla

Virunga

Gorila

Gorille

Mountain gorilla

大猩猩

ゴリラ

Page 6: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Challenges

Aotus trivirgatus Aotus Illiger 1811

Aotus Aotus Smith 1805 Aotus ericoides

.

Contextual data

PrimateMonkeyEyesFoodPanamaAotus nancymaae

Disambiguate by authority, species, contextual data

Contextual data

LegumePlant

FlowerMirbelieaAustralia

Aotus mollis

Page 7: Knowledge extraction from the Encyclopedia of Life using Python NLTK

GNRD

Beautiful Soup

Resolver

Page 8: Knowledge extraction from the Encyclopedia of Life using Python NLTK

• Common names• Interaction type

Page 9: Knowledge extraction from the Encyclopedia of Life using Python NLTK

• Common names• Interaction type

Page 10: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Python NLTK

• http://nltk.org/book/• http://nltk.org/• Install NLTK and NLTK Data

Page 11: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Python NLTK

• http://nltk.org/book/• http://nltk.org/• Install NLTK and NLTK Data

Page 12: Knowledge extraction from the Encyclopedia of Life using Python NLTK

• Natural Language Processing (NLP)

Page 13: Knowledge extraction from the Encyclopedia of Life using Python NLTK

• Natural Language Processing (NLP)• Semantic StatisticsRobin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.

The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.

Page 14: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.

The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.

• fictional• comic books• Bob Kane• superhero• Batman• Dynamic Duo• Caped Crusaders

• flocks• bird• eggs• nest• sing• species

Page 15: Knowledge extraction from the Encyclopedia of Life using Python NLTK

GNRD

Beautiful Soup

Resolver

Page 16: Knowledge extraction from the Encyclopedia of Life using Python NLTK

GNRD

Beautiful Soup

Resolver

Page 17: Knowledge extraction from the Encyclopedia of Life using Python NLTK
Page 18: Knowledge extraction from the Encyclopedia of Life using Python NLTK

From GNRDnames_list = [“Pandarus sinuatus”,“Pandarus smithii”]

genera = []for name in name_list: row = name.split(‘ ‘) genera.append(row[0])

genera = [“Pandarus”,”Pandarus”]

Page 19: Knowledge extraction from the Encyclopedia of Life using Python NLTK

i = -1genus_index_list = []for genus in genera: genus_text = tokens[i+1:] genus_index = genus_text.index(genus) if i == -1: genus_index_list.append(genus_index) else: genus_index = genus_index + i + 1 genus_index_list.append(genus_index) i = genus_index

genera = [“Pandarus”,”Pandarus”]

genus_index = [36,39]

Page 20: Knowledge extraction from the Encyclopedia of Life using Python NLTK

for index in genus_index_list: species = [‘ ‘.join(tokens[index:index+2])] #Join the genus to the word immediately following. if species == name_list[counter]: #Does this match the name_list? tokens[index:index+2] = [‘ ‘.join(tokens[index:index+2])] #If yes, combine the two into one element

genus_index = [36,39]

Page 21: Knowledge extraction from the Encyclopedia of Life using Python NLTK

tokens = [‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘predators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘large’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘their’, ‘prey’, ‘including’, ‘elephant’, ‘seals’, ‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Page 22: Knowledge extraction from the Encyclopedia of Life using Python NLTK

term_list = []for name_index in name_index_list: term_list = tokens[name_index-10:name_index+10]

name_index_list = [36,38]

term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Looking at the first relationship:

Carcharodon carcharias Pandarus sinuatus

Page 23: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Looking at the first relationship:

Carcharodon carcharias Pandarus sinuatusParasite/host

term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Page 24: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Training Data

• Show the algorithm what “parasite/host” words look like

• Compare to an unknown• We want “Document Classification”• Brown, Reuters and Movie Review• We need to make our own corpus

Page 25: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Creating a Categorized Text Corpus

• http://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora

• Inside “corpus” folder create new folder for your corpus. Mine is “eco”.

• Build your corpus (start with EOL text)• Make a category specification• Lets start with parasitism and predation

Page 26: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Creating a Categorized Text Corpus

• eco– lion1– lion2– lion3– shark1– shark2– shark3– …– cats.txt

• in cats.txtlion1.txt predationlion2.txt parasitism…

Page 27: Knowledge extraction from the Encyclopedia of Life using Python NLTK

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

Page 28: Knowledge extraction from the Encyclopedia of Life using Python NLTK

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

Choose a Corpus Reader

Page 29: Knowledge extraction from the Encyclopedia of Life using Python NLTK

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

Choose a Corpus ReaderYou have to tell this Corpus Reader

Corpus root directoryFile names (aka fileids)Category specification

Page 30: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Next Steps

• Build corpus• Build Feature Extractor• Train Classifier

Page 31: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Build Feature Extractor

Page 32: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Train Classifier

Page 33: Knowledge extraction from the Encyclopedia of Life using Python NLTK

Error Checking