knowledge extraction from the encyclopedia of life using python nltk
DESCRIPTION
This presentation demonstrates the potential for NLTK to extract information about ecological species interactions from text in EOL. It was presented Nov 12, 2013 at the Startup Institute in Cambridge, MA for the Boston PyLadies monthly meeting.TRANSCRIPT
Finding Taxonomic Names
Challenges
Koko
Горилла
Guerilla
Eastern Lowland Gorilla
Gorilla graueri
Gorilla berengeiGorilla beringei
MatschieGorilla beringei mikenensisKing kong
Gorilla gorilla
Virunga
Gorila
Gorille
Mountain gorilla
大猩猩
ゴリラ
Challenges
Aotus trivirgatus Aotus Illiger 1811
Aotus Aotus Smith 1805 Aotus ericoides
.
Contextual data
PrimateMonkeyEyesFoodPanamaAotus nancymaae
Disambiguate by authority, species, contextual data
Contextual data
LegumePlant
FlowerMirbelieaAustralia
Aotus mollis
GNRD
Beautiful Soup
Resolver
• Common names• Interaction type
• Common names• Interaction type
Python NLTK
• http://nltk.org/book/• http://nltk.org/• Install NLTK and NLTK Data
Python NLTK
• http://nltk.org/book/• http://nltk.org/• Install NLTK and NLTK Data
• Natural Language Processing (NLP)
• Natural Language Processing (NLP)• Semantic StatisticsRobin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.
The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.
Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.
The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.
• fictional• comic books• Bob Kane• superhero• Batman• Dynamic Duo• Caped Crusaders
• flocks• bird• eggs• nest• sing• species
GNRD
Beautiful Soup
Resolver
GNRD
Beautiful Soup
Resolver
From GNRDnames_list = [“Pandarus sinuatus”,“Pandarus smithii”]
genera = []for name in name_list: row = name.split(‘ ‘) genera.append(row[0])
genera = [“Pandarus”,”Pandarus”]
i = -1genus_index_list = []for genus in genera: genus_text = tokens[i+1:] genus_index = genus_text.index(genus) if i == -1: genus_index_list.append(genus_index) else: genus_index = genus_index + i + 1 genus_index_list.append(genus_index) i = genus_index
genera = [“Pandarus”,”Pandarus”]
genus_index = [36,39]
for index in genus_index_list: species = [‘ ‘.join(tokens[index:index+2])] #Join the genus to the word immediately following. if species == name_list[counter]: #Does this match the name_list? tokens[index:index+2] = [‘ ‘.join(tokens[index:index+2])] #If yes, combine the two into one element
genus_index = [36,39]
tokens = [‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘predators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘large’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘their’, ‘prey’, ‘including’, ‘elephant’, ‘seals’, ‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]
term_list = []for name_index in name_index_list: term_list = tokens[name_index-10:name_index+10]
name_index_list = [36,38]
term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]
Looking at the first relationship:
Carcharodon carcharias Pandarus sinuatus
Looking at the first relationship:
Carcharodon carcharias Pandarus sinuatusParasite/host
term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]
Training Data
• Show the algorithm what “parasite/host” words look like
• Compare to an unknown• We want “Document Classification”• Brown, Reuters and Movie Review• We need to make our own corpus
Creating a Categorized Text Corpus
• http://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora
• Inside “corpus” folder create new folder for your corpus. Mine is “eco”.
• Build your corpus (start with EOL text)• Make a category specification• Lets start with parasitism and predation
Creating a Categorized Text Corpus
• eco– lion1– lion2– lion3– shark1– shark2– shark3– …– cats.txt
• in cats.txtlion1.txt predationlion2.txt parasitism…
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)
Choose a Corpus Reader
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)
Choose a Corpus ReaderYou have to tell this Corpus Reader
Corpus root directoryFile names (aka fileids)Category specification
Next Steps
• Build corpus• Build Feature Extractor• Train Classifier
Build Feature Extractor
Train Classifier
Error Checking