text-mining practical

Post on 10-May-2015

198 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

unix primer

the command line

some useful commands

cat

less

head -10

tail -10

grep ‘needle’

cut -f 2

sort

sort -nr

uniq -c

redirecting output

write to file

command > filename

using pipes

command1 | command2

putting it all together

cut -f 4 infile | sort | uniq -c |sort -nr | head -100 > outfile

the task

disease gene finding

named entity recognition

human genes

gene prioritization

what I have done

information retrieval

two diseases

prostate cancer

schizophrenia

two sets of documents

62,755 abstracts

65,588 abstracts

one directory with each set

one file with each abstract

dictionary

tab-delimited file

human genes

22,523 entities

synonyms

from many databases

orthographic variation

prefixes and postfixes

automatically generated

2,726,495 names

tagdir program

flexible matching

upper- and lower-case

spaces and hyphens

tab-delimited output

what you will do

named entity recognition

find unfortunate names

create “black list”

information extraction

co-mentioning

within abstracts

rank genes for each disease

find shared gene

a helping hand

“black list”

100+ matches

10+ matches

wrap up

prostate cancer

FOLH1

schizophrenia

Glutamate carboxypeptidase II

same protein

synonyms matter

“black list” is crucial

text mining is useful

not black magic

EMBO Practical Course Computational Biology:Genomes to SystemsPuerto Varas, 3-9 April 2014

Thank you!

Thank you!

top related