six month
TRANSCRIPT
![Page 1: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/1.jpg)
Six Month Progress Report
Farzaneh Sarafraz14 August 2008
![Page 2: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/2.jpg)
In this report
● What I have learnt● What are the gaps in my understanding● Outputs so far● Reflection on supervision mode● Plan outline until December 2008
![Page 3: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/3.jpg)
1. What I have learnt – general
● General– Settled down in a new environment– Learnt some of the regulations and how things
work in● The country● The city● The university● The faculty● The school
![Page 4: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/4.jpg)
What I have learnt – less general
● Less general– Thesis and paper writing theory and practice
● Specifically through the CS7100 seminar– LaTeX– Coding infrastruction
● Warmed up!– Database handling– Administration / web applications
● Specific– Biological text mining theory
![Page 5: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/5.jpg)
Biological text mining
● Biological text mining theory– Main problems– Main challenges– Main approaches– Communities– Events, papers, journals, competitions, etc.
● 40+ papers in my CiteULike account● Biological text mining hands on
– Tools, techniques, and resources– i2b2– HIV
![Page 6: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/6.jpg)
Biological text mining theory
● Main problems– Information retrieval– Information extraction
● Relation extraction– Shallow parsing / chunking– POS tagging– Word sense disambiguation– Term variation
![Page 7: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/7.jpg)
Biological text mining theory (cont.)
● Main problems (cont.)– Named entity recognition
● Dictionary based● Rule based● Machine learning (HMM: Zhou et al.)● Hybrid
– Evaluation● Precision, recall, FScore● Sensitivity and specificity● Not always possible due to the lack of
– Test Corpora– Common domains, techniques, goals
![Page 8: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/8.jpg)
Biological text mining theory (cont.)
● Main challenges– Deal with sublanguage of biology– Build scalable and robust systems– Present the results in meaningful and informative
ways to the biologist– Deal with interdisciplinary aspects
● Biology – chemistry – medicine– Different views / information needs
● Specific field (biomedicine) – linguistics – computation and data mining
![Page 9: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/9.jpg)
Main Challenges (cont.)
● Specific field (biomedicine) – linguistics – computation and data mining– The text is not necessarily written to be
comprehensible by automatic techniques– The language is dramatically different from that
of e.g. newswire.– Terminology, new and coined terms, usage
ambiguity– Nonalgorithmic, irrational patterns in NL
![Page 10: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/10.jpg)
Resources
● I am aware of / I am using existing resources– Literature repositories/search engines
● Pubmed, MEDLINE, BioMed● Google
– Parsers● Stanford Parser● GeniaTagger
– Terminological resorces● Gene Ontology● EMBLEBI● MeSh thesaurus● UMLS● Gene Synonym Finder, SBO, ...
![Page 11: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/11.jpg)
Resources (cont.)
● Existing resources (cont.)– Lexical resources– Webservices
● Entrez● Taverna● SBO
![Page 12: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/12.jpg)
Resources (cont.)
● I am partially developing tools for– Named entity recognition– Relation extraction
● I am fully tackling● PPI mining● Word sense disambiguation● Nominalization
– I may have to tackle in future● Contradiction, negation, contrasts● Temporal text mining
![Page 13: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/13.jpg)
2. What I still need to learn Specific
● There may be gaps I am unaware of● Less of wheel reinvention
– Use other software● Lingpipe, NLTK, Weka, RASP, ABNER, PIE,
BIOINFER, MALLET, Julielab, SPECIALIST, EMBLEBI, GNN (Arizona Uni),
– Use other methods/approaches● Machine Learning● Dynamic programming
– CL / Bio text mining theory algorithms● Viterbi, HMM, NN, SVM, GA, CRF,● ...
![Page 14: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/14.jpg)
2. What I still need to learn Specific
● Make a resources list on our web page?– Similar to the Stanford – outdated– repository
![Page 15: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/15.jpg)
What I still need to learn – Less general
● News of the field● Areas/opportunities for research
– Michael Phelps analogy● Developing skills for a CV
– Ways to proove I have the skills I already have● Presenting results
– Reasons, occasions, methods– Writing
● Other workshops by the faculty
![Page 16: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/16.jpg)
What I still need to learn General
● Writing, writing, writing– Binge writing vs. Snacking– Write as you go
● Closer to the final output● Paperbased dissertation? Something to consider.
– Review, get feedback, rewrite– A pedantic editor
![Page 17: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/17.jpg)
What I still need to learn – General (Cont.)
● Stronger coding infrastructure– More reusable libraries– Config files– Oneclick approach
● Optimisation– Code– Database
● Query optimization● Database optimization
– Server● Load balancing
– Multi threading– Multi processor
![Page 18: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/18.jpg)
3. Outputs so far
● Written– Background work survey
● Mid April 2008● 5 pages (approx. 1000 words)● Feedback from supervisor● Never was written up
– Writing sample for CS7100 seminar● June 2008● Same document as above, revised and rewritten● 12 pages, 2215 words● Feedback from Jim Miles and peer students
![Page 19: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/19.jpg)
HIV
● Understanding of the problem and the goals● Presenting the given/wanted as tables/code/
query● Building code infrastructure
– Database tables– Utility libraries– Version control system– 1500+ lines of documented, reusable code
![Page 20: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/20.jpg)
HIV summary
● Goal: to reproduce a humanproduced table● Each row has the following main columns
– HIV GPN (protein name, acc, and gene ID)– Human GPN (protein name, acc, and gene ID)– A relation (interactoin) between the two– A description of the interaction– The PMIDs that the interaction has been
reported in● The raw input: the full abstracts
![Page 21: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/21.jpg)
HIV results
● HIV and human GPN names– Most where mapped to their entities– 1237 out of 50416 currently unmapped (2%)
● Interaction verbs– Interesting verbs and stems identified– The stems where found in the text
● Working on stems, so including nominals, etc.● Terms extracted from the interaction
descriptions in the original data
![Page 22: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/22.jpg)
Example
● SELECT DISTINCT mention FROM index_description_term i where termID=28;
● 18 variations
CD4+ T T4 (CD) CD4+TCD4, T T4(CD) T (CD4)T CD4 CD4 (T) CD4+ (T)CD4(+) T CD4(+)T CD4(T)CD4 T CD4+T CD(4+) TT4+ (CD) CD4(+)T CD4 T
![Page 23: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/23.jpg)
Example
● SELECT DISTINCT mention FROM index_description_term i where termID=28 or termID = 17;
● 28 variations
CD4+ T T4(CD) CD4+ (T) CD4(+) T cellCD4, T CD4 (T) CD4(T) CD4 TcellT CD4 CD4(+)T CD(4+) T CD4(+) TcellCD4(+) T CD4+T CD4 T CD4(+)T cellCD4 T CD4(+)T CD4+ T cell CD4+TcellT4+ (CD) CD4+T CD4, T cell CD4(+)TcellT4 (CD) T (CD4) CD4+ Tcell CD4 T cell
![Page 24: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/24.jpg)
HIV results
● POS tagging with GeniaTagger● Parsing with Stanford parser
– Haven't used this data yet● Working with sentences as units● Normalising terms● Tables of synonyms● Tables of verb stems and terms● Indexes with mention/offset pairs
![Page 25: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/25.jpg)
HIV results
● Looking for sentences that share all these properties with any of the goal table rows– A humanHIV pair of GPN– A verb phrase containing a word with the same
stem of the interaction verb– Any description term(s)
● Very high recall (few false negatives)● Notsohigh precision (numerous false
positives)● Optimisation for more complicated queries
![Page 26: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/26.jpg)
HIV next steps
● Compare with other PPI mining and GPN recognition tools
● Find optimum parameters● Presentable results● Integrate with the interaction ontology● Evaluate, compare, present, get feedback● Apply to new papers● Apply to new organisms● Evaluate, compare, present, get feedback...
![Page 27: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/27.jpg)
Supervision
● Good points– Moving away from theory to tackling real
problems very quickly– Micromanagement while I am free to manage my
own time and other preferences– Planning ahead, causing commitment– Providing common sense, insight, and savvy
![Page 28: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/28.jpg)
Supervision – good points (cont.)
– Providing good starting points while not ruling out my own ideas
– Good meeting frequency● Group meetings?
– General support– Addressing my needs
● Financial● Research interests and preferences
![Page 29: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/29.jpg)
Supervision
● Could be improved– Minutes were not always thorough– Same for tasklists– We could have agenda for the meetings
● I write a list of the things that I want to discuss each session
● Like the one I had for this report–could have been there when I presented my 3week plan
– Same for TEAM meetings and HIV meetings● I hope we keep tackling real problems in
future
![Page 30: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/30.jpg)
Plan
● End of August– Presenting HIV output to the group– Writing HIV results
● Sep– Moving to new accommodation (1120 Sep.)– Moving on HIV
● Applying the ontology● Mining new corpora● Generalising?
![Page 31: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/31.jpg)
Plan
● Oct– Writing up HIV– Possible publicatoin– Ideas for PhD research
● Nov– Finalise MPhil vs. PhD– Finalise PhD research area– Work on end of year report
● Dec– Write up EOY report– EOY Viva
![Page 32: Six Month](https://reader031.vdocuments.mx/reader031/viewer/2022020207/556774f5d8b42a4f528b51ff/html5/thumbnails/32.jpg)
References
● Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology and Biomedicine. Norwood: Artech House, Inc.
● Spasić, Irena. Some Web Services relevant for biomedical applications. (Presentation slides.)
● Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan, 2004. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. Vol. 20 no. 7. Pp. 11781190