sig new grad 2010 vijay shanker. general info research interests: natural language processing, text...

SIG New Grad 2010

Vijay Shanker

General Info

• Research interests: Natural Language Processing, Text Mining, Machine Learning, NLP/Software Engineering

• Funding: NSF, NIH/NLM, USDA

• Teaching: Logic, Theory of Computation, Machine Learning/Data Mining, Topics in Natural Language Processing

Machine Learning with Minimal Supervision

• Supervised Machine Learning requires annotated data – what should the output be for each instance.

• More success leads to wider use, but– Lot of data to be annotated– Often requires expert annotator

• Minimal Supervision– Active learning– Domain Adaptation– Semi-supervised or unsupervised

Active LearningMichael Bloodgood (2009)

• The learned model decides which instance should be annotated next.

• Mike’s dissertation focused on active learning in a situation where there is data imbalance

• Partially learned model (SVM) would figure out the hard instances

• For Imbalance, adjusted model to penalize errors on minority class according to degree of imbalance

• Statistical methods to infer the degree of imbalance with high confidence

• Success in a variety of IR (e.g., text classification), NLP and other domains

Active Learning Contd.

• When do we stop? A fundamental but understudied research problem

• When performance (accuracy is good enough) – but we can’t use annotated data to decide

• Method based on stabilization of predictions • ICML workshop on active Learning (2008),

HLT/NAACL (2009), CoNLL (2009)

Domain Adaptation• ML/Statistical NLP methods not successful when

application domain is different from training domain

• Use same model but adapt to domain• Figure out what is independent of domain and

what varies• Use unannotated data from domain to make a

good guess and bootstrap with learned model• Example– POS tagging (J. Miller & M. Torii)– Syntax same but lexicon different (EMNLP 2008)

Learning from Positive Data

• Existence of incomplete databases (especially in Bioinformatics)

• Databases only provide positive data• But can’t learn without negative data• Again issues involving bootstrapping, choosing

examples, imbalance etc.

Text Mining and Information Extraction

• Mining – look for nuggets• Text Mining – look for information/patterns in large

amounts of text• Information Extraction – look for specific kind of

information from documents (text, websites, etc)• Relation Extraction – two or more entities in specific

relation – from unstructured data to structured data (Databases)

• Employer-Employee relation, 2 proteins that interact, books and authors

• Much of my current work on biomedical research articles

Text MiningeGIFT -- Oana Tudor

• Large scale experimental data involve thousands of genes

• Tens of thousands of genes. Biologists might know about a couple of hundred genes well

• Incomplete databases• Still most of information in the literature only,

besides the papers give the relevant context to interpret information – but significant IR issues in search for genes

eGIFT

• Addresses gene based search (IR)• Compares gene specific literature vs

background set to identify “key terms”• Categorizes terms• Links to sentences and literature• Picking one good sentence to suggest relation• Summary of gene and its properties

DNA Topoisomerase II alpha

• Identify “keywords” without any human intervention• Top2a literature -- 758 abstracts – nuclear (0.3), break (0.1), corepressor (0.003)

• Background -- 1.98 million abstracts– nuclear (0.1), break (0.001), corepressor (0.05)

• Statistical tests for significance• Extend or not?– Nuclear, double strand break, anticancer, chromosome

segregation, …

Identify Sentences

• Pick sentences with keywords based on where they appear, where gene name appears, # of keywords etc.– DNA Topoisomerase IIA expression levels are related to tumor

growth and to resistance to anticancer chemotherapy

– DNA Topoisomerase IIalpha is an essential enzyme for chromosome segregation during mitosis

– DNA topoisomerase II is a nuclear enzyme whose decatenating activity on newly replicated DNA is essential to successful cell division

• SMBM 2008, BMC Bioinformatics 2010

Current and Future Extensions

• Selecting and Ranking sentences – SVM rank– Features – vagueness and sentence complexity

• Summarizing a gene and its properties – wikigene

• Assist in interpretation of large scale experiments, hypothesis generation and knowledge discovery

Information Extraction

• Several relation extraction tasks, many from biomedical text

• Research issues are not specific to any domain or task– Beyond sentences– What kind of features (lexical, syntactic, semantic

features) help?– Learn starting from positive data– Could sentence simplification help?

Text Simplification for RE

• Sentences in WSJ, paper abstracts, etc. are quite structurally complicated

• To humans the information to be extracted may look straightforward

• But not straightforward for systems, e.g., because parsing (especially in new domains) remains a hard task. So often these systems rely on much “shallower” features.

Simplification Example

• MAPK phosphorylates BCL-2 at Serine 112.• MAPK isolated from rat brain tissue

phosphorylates BCL-2 at Serine 112.• MAPK phosphorylates BCL-2, a member of the

BAX family, at Serine 112.• MAPK phosphorylates BCL-2 at Serine 112 and

BAD at Tyrosine 25.

Sentence Simplification

• MAPK isolated from rat brain tissue phosphorylates BCL-2 at Serine 112.– MAPK phosphorylates BCL-2 at Serine 112.– MAPK was isolated from rat brain tissue.

• MAPK phosphorylates BCL-2 at Serine 112 and BAD at Tyrosine 25. – MAPK phosphorylates BCL-2 at Serine 112.– MAPK phosphorylates BAD at Tyrosine 25.

NLPA for SE

• Collaboration with Lori Pollock • Code as a different type of NL document• Abbreviation expansion -- wcStrBrk– Acronyms – International Business Machine (IBM)

vs AuctionEntry ae; • Importance of verb object relation – NL – verb classification e.g., drink, sip, gulp vs. eat,

bite, …– No subject; object in name or first parameter

Software Word Usage Model

• SWUM – virtual remodularization (OO vs procedural) – where in the code is some action completed– Software search -- query expansion (Shepherd etal

AOSD 2006), query reformulation (Hill etal ICSE 2009)

– Software navigation – Hill et al ASE 2009– Comment generation – Giri et al ASE 2010, ICSE

2011

Summary Comment Generation• Giriprasad ASE 2010• NL vs software code– Location in document – differing importance– Flow of information – lexical chains vs control and

data flow• High level actions (ICSE 2011)– Structural aspects, code conventions and design

patterns• Text Segmentation vs code segmentation – Readability (para), refactoring code, internal

comments

Naming ConventionsClasses and methods

• Call site – imperative statements vs. comment– saveAuction(ae)– savesAuction() or c.contains(str)– auctionSaved()– savedAuction

• Boolean returns – proposition and sentential window.isClosed()

• Verbs to nouns. Tyranny of nouns.– Traditional vs action-oriented classes– book, sequenceManager, xmlParser

Other Projects

• Colin Kern – with Prof. Liao– Transmembrane protein structure prediction– Incorporating local and long distance relations

• Dan Blanchard – with Prof. Jeffrey Heinz– Unsupervised word segmentation (e.g., speech)– What if we didn’t have a lexicon? Infant language

acquisition

sig new grad 2010 vijay shanker. general info research interests: natural language processing, text...

Documents

data imbalance

unstructured data

negative data

lot of data

unsupervised slide

domains slide

use unannotated data

text mining egift