prof. carolina ruiz computer science department bioinformatics and computational biology program wpi...
TRANSCRIPT
Prof. Carolina Ruiz
Computer Science Department
Bioinformatics and Computational Biology Program
WPI
WELCOME TO
BCB4003/CS4803
BCB503/CS583
BIOLOGICAL AND BIOMEDICAL DATABASE MINING
WHY THIS COURSE?
Biological and BiomedicalResearch Problems
Genome 1980’s-1990’sSequencing, sequence analysis, …
Proteome 1990’s-2000’s
Protein structure, protein-protein interactions, protein pathways
Central dogma: DNA (trascription) RNA (translation) Protein
Transcriptomemid 1990’s-2000’s Gene expression,
DNA/RNA microarrays
Biological Function
2000’s
Applications 2000’sOrganism-organism interactions
Organism-environment interactionsGenome-wide association studies
Cancer therapiesDrug development
THIS ALL HAS GENERATED …
• Data• Massive datasets and databases of sequence, gene, gene
expression, protein, biological function, clinical information, …
• Text• Annotations in data sources, abstracts (e.g., Medline), research
articles, medical literature (e.g., PubMed, NCBI Bookshelf, Google Scholar), patients records, …
• Ontologies• Description of terms and their relationship
• (e.g., Gene Ontology)
CURRENT CHALLENGES
• To make sense of and put to use all this information.
• How? Computational tools and techniques are needed to help humans in integrating, summarizing, understanding, and taking advantage of accumulated information• Data mining• Text mining• Data and text mining together
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [text]” (Fayyad et al., 1996)
• Raw Data [Text] Data [Text] Mining
• Patterns
• Analytical Patterns (rules, decision trees)
• Statistical Patterns (data distribution)
• Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
WHAT IS DATA [TEXT] MINING?OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)
DATA MINING METHODS IN BIOINFORMATICS
• Clustering
• Sequence Mining
• Bayesian Methods
• Expectation Maximization (EM)
• Gibbs Sampling
• Hidden Markov Models
• Kernel methods
• Support Vector Machines
TEXT MINING IN BIOINFORMATICS• Document indexing
• Information retrieval
• Lexical analysis (Sentence tokenization, Word tokenization, Stemming, Stop word removal)
• Semantic analysis
• Query processing
• Text classification
• Text clustering
• Text summarization
• (Semi-) Automatic curation of literature repositories
• Knowledge discovery from text, hypothesis generation
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
East
West
North
DATA/TEXT MINING PROCESS (KDD)
information sources
data analysisdata mining• analytical• statistical• visual
models
model/patterns deployment• prediction
• decision supportnew data
data management• databases
• data warehouses“good” model
model/patternevaluation• quantitative• qualitative
data “pre”-processing
• noisy/missing data • feature selection
cleaneddata
data
PUTTING ALL TOGETHER …
• Data / Text / Information Integration• Mining over data and text combined
• Visualization
• Other real-world issues• Developing tools and techniques that are
efficient, scalable, and user friendly
• Biology and Biomedicine
• Contributes domain knowledge
• Machine Learning (AI)
• Contributes (semi-)automatic induction of empirical laws from observations & experimentation
• Statistics
• Contributes language, framework, and techniques
• Pattern Recognition
• Contributes pattern extraction and pattern matching techniques
• Natural Language Processing (AI) Computational Linguistics• Contributes text analysis techniques
• Databases• Contributes efficient data storage, data
cleansing, and data access techniques
• Data Visualization• Contributes visual data displays and
data exploration
• High Performance Comp.• Contributes techniques to efficiently
handling complexity
• Signal processing
• Image Processing …
INTERDISCIPLINARY TECHNIQUES COME FROM MULTIPLE FIELDS
QUESTIONS?
* Images in this presentation were downloaded from Google images