string: large-scale data and text mining
DESCRIPTION
STRING: Large-scale data and text miningTRANSCRIPT
STRINGLarge-scale data and text mining
Lars Juhl Jensen
association networks
guilt by association
biological systems
protein networks
STRING
1100+ genomes
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
protein interactions
Jensen & Bork, Science, 2008
curated knowledge
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
not same species
hard work
(Ph.D. students)
common identifiers
quality scores
von Mering et al., Nucleic Acids Research, 2005
score calibration
von Mering et al., Nucleic Acids Research, 2005
homology-based transfer
Franceschini et al., Nucleic Acids Research, 2013
missing most of the data
text mining
>10 km
too much to read
computer
comprehensive lexicon
CDC2
cyclin dependent kinase 1
expansion rules
hCdc2
CDC2
flexible matching
cyclin-dependent kinase 1
cyclin dependent kinase 1
“black list”
SDS
co-mentioning
counting
within documents
within paragraphs
within sentences
natural language processing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
text corpus
~2 million full-text articles
~22 million abstracts
Exercise 1Go to http://string-db.org
Query for Mt H37Rv adhD
(Rv3086)
Change between different
views
Check evidence for adhD–lipR
link
Extent network to 50
interactors
Exercise 2Go to the paper PMC2995261
Extract the protein names in
table 1
Create STRING network of
them
Change to “advanced” mode
Analyze for clusters and
enrichment
multi-page tables
related resources
general approach
curated knowledge
experimental data
text mining
computational predictions
common identifiers
quality scores
score calibration
visualization
protein networks
string-db.org
chemical networks
stitch-db.org
subcellular localization
compartments.jensenlab.org
tissue expression
tissues.jensenlab.org
disease associations
Work on your own datastring-db.org
stitch-db.org
compartments.jensenlab.org
tissues.jensenlab.org
diseases.jensenlab.org