sql, nosql or no database at all? are databases still a core skill?
DESCRIPTION
Presented at Bioinformatics FOAM workshop, Melbourne, March 27 2014TRANSCRIPT
SQL, noSQL or no database at all?Are databases still a core skill?
Neil Saunders
COMPUTATIONAL INFORMATICSwww.csiro.au
Databases: Slide 2 of 24
alternative title: should David Lovell learn databases?
Databases: Slide 3 of 24
actual recent email request
Hi Neil,
I was wondering if you could help me with something. I am trying to puttogether a table but it is rather slow by hand. Do you know if you canhelp me with this task with a script? If it is too much of your time,don’t worry about it. Just thought I’d ask before I start.
The task is:
The targets listed in A tab need to be found in B tab then the entire rowcopied into C tab. Then the details in column C of C tab then need to bematched with the details in D tab so that the patients with the mutationsare listed in row AG and AH of C tab.
Again, if this isn’t an easy task for you then don’t worry about it.
Databases: Slide 4 of 24
sounds like a database to me (c. 2004)
Databases: Slide 5 of 24
database design is a profession in itself
-- KEGG_DB schema
CREATE TABLE ec2go (ec_no VARCHAR(16) NOT NULL, -- EC number (with "EC:" prefix)go_id CHAR(10) NOT NULL -- GO ID
);CREATE TABLE pathway2gene (pathway_id CHAR(8) NOT NULL, -- KEGG pathway long IDgene_id VARCHAR(20) NOT NULL -- Entrez Gene or ORF ID
);CREATE TABLE pathway2name (path_id CHAR(5) NOT NULL UNIQUE, -- KEGG pathway short IDpath_name VARCHAR(80) NOT NULL UNIQUE -- KEGG pathway name
);
-- Indexes.CREATE INDEX Ipathway2gene ON pathway2gene (gene_id);
Databases: Slide 6 of 24
know your ORM from your MVC(do you DSL?)
http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller
Databases: Slide 7 of 24
my one tip for today: use ORM= object relational mapping
#!/usr/bin/ruby
require ’sequel’
# connect to UCSC Genomes MySQL serverDB = Sequel.connect(:adapter => "mysql", :host => "genome-mysql.cse.ucsc.edu",
:user => "genome", :database => "hg19")
# instead of "SELECT count(*) FROM knownGene"DB.from(:knownGene).count
# => 82960
# instead of "SELECT name, chrom, txStart FROM knownGene LIMIT 1"DB.from(:knownGene).select(:name, :chrom, :txStart).first
# => {:name=>"uc001aaa.3", :chrom=>"chr1", :txStart=>11873}
# instead of "SELECT name FROM knownGene WHERE chrom == ’chrM’"DB.from(:knownGene).where(:chrom => "chrM").all
# => [{:name=>"uc004coq.4"}, {:name=>"uc022bqo.2"}, {:name=>"uc004cor.1"}, {:name=>"uc004cos.5"},# {:name=>"uc022bqp.1"}, {:name=>"uc022bqq.1"}, {:name=>"uc022bqr.1"}, {:name=>"uc031tga.1"},# {:name=>"uc022bqs.1"}, {:name=>"uc011mfi.2"}, {:name=>"uc022bqt.1"}, {:name=>"uc022bqu.2"},# {:name=>"uc004cov.5"}, {:name=>"uc031tgb.1"}, {:name=>"uc004cow.2"}, {:name=>"uc004cox.4"},# {:name=>"uc022bqv.1"}, {:name=>"uc022bqw.1"}, {:name=>"uc022bqx.1"}, {:name=>"uc004coz.1"}]
Databases: Slide 8 of 24
don’t want to CREATE? you still might want to SELECT
Question: How to map a SNP to a gene around +/- 60KB ?
I am looking at a bunch of SNPs. Some of them are part of genes,but other are not. I am interested to look up +60KB or -60KB ofthose SNPs to get details about some nearby genes. Please shareyour experience in dealing with such a situation or thoughts onany methods that can do this. Thanks in advance.
http://www.biostars.org/p/413/
Databases: Slide 9 of 24
example SELECT
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e ’
selectK.proteinID, K.name, S.name,S.avHet, S.chrom, S.chromStart,K.txStart, K.txEnd
from snp130 as Sleft join knownGene as K on(S.chrom = K.chrom and not(K.txEnd + 60000 < S.chromStart or
S.chromEnd + 60000 < K.txStart))where
S.name in ("rs25","rs100","rs75","rs9876","rs101")
’
Databases: Slide 10 of 24
example SELECT result
Databases: Slide 11 of 24
let’s talk about noSQL
http://www.infoivy.com/2013/07/nosql-database-comparison-chart-only.html
Databases: Slide 12 of 24
(potentially) a good fit for biological data
Databases: Slide 13 of 24
many data sources are “key-value ready”(or close enough)
http://togows.dbcls.jp/entry/pathway/hsa00030/genes.json
[{
"2821": "GPI; glucose-6-phosphate isomerase [KO:K01810] [EC:5.3.1.9]","2539": "G6PD; glucose-6-phosphate dehydrogenase [KO:K00036] [EC:1.1.1.49]","25796": "PGLS; 6-phosphogluconolactonase [KO:K01057] [EC:3.1.1.31]",..."5213": "PFKM; phosphofructokinase, muscle [KO:K00850] [EC:2.7.1.11]","5214": "PFKP; phosphofructokinase, platelet [KO:K00850] [EC:2.7.1.11]","5211": "PFKL; phosphofructokinase, liver [KO:K00850] [EC:2.7.1.11]"
}]
Databases: Slide 14 of 24
schema-free: save first, worry later(= agile)
#!/usr/bin/rubyrequire "mongo"require "json/pure"require "open-uri"
db = Mongo::Connection.new.db(’kegg’)col = db.collection(’genes’)j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa00030/genes.json").read)
j.each do |g|gene = Hash.newg.each_pair do |key, val|gene[:_id] = keygene[:desc] = valcol.save(gene)
endend
Ruby code to save JSON from the TogoWS REST service
Databases: Slide 15 of 24
example application - PMRetractask later if interested
http://pmretract.heroku.com/https://github.com/neilfws/PubMed/tree/master/retractions
Databases: Slide 16 of 24
when rows + columns != database
- sometimes a database is overkill
Databases: Slide 17 of 24
example 1 - R/IRanges
Databases: Slide 18 of 24
example 2 - bedtools
http://bedtools.readthedocs.org/en/latest/
Databases: Slide 19 of 24
example 3 - unix join (and the shell in general)
Databases: Slide 20 of 24
when are databases good?
- when data are updated frequently
- when multiple users do the updating
- when queries are complex or ever-changing
- as backends to web applications
Databases: Slide 21 of 24
when are databases not/less good?
- for basic “set operations”
- for sequence data [1] (?)
[1] no time to discuss BioSQL, GBrowse/Bio::DB::GFF, BioDAS etc.
Databases: Slide 22 of 24
so how did I answer that email?
options(java.parameters = "-Xmx4g")library(XLConnect)
wb <- loadWorkbook("˜/Downloads/NGS Target list Tumour for Neil.xlsx")
s1 <- readWorksheet(wb, sheet = 1, startCol = 1, endCol = 1, header = F)s2 <- readWorksheet(wb, sheet = 2, startCol = 1, endCol = 32, header = T)s4 <- readWorksheet(wb, sheet = 4, startCol = 1, endCol = 3, header = T)
# then use gsub, match, %in% etc. to clean and join the data# ...
Read spreadsheet into R using the XLConnect package, then “munge”