biopostgres overview bosc 2006 - csstott/biopostgres_overview_bosc_2006.pdf · ucla computer...
TRANSCRIPT
BioPostgresBioPostgres
Stott Parker & Stott Parker & Ruey-Lung Ruey-Lung HsiaoHsiaoUCLA Computer Science Dept.UCLA Computer Science Dept.
www.biopostgres.org
UCLA Center for
ComputationalBiology (CCB)
Evolution of scienceEvolution of science::
Increasing emphasis on:Increasing emphasis on:•• scalescale•• networkingnetworking•• informaticsinformatics
automationautomationData exploration scienceData exploration science
simulationsimulationComputational scienceComputational science
predictive modelingpredictive modelingAnalytical scienceAnalytical science
direct gathering of datadirect gathering of dataObservational scienceObservational science
The Future of ScienceThe Future of Science
bScience bScience ?? eScienceeScience
•• large-scale, data-centric, computationally mind-numbing sciencelarge-scale, data-centric, computationally mind-numbing sciencehttp://research.http://research.microsoftmicrosoft.com/workshops/escience2005/.com/workshops/escience2005/
•• the future of sciencethe future of science (for (for eScientistseScientists): enormous centralized data centers): enormous centralized data centersthat provide manythat provide many information servicesinformation services
bSciencebScience•• NCBI is aNCBI is a good model for an good model for an eScience eScience data centerdata center
The importance of extensible database systemsThe importance of extensible database systems•• Jim Gray stresses the importance of extensible databases in Jim Gray stresses the importance of extensible databases in eScienceeScience::
future scientists will spend their days writing large-scale SQL queries (!)future scientists will spend their days writing large-scale SQL queries (!)
•• Even if Jim Gray is not right about this, at some point the scale of biologicalEven if Jim Gray is not right about this, at some point the scale of biologicalinformation requires using database systems for some kind of datainformation requires using database systems for some kind of datamanagement. Think about terabytesmanagement. Think about terabytes…… and and exabytesexabytes……
CREATE TABLECREATE TABLE exon_featureexon_feature exon_feature exon_featureAS AS gene | gene | exon exon | protein | interval | type å | description| protein | interval | type å | descriptionSELECT DISTINCTSELECT DISTINCT ---------+------+---------+----------+---------------+-------------------- ---------+------+---------+----------+---------------+-------------------- e.gene, e.gene, H2-DMb2 | 2 | P35737 | 75..75 | H2-DMb2 | 2 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) e. e.exon_no exon_no asas exonexon,, H2-DMb2 | 3 | P35737 | 1..18 | signal | potential H2-DMb2 | 3 | P35737 | 1..18 | signal | potential f.protein_id as protein, f.protein_id as protein, H2-DMb2 | 4 | P35737 | 19..112 | domain | H2-DMb2 | 4 | P35737 | 19..112 | domain | lumenal lumenal beta-1beta-1 f.interval,f.interval, H2-DMb2 | 4 | P35737 | 75..75 | H2-DMb2 | 4 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) f.type, f.type, H2-DMb2 | 5 | P35737 | 113..207 | domain | H2-DMb2 | 5 | P35737 | 113..207 | domain | lumenal lumenal beta-2beta-2 f.description f.description H2-DMb2 | 5 | P35737 | 114..204 | domain | H2-DMb2 | 5 | P35737 | 114..204 | domain | Ig-likeIg-likeFROM FROM H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide gene g,gene g, H2-DMb2 | 6 | P35737 | 219..239 | H2-DMb2 | 6 | P35737 | 219..239 | transmembrane transmembrane | potential| potential exon exon e,e, H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif mrna mrna m,m, HLA-DMB | 1 | P28068 | 1..18 | signal | potential HLA-DMB | 1 | P28068 | 1..18 | signal | potential protein p,protein p, HLA-DMB | 2 | P28068 | 19..112 | domain | HLA-DMB | 2 | P28068 | 19..112 | domain | lumenal lumenal beta-1beta-1 protein_feature fprotein_feature f HLA-DMB | 2 | P28068 | 110..110 | HLA-DMB | 2 | P28068 | 110..110 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...)WHERE WHERE HLA-DMB | 3 | P28068 | 113..207 | domain | HLA-DMB | 3 | P28068 | 113..207 | domain | lumenal lumenal beta-2beta-2 e.gene = g.genee.gene = g.gene HLA-DMB | 3 | P28068 | 114..208 | domain | HLA-DMB | 3 | P28068 | 114..208 | domain | Ig-likeIg-like AND e.gene = m.geneAND e.gene = m.gene HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide AND e.gene = p.gene AND e.gene = p.gene HLA-DMB | 4 | P28068 | 219..239 | HLA-DMB | 4 | P28068 | 219..239 | transmembrane transmembrane | potential| potential AND p.protein_id = f.protein_id AND p.protein_id = f.protein_id HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif AND loc_range(e. AND loc_range(e.coding_mrnacoding_mrna)) + '-2..2'::range + '-2..2'::range - range_lower(loc_range(m. - range_lower(loc_range(m.coding_mrnacoding_mrna)))) @ -- contains @ -- contains (f.interval - 1) * 3; (f.interval - 1) * 3;
Databases take work to set up, butpermit exploration -- asking and
quickly getting answers to questions.
The join is a large-scale informationconnection operator that is both verybasic and very annoying to program.
Aligning protein features & Aligning protein features & exon exon structurestructure
But: Biologists donBut: Biologists don’’t like databasest like databases
The main programming model is The main programming model is PerlPerl, not, not SQLSQL
DatabasesDatabases have negative associationshave negative associations•• InflexibilityInflexibility•• QuirkinessQuirkiness•• Possible slownessPossible slowness•• Possible expense !Possible expense !•• Operations often must be done outside the database;Operations often must be done outside the database;
SQL is not usually enoughSQL is not usually enough•• Steep learning curveSteep learning curve
Why do DBMS get an F in Biology?Why do DBMS get an F in Biology?
1.1. There arenThere aren’’t that many choices for DBMSt that many choices for DBMS•• …… and for open-source DBMS there are few and for open-source DBMS there are few
2.2. DBMS wereDBMS were designed for business, not sciencedesigned for business, not science•• hard-to-change database schemashard-to-change database schemas•• limited set of data typeslimited set of data types•• peculiarities of the SQL query languagepeculiarities of the SQL query language•• quirks of query optimizationquirks of query optimization•• arcane programming modelsarcane programming models
3.3. Strong challenges are inherent in bioscienceStrong challenges are inherent in bioscience•• very large scalevery large scale•• diverse types ofdiverse types of informationinformation•• extremely complex analytical queriesextremely complex analytical queries
Amazingly Few Large-Scale DBMS OptionsAmazingly Few Large-Scale DBMS Options
Commercial DBMSCommercial DBMS•• IBM DB2/DiscoveryLinkIBM DB2/DiscoveryLink•• Microsoft SQL ServerMicrosoft SQL Server•• Oracle 10gOracle 10g
10001000
4 GB4 GB
8 XB (=8 XB (= 8M8M TB)TB)
OracleOracle 10g10g
UnlimitedUnlimited UnlimitedUnlimited Database sizeDatabase size
250 250 —— 1600 1600 10001000 Columns/TableColumns/Table
1 GB1 GB 8 KB8 KB Field sizeField size
1.6 TB1.6 TB 8 KB8 KB Row sizeRow size
32 TB32 TB 64 TB64 TB Table sizeTable size
PostgreSQL PostgreSQL 8.18.1MySQL MySQL 4.14.1Maximum:Maximum:
Open-Source DBMSOpen-Source DBMS•• MySQLMySQL•• PostgreSQLPostgreSQL
Even though these have been designed with scalability as a keydesign goal, they have scalability limits that bScience will push:
PostgreSQLPostgreSQL? Why not just ? Why not just MySQLMySQL??
MySQL MySQL and and PostgreSQL PostgreSQL are the primary scalable open-source DBMSare the primary scalable open-source DBMS Objective, feature-by-feature Comparisons:Objective, feature-by-feature Comparisons:
•• http:http://troels//troels..arvinarvin..dk/db/rdbms/dk/db/rdbms/•• http://en.http://en.wikipediawikipedia..org/wiki/Comparison_of_relational_database_manorg/wiki/Comparison_of_relational_database_man
agement_systemsagement_systems
More subjective, focal-issue Comparison:More subjective, focal-issue Comparison:
++++++++++ Installed BaseInstalled Base
++++++++++ ExtensibilityExtensibility
++++++++++ Stds Stds ComplianceCompliance++++++++++ SpeedSpeed
PostgreSQL PostgreSQL 8.18.1MySQL MySQL 4.14.1 Criterion:Criterion:
CREATE TABLECREATE TABLE exon_featureexon_feature exon_feature exon_featureAS AS gene | gene | exon exon | protein | interval | type å | description| protein | interval | type å | descriptionSELECT DISTINCTSELECT DISTINCT ---------+------+---------+----------+---------------+-------------------- ---------+------+---------+----------+---------------+-------------------- e.gene, e.gene, H2-DMb2 | 2 | P35737 | 75..75 | H2-DMb2 | 2 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) e. e.exon_no exon_no asas exonexon,, H2-DMb2 | 3 | P35737 | 1..18 | signal | potential H2-DMb2 | 3 | P35737 | 1..18 | signal | potential f.protein_id as protein, f.protein_id as protein, H2-DMb2 | 4 | P35737 | 19..112 | domain | H2-DMb2 | 4 | P35737 | 19..112 | domain | lumenal lumenal beta-1beta-1 f.interval,f.interval, H2-DMb2 | 4 | P35737 | 75..75 | H2-DMb2 | 4 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) f.type, f.type, H2-DMb2 | 5 | P35737 | 113..207 | domain | H2-DMb2 | 5 | P35737 | 113..207 | domain | lumenal lumenal beta-2beta-2 f.description f.description H2-DMb2 | 5 | P35737 | 114..204 | domain | H2-DMb2 | 5 | P35737 | 114..204 | domain | Ig-likeIg-likeFROM FROM H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide gene g,gene g, H2-DMb2 | 6 | P35737 | 219..239 | H2-DMb2 | 6 | P35737 | 219..239 | transmembrane transmembrane | potential| potential exon exon e,e, H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif mrna mrna m,m, HLA-DMB | 1 | P28068 | 1..18 | signal | potential HLA-DMB | 1 | P28068 | 1..18 | signal | potential protein p,protein p, HLA-DMB | 2 | P28068 | 19..112 | domain | HLA-DMB | 2 | P28068 | 19..112 | domain | lumenal lumenal beta-1beta-1 protein_feature fprotein_feature f HLA-DMB | 2 | P28068 | 110..110 | HLA-DMB | 2 | P28068 | 110..110 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...)WHERE WHERE HLA-DMB | 3 | P28068 | 113..207 | domain | HLA-DMB | 3 | P28068 | 113..207 | domain | lumenal lumenal beta-2beta-2 e.gene = g.genee.gene = g.gene HLA-DMB | 3 | P28068 | 114..208 | domain | HLA-DMB | 3 | P28068 | 114..208 | domain | Ig-likeIg-like AND e.gene = m.geneAND e.gene = m.gene HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide AND e.gene = p.gene AND e.gene = p.gene HLA-DMB | 4 | P28068 | 219..239 | HLA-DMB | 4 | P28068 | 219..239 | transmembrane transmembrane | potential| potential AND p.protein_id = f.protein_id AND p.protein_id = f.protein_id HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif AND loc_range(e. AND loc_range(e.coding_mrnacoding_mrna)) + '-2..2'::range + '-2..2'::range - range_lower(loc_range(m. - range_lower(loc_range(m.coding_mrnacoding_mrna)))) @ -- contains @ -- contains (f.interval - 1) * 3; (f.interval - 1) * 3;
What is really needed here is a newdatatype for sequence locations.This query is painful to express
without this, and not painful with it.
Aligning protein features & Aligning protein features & exon exon structurestructure
BioPostgres BioPostgres -- some modular database-- some modular databaseinfrastructure for Computational Biologyinfrastructure for Computational Biology
BioPostgres BioPostgres = = PostgreSQL PostgreSQL + Extensions+ Extensions•• PostgreSQLPostgreSQL: an open-source, industrial-strength, scalable DBMS: an open-source, industrial-strength, scalable DBMS
http://www.postgresql.orghttp://www.postgresql.org
•• Extension: a new SQL API, with query operators and toolsExtension: a new SQL API, with query operators and tools
Each Extension is a separate packageEach Extension is a separate package
•• Biosequence Biosequence extensionsextensions•• GeneOntology GeneOntology extensionsextensions
•• SQL SQL datatype datatype extensionsextensions•• System management extensionsSystem management extensions
Working toward complementing other BOSC platforms:Working toward complementing other BOSC platforms: BioPerlBioPerl, , BioPythonBioPython, , BioJavaBioJava, , BioSQLBioSQL, , BioConductorBioConductor, GMOD, , GMOD, ……
Extensibility features of Extensibility features of PostgreSQLPostgreSQL PostgreSQL was designed specifically to be extensible
Individual databases can be extended with:• New datatypes• New functions• New query operators• New indexing methods• New query languages
These can be added or dropped anytime, on the fly• dynamic linking of implementation libraries as needed
Flexible conventions for user-contributed modules• implementations are typically in C (like PostgreSQL).
Extending Extending PostgreSQLPostgreSQL Within a given database, oneWithin a given database, one can add new can add new datatypesdatatypes A A datatype datatype can be added to can be added to allall databases also databases also
PostgreSQLMyBioStuff(PostgreSQLdatabase)
Graph datatype& query operators
Seq Location datatype& query operators
PostgreSQL PostgreSQL user-contributed modulesuser-contributed modules A new module (say A new module (say ““PostFooPostFoo””) typically contains) typically contains
•• New functions (in New functions (in PostFooPostFoo.c.c))
•• SQL interface bindings for these functions (in SQL interface bindings for these functions (in PostFooPostFoo..sqlsql))
The The PostFoo PostFoo module gets downloaded by others into their copymodule gets downloaded by others into their copy
of the of the PostgreSQL PostgreSQL source tree as a new directorysource tree as a new directory
postgresql-8.*.*postgresql-8.*.*/contrib/PostFoo//contrib/PostFoo/
In this directory, the source tree ownerIn this directory, the source tree owner typestypes
gmakegmake
gmake gmake installinstall # as root# as root
This compiles only the module (NOT This compiles only the module (NOT PostgreSQLPostgreSQL)!)!
Afterwards anyone canAfterwards anyone can dynamically add the module to any givendynamically add the module to any givenPostgreSQL PostgreSQL database (say database (say ““MyBioStuffMyBioStuff””) with a command like) with a command like
psql -d MyBioStuff psql -d MyBioStuff < < PostFooPostFoo..sqlsql
SeeSee www.www.biopostgresbiopostgres.org/install.html.org/install.html
BioPostgres BioPostgres ModulesModules
Derivation dependency extensionsDerivation dependency extensionsPostMakeModel base/Model base/ddata mining extensionsata mining extensionsPostModelGraphGraph database extensionsdatabase extensionsPostGraph
BioPostgres is a collection of modules that extendPostgreSQL for Computational Biology. Thesemodules are basically independent, and can beused separately or in conjunction with others
GeneOntology GeneOntology (GO)(GO) analysisanalysisGObaseBiosequence Biosequence data analysisdata analysisBLASTgres
Quick overview: Quick overview: BLASTgresBLASTgres
BLASTgres BLASTgres -- extensions for -- extensions for biosequence biosequence managementmanagement
Sequence location Sequence location datatypedatatype:: ((seq_idseq_id, [start,end]), [start,end]) Sequence location operatorsSequence location operators:: loc intersection, etc. loc intersection, etc. Sequence location query:Sequence location query: find overlapping find overlapping locslocs, etc., etc.
Access to BLAST services:Access to BLAST services: remote and local serversremote and local servers
URL:URL: http://www.biopostgres.org/BLASTgres/http://www.biopostgres.org/BLASTgres/
Two sets of Two sets of BLASTgres BLASTgres extensionsextensions
1. 1. BLASTgres BLASTgres provides provides ““BLAST queryBLAST query””, , ““BLAST hit databaseBLAST hit database””
SELECT * FROM SELECT * FROM blast_sequence(blast_sequence(……););
2. 2. BLASTgres BLASTgres provides provides biosequence-related datatypesbiosequence-related datatypes, with, withaccompanying query operators and indexing methods:accompanying query operators and indexing methods:
Sequence range (and: array of range)Sequence range (and: array of range)‘‘17679235..1767942717679235..17679427’’
Sequence location (and:Sequence location (and: array of location)array of location)‘‘NT_011109.15[17679235..17679427]NT_011109.15[17679235..17679427]’’
Hit Hit (= high-scoring sequence alignment information)(= high-scoring sequence alignment information)((‘‘in1[353..966]in1[353..966]’’, , ’’HUMAPOE4[3779..4402]HUMAPOE4[3779..4402]’’ 99.84, 624, 1, 0, 0, 1229) 99.84, 624, 1, 0, 0, 1229)
BLAST access via BLAST access via BLASTgres BLASTgres queriesqueries1. Simple BLAST queries
SELECT * FROM local_blast_hit(‘atcgatcgatcg’, ‘lab-sequences’, ‘blastn’);SELECT * FROM remote_blast_hps(‘lab_protein.fasta’, ‘nr’, ‘blastp’);SELECT * FROM remote_blast_hit(‘NM_010387.2[245..546]’, ‘nr’, ‘blastn’) ;SELECT * FROM fasta_sequence( ‘/lab/ests’, ‘P082345’, 20, 40 );
2. Annotations to BLAST query resultsSELECT count(*), species FROM annotated_remote_blast_hit(‘AF101044’, ‘nr’, ‘blastn’) GROUP BY species; -- automatic annotation to BLAST query results
SELECT * FROM annotated_remote_blast_hit(‘AF101044’, ‘nr’) WHERE description LIKE ‘%SNRPN%’, and species <> ‘Homo sapiens’;
3. Advanced filtering for BLAST results in SQLSELECT subject_location, length FROM local_blast_hit(‘AF101044’) WHERE evalue < 1E-5 AND bitscore > 800 AND mismatches<10 AND identity>40;
SELECT * FROM local_blast_hit(‘actgactgactgactg’, ‘ESTs’, ‘blastn’) A, features B WHERE A.subject_location && B.feature_location -- && means “overlaps”
4. Large-scale BLAST querySELECT * FROM local_blast_hit_all( ‘sequences’, ‘seq’, ‘ESTs’, ‘blastn’);-- BLAST multiple sequences at the same time.
-- sensitivity test (comparison of BLAST query results using different parameters): CREATE TABLE parameter1 ( name TEXT, value TEXT );INSERT INTO parameter1 VALUES ( ‘WORD_SIZE’, ‘7’ ); -- change the default settingsINSERT INTO parameter1 VALUES ( ‘GAPCOSTS’, ‘5 2’ );CREATE TABLE parameter2 ( name TEXT, value TEXT );INSERT INTO parameter2 VALUES ( ‘WORD_SIZE’, ‘13’ );INSERT INTO parameter2 VALUES ( ‘GAPCOSTS’, ‘3 1’ );-- retrieve matches that are not found in both resultsSELECT R1.*, R2.* FROM remote_blast_hit_v( ‘AF101044’, ‘parameter1’) R1, remote_blast_hit_v( ‘AF101044’, ‘parameter2’) R2 WHERE (R1.subject_location && R2.subject_location) AND (R1.length <> R2.length );
BLASTgres BLASTgres functionsfunctionsRange Operatorsrange + rangerange + int8 int8 +range range -range range ミ int8range * rangerange * int4 int4 *range range |range range <range range<=range range >range range >=range range <<range range &<range range &&range range &>range range >>range range =range range <>range range @range range ~range range @<range range @>range RangeAggregate minmax(range)
Location Functions
loc_range(loc)loc_seqid(loc)loc_size(loc)loc_lower(loc)loc_upper(loc)loc_positive_strand(loc)loc_negative_strand(loc)loc_same_strand(loc,loc)loc_negate(loc)loc_eq(loc, loc)loc_eq(loc, range)loc_ne(loc, loc)loc_ne(loc, range)loc_over_left(loc, loc)loc_over_left(loc, range)loc_over_right(loc, loc)loc_over_right(loc, range)loc_left(loc,loc)loc_left(loc,range)loc_right(loc,loc)loc_right(loc,range)loc_lt(loc,loc)loc_lt(loc,range)loc_le(loc,loc)loc_le(loc,range)loc_gt(loc,loc)loc_gt(loc,range)loc_ge(loc,loc)loc_ge(loc,range)
Aggregate functions
coalescing( text, text, text, int4 )coalescing( text, text, text)
partition(text, text, text, text, text )
revcom(text)transcribe(text)translate(text)translate(text, int4)
range_agg_state (range[], range)range_agg_final_array (range[])range_array_aggregate (range_array_enum(range[])
loc_agg_state (loc[], loc)loc_agg_final_array (loc[])loc_array_aggregateloc_array_enum(loc[])
rangeset(range)rcount(_range)
sort(_range, text)sort(_range)sort_asc(_range)sort_desc(_range)uniq(_range)
idx(_range, range)subarray(_range, int4, int4)subarray(_range, int4)
Range Functions
range_over_left(range, range)range_over_right(range, range)range_left(range, range)range_right(range, range)range_lt(range, range)range_le(range, range)range_gt(range, range)range_ge(range, range)range_ne(range, range)range_inside(range,range)range_contains(range, range)range_contained(range, range)range_overlaps(range, range)range_eq(range, range)range_meets(range, range)range_met_by(range, range)range_starts(range, range)range_started_by(range, range)range_finishes(range, range)range_finished_by(range, range)range_same_lower(range, range)range_same_upper(range, range)range_minus(range, range)range_plus(range, range)range_torange(int8, int8)range_maxmin(range, range)range_minmax(range, range)range_extend(range, int4)range_cmp(range, range)range_union(range, range)range_inter(range, range)range_size(range)range_upper(range)range_lower(range)range_times(range,range)range_times(range, int4)range_times(int4, range)range_plus(range, int8)range_plus(int8, range)range_minus(range, int8)
LocationOperatorsloc + locloc +range loc+ int8 loc -loc loc -range loc ミint8 loc *loc loc *range loc *int4 loc =loc loc <>loc loc <>range loc< loc loc <range loc<= loc loc<= rangeloc > locloc >range loc>= loc loc>= rangeloc << locloc <<range loc&< loc loc&< rangeloc && locloc &&range loc&> loc loc&> rangeloc >> locloc >>range loc@ loc loc@ rangeloc ~ locloc ~range
BLAST-related functions
fasta_sequence(text,text,int,int)fasta_sequence(text,text)fasta_dbinfo(text)
remote_blast_hit(text,text,text)remote_blast_hit( text, text )remote_blast_hit(text)remote_blast_hsp(text,text,text)remote_blast_hsp( text, text )remote_blast_hsp(text)
local_blast_hsp(text,text,text,text)local_blast_hsp( text, text, text )local_blast_hit(text,text,text,text)local_blast_hit( text, text, text )local_ublast_hit( text, text, text )local_ublast_hit( text, text )local_ublast_hit( text )genbank_search(text)
loc_contains(loc,loc)loc_contains(loc,range)loc_contained(loc,loc)loc_contained(range,loc)loc_overlaps(loc,loc)loc_overlaps(loc,range)loc_meets(loc,loc)loc_meets(loc,range)loc_met_by(loc,loc)loc_met_by(loc,range)loc_starts(loc,loc)loc_starts(loc,range)loc_started_by(loc,loc)loc_started_by(loc,range)loc_finishes(loc,loc)loc_finishes(loc,range)loc_finished_by(loc,loc)loc_finished_by(loc,range)loc_minus(loc,loc)loc_minus(loc, integer)loc_minus(loc, range)loc_plus(loc,loc)loc_plus(loc, integer)loc_plus(integer, loc)loc_plus(loc, range)loc_times(loc,loc)loc_times(loc, integer)loc_times(integer,loc)loc_times(loc, range)toloc(cstring, int8, int8)toloc(text, int8, int8)toloc(cstring, range)toloc(text, range)loc_maxmin(loc,loc)loc_minmax(loc,loc)loc_extend(loc,int4)loc_maxmin(loc,range)loc_minmax(loc,range)
_range_concat(_range,_range)_range_overlaps(_range,_range)_range_contains(_range,_range)_range_contained(_range,_range)_range_eq(_range,_range)_range_ne(_range,_range)_range_union(_range,_range)_range_inters(_range,_range)_range_push_elem(_range,RANGE)_range_push_array(_range,_range)_range_del_elem(_range,RANGE)_range_union_elem(_range,RANGE)_range_subtract(_range,_range)_range_contains(_range,range)_range_contains_interval_any(_range, range)_range_contains_interval_all(_range, range)_range_contained_interval_any(_range, range)_range_contained_interval_all(_range, range)_range_overlaps_interval_any(_range, range)_range_overlaps_interval_all(_range, range)_range_in(cstring)_range_out(_range_key)
CREATE TABLECREATE TABLE exon_featureexon_feature exon_feature exon_featureAS AS gene | gene | exon exon | protein | interval | type å | description| protein | interval | type å | descriptionSELECT DISTINCTSELECT DISTINCT ---------+------+---------+----------+---------------+-------------------- ---------+------+---------+----------+---------------+-------------------- e.gene, e.gene, H2-DMb2 | 2 | P35737 | 75..75 | H2-DMb2 | 2 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) e. e.exon_no exon_no asas exonexon,, H2-DMb2 | 3 | P35737 | 1..18 | signal | potential H2-DMb2 | 3 | P35737 | 1..18 | signal | potential f.protein_id as protein, f.protein_id as protein, H2-DMb2 | 4 | P35737 | 19..112 | domain | H2-DMb2 | 4 | P35737 | 19..112 | domain | lumenal lumenal beta-1beta-1 f.interval,f.interval, H2-DMb2 | 4 | P35737 | 75..75 | H2-DMb2 | 4 | P35737 | 75..75 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...) f.type, f.type, H2-DMb2 | 5 | P35737 | 113..207 | domain | H2-DMb2 | 5 | P35737 | 113..207 | domain | lumenal lumenal beta-2beta-2 f.description f.description H2-DMb2 | 5 | P35737 | 114..204 | domain | H2-DMb2 | 5 | P35737 | 114..204 | domain | Ig-likeIg-likeFROM FROM H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide H2-DMb2 | 6 | P35737 | 208..218 | domain | connecting peptide gene g,gene g, H2-DMb2 | 6 | P35737 | 219..239 | H2-DMb2 | 6 | P35737 | 219..239 | transmembrane transmembrane | potential| potential exon exon e,e, H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif H2-DMb2 | 7 | P35737 | 248..251 | site | YXXZ motif mrna mrna m,m, HLA-DMB | 1 | P28068 | 1..18 | signal | potential HLA-DMB | 1 | P28068 | 1..18 | signal | potential protein p,protein p, HLA-DMB | 2 | P28068 | 19..112 | domain | HLA-DMB | 2 | P28068 | 19..112 | domain | lumenal lumenal beta-1beta-1 protein_feature fprotein_feature f HLA-DMB | 2 | P28068 | 110..110 | HLA-DMB | 2 | P28068 | 110..110 | glycosylation glycosylation | N-linked (| N-linked (GlcNAcGlcNAc...)...)WHERE WHERE HLA-DMB | 3 | P28068 | 113..207 | domain | HLA-DMB | 3 | P28068 | 113..207 | domain | lumenal lumenal beta-2beta-2 e.gene = g.genee.gene = g.gene HLA-DMB | 3 | P28068 | 114..208 | domain | HLA-DMB | 3 | P28068 | 114..208 | domain | Ig-likeIg-like AND e.gene = m.geneAND e.gene = m.gene HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide HLA-DMB | 4 | P28068 | 208..218 | domain | connecting peptide AND e.gene = p.gene AND e.gene = p.gene HLA-DMB | 4 | P28068 | 219..239 | HLA-DMB | 4 | P28068 | 219..239 | transmembrane transmembrane | potential| potential AND p.protein_id = f.protein_id AND p.protein_id = f.protein_id HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif HLA-DMB | 5 | P28068 | 248..251 | site | YXXZ motif AND loc_range(e. AND loc_range(e.coding_mrnacoding_mrna)) ++ '-2..2'::range '-2..2'::range -- range_lower(loc_range(m. range_lower(loc_range(m.coding_mrnacoding_mrna)))) @@ -- contains -- contains (f.interval (f.interval -- 1) 1) ** 3; 3;
BLASTgres query operators:X && Y == range_overlaps(X,Y)X @ Y == range_contains(X,Y)
X * Y == range_times(X,Y)X + Y == range_plus(X,Y)
etc…
Aligning protein features & Aligning protein features & exon exon structurestructure
IndexingIndexing for the for the locloc & & rangerange datatypesdatatypes
(1,2)(4,160)
(3,4)(21,203)
(1,1)(4,83)
(1,2)(40,160)
(3,4)(21,78)
(4,4)(56,203)
1(4,70)
1(32,83)
1(53,160)
2(40,93)
3(40,46)
4(160,203)
4(21,78)
4(56,120)
A
B C
D E F G
4(60,80)
predicate: overlaps
Search keys are representedas two bounding intervals (forleaf nodes, the lower boundis equal to upper bound forthe first bounding intervals).The bounding interval of aninternal node contains thoseof its subnodes.
GiSTGiST Indexing in Indexing in BLASTgresBLASTgresGiST GiST (Generalized Index Search Tree) is(Generalized Index Search Tree) is the the PostgreSQL PostgreSQL framework for user-defined indexing.framework for user-defined indexing.
BLASTgres BLASTgres uses uses GiST GiST indexing for locations, ranges, location arrays, and range arrays.indexing for locations, ranges, location arrays, and range arrays.
(1,2)(4,160)
(3,4)(21,203)
(1,1)(4,83)
(1,2)(40,160)
(3,4)(21,78)
(4,4)(56,203)
1(4,70)
1(32,83)
1(53,160)
2(40,93)
3(40,46)
4(160,203)
4(21,78)
4(56,120)
A
B C
D E F G
4(60,80)GiST GiST indexing requires userindexing requires user
definition definition ofof four methods: four methods:
••ConsistentConsistent : determine if a: determine if asubtree subtree traversal is necessary.traversal is necessary.
••UnionUnion : merge two nodes. : merge two nodes.
••PenaltyPenalty : determine the cost of : determine the cost ofinserting an entry in a node.inserting an entry in a node.
••PickSplit PickSplit : determine how to: determine how tosplit a full nodesplit a full node
BioPostgres BioPostgres ModulesModules
Derivation dependency extensionsDerivation dependency extensionsPostMakeModel base/Model base/ddata mining extensionsata mining extensionsPostModelGraphGraph database extensionsdatabase extensionsPostGraph
Many modules are needed! We have only just begun todevelop modules we feel are most clearly needed.Please contact us if you have opinions or ideas.
GeneOntology GeneOntology (GO)(GO) analysisanalysisGObaseBiosequence Biosequence data analysisdata analysisBLASTgres
Quick overview: Quick overview: PostGraphPostGraph
PostGraph PostGraph -- extensions for graph management-- extensions for graph management
Graph Graph datatypedatatype:: each graph is a single data objecteach graph is a single data object Graph operatorsGraph operators:: insert edge, delete node, etc. insert edge, delete node, etc. Graph query:Graph query: find connected components, etc.find connected components, etc.
Access to graph tools:Access to graph tools: dot/graphvizdot/graphviz, , prefuseprefuse, etc., etc.
URL:URL: http://www.http://www.biopostgresbiopostgres.org/PostGraph/.org/PostGraph/
GeneOntology GeneOntology relational databaserelational database
GO is naturally a graph, distributed as relational tablesGO is naturally a graph, distributed as relational tables This representation is fine for storage but can beThis representation is fine for storage but can be
awkward for explorationawkward for exploration
SELECT id, name, term_type, acc FROM term LIMIT 7;
id | name | term_type | acc------+-------------------------------+--------------------+---------- 1 | all | universal | all 2 | is_a | relationship | is_a 3 | part_of | gene_ontology | part_of 4 | mitochondrion inheritance | biological_process | GO:0000001 5 | mitochondrial genome maintena | biological_process | GO:0000002 6 | reproduction | biological_process | GO:0000003 7 | alt_id | synonym_type | alt_id
SELECT * FROM term2term LIMIT 7;
id | relationship_type_id | term1_id | term2_id | complete ----+----------------------+----------+----------+---------- 1 | 2 | 10 | 9 | 0 2 | 2 | 10 | 13 | 0 3 | 2 | 26 | 25 | 0 4 | 2 | 10 | 42 | 0 5 | 2 | 10 | 50 | 0 6 | 2 | 26 | 67 | 0 7 | 2 | 26 | 92 | 0
Augmenting the Augmenting the GeneOntologyGeneOntologydatabase with a graph databasedatabase with a graph database
PostGraph PostGraph permits GO terms to be stored & queried as graphspermits GO terms to be stored & queried as graphs GOtermViewer GOtermViewer uses this to provide a uses this to provide a GraphGraphical Interface for GOical Interface for GO These are extended in These are extended in GObase GObase www.biopostgres.org/GObase/www.biopostgres.org/GObase/
Quick overview: Quick overview: PostMakePostMake
PostMake PostMake -- extensions for managing data dependencies-- extensions for managing data dependencies
PostMake PostMake covers common dependencies in bioscience:covers common dependencies in bioscience:•• PostCronPostCron:: database operations at pre-specified times database operations at pre-specified times
((““crontabcrontab within a databasewithin a database””))•• Database dump/loadDatabase dump/load, tracking when things change, tracking when things change•• Materialized viewsMaterialized views, triggered by updates, triggered by updates
((““makemake within within a databasea database””)) PostMakefiles PostMakefiles are translated to SQL DDLare translated to SQL DDL
URL:URL: http://www.biopostgres.org/PostMake/http://www.biopostgres.org/PostMake/
BioPostgres BioPostgres and open-sourceand open-source
Like Like PostgreSQLPostgreSQL, , BioPostgres BioPostgres is committed tois committed toopennessopenness
BioPostgres BioPostgres modules aremodules are GPLGPL’’eded; most code so far; most code so faris in C, SQL, and Javais in C, SQL, and Java
BioPostgres BioPostgres modules are downloadablemodules are downloadable fromfromSourceForgeSourceForge
We think We think BioPostgres BioPostgres shows that extensible DBMSshows that extensible DBMShave a lot to offer the open-source movement,have a lot to offer the open-source movement,particularly in bioscience.particularly in bioscience.
Will DBMS ever get an A in Biology?Will DBMS ever get an A in Biology?
Actually we donActually we don’’t know.t know.
We do not think We do not think DBMS will ever replace files orDBMS will ever replace files orprogramming languages in bioscience.programming languages in bioscience.
We think however that scalability is aWe think however that scalability is a drivingdrivingissue, and it willissue, and it will impact the effectiveness ofimpact the effectiveness offiles and programming languages in bioscience.files and programming languages in bioscience.And soon.And soon.
It does appear that It does appear that extensibile extensibile DBMSDBMS cancansuggest steps in the right direction.suggest steps in the right direction.
THANKTHANKYOUYOU!!