Download - GUS Overview
GUS Overview
June 18, 2002
GUS-3.0
• Supports application and data integration• Uses an extensible architecture.• Is object-oriented even though it uses an underlying
relational database management system (Oracle).• Warehouse instead of federation for local stable copy• Uses standards for bulk data exchange (e.g., MAGE)
Genomics Unified Schema
GUS Usage• Annotation – of genomes - gene models, sequence features– of genes - gene function, gene expression, gene
regulation
• Data mining– Develop algorithms and queryable resource
• Publish– Map identifiers with other resources/ databases – URL for entry retrieval/ ad hoc queries in web interface
GUS-3.0 Name SpacesGUS has 5 name spaces compartmentalizing different
types of information.
Namespace Domain Features
Core Data Provenance Workflows
Sres Shared resorurces Ontologies
DoTSsequence and
annotationCentral dogma
RAD Gene expresssion MIAME
TESS Gene regulation Grammars
Application Integration: PlasmoDB
AutomatedAnalysis &Integration
WWW queries,
browsing, & download
Java Servlets &
Perl CGI
GenePlotSoftware
GenePlotCD
DoTS Oracle/SQL
GenomicSequence
microArray& SAGE
Experiments
MappingData
GenBank, InterPro,
GO, etc
GSSs &ESTs
Annotation QTL,POP,SNP, Clinical
Existing implementation
Future implementation
RAD Core SRes
Object Layer
TESS
TIGRSanger
Stanford
PlasmodiumInvestigators
PublicDatabases
Annotator’s Interface
GUS Supports Multiple ProjectsAllGenesAllGenes PlasmoDBPlasmoDB
EPConDBEPConDB
CoreSRESTESSRADDoTS
Oracle RDBMS Object Layer for Data Loading
Java Servlets
Other sites,Other projectsOther sites,Other projects
Main Aspects of GUS Development• Choice of development tools
– Schema: • CREATE TABLE statements• Documentation plug-in: input is tab- delimited text • UML - Rational Rose, PowerDesigner
– Code: CVS
• Areas to emphasize– Plug-ins – Work flow– TESS– Proteomics– Images
• Preferred type of user interface– JSP– PHP
Data Integration
• GO• Species• Tissue• Dev. Stage
Ontologies
SRes
acute myeloid leukemia
Data Provenance
• Ownership• Protection• Algorithms• Similarity• Versioning• Workflow
Core
with sequence similarity to c-fos
GenomicSequence
• Genes, gene models• STSs, repeats, etc• Cross-species analysis
TranscribedSequence
• Characterize transcripts• RH mapping• Library analysis • Cross-species analysis• DOTS
ProteinSequence
• Domains• Function• Structure• Cross-species analysis
DoTS
Transcription factors
•Arrays•SAGE•Conditions
TranscriptExpression
RAD
up-regulated in
• Binding Sites• Patterns• Grammars
Gene Regulation
TESS
and common promoter motifs
RAD
EST clustering and assembly
GUS
TESS
Genomic alignmentand comparativeSequence analysis
Identify sharedTF binding sites
GUS Approach to Schema• Think objects
– Parents and children– Subclassing with views
• Views– Start with generic Imp table (e.g., NAFeatureImp) that contains
base attributes plus generic attributes of various datatypes– Superclass view (e.g., NAFeature) just has base attributes– Subclass views (e.g., RNAFeature) have additional attributes
using generic attributes
• Strongly-typed– Tend to avoid “name-value” pairs
NAFeature
AAFeature
AASequence
NASequence
DoTS Central Dogma
Gene
RNA
Protein
GeneFeature
GenomicSequence
RNASequence
ProteinSequence
RNAFeature
ProteinFeature
GeneInstance
RNAInstance
ProteinInstance
Functional predictions
GenomicSequence
DoTS consensusSequences
mRNA/ESTSequence
Clustering andAssembly
PredictedGenes
GeneIndex
Merge Genes
Gene/RNA clusterassignment
SIM4 or BLAT
ProteinsRNAs
Gene predictionsGenScan/ HMMer, PHAT
GO Functions
ProteinMotifs
BLAST Similarities
PFAM, Smart, ProDomBLASTPBLASTX
DoTS Schema Has Been Driven By Building Gene Indices
Other computed annotation(EPCR,
AssemblyAnatomyPercent,Index Key Words,
SNP analysis)
Annotate DoTSManual Annotation
Tasks
translationframefinder
DoTS Gene Indices Are Based on Clustering and Assembling ESTs
Identify new sequencesIn GenBank and dbEST
“Quality” AssemblySequences
Clusters of sequences(40 bp length, 92% identity)
•Assemble clusters using CAP4• update database
•Remove vector, polyA tails, ribosomal and poor quality sequences•Mask repeats with RepeatMasker
•BLASTN vs self•BLASTN vs DoTS•Connected components analysis to form clusters
GUS relational databaseIterate to complete build -Extract consensus sequences -Block with RepeatMasker -BLASTN vs self -Cluster (95% identity, 75 bp overlap) -Assemble with CAP4
Annotation of DoTS consensus sequences -protein translations with framefinder -BLAST analyses vs nrdb, prodom and CDD -assign description and index keywords -GOFunction assignment -EPCR to generate radiation hybrid mapping -derive assembly -> anatomy mapping -alignment to genomic DNA -assignment to “Gene” clusters
AnalysisInput
AnalysisOutputAnalysisImplementation
AnalysisParameter
Analysis
1
0..*
1
0..*1
0..*1
0..*
1
0..*
1
0..*
10..*
10..*
ARRAYANNOTATION
ASSAYLABELEDEXTRACT
BIOSAMPLE BIOSOURCE
COMPOSITEELEMENTANNOTATION
CONTROL
CONTROLTYPE
1
0..*
1
0..*
ELEMENTANNOTATION COMPOSITEELEMENTIMP
0..*0..1
0..*0..1
1
0..*
1
0..*
1
0..*
1
0..*
ARRAY10..* 10..*
1
0..*
1
0..*
ELEMENTIMP10..* 10..*
0..10..* 0..10..*
1
0..*
1
0..*
GROUPFACTOR
EXPERIMENTGROUP
1
0..*
1
0..*
LABEL
LABELEDEXTRACT
BIOSOURCECHARACTERISTIC
1
0..*
1
0..*
PROCESSIMPPARAMETER
PROCESSPARAMETER
ProcessInput
PROCESS
1
0..*
1
0..*
10..*
10..*
PROCESSIMPLEMENTATION
1
0..*
1
0..*
1 0..*1 0..*
PROCESSTYPE
0..*0..1
0..*0..1
1
0..*
1
0..*
ELEMENTRESULTIMP
1
0..*
1
0..*
COMPOSITEELEMENTRESULTIMP
1
0..*
1
0..*
0..10..* 0..10..*
RELATEDACQUISITIONACQUISITION
1 0..*1 0..*1 0..*1 0..*
RELATEDASSAYASSAY
10..*
10..*
1
0..*
1
0..*
1
0..*
1
0..*1
0..*
1
0..*
1 0..*1 0..*1
0..*1
0..*
RELATEDQUANTIFICATIONQUANTIFICATION
1
0..*
1
0..*
1
0..*
1
0..*
0..1
0..*
0..1
0..*
1 0..*1 0..*1 0..*1 0..*
ONTOLOGYENTRY
0..*0..1
0..*0..1
1
0..*
1
0..*
BIOMATERIALIMP1
0..*
1
0..*
BioMaterialImp
0..1
0..*
0..1
0..*
BioMaterialImp
PROTOCOLTREATMENT
1
0..*
1
0..*
1
0..*
1
0..*
0..10..* 0..10..*
ProcessOutput
1
0..*
1
0..*
ASSAYGROUPFACTOR
1
0..*
1
0..*
1
0..*
1
0..*
QUANTIFICATIONPARAMETER
0..1
0..*
0..1
0..*
BIOMATERIALMEASUREMENT
1
0..*
1
0..*
ACQUISITIONPARAMETER
1
0..*
1
0..*
RAD 3.0 Schema Incorporates MAGE and Experience With Microarrays
LIMS for Data Analysis. Also holds SAGE.
Status of GUS Namespaces• Core
– Tables exist, Workflow documented
• Sres– Tables exist
• DoTS– Tables exist, some documentation
• RAD– Version 3.0 to include MAGE, experience
• Pretty much complete
– Tables exist, mostly documented
• TESS– Tables ready but not created
Schema Development
• Releases on Sourceforge:– CREATE TABLE statements– Table dumps from Core::TableInfo,
Core::DatabaseDocumentation– Gifs of ER diagrams
• Adding tables between releases– In CVS tree?– Use message forum for discussion
Documentation
• Schema Browser looks at TableInfo
• Plug-in– Populates DatabaseDocumentation– Input:
Table\t\tDescription of table
Table\tAttribute\tDescription of attribute
GUS Schema Browser• http://www.cbil.upenn.edu/cgi-bin/GUS30/schema
Browser.pl?db=GUS30• Points at GUS30 on CBIL development database
server (erebus).– Need to move? Maintain release view?
• DoTS Tables:– Central dogma – Evidence/ Similarity – ProjectLink– SequenceGroupImp/ SequenceGroupExperimentImp– Plasmomap?
• Other tables of interest?