dbbm cesmg g. paolella ceinge. csi internet ceinge university campus

41
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. DBBM CESMG G. Paolella CEINGE

Upload: juan-houston

Post on 27-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

DBBMCESMG

G. Paolella

CEINGE

Page 2: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

CSI

INTERNET

CEINGE

University Campus

Page 3: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

CAPRIImage restoration

and analysis

ComparativeGenomics

H. sapiens

M. musculus

CST

Annotazione

DG CST

Allineamento e Identificazione

LOCUSLINK

EST

ENSEMBL

PROGR.

QuickTime™ and aCinepak decompressor

are needed to see this picture.

Francesco SalvatoreFrancesco Salvatore 0503

Research and Services in Bioinformatics

Page 4: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

- Comparative genomics- DG-CST- KinWeb

- Non Coding RNAs- Bacterial- Eukaryotic

- Cell motility

Research subjects

Page 5: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

H. sapiens

M. musculus

CST

Annotazione

DG CST

Allineamento e Identificazione

LOCUSLINK

EST

ENSEMBL

PROGR.

Conserved Sequence Tags (CST)

Page 6: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

DG-CST

Page 7: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

DG-CST DB

Page 8: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Genome browser

Page 9: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

KinWeb

Page 10: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

(a)

(b)

(c)

(d)

(e) KinWeb DB

Page 11: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Three genes

a)

b)

Ig-I Ig-II Ig-III TM Tyr Kinase

// // //

CSTsSer-Thr Kinase

CST

// //

Ser-Thr Kinasec)

// //

a cb

CST

I II III

Page 12: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Multistep process of comparative sequence analysisIdentify orthologsBased on combination of ENSEMBL and NCBI informations and/or

sequence alignment

Insert CSTs into DBAutomatic insertion of identified CSTs and preliminary annotation

CompareFind similar stretches by using BLASTZ

Postprocess and select CSTsThresholds: identity >=70% and length >=100 bp

PreprocessMask repetitive sequences by passing through RepeatMasker

Automatic CST annotationBased on available resources and according to different criteria

Select target genesAbout 1000 genes involved in genetic disease

Identify CST subpopulationsAnalysis of annotation results

Test hypothesis on functional rolesAccording to literature and experimental data

FINDING CSTs

Selection of homologous chromosome regions from human and mouse genomes.

Comparison of selected regions using BLASTZ, a program based on a local similarity algorhitm.

Further analysis on the dataset looking for subpopulations sharing specific characteristics, using different programs, such as:- Blast of CSTs vs EST, human and other species genomes- Program for calculation of CPS score (Coding Potential Score)- RNA structure prediction programs

Selection of the definitive set of CSTs based on specified thresholds (identity >= 70%; length >= 100 bp) using StrongHits .

Insertion of selected CSTs into DB and extensively annotation for:- type (i.e. intergenic, exonic etc.) according to Ensembl- Coding capability according to Ensembl- Distances from other genes and coding regions- Calculation of Log Score according to UCSC comparison of human and mouse genomes

Masking sequences of repetitive elements to reduce the noise fatally introduced by repeated sequences through RepeatMasker.

Pipeline

Page 13: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Annotation is carried out through a pipeline which goes through the various phases wit hout requiring human assistance. Tasks requiring intensive CPU usage, such as BLAST homology search, are spread on several collaborating servers using a system specifically developed for load distribution and monitoring.

CST ANNOTATIONCSTs- chromosome position- type (i.e. intergenic, intronic, exonic, etc.)- coding %- closest gene and relative distances- .......

ENSEMBL gene and gene structure data- Max L-Score- Avg L-Score- .......

UCSC Log Score dataMatches with:- EST- Other genomes- Proteins (BlastX)

BLAST- repeats type- repeats %Repeat MaskerCoding Potential ScoreCPS - Redundancy- Overlapping- ........

PHP ScriptsDBRemote Servers Remote Servers

Pipeline units

Page 14: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Non coding RNAs

ncRNADNA

transcriptionreverse

transcription

Proteinstranslation

mRNA

tRNArRNA

AntisensemiRNA

transcription/maturation

snoRNA

maturation

Self-splicing intronsnRNA

Imprinting H19, AIRX inactivation XISTChromatin structure dynamics small RNAsDNA demethylation KHPS1a

Page 15: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

0 50000 100000 150000 200000 250000

Bacillus anthracis Ames (1)

Bacillus halodurans C-125 (2)

Bacillus subtilis 168 (3)

Clostridium perfringens (4)

Clostridium tetani E88 (5)

Enterococcus faecalis V583 (6)

Lactobacillus johnsonii NCC 533 (7)

Listeria innocua (8)

Listeria monocytogenes EGD-e (9)

Staphylococcus aureus Mu50 (10)

Streptococcus pneumoniae TIGR4 (11)

Streptococcus pyogenes MGAS315 (12)

Mycoplasma genitalium (13)

Mycoplasma pneumoniae M129 (14)

Ureaplasma urealyticum (15)

Corynebacterium diphtheriae strain NCTC13129 (16)

Mycobacterium leprae (17)

Mycobacterium tuberculosis H37Rv (18)

Treponema pallidum (19)

Chlamydia pneumoniae AR39 (20)

Chlamydia trachomatis serovar D (21)

Campylobacter jejuni NCTC 11168 (22)

Helicobacter pylori 26695 (23)

Brucella melitensis (24)

Rickettsia conorii (25)

Rickettsia prowazekii Madrid E (26)

Bordetella bronchiseptica RB50 (27)

Bordetella parapertussis 12822 (28)

Bordetella pertussis (29)

Neisseria meningitidis MC58 (30)

Buchnera sp. APS (31)

Escherichia coli K12-MG1655 (32)

Escherichia coli O157:H7 (EDL933) (33)

Haemophilus influenzae KW20 (34)

Pasteurella multocida (35)

Pseudomonas aeruginosa PA01 (36)

Pseudomonas putida KT2440 (37)

Salmonella enterica serovar Typhi CT-18 (38)

Salmonella typhimurium LT2 SGSC1412 (39)

Vibrio cholerae El Tor N16961 chr1 (40)

Yersinia pestis CO92 (41)

Aquifex aeolicus VF5 (42)

Species

SLS Num

genic

antigenic

spanning

intergenic

Bacterial SLSs

Page 16: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1

Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1

SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1

Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1

SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1

Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1

SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch

Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family

SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family

SLS Families

Page 17: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1

Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1

SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1

Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1

SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1

Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1

SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch

Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family

SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family

Position in the genome

Position

Page 18: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1

Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1

SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1

Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1

SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1

Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1

SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch

Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family

SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family

Alignment

Page 19: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Genome ClusterDS Rawclusters Elements Avg_identity Type %P<=0.001 Known1 1 47 58 intergenic 43.8 Bru-1

Brucella M 1 1 4 19 98 intergenic 62.5 Bru-11 5 17 59 antigenic 11.8 Bru-1

SLS = 44,719 1 9 8 68 intergenic 12.5 Bru-11 6 21 87 intergenic 69.2 Bru-21 7 14 98 intergenic 78.9 Bru-12 2 31 100 mixed 50.7 Bru-22 3 24 98 mixed 71.4 Bru-23 8 7 95 mixed 66.7 new Family1 446 23 91 intergenic 43.2 efa-1

Ent. Faecalis 1 447 36 85 intergenic 88.9 efa-11 448 39 88 intergenic 95.9 efa-1

SLS = 40,991 1 451 7 81 intergenic 22.2 efa-11 449 21 93 intergenic 43.2 efa-12 452 9 100 antigenic 0 new Family3 450 7 95 antigenic 24 new Family1 453 12 80 intergenic 48 bcr-1

Bac. Anthracis 1 454 10 81 intergenic 66.7 bcr-11 455 7 85 mixed 37.5 bcr-1

SLS = 65,220 2 456 9 59 intergenic 0 new Family1 419 51 76 intergenic 50 TPP riboswitch

Vibrio Cholerae 1 1 420 33 66 intergenic 12.8 TPP riboswitch2 421 7 85 intergenic 0 new Family

SLS = 45,824 2 424 7 99 intergenic 0 new Family3 422 8 100 intergenic 50 new Family4 423 8 100 antigenic 0 new Family5 425 8 100 intergenic 18.2 new Family

RNAzP = 0.99

PFOLD

Secondary structures

Page 20: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Processing timeSLSs Proj CSTs Proj

1 1

BLAST vs self 1 1BLAST vs hum EST - 15BLAST vs musEST - 12BLAST vs Hum Genome - 13BLAST vs Mus/Rat Genome - 10BLAST vs Small Genomes - 6RepeatMasker 3 5Mfold 2 -RandFold 30 30RNA-z 0.5 0.5

SLSs Proj CSTs Proj2469003 103340

BLAST vs self 28.6 1.2BLAST vs hum EST - 17.9BLAST vs musEST - 14.4BLAST vs Hum Genome - 15.5BLAST vs Mus/Rat Genome - 12BLAST vs Small Genomes - 7.2RepeatMasker 85.7 6Mfold 57.2 -RandFold 857.3 35.9RNA-z 14.3 0.6

SLSs Proj CSTs Proj2469003 103340

Time (months) ALL 33.6 3.6

Operation

Operation

Operation

Time (days)

Time (sec)

Page 21: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

4x14x2=112 procs 2.8 GHz

4x14x2=112 GB RAM

2 GB/s per scheda - 4 GB/s aggregata

Cluster

Page 22: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Bioinfo portal

Page 23: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Servizi bioinformatici per la ricerca gia’ attivi

Francesco SalvatoreFrancesco Salvatore 0503

• Circa 100 banche dati di interesse biologico accessibili mediante SRS (sequenze nucleotidiche, genomi, mutazioni, malattie ereditarie, enzimi, etc.)

• Sistema integrato per analisi di dati biologici con oltre 150 programmi per analisi di sequenze, modelli evolutivi, studio di mutazioni, proteine etc.

• Banche dati realizzate nell’ambito di progetti di ricerca (DG-CST, KinWEB, etc.)

• Sistemi per la gestione di dati sperimentali (campioni biologici, sequenze, immagini da microscopia etc.)

Page 24: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Research and services

Research and Services in BioinformaticsCAPRI

Image restorationand analysis

ComparativeGenomics

H. sapiens

M. musculus

CST

Annotazione

DG CST

Allineamento e Identificazione

LOCUSLINK

EST

ENSEMBL

PROGR.

QuickTime™ and aCinepak decompressor

are needed to see this picture.

Page 25: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

• CEINGE• DBBM• IIGB• BIOGEM• Facolta’ di Medicina• Facolta’ di Biotecnologie• Altre Facolta’• Pubblico (accesso limitato)

Francesco SalvatoreFrancesco Salvatore 0503

Servizi: chi ha accesso ?

Page 26: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

WEB SERVER

CAPRI SRSPISE

Other Emboss Fasta Blast

UserData DB

Primary remotedatabases

ENSEMBL

Services organization

Page 27: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Graphic interface to programs

Page 28: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

CAPRI

CAPRI

Page 29: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Various operations in a row:Complement ->Translation -> Isoelectric point of the resulting protein.

DNA

Complement

Translation

Isoelectric point

CAPRI workflow

Page 30: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

CGI

Plugin ObjectPise

Plugin ObjectCLI Simple

Programs

Plugin ObjectCURL

Base Obj.

Plugin ObjectSOAP

Plugin ObjectJEMBOSS

ProgramObject

Tasks Obj.

Menu Table

Disk Buffering

BLAST

FASTA

EMBOSS

HMMer

Genscan

ClustalW

Programmi

Dischi del ServerDischi del Server

Phylip

CLIENT SERVER

CAPRI

ProgramObject

ProgramObject

LegendaRelazione tra oggetti:

UsoEredità

Esecuzione programmiTrasferimento datiRelazione temporale

CAPRI architecture

Page 31: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Cluster Cluster Nodes

AccessServer

AccessServer

AccessServer

For each user request, a process islaunched on a different node

Distributed execution

Page 32: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Cluster

BrokerBroker

Web applicat

ion server

Web applicat

ion server

DB serverDB serverClusterManage

r

ClusterManage

r

3 – Request the status of the cluster

5 - launch the

command on the node

1 – Run a command

2 – Request a node IP

4 – Search for the best resource and return the corresponding node IP

Relational DB

6 – Return the result

Cluster activity

http

Page 33: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus
Page 34: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Broker

virtualnode

virtualnode

DB

DB

Grid

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

Page 35: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

PROGETTO DI RICERCA

------------------------------------------

*Cell line*Colture conditions*Fixation and inclusion methods, stainings, ecc

*Objective*Focus Position*Stage position x/y

*Project title *Experiment name, *Author, group, group leader, ecc.

WEB INTERFACE

*Exposure time*Resolution, ecc.

DB

Image archival and management

Page 36: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Image-DB interface

Page 37: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

timelapse at 6 positionstimelapseactinwound healingtimelapse 2adhesionactin staining

IPROC

Page 38: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

HPCon

ClusternodesG

ateway

iPage

image

area

data + images

page

iPaneiPaneiPane

proc-steps

IPROC architecture

Page 39: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

Cluster Cluster Nodes

AccessServer

AccessServer

AccessServer

A tool can require the execution of multiple, simultaneous processes

Distributed execution of parallel requests

Page 40: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

-PHP internal routines (basic drawing, processing)

-ImageMagick (more advanced processing)

-Image converters

-Special tools (PDL, deconvolution)

-Tools developed in-house (cell tracking)

- ......

What software may be linked

Page 41: DBBM CESMG G. Paolella CEINGE. CSI INTERNET CEINGE University Campus

-Convenient graphic interface

-Access to a vast library of image processing steps

-No specific interface requirements

-Remote processing on parallel hardware

-Support for a large number of concurrent users

-System independent (works on Mac, PC, Linux etc.)

-No need to install. A browser is enough.

Advantages