sfsu center for computing for life sciences - ccls dragutin petkovic 1, chris smith 2,3, mike wong...

35
SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1 , Chris Smith 2,3 , Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU Department of Biology 3 - SFSU Center for Computing for Life Sciences QuickTime™ TIFF (Uncompres are needed to QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

Upload: shavonne-conley

Post on 11-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

SFSU Center for Computing for Life Sciences - CCLS

Dragutin Petkovic1, Chris Smith2,3, Mike Wong1,3

1 - SFSU Department of Computer Science

2- SFSU Department of Biology

3 - SFSU Center for Computing for Life Sciences

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Page 2: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Outline

• About Center for Computing for Life Sciences (CCLS) at SFSU – cs.sfsu.edu/ccls/index.html

• What is computing for life sciences?

• CCLS Dell Cluster Computer and its usage

• Chris Smith - Turning Processor Cycles into Research-Based Teaching

Page 3: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

cs.sfsu.edu/ccls/index.html

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 4: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Mission• CCLS addresses the emerging trend of integration of life sciences and

computational and mathematical sciences. It involves faculty, researchers, and students from the SFSU departments of Biology, Biochemistry, Computer Science, Mathematics, and Physics and other SFSU departments.

• The broad research program of the center emphasizes investigations in topics varying from Bioinformatics and Computational Drug Discovery to complex data visualization and development of new paradigms for data modeling, user interfaces and web-engineering in contexts involving life sciences.

• The CCLS provides an environment for faculty to cooperate, for students to work on multidisciplinary projects including those involving culmination degrees and for collaboration with industrial and academic partners. The center also hosts a number of external advisors and collaborators.

ccls.lab.sfsu.edu

Page 5: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

5

CCLS: An Interdisciplinary Collaboration Space

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 6: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Areas addressed by CCLS projects – they are broad by design

Bioinformatics but also…

• Use of Machine Learning or analysis and classifications of genotypes• Data management for biology and drug development• Data visualization• Mathematical modeling of genetic structures• Advanced WWW applications and user interfaces• Serious games in health education• Sensor networks for biological and environmental applications• Data mining of biological data• High performance clusters and SW tools for computing for life sciences

……whatever is next

Page 7: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Range of Activities

• Broad, but focused on fostering research, not direct teaching

– Projects and theses– Research and publications– External grants– Collaboration with industry and academia– Hosting seminar visitors– Helping faculty with grants and travel– IT and high performance computing support– Also incubators for high tech

Page 8: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

CCLS Accomplishments• 20+ faculty

– COSE (CS. Math, Biology, Chemistry and Biochemistry)– Health and Human Services, Industrial design, Philosophy

• Over 19 MS Theses in CCLS area since 2004

• Over 28 refereed publications in CCLS area since 2003. – One best paper award & one second best paper award– Several top awards at COSE science fairs

• NSF Career Grant to Prof. R. Singh for proposed research in CCLS area– Data management and search for chemical data

• Funding:– External sources - NSF, CSUPERB, Microsoft, Sun/Agilent, NIH– Support for CCLS investigators - Three rounds of mini grant and travel grants funded by CCLS:

• 30+ faculty and students funded (about $ 150 K in three years)

• External Collaborators: UCSF, UC Davis, SUN/Agilent, Microsoft, Washington University Genome Center, Lawrence Berkeley National Lab

• Core Computing Resources– Dell High performance cluster computer – Teaching Cluster & shared application servers– Climate and power control for independent research groups

Page 9: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

CCLS Computing Resources• A cluster is

– Multiple computers offer high compute power

– Work closely together such that they can be viewed as a single computer.

– Network/WWW accessible– Small footprint

• Applications include– Predicting molecular structure (e.g.

protein folding)– Gene sequence searches (e.g.

BLAST searches)– Genetic similarity comparison

between species (e.g. PAUP phylogentics analysis).

– Predicting RNA secondary structures

CCLS DELL PowerEdge 1955 Quad-Processor Compute Nodes

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Mike Wong M.S.Researcher Programmer

Page 10: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

CCLS ClusterComputing Program

• Purpose:– To support computational biology education and

computationally expensive biology research by developing and teaching skills and procuring equipment necessary for high-performance cluster computing (HPCC)

• CCLS HPCC DELL Technical Specifications:– 40 CPUs Intel Xeon 2.0 GHz– 40 GB RAM– 4.0 Terabytes storage– Gigabit Ethernet– Dell PowerEdge and Apple XServe technology

• CCLS Instructional Cluster (not shown)– Provides an educational environment where biology

and computer science students can get hands-on experience with clusters

– Isolated from HPCC research cluster

Page 11: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

CCLS HPCCEarly Contributions and Results

• CCLS HPCC serves 5 research labs at SFSU and is expanding

– Enables Smith Lab to perform thousands of BLAST searches per hour

– Enables Spicer Lab to find a consensus of hundreds of maximum-likelihood phylogeny trees within a day

– Enables Stillman lab to perform protein function prediction on EST datasets

• CCLS HPC cluster and instructional cluster provide a rich environment for biology research and education

CCLS Usage Report (via Ganglia): Smith Lab experiment designed to find genes orthologs responsible for observed behaviors in insects

Page 12: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Summary• CS and math are becoming a critical tool for future advances in

biotechnology and is exciting area for research and teaching• CSU must address this area adequately• CCLS at SFSU is one example of a working model of

Biology/Chemistry/CS/Math/Life Sciences collaboration• CCLS advocates addressing this area very broadly, NOT only

as bioinformatics• Critical need for infrastructure support: technical support, admin,

people, space, networking, SW, HW (NOT ONLY HW)

Page 13: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU
Page 14: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Turning Processor Cycles into Research-Based Teaching

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

• Annotation Background

• Genomics Education Partnership

• CCLS Genome Annotation Pipeline

• Biol638/738: Student Genome Annotation

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

© SmithLab 2007

Page 15: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Given Some Raw Sequence? G

GTATCTTATTCGCCATCGAAGCGGTCACACTGGGTGCCGCCGCCAACTTCACTCTTTCCGTTCTGTGAGCGAAAACCGAAAAGTCTGTGCTTTGGTAAGTGTTGCTAAAAGTTCGGAATAATGTTGCATCCCGAGCATTTTCGGGTACATAACTGTTCCACGGCGGTGGTCCAGCAAAGACTAATCGTTATCACGCCTTTCGCAGTTCTTAAATTCACCCGACGAGTCCCTAATACACAATTAAAATGGTTAGGGAGAACAAGGCAGCGTGGAAGGCTCAGTACTTCATCAAGGTTGTGGTAAGTATAGAACCTTATAGAATTCGCTCACTAGCTGGCGCCTGGCTTATGCTGTTAACTGATCCCTCCTCCAGGAACTGTTCGATGAGTTCCCAAAGTGCTTCATCGTGGGCGCCGACAACGTGGGCTCCAAGCAGATGCAGAACATCCGTACCAGCCTGCGTGGACTGGCCGTCGTGCTTATGGGCAAGAACACCATGATGCGCAAGGCCATCCGCGGTCATCTGGAGAACAACCCGCAGCTGGAGAAGCTGCTACCCCACATCAAGGGCAACGTGGGATTCGTGTTCACCAAGGGCGATCTCGCCGAGGTGCGCGACAAGCTGCTGGAGTCCAAGGTGCGCGCCCCCGCCCGTCCCGGCGCTATTGCCCCTCTGCACGTCATCATCCCGGCGCAGAACACCGGCTTGGGACCCGAGAAGACCAGTTTCTTCCAGGCCCTGTCCATCCCGACCAAAATTTCCAAGGGAACAATTGAAATCATCAACGATGTGCCCATCCTGAAGCCTGGCGACAAGGTCGGCGCCTCCGAGGCGACACTGCTCAACATGTTGAACATCTCGCCCTTCTCGTACGGTCTGATTGTCAACCAGGTCTACGACTCCGGCTCGATCTTTTCGCCGGAGATCCTGGACATCAAGCCCGAGGATCTGCGCGCCAAGTTCCAACAGGGAGTGGCCAACTTGGCCGCCGTTTGTTTGTCCGTGGGCTACCCCACCATCGCCTCGGCCCCGCACAGCATTGCCAACGGATTCAAGAATCTGCTGGCCATTGCTGCCACCACCGAGGTGGAGTTCAAGGAGGCGACCACCATCAAGGAGTACATCAAGGACCCCAGCAAGTTCGCCGCAGCTGCTTCGGCTTCGGCTGCCCCCGCGGCCGGCGGAGCTACCGAGAAGAAGGAGGAGGCCAAGAAGCCCGAGTCCGAATCAGAGGAGGAGGACGATGATATGGGTTTCGGTCTGTTCGACTAAGCTGGATCCCGATTGCAGAATGCCCTCTGCGGCGCCCGCGAACCATCGCTTCCGCTTTCGGCGTTTACCCACTAAGACCCTTTGTTATGTT

Page 16: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

What Does The Sequence Encode? G

GTATCTTATTCGCCATCGAAGCGGTCACACTGGGTGCCGCCGCCAACTTCACTCTTTCCGTTCTGTGAGCGAAAACCGAAAAGTCTGTGCTTTGGTAAGTGTTGCTAAAAGTTCGGAATAATGTTGCATCCCGAGCATTTTCGGGTACATAACTGTTCCACGGCGGTGGTCCAGCAAAGACTAATCGTTATCACGCCTTTCGCAGTTCTTAAATTCACCCGACGAGTCCCTAATACACAATTAAAATGGTTAGGGAGAACAAGGCAGCGTGGAAGGCTCAGTACTTCATCAAGGTTGTGGTAAGTATAGAACCTTATAGAATTCGCTCACTAGCTGGCGCCTGGCTTATGCTGTTAACTGATCCCTCCTCCAGGAACTGTTCGATGAGTTCCCAAAGTGCTTCATCGTGGGCGCCGACAACGTGGGCTCCAAGCAGATGCAGAACATCCGTACCAGCCTGCGTGGACTGGCCGTCGTGCTTATGGGCAAGAACACCATGATGCGCAAGGCCATCCGCGGTCATCTGGAGAACAACCCGCAGCTGGAGAAGCTGCTACCCCACATCAAGGGCAACGTGGGATTCGTGTTCACCAAGGGCGATCTCGCCGAGGTGCGCGACAAGCTGCTGGAGTCCAAGGTGCGCGCCCCCGCCCGTCCCGGCGCTATTGCCCCTCTGCACGTCATCATCCCGGCGCAGAACACCGGCTTGGGACCCGAGAAGACCAGTTTCTTCCAGGCCCTGTCCATCCCGACCAAAATTTCCAAGGGAACAATTGAAATCATCAACGATGTGCCCATCCTGAAGCCTGGCGACAAGGTCGGCGCCTCCGAGGCGACACTGCTCAACATGTTGAACATCTCGCCCTTCTCGTACGGTCTGATTGTCAACCAGGTCTACGACTCCGGCTCGATCTTTTCGCCGGAGATCCTGGACATCAAGCCCGAGGATCTGCGCGCCAAGTTCCAACAGGGAGTGGCCAACTTGGCCGCCGTTTGTTTGTCCGTGGGCTACCCCACCATCGCCTCGGCCCCGCACAGCATTGCCAACGGATTCAAGAATCTGCTGGCCATTGCTGCCACCACCGAGGTGGAGTTCAAGGAGGCGACCACCATCAAGGAGTACATCAAGGACCCCAGCAAGTTCGCCGCAGCTGCTTCGGCTTCGGCTGCCCCCGCGGCCGGCGGAGCTACCGAGAAGAAGGAGGAGGCCAAGAAGCCCGAGTCCGAATCAGAGGAGGAGGACGATGATATGGGTTTCGGTCTGTTCGACTAAGCTGGATCCCGATTGCAGAATGCCCTCTGCGGCGCCCGCGAACCATCGCTTCCGCTTTCGGCGTTTACCCACTAAGACCCTTTGTTATGTT

5’ UTR START CODING EXON STOP 3’UTR

Page 17: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Genome Annotation

The Problem: Too many genomes, not many reliably annotated. Bad gene models make it harder to clone genes and use genomic data in the lab. Reliance on automated annotations means that many analyses are ‘quick & dirty’

www.genomesonline.org/Liolios et al. NAR 2006 (DOI:10.1093/NARGKJ145)© SmithLab 2007

Page 18: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

The World is Filled with Non-Model Organisms

• Only a few model organisms annotated, only 5 done ‘well’• Most new genomes are automatically annotated, if at all• Human curation is poorly funded or not funded• Little infrastructure exists for normal people to do

bioinformatics analyses in their own organisms

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

© SmithLab 2007

Page 19: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Typical Automated Genome Annotation Pipeline

1-2 Gene Predictions Programs

ESTs if you are lucky

Protein coding gene models that are largely incorrect

Rarely other features (miRNA, ncRNA, etc)

• Comparative genomics difficult without high-quality genes

• General frustration by user community to access data, understand it, or manipulate it in novel ways

© SmithLab 2007

Page 20: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Automated Annotation is Better Than Some Methods*…

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

theredsrocket.blogspot.com/2007/04/finals.html

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

* Methods have not been actually tested

• Web tools & common formats enable distributed annotation• Easier technology puts annotation in grasp of students

© SmithLab 2007

Page 21: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

• Student Driven Community Genome Annotation• Collaborators (34 US Universities)

• Smith Lab @ SFSU• Jim Youngblom @ CSU Stanislaus• Anya Goodman @ California Poly • Catherine Coyle-Thompson @ CSU Northridge

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• Use real research data as a teaching tool

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

© SmithLab 2007

Page 22: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

A Student Pathway to Publication

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Raw Sequence Project Coordination

Course Integration

Computational Analysis

Student AnnotatorsBiol638 / Biol738Public Archiving

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

© SmithLab 2007

Page 23: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Students perform analysis and annotation in class

• Biol638/Biol738– Paired Undergraduate/Graduate Genome

Annotation Workshop– 1 Semester, 4 units– Fall 2007 20 enrolled, 14 finished– Taught in SFSU SEGA Teaching Lab

• 20 iMac G4’s• Students can also use their own computer

• Each student annotates 50kb of sequence– Finds repeat, genes, protein functions, promoters– Learn basic UNIX, command-line programs

• Pre- and Post-Course Assessment Surveys

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

All subjects & ages under

one roof

© SmithLab 2007

Page 24: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

CCLS Genome Annotation Pipeline

Genomic sequence

Repeat Identification Transposable Elements Satellite Sequence Tandem Repeats

Alignment of EST/cDNA

Complete cDNA Partial EST GenBank mRNA

Alignment of Protein Data

SwissProt Known Fly Peptides GenBank Peptides

Programs: RepeatMasker, RepeatRunner TRF4, PILER-DF

Programs: SIM4,BLASTN

ncRNA Predictions tRNA miRNA snoRNAs

Programs: BLASTX

Gene Orthology Data

CGL Orthologs InParanoid Orthologs 1

OrthoMCL

Programs: M-Fold, CARNAC, INFERNAL, tRNA-scan, BLASTN

Programs: SIM4, TBLASTN, BLASTX

RAW results CustomizedCCLS Parsers

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

© SmithLab 2007

Page 25: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Students Annotate Genes in Multiple Species

Release 5.1Annotation

Smith et. al. Science 316, 1586 (2007)© SmithLab 2007

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 26: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Accurate Genome Annotations are the Basis for Comparative Genomics

Splice Variant B

Protein-coding geneSplice Variant A

start stop 3’ UTR5’ UTR

• Any feature region of interest that can be associated to a sequence

tRNA

microRNA

rRNA

Non-Protein-coding RNA

pseudogene

DNA Transposon Retrotransposon (AAGAGAG)n

Satellite Arrays

• Annotation types can match interests of your own researchers• Comparing annotations between species is highly informative© SmithLab 2007

Page 27: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Multiple Fly Genomes Give Student Access to Cutting Edge Research Data• Currently 12 Drosophlid

genomes

• Several more insect genomes

• Possible to do in-depth comparative genomic analyses– Conserved promoters– Rates of gene evolution– New/Lost genes– Much much more…

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

229 Annotation authors including C.D. Smith Nature 2007 450 (8),25-40.© SmithLab 2007

Page 28: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Comparative Genomic Analysis

From Biol738 Final Report of Jennifer PlacekBad D. mojavensis annotation!

D. erecta D. melanogaster

© SmithLab 2007

Page 29: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

RNA Structure Motifs Conserved Across Species Are Candidates for Further Study

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

From Biol638 Report by Lucas Hanscom Spring 2007© SmithLab 2007

Page 30: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

CCLS Injects Computing in Biology Courses

• Standardized core facilities that are actively maintained

• Advanced software installation and support

• Custom software development for individual researchers

• Access to faculty and students from other disciplines

• Comfortable collaborative meeting space

• Engaged staff who meets the needs of researchers

Page 31: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Conclusions

1) I write hacky error ridden code2) CCLS Mike Wong fixes my code

Adapted to clusterError handlingScalabilityNew analyses & features

Pre-CCLS Code screenshot Post-CCLS Code screenshot

CCLS People Make the Difference

#!/bin/csh set query = /home/cdsmith/resultscd $queryforeach file (*fst) current directory blastx Pfam-A.fasta $file > $query.resultsend

#!/share/apps/bin/perl -w

use Datastore::MD5;use File::Path qw( mkpath );use Statistics::Descriptive;use Proc::Daemon;Proc::Daemon::Init; # This script will continue running after you log out

# ===== INITIALIZE VARIABLES# It's important to use absolute paths; Proc::Daemon::Init requires itour $prefix = "/home/mikewong/research/stillmanlab"; # CHANGE THIS VARIABLEour $path = { results => "$prefix/JGI_Project/results", queries => "$prefix/JGI_Project/queries",};

my $job_name = 'anu_blast';my $species = '/share/apps/data/blastdb/GenBank_v159_aa.fasta';my $datastore = new Datastore::MD5( root => $path->{ results }, depth => 2 );mkdir $path->{ results } unless -e $path->{ results };

# ===== READ THE QUERY DIRECTORY FOR FST FILESopendir DIR, $path->{ queries };my @files = sort grep { /\.fst$/ } readdir DIR;closedir DIR;my $job_processing_times = new Statistics::Descriptive::Full();open LOG, ">>$prefix/JGI_Project/log";# ===== GENERATE THE COMMAND FOR EACH FILE/SPECIES COMBINATIONforeach my $file (@files) { my $results_path = $datastore->id_to_dir( $file ); mkpath $results_path unless -e $results_path; my $db = $species; my $results = "$results_path/$file.blastx"; my $errors = "$results_path/$file.err"; my $command = "bsub -J $job_name -e $errors blastx $db $path->{ queries }/$file -o $results " . "topcomboN=1 hspsepsmax=100000 wordmask=seg+xnu -B1 -V1 -E0.00001 W=5 T=25 kap"; # ===== SUBMIT THE COMMAND UNLESS THE RESULTS FILE EXISTS unless( -e $results ) { `$command`; my $delay = int( $job_processing_times->median()); sleep( $delay ); $delay = cluster_throttle_control( $job_name, $delay ); print LOG scalar( localtime() ) . " Analyzing protein '$file' with delay $delay s-- $command\n"; $job_processing_times->add_data( $delay ); }}close LOG;#========================================================sub cluster_throttle_control { my $job_name = shift; my $delay = shift; my $jobs = int( `bjobs | grep $job_name | wc -l` ); my $wait = 1; while( $jobs > 100 ) { $delay += $wait; sleep( $wait ); if( $delay > 20 ) { $wait = 5; } elsif( $delay > 60 ) { $wait = 15; } elsif( $delay > 120 ) { $wait = 30; } elsif( $delay > 300 ) { $wait = 60; } $jobs = int( `bjobs | grep $job_name | wc -l` ); } return $delay;}

© SmithLab 2007

Page 32: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Acknowledgements

• Bioinformatics & Genome Annotation Class Fall 2007– Tobias Sayre (Graduate Assistant)

• CCLS Pipeline - Mike Wong• SFSU COSE Hardware Support - Alan Der• SFSU COSE Network Support - Tina Easter

Ari A. Ramsey M. Amy S.

Joseph B. Vy N. Elinor V.

Eugenel E. Jennifer P. Tyler W.

Henry H. Bhamini P. Mike W.

Jay K. Marvin S. Lucas H. (S07)

© Smithlab 2007

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 33: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

fin

Page 34: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

Using the Semantic Web to Link Genes and Behaviors

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• Took 200 known behavior genes from flies• Used CCLS cluster to identify orthologs in ants and bees• Designed primers to find in new ant species• Created networks of genes linked to behaviors

© SmithLab 2007

Page 35: SFSU Center for Computing for Life Sciences - CCLS Dragutin Petkovic 1, Chris Smith 2,3, Mike Wong 1,3 1 - SFSU Department of Computer Science 2- SFSU

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

1-Student, 1-Gene Independent Project• Romeo-Smith HIV Project

– HIV is known to suppress host immune system genes

– HIV Tar RNA secondary structure may act to inhibit through RNAi

• Screen all human genes & genome for novel Tar targets

– Human genes may also adopt Tar-like shapes

• Use CCLS cluster + RNA folding tools to fold all 30,000 human genes

www.mcld.co.uk© SmithLab 2007