previous lecture: descriptive statistics. introduction to biostatistics and bioinformatics data...

59
Previous Lecture: Descriptive Statistics

Upload: joleen-hopkins

Post on 11-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Previous Lecture: Descriptive Statistics

Page 2: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Introduction to Biostatistics and Bioinformatics

Data types and representations in Molecular Biology

This Lecture

Page 3: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Learning Objectives

• text formats for some common genomics data types

• formatting text with tag:value pairs

• basic database concepts

• details of the FASTA format

• Data formats in public molecular biology databases• Genbank, dbSNP• Genome Browsers: BED format

• Database queries: field specific queries

Page 4: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Biologists Collect Lots of Data

• Hundreds of thousands of species• Millions of articles in scientific journals• Genetic information:

– gene names– phenotype of mutants– location of genes/mutations on

chromosmes– linkage (distances between genes)

Page 5: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

• High Throughput lab technology– PCR– Gene expression microarrays– Rapid inexpensive DNA sequencing– Many methods of collecting genotype data

• Assays for specific polymorphisms• Genome-wide SNP chips

• Must have data quality assessment prior to analysis

Page 6: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Data files• Various assay technologies/machines collect raw

data in custom formats• Images• Trace files• Machine specific binary formats

• Convert to text to share scientific data– Why text?

• Does not require custom software to read the data• Stable for long periods of time across different

computing systems (ASCII is universal)• Can be smoothly shared across many different

computing systems– The WWW is built with text (html)

Page 7: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

GFF3

@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>[email protected] MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGGAAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152#.,')2/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHIHIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG

##gff-version 3#!gff-spec-version 1.20##species_http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7425NC_015867.2 RefSeq cDNA_match 66086 66146 . - . ID=aln0;Target=XM_008204328.1 1 61 +; for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65959 66007 . - . ID=aln0;Target=XM_008204328.1 62 110 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65799 65825 . - . ID=aln0;Target=XM_008204328.1 111 137 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1

FASTQ

FASTA

Text has many different formats

Page 8: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

tag:value pairs

• A very common way to organize text is with tag:value pairs

•address: Publisher's address (usually just the city, but can be the full address for lesser-known publishers)•annote: An annotation for annotated bibliography styles (not typical)•author: The name(s) of the author(s) (in the case of more than one author, separated by and)•booktitle: The title of the book, if only part of it is being cited•chapter: The chapter number

• HTML is a tag system to display text in web browsers.

<b>This text is bold</b><a href="http://www.w3schools.com">This is a link</a> <h1 style="font-family:verdana">This is a heading</h1><p style="color:green;margin-left:20px;">This is a paragraph.</p>

Page 9: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

• Structured data• Information is stored in

"records" and "fields"• Fields are categories

– Must contain data of the same type

• Records contain data that is related to one object across fields

• A record does not need to have data in every field

• A record is a series of tag-value pairs where fields are the tags

What is a Database?

SNP ID

SNPSeqID Gene +primer -primer Hap A Hap B HapC

D1Mit160_1 10.MMHAP67FLD1.seq

lymphocyte antigen 84

AAGGTAAAAGGCAATCAGCACAGCC

TCAACCTGGAGTCAGAGGCT

C — A

M-05554_1 12.MMHAP31FLD3.seq

procollagen, type III, alpha

TGCGCAGAAGCTGAAGTCTA

TTTTGAGGTGTTAATGGTTCT

C — A

M-05554_2 X60184 complement component factor i

ACTTCCAGCCCTGGCTCT

ATATGCCACCAAGAAGCA

A C —

M-09947_3 AF067835 caspase 8 TCACAGAGGGAAACATGAAG

CTCCACATTGAACCAAAGCA

G C T

M-11415_1 U02023 insulin-like growth factor binding protein

GGGAAAAGCCTGAAAGAAGC

AGCTGAAACCGGACATCAAT

T G —

D1Mit284_3

J05234 nucleolin TGTTGGAACCGACTTCTTCA

AAGAGTCAAAGAATTTATGGAATGA

G T T

Unique Identifier

Page 10: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

A Spreadsheet can be a Database

• columns are Fields • Rows are Records• Can search for a

term within just one field

• Or combine searches across several fields

SNP ID SNPSeq ID

Gene +primer -primer Hap A Hap B Hap C

D1Mit160_1 10.MMHAP67FLD1.seq

lymphocyte antigen 84

AAGGTAAAAGGCAATCAGCACAGCC

TCAACCTGGAGTCAGAGGCT

C — A

M-05554_1 12.MMHAP31FLD3.seq

procollagen, type III, alpha

TGCGCAGAAGCTGAAGTCTA

TTTTGAGGTGTTAATGGTTCT

C — A

M-05554_2 X60184 complement component factor i

ACTTCCAGCCCTGGCTCT

ATATGCCACCAAGAAGCA

A C —

M-09947_3 AF067835 caspase 8 TCACAGAGGGAAACATGAAG

CTCCACATTGAACCAAAGCA

G C T

M-11415_1 U02023 insulin-like growth factor binding protein

GGGAAAAGCCTGAAAGAAGC

AGCTGAAACCGGACATCAAT

T G —

D1Mit284_3

J05234 nucleolin TGTTGGAACCGACTTCTTCA

AAGAGTCAAAGAATTTATGGAATGA

G T T

Page 11: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Spreadsheet data can be saved as tab or comma separated values

gene,Ovary,embryo(0-3hrs),embryo(3-6hrs),embryo(6-9hrs),PupalLOC100118025,0.04541333,0.006205798,0.165735055,2.226200589,2.556445228LOC100122637,0.233690353,0.007614514,0.217603805,2.044255893,2.496835435LOC100116733,0.033557481,0.009225546,0.177377903,2.76701782,2.012821249LOC100120954,0.003250874,0.010542103,1.974338817,2.971542769,0.040325437LOC100122540,0.483847049,0.01129521,0.286362403,4.180982477,0.037512862LOC100119626,0.089661159,0.01165491,0.085576525,0.809059218,4.004048189Scr,0.016751983,0.013304455,0.445865943,0.813361695,3.710715923LOC100119924,0.685022497,0.016888969,1.618261922,1.182058753,1.49776786LOC100121348,0.18959044,0.018210136,1.178691029,1.916404302,1.697104093

Ovary embryo (0-3hrs)embryo (3-6hrs)embryo (6-9hrs)Pupal0.130666 0.0178557 0.476863 6.40536 7.355560.443061 0.0144366 0.412562 3.87577 4.733830.273747 0.0752579 1.44697 22.5721 16.41970.0643887 0.208803 39.1049 58.8561 0.7987098.93599 0.208607 5.28872 77.217 0.6928110.229979 0.0298946 0.219502 2.07522 10.27030.0277432 0.0220337 0.738405 1.34702 6.145371.46693 0.0361666 3.4654 2.5313 3.207370.408315 0.0392186 2.53851 4.1273 3.6550.108006 0.0572734 2.08545 10.0762 3.298760.151759 3.82547 485.993 530.451 1.248370.0793942 0.129111 5.38445 27.6188 0.232970.139144 0.180263 1.06842 35.8966 3.07092

Tab delimited

csv

Page 12: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Data Formats• How to organize various types of genetic

data?• Need standard formats• DNA sequence = GATC, but what about

gaps, unknown letters, etc.– How many letters per line– ?? Spaces, numbers, headers, etc.– Store as a string, code as binary numbers, etc.

• Use a completely different format for proteins?

Page 13: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

FASTA Format• In the process of writing a similarity searching program

(in 1985), William Pearson designed a simple text format for DNA and protein sequences

• The FASTA format is now universal for all databases and software that handles DNA and protein sequences

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

One header line, starts with > with a [return] at endAll other characters are part of sequence.Most software ignores spaces, carriage returns. Some ignores numbers

Page 14: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Multi-Sequence FASTA file>FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-PA;

parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_Annotation_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294;

MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQPKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLASLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQYHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLRDYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPEIVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL

>FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-PA; parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_Annotation_IDs:CG32854-PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87;

MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV

RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS

>FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159; name=CG33919-PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_Annotation_IDs:CG33919-PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191;

MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINWNLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER

RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFYQVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN

>FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-PA; parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_Annotation_IDs:CG12410-PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257;

MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELKNCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPELFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKNLDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCCECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD

How many fields?What is the key, what is the value?Does the header have tag:value encoded data?

Page 15: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Other Standards?• Other types of important medical and

genetic data may or may not have universal standards:

• Genotype/haplotype• Clinical records• Gene expression• Genome annotation• Protein structure• Alignments• Phylogenetic trees

Page 16: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

SNPStats

HapStat

Page 17: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Reformatting Data Files

• Much of the routine (yet annoying) work of bioinformatics involves messing around with data files to get them into formats that will work with various software

• Then messing around with the results produced by that software to create a useful summary…

Page 18: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Public Databases

• In addition to your own experimental data, access to public data is essential for epidemiology– Complete genome sequences (human and

pathogens/vectors)– SNPs– Genotypes– Population Sets– Supplemental data for specific Journal articles

Page 19: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

GenBank is a Database

• Contains all DNA and protein sequences described in the scientific literature or collected in publicly funded research

• Flatfile: Composed entirely of text– you could print the whole thing out

• Each submitted sequence is a record• Had fields for Organism, Date, Author,

etc.• Unique identifier for each sequence

– Locus and Accession #

Page 20: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Fields

Page 21: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

dbSNP record

SummaryAverage

Het.+/- std err:Individual

CountFounders

CountIndividualOverlap

GenotypeConflict

2 2 0 0

Validation statusMarker displays

Mendelian segregationPCR results confirmedin multiple reactions

Homozygotes detectedin individual genotype data

UNKNOWN UNKNOWN UNKNOWN

Reference SNP (refSNP) Cluster Report: rs1042574   Organism: human (Homo sapiens) Molecule Type: Genomic Created/Updated in build: 86/141 Map to Genome Build: 106/Weight Validation Status: byCluster Variation Class: SNV: single nucleotide variation RefSNP Alleles: C/T (FWD) Allele Origin: Ancestral Allele: C Variation Viewer: unknown Clinical Significance: NA MAF/MinorAlleleCount: NA MAF Source: HGVS Names: NC_000014.9:g.24166518C>T, NM_006084.4:c.*322C>T, NT_026437.13:g.5942995C>T

>gnl|dbSNP|rs1042574|allelePos=140|totalLen=345|taxid=9606|snpclass=1|alleles='C/T'|mol=Genomic|build=138 CCTTTTTTTT TTTTWADTTT GAGATATACG CCCTCTTTCA TCTGTAAGGG ACTAGGAAAT TCCAAATGGT GTGAACCCAG GGGGCCTTTC CCTCTTCCCT GACCTCCCAA CTCTAAAGCC AAGCACTTTA TATTTTCCT Y TTAGATATTC MCTAAGGACT TAAMATAAAA TTTTATTGAA AGAGGAATCA GTATCTGATT TTCTGGGAGA AGAAGGTAGC AGTGGTCACA GATAGAGATG TAAACTTAAG AGTGGGGCAC TGGGGTTCTC TTCCTGCTGA CATCTCCAGC CTCTTTCCTC TCCTCTGCCC ACAGGTTCTG GCTAAGAKGC TGCCTGGGCC CTGTG

Page 22: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Accession Numbers!!• Databases are designed to be searched by

accession numbers (and locus IDs)• These are guaranteed to be non-redundant,

accurate, and not to change.• Searching by gene names and keywords is

doomed to frustration and probable failure• Neither scientists nor computers can be

trusted to accurately and consistently annotate database entries

• If only scientists would refer to genes by accession numbers in all published work!

Page 23: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

http://www.ncbi.nlm.nih.gov/Genbank

• GenBank is managed by the National Center for Biotechnology Information (NCBI) at the NIH (part of the U.S. National Library of Medicine)

• Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.

• Now GenBank is over 150 billion bases

• Scientists access GenBank directly over the Web at www.ncbi.nlm.nih.gov

Page 24: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

What is GenBank?GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences ( Nucleic Acids Research 2007 Jan ;35(Database issue):D21-5). There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006.

Page 25: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Relational Databases• Databases can be more complex than a

single spreadsheet• GenBank has proteins and SNPs as well as

DNA• Some fields (i.e. phosphorylation sites) apply

to protein, but not DNA• Better to create a separate spreadsheet

format for Protein records• Each different spreadsheet is called a Table• Different Tables are linked by key fields

– (i.e. DNA and protein for same gene)

Page 26: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Many Tables at NCBI• The NCBI hosts a huge interconnected

database system that, in addition to DNA and protein, includes:– Journal Articles (PubMed)– Genetic Diseases (OMIM)– Polymorphisms (dbSNP)– Cytogenetics (CGH/SKY/FISH & CGAP)– Gene Expression (GEO)– Taxonomy– Chemistry (PubChem)

Page 27: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Database Design

A database can only be searched in ways that it was designed to be searched

You can search within a specific Field in a specific Table - and sometimes can combine searches from different Fields and/or Tables

(Boolean: "AND" and "OR" searches)

Bad to search for "human hemoglobin" in a 'Description' field

Much better to search for "homo sapiens in 'Organism' AND "HBB" in 'gene name'

Page 28: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Web Query• Most Scientific databases have a web-

based query tool• It may be simple…

Page 29: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

… or complex

Page 30: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

ENTREZ is the GenBank web query tool

Page 31: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Advanced query

interface:

Page 32: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

ENTREZ has pre-computed links between Tables

• Relationships between sequences are computed with BLAST

• Relationships between articles are computed with "MESH" terms (shared keywords)

• Relationships between DNA and protein sequences rely on accession numbers

• Relationships between sequences and PubMed articles rely on both shared keywords and the mention of accession numbers in the articles.

Page 33: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

NCBI Databases contain more than just DNA & protein sequences

Page 34: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 35: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 36: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Other Important Databases

• Genomes• Proteins• Biochemical & Regulatory Pathways• Gene Expression• Genetic Variation (mutants, SNPs)• Protein-Protein Interactions• Gene Ontology (Biological Function)

Page 37: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

UCSC Genome BrowserSearch by gene name:

or by sequence:

Page 38: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

BED format• Genome Browsers use a BED format that defines a

genomic interval as positions on a reference genome.• An interval can be a anything with a location: gene,

exon, binding site, region of low complexity, etc. • BED files can also specify color, width, some other

formatting.

chr1 213941196 213942363 chr1 213942363 213943530 chr1 213943530 213944697 chr2 158364697 158365864 chr2 158365864 158367031 chr3 127477031 127478198 chr3 127478198 127479365 chr3 127479365 127480532

chromosome start end

track name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0

Page 39: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 40: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Lots of additional data can be added as optional "tracks"

- anything that can be mapped to locations on the genome

Page 41: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Ensembl at EBI/EMBL

Page 42: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 43: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 44: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

• Genetic variation• Can be alleles of genes• also differences in non-coding regions

collected from genome sequencing of different individuals

• dbSNP at the NCBI - all public SNP data

• SNP Consortium at CSHL - high quality set

SNPs (Single Nucleotide Polymorphisms)

Page 45: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 46: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

KEGG: Kyoto Encylopedia of

Genes and Genomes• Enzymatic and regulatory pathways• Mapped out by EC number and cross-

referenced to genes in all known organisms

(wherever sequence information exits)

• Parallel maps of regulatory pathways

Page 47: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 48: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 49: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 50: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

NCI BioCarta

Page 51: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Protein-Protein Interactions• Metabolic and regulatory pathways• Transcription factors• Co-expression• Biochemical data

– crosslinking– yeast 2-hybrid– affinity tagging

• Useful feedback to genome annotation/protein function and gene expression

Page 52: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 53: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

BIND - The Biomolecular Interaction Network Database

Page 54: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Genome Ontology• Genetics is a messy science• Scientists have been working in isolation on

individual species for many years - naming genes, mutants, odd phenotypes– “sonic hedgehog”

• Now that we have complete genome sequences, how to reconcile the names across all species?

• Genome Ontology uses a single 3 part system– Molecular function (specific tasks)– Biological process (broad biologial goals - e.g cell

division)– Cellular component (location)

Page 55: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture
Page 56: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Database Search Strategies

• General search principles - not limited to sequence (or to biology)

• Use accession numbers whenever possible• Start with broad keywords and narrow the

search using more specific terms• Try variants of spelling, numbers, etc.• Search all relevant databases

• Be persistent!!

Page 57: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Bioinformatics Paradigm

• Find the data• Download the data• Reformat the data

• Collect the samples• Run molecular analysis• Filter the data

• Run analysis software• Collect and sort results• Publish / Data sharing

Page 58: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Summary

• text formats for some common genomics data types• need for universal standard data format• data types that lack a universal standard

• formatting text with tag:value pairs• basic database concepts

• fields, records, unique identifiers• tab and csv formats• relational databases, tables

• details of the FASTA format• header + data

• Data formats in public molecular biology databases• Genbank, dbSNP• Genome Browsers: BED format

• Database queries: field specific queries• Genome Ontology

Page 59: Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Next Lecture: Probability