retos de la bioinformatica
DESCRIPTION
Charla impartida en la Universidad de GranadaTRANSCRIPT
![Page 1: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/1.jpg)
Bioinformática: la biología por otros medios
Alberto Labarga
UGR, Noviembre 2008
![Page 2: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/2.jpg)
Computational Biology
Bioinformatics[Biological Information]
![Page 3: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/3.jpg)
1859 1866 1870 1900 1902
Hacia una teoría científica de la herencia
![Page 4: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/4.jpg)
1859 1866 1870 1900 1902
Charles Darwin publica en 1859 'The Origin of Species‘donde se propone que los seres vivos son el resultado de la selección natural y que todas las criaturas han evolucionado a lo largo de las generaciones a través de pequeños cambios.
![Page 5: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/5.jpg)
1859 1866 1870 1900 1902
Leyes de Mendel,
publicadas en 1866,
redescubiertas en 1900
![Page 6: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/6.jpg)
1859 1866 1870 1900 1902
En 1870, un científico alemán llamado Friedrich Miescher aísla los componentes almacenados en el núcleo, compuesto principalmente por proteinas y ácidos nucleicos. En aquel momento se creía que el elemento que almacenaba la información hereditaria tenía que ser la proteína, compuesta por 20 aminoacidos, mientras que los ácidos nucleicos tenían sólo 4 componentes.
![Page 7: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/7.jpg)
1859 1866 1870 1900 1902
A comienzo de siglo, Phoebus Levene, descubrió que el ADN es una cadena de nucleótidos, en la que cada nucleótido está compuesto de un azucar (desoxirribosa), un grupo fosfato y una base nitrogenada, que podía ser de cuatro tipos, Adenin, Timina, guanina y Citosina
![Page 8: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/8.jpg)
1859 1866 1870 1900 1902
Walter Sutton, a graduate student in E. B. Wilson’s
lab at Columbia University, observed that in the
process of cell division, called meiosis, that produces
sperm and egg cells, each sperm or egg receives only
one chromosome of each type. (In other parts of the
body, cells have two chromosomes of each type, one
inherited from each parent.) The segregation pattern
of chromosomes during meiosis matched the
segregation patterns of Mendel’s genes.
![Page 9: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/9.jpg)
1928 1944 1949 1952 1953
El descubrimiento del ADN
![Page 10: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/10.jpg)
1928 1944 1949 1952 1953
1928 Frederick Griffith: principio de transformación
si mezclaba a los neumococos R
con neumococos S previamente
muertos por calor, entonces los
ratones se morían. Aún más, en la
sangre de estos ratones muertos
Griffith encontró neumococos
con cápsula (S).
![Page 11: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/11.jpg)
1928 1944 1949 1952 1953
En 1944 Oswald Avery y sus colaboradores, que estaban estudiando la bacateria que causa la neumonía, Pneumococcus, descubrieron que las bacterias tienen ácidos nucleicos y que es la molécula de ADN la encargada de almacenar los genes. Otros estudios con virus se encargaronde confirmar esta teoría a pesar de que se seguía creyendo que el ADN era demasiado simple.
![Page 12: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/12.jpg)
1928 1944 1949 1952 1953
La vida puede verse como un proceso de almacenamiento y transmisión de información biológica.
Los cromosomas son los portadores de esta información.
La información está almacenada en la forma de un código molecular
Para entender la vida debemos identificar estas moléculas y descifrar el código
![Page 13: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/13.jpg)
1928 1944 1949 1952 1953
1949DNA se duplica durante la división celularChargaff: A = T and G = C
![Page 14: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/14.jpg)
1928 1944 1949 1952 1953
1952 - Hershey-Chase Experiment
![Page 15: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/15.jpg)
1928 1944 1949 1952 1953
M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:
Molecular Structure of Deoxypentose Nucleic
Acids. Nature 171, 738 (1953)
R.E. Franklin and R.G. Gosling
Molecular Configuration in Sodium
Thymonucleate, Nature 171, 740
(1953)
![Page 16: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/16.jpg)
1928 1944 1949 1952 1953
MOLECULAR STRUCTURE
OF NUCLEIC ACIDS
“We wish to propose a
structure for the salt of
desoxyribose nucleic acid
(DNA). This structure has
novel features which are of
considerable biological
interest”
Nature. 25 de abril de 1953
![Page 17: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/17.jpg)
1928 1944 1949 1952 1953
“It has not escaped our
attention that the specific
pairing we have
postulated immediately
suggests a possible
copying mechanism for
the genetic material.”
![Page 18: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/18.jpg)
The base pairs
![Page 19: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/19.jpg)
![Page 20: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/20.jpg)
1955 1959 1962 1966
En 1955 Ochoa publicó en Journal of the American
Chemical Society con la bioquímica francorrusa
Marianne Grunberg-Manago, el aislamiento de una
enzima del colibacilo que cataliza la síntesis de ARN, el
intermediario entre el ADN y las proteínas. Los
descubridores llamaron «polinucleótido-fosforilasa» a
la enzima, conocida luego como ARN-polimerasa. El
descubrimiento de la polinucleótido fosforilasa dio
lugar a la preparación de polinucleótidos sintéticos de
distinta composición de bases con los que el grupo de
Severo Ochoa, en paralelo con el grupo de Marshall
Nirenberg, llegaron al desciframiento de la clave
genética.
![Page 21: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/21.jpg)
![Page 22: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/22.jpg)
1955 1959 1962 1966
![Page 23: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/23.jpg)
1955 1959 1962 1966
Cuando Perutz llegó a Cambridge la
estructura molecular más grande que se
había resuelto era la del pigmento natural
ficocianina, de 58 átomos. Una proteína
tiene miles de átomos. Bernal, su director,
había realizado algunas imágenes de
difracción de rayos X de cristales de una
proteína, la pepsina, pero sin llegar a
interpretarlas. El tema escogido por Perutz
para su tesis fue otra proteína, la
hemoglobina, el transportador de oxígeno
que da color rojo a nuestra sangre. La
hemoglobina tiene nada menos que 11.000
átomos. Tardo 23 años.
![Page 24: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/24.jpg)
1955 1959 1963 1966
![Page 25: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/25.jpg)
1955 1959 1962 1966
Over the course of several years,
Marshall Nirenberg, Har Khorana and
Severo Ochoa and their colleagues
elucidated the genetic code – showing
how nucleic acids with their 4-letter
alphabet determine the order of the 20
kinds of amino acids in proteins.
Messenger RNA is interpreted three
letters at a time; a set of three
nucleotides forms a "codon" that
encodes an amino acid. A three-letter
word made of four possible letters can
have 64 (4 x 4 x 4) permutations, which
is more than enough to encode the 20
amino acids in living beings.
![Page 26: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/26.jpg)
![Page 27: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/27.jpg)
From DNA to protein
![Page 28: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/28.jpg)
1970 1975 1977 19801971
Entendiendo los mecanismos, creando las herramientas
![Page 29: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/29.jpg)
1970 1975 1977 19801971
El Central Dogma
![Page 30: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/30.jpg)
1970 1975 1977 19801971
Created in 1971
with seven
structures
![Page 31: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/31.jpg)
1970 1975 1977 19801971
El ADN recombinante, o ADN recombinado, es
una molécula de ADN formada por la unión de
dos moléculas heterólogas, es decir, de diferente
origen.
Se realiza a través de las enzimas de restricción
que son capaces de "cortar" el ADN en puntos
concretos.
De una manera muy simple podemos decir que
"cortamos" un gen humano y se lo "pegamos" al
ADN de una bacteria; si por ejemplo es el gen
que regula la fabricación de insulina, lo que
haríamos al ponérselo a una bacteria es
"obligar" a ésta a que fabrique la insulina.
![Page 32: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/32.jpg)
1970 1975 1977 19801971
![Page 33: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/33.jpg)
1970 1975 1977 19801971
A precursor-RNA may often be matured to
mRNAs with alternative structures. An example
where alternative splicing has a dramatic
consequence is somatic sex determination in the
fruit fly Drosophila melanogaster.
In this system, the female-specific sxl-protein
is a key regulator. It controls a cascade of
alternative RNA splicing decisions that finally
result in female flies.
![Page 34: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/34.jpg)
1981 1985 1987 199019831982
Entendiendo los mecanismos, creando las herramientas
![Page 35: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/35.jpg)
1981 1985 1987 199019831982
Read out the letters from a DNA sequence
GTGAGGCGCTGC
![Page 36: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/36.jpg)
1981 1985 1987 199019831982
1983 La reacción en cadena de la polimerasa,
conocida como PCR por sus siglas en inglés
(Polymerase Chain Reaction), es una técnica
de biología molecular descrita en 1986 por
Kary Mullis,[1] cuyo objetivo es obtener un
gran número de copias de un fragmento de
ADN particular, partiendo de un mínimo; en
teoría basta partir de una única copia de ese
fragmento original, o molde.
![Page 37: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/37.jpg)
1981 1985 1987 199019831982
Total nucleotides
(Nov 07: 188,490,792,445)
Number of entries
(Nov 07: 106,144,026)
![Page 38: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/38.jpg)
1981 1985 1987 199019831982
![Page 39: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/39.jpg)
1981 1985 1987 199019831982
El Proyecto Genoma Humano (PGH) (Human
Genome Project en inglés) consiste en
determinar las posiciones relativas de todos los
nucleótidos (o pares de bases) e identificar
100.000 genes presentes en él.
El proyecto, dotado con 3.000 millones de
dólares, fue fundado en 1990 por el
Departamento de Energía y los Institutos de la
Salud de los Estados Unidos, con un plazo de
realización de 15 años.
![Page 40: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/40.jpg)
”Imagine varias copias de un libro, cortadas en
10 millones de trocitos cada una, de manera
que los trocitos se solapan. Supongamos que 1
millón de trocitos se han perdido, y que los
otros 9 millones están manchados de tinta.
Recupere el texto original.”
![Page 41: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/41.jpg)
![Page 42: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/42.jpg)
HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by
fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The
genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones
are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct
the sequence of the genome.
![Page 43: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/43.jpg)
1990 1995 1997 20011996 1998 1999
Descifrando el libro de la vida
![Page 44: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/44.jpg)
1990 1995 1997 20011996 1998 1999
S.F. Altschul, et al. (1990), "Basic Local
Alignment Search Tool," J. Molec.
Biol., 215(3): 403-10, 1990. 15,306
citations
Altschul, S.F. et al (1997), “Gapped
BLAST and PSI-BLAST: a new
generation of protein database search
programs”, Nucleic Acids Res., vol. 25,
no. 17, pp. 3389-402.
![Page 45: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/45.jpg)
![Page 46: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/46.jpg)
![Page 47: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/47.jpg)
• SSAHA (Ning et al., 2001)• http://www.sanger.ac.uk/Software/analysis/SSAHA/
• SSAHA is an algorithm for very fast matching and alignment of DNA
sequences. It stands for Sequence Search and Alignment by Hashing
Algorithm. It achieves its fast search speed by converting sequence
information into a `hash table' data structure, which can then be
searched very rapidly for matches.
• BLAT (J. Kent, 2002)• http://genome.ucsc.edu/cgi-bin/hgBlat
• BLAT on DNA is designed to quickly find sequences of 95% and greater
similarity of length 40 bases or more. It may miss more divergent or
shorter sequence alignments. It will find perfect sequence matches of 33
bases, and sometimes find them down to 20 bases. BLAT on proteins
finds sequences of 80% and greater similarity of length 20 amino acids
or more.
![Page 48: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/48.jpg)
1990 1995 1997 20011996 1998 1999
J. Thompson, T. Gibson, D.
Higgins (1994), CLUSTAL W:
improving the sensitivity of
progressive multiple sequence
alignment … Nuc. Acids. Res. 22,
4673 - 4680
![Page 49: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/49.jpg)
Flowchart of computation steps in
Clustal W (Thompson et al., 1994)
Pairwise alignment: calculation of distance matrix
Creation of unrooted neighbor-joining tree
Rooted nJ tree (guide tree) and calculation of sequence weights
Progressive alignment following the guide tree
![Page 50: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/50.jpg)
Otros métodos
Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for
fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.
Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic Acids
Res, 33, 511–518.
Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple
sequence alignment algorithm. BMC Bioinformatics , 6, 298.
Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007
23(21): 2947-2948.
![Page 51: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/51.jpg)
Tree of Life
http://tolweb.org/tree/phylogeny.html http://itol.embl.de/
![Page 52: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/52.jpg)
1990 1995 1997 20011996 1998 1999
1995• El primer genoma completo de un organismoHemophilus influenzae.
![Page 53: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/53.jpg)
1990 1995 1997 20011996 1998 1999
1996• El genoma de la levadura se completa: aproximadamente, 6,000 genes y 14.000.000 de pares de bases
![Page 54: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/54.jpg)
1990 1995 1997 20011996 1998 1999
![Page 55: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/55.jpg)
1990 1995 1997 20011996 1998 1999
1997
•Ecuenciado el genoma de la bacteria E. Coli: 4,600 genes 4,5 millones de nucleótidos.
![Page 56: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/56.jpg)
1990 1995 1997 20011996 1998 1999
1998
El genoma del gusano Caenorhabditis elegans, tiene 18,000 genes unos 100 millones de nucleotidos
![Page 57: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/57.jpg)
1990 1995 1997 20011996 1998 1999
1999•Se consigue la secuencia completa del cromosoma 22 El HGP va por delante de lo planeado.Sorprende el reducido número de genes encontrado (unos 300)
![Page 58: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/58.jpg)
Fire A, Xu S, Montgomery M, Kostas
S, Driver S, Mello C (1998). "Potent
and specific genetic interference by
double-stranded RNA in
Caenorhabditis elegans". Nature 391
(6669): 806–11. doi:10.1038/35888.
PMID 9486653
![Page 59: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/59.jpg)
Hamilton A, Baulcombe D
(1999). "A species of small
antisense RNA in
posttranscriptional gene
silencing in plants". Science
286 (5441): 950–2.
PMID 10542148
![Page 60: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/60.jpg)
Dr Alan Wolffe (1999)
• Epigenetics is heritable changes in gene expression that occur without a change in DNA sequence
• Such changes cannot be attributed to changes in DNA sequence (mutations)
• They are as Irreversible as mutations (or difficult to reverse)
![Page 61: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/61.jpg)
1990 1995 1997 20011996 1998 1999
![Page 62: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/62.jpg)
Gene prediction
In humans:
~22,000 genes
~1.5% of human DNA
Where are the genes?
![Page 63: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/63.jpg)
the gencode pipeline
1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the
human genome
2. manual curation to resolve conflicting evidence
3. additional computational predictions
4. experimental verification
5. FINAL ANNOTATION
![Page 64: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/64.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
64
Genome annotation - building a pipeline
Genome sequence
Map repeats
Genefinding
Protein-coding genes
Map ESTs Map Peptides
nc-RNAs
Functional annotation
Release
![Page 65: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/65.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
65
Genefinding - ab initio predictions
Use compositional features of the DNA sequence to define coding
segments (essentially exons)
ORFs
Coding bias
Splice site consensus sequences
Start and stop codons
Each feature is assigned a log likelihood score
Use dynamic programming to find the highest scoring path
Need to be trained using a known set of coding sequences
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
![Page 66: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/66.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
66
ab initio prediction
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
![Page 67: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/67.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
67
ab initio prediction
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
![Page 68: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/68.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
68
ab initio prediction
Find best prediction
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
![Page 69: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/69.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
69
Genefinding - similarity
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Needs to handle fuzzy alignment regions around splice sites
Needs to attempt to find start and stop codons
Examples: EST2Genome, exonerate, genewise
Use 2 or more genomic sequences to predict genes based on
conservation of exon sequences
Examples: Twinscan and SLAM
![Page 70: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/70.jpg)
August 2008 Bioinformatics tools for Comparative
Genomics of Vectors
70
Similarity-based prediction
Align
Create prediction
Genome
cDNA/peptide
![Page 71: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/71.jpg)
Example of a simple HMM
EPFL – Bioinformatics I – 05 Dec 2005
Top: model architecture and parameters. Bottom: sequence generation process.
green: state transition probabilities, red: emission probabilities.
Prob(sequence, path|model) = 6.8e-8.
![Page 72: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/72.jpg)
Automatic Annotation vs Manual
Automatic Annotation
• Quick whole genome analysis ~ weeks
• Consistent annotation
• Use unfinished sequence/shotgun assembly
• No polyA sites/signals, pseudogene
• Predicts ~70% loci
Manual Annotation
• Extremely slow~3 months Chr 6
• Need finished seq
• Flexible, can deal with inconsistencies in data
• Most rules have exception
• Consult publications as well as databases
![Page 73: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/73.jpg)
Analysis EGASP predictions vs manual
annotation
0
10
20
30
40
50
60
70
80
90
100
9_101_1 20_79_1 36_46_1 41_77_1
Nuc Sn
Nuc Sp
0
10
20
30
40
50
60
70
80
90
100
9_101_1 20_79_1 36_46_1 41_77_1
Exon Sn
Exon Sp
0
10
20
30
40
50
60
70
80
9_101_1 20_79_1 36_46_1 41_77_1
Trans Sn
Trans Sp
0
10
20
30
40
50
60
70
80
9_101_1 20_79_1 36_46_1 41_77_1
Gene Sn
Gene Sp
![Page 74: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/74.jpg)
2002 2007 201020052004
Y sólo es el principio
![Page 75: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/75.jpg)
2002 2007 201020052004
![Page 76: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/76.jpg)
2002 2007 201020052004
874
2124
1004
10/0810/3/02
104
316
218
8/28/03
156
386
246
5/07
500
1500
700
4000
Published complete genomes:
Ongoing prokaryotic genomes:
Ongoing eukaryotic genomes:
http://www.genomesonline.org
![Page 77: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/77.jpg)
2002 2007 201020052004
Illumina / Solexa
Genetic Analyzer
2000 Mb / run
Applied Biosystems
ABI 3730XL
1 Mb / day
Roche / 454
Genome Sequencer FLX
100 Mb / run
Applied Biosystems
SOLiD
3000 Mb / run
454-GS20
32,000,000
0 .04
0 .54
1 .04
1 .54
2 .04
2 .54
3 .04
3 .54
4 .04
4 .54
199 4 199 6 199 8 200 0 200 2 200 4 200 6
Mill
ions
Date of Introduction
# B
ases
/Run
ABI
3730ABI
370/377
ABI
3700
![Page 78: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/78.jpg)
Aunque los seres humanos compartimos
99.9 por ciento de la información genética,
tenemos pequeñas variaciones, llamadas
poliformismos singulares de nucléotido o
SNP (por su siglas en inglés; se pronuncia
snip). Se estima que existen unos 10
millones de SNP en la especie humana y
supuestamente esas diferencias estarían
relacionadas con la mayor resistencia o
susceptibilidad a enfermedades y
medicamentos.
2002 2007 201020052004
![Page 79: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/79.jpg)
VARIACIÓN EN LA SECUENCIA HUMANA DE
DNA
Tasa de mutación = 10-8 /sitio/generación
Nº generaciones ancestro común-humano actual: 104-105
![Page 80: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/80.jpg)
2002 2007 20102005
ENCyclopedia Of DNA Elements
2004
![Page 81: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/81.jpg)
2002 2007 201020052004
![Page 82: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/82.jpg)
Genómica funcional
![Page 83: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/83.jpg)
Comparative
genomics
Sequence (DNA/RNA)
& phylogeny
Regulation of gene
expression;
transcription factors &
micro RNAs
Protein sequence analysis &
evolution
Protein families,
motifs and domains
Protein structure & function:
computational crystallography
Protein interactions & complexes: modelling and
prediction
Chemical biology
Pathway analysis
Systems
modelling
Image analysis
Data integration & literature
mining
![Page 84: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/84.jpg)
Se preparan copias del ADN
de los genes de interés
Transcripción
inversa
...que se
imprimen
en el chip
Las muestras se hibridan
en el microarray
Laser 1 Laser 2
El chip se excita
con láseres
diferentes: el
control
reacciona a uno
de ellos y la
muestra al otro
La comparación
de ambas
imágenes nos
indica que genes
se expresan de
manera diferente
Añadir
fluorescencia
control muestr
a
Se preparan las
muestras de ARN
de interés
Schena et al. Science 1995
![Page 85: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/85.jpg)
Microarray analysis
Clinical prediction of Leukemia type
• 2 types
– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatment & outcomes
• Predict type before treatment?
Golub et. al. Science 286:531-537. (1999)
![Page 86: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/86.jpg)
Biomarkers discovery
Data
Management
statistical
analysis AnnotationNetwork
análisis Selection
30.000
genes
1500 genes 150 genes 50 elements 10 targets
![Page 87: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/87.jpg)
Step1: Calculate Ct with SDS and export text file
TaqMan Assays
Step 3: Biological Replicates
Step 4: Selection of Optimal Endogenous Controls &
Calculation of ΔCt
Step 5: Differential Expression Analysis ΔΔCt
! Overview Plates & Samples
! Quality Control
Raw Values
! Discard Samples
! Quality Control
ΔCt Overview
RT-PCR Standard Processing Procedure
Step2: Retrieve data and define
experiment design
![Page 88: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/88.jpg)
88
Example of Array CGH Technology*
Chari et al, Cancer Informatics, 2006, 2, 48-58
![Page 89: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/89.jpg)
89
![Page 90: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/90.jpg)
Source: http://www.chiponchip.org/
Chip-on-chip
![Page 91: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/91.jpg)
DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo
Isolate the chromatin. Shear DNA along with bound proteins into small fragments.
Bind antibodies specific to the DNA-binding protein to isolate the complex by precipitation. Reverse the cross-linking to release the DNA and digest the proteins.
Use PCR( Polymerase Chain Reaction )
to amplify specific DNA sequences to see if they were precipitated with the antibody
ChIP (Chromatin ImmunoPrecipitation)
• Chromatin immunoprecipitation, or ChIP, refers to a procedure used to determine whether a given protein binds to a specific DNA sequence in vivo
![Page 92: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/92.jpg)
![Page 93: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/93.jpg)
Protein MicroarrayG. MacBeath and S.L. Schreiber, 2000, Science 289:1760
arrayIT TM
Spotting platform and protein microarray
![Page 94: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/94.jpg)
Different Kinds of Protein Arrays*
Antibody Array Antigen Array Ligand Array
Detection by: SELDI MS, fluorescence, SPR,
electrochemical, radioactivity, microcantelever
![Page 95: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/95.jpg)
The Microarray Study Process
![Page 96: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/96.jpg)
Preprocesado
![Page 97: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/97.jpg)
Some Questions:
• Which genes have expression levels that are correlated
with some external variable?
• For a given pathway, which of the genes in our collection
are most likely to be involved?
• For a diffuse disease, which genes are associated with
different outcomes?
![Page 98: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/98.jpg)
Challenges for Data Analysis
• Normalization (removing systematic measurement effects)
• Variable Selection (Identification of relevant Variables)
• Large sample Effects:
Type I and Type II errors (False positives / False negatives)
• Dimensionality Reduction
• Identification of new disease classes
• Classification of data into known disease classes
![Page 99: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/99.jpg)
Data Analysis Methods
Dimension Reduction
• PCA (Principle Component Analysis)
• ICA (Independent Component Analysis)
• Multidimensional Scaling
Unsupervised Learning
• K-Means / K-Medoid
• Hierarchical Clustering Algorithms
Supervised Learning
• Linear Discriminant Analysis
• Maximum Likelihood Discrimination
• Nearest Neighbor Methods
• Decision Trees
• Random Forests
![Page 100: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/100.jpg)
Matrix factorization
![Page 101: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/101.jpg)
![Page 102: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/102.jpg)
102
Popular Classification Methods
• Decision Trees/Rules– Find smallest gene sets, but not robust – poor performance
• Neural Nets - work well for reduced number of genes
• K-nearest neighbor – good results for small number of genes, but no model
• Naïve Bayes – simple, robust, but ignores gene interactions
• Support Vector Machines (SVM)– Good accuracy, does own gene selection,
but hard to understand
• Specialized methods, D/S/A (Dudoit), …
![Page 103: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/103.jpg)
Support Vector Machine (SVM)
• Main idea: Select hyperplane that is more likely to
generalize on a future datum
![Page 104: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/104.jpg)
104
Best Practices
• Capture the complete process, from raw data to final results
• Gene (feature) selection inside cross-validation
• Randomization testing
• Robust classification algorithms– Simple methods give good results
– Advanced methods can be better
• Wrapper approach for best gene subset selection
• Use bagging to improve accuracy
• Remove/relabel mislabeled or poorly differentiated samples
![Page 105: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/105.jpg)
Alistair Chalk, 2008
Enrichment Analysis
• What are major enriched GO terms?
• What are the highly active pathways?
• What are the frequently interacting proteins?
• What are the known disease associations?
![Page 106: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/106.jpg)
Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
![Page 107: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/107.jpg)
Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
• Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus.
• Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression.
• “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long-term value of the time invested in improving annotations.”
![Page 108: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/108.jpg)
Biología de sistemas
![Page 109: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/109.jpg)
![Page 110: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/110.jpg)
PPI ANNOTATION AND DATABASES
http://www.hpid.org (Han et al., 2004)HPID
http://www.ebi.ac.uk/intact(Hermjakob et al., 2004)IntAct
http://www.hprd.org/(Peri et al., 2004)HPRD
http://dip.doe-mbi.ucla.edu/(Xenarios et al., 2002)DIP
http://mint.bio.uniroma2.it/mint(Zanoni et al., 2002)MINT
URLReferenceDatabase
iMEX agreement to share curation efforts
Protein Standard Initiative (PSI) recommendation
Molecular Interaction (MI) Ontology
Large scale experiments
Literature curation
![Page 111: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/111.jpg)
![Page 112: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/112.jpg)
Complex networks
• Many systems can be represented as networks (graphs)– Nodes: individual component (proteins)
– Edges: relationships (interactions)
• They share common properties– Scale-free
– Hierarchical
– Clustering
• Some properties may be intrinsic and can be understood better when putting into the context of evolution
![Page 113: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/113.jpg)
Detecting Hierarchical Organization
![Page 114: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/114.jpg)
Summary: Network Measures
• Degree ki
The number of edges involving node i
• Degree distribution P(k)
The probability (frequency) of nodes of degree k
• Mean path length
The avg. shortest path between all node pairs
• Network Diameter
– i.e. the longest shortest path
• Clustering Coefficient
– A high CC is found for modules
![Page 115: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/115.jpg)
Mapping the phenotypic data to the network
Begley TJ, Rosenbach AS, Ideker T,
Samson LD. Damage recovery pathways
in Saccharomyces cerevisiae revealed by
genomic phenotyping and interactome
mapping. Mol Cancer Res. 2002
Dec;1(2):103-12.
•Systematic phenotyping
of 1615 gene knockout
strains in yeast
•Evaluation of growth of
each strain in the presence
of MMS (and other DNA
damaging agents)
•Screening against a
network of 12,232 protein
interactions
![Page 116: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/116.jpg)
![Page 117: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/117.jpg)
The Role of Proteomics
• The existence of an ORF does not imply the
existence of a functional gene.
• Limitations of comparative genomics.
• mRNA levels may not correlate with protein levels.
• Protein modifications post-transcriptional
modifications, isoforms, post-translational
modifications, mutants.
• Issues of proteolysis, sequestration, etc. relevant only
at the protein level.
• Protein complex composition, protein-protein
interactions, structures.
![Page 118: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/118.jpg)
Structural proteomics
• Folding
• Structure and function
• Protein structure prediction
• Secondary structure
• Tertiary structure
• Function
• Post-translational modification
• Prot.-Prot. Interaction -- Docking algorithm
• Molecular dynamics/Monte Carlo
![Page 119: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/119.jpg)
What kind of methods around?
5 main levels of protein Structure prediction:
1. Extensive Sequence Search
2. Threading and 1D-3D profiles
3. Ab initio prediction of protein structure
4. Comparative Modelling
5. Docking (domain interaction prediction)
![Page 120: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/120.jpg)
![Page 121: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/121.jpg)
Prediction of Protein Structures
• Examples – a few good examples
actual predicted actual
actual actual
predicted
predicted predicted
![Page 122: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/122.jpg)
![Page 123: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/123.jpg)
START
Get profile for sequence (NR)
Scan sequence profile against
representative PDB chains
Scan PDB chain profiles
against sequence
PS
I-B
LA
ST
MODPIPE: Large-Scale Comparative Protein Structure Modeling
Select templates using
permissive E-value cutoff
1
Expand match to cover
complete domains
1
Build model for target segment by
satisfaction of spatial restraints
Evaluate model
Align matched parts of sequence and
structure
MO
DE
LL
ER
R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.
N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali.
Fo
r ea
ch t
arg
et s
equ
ence
Fo
r ea
ch t
emp
late
str
uct
ure
3/25/03
END
![Page 124: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/124.jpg)
Structural Proteomics:
The Motivation*
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
1980 1985 1990 1995 2000 2005
2000040000
6000080000
100000120000
140000160000
0
Seq
uen
ces S
tructu
res
180000200000
![Page 125: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/125.jpg)
The hierarchies of protein structure
![Page 126: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/126.jpg)
126
Docking Programs
• Dock (UCSF)
• Autodock (Scripps)
• Glide
(Schrodinger)
• ICM (Molsoft)
• FRED (Open Eye)
• Gold, FlexX, etc.
![Page 127: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/127.jpg)
Cell cycle network from KEGG
![Page 128: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/128.jpg)
128
Graphical Notation: a necessity for the conceptual representation
of biopathways
Thiery & Sleeman, Nat. Rev. Mol.
Cell. Biol 7:131 (2006)
Qualitative Mechanistic
various degree of
detail, mixed level
of presentation
Aladjem et al., Science STKE pe8
(2004)
![Page 129: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/129.jpg)
129
Strategies: simulate or analyse?
(or rather what to do first)
convert diagram
into a quantitative
model
simulate model
behavior
numerically
obtain qualitative
understanding
through numerical
results and model
reduction
qualitatively
analyze network
topology, stability,
etc
identify
“elementary
modes”
build and
simulate a
reduced model
![Page 130: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/130.jpg)
130
Space of modeling methods
con
tin
uou
s↔
dis
cret
e
sto
chsi
mB
oo
lean
net
wo
rks
![Page 131: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/131.jpg)
Continuum of modeling approaches
Top-down Bottom-up
![Page 132: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/132.jpg)
Frazier et al. (2003) Science 11 April Vol 300:290-293
![Page 133: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/133.jpg)
Integración de datos
![Page 134: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/134.jpg)
Nucleic Acids Research article lists
1078 public databases
Nucleic Acids Research, 2008, Vol. 36, Database issue
http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2
![Page 135: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/135.jpg)
Growth in Available Bioinformatics Databases
![Page 136: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/136.jpg)
Too much unintegrated data
• Data sources incompatible
• No (or few) standard naming convention
• No common interface (varying tools for browsing,
querying and visualizing data)
![Page 137: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/137.jpg)
– Small, isolated, independent, groups/individuals
– Loosely coupled provider-consumer of resources.
– Commonly resource consumers
– Boutique suppliers.
– Poor access systems admins
– Large experiments or large research groups/labs, possibly distributed
– Large service provider institutes.
– Tightly coupled provider-consumer of resources.
– Commonly resource providers.
– Some or lots of access to sys admin
![Page 138: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/138.jpg)
138
Challenges: Names and Identity
Q92983
O00275
O00276
O00277
O00278
O00279
O00280
O14865
O14866
P78507
• WSL-1 protein
• Apoptosis-mediating receptor DR3
• Apoptosis-mediating receptor TRAMP
• Death domain receptor 3
• WSL protein
• Apoptosis-inducing receptor AIR
• Apo-3
• Lymphocyte-associated receptor of death
• LARD
• GENE: Name=TNFRSF25
Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor
P78515
Q93036
Q93037
Q99722
Q99830
Q99831
Q9BY86
Q9UME0
Q9UME1
Q9UME5
Annotation history:
http://www.expasy.org/uniprot/Q93038
GUIDs
Life Science
Identifier?
Normalisation
![Page 139: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/139.jpg)
![Page 140: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/140.jpg)
Why must support standards?
• Unambiguous representation, description
and communication
– Final results and metadata
• Interoperability
– Data management and analysis
• Integration of OMICS system biology
![Page 141: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/141.jpg)
What to standarize?
• CONTENT: Minimal/Core Information to be reported
• MIBBI (http://www.mibbi.org)
• SEMANTIC: Terminology Used -> Ontologies
• OBI (http://obi-ontology.org)
• SYNTAX: Data Model, Data Exchange
• Fuge (http://fuge.sourceforge.net/)
![Page 142: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/142.jpg)
MIBBI: Standard Content
Promoting Coherent Minimum Reporting Requirements for
Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.
![Page 143: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/143.jpg)
143U
ser
inte
rface
Applic
ation
Applic
ation inte
rface
Link Integration: Integration Lite
Ontology
Authority
Identity Authority
![Page 144: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/144.jpg)
144
Warehouse
Applic
ation
User
inte
rfaceW
rappers
Wra
ppers
Wra
ppers
Unified
model Data
Access a
nd Q
uery
• Copy the data sets, clean and massage data into shape
• Combine them into a (different) pre-determined model before query
• ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART
• Often called “Knowledge bases”
![Page 145: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/145.jpg)
145
View integration
• Data at Source; Virtual integrating database view
• Global as View / Local as View mappings between models
• Map from model to databases dynamically so always fresh
• TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE
Wra
ppers
Wra
ppers
Wra
ppers
Applic
ation
User
inte
rface
Unified
model Data
Access a
nd Q
uery
![Page 146: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/146.jpg)
146
Specialist Integrating Application
E.g. Ensembl, UTOPIA
• Very popular. Known to be one application.
Applic
ation
User
inte
rfaceW
rappers
Wra
ppers
Wra
ppers
![Page 147: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/147.jpg)
147
Workflows
• Data flow protocol. Automated data chaining.
• General technique for describing and enacting a process
• Describes what you want to do, not how you want to do it
• Various degrees of data type compliance anticipated
Applic
ation
User
inte
rface
Wra
pper
Workflow
Engine
![Page 148: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/148.jpg)
148
Mash-Up Data Marshalling
• Content syndication and feeds
• Emphasis on User creating specific integration by mapping.
• Just in time, just enough design
• On demand integration
Ma
sh U
p A
pplic
ation
User
inte
rfaceP
roto
col
objects
Pro
tocol
Pro
tocol
![Page 149: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/149.jpg)
Composite applications
![Page 150: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/150.jpg)
150
Semantic Web help?
• Slight problem: we have no first class metadata migration and
management infrastructure, where metadata is outside the application and
in the middleware, and we can handle progressive curation
Wra
ppers
Wra
ppers
Wra
pper
Applic
ation
User
inte
rface
Acce
ss a
nd
Qu
ery
Semantic Enrichment
Model flattening
Mapping Transparency
![Page 151: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/151.jpg)
![Page 152: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/152.jpg)
dataflow workflow
ws ws ws ws ws
curation
submission
Advanced Search
Retrieve data
Submit data
Service Oriented Architecture
![Page 153: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/153.jpg)
Distributed Annotation System
![Page 154: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/154.jpg)
Distributed Annotation System
![Page 155: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/155.jpg)
An Integrative Analysis Example
Relational data
mining Text mining
Spectrum data
mining
Chemical sequence
data model
Visualizing
relational data
clusters
Visualizingmultidimensi
onal data
Visualizingsequence
data
Visualizingpathway
dataText mining visualization
Visualizing cluster
statistics
Visualizing serial/spect
rum data
Decision tree model
of metabonomi
c profile
Chemical structure
visualization
![Page 156: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/156.jpg)
1- Experiments
Planning and carrying outexperiments(lab work)
2- Results
Processing and interpretation of obtained results
3- Scientific Peer-reviewed articles
'Relevant' results are published in scientific
journals
From experiments to scientific publications
![Page 157: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/157.jpg)
PubMed/Medline database at NCBI
- Developed at the National
Center for Biotechnology
Information (NCBI).
- The core 'Textome'.
- repository of citation
entries of scientific
articles.
- PubMed titles and
abstracts
are primary data source for
Bio-NLP.
- ~ 450,000 new abstracts/a
- > 4,800 biomedical
journals
- ENTREZ search engine
![Page 158: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/158.jpg)
ScientificJournals
Journal-specific
Information:
•Format•Paper structure
(sections)•Article type
Data in scientific articles
Free Text
Title
Abstracts
Keywords
Text body
References
Tables Figures
Biomedical literature characteristics
- Heavy use of domain specific terminology (12%
biochemistry
related technical terms).
- Polysemic words (word sense disambiguation).
- Most words with low frequency (data sparseness).
- New names and terms created.
- Typographical variants
- Different writing styles (native languages)
![Page 159: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/159.jpg)
![Page 160: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/160.jpg)
BioCreative
![Page 161: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/161.jpg)
BioCreative
![Page 162: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/162.jpg)
BioCreative results
1: Chiang et al.
2: Couto et al.
3: Ehrler et al.
4: Ray et al.
5: Rice et al.
6: Verspoor et al.
TP: prediction evaluated as protein and GO terms correct
Precision: TP / Total nr. of
evaluated submissions
![Page 163: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/163.jpg)
Data Integration
• Standards, DBs
Knowledge Discovery
• Algorithms, Informatics, Machine Learning
Integrate knowledge
• Text mining, Ontologies
Modelling
• Pathways, Circuits, Abstraction
Infrastructure
SupportResearch
![Page 164: Retos de la Bioinformatica](https://reader034.vdocuments.mx/reader034/viewer/2022052207/548274a1b47959e20c8b47ab/html5/thumbnails/164.jpg)
Los retos de la biología en los próximos
50 years
• Listado de todos los componentes moleculares que forman un organismo:– Genes, proteinas, y otros elementos funcionales
• Comprender la funcion de cada componente
• Comprender como interaccionan
• Estudiar como la función ha evolucionado
• Encontrar defectos geneticos que causan enfermedades
• Diseñar medicamentos y terapias de manera racional
• Secuenciar el genoma de cada individuo y usarlo en una medicina personalizada
• La Bioinformatica es un componente esencial para conseguir todos estos objetivos