university of groningen genetical genomics with affymetrix ... · sequence is used in groups of 3...
TRANSCRIPT
University of Groningen
Genetical genomics with Affymetrix gene expression arraysAlberts, Rudi
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.
Document VersionPublisher's PDF, also known as Version of record
Publication date:2007
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):Alberts, R. (2007). Genetical genomics with Affymetrix gene expression arrays. Groningen: s.n.
CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.
Download date: 15-04-2020
CH
AP
TE
R 1General introduction
Abstract
This chapter introduces the basic concepts needed to better understand the rest of
the thesis: complex traits, genetics (mice crosses), transcriptomics (measuring gene
expression using Affymetrix microarrays), and data pre-processing (normalization of
data, some pitfalls).
2 General introduction
Complex trait analysis is a challenging field in biological research. A trait (char-
acteristic) of an individual is developed under the influence of both its genes
and environmental factors. A complex trait is a trait that has multiple determinants,
which may be genetic, environmental, or both. Many diseases, such as obesity, hyper-
tension and cancer are complex traits. In complex trait analysis, one aims to reveal
the determinants underlying the trait or more loosely spoken to find how many genes
are encoding the trait, where the genes are located, what their effects are, and how
they talk to each other and are affected by the environment.
DNA contains the genetic information in living cells. Genes are segments of DNA
sequence, often encoding for proteins. DNA (deoxyribonucleic acid) is made from
nucleotides, consisting of a sugar-phosphate molecule and a base. There are four types
of bases: adenine, guanine, cytosine and thymine, corresponding to four nucleotides
labeled A, G, C and T. A normal DNA molecule consists of two complementary
strands of nucleotides. T in one strand pairs with A in the other, and G in one
strand pairs with C in the other. The two strands twist around each other to form
a double helix. Figure 1.1 shows what is called the central dogma in biology: how
DNA is transcribed into RNA (ribonucleic acid), which subsequently is translated into
protein. The following description is valid for eukaryotes (animals, plants, fungi and
protists). In the first step, RNA transcription, the enzyme RNA polymerase makes a
copy of the gene from the DNA into mRNA (messenger RNA). The backbone of RNA
is formed of a slightly different sugar from that of DNA – ribose instead of deoxyribose
– and one of the four bases is slightly different: uracil (U) instead of thymine (T).
Next, the mRNA molecule must be further modified before it can be used. A so
called cap is added at the beginning of the molecule and a Poly-A tail is added at
the end. Regions that do not contain genetic code, called introns, are removed to
finally form the mature mRNA. This leaves the nucleus and is translated into protein
at the ribosomes. Proteins are build from amino acids. There are 20 types of amino
acids. The translation of the genetic information from a 4-letter alphabet into the
20-letter alphabet of proteins is a complex process. The information in the mRNA
sequence is used in groups of 3 nucleotides at a time, called codons. Each possible
codon represents an amino acid and an amino acid can be represented by multiple
codons, since there are 4*4*4=64 codons and only 20 amino acids.
Determined by the order of the nucleotides in the mRNA, amino acids are bound
together during protein translation to form proteins. In prokaryotes (bacteria and
archaea) the situation is simpler. The main differences with eukaryotes are that
prokaryotic cells do not have a nucleus and other complex cell structures, they only
contain a single loop of DNA and their DNA is much more compact, since they do
not have introns and large inter-genic regions.
Proteins are essential parts of all living organisms. Many proteins are enzymes
3
Figure 1.1: In the nucleus, DNA is transcribed into primary mRNA and further processed
into mature mRNA. At the ribosomes, mature mRNA is translated into protein. Figure
copied from http://www.hbcprotocols.com.
that catalyze biochemical reactions. Other proteins have mechanical or structural
functions. Proteins are also important in cell signaling and immune responses. They
play an important role in the development of complex traits. Biological systems can
be studied at any level in Figure 1.1. This thesis focuses on the amounts of mRNA
present.
A common way of studying mRNA levels of genes is by considering one gene at a
time, e.g. by creating so-called knockout individuals where the gene has been made
inactive. Differences between the normal individuals and the knockout individuals
might reveal information about the function of the gene. With microarray technol-
ogy, the amount of RNA can be measured for thousands of genes simultaneously.
To understand processes behind the development of complex traits, it can be very
informative to look at the differences in gene activity between normal and knock-out
individuals.
A more powerful approach is to multifactorially perturb the biological system.
This is done in genetical genomics (Jansen 2003).
4 General introduction
1.1 Genetic dissection of complex traits
Using microarray technology, how can we reveal the genetic basis of complex traits?
A powerful method for doing this is genetical genomics, which was introduced by
Jansen and Nap in 2001 (Jansen and Nap 2001). Genetical genomics basically is the
application of classical QTL mapping on gene expression data. In QTL (quantita-
tive trait locus) mapping, a quantitative trait, such as height, of genetically related
individuals is measured. Often, recombinant inbred lines (RILs) are used. RILs are
created by starting with the F1 of two homozygous parents and inbreeding them for
six or more successive generations. This produces (almost) homozygous RILs of which
Figure 1.2: The construction of recombinant inbred lines (RILs) from B6 and DBA
parental lines. The bars represent chromosome pairs. Grey (white) indicates that this
genomic interval is derived from B6 (DBA). The numbers 1-5 represent molecular mark-
ers. Starting from the homozygous parents, heterozygous F1 is created. Crossng-overs
cause a mixture of B6 and DBA genomes in the F2 generation. The two circles repre-
sent parts of the genome that are already fixed for one parental type. Further inbreed-
ing provides homozygous RILs (BXD-101, BXD-102 and BXD-103). Figure copied from
http://www.informatics.jax.org.
Genetic dissection of complex traits 5
the genomes are a mixture of the two parental genomes (Figure 1.2). To determine
which part of the genome in the RILs is inherited from which parental line, molecular
markers are used. These are specific places in the DNA where the sequence differs
between both parental lines and that can be identified by a simple assay. By com-
paring the quantitative trait values of each of the RILs with their molecular markers
along the genome, genomic regions (QTLs) are found where markers and trait values
correlate. The QTL regions probably contain gene(s) that regulate the quantitative
trait. In genetical genomics, gene expression is taken as quantitative trait. Hence,
for each gene, regulatory regions or even regulating genes can be found. QTLs for
gene expression traits are also called expression QTLs (eQTLs). Genetical genomics
makes use of the naturally occurring genetic variation between individuals and uses
the genetic mechanisms of segregation and recombination to produce offspring with
many combinations of gene variants. In this way the effect of differences in expression
of multiple genes can be studied at once, and combined effects may be revealed that
stay uncovered when looking at the effect on one gene at a time. See Chapter 2 for
a more detailed explanation of genetical genomics.
The genomic location of an eQTL can be compared with the location of the gene
for which the eQTL was found. When the eQTL is located close to the gene under
study (mostly up to 5 or 10 Mb), the eQTL is said to be cis-acting. When the eQTL
is located farther away from the gene, the eQTL is trans-acting.
Nowadays, microarrays have become more and more affordable. This makes it
possible to create gene expression profiles of tens or even hundreds of individuals and
to fully exploit the power of genetical genomics. Indeed, multiple genetical genomics
studies have recently appeared. See the review of Rockman and Kruglyak (Rockman
and Kruglyak 2006) and the references therein. Table 1.1 lists the majority of the
studies where Affymetrix arrays were used and indicates the numbers of cis and trans
genes reported.
Paper Organism No. cis genes No. trans genes
Hubner et al. 2005 rat 97 18
Bystrykh et al. 2005 mouse 162 136
Chesler et al. 2005 mouse 92 9
Morley et al. 2004 human 27 110
Cheung et al. 2005 human 12 1
Table 1.1: The amounts of cis and trans genes reported in the main of genetical genomics
studies that used Affymetrix arrays.
Many research groups are working on the genetics of complex traits (http://www.
complextrait.org). http://www.genenetwork.org is an example of a valuable
6 General introduction
repository of complex trait data and analysis tools. It can be used to help understand
interactions among networks of gene variants, transcripts and ’classical’ traits such as
cancer susceptibility. It incorporates ’classical’ trait data collected in more than 25
years, genetic maps, genome sequence and large microarray datasets. QTL analysis
can be performed on both ’classical’ traits and gene expression traits.
In each of the experiments in Table 1.1, the individuals used for gene expression
profiling are expected to show genetic variation. One example of genetic variation
is the single nucleotide polymorphisms (SNP). This is a place in the DNA where
individuals differ by one nucleotide, e.g. where one individual carries a G and another
carries an A. Another type of genetic variation are insertions and deletions, where a
piece of DNA is either added or missing. A third type is gene copy number variation,
where different individuals have different amounts of copies of a gene in their DNA.
1.2 Affymetrix microarrays
This thesis focuses on Affymetrix GeneChip R© short-oligonucleotide arrays
(Lockhart et al. 1996). These arrays are small wafers of which the area is divided into
thousands of very small pieces (spots). Each spot contains so called probes. Probes
are small synthesized pieces of single-stranded DNA (Figure 1.3A). Probes are chosen
to fit to a unique part of the target RNA of which the quantity has to be measured.
They are complementary to this sequence since the RNA will then automatically
bind (hybridize) to the probe. The target RNA is labeled with a fluorescent label.
Figure 1.3B shows how the target RNA binds to probes on a microarray. Finally, the
RNA molecules that did not bind to any probe are washed away from the array and
AA
A B
Figure 1.3: (A) DNA probes on an Affymetrix microarray. (B) Fluorescent labeled RNA
hybridizes to probes on the microarray. Courtesy of Affymetrix.
Affymetrix microarrays 7
..ACTTGTGCCGTTCTCTGAGAGAAAAAGACTGA.. DNA TGTGCCGTTCTCTGAGAGAAAAAGA PM probe 1 TGTGCCGTTCTCAGAGAGAAAAAGA MM probe 1
GTGCCGTTCTCTCAGAGAAAAAGAC MM probe 2 GTGCCGTTCTCTGAGAGAAAAAGAC PM probe 2
Probe Sequence1 CTCTGTGAATTCTCCTCCGAGAGAG2 TGTGCCGTTCTCTGAGAGAAAAAGA
3 GTGCCGTTCTCTGAGAGAAAAAGAC4 GGGCTCTGAACCAGTATCCGGCATT5 TCTGAACCAGTATCCGGCATTATAT6 CTACAGAGATTTCTTTCCAGAGTAT7 TACAGAGATTTCTTTCCAGAGTATA8 CAGAGATTTCTTTCCAGAGTATATG9 TGACCCACAGAGCTACTGTCCCATG10 GCAGCTGCAGGGACAGATTGCCCTC11 TGCAGGGACAGATTGCCCTCCTGGT12 TTCCATGTCGCTAATCCCGTGATTA13 TCGCTAATCCCGTGATTATGTCTGC14 CGCTAATCCCGTGATTATGTCTGCA15 CCCGTGATTATGTCTGCAAACATGG16 GCAGAGCCAGGCAGGCTGTAAATAA
A
B
Figure 1.4: (A) Part of the target DNA sequence and two example PM and MM probe
sequences. The PM probes perfectly match the target, and in the MM probe the middle
nucleotide is replaced by its complement. (B) PM probe sequences of probe set 102934 s at
for gene Cdc25c.
the amount of fluorescence for each probe is measured in a scanner. This amount
represents the amount of RNA that was present for each gene.
The Affymetrix technology uses multiple probes per gene to measure RNA abun-
dances. For each gene one (or multiple) probe sets are designed, that each con-
tain between 11 and 20 perfect match (PM) and mismatch (MM) probes (Figure
1.4A). Figure 1.4B shows the sequences of the 16 PM probes in example probe set
102934 s at. PM and MM probes are always located next to each other on the mi-
croarray. However, all PM/MM pairs are randomly distributed over the area of the
microarray. Figure 1.5A shows the relative probe positions on the RNA (top) and
PM probe signals (bottom) for this probe set. It is striking to see that, although
the probes are designed to measure the same RNA target, their signals can be quite
different. However, this is a representative example of how the signals in a probe set
behave. In general, the probes in a probe set can give very different signals. How-
ever, when comparing multiple samples (lines in the figure), the probe signals mostly
follow the same trend.
The differences in signal between probes can be caused by many factors (summa-
rized in Table 1.2). 1) The melting temperature of DNA molecules is defined as the
temperature at which 50% of the molecules form a stable double helix and the other
8 General introduction
1 2345
678 9 1011 12131415 16
Relative probe position5
10
15
A
0
2000
4000
6000
8000
10000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Probe
PM
sig
na
lP
rob
e
Pro
be
Lo
g2
PM
sig
na
l
1 2345
678 9 1011 12131415 16
Relative probe position5
10
15
B
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Probe
Figure 1.5: (A) Relative probe positions on the RNA for probe set 102934 s at (top) and
PM probe signals (bottom). Each line represents a mouse sample. (B) Similar plots for the
log-transformed PM signals.
50% is separated in single strand molecules. The melting temperature depends on
the sequence length and nucleotide composition. It is related to the amount of G and
C nucleotides occurring in the sequence. The higher the melting temperature of a
probe, the better it binds RNA molecules (Figure 1.6A). 2) RNA is amplified before
it is hybridized to the microarray. This amplification starts at the end of the RNA
and stops at random places, because of which probes located more to the end of the
RNA have a tendency to show a higher signal. 3) When the sequence of a probe
is based on a wrong messenger sequence, it will probably not measure any target
and give a low signal. When the probe sequence also fits other RNAs, those other
RNAs will also bind to this probe (cross-hybridization) and the signal will become
higher (Figure 1.6B). 4) Both RNA and probes could form secondary structure, e.g.
by binding to themselves. This could hinder hybridization. The following two points
No. Factor
1 Probe melting temperature / CG content
2 Probe location on the RNA
3 Errors in probe design
4 Probe and RNA secondary structure
5 Probe density on array
6 Errors during array manufacturing
Table 1.2: Six factors leading to differences in signal for probes in one probe set.
Affymetrix microarrays 9
50 60 70 80
01
00
02
00
03
00
04
00
05
00
0
Probe melting temperature
Pro
be
inte
nsi
ty
−2 −1 0 1 2 3 4 5 A M
BLAST group
Lo
g2
inte
nsi
ty
8.0
8.5
9.0
9.5
10
.0
BA
Figure 1.6: (A) Probe melting temperature vs. probe intensity. Probes were divided
into groups according to their melting temperature, calculated following (SantaLucia 1998).
The probe intensities are plotted per group (black). The grey crosses are the mean intensity
values per group, and increase for higher melting temperatures. The melting temperature
explains only part of the variation in intensity. (B) Number of probe matches against
messenger sequence database vs. probe intensities. Probe sequences were compared with
messenger sequences and probes were grouped according to the matching results. -2 and -1
represent probes with a best hit with 2 or 1 mismatch(es). 0 till 5 represent probes with 1
till 5 perfect matches. M (missing) probes were not found and A (anti-sense) probes had
the wrong orientation. The better the probes match with one or multiple genes, the higher
the signal.
are purely hypothetical: 5) When the amount of probes per spot on the array dif-
fers per spot, differences in probe signals can arise. 6) When there is an error in a
mask during chip production, wrong probes will appear on the chip, leading to lower
signals.
The log transformed PM signals of Figure 1.5A are shown in Figure 1.5B. When
comparing both figures, one sees that on the original scale the variation per probe can
differ considerably, e.g. probe 6 shows little variation and probe 11 much. On log scale
the variation between individuals is much more consistent over probes. Differences
between individuals can be studied better because they are also consistent over probes
and not reduced by probes with low variation. Microarray data are usually log
transformed, since after the transformation the distribution of the data will be much
closer to a normal distribution. Many statistical tests assume the data to be normally
distributed. Moreover, on log scale one obtains symmetry in up- and downregulation,
e.g. on log2 scale an increase of 1 corresponds with 2-fold upregulation and a decrease
10 General introduction
of 1 corresponds with 2-fold downregulation.
1.3 Preprocessing Affymetrix array data
In the past years many research groups have developed algorithms for preprocessing
of Affymetrix array data. Usually, the following steps are performed: background
correction, normalization and summarization. The background signal is measured ir-
respective of any true signal. Several solutions have proposed to correct for this effect.
Mostly multiple microarrays are used within one experiment. Because of differences
introduced for example during the sample preparation process or the RNA ampli-
fication process, the overall signal levels can differ between arrays. Between-array
normalization equates the overall signal of multiple arrays so that fair comparisons
can be made. In the summarization step, the multiple probe signals per probe set
are summarized into one expression value per probe set.
Below, I will discuss the Microarray Suite 5.0 software that Affymetrix provides
for analyzing GeneChip R© data (MAS 5.0; Affymetrix Microarray Suite User Guide,
Version 5, www.affymetrix.com/support/technical/manuals.affx) and the soft-
ware that I have used throughout this thesis: Robust Multi-array Average (RMA;
Irizarry, Hobbs, Collin, Beazer-Barclay, Antonellis, Scherf and Speed’s (2003)). I
used RMA because it performs better than MAS 5.0 (Irizarry, Bolstad, Collin, Cope,
Hobbs and Speed 2003). RMA has higher precision, especially for lower expression
values. RMA provides more consistent estimates of fold change and provides higher
specificity and sensitivity when using fold change analysis to detect differential ex-
pression. Chapter 6 shows how RMA is performed in the statistical programming
language R.
MAS 5.0 background correction. Each array is divided into equally spaced
zones and an average background is assigned to the center of each zone. The distances
from each spot to the center of each zone are calculated. The distances are used as
weights in calculating the background per spot based on the background values of
the zones. Finally the background is subtracted from the signal.
MAS 5.0 summarization. The signal of a probe set is defined as the anti-log
of a robust average (Tukey biweight) of the values log(PMj − CTj), for probe pairs
j = 1, ..., J , where CTj is MM when MM<PM, but adjusted to be less than PM if
MM>PM. The adjustment is based on the average difference between PM and MM,
or if that measure is too small, on some fraction of PM.
MAS 5.0 normalization. All arrays are scaled so that they have the same
mean value.
RMA background correction. The background correction is based on PM
values only. An additive background plus specific signal is proposed, where the back-
Preprocessing Affymetrix array data 11
ground follows a normal distribution and the specific signal follows an exponential
distribution.
RMA normalization. RMA uses a procedure called quantile normalization,
which gives the signals of each array the same distribution. The algorithm is as
follows: 1) Collect the signals of all p probes of n arrays in a matrix X with n
columns and p rows. 2) Sort each column in X to give Xsort. 3) Take the mean
across rows of Xsort and assign this mean to each element in the row to get X ′sort.
4) Get the normalized data by rearranging each column in X ′sort to have the same
ordering as the original X .
RMA summarization. Expression measures per probe set are obtained via a
linear additive model:
log(PMij) = µi + αj + ǫij , i = 1, ..., I and j = 1, ..., J
with αj a probe affinity effect for probe j, µi the log scale expression level for array
i and ǫij an error term. Note that these algorithms calculate an average expression
value per probe set. By doing this, valuable information may be lost.
1.4 Processing batches
Large-scale microarray experiments are often split up into multiple phases (batches).
This occurs for example because the sample preparation is such an effort that it
cannot be done at once, or because not all samples are already available. A dataset
of 30 mouse samples hybridized to MG-U74Av2 arrays (Bystrykh et al. 2005) was
6 8 11 12 31 32 33 38 40 42 1 2 5 9 14 16 18 19 21 28 34 39 15 22 24 25 27 29 30 36
68
1012
14
BXD numbers
Log2
raw
PM
inte
nsity
Figure 1.7: Boxplots of the PM signals for 30 mouse microarrays. The three colors indicate
the three batches in which the arrays were processed.
12 General introduction
1 2 3 4 5 6 78 9101112131415 16
Relative probe position5
10
15
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Probe
12 3 4 5 6 7 8 910 1112 13 141516
Relative probe position5
10
15
BA
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Probe
PM
sig
na
l
Lo
g2
PM
sig
na
l
Pro
be
Pro
be
Figure 1.8: PM signals of 2 example probe sets showing a reasonable consistent batch effect
(A) and probe-specific batch effects (B). Individuals are colored according to the batches to
which they belong.
created in three processing batches: batches 1, 2, 3 containing 10, 12 and 8 arrays
respectively. Figure 1.7 shows boxplots of all PM signals for each of the 30 arrays,
that are colored according to the batch the arrays were processed in. As can be seen
from the figure, the overall signal for all arrays in the third (rightmost) batch is lower
than the overall signal of the arrays in the other batches. Although between-array
normalization equates the overall signal of all 30 arrays, the batch effect observed is
not removed from the data by this procedure, as can be seen for an example probe set
in Figure 1.8. Figure 1.8A shows the PM signals after quantile normalization; each
line (individual) is colored according to the batch it was processed in. The batch
effect is clearly visible. Figure 1.8B shows a similar figure for another probe set.
There we see that the batch effect can be probe-specific: for probe 6 the individuals
from the light grey batch have the lowest signal, while for probes 14-16 they have the
highest signal. In Chapter 3 we’ll see how these effects can be corrected for.
1.5 Outline of the thesis
Chapter 2 introduces the concepts of genetical genomics in more detail. It states
that the design and analysis of genetical genomics experiments is not just a matter
of simply applying the existing computational tools for microarrays to this new type
of experimental data. Optimal experimental design and analysis for two major mi-
croarrays technologies, cDNA two-colour arrays and Affymetrix short oligonucleotide
Outline of the thesis 13
arrays are described.
Chapter 3 focuses on a particular mouse genetical genomics dataset, where gene
expressions were measured using Affymetrix arrays. Arrays were processed in three
batches and this introduced a batch effect in the data. On places where batches
and marker alleles spuriously correlated, false QTLs were detected. By applying an
ANOVA model with a factor ’batch’, the batch effect was successfully eliminated.
Since this ANOVA model is applied to the probe level data, the interaction between
probe effect and QTL effect could be examined. It appeared that among cis genes
this interaction term is often significant, indicating that for cis genes often the QTL
effect is not consistent over probes. We show a clear example where the probe-specific
QTL effect is caused by two SNPs covering only part of the probes, and where only
difference in signal is observed in the probes that overlap the SNPs. In this case
the gene is not differentially expressed, but there is only a difference in hybridization
caused by the fact that, because of the SNPs, the RNA of one strain perfectly binds
to the array and the RNA of the other strain has a mismatch with the probes on the
array.
Chapter 4 introduces a verification protocol for the probe sequences of Affymetrix
human, mouse and rat arrays. It is known that, since the time of probe design by
Affymetrix, knowledge about messenger and genomic sequences increases. Probe sets
could have been designed on erroneous EST sequences and hence do not actually mea-
sure any occurring RNA. Furthermore the definition of genes change since UniGene
clusters are merged, redivided or deprecated over time. The verification protocol
checks all probes on Affymetrix arrays against multiple messenger databases and the
genome and updates the probe set definitions.
Chapter 5 shows that in multiple genetical genomics studies many false cis eQTLs
are detected because of polymorphisms in the probe regions. When inbred popula-
tions are used and the microarrays are designed based on the sequence of one of the
parental strains, polymorphisms between the parental strains cause an excess of cis
eQTLs where the hybridization signal of the inbred lines carrying the allele of the
strain used for array design is higher than that of inbred lines carrying the other al-
lele. This is shown to occur in multiple studies and a statistical solution implemented
in a downloadable application is provided.
Chapter 6 will present the R and perl code of the most important computational
protocols developed in this thesis.
Finally, chapter 7 combines the general discussion, dutch summary, dankwoord,
CV, publications and poster award.
14 General introduction
1.6 Glossary
Term Description
DNA Abbreviation for deoxyribonucleic acid, the molecule that contains the ge-
netic code for all life forms except for a few viruses. It consists of two long,
twisted chains made up of nucleotides. Each nucleotide contains one base,
one phosphate molecule, and the sugar molecule deoxyribose. The bases in
DNA nucleotides are adenine, thymine, guanine, and cytosine.
RNA Abbreviation for ribonucleic acid, the molecule that carries out DNA’s in-
structions for making proteins. It consists of one long chain made up of
nucleotides. Each nucleotide contains one base, one phosphate molecule,
and the sugar molecule ribose. The bases in RNA nucleotides are ade-
nine, uracil, guanine, and cytosine. There are three main types of RNA:
messenger RNA, transfer RNA, and ribosomal RNA.
protein A molecule or complex of molecules consisting of subunits called amino
acids. Proteins are the cell’s main building materials, and they do most of
a cell’s work.
nucleotide A building block of DNA or RNA. It includes one base, one phosphate
molecule, and one sugar molecule (deoxyribose in DNA, ribose in RNA).
oligonucleotide A short sequence of nucleotides
microarray Microarrays consist of large numbers of molecules (often, but not always,
DNA) distributed in rows in a very small space. Microarrays permit scien-
tists to study gene expression by providing a snapshot of all the genes that
are active in a cell at a particular time.
probe Synthesized stretch of DNA to be attached to a microarray.
probe set Collection of multiple probes, intended to measure the expression of one
gene.
hybridize To anneal nucleic acid strands from different sources.
PM probe Perfect match probe, perfectly matching the target gene.
MM probe Mismatch probe, equal to the PM probe expect for the middle nucleotide
which is replaced by its complement.
QTL Quantitative Trait Locus. Location on the genome where gene(s) are lo-
cated that likely regulate some quantitative trait.
genetical
genomics
Method to identify QTL for gene expression traits.
RIL Recombinant inbred line. RILs are genetic populations whose genomes are
mixtures of two parental founder genomes.
cis-acting
QTL
QTL that colocates with it’s gene.
trans-acting
QTL
QTL that does not colocate with it’s gene.
SNP Single nucleotide polymorphism. A difference in the DNA or RNA of only
one nucleotide.
CDF Chip Definition File
Glossary 15
1.7 Websites
URL Description
www.affymetrix.com Affymetrix website
www.bioconductor.org Open source software for bioinformatics
www.r-project.org The statistical programming language R
www.genenetwork.org Large collection of genotype, phenotype and
gene expression data for complex trait analysis
and gene network reconstruction
www.rgenetics.org Open source R software for the analysis of ge-
netic data
www.complextrait.org Website of the Complex Trait Consortium
16 General introduction