Analysis
of Microarray Data
Data Generated From Both mRNA
and aRNA Samples
Clustering analysis is commonly used
for interpreting microarray data. It provides both a visual representation
of complex data and a method for measuring similarity between experiments
(gene ratios). The widely used methods for clustering microarray
data are: Hierarchical, K-means and Self-organizing map. In this
article, the second in our series on Ambion's MessageAmp aRNA
Amplification Kit, we present data and statistical analyses from
experiments conducted by Drs. Philip Moos and Brian Dalley at the
University of Utah, Huntsman Cancer Institute (HCI). The microarrays
used in this study were manufactured at the HCI and contained 6912
cDNA clones deposited in duplicate using a Molecular Dynamics GEN
III Array Spotter. Moos and Dalley compared the data generated
from the HCI microarrays hybridized with mRNA and amplified antisense
RNA (aRNA) generated with the MessageAmp Kit.
Experimental
Design
Total RNA was isolated from HCT116
and RKO cells and aRNA was amplified from total RNA (2 µg).
For non-amplified samples, mRNA was purified using Oligotex beads
(Qiagen). To assess the reproducibility of RNA amplification and
compare representation of messages in aRNA with those in mRNA,
three samples each of RKO and HCT116 RNA were amplified independently.
The quality and yields of RNA obtained from the various samples
are presented in Figure 2. mRNA or aRNA (2 µg) samples were
fluorescently labeled by incorporating Cy3-dCTP (RKO samples) or
Cy5-dCTP (HCT 116 samples) during reverse transcription with random
9-mers. Each glass slide contained duplicate arrays and each labeled
RNA sample was hybridized to two slides (4 replicates). Duplicate
hybridizations were performed with each sample RNA and the quantified
data was represented as slide averages resulting in 12 arrays for
each aRNA and mRNA sample. Images were acquired using a Molecular
Dynamics Array Scanner. Microarray data analysis was performed
using Spotfire Software (Spotfire®, Somerville,
MA).
Cluster Analysis
An example of the image data obtained
from 4 of the 12 grids from a standard 6912 element HCI array
is represented in Figure 1. Comparison of the signals obtained
using mRNA vs. aRNA indicates that RNA amplification provides
excellent signal to noise, even for genes with a low level of
expression.
|
|
Figure
1. Array Analysis of
6912 Element Arrays. Image Data from 4 of the 12
grids of a standard 6912 element Hunstman Cancer Institute
cDNA microarray. mRNA or aRNA (2 µg) was reverse transcribed
in the presence of random ninemers and Cy dyes.
HCT 116 samples were labeled with Cy5-dCTP while RKO cells
were labeled with Cy3-dCTP. Analysis of the microarray
data was carried out using Spotfire Software (Spotfire®,
Somerville, MA). |
| mRNA Sample |
A260/280 |
aRNA Yield |
| RKO-1 |
2.12 |
37.0 µg |
| RKO-2 |
2.15 |
28.7 µg |
| RKO-2 |
2.12 |
26.8 µg |
| HCT116-1 |
2.12 |
33.5 µg |
| HCT116-2 |
2.12 |
36.2 µg |
| HCT116-3 |
2.12 |
29.5 µg |
|
| Figure 2. Yields
of Amplified RNA. Yields
obtained after amplification of total RNA (2 µg) from
HCT116 and RKO cells using the MessageAmp Kit. |
Hierarchical clustering analysis
of the data obtained from 6912 elements was carried out using UPGMA
(Unweighted Pair Group Method with Arithmetic Mean) analysis (see
sidebar "Clustering Methods Used for Analyzing Microarray Data"),
with an ordering function based on the input rank. This data is
represented as a dendrogram (tree graph) with the closest branches
of the tree representing arrays with similar gene expression patterns.
Figure 3 depicts the hierarchical clustering data from all 6912
elements. The results indicate that there are broad similarities
between arrays hybridized with aRNA or mRNA. Even though the overall
signal patterns found on the aRNA and mRNA hybridized arrays are
similar, a small subset of regions show differential expression
(RKO/HCT116) signals between the aRNA and mRNA samples.
|
|
| Figure 3. Hierarchical
Clustering Analysis of All Array Elements. Hierarchical
clustering data of all the elements in a HCl array. A dendrogram
(tree graph) epicts the grouping of the genes based on the
similarity between them. UPGMA analysis (unweighted average)
was carried out using the "Euclidean Distance" to determine
the similarity measure and the input rank as the ordering
function. A subset of all the columns constituting the complete
data is shown in this figure. |
To obtain statistically significant
data for the sub-regions that were distinct between the aRNA and
mRNA (91 elements), a weighted average (WPGMA) analysis was carried
out. The hierarchical clustering of these 91 elements is depicted
in Figure 4. It is evident that there are very few genes that clearly
segregate into either mRNA or aRNA groups. It is important to note,
for those genes that do segregate, the gene expression differences
(ratios) do not change direction (i.e. RKO>HCT to HCT>RKO),
but show greater differences in the aRNA samples compared to the
mRNA samples (as determined by the color shade).
|
|
| Figure 4. Hierarchical
Clustering Analysis of Selected Array Elements. Hierarchical
clustering analysis of a few select genes (91 of 6912) that
are very different between the aRNA and mRNA samples. This
analysis was carried out using the weighted average (WPGMA)
method. |
An alternate methodology used to
understand the clustering of microarray data is k-means clustering.
This method does not suffer from some of the problems associated
with hierarchical clustering such as irrelevance of gene expression
data as clustering progresses or spurious results due to errors
in assigning clusters initially in the analysis (2). K-means clustering
of all the elements of the HCI arrays with 6 clusters was determined
(Figure 5). After 45 iterations, a total score of 1.082e+004 was
computed. The most similar "similarity value" was 0 and the least
similar "similarity value" was 1.798e+308. This grouping of genes
to identify sets of genes that appear to be differentially expressed
between aRNA and mRNA resulted in two clusters (91 elements among
clusters 5 and 6) that have the largest difference between aRNA
and mRNA (Figure 6).
|
|
Figure 5. K-
means cluster analysis of the 6912 elements using a userdefined
cluster number of 6. 45
iterations were carried out to group
the genes within a given cluster using a data centroid based search.
The total score was calculated to be 1.082e+004. The most closely
clustering genes had a similarity value of 0 and the least similar
gene
had a similarity value of 1.798e+308. |
|
|
Figure 6. An
analysis of variance calculation of K-mean clusters 5 and
6. This plot indicates the confidence
limit of the data. A Œp-value‚
of less than 0.0001 was used, indicating that the genes represented
in
this plot are unique at the 99.9999% cut-off value. |
Analysis of variance (ANOVA) of
genes in clusters 5 and 6 indicated that the clusters contained
genes that behave distinctly between the mRNA and aRNA samples
at a confidence limit of 99.99999% (p<0.00001). ANOVA measurements
processed the gene-by-gene fluctuations from the mean value and
accounted for variance across samples.
A scatter plot analysis of the raw
Cy3 and Cy5 values of all the 6912 elements within the 5 gene clusters
is shown in Figure 7 (4 plots). The top two plots represent all
the elements and the bottom two depict the genes that show the
largest differences in signal. Most of the genes that are distinct
between the samples are expressed at lower levels (low fluorescent
signal). These differences were more exaggerated in the aRNA than
the mRNA sample because the signal-to-noise ratio was typically
much greater in the aRNA sample. The distinct genes in the aRNA
panel might be elements that were not clearly discernible in the
mRNA sample due to ribosomal contamination (27% in the mRNA used
for this analysis). The presence of ribosomal RNA can increase
background in mRNA samples, resulting in variations in mRNA concentration
between samples and decreasing the efficiency of cDNA probe synthesis.
Thus the presence of ribosomal RNA could have cumulatively skewed
the detection and quantification of genes that were expressed in
very low amounts when mRNA or total RNA was used.
|
|
| Figure 7. Scatter
Plot Analysis of all Array Elements. A
scatter plot of the Cy5 vs. Cy3
values obtained for an aRNA and a mRNA array is shown. The
top two panels depict the 6 clusters (obtained after K-Mean
Clustering Analysis) containing all 6912 elements. A subset
of elements that are distinct between the two arrays and
which deviate the most in signal intensity are depicted in
the lower panels. |
Amplification of RNA thus provides
a means of measuring expression from genes transcribed at very
low levels. In many cases the RNA concentration of an experimental
sample is under the optimal required amount for synthesizing labeled
cDNA for microarray analysis. MessageAmp is a viable technology
for increasing the yield of useful probe and can greatly lower
the starting amount of RNA required to produce biologically relevant
signals.
More on Microarray Analysis
In future columns we will continue to
report results from MessageAmp microarray studies from Ambion
researchers, our collaborators, and our customers. If you have
results you would like to share, please contact Lakshmi Madabusi
at lmadabusi@ambion.com or
call 1-800-888-8804 x6308.
REFERENCES
1. Quakenbush J (2001). Computational
analysis of microarray data. Nature Reviews Genetics. 2(6):
418427.
2. Tavazoie S, Hughes JD,
Campbell MJ, Chos RJ and Church GM (1999). Systematic determination
of genetic network architecture. Nature Genetics. 22: 281285.
Cy3 is a trademark
of Amersham Biosciences.
back
to top
Ordering Information
For prices and availability, please contact our Customer Service Department.
| Cat# |
Product Name |
Size |
| AM1705 |
Amino Allyl cDNA Labeling Kit |
15 rxns |
| AM1750 |
MessageAmp™ aRNA Kit |
20 rxns |
|
Amino Allyl Labeling for Array Analysis [read]
|
Array Analysis Step by Step: Ambion's Array Analysis Products [read]
|
Array Analysis: The Basics [read]
|
Microarray Analysis: Gene Representation in Amplified vs. Unamplified RNA [read]
|
RNA Interviews: James Eberwine [read]
|
Selected References on Microarray Analysis [read]
|
UPGMA
(Unweighted Pair Group Method with Arithmetic Mean)
This analysis calculates the average Euclidean distance between each
point in a cluster and all the points in a neighboring cluster. The two
clusters that are closest to each other (have the smallest average distance)
are connected to form the higher order cluster. This data is represented
as a dendrogram (tree graph) with the closest branches of the tree representing
genes with similar gene expression patterns.
WPGMA (Weighted Pair
Group Method with Arithmetic Mean)
WPGMA is a clustering technique used when the cluster sizes obtained
(using UPGMA) are suspected to be greatly uneven. In this analysis the
cluster is computed by weighing the data based on the number of objects
contained in a given cluster.
K-means Clustering
This method does not suffer from some of the problems associated with
hierarchical clustering such as irrelevance of gene expression data
as clustering progresses, or spurious results due to mistakes assigning
clusters initially in the analysis (2). K-means analysis requires a
prior knowledge of the number of clusters represented in the data,
which is used to partition the data into clusters. Each element in
the array is randomly assigned to a cluster and an average expression
vector is calculated for each cluster. This vector is used to compute
the average Euclidean distances between clusters. Elements are then
moved between clusters and allowed to remain in a cluster if the new
computed distance after reassignment is lower than the distance when
assigned to the previous cluster. After reassignment the expression
vector is recalculated for each cluster and this process is carried
out iteratively until any further movement of the elements can increase
the distance within and between clusters.
Analysis of Variance
(ANOVA)
ANOVA measurements process the gene by gene fluctuations from the mean
value and accounts for variance across samples. The p value indicates
the overlap between samples - the smaller the 'p-value' the more distinct
the sample tested and lesser the likelihood of overlap between samples.
|