This help file is meant to serve as a general guideline for using the MARAN
website. Based on a global overview, the different features and possibilities
of the site will be explained in detail. Moreover, a test data set (freely
available here) will be used to illustrate how to
interpretate obtained results, and perform consequent analysis steps
accordingly.
MARAN concept overview
MARAN is a web application for preprocessing micro-array data. It uses a
generic ANOVA model for normalizing the measurements with respect to several
sources of variation in the experiment (generic in the sense that it can
readily be applied to any type of experimental design). The major advantages of
using a linear (ANOVA) model for normalizing micro-array data is that this
approach assesses the different sources of variation across the entire
experiment and that the residuals, obtained from fitting the model, can be used
for statistical inference.
Apart from an ANOVA based normalization, several other features are provided. A
loess fit procedure (Yang et al., 2002) is
implemented as a remedial measure for non-linearities. An option for filtering
the results by selecting genes with significantly changing expression profiles
is also available. Preprocessing results can be sent from the MARAN website to
the INCLUSive website for further analysis, such as gene clustering and motif
detection. Figure 1 illustrates the general concepts of the MARAN
website.
Figure 1: Schematic overview of the different conceptual modules of the
MARAN webapplication.
DATA
UPLOAD
Data Format Upload File:
The Maran application requires a specific format for the upload file. This
format is explained in detail on the Data Format page.
A note on UID's (Unique IDentifiers)
When a dataset is uploaded, a unique identifier (an alphanumeric name of at
most eight characters) is requested. This unique identifier enables the
application to keep track of all files and images generated during an analysis.
All general files are stored using the unique identifier the user specified.
When performing a specific modeling step with these files, the identifier gets
extended. The extention exists out of eight numbers which are the parameters of
this specific modeling step. (When repeating the same modeling for the same
uploaded dataset, the modeling is not repeated. The application will notice
that the unique identifier already exists with the specific extension and will
display the results from the former analysis.)
An example: A user uploads a dataset and gives it a unique identifier
'my_data'. He performs a modeling with all sources of variation checked on. The
results of his analysis will be stored using the identifier 'my_data_11112100'.
A 1 or 2 means checked, a 0 means unchecked. The sequence is: batch, dye,
array, array-dye, pin, gene, log and loess. 'log' denotes that the data were
log-transformed before modeling, 'loess' means the loess fitted data were used
instead of the uploaded data.
PERFORMING
ANALYSIS
Normalization by use of an ANOVA model
As mentioned earlier, the normalization procedure implemented in MARAN consists
of a generic ANOVA model. Use of ANOVA for normalizing micro-array data is
increasingly gaining interest (Jin et al., 2001;
Kerr et al., 2000; Kerr and Churchill, 2001).
Basically, it comes down to modeling the measured expression level of each gene
as a linear combination of the major sources of variation (i.e. explanatory
variables or effects), such as the array or dye for which the measurement was
taken. The model implemented in MARAN is generic with respect to the
experimental design, i.e. it can be used to normalize any type of micro-array
design in a single run. The major effects included in the model are batch,
dye, array, pin and arrayxdye. A batch is a
set of arrays which contain a specific set of genes, representitive of part of
the genome. This effect needs to be introduced when the entire set of genes was
to large to be spotted on a single array. All arrays on which the same set of
genes was spotted is a 'batch'. The dye effect models the difference in
measured intensities between the red and green dye; the array effect
compensates for global intensity differences between arrays. Likewise, a set of
measurements that share the same pin effect, were spotted by the same
spotting pin. The arrayxdye interaction effect was used, instead of a condition
effect, for alleviating any condition-dependent variations in the measured
intensities. Both effects are confounded and using a condition effect
would render the analytical solutions of the model fit dependent on the
experimental design.
Apart from these global effects, a gene and expression effect
have been included. The gene effect normalizes each gene with respect to
its basal expression level; expression is the effect of interest,
i.e. the condition-affected change in intensity for each gene. Normalized expression values (dubbed E_hat) can be downloaded from the result page; a link is also provided to download all other parameters (dye, array, pin,...) that were estimated during the ANOVA procedure. The E_hat file contains as many rows as there are genes, and as many columns as there are conditions, in the analyzed experiment. The corresponding entries indicate how much a single gene is either up or downregulated in a certain condition.
ANOVA model and constraints
Analytical solutions
Using the modeling page
Using the modeling page is fairly straightforward. A number of checkboxes,
representing the different sources of variation taken into account by our
model, are present. Depending on the specific design of the experiment, some of
these checkboxes may be greyed out, i.e. any of these explanatory variables,
that may not be relevant for a specific experimental design, are automatically
discarded. For instance, when the total number of genes fits on one array,
there will be only one batch, so the 'batch box' will be greyed out. All other
effects (except 'expression') can be in- or excluded (depending on wether or
not the user would like to incorparate the respective sources of variantion in
the model) by clicking the corresponding checkboxes. For instance, not checking
the gene effect will not normilize the expression data with respect to there
mean expression level (i.e. the gene effect); in some cases, it may be
usefull to obtain this information within the expression values. On this page,
there's also a checkbox for log transforming the data. If the uploaded data is
not log transformed when, we recommend checking the log transformation box.
Indeed, our model assumes that an additive error (absolute error is independent
of measured intensities) is present, while in most cases there's a pronounced
multiplicative error (absolute error on the measurement increases with the
measured intensity) so that modeling assumptions are not satisfied. Log
transforming the data (multiplicative errors become additive) is therefore
often required (Marchal et al., 2002).
Interpretation of results
Normalized expression values (and all parameters and residuals of the fitted
model) can be downloaded from the 'Results' page after completion of the
analysis. Also on this page, an ANOVA table is represented for interpreting the
contribution of the different effects to the total amount of variation
(represented by the 'SS' (Sum of Squares) collumn). The ANOVA
table of our test data is given below.
Source
SS
df
MS
F*
p(F > F*)
Bo
7258.3411
1
7258.3411
40452.4260
0.0000
Dl
9198.4442
1
9198.4442
51265.0726
0.0000
Ak(o)
9633.6236
6
1605.6039
8948.4048
0.0000
(AD)kl(o)
2148.8923
6
358.1487
1996.0462
0.0000
Gi(j(o))
129441.4614
8395
15.4189
85.9330
0.0000
Ein(j(o))
10215.3596
25185
0.4056
2.2606
0.0000
SSE
18078.7346
100757
0.1794
0.0000
0.0000
SSTO
185974.8567
134351
1.3842
0.0000
0.0000
Table 1: Example of the resulting ANOVA table after modeling.
Several plots for analyzing the ANOVA modeling assumptions are also included.
These assumptions are twofold: firstly, the data should be adequately described
by a linear model. Secondly, the error terms are assumed to be normally
distributed with mean zero and constant variance. Information about the
non-normality and heteroscedasticity (non-constant error variance) of the
residual distribution can be obtained from the 'NQ plot' (Normal Quantile plot
of residual values, Figure 3) and the 'Global residual
plot' (Figure 2) respectively. To illustrate this, the
figures below show the residual plots of an ideal case (artificial data), as
opposed to what can be expected from a real experiment (test data set).
Although heteroscedasticity is not too prominent in the latter case, there are
some deviating features in the scatter plot that require further analysis.
Figure 2: Global residual plots of residual values: an ideal case with
artificial data (left plot), and a more realistic plot from a test data set
(right plot).
As for interpreting the NQ-plot, ideally -i.e. for normally distributed
residuals- a straight line will be observed; any deviations from this line
indicate deviations from normality. It should be noted that, while any serious
heteroscedastic features should be avoided for further analysis, serious
deviations from normality, in the form of widened tails (like the ones we
observe in our test data set, Figure 4), can often be
acceptable due to the small amount of data points compared to the number of
parameters to be estimated. As explained later on, bootstrap methods are
advisable when serious heteroscedasticity or non-normality occurs in the
residual distribution.
Figure 3: Normal Quantile plots of residual values: an ideal case with
artificial data (left plot), and a more realistic plot from a test data set
(right plot).
More problematic, however, is an apparent heteroscedasticity caused by a
superposition of non-linear trends in the residuals for each combination of
major effects, indicating that the first assumption is not satisfied (i.e. a
linear model is not adequate for describing the data). All other plots on the
'Results' page are the residual plots for each specific array dye combination,
and can be used to assess this problem. When obvious curvilinear trends are
observed on these plots, remedial measures should be taken. The following
figures (a few array dye combinations from our test data) illustrate this
phenomenon.
Figure 4: A few detailed residual plots after analyzing the test data
set. C indicates condition number, A indicates array number, D
indicates dye number (Cy5=1 and Cy3=2 by default) and B indicates
batch number.
Remark that the plots on the result page are all in jpeg format. It is
possible, however, to study the plots in more detail on a separate page with
plot applets. The applets allow for zooming in and out, rescaling the axes and
printing of the individual plots.
Experimental design limitations
By using the above model, it is technically possible to normalize any type of
experimental design. This does not imply however, that the quality of the
normalization results is independent of the experimental design. In general,
the more replicates are measured for each gene-condition combination, the
better the estimated ANOVA parameters will be (which is not necessarely
expressed through a low error sum of squares!). This follows naturally from the
fact that each expression effect (the effect of interest, the
condition-affected change in intensity for each gene) is calculated from all
measurements of a single gene-condition combination (after being corrected for
the other experimental variations included in the model). As a rule of thumb,
we would suggest that each gene-condition combination is meaured at least
twice as to ensure a residual is obtained for each measurement (this is
especially important when the residuals are later to be used as a statistical
measure for e.g. selecting genes with significant change in expression). There
are no strict regulations as to how these replications should be incorporated
in the experimental design. However, in order to avoid partial confounding of
effects, it may be wiser to ensure that experimental sources of variation are
different from one replicate to the next. For instance, a dye swap experiment
(replicates measured on different array and in different dye) may be more
informative than simply repeating all measurements on a different array (same
dye) or multiple spotting on the same array (same array, same dye).
To illustrate how the lack of replicates can lead to bad error estimation (due
to the lack of residuals), we normilized the Spellman data set (Spellman
et al., 1998). This experiment was a so-called 'reference
design'. Eighteen arrays were used to test eighteen timepoints of the yeats
cell-cycle (measurements in red); each array used the same 'control' condition
(measurements in green). Since there was no multiple spotting, all
gene-condition combinations measured in red were only measured once, while the
combination of each gene with the nineteenth condition (i.e. the control
condition, always measured in green) were measured seven times each (once on
each array). This results in partial residual plots for each array as the one
illustrated below (Figure 5). There are no residuals for any of the measurements in red.
Valuable biological interpretations may still be obtained from analyzing the expression
effects (after all, they are normalized with respect to various experimental
sources of variation), but using the obtained residuals for further statistical
inference may prove detrimental.
Figure 5: Detailed residual plots of the fifteenth array after analyzing
the Spellman data set. C indicates condition number, A indicates
array number, D indicates dye number (Cy5=1 and Cy3=2 by default) and B
indicates batch number.
With respect to problematic experimental designs, we would like to remark that
using MARAN for normalizing dedicated chips will very likely give
unsatisfactory results. Dedicated chips contain only a subset of genes, which
are of specific interest for the conditions tested. Of course, it is generally
impossible to normalize data from dedicated chips with any global normalization
method (i.e. a method that assumes that roughly equal amounts of genes are up-
vs. downregulated from one condition to the next).
REMEDIAL
MEASURES
Loess-fit for alleviating non-linear residual trends
A remedial measure for non-linear effects in the dataset has previously been
described by Yang et al., 2002. This Loess-based
method has been made available in the MARAN web application ('Loess' page). The
'Loess' page can be accessed directly or after inspecting the results of an
initial fit. This procedure wille generate a new dataset from your data, on
which the ANOVA model can be fitted (as a consequence of this procedure, dye
effects will be zero after performing the loess fit and the ANOVA model). It is
important to keep in mind that a Loess-fit based correction for non-linearities
is performed in the dye direction, separately for each array. This implies
that, for complex experimental designs, non-linearities across arrays or
conditions cannot be completely alleviated. Figure 6 below
shows how performing a Loess fit can completely alleviate observed
non-linearities in some cases, while not in other.
Figure 6: A few Detailed Residual plots of the ANOVA-normalized test data
set, after performing a Loess fit. They clearly show how this procedure can
completely alleviate observed non-linearities in some cases, while not in
other. C indicates condition number, A indicates array number, D
indicates dye number (Cy5=1 and Cy3=2 by default) and B indicates
batch number.
FILTERING
RESULTS
As explained above, the obtained estimates of the error terms can be used for
statistical analysis of the ANOVA parameters. Two different methods for
selecting genes with significantly changing expression have been made available
on the website. Both methods differ in the calculation of confidence intervals
for the expression estimates (the parameters of interest, i.e. the
condition-affected change in intensity for each gene), as derived from fitting
the ANOVA model. The statistical test for selecting genes with significantly
changing expression, based on these confidence intervals, is identical for both
methods. For each gene i:
Where δ is an undefined value. Basicly, this test evaluates, for each
single gene, whether a single expression level (i.e. δ) exists that could
account for each calculated 'expression' level of that gene (indicating that
this gene is not differentially expressed). We chose a non-fixed value for
δ, as opposed to e.g. the average of the calculated 'expression' levels
or the calculated 'gene' level.
Use of normal confidence intervals
The first method is valid under the assumption of normally distributed error
terms, with mean zero and constant error variance. The null hypothesis for
selecting differentially expressed genes is that all calculated 'expression'
parameters for a single gene are sampling instances from a normal distribution
(based on standardized residuals) around a non-specified 'expression' value.
Correction for multiple testing is done by using the Bonferroni correction
procedure. A selection of genes can be obtained by entering a preferred
significance or, when desired, p-values for all genes can be downloaded.
Although this method should not be applied when there is doubt that the above
assumptions are satisfied, this method is relatively fast and may therefore
serve as a preliminary indication of differentially expressed genes.
Use of bootstrap confidence intervals
The alternative method is based on a bootstrap procedure (DiCiccio
and Efron, 1996; Efron, 1979; Efron
and Tibshirani, 1986). It is a 'fixed predictor sampling
method',similar to the one described by Kerr et al., 2001 and is
appropriate when the residuals show serious deviations from normality, but no
apparent heteroscedasticity is present.A selection of genes can be obtained by
entering a preferred significance. It is not possible however, to obtain
p-values for each genes, as opposed to the method described above.
INCLUSive
There is a direct link from MARAN to the
INCLUSive
web application (Thijs
et al., 2002). INCLUSive is suite of web based tools and is
aimed at the automatic multistep analysis of microarray data (clustering and
motif finding). Currently, adaptive quality-based clustering, retrieval of
upstream sequences and the motif sampler are accessible from the website. All
tools are linked together, starting with the clustering, then retrieving the
upstream sequences of the genes belonging to a cluster and finally using the
motif sampler to find over-represented motifs in this set of upstream
sequences. All results obtained through modeling and filtering at MARAN can be
sent to the web-based clustering algorithm. To proceed with upstream sequence
retrieval and motif sampling, gene names and accession numbers for every gene
should be available. If the initial uploaded file did not contain gene names
and accession numbers, the link to INCLUSive will be limited to clustering.
MARAN
SOAP SERVICE
The functionalities of the Maran Webapplication can also be used through the
use of a web service. This service is provided to enable easy integration of
the functionalities with your own applications. All information about this can
be found here.
2. Efron,B. (1979) Bootstrap methods:another
look at teh jackknife. The Annals of Statistics
3. Efron,B. and Tibshirani,R. (1986) Bootstrap
methods for standard errors, confidence intervals, and other measures of
statistical accuracy. Statistical Science, 54-77
4. Jin,W. et al. (2001) The contributions
of sex, genotype and age to transcriptional variance in Drosophila
melanogaster. Nat Genet 29, 389-395
5. Kerr,M.K. and Churchill,G.A. (2001)
Experimental design for gene expression microarrays. Biostatistics 2, 183-201
6. Kerr,M.K. et al. (2000) Analysis of
variance for gene expression microarray data. J Comput Biol 7, 819-837
7. Marchal,K. et al. (2002) Comparison
of different methodologies to identify differentially expressed genes in
two-sample cDNA microarrays. J Biol Systems in press - download ps
8. Spellman, P.T. et al. (1998)
Comprehensive identification of cell cycle-regulated genes of the yeast
Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97
9. Thijs,G. et al. (2002) INCLUSive:
INtegrated Clustering, Upstream sequence retrieval and motif Sampling.
Bioinformatics 18, 331-332 -
download pdf
10. Yang,Y.H. et al. (2002) Normalization
for cDNA microarray data: a robust composite method addressing single and
multiple slide systematic variation. Nucleic Acids Res 30, e15