maran

Start Register   User Guide Contact
User Guide Login
   
 

Frequently Asked Questions

TABLE OF CONTENTS

Introduction
Data Upload
Performing Analysis
Remedial Measures
Filtering Results
INCLUSive
Maran SOAP Service
References

INTRODUCTION

About this help file

This help file is meant to serve as a general guideline for using the MARAN website. Based on a global overview, the different features and possibilities of the site will be explained in detail. Moreover, a test data set (freely available here) will be used to illustrate how to interpretate obtained results, and perform consequent analysis steps accordingly.

MARAN concept overview

MARAN is a web application for preprocessing micro-array data. It uses a generic ANOVA model for normalizing the measurements with respect to several sources of variation in the experiment (generic in the sense that it can readily be applied to any type of experimental design). The major advantages of using a linear (ANOVA) model for normalizing micro-array data is that this approach assesses the different sources of variation across the entire experiment and that the residuals, obtained from fitting the model, can be used for statistical inference.
Apart from an ANOVA based normalization, several other features are provided. A loess fit procedure (Yang et al., 2002) is implemented as a remedial measure for non-linearities. An option for filtering the results by selecting genes with significantly changing expression profiles is also available. Preprocessing results can be sent from the MARAN website to the INCLUSive website for further analysis, such as gene clustering and motif detection.
Figure 1 illustrates the general concepts of the MARAN website.
MARAN Overview

Figure 1: Schematic overview of the different conceptual modules of the MARAN webapplication.

DATA UPLOAD

Data Format Upload File:

The Maran application requires a specific format for the upload file. This format is explained in detail on the Data Format page.

A note on UID's (Unique IDentifiers)

When a dataset is uploaded, a unique identifier (an alphanumeric name of at most eight characters) is requested. This unique identifier enables the application to keep track of all files and images generated during an analysis. All general files are stored using the unique identifier the user specified. When performing a specific modeling step with these files, the identifier gets extended. The extention exists out of eight numbers which are the parameters of this specific modeling step. (When repeating the same modeling for the same uploaded dataset, the modeling is not repeated. The application will notice that the unique identifier already exists with the specific extension and will display the results from the former analysis.)
An example: A user uploads a dataset and gives it a unique identifier 'my_data'. He performs a modeling with all sources of variation checked on. The results of his analysis will be stored using the identifier 'my_data_11112100'. A 1 or 2 means checked, a 0 means unchecked. The sequence is: batch, dye, array, array-dye, pin, gene, log and loess. 'log' denotes that the data were log-transformed before modeling, 'loess' means the loess fitted data were used instead of the uploaded data.

PERFORMING ANALYSIS

Normalization by use of an ANOVA model

As mentioned earlier, the normalization procedure implemented in MARAN consists of a generic ANOVA model. Use of ANOVA for normalizing micro-array data is increasingly gaining interest (Jin et al., 2001; Kerr et al., 2000; Kerr and Churchill, 2001). Basically, it comes down to modeling the measured expression level of each gene as a linear combination of the major sources of variation (i.e. explanatory variables or effects), such as the array or dye for which the measurement was taken. The model implemented in MARAN is generic with respect to the experimental design, i.e. it can be used to normalize any type of micro-array design in a single run. The major effects included in the model are batch, dye, array, pin and arrayxdye. A batch is a set of arrays which contain a specific set of genes, representitive of part of the genome. This effect needs to be introduced when the entire set of genes was to large to be spotted on a single array. All arrays on which the same set of genes was spotted is a 'batch'. The dye effect models the difference in measured intensities between the red and green dye; the array effect compensates for global intensity differences between arrays. Likewise, a set of measurements that share the same pin effect, were spotted by the same spotting pin. The arrayxdye interaction effect was used, instead of a condition effect, for alleviating any condition-dependent variations in the measured intensities. Both effects are confounded and using a condition effect would render the analytical solutions of the model fit dependent on the experimental design.
Apart from these global effects, a gene and expression effect have been included. The gene effect normalizes each gene with respect to its basal expression level; expression is the effect of interest, i.e. the condition-affected change in intensity for each gene. Normalized expression values (dubbed E_hat) can be downloaded from the result page; a link is also provided to download all other parameters (dye, array, pin,...) that were estimated during the ANOVA procedure. The E_hat file contains as many rows as there are genes, and as many columns as there are conditions, in the analyzed experiment. The corresponding entries indicate how much a single gene is either up or downregulated in a certain condition.

ANOVA model and constraints

model

constraints

Analytical solutions

model
model
model
model
model
model
model
model

Using the modeling page

Using the modeling page is fairly straightforward. A number of checkboxes, representing the different sources of variation taken into account by our model, are present. Depending on the specific design of the experiment, some of these checkboxes may be greyed out, i.e. any of these explanatory variables, that may not be relevant for a specific experimental design, are automatically discarded. For instance, when the total number of genes fits on one array, there will be only one batch, so the 'batch box' will be greyed out. All other effects (except 'expression') can be in- or excluded (depending on wether or not the user would like to incorparate the respective sources of variantion in the model) by clicking the corresponding checkboxes. For instance, not checking the gene effect will not normilize the expression data with respect to there mean expression level (i.e. the gene effect); in some cases, it may be usefull to obtain this information within the expression values. On this page, there's also a checkbox for log transforming the data. If the uploaded data is not log transformed when, we recommend checking the log transformation box. Indeed, our model assumes that an additive error (absolute error is independent of measured intensities) is present, while in most cases there's a pronounced multiplicative error (absolute error on the measurement increases with the measured intensity) so that modeling assumptions are not satisfied. Log transforming the data (multiplicative errors become additive) is therefore often required (Marchal et al., 2002).

Interpretation of results

Normalized expression values (and all parameters and residuals of the fitted model) can be downloaded from the 'Results' page after completion of the analysis. Also on this page, an ANOVA table is represented for interpreting the contribution of the different effects to the total amount of variation (represented by the 'SS' (Sum of Squares) collumn). The ANOVA table of our test data is given below.
Source SS df MS F* p(F > F*)
Bo 7258.3411 1 7258.3411 40452.4260 0.0000
Dl 9198.4442 1 9198.4442 51265.0726 0.0000
Ak(o) 9633.6236 6 1605.6039 8948.4048 0.0000
(AD)kl(o) 2148.8923 6 358.1487 1996.0462 0.0000
Gi(j(o)) 129441.4614 8395 15.4189 85.9330 0.0000
Ein(j(o)) 10215.3596 25185 0.4056 2.2606 0.0000
SSE 18078.7346 100757 0.1794 0.0000 0.0000
SSTO 185974.8567 134351 1.3842 0.0000 0.0000

Table 1: Example of the resulting ANOVA table after modeling.
Several plots for analyzing the ANOVA modeling assumptions are also included. These assumptions are twofold: firstly, the data should be adequately described by a linear model. Secondly, the error terms are assumed to be normally distributed with mean zero and constant variance. Information about the non-normality and heteroscedasticity (non-constant error variance) of the residual distribution can be obtained from the 'NQ plot' (Normal Quantile plot of residual values, Figure 3) and the 'Global residual plot' (Figure 2) respectively. To illustrate this, the figures below show the residual plots of an ideal case (artificial data), as opposed to what can be expected from a real experiment (test data set). Although heteroscedasticity is not too prominent in the latter case, there are some deviating features in the scatter plot that require further analysis.
Residual plot with artificial data Residual plot with real data

Figure 2: Global residual plots of residual values: an ideal case with artificial data (left plot), and a more realistic plot from a test data set (right plot).
As for interpreting the NQ-plot, ideally -i.e. for normally distributed residuals- a straight line will be observed; any deviations from this line indicate deviations from normality. It should be noted that, while any serious heteroscedastic features should be avoided for further analysis, serious deviations from normality, in the form of widened tails (like the ones we observe in our test data set, Figure 4), can often be acceptable due to the small amount of data points compared to the number of parameters to be estimated. As explained later on, bootstrap methods are advisable when serious heteroscedasticity or non-normality occurs in the residual distribution.
Normal Quantile plot, ideal case Normal Quantile plot of real data set

Figure 3: Normal Quantile plots of residual values: an ideal case with artificial data (left plot), and a more realistic plot from a test data set (right plot).
More problematic, however, is an apparent heteroscedasticity caused by a superposition of non-linear trends in the residuals for each combination of major effects, indicating that the first assumption is not satisfied (i.e. a linear model is not adequate for describing the data). All other plots on the 'Results' page are the residual plots for each specific array dye combination, and can be used to assess this problem. When obvious curvilinear trends are observed on these plots, remedial measures should be taken. The following figures (a few array dye combinations from our test data) illustrate this phenomenon.
model model

model model

Figure 4: A few detailed residual plots after analyzing the test data set. C indicates condition number, A indicates array number, D indicates dye number (Cy5=1 and Cy3=2 by default) and B indicates batch number.
Remark that the plots on the result page are all in jpeg format. It is possible, however, to study the plots in more detail on a separate page with plot applets. The applets allow for zooming in and out, rescaling the axes and printing of the individual plots.

Experimental design limitations

By using the above model, it is technically possible to normalize any type of experimental design. This does not imply however, that the quality of the normalization results is independent of the experimental design. In general, the more replicates are measured for each gene-condition combination, the better the estimated ANOVA parameters will be (which is not necessarely expressed through a low error sum of squares!). This follows naturally from the fact that each expression effect (the effect of interest, the condition-affected change in intensity for each gene) is calculated from all measurements of a single gene-condition combination (after being corrected for the other experimental variations included in the model). As a rule of thumb, we would suggest that each gene-condition combination is meaured at least twice as to ensure a residual is obtained for each measurement (this is especially important when the residuals are later to be used as a statistical measure for e.g. selecting genes with significant change in expression). There are no strict regulations as to how these replications should be incorporated in the experimental design. However, in order to avoid partial confounding of effects, it may be wiser to ensure that experimental sources of variation are different from one replicate to the next. For instance, a dye swap experiment (replicates measured on different array and in different dye) may be more informative than simply repeating all measurements on a different array (same dye) or multiple spotting on the same array (same array, same dye).
To illustrate how the lack of replicates can lead to bad error estimation (due to the lack of residuals), we normilized the Spellman data set (Spellman et al., 1998). This experiment was a so-called 'reference design'. Eighteen arrays were used to test eighteen timepoints of the yeats cell-cycle (measurements in red); each array used the same 'control' condition (measurements in green). Since there was no multiple spotting, all gene-condition combinations measured in red were only measured once, while the combination of each gene with the nineteenth condition (i.e. the control condition, always measured in green) were measured seven times each (once on each array). This results in partial residual plots for each array as the one illustrated below (Figure 5). There are no residuals for any of the measurements in red. Valuable biological interpretations may still be obtained from analyzing the expression effects (after all, they are normalized with respect to various experimental sources of variation), but using the obtained residuals for further statistical inference may prove detrimental.
model model

Figure 5: Detailed residual plots of the fifteenth array after analyzing the Spellman data set. C indicates condition number, A indicates array number, D indicates dye number (Cy5=1 and Cy3=2 by default) and B indicates batch number.
With respect to problematic experimental designs, we would like to remark that using MARAN for normalizing dedicated chips will very likely give unsatisfactory results. Dedicated chips contain only a subset of genes, which are of specific interest for the conditions tested. Of course, it is generally impossible to normalize data from dedicated chips with any global normalization method (i.e. a method that assumes that roughly equal amounts of genes are up- vs. downregulated from one condition to the next).

REMEDIAL MEASURES

Loess-fit for alleviating non-linear residual trends

A remedial measure for non-linear effects in the dataset has previously been described by Yang et al., 2002. This Loess-based method has been made available in the MARAN web application ('Loess' page). The 'Loess' page can be accessed directly or after inspecting the results of an initial fit. This procedure wille generate a new dataset from your data, on which the ANOVA model can be fitted (as a consequence of this procedure, dye effects will be zero after performing the loess fit and the ANOVA model). It is important to keep in mind that a Loess-fit based correction for non-linearities is performed in the dye direction, separately for each array. This implies that, for complex experimental designs, non-linearities across arrays or conditions cannot be completely alleviated. Figure 6 below shows how performing a Loess fit can completely alleviate observed non-linearities in some cases, while not in other.
model model
model model

Figure 6: A few Detailed Residual plots of the ANOVA-normalized test data set, after performing a Loess fit. They clearly show how this procedure can completely alleviate observed non-linearities in some cases, while not in other. C indicates condition number, A indicates array number, D indicates dye number (Cy5=1 and Cy3=2 by default) and B indicates batch number.

FILTERING RESULTS

As explained above, the obtained estimates of the error terms can be used for statistical analysis of the ANOVA parameters. Two different methods for selecting genes with significantly changing expression have been made available on the website. Both methods differ in the calculation of confidence intervals for the expression estimates (the parameters of interest, i.e. the condition-affected change in intensity for each gene), as derived from fitting the ANOVA model. The statistical test for selecting genes with significantly changing expression, based on these confidence intervals, is identical for both methods. For each gene i:

model
model

Where δ is an undefined value. Basicly, this test evaluates, for each single gene, whether a single expression level (i.e. δ) exists that could account for each calculated 'expression' level of that gene (indicating that this gene is not differentially expressed). We chose a non-fixed value for δ, as opposed to e.g. the average of the calculated 'expression' levels or the calculated 'gene' level.

Use of normal confidence intervals

The first method is valid under the assumption of normally distributed error terms, with mean zero and constant error variance. The null hypothesis for selecting differentially expressed genes is that all calculated 'expression' parameters for a single gene are sampling instances from a normal distribution (based on standardized residuals) around a non-specified 'expression' value. Correction for multiple testing is done by using the Bonferroni correction procedure. A selection of genes can be obtained by entering a preferred significance or, when desired, p-values for all genes can be downloaded. Although this method should not be applied when there is doubt that the above assumptions are satisfied, this method is relatively fast and may therefore serve as a preliminary indication of differentially expressed genes.

Use of bootstrap confidence intervals

The alternative method is based on a bootstrap procedure (DiCiccio and Efron, 1996; Efron, 1979; Efron and Tibshirani, 1986). It is a 'fixed predictor sampling method',similar to the one described by Kerr et al., 2001 and is appropriate when the residuals show serious deviations from normality, but no apparent heteroscedasticity is present.A selection of genes can be obtained by entering a preferred significance. It is not possible however, to obtain p-values for each genes, as opposed to the method described above.

INCLUSive

There is a direct link from MARAN to the INCLUSive web application (Thijs et al., 2002). INCLUSive is suite of web based tools and is aimed at the automatic multistep analysis of microarray data (clustering and motif finding). Currently, adaptive quality-based clustering, retrieval of upstream sequences and the motif sampler are accessible from the website. All tools are linked together, starting with the clustering, then retrieving the upstream sequences of the genes belonging to a cluster and finally using the motif sampler to find over-represented motifs in this set of upstream sequences. All results obtained through modeling and filtering at MARAN can be sent to the web-based clustering algorithm. To proceed with upstream sequence retrieval and motif sampling, gene names and accession numbers for every gene should be available. If the initial uploaded file did not contain gene names and accession numbers, the link to INCLUSive will be limited to clustering.

MARAN SOAP SERVICE

The functionalities of the Maran Webapplication can also be used through the use of a web service. This service is provided to enable easy integration of the functionalities with your own applications. All information about this can be found here.

REFERENCES

1. DiCiccio,T.J. and Efron,B. (1996) Bootstrap confidence intervals. Statistical Science 11 , 189-228

2. Efron,B. (1979) Bootstrap methods:another look at teh jackknife. The Annals of Statistics

3. Efron,B. and Tibshirani,R. (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 54-77

4. Jin,W. et al. (2001) The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat Genet 29, 389-395

5. Kerr,M.K. and Churchill,G.A. (2001) Experimental design for gene expression microarrays. Biostatistics 2, 183-201

6. Kerr,M.K. et al. (2000) Analysis of variance for gene expression microarray data. J Comput Biol 7, 819-837

7. Marchal,K. et al. (2002) Comparison of different methodologies to identify differentially expressed genes in two-sample cDNA microarrays. J Biol Systems in press - download ps

8. Spellman, P.T. et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97

9. Thijs,G. et al. (2002) INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling. Bioinformatics 18, 331-332 - download pdf

10. Yang,Y.H. et al. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30, e15
 
 
  Start Register   User Guide Contact