DNA array data analysis



1. Introduction
2. Overview of the whole process
3. Diagnosis and Normalization
4. Gene Expression Pattern Preprocessing
5. Clustering of gene expression patterns
6. Identifying differentially expressed genes
7. Data mining
8. Play with more examples
8. References


1. Introduction

DNA microarray technology opens up the possibility of measuring the expression level of thousands of genes in a single experiment. Serial experiments measuring gene expression at different conditions, times, or distinct experiments with diverse tissues, patients, etc., allows obtaining gene expression profiles under the different experimental conditions studied. Initial experiments suggest that genes having similar expression profiles tend to be playing similar roles in the cell.
Microarray experiments can be conceptually subdivided into material- and data-processing steps. During material processing, important information needs to be recorded, such as array design, experimental conditions and sample treatment, to enable meaningful data analysis and biological interpretation. Microarray data processing, which can be further divided into data preprocessing, such as normalization and filtering, data-analysis steps and the biological interpretation of the results.
The first steps in microarray data preprocessing involve image scanning, and include spot finding and the selection of good quality spots. Next, data-normalization steps are necessary to correct unavoidable experimental variations, such as differences in sample preparation, dye incorporation and hybridization efficiencies. These variations are not owing to differences in gene expression in the original samples and, therefore, need to be corrected before data analysis can be carried out. Such analysis might include various methods to identify genes that are differentially expressed or conditions (cell-culture treatments, diseases and so on) that result in similar changes in gene expression. Some normalization or data-analysis methods require special arrangements, such as a particular array design. Therefore, both material- and data-processing steps need to be considered at the early stages of a microarray experiment.
The biological interpretation of the data is facilitated by various tools, which place the analysis results into context with existing biological knowledge, such as the scientific literature or sequence data. Efforts to unify and standardize the way in which information is recorded are making the interpretation of large-scale experiments easier.
Finally, the integration of biological information from various sources, such as large-scale data sets produced by various experimental techniques, provides a valuable platform for the exploration of regulatory networks.





2. Overview of the whole process






3. Diagnosis and Normalization

DNA microarrays are part of a new class of biotechnologies that allow the monitoring of expression levels in cells for thousands of genes simultaneously. In order to accurately and precisely measure gene expression changes, it is important to take into account the random (experimental) and systematic variations that occur in every microarray experiment.
Sources of variation (not related with variation due to biological causes) include:

The goal of normalization is to adjust for effects that are due to variations in the technology rather than the biology. In particular, we will try to adjust for differences in the red and green labeling caused, for example, by differences in the binding of the labels; a widespread phenomenon are differences in the labeling (i.e., dye biases) that are related to intensity (the typical curvilinear MA plots).

Let's see several plots used for the diagnosis and normalization of microarrays. Click here.

Now, think about the next experiment. Imagine that we extract mRNA form a mouse liver, we split the sample in two and we label each sample with cy3 and cy5 respectively. Later we hybridise those two samples into the same array. What should we expect from the MA plot?

Below there are two plots for an experiment in which two identical mRNA are labeled with different dyes and then hybridized to the same slide.

If there weren't any differences in the labeling, or in the photodetection, the values should be distributed around an M of 0 since we're hybridizing identical mRNA, but as we can see in the plots this is not the case, so the need for normalization becomes evident.

We provide a web interface to help in the normalization of cDNA microarray data, DNMAD (Vaquerizas et al., 2004). The method implemented is the print-tip loess as explained in Yang, Dudoit, Luu, Lin, Peng, Ngai, and Speed (2002). "Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation". Nucleic Acids Research, Vol. 30, No. 4, e15.

With print-tip loess, the majority of genes are used for the normalization. Thus, we are assuming that most genes are not differentially expressed, that there is approximately an equal number of up- and down-regulated genes, and these assumptions should also hold for each print-tip group (see also Yang et al., p. 4). If these assumptions are unreasonable (e.g, a small custom-made array where most genes are differentially expressed), this method should not be used.

The program expects GenePix GPR files. Please, use the format as it is in the gpr files. Details about the GPR format & example.

You can enter either a number of individual GPR files or a compressed file containing all the arrays in GRP file format.

As resutls, you are given the box-plots of the log (base 2) ratio before and after print-tip loess normalization. This allows you to assess the need for slide scale normalization (a normalization that will ensure the same scale for all arrays).

Box plots for each array show box-plots, before and after normalization, of the ratio for each array. This allows you to assess if the scale of different print-tip groups is comparable.

Image plots (which include the histograms of the raw intensities) to check the quality of your arrays.

MA-plots show the relationship between A (the "average signal" [0.5 * (log R + log G)], where R is the background subtracted red [mean of F635 - median of B635] and G the background subtracted green [mean of F532 - median of B532]) and M (the log [base 2] differential ratio: log(R/G)).

Also, the program returns a file with the normalized M and A values. These M values are always from the log (base 2) ratio of 635/532 (or Red/Green). If your experimental setup would require the 532/635 ratio, just flip the signs of the output. The A values are the "average signal" [0.5 * (log R + log G)].

Using DNMAD

Exercise 1

The file for this exercise is here. Save this file to your desktop.

This file is a compressed file containing 3 different arrays.

Go to the application URL (http://dnmad.bioinfo.cipf.es) and select the file you have just downloaded. Also, enter the layout of the microarray. For this exercise, the layout values are 12 x 4 and 18 x 15.

Check the option "Use backgorund subtraction" and leave the others as defaults.

Press the "RUN" button and wait for the results. The results are also here.

The boxplots for all the arrays show that before normalization, the values in the array are not centered about an M equals to 0, as it shoud be (see the assumptions), so there is a clear need for normalization.



Clicking on the button labeled "View all" in the column of the boxplots, we can examine how the normalization procedure has worked for each individual array within each print-tip-group.

Clicking on the button labeled "View all" in the column of the Ma-plots, we can see how the regresion curve has been adjusted to an M equals to 0 as it should be under our assumptions.

Clicking on the button labeled "View all" in the column of the diagnostic plots, we can check for corrupted arrays, or for bad scanner settings and so on.

Looking at the MA-plots it seems to be that the normalization procedure has worked alright for our arrays and that they are ready to start the analysis, but the print-tip-group boxplots and the diagnostic plots do not show the same.

Looking at the boxplots we can observe that group 45 is a bit out of scale, particularly in arrays 00G66.txt and 01G18.txt. This may induce us to think that there is a defect on the needle that printed these groups.





Also, we can also observe that in array 98G40.txt there is a problem in print-tip-group 4. Looking at the diagnostic plot we can observe that the red background is almost saturated for these print-tip but not fot the others, indicating problems with the hybridisation in this group.



We can also observe that there are some white dots in the diagnosis plot that shows the M of the array (log 2 of the ratios). These are going to be considered missing values ("NA") because you have more background than foreground so the log2 of the ratios can not be calculated. If you can go on in spite of these errors, we're ready to start our analysis process.

In order to do this, we have two different options:






4. Gene Expression Pattern Preprocessing

Next, we have to preprocess the patterns. This step includes the transformation of the scale, the handling of missing values, the removing of flat gene expression patterns and the standardization of the remaining patterns. Those steps are optional and depends on the actual data set and on the methods that you have used previously. For example, if you use DNMAD for normalization, the step of scale transformation becomes unnecessary as you've got your data in log2 scale already.

These are the transformations available in the Preprocessor (Herrero et al., 2003):

Also there is a "Pre-Analysis" module that performs several checks and plots different histograms to help and to guide the user through the options. It automatically selects or unselects different options and gives useful information to the user.

Here is the list of topics covered by the pre-analyse module:

It is important to emphasize that the Pre-Analyse module does not change the original file: the user can ignore the recommendations simply by unselecting the corresponding options.
To know more about preprocessing, see the tutorial on gene expression pattern preprocessing.






5. Clustering of gene expression patterns

5.1 The data

We use the data corresponding to the diauxic shift in yeast (DeRisi et al. ,1997, Exploring the Metabolic and Genetic Control of Gene Expression on a genomic Scale. Science, 278: 680-686.)
[See data]

5.2 Data processing

Data are arranged in a table where rows contain the different gene expression values for a given gene in the different experimental conditions. Each column corresponding to one of the experimental conditions. Values are tab-delimited.
A log2 transformation has been applied to the data..
[see data]
All the irrelevant data have been removed. In that case all the data showing no significant variation along the different experimental conditions.
[see data]


In some cases, standarization of the data may be helpful. Values are transformed by substracting the mean and dividing by the varianze. Then, all the profiles have a mean of zero and a varianze of one.
[see data]

5.3 Clustering

 
This is the hierarchical classification obtained for the previous data using SOTA and lineal correlation as distance.
 
Arbol completo
See full figure
 
Note on the right part of the figure the color-coded patterns corresponding to the genes (see details in the figure).

Same clustering with a symbolic representation using a confidence level in the variabiliy value of 95%.

 
Arbol completo
See full figure
 
In this figure the size of the circles represents the number of genes in the cluster, with the value of the average pattern on its right.
The variability is defined in each cluster as the largest among all the pattern-pattern distances in it. Thus the two more divergent patterns in a cluser define the radius of variability of the cluster. The confidence level is based on a resampling test of the pattern values. See SOTA tutorial for details.

5.4. Practical exercises

5.4.1 Clustering with SOTA

If you have your own data you can go directly to the server: Sotarray Server
1st Part:

Go to data sets page clicking here:

DNA-arrays: Public Data Sets

Select Diauxic shift:

Diauxic Shift Data Set

(Optional)  Note in the different files the steps in the treatment of the data

Choose a file (data_norm.txt, for example) and send it to the server by clicking here:

sota/cluster

Select SOTA server:

Send To Sotarray

Select Unrestricted Grow option

Click on Run.

Send result to TreView server to see the tree:
 

Send To TreeView

Accept default parameters and press Run

Go back by clicking here:

Change Parameters

Uncheck Gene Names change to 1 the value of Vertical Separation

Press Run again.


2nd Part:

Close the window and go back to the Sotarray result screen

Press Change Parameters button to modify`parameters:
 

Change Parameters

Select Variablity Threshold (%) option and set the value to 90 (default).

Press Run again.

Send the results both to TreeView and al SotaTree:
 

Send To SotaTree
Send To TreeView

Compare the results

5.4.2 Classical hierarchical clustering (UPGMA)

If you have your own data you can go directly to the server: Cluster Server
In data sets page choose one of the experiments.
DNA-arrays: Public Data Sets
Choose a file and send it to the server by clicking on the sota/cluster link.
Go to the Cluster server
Send To Cluster
Accept dafault values (UPGMA and correlation) and press Run
Send result to TreeView to display the tree:
Send To TreeView
Choose the change parameters option, uncheck Gene Names and set the Vertical Separation to 1.
Press Run

5.4.3. Self-Organizing Maps

If you have your own data you can go directly to the server: SOM Server
From data sets page, choose an experiment.
DNA-arrays: Public Data Sets
Send file to the server following the link: som
Send it to SOM:
Send To Cluster
Accept default values and press Run





6. Identifying differentially expressed genes

One of the most interesting problems consists of finding genes differentially expressed among two or more conditions (e.g. different cancer types). Conceptually related to this is finding genes related to a given continuous variable (e.g. the level of a metabolite) or the case of survival, a particular case of a continuous variable. Nevertheless, finding the proper group of genes among the thousands present in the arrays is not an easy task.

The Pomelo tool has been designed to control the problem of multiple testing when searching for differentially expressed genes. Using microarray data we are testing for differential expression of a high number genes and we need to account for multiple testing. The problem of using the p-value from each test directly is that we are examining many null hypotheses (one null hypothesis i.e. non differential expression for each gene). If we were to consider each of the tests with a p-value smaller than, say, 0.05, as significant, we would end up with an excessive number of differentially expressed genes. Despite this, not many authors are aware of this problem and few programmes consider multiple testing in their design. The need to account for multiple testing has been reviewed for the analysis of microarrays by Dudoit et al. In Pomelo we have implemented four methods to account for multiple testing; two of them control the Family Wise Error Rate (21) and two others control de False Discovery Rate.

These methods can be applied to five different statistical tests:

Program input

Data files and format

The files "Covariates" and "Class labels or dependent variable" are required. In addition, if your data are survival data, you need to provide a file with the "Censored indicators".

Covariates file
The file with the covariates; generally the gene expression data. In this file, rows represent variables (generally genes), and we want to find those variables that are most distinctly expressed among groups (e.g., a t-test or ANOVA) or that are most related to, say, survival (e.g., Cox model).

Class labels or dependent variable
These are generally the class labels (e.g., healthy or affected, or different types of cancer) that group the samples, or the survival times of patients, or another dependent continuous variable (if regression models). In our analyses we want to find which of the covariates is significantly different among (between) the classes given here, or which of the covariates is significantly related to the dependent variable.

Censored indicator
For survival data only. An observation is censored if the time of occurrence of the event (usually death) has not yet been observed. We will represent uncensored observations with a 1 (because we do have observed the time of death) and censored observations with a 0 (meaning that the survival time given is only a lower bound).

Data format for covariates

The file for the covariates should be formated as:






7. Data mining

Perhaps one of the most demanded kinds of tools are those that can transfer biologically relevant information to microarray experiments. This information can be extracted either from free text (e.g. Medline abstracts) or from more or less curated repositories. The use of text mining techniques in studying the coherence of gene groups obtained from different methodologies has only recently been addressed, although its practical application still poses many drawbacks. Furthermore,availability to end users is often scarce. Gene Ontology, which organizes information for molecular function, biological processes and cellular components for a number of different organisms, and KEGG, which includes a comprehensive description of different pathways, are among the most used curated repositories of information. Different tools, which generate tables correlating groups of genes to GO terms regarding biochemical and molecular functions, have been recently implemented and the web page of the GO consortium. Among them is FatiGO,can deal with thousands of genes and extract the GO terms of relevance for a given set of genes with respect to the rest of them. These terms are obtained with the application of a test that takes into account the multiple-testing nature of the statistical contrast. The module produces a graphical representation with a bar chart with the proportion of GO terms in the analysed cluster with respect to the cluster of reference. Adjusted p-values for the differentially represented GO terms are given too. Also, links to the GO terms as well as to the genes are provided. FatiGO has the advantage of being integrated within the GEPAS platform.

7.1.What is FatiGO?

FatiGO is a web interface which carries out simple datamining using Gene Ontology for DNA microarray data. The datamining consists on the assignation of the most characteristic Gene Ontology term to each cluster. GO terms are related to Human, Mouse, Fly, Worm and Saccharomyces genes and proteins. The assignation of the most relevant GO terms to each cluster is performed by means of a chi-square test. Since each gene can contribute with a different number of GO terms to the p-value for the total chi-square test is obtained by means a permutation test.

7.2. Program Input

Databases

Select suitable database for your genes query. Each Gene of these databases has one or more Gene Ontology terms annotated.

Gene Ontology June 2004
UniGene Build #170 Homo sapiens
UniGene Build #137 Mus musculus
Mouse Genome Database software Released date: 06/04/2004
Saccharomyces Genome Database Released date: 06/17/2004
FlyBase Released date: May 22, 2004
WormBase Released date: 05/11/2004
Rat Genome Database (RGD) Released date: 05/13/2004
Arabidopsis Information Resource (TAIR) Released date: 05/28/2004
GOA Human 20.0 Released date: June 04 2004
GOA UniProt 18.0 Released date: June 04 2004
Ensembl Human v. 19.34b
Affymetrix: AFFY_HG_U133A, AFFY_HG_U133A_2, AFFY_HG_U133B, AFFY_HG_U133_PLUS_2, AFFY_HG_U95A, AFFY_HG_U95Av2, AFFY_HG_U95B, AFFY_HG_U95C, AFFY_HG_U95D, AFFY_HG_U95E

 

 

Ontology and Level

A gene product can have one or more molecular functions, be used in one or more biological processes and may be associated with one or more cellular components. You want to know the distribution of your genes in a specific ontology (function, process or component).
You must select the ontology and also the GO term´s level [ 2 - 5 ]. The optimum levels for the three ontologies where are located most genes is >=3.

Inclusive analysis

FatiGO performs the analysis on nodes of the GO hierarchy (DAG). If the level corresponding to (e.g.) apoptosis was selected. Any gen annotated as either apoptosis or as any children term was considered in the same category for the test. This increases the power of the test. We have to test less terms with more genes.

GO Terms Genes
GO:0006915: apoptosis BCL2-like, plasminogen
GO:0042981: regulation of apoptosis Caspase6, Caspase7, Caspase9
GO:0043066: negative regulation of apoptosis Caspase2
GO:0006917: induction of apoptosis Caspase12, Interleukin 9

Level chosen for inclusive analysis

In the example, choosing apoptosis node for the analysis will assign 8 genes to the term corresponding to the node. If inclusive analysis is not used, then four terms apoptosis (with 2 genes), regulation of apoptosis (3), negative regulation of apoptosis (1) and induction of apoptosis (2) are taken into account with the obvious decrease in the power of the test.

 

Data and format

The data can be Gene Symbol, Cluster Ids, standard name or systematic name for each suitable database.
If you download a file:
For a excel file the Gene or protein Id must be in one column
For a text file put one gen or protein by line.

Human
Enter in the search box the UniGene Cluster Id for human with the format Hs.--- or the Gene symbol for each gen or SWISS-PROT/TrEMBL entry name or Ensembl peptide identifiers (ENSPxxx....) - and Ensembl gene (ENSGxxx..) for human. Example: ClusterId: Hs.4994 or Gene Symbol: TOB2 SWISS-PROT/TrEMBL: HAX1_HUMAN. Ensembl: ENSG00000104691 or ENSP00000264167
Mouse
Enter in the search box the UniGene Cluster Id for mouse with the format Mm.--- or the Gene symbol for each gen..
Example: ClusterId: Mm.4994 or Gene Symbol: Tnfsf4.
Saccharomyces Genome Database
Enter in the search box a standard or systematic sequence name .
Example: Standard name: RRP5 or Systematic name: YMR229C.
FlyBase
Enter in the search box a Gene symbol .
Example: Gene symbol: Ace
WormBase
Example: ID: C39B5.5
Rat Genome Database
Example: Gene: Atp2a
Arabidopsis (TAIR)
Example: Gene: ABI4
SwissPROT/TrEMBL (Uniprot)
Enter in the search box a SWISS-PROT/TrEMBL entry name.
Example: Entry name: IPUA_ASPNG .
GenBank (accesion number)
Enter a accesion number Genbank nucleotide database.
Example: Accesion: AK007681

7.3. Include Reference Cluster

If you want study of the differential distribution of GO terms for two sets of genes, you must include a reference group of genes. In this case FatiGO extract relevant GO terms by the application of a test that takes into account the multiple-testing nature of the statistical contrast performed.Also you can select if the multiple test include adjusted p-values (control FWER) or not. The result show a comparative graphical view with the distribution of GO terms and the p-values:
Unadjusted p-value
Step-down min p adjusted p-value
FDR (independient) adjusted p-value
FDR (arbitrary dependent) adjusted p-value

The p-value after adjusting for multiple testing. We are simultaneously testing multiple null hypothesis (one for each GO term) of no difference in the frequency of terms in each cluster. If we were to simply report the p-value from a Fisher's exact test or a chi-square test, we would be increasing the Type I error rate. The application of a procedure for multiple testing allows us to try to determine which of the GO terms differ between the two clusters, while at the same time controlling the Family Wise Error Rate.

 

 
Unadjusted p-value
The unadjusted p-value; in most cases, this p-value is obtained using the random permutations, where each row-wise p-value is based only on the results for each row or gene. With Fisher's exact test, this is the exact p-value using Fisher's exact test procedure.
Adjusted p-value
The procedure we have implemented here uses the step down max-T procedure (see details, for example, in Westafall & Young, 1993, 'Reampling-based multple testing', John Wiley & Sons), with a test statistic that is equivalent to Fisher's exact test for 2x2 contingency tables. The number of random permutations is set at 10000, unless the total number of possible arrangements of the data is smaller than 10000, in which case we use complete enumeration.
FDR (independient) adjusted p-value
Adjusted p-value using the FDR procedure of Benjamini & Hochberg;
FDR (arbitrary dependent) adjusted p-value
Adjusted p-value using the FDR procedure of Benjamini & Yekutieli that provides strong control of the FDR under arbitrary dependent structures.

7.4. Using FatiGO

Examples

For Saccaromyces

These files has interacting proteins in the yeast. One file include overinteracting proteins and the other one is a file with underinteracting proteins .
With FatiGO you can study the biological process of this proteins involved in interactions and finding the GO terms differentially represented in both sets.

Cluster_Query overinteracting proteins
Cluster_Reference underinteracting proteins
Include these files in FatiGO select Saccaromyces Genome database, ontology biological process and level 3. Also You can choose the multiple test with or without adjusted p-values (control of FWER).

For Human

Example Human genes involved in apoptosis processes

Using SOTA

This data set corresponds to an experiment carried out by a grop of the Stanford University about the diauxic shift in S. cerevisiae

1st Part:

Go to data sets page clicking here:

DNA-arrays: Public Data Sets

Select Diauxic shift:

Diauxic Shift Data Set

(Optional)  Note in the different files the steps in the treatment of the data

Choose a file (data_norm.txt, for example) and send it to the server by clicking here:

sota/cluster

Select SOTA server:

Send To Sotarray

Click on Run.

Send result to SotaTree server to see the tree:
 

Send To SotaTree

Accept default parameters and press Run



 
In this figure the size of the circles represents the number of genes in the cluster, with the value of the average pattern on its right.
The variability is defined in each cluster as the largest among all the pattern-pattern distances in it. Thus the two more divergent patterns in a cluser define the radius of variability of the cluster. The confidence level is based on a resampling test of the pattern values. See SOTA tutorial for details.



Arbol completo

Click on the circle with 81 genes.You will get two new windows:
one with a plot with the profile of the cluster pattern.
And the list of the genes in that cluster.

Click on Extract this cluster Perfil

Send result to FatiGO:
 

Send To FatiGO



Cluster Query is the cluster 5 and the reference cluster is the remaining genes.
Select suitable database (Saccaromyces Genome Database), the level ([2 - 5]) and the ontology (biological_process, molecular_function, cellular_component).

Click on Search and ............
For Ontology biological process and level 3 we have three terms with p-value < 0.05 : cellular physiological process, response to stimulus and metabolism


PaintGO






8. Play with more examples

Go to data sets page by clicking here:
DNA-arrays: Public Data Sets
As you have repeatedly been told, if you have your own data, feel free of using our server:
GEPAS: Gene Expression Pattern Analysis Suite
Also, you can take a look to our on-line courses at:
http://bioinfo.cipf.es/docus/courses/on-line.html





9. References


Tue Jul 13 16:54:39 CEST 2004