Last modified on 14 January 2000.

## (1983) The First Comparative Algorithm: Chi-Square Statistics

The first method to use a computer algorithm for detection of covariation was a chi-square approach developed by Gary Olsen (thesis, University of Colorado Health Sciences Center, 1983). The general chi-square method involves a comparison between observed and expected data. Large deviations from expected values produce large chi-square values, which indicate a correlation.

For comparative analysis, the chi-square value is calculated as:

where

- no(Mi,Nj) is the number of observations of base pair M:N at position i-j, and
- ne(Mi,Nj) is the expected number of observations for M:N at position i-j, calculated from the observed numbers of bases M and N at positions i and j (no(Mi) and no(Nj)).

When this technique was first developed, technical limitations hampered its effectiveness. Contemporary computers were slow, by modern standards, and the sequence databases were smaller than the minimal size needed to use statistical methods with confidence. As computer technology improved and the sequence databases grew, the chi-square method was refined and proved to be a useful development.

The chi-square method is not biased toward any particular pairing types; each of the 16 pairing types is treated equally. The method does not search specifically for secondary structure elements; yet, chi-square analysis predicted all of the tRNA secondary structure base pairings from about 200 tRNA sequences, and Watson-Crick pairing emerged as the most common pairing motif, independently verifying this basic principle of RNA structure. Chi-square can detect weaker interactions; this method indicated that the pairing partners in the tRNA D-helix, which were not found using the reddot-greendot approach, covary best with each other. These weaker interactions also include statistically significant constraints between all of the nucleotides within the tRNA D-helix.

The initial successes of the chi-square method led to the development of several algorithms based on chi-square statistics. One of these methods is described in detail below.

## (1992-present) An Improved Statistical Method at Work: The Mutual Information (*mixy*) Algorithm

The limitations of the number pattern method ensured that other methods would be designed and tested. The *mixy* (Mutual Information between positions X and Y) algorithm is a chi-square based statistical method which yields improved results in the detection of RNA interactions. Rather than searching for an entire structural element among many possibilities, the chi-square statistical methods test for interactions occurring between two positions, which reduces the number of search possibilites to *n*^{2}, where *n* is the number of nucleotides in the sequence under study. Like the number pattern method, *mixy* is not prejudiced toward Watson-Crick interactions; the algorithm will detect each of the sixteen pairing types which may appear in an analysis.

The *mixy* score can be calculated using this formula:

where

- fo(Mi) is the base frequency for M at position i,
- fo(Nj) is the base frequency for N at position j, and
- fo(Mi,Nj) is the base pair frequency for M:N at position i-j.

*mixy*score will be approximately zero.

NOTE: mixy scores are calculated in practice using a more efficient method. For more details, please consult:

- Chiu D.K. and Kolodziejczak T. (1991) Inferring consensus structure from nucleic acid sequences.
*Comput. Appl. Biosci.***7(3)**:347-352. - Gutell R.R., Power A., Hertz G., Putz E., and Stormo G. (1992) Identifying Constraints on the Higher-Order Structure of RNA: Continued Development and Application of Comparative Sequence Analysis Methods.
*Nucleic Acids Research***20(21)**:5785-5795.

Covariations are deviations from independent variation and have nonzero mixy scores. The scores are maximized when the two positions are highly variable and completely correlated. By filtering the *mixy* output to show only the highest-scoring positions, potential interactions of interest can be quickly recognized, reducing the amount of time spent examining potential interactions with little or no comparative support. The canonical pairings are the most often observed and tend to produce strong signals.

From a database of 200 tRNA sequences, the *mixy* algorithm was able to predict all of the secondary structure base pairings (in comparison to the crystal structure). Almost all pairings of these pairings contain at least two compensatory base changes with a minimal number of exceptions.

#### Mixy Image Section

The following series of images is intended to illustrate how the mixy algorithm performs in actual use. Here, results from the correlation analysis of the tRNA-895 alignment are presented. A larger version of each image may be viewed in a separate window by selecting that image.