About GENE FINDER (ver 1.0)

GENE FINDER was developed to make sense of genomic DNA sequence data from Aspergillus niger and Saccharomyces cerevisiae. The program will only recognize genes beginning with ATG and ending with TAA, TAG or TGA. If you wish to ensure detection of genes whose beginning or ending is not included in the submitted sequence simply add "ATGCATGCATG" and "TAGCTAGCTAG" at the beginning and end of your sequence respectively. The program identifies and evaluates sequences conforming to the following consensus motifs:
1) translation initiation site
2) translation stop site
3) left intron junction
4) right intron junction, and
5) intron branch-point.

Potential introns are identified and assigned scores based on:
1) how well each of the sequences matches the consensus
2) the spacing between the branch-point and the right junction, and
3) the length of the resulting intron.

The versions available here have been optimized for Aspergillus niger genomic DNA and Saccharomyces cerevisiae. Since most A. niger genes are interrupted by several (up to 8) introns, the A. niger program was designed to handle potential open reading frames (ORFs) containing up to 10 introns. The yeast program will identify ORFs having up to two introns. The following strategy is employed:
1) ORFs are constructed from each start site to the first in-frame stop codon.
2) These ORFs are then assigned a score based on their length.
3) Introns are sought which either excise the stop or alter the reading frame.
4) The next in-frame stop codon is identified.
5) The new ORF is given a score based on length and intron score.
6) And so on....

Since billions of hypothetical ORFs can be constructed from a few kilobases of A. niger genomic sequence, a few strictures were applied. First, only the top 25 A. niger ORFs are stored in memory. For yeast, the top 100 ORFs are kept. Second, only potential introns which exceed a certain score are considered. (Reducing this stringency level results in a much slower program.)

Note also that the context of the start and stop site are not taken into consideration in this version of the program. This is mainly to streamline processing. Queries from the net are expected to be relatively small (kb not Mb). Information of DNA sequence up to 1 kb upstream of the start site are required to evaluate the likelihood of authenticity. However, in A. niger at least, most authentic start sites are easily identified.

The parameters used in the A. niger program are based on the author's study of all A. niger sequences available in 1996. The numerical values assigned to the various parameters were determined empirically. The parameters used in the yeast program are based on the author's analysis of 33 intron-containing S.cerevisiae genes. Continuing studies of gene sequences may result in further optimization of these values.

Other Species

Genomic sequences from animals and plants are currently being analyzing to determine the parameters which regulate their processing. E-mail is invited from anyone with information regarding either success or failure of this program. Suggestions for improvements are welcome.

The Program

This program was written in Perl to run in a Linux environment. A more flexible, user-directed, version for personal use may be made available upon request.

last modified 29/11/99
Comments or suggestions: bawill@molecularworkshop.com