Analysis and Annotation Tool for Finding Genes in Genomic Sequences


The AAT web server at Michigan Tech identifies genes in a DNA sequence by comparing the query sequence against cDNA and protein sequence databases:
(1) Human_Gene_Index, a database of human cDNA sequences at TIGR,
(2) dbEST, a database of EST sequences at NCBI,
(3) SwissProt, a database of protein sequences at University of Geneva,
(4) nr, a database of non-redundant protein sequences at NCBI.

The AAT web server also predicts genes and combines database matches with prediction results. Note that the prediction might not work well for non-human sequences because the sequence statistics was collected from human DNA sequences.

Methods

The AAT package includes two sets of programs, one set (DPS/NAP) for comparing the query sequence with a protein database, and the other (DDS/GAP2) for comparing the query with a cDNA database (Huang et al., 1997). Each set contains a fast database search program and a rigorous alignment program. The database search program quickly identifies regions of the query sequence that are similar to a database sequence. Then the alignment program constructs an optimal alignment for each region and the database sequence. The alignment program also reports the coordinates of exons in the query sequence.

The AAT package also includes a program named GSA2 for combining database matches with gene prediction results (Huang and Zhang, in preparation). GSA2 produces gene structures, cDNAs, and their translation. GSA2 computes exon sequence statistics using the method of the MZEF program. GSA2 uses cDNA and protein matches by GAP2 and NAP to influence gene prediction. GSA2 can be used alone for gene prediction.

To reduce the number of undesirable matches due to interspersed repeats, the DNA sequence is screened for interspersed repeats using the RepeatMasker program (Smit and Green, unpublished results). The masked DNA sequence is used for database searching, and the unmasked DNA sequence for sequence alignment, which allows the alignment program to identify the exact coordinates of exons even if parts of the exons are masked.

Getting results

We suggest that the user obtain results via email by choosing the email option and providing an email address. The email option allows the user to quit the connection right away. On the other hand, if the user chooses to get results in the result window, the user can not quit the connection before the computation is completed. A database search takes about 5 to 20 minutes.

AAT email server

The programs in the AAT tool can also be used through an electronic mail server. To receive information on using AAT email server, click here.

Loading a large sequence into the server

The server allows the user to load a sequence into the server by providing the name of the sequence file. The server requires that the sequence file, its parent directory, its grant parent directory, ... and the home directory be all readable by the world. One simple way for meeting this requirement is to move the sequence file into the home directory and make both the file and the home directory readable by the world. To load the sequence file into the server, start the Netscape at the home directory, click the "Browse" button, and provide your file name.

References

Huang, X., Adams, M.D., Zhou, H. and Kerlavage, A.R. (1997). A tool for analyzing and annotating genomic sequences, Genomics 46, 37-45.

Acknowledgments

Part of the work was done while X.H. was on sabbatical leave at TIGR. I thank Michael Zhang for providing the MZEF program.

Suggestions/Comments

Please contact Xiaoqiu Huang at huang@mtu.edu