Analysis and Annotation Tool for Finding Genes in Genomic Sequences
The AAT web server at Michigan Tech identifies genes in a DNA sequence
by comparing the query sequence against cDNA and protein sequence databases:
(1) Human_Gene_Index, a database of human cDNA sequences at TIGR,
(2) dbEST, a database of EST sequences at NCBI,
(3) SwissProt, a database of protein sequences at University of Geneva,
(4) nr, a database of non-redundant protein sequences at NCBI.
The AAT web server also predicts genes and combines
database matches with prediction results.
Note that the prediction might not work well for non-human sequences
because the sequence statistics was collected from human DNA sequences.
Methods
The AAT package includes two sets of programs,
one set (DPS/NAP) for comparing the query sequence with a protein database,
and the other (DDS/GAP2) for comparing the query with a cDNA database
(Huang et al., 1997).
Each set contains a fast database search program and
a rigorous alignment program. The database search program
quickly identifies regions of the query sequence that
are similar to a database sequence. Then the alignment program
constructs an optimal alignment for each region and the database sequence.
The alignment program also reports the coordinates of exons in the query sequence.
The AAT package also includes a program named GSA2 for
combining database matches with gene prediction results
(Huang and Zhang, in preparation).
GSA2 produces gene structures, cDNAs, and their translation.
GSA2 computes exon sequence statistics using the method of the
MZEF program.
GSA2 uses cDNA and protein matches by GAP2 and NAP to influence
gene prediction. GSA2 can be used alone for gene prediction.
To reduce the number of undesirable matches due to
interspersed repeats, the DNA sequence is screened for interspersed repeats
using the
RepeatMasker
program (Smit and Green, unpublished results).
The masked DNA sequence is used for database searching, and
the unmasked DNA sequence for sequence alignment, which
allows the alignment program to identify the exact coordinates of exons
even if parts of the exons are masked.
Getting results
We suggest that the user obtain results via email
by choosing the email option and providing an email address.
The email option allows the user to quit the connection right away.
On the other hand, if the user chooses to get results in the result window,
the user can not quit the connection before the computation is completed.
A database search takes about 5 to 20 minutes.
AAT email server
The programs in the AAT tool can also be used through an electronic mail server.
To receive information on using AAT email server,
click here.
Loading a large sequence into the server
The server allows the user to load a sequence into the server
by providing the name of the sequence file.
The server requires that the sequence file, its parent directory,
its grant parent directory, ... and the home directory
be all readable by the world.
One simple way for meeting this requirement is to move the sequence file
into the home directory and make both the file and the home directory
readable by the world.
To load the sequence file into the server,
start the Netscape at the home directory, click the "Browse" button,
and provide your file name.
References
Huang, X., Adams, M.D., Zhou, H. and Kerlavage, A.R. (1997).
A tool for analyzing and annotating genomic sequences, Genomics 46, 37-45.
Acknowledgments
Part of the work was done while X.H. was on sabbatical leave at TIGR.
I thank Michael Zhang for providing the MZEF program.
Suggestions/Comments
Please contact Xiaoqiu Huang at huang@mtu.edu