Title page for etd-0725103-114311


[Back to Results | New Search]

URN etd-0725103-114311
Author Jen-Wei Hu
Author's Email Address No Public.
Statistics This thesis had been viewed 5342 times. Download 2232 times.
Department Computer Science and Engineering
Year 2002
Semester 2
Degree Master
Type of Document
Language English
Title An ACGT-Words Tree for Efficient Data Access in Genomic Databases
Date of Defense 2003-06-20
Page Count 99
Keyword
  • DNA sequence
  • genomic databases
  • indexing
  • suffix tree
  • suffix array
  • Abstract Genomic sequence databases, like GenBank, EMBL, are widely used by molecular biologists for homology searching. Because of the increase of the size of genomic sequence databases, the importance of indexing the sequences for fast queries grows. The DNA sequences are composed of 4 base pairs, and these genomic sequences can be regarded as the text strings. Similar to conventional databases, there are some approaches use indexes to provide efficient access to the data. The inverted-list indexing approach uses hashing to store the database sequences. However, the perfect hashing function is difficult to construct, and the collision in a hash table may occur frequently. Different from the inverted-list approach, there are other data structures, such as the suffix tree, the suffix array, and the suffix binary search tree, to index the genomic sequences. One characteristic of those suffix-tree-like data structures is that they store all suffixes of the sequences. They do not break the sequences into words. The advantage of the suffix tree is simple. However, the storage space of the suffix tree is too large. The suffix array and the suffix binary search tree reduce more storage space than the suffix tree. But since they use the binary searching technique to find the query sequence, they waste too much time to do the search. Another data structure, the word suffix tree, uses the concept of words and stores partial suffixes to index the DNA sequence. Although the word suffix tree reduces the storage space, it will lose information in the search process. In this thesis, we propose a new index structure, ACGT-Words tree, for efficiently support query processing in genomic databases. We define the concept of words which is different from the word definition given in the word suffix tree, and separate the DNA sequences stored in the database and in the query sequence into distinct words. Our approach does not store all of the suffixes in the database sequences. Therefore, we need less space than the suffix tree approach. We also propose an efficient search algorithm to do the sequence match based on the ACGT-Words tree index structure; therefore, we can take less time to finish the search than the suffix array approach. Our approach also avoids the missing cases in the word suffix tree. Then, based on the ACGT-Words tree, we propose one improved operation for data insertion and two improved operations for the searching process. In the improved operation for insertion, we sort the ACGT-Words generated and then preprocess them before constructing the tree structure. In the two improved operations, we can provide better performance when the query sequence satisfies some conditions. The simulation results show that the ACGT-Words tree outperforms the suffix tree and the suffix array in terms of storage and processing time, respectively. Moreover, we show that the improved operations in the ACGT-Words tree also require shorter time to construct or search than the original processes or the suffix array.
    Advisory Committee
  • Tei-Wei Kuo - chair
  • Chien-I Lee - co-chair
  • San-Yi Huang - co-chair
  • Ye-In Chang - advisor
  • Files
  • etd-0725103-114311.pdf
  • indicate accessible in a year
    Date of Submission 2003-07-25

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have more questions or technical problems, please contact eThesys