Title page for etd-0625109-171938


[Back to Results | New Search]

URN etd-0625109-171938
Author Shu-Yi Lin
Author's Email Address No Public.
Statistics This thesis had been viewed 5380 times. Download 1213 times.
Department Computer Science and Engineering
Year 2008
Semester 2
Degree Master
Type of Document
Language English
Title The GDense Algorithm for Clustering Data Streams with High Quality
Date of Defense 2009-06-12
Page Count 77
Keyword
  • density-based
  • grid-based
  • clustering
  • data streams
  • Abstract In recent years, mining data streams has been widely studied. A data streams is a
    sequence of dynamic, continuous, unbounded and real time data items with a very
    high data rate that can only be read once. In data mining, clustering is one of use-
    ful techniques for discovering interesting data in the underlying data objects. The
    problem of clustering can be defined formally as follows: given n data points in the d-
    dimensional metric space, partition the data points into k clusters such that the data
    points within a cluster are more similar to each other than data points in different
    clusters. In the data streams environment, the difficulties of data streams clustering
    contain storage overhead, low clustering quality and a low updating efficiency. Cur-
    rent clustering algorithms can be broadly classified into four categories: partition,
    hierarchical, density-based and grid-based approaches. The advantage of the grid-
    based algorithm is that it can handle large databases. Based on the density-based
    approach, the insertion or deletion of data affects the current clustering only in the
    neighborhood of this data. Combining the advantages of the grid-based approach
    and density-based approach, the CDS-Tree algorithm was proposed. Although it can
    handle large databases, its clustering quality is restricted to the grid partition and the
    threshold of a dense cell. Therefore, in this thesis, we present a new clustering algo-
    rithm with high quality, GDense, for data streams. The GDense algorithm has high
    quality due to two kinds of partition: cells and quadcells, and two kinds of threshold:
    δ and (1/4) . Moreover, in our GDense algorithm, in the data insertion part, the
    7 cases takes 3 factors about the cell and the quadcell into consideration. In the
    deletion part, the 10 cases take 5 factors about the cell into consideration. From our
    simulation results, no matter what condition (including the number of data points,
    the number of cells, the size of the sliding window, and the threshold of dense cell)
    is, the clustering purity of our GDense algorithm is always higher than that of the
    CDS-Tree algorithm. Moreover, we make a comparison of the purity between the our
    GDense algorithm and the CDS-Tree algorithm with outliers. No matter whether the
    number of outliers is large or small, the clustering purity of our GDense algorithm is
    still higher than that of the CDS-Tree and we can improve about 20% the clustering
    purity as compared to the CDS-Tree algorithm.
    Advisory Committee
  • Gen-Huey Chen - chair
  • Chien-I Lee - co-chair
  • San-Yi Huang - co-chair
  • Ye-In Chang - advisor
  • Files
  • etd-0625109-171938.pdf
  • indicate accessible in a year
    Date of Submission 2009-06-25

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have more questions or technical problems, please contact eThesys