Title page for etd-0925117-160503


[Back to Results | New Search]

URN etd-0925117-160503
Author Shun-ji Huang
Author's Email Address No Public.
Statistics This thesis had been viewed 5354 times. Download 0 times.
Department Computer Science and Engineering
Year 2017
Semester 1
Degree Master
Type of Document
Language zh-TW.Big5 Chinese
Title An Improved Misra-Gries algorithm for Finding Top-k Frequent Items and Its Parallelized Implementation
Date of Defense 2017-09-04
Page Count 91
Keyword
  • Frequent items
  • Top-K query
  • Misra-Gries algorithm
  • real-time streaming data
  • Apache Storm
  • parallel processing
  • Abstract The problem of finding top-k frequent items over data streams is very popular and heavily studied in data streams analysis, and it is widely used in many applications, such as network traffic monitoring and identifying high-clicker commodity. However, exactly finding the top-k items is difficult, because the quantities of data in data streams are so large that it is costly to store all data and then process them. To deal with this problem, many algorithms are proposed to gain the approximate results with tolerable errors.
     Misra-Gries algorithm is one of counter-based algorithms that are used to find frequent items and top-k items in data streams. However, when Misra-Gries algorithm process the test data whose distribution is low-skewed, the accuracy rate of top-k result is very low. In our experiment, Misra-Gries algorithm process the test data generated following the Zipfian distribution and the parameter skewness is set to 0.4, then the accuracy rate of top-k result is only 32%.
     In order to improve the accuracy rate, we not only partially modify the update rule of Misra-Gries algorithm, but also add the concept of “sliding window” to the algorithm. Although the top-k result is approximate, the error is guaranteed not exceed an error bound. The algorithm we proposed is named to “sliding window Misra-Gries algorithm” that is abbreviated to “WMG”. WMG is very simple and use small memory space. WMG which just need one more memory space, compared to Misra-Gries algorithm, process the test date we mentioned above get the accuracy rate 80%.
     In addition, we present a parallel version of WMG, so that it can process larger quantities of data in data streams. We explain how WMG algorithm is parallelized and implement the parallel version of WMG by Apache Storm. The experiment results show that our parallel design could achieve high throughput, and the accuracy rate of top-k result is not worse than the accuracy rate of the sequential version.
    Advisory Committee
  • Shi-huang Chen - chair
  • Yun-nan Chang - co-chair
  • Jiunn-ru Lai - co-chair
  • Wei-kuang Lai - co-chair
  • Chun-hung Lin - advisor
  • Files
  • etd-0925117-160503.pdf
  • Indicate in-campus at 5 year and off-campus access at 5 year.
    Date of Submission 2017-10-25

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have more questions or technical problems, please contact eThesys