||The problem of finding top-k frequent items over data streams is very popular and heavily studied in data streams analysis, and it is widely used in many applications, such as network traffic monitoring and identifying high-clicker commodity. However, exactly finding the top-k items is difficult, because the quantities of data in data streams are so large that it is costly to store all data and then process them. To deal with this problem, many algorithms are proposed to gain the approximate results with tolerable errors.|
Misra-Gries algorithm is one of counter-based algorithms that are used to find frequent items and top-k items in data streams. However, when Misra-Gries algorithm process the test data whose distribution is low-skewed, the accuracy rate of top-k result is very low. In our experiment, Misra-Gries algorithm process the test data generated following the Zipfian distribution and the parameter skewness is set to 0.4, then the accuracy rate of top-k result is only 32%.
In order to improve the accuracy rate, we not only partially modify the update rule of Misra-Gries algorithm, but also add the concept of “sliding window” to the algorithm. Although the top-k result is approximate, the error is guaranteed not exceed an error bound. The algorithm we proposed is named to “sliding window Misra-Gries algorithm” that is abbreviated to “WMG”. WMG is very simple and use small memory space. WMG which just need one more memory space, compared to Misra-Gries algorithm, process the test date we mentioned above get the accuracy rate 80%.
In addition, we present a parallel version of WMG, so that it can process larger quantities of data in data streams. We explain how WMG algorithm is parallelized and implement the parallel version of WMG by Apache Storm. The experiment results show that our parallel design could achieve high throughput, and the accuracy rate of top-k result is not worse than the accuracy rate of the sequential version.