||With the fast development of the internet technologies, data is easily generated and collected. Those data could be useful based on how the enterprise or individuals can derive the valuable information from it. Before doing more complex analysis, analyzers need to understand the data, preferably in a visualization way, leading to the approach of Exploratory Data Analysis(EDA). With EDA, analyzers can dig out the pattern or characteristic of data and then choose the appropriate model for further analysis. The common techniques of EDA include graphing, tabulation, and equation fitting, which could help the analyzers explore the data and identify its regularity. Unfortunately, when the volume of data is huge, traditional EDA methods may suffer from the lack of efficiency.|
Our work uses R to develop an EDA software based on its features of data exploration and rich package libraries and tries to efficiently visualize big data. By applying data reduction strategies, large volumes of data could be reduced to some meaningful data set with lower complexity and lower size. Specifically, we apply the strategy of binning for developing data reduction methods. Equal-width is the most common binning method for aggregating continuous variables. Although equal-width had high efficiency, it had poor performance for skewness data distribution. In this thesis, we compared three aggregation approaches: equal-width, equal-depth and MHist by assessing their time efficiencies and accuracies.
Experimental results showed that both equal-depth and MHist has much higher accuracy at some price of efficiency when compared to equal-width. MHist method performs well in various data distributions but with lowest efficiency. The method equal-depth strikes a balance in that it has reasonable performance in both efficiency and accuracy.