Abstract |
Knowledge discovery in the database focuses upon methodologies for extracting useful information from collection of data. One of approaches for knowledge discovery is data mining. Data classification is one of famous and useful techniques for data mining that assigns categories to collected data in order to analyze the accurate prediction. Moreover, one of models for data classification is a decision tree. In fact, one of key points of a good decision tree is the kind of deciding factors in the internal nodes. In statistical tests, the chi-square test is one of good ways to analyze whether categorical variable A is the significant factor to categorical variable B. From our observation from research papers in the topic of medicine, we consider that the risk factor (i.e., the significant factor of the chi-square in statistics) is strongly related to the important deciding factor in the decision tree. Therefore, in this dissertation, first, we study the chronic kidney disease as an important risk factor for the bladder cancer by cooperating with Department of Urology, Chang Gung Memorial Hospital, Kaohsiung, Taiwan, and we propose a statistic approach to check the relation. In such a study, we need several preprocessing steps of knowledge discovery, including data selection, cleaning unclear data, and data enrichment. Moreover, the resulting risk factor (i.e., the significant factor) can be used as a deciding factor in a decision tree. Second, we make use of the significant factor to improve the performance of the decision tree, and we propose an approach which aims to reduce the number of deciding factors and decide the order of deciding factors in a decision tree. In such a study, we take the public baseball database as an example to illustrate our method. In fact, what we care about is the comparison of the performance of the same decision tree algorithm with or without using the preprocessing step, i.e., the pruning process of insignificant factors, before we construct the decision tree. Therefore, we compare the performance of the case that it uses the preprocessing step and the case that it does not use the preprocessing step. Overall, our proposed method can be applied to any other database for an extra attribute with a class value. Third, the result of mining periodicity patterns can also be helpful in knowledge discovery, and we propose a time-position join method based on a matrix and a graph for periodicity mining. In such a study, we make a comparison with the suffix tree approach proposed by Rasheed et al. For each of those three directions for research, we have shown that our contribution in terms of high accuracy, short processing time and less storage to some degree. Consequently, in this dissertation, we have proposed efficient algorithms for data classification toward knowledge discovery based on statistics and decision trees. |