Title page for etd-0630118-172505


[Back to Results | New Search]

URN etd-0630118-172505
Author Chia-En Li
Author's Email Address No Public.
Statistics This thesis had been viewed 5564 times. Download 0 times.
Department Computer Science and Engineering
Year 2017
Semester 2
Degree Ph.D.
Type of Document
Language English
Title Efficient Algorithms for Data Classification toward Knowledge Discovery based on Statistics and Decision Trees
Date of Defense 2018-07-30
Page Count 115
Keyword
  • Periodicity Mining
  • Knowledge Discovery
  • Decision Tree
  • Data Classification
  • Chi-Square Test
  • Abstract Knowledge discovery in the database focuses upon methodologies for extracting useful information from collection of data. One of approaches for knowledge discovery is data mining. Data classification is one of famous and useful techniques for data mining that assigns categories to collected data in order to analyze the accurate prediction. Moreover, one of models for data classification is a decision tree. In fact, one of key points of a good decision tree is the kind of deciding factors in the internal nodes. In statistical tests, the chi-square test is one of good ways to analyze whether categorical variable A is the significant factor to categorical variable B. From our
    observation from research papers in the topic of medicine, we consider that the risk factor (i.e., the significant factor of the chi-square in statistics) is strongly related to the important deciding factor in the decision tree. Therefore, in this dissertation, first, we study the chronic kidney disease as an important risk factor for the bladder cancer by cooperating with Department of Urology, Chang Gung Memorial Hospital, Kaohsiung, Taiwan, and we propose a statistic approach to check the relation. In
    such a study, we need several preprocessing steps of knowledge discovery, including data selection, cleaning unclear data, and data enrichment. Moreover, the resulting risk factor (i.e., the significant factor) can be used as a deciding factor in a decision tree. Second, we make use of the significant factor to improve the performance of the decision tree, and we propose an approach which aims to reduce the number of deciding factors and decide the order of deciding factors in a decision tree. In such a study, we take the public baseball database as an example to illustrate our method. In fact, what we care about is the comparison of the performance of the same decision
    tree algorithm with or without using the preprocessing step, i.e., the pruning process of insignificant factors, before we construct the decision tree. Therefore, we compare the performance of the case that it uses the preprocessing step and the case that it does not use the preprocessing step. Overall, our proposed method can be applied to any other database for an extra attribute with a class value. Third, the result of mining periodicity patterns can also be helpful in knowledge discovery, and we
    propose a time-position join method based on a matrix and a graph for periodicity mining. In such a study, we make a comparison with the suffix tree approach proposed by Rasheed et al. For each of those three directions for research, we have shown that our contribution in terms of high accuracy, short processing time and less storage to some degree. Consequently, in this dissertation, we have proposed efficient algorithms for data classification toward knowledge discovery based on statistics and decision trees.
    Advisory Committee
  • Arbee L. P. Chen - chair
  • Chiou-Shann Fuh - co-chair
  • Yu-Chee Tseng - co-chair
  • Chun-I Fan - co-chair
  • Wei-Kuang Lai - co-chair
  • San-Yih Hwang - co-chair
  • Ye-In Chang - advisor
  • Files
  • etd-0630118-172505.pdf
  • Indicate in-campus at 5 year and off-campus access at 5 year.
    Date of Submission 2018-07-31

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have more questions or technical problems, please contact eThesys