Title page for etd-0425113-122452


[Back to Results | New Search]

URN etd-0425113-122452
Author Yung-Shen Lin
Author's Email Address No Public.
Statistics This thesis had been viewed 5577 times. Download 331 times.
Department Institute of Electrical Engineering
Year 2012
Semester 2
Degree Ph.D.
Type of Document
Language English
Title Measuring Document Similarity Based on Text Classification and Clustering
Date of Defense 2013-04-18
Page Count 139
Keyword
  • similarity function
  • feature selection
  • entropy
  • document clustering
  • document classification
  • near-duplicate document
  • accuracy
  • classifiers
  • clustering algorithms
  • Abstract This thesis proposes a novel similarity measure that applies between documents. The proposed measure is also extended to gauge the similarity between two sets of documents. Furthermore, a new method of similarity measure implementation is assigned to detect near-duplicate documents.
      To measure the similarity between two documents is a significant utilization in the text field. Computing the similarity between two documents with respect to a feature, the appropriate features are selected to represent documents, and employed to measure the similarity. Therefore, a similarity measure between two documents may be interested about the feature appears in both documents or not, similarity degree
    between features, and the number of similar features. In this thesis, we propose a new similarity based on three cases of the feature appear conditions.
      Document’s similarity differentiating is a significant operation in the text processing. For items of documents are huge, selecting the appropriate
    features to represent documents and facilitate this target are important. The documentation analysis usually retrieves the information sufficient to cover contents of the documents as a representative of
    documents feature. These features may be a single letter, word, sentence, or even whole paragraph. And the vector-space model is used to represent the features. To compute the similarity between two documents with respect to a feature, the major measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. Based on the research and to improve the performance of the similarity measure algorithms, our proposed measure
    is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, and the results are better than that achieved by other measures.
      For more consider of similarity measure, an implementation of detecting near-duplicate documents is also demonstrated. Based on similarity measure, we present a novel method for detecting near-duplicates from a large collection of documents.
      To distinguish near-duplicate documents is extremely important in the Internet era. If a search engine can effectively determine the near-duplicate
    documents will have access to reduce the number of duplicate documents retrieved, jointly and severally improve the search performance. For this purpose, we also propose a novel method for detecting
    near-duplicates from a huge collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is afterwards
    applied and the similarity degree between the input document, and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional
    methods. Experimental results show that our method is effective in near-duplicate document detection.
    Advisory Committee
  • Chih-Chin Lai, - chair
  • Chun-Liang Hou - co-chair
  • Chih-Feng Liu - co-chair
  • Chen-Sen Ouyang - co-chair
  • Hsien-Liang Tsai - co-chair
  • Shie-Jue Lee - advisor
  • Files
  • etd-0425113-122452.pdf
  • Indicate in-campus at 0 year and off-campus access at 3 year.
    Date of Submission 2013-04-25

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have more questions or technical problems, please contact eThesys