博碩士論文 etd-0810104-153712 詳細資訊

[回到前頁查詢結果 | 重新搜尋]

姓名 丁康迪(Kang-Di Ting) 電子郵件信箱 m9142635@student.nsysu.edu.tw
畢業系所 資訊管理學系研究所(Information Management)
畢業學位 碩士(Master) 畢業時期 92學年第2學期
論文名稱(中) 結合文件內容和使用紀錄的文獻數位圖書館文件分群技術
論文名稱(英) Clustering Articles in a Literature Digital Library Based on Content and Usage
  • etd-0810104-153712.pdf
  • 本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。


    論文語文/頁數 英文/52
    統計 本論文已被瀏覽 5617 次,被下載 4572 次
    摘要(中) 文獻數位圖書館提供文獻數位化的儲存,研究人員可以透過網路很方便地使用文獻的查詢。然而在查詢文獻的時候,往往使用者一次面對大量的資料,無法找到自己所真正想要的文獻資料。為了提供更有效率的查詢服務,很多系統會提供瀏覽介面,以期望能減少使用者點選的次數。在本篇研究,我們期望能建出一個兼具主題目錄的瀏覽介面,以期望能提供使用者做文獻資料的查詢時更多的方便以及效率。
    在之前的相關研究當中,文件分類或分群可以適用於本研究所要解決的問題,但文件分類方式多半需要專家的幫忙以及有既定的主題類別或目錄。所以本研究想試著利用系統中使用者的使用紀錄(usage log)來取代模擬專家的分類,減少專家人工上的成本,而能建構出符合使用者需求的瀏覽介面。本研究主要提出兩種結合文件內容與使用紀錄的方法(Document categorization-based與Document clustering based),最後並以傳統內容式的方法(Content-based)與以分別針對專家人工分類的結果比較Entropy來評估。結果發現內容式的方法整體而言對於專家分類的結果吻合度較高。
    摘要(英) Literature digital library is one of the most important resources to preserve civilized asset. To provide more effective and efficient information search, many systems are equipped with a browsing interface that aims to ease the article searching task. A browsing interface is associated with a subject directory, which guides the users to identify articles that need their information need. A subject directory contains a set (or a hierarchy) of subject categories, each containing a number of similar articles. How to group articles in a literature digital library is the theme of this thesis.
    Previous work used either document classification or document clustering approaches to dispatching articles into a set of article clusters based on their content. We observed that articles that meet a single user’s information need may not necessarily fall in a single cluster. In this thesis, we propose to make use of both Web log and article content is clustering articles. We proposed two hybrid approaches, namely document categorization based method and document clustering based method. These alternatives were compared to other content-based methods. It has been found that the document categorization based method effectively reduces the number of required click-through at the expense of slight increase of entropy that measures the content heterogeneity of each generated cluster.
  • 文件分群
  • 文件分類
  • 文獻數位圖書館
  • 使用者紀錄分群
  • 關鍵字(英)
  • Digital library
  • Document categorization
  • Usage clustering
  • Document clustering
  • Content-based clustering
  • 論文目次 Chapter 1 Introduction 1
    1.1 Research Background 1
    1.2 Research Motivations and Objectives 1
    1.3 Data Description 2
    1.4 Problem Description 4
    1.5 Thesis organization 5
    Chapter 2 Literature review 6
    2.1 Converting an article to a set of vectors 6
    2.2 Keyword Selection 8
    2.2.1 CHI Square Statistics 8
    2.2.2 Information Gains 9
    2.3 Web Usage Clustering 9
    2.3.1 Data preparation for Web usage log 9
    2.3.2 Usage Clustering 11 Based on frequent itemsets 12 Based on Hyperclique Patterns 13
    2.4 Content-based Clustering 15
    2.5 Text Categorization 20
    2.5.1 Probabilistic Classifiers 20
    2.5.2 Neural Network Classifiers 21
    2.5.3 Support Vector Machines 21
    Chapter 3 Content-based and hybrid approach 24
    3.1 Content-based clustering 24
    3.1.1 Article Clique Hypergraph Partitioning 25
    3.1.2 K-means 26
    3.2 Hybrid approach 26
    3.2.1 Document categorization based hybrid approach 27
    3.2.2 Document clustering based hybrid approach 28
    Chapter 4 Performance Evaluation 29
    4.1 Performance Metrics 32
    4.2 Experimental Results 34
    4.2.1 Comparing usage coherence of various clustering 34
    4.2.2 Comparing automatic clusters with manual clusters 36
    Chapter 5 Conclusions 41
    Reference 50
    參考文獻 [AS94] Agrawal. R. and Srikant. R., “Fast algorithms for mining association rules”, In Proceedings of the 20th VLDB conference, pp. 487-499, Santiago, Chile, 1994.
    [BGGH99] Daniel Boley, Maria Gini, Robert Gross, and Eui-Hong Han etal. “Partitioning-Based Clustering for Web Document Categorization”, Decision Support Systems archive Volume 27 , Issue 3 Dec.1999 table of contents Special issue on WITS '97. Pages: 329 – 341, 1999.
    [Chuang03] S. M. Chuang. "Combining Content-based and Collaborative Article Recommendation in Literature Digital Libraries", master thesis, National Sun Yat-sen University Department of Information Management, Jul.2003.
    [CMS99] R. Cooley, B. Mobasher, and J. Srivastava, “Creating adaptive Web sites through usage-based clustering of URLs,” In Proc. of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX), November 1999.
    [FM01] E. A. Fox and G. Marchionini. "Digital Libraries," Communications of the ACM, 44(5), pp. 30-32, May 2001.
    [Fox92] C.Fox, “Lexical Analysis and Stoplists,” Chapter 7, in Information Retrieval: Data Structures & Algorithms, edited by W. B. Frakes and R. Baeza-Yates, Prentices Hall, 1992.
    [HKKM97] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Clustering based on association rule hypergraphs," In Proccedings of SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’97), May 1997.
    [HKKM98] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Hypergraph based clustering in high dimensional data sets: a summary of results." IEEE Bulletin of the Technical Committee on Data Engineering, (21) 1, March 1998.
    [Hsiung02] W.C. Hsiung. “Article Recommendation in Literature Digital Libraries.”, master thesis, National Sun Yat-sen University, department of Information Management, Jul. 2002.
    [Joac98] T. Joachims ,”Text Categorization with support vector machines: learning with many relevant features.” In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemintz, DE, 1998), pp.137-142
    [Joac99] T. Joachims, “Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning”, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
    [KS97] Daphe Koller and Mehran Sahami, "Hierarchically classifying documents using very few words," Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997, Pages 170-178.
    [MDL00a] B. Mobasher, H. Dai, T. Luo, Miki Nakagawa, and Jim Witshire. "Discovery of aggregate usage profiles for Web personalization," In Proc. of the WebKDD Workshop, 2000.
    [MDL00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, and J. Zhu, "Integrating Web Usage and Content Mining for More Effective Personalization," International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK. September 2000.
    [Se02] Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization” Consiglio Nazionale delle Ricerche, Italy, 2002
    [SKK00] M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," In KDD Workshop on Text Mining, 2000.
    [SYZX01] Z.Su, Q.Yang, H.Zhang , X.Xu , and Y.Hu, "Correlation-based Document Clustering using Web Logs," 34th Annual Hawaii International Conference System Science(HICSS-34)-Volume 5.Jan 03-06,2001.
    [XTK04] Hui Xiong, Pang-Ning Tan, and Vpin Kumar, “Mining Hyperclique Patterns in Data Sets with Skewed Support Distributions,” Kluwer Acadenic Publishers, 2004.
    [YP97] Yang, Y. and Pederson, J.O., “A comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp412-420.
    [ZK02] Ying Zhao and George Karypis, ”Evaluation of hierarchical clustering algorithms for document datasets” Conference on Information and Knowledge Management Proceedings of the eleventh international conference on Information and knowledge management, 2002, pp515- 524
  • 魏志平 - 召集委員
  • 林福仁 - 委員
  • 黃三益 - 指導教授
  • 口試日期 2004-07-26 繳交日期 2004-08-10

    [回到前頁查詢結果 | 重新搜尋]