Responsive image
博碩士論文 etd-0810104-153712 詳細資訊
Title page for etd-0810104-153712
論文名稱
Title
結合文件內容和使用紀錄的文獻數位圖書館文件分群技術
Clustering Articles in a Literature Digital Library Based on Content and Usage
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
52
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2004-07-26
繳交日期
Date of Submission
2004-08-10
關鍵字
Keywords
文件分群、文件分類、文獻數位圖書館、使用者紀錄分群
Digital library, Document categorization, Usage clustering, Document clustering, Content-based clustering
統計
Statistics
本論文已被瀏覽 5911 次,被下載 4625
The thesis/dissertation has been browsed 5911 times, has been downloaded 4625 times.
中文摘要
文獻數位圖書館提供文獻數位化的儲存,研究人員可以透過網路很方便地使用文獻的查詢。然而在查詢文獻的時候,往往使用者一次面對大量的資料,無法找到自己所真正想要的文獻資料。為了提供更有效率的查詢服務,很多系統會提供瀏覽介面,以期望能減少使用者點選的次數。在本篇研究,我們期望能建出一個兼具主題目錄的瀏覽介面,以期望能提供使用者做文獻資料的查詢時更多的方便以及效率。

在之前的相關研究當中,文件分類或分群可以適用於本研究所要解決的問題,但文件分類方式多半需要專家的幫忙以及有既定的主題類別或目錄。所以本研究想試著利用系統中使用者的使用紀錄(usage log)來取代模擬專家的分類,減少專家人工上的成本,而能建構出符合使用者需求的瀏覽介面。本研究主要提出兩種結合文件內容與使用紀錄的方法(Document categorization-based與Document clustering based),最後並以傳統內容式的方法(Content-based)與以分別針對專家人工分類的結果比較Entropy來評估。結果發現內容式的方法整體而言對於專家分類的結果吻合度較高。
Abstract
Literature digital library is one of the most important resources to preserve civilized asset. To provide more effective and efficient information search, many systems are equipped with a browsing interface that aims to ease the article searching task. A browsing interface is associated with a subject directory, which guides the users to identify articles that need their information need. A subject directory contains a set (or a hierarchy) of subject categories, each containing a number of similar articles. How to group articles in a literature digital library is the theme of this thesis.

Previous work used either document classification or document clustering approaches to dispatching articles into a set of article clusters based on their content. We observed that articles that meet a single user’s information need may not necessarily fall in a single cluster. In this thesis, we propose to make use of both Web log and article content is clustering articles. We proposed two hybrid approaches, namely document categorization based method and document clustering based method. These alternatives were compared to other content-based methods. It has been found that the document categorization based method effectively reduces the number of required click-through at the expense of slight increase of entropy that measures the content heterogeneity of each generated cluster.
目次 Table of Contents
Chapter 1 Introduction 1
1.1 Research Background 1
1.2 Research Motivations and Objectives 1
1.3 Data Description 2
1.4 Problem Description 4
1.5 Thesis organization 5
Chapter 2 Literature review 6
2.1 Converting an article to a set of vectors 6
2.2 Keyword Selection 8
2.2.1 CHI Square Statistics 8
2.2.2 Information Gains 9
2.3 Web Usage Clustering 9
2.3.1 Data preparation for Web usage log 9
2.3.2 Usage Clustering 11
2.3.2.1 Based on frequent itemsets 12
2.3.2.2 Based on Hyperclique Patterns 13
2.4 Content-based Clustering 15
2.5 Text Categorization 20
2.5.1 Probabilistic Classifiers 20
2.5.2 Neural Network Classifiers 21
2.5.3 Support Vector Machines 21
Chapter 3 Content-based and hybrid approach 24
3.1 Content-based clustering 24
3.1.1 Article Clique Hypergraph Partitioning 25
3.1.2 K-means 26
3.2 Hybrid approach 26
3.2.1 Document categorization based hybrid approach 27
3.2.2 Document clustering based hybrid approach 28
Chapter 4 Performance Evaluation 29
4.1 Performance Metrics 32
4.2 Experimental Results 34
4.2.1 Comparing usage coherence of various clustering 34
4.2.2 Comparing automatic clusters with manual clusters 36
Chapter 5 Conclusions 41
Reference 50
參考文獻 References
[AS94] Agrawal. R. and Srikant. R., “Fast algorithms for mining association rules”, In Proceedings of the 20th VLDB conference, pp. 487-499, Santiago, Chile, 1994.
[BGGH99] Daniel Boley, Maria Gini, Robert Gross, and Eui-Hong Han etal. “Partitioning-Based Clustering for Web Document Categorization”, Decision Support Systems archive Volume 27 , Issue 3 Dec.1999 table of contents Special issue on WITS '97. Pages: 329 – 341, 1999.
[Chuang03] S. M. Chuang. "Combining Content-based and Collaborative Article Recommendation in Literature Digital Libraries", master thesis, National Sun Yat-sen University Department of Information Management, Jul.2003.
[CMS99] R. Cooley, B. Mobasher, and J. Srivastava, “Creating adaptive Web sites through usage-based clustering of URLs,” In Proc. of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX), November 1999.
[FM01] E. A. Fox and G. Marchionini. "Digital Libraries," Communications of the ACM, 44(5), pp. 30-32, May 2001.
[Fox92] C.Fox, “Lexical Analysis and Stoplists,” Chapter 7, in Information Retrieval: Data Structures & Algorithms, edited by W. B. Frakes and R. Baeza-Yates, Prentices Hall, 1992.
[HKKM97] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Clustering based on association rule hypergraphs," In Proccedings of SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’97), May 1997.
[HKKM98] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Hypergraph based clustering in high dimensional data sets: a summary of results." IEEE Bulletin of the Technical Committee on Data Engineering, (21) 1, March 1998.
[Hsiung02] W.C. Hsiung. “Article Recommendation in Literature Digital Libraries.”, master thesis, National Sun Yat-sen University, department of Information Management, Jul. 2002.
[Joac98] T. Joachims ,”Text Categorization with support vector machines: learning with many relevant features.” In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemintz, DE, 1998), pp.137-142
[Joac99] T. Joachims, “Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning”, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
[KS97] Daphe Koller and Mehran Sahami, "Hierarchically classifying documents using very few words," Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997, Pages 170-178.
[MDL00a] B. Mobasher, H. Dai, T. Luo, Miki Nakagawa, and Jim Witshire. "Discovery of aggregate usage profiles for Web personalization," In Proc. of the WebKDD Workshop, 2000.
[MDL00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, and J. Zhu, "Integrating Web Usage and Content Mining for More Effective Personalization," International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK. September 2000.
[Se02] Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization” Consiglio Nazionale delle Ricerche, Italy, 2002
[SKK00] M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," In KDD Workshop on Text Mining, 2000.
[SYZX01] Z.Su, Q.Yang, H.Zhang , X.Xu , and Y.Hu, "Correlation-based Document Clustering using Web Logs," 34th Annual Hawaii International Conference System Science(HICSS-34)-Volume 5.Jan 03-06,2001.
[XTK04] Hui Xiong, Pang-Ning Tan, and Vpin Kumar, “Mining Hyperclique Patterns in Data Sets with Skewed Support Distributions,” Kluwer Acadenic Publishers, 2004.
[YP97] Yang, Y. and Pederson, J.O., “A comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp412-420.
[ZK02] Ying Zhao and George Karypis, ”Evaluation of hierarchical clustering algorithms for document datasets” Conference on Information and Knowledge Management Proceedings of the eleventh international conference on Information and knowledge management, 2002, pp515- 524
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code