creators_name: Kumar, Niraj creators_name: Vemula, Venkata Vinay Babu creators_name: Srinathan, Kannan creators_name: Varma, Vasudeva creators_id: niraj_kumar@research.iiit.ac.in creators_id: vinaybabu.vv@gmail.com creators_id: srinathan@iiit.ac.in creators_id: vv@iiit.ac.in type: confpaper datestamp: 2010-11-22 14:10:12 lastmod: 2011-03-11 08:57:49 metadata_visibility: show title: EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING ispublished: pub subjects: comp-sci-stat-model full_text_status: public keywords: Document clustering, Group-average agglomerative clustering, Community detection, Similarity measure, N-gram, Wikipedia based additional knowledge. abstract: This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on several features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area. date: 2010-10-25 date_type: published refereed: TRUE referencetext: 1. Banerjee, S., Ramanathan, K., Gupta, A., 2007. Clustering Short Texts using Wikipedia; SIGIR’07, July 23–27, Amsterdam, The Netherlands. 2. Clauset, A., Newman, M., Moore, C., 2004. Finding community structure in verylarge networks. Physical Review E, 70:066111, 2004. 3. Hammouda, K., Matute, D., Kamel, M., 2005. CorePhrase: Keyphrase Extraction for Document Clustering; In IAPR: 4th International Conference on Machine Learning and Data Mining. 4. Han, J., Kim, T., Choi, J., 2007. Web Document Clustering by Using Automatic Keyphrase Extraction; Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops. 5. Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009. Exploiting Wikipedia as External Knowledge for Document Clustering; KDD’09. 6. Huang, A., Milne, D., Frank, E., Witten, I. 2008. Clustering Documents with Active Learning Using Wikipedia. ICDM 2008. 7. Huang, A., Milne, D., Frank, E., Witten, I., 2009. Clustering documents using a wikipedia-based concept representation. In Proc 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining. 8. Kaufman, L., and Rousseeuw, P., 1999. Finding Groups in data: An introduction to cluster analysis, 1999, John Wiley & Sons. 9. Kumar, N., Srinathan, K., 2008. Automatic Keyphrase Extraction from Scientific Documents Using N-gram Filtration Technique. In the Proceedings of ACM DocEng. 10. Newman,M., Girvan,M., 2004. Finding and evaluating community structure in networks. Physical review E, 69:026113, 2004. 11. Steinbach, M., Karypis, G., and Kumar, V., 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering,University of Minnesota. 12. Tan,P., Steinbach,M.,Kumar,V., 2006. Introduction to Data Mining; Addison-Wesley; ISBN-10: 0321321367. 13. Zhao, Y., Karypis, G., 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota. citation: Kumar, Mr. Niraj and Vemula, Mr. Venkata Vinay Babu and Srinathan, Dr. Kannan and Varma, Dr. Vasudeva (2010) EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING. [Conference Paper] document_url: http://cogprints.org/7148/1/KDIR_Niraj.pdf