"7148","EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING","This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on several features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.","http://cogprints.org/7148/","Kumar, Mr. Niraj and Vemula, Mr. Venkata Vinay Babu and Srinathan, Dr. Kannan and Varma, Dr. Vasudeva","UNSPECIFIED"," Kumar, Mr. Niraj and Vemula, Mr. Venkata Vinay Babu and Srinathan, Dr. Kannan and Varma, Dr. Vasudeva (2010) EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING. [Conference Paper] ","niraj_kumar@research.iiit.ac.in,vinaybabu.vv@gmail.com,srinathan@iiit.ac.in,vv@iiit.ac.in","2010-10-25"