TY  - GEN
ID  - cogprints7148
UR  - http://cogprints.org/7148/
A1  - Kumar, Mr. Niraj 
A1  - Vemula, Mr. Venkata Vinay Babu
A1  - Srinathan, Dr. Kannan
A1  - Varma, Dr. Vasudeva
TI  - EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
Y1  - 2010/10/25/
N2  - This paper provides a solution to the issue: ?How can we use Wikipedia based concepts in document
clustering with lesser human involvement, accompanied by effective improvements in result?? In the
devised system, we propose a method to exploit the importance of N-grams in a document and use
Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams
in a document depends on several features including, but not limited to: frequency, position of their
occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we
introduce a new similarity measure, which takes the weighted N-gram importance into account, in the
calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.
AV  - public
KW  - Document clustering
KW  -  Group-average agglomerative clustering
KW  -  Community detection
KW  -  Similarity measure
KW  -  N-gram
KW  -  Wikipedia based additional knowledge.
ER  -

<script src="//archive-bar.soton.ac.uk/archive-bar.js"></script>