creators_name: Turney, Peter type: techreport datestamp: 2001-09-17 lastmod: 2011-03-11 08:54:48 metadata_visibility: show title: Extraction of Keyphrases from Text: Evaluation of Four Algorithms ispublished: unpub subjects: comp-sci-lang subjects: comp-sci-mach-learn subjects: comp-sci-stat-model full_text_status: public keywords: keywords, keyphrases, keyphrase extraction, summarization, AutoSummarize. abstract: This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithm’s keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsoft’s Word 97, (2) an algorithm based on Eric Brill’s part-of-speech tagger, (3) the Summarize feature in Verity’s Search 97, and (4) NRC’s Extractor algorithm. For all five document collections, NRC’s Extractor yields the best match with the manually generated keyphrases. date: 1997 date_type: published institution: National Research Council of Canada department: Institute for Information Technology refereed: FALSE referencetext: Brandow, R., Mitze, K., and Rau, L.R. (1995). The automatic condensation of electronic publica-tions by sentence selection. Information Processing and Management, 31 (5), 675-685. Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Confer-ence on Applied Natural Language Processing, Association for Computational Linguistics (ACL), Trento, Italy. Brill, E. (1993). A Corpus-Based Approach to Language Learning. Ph.D. Dissertation, Depart-ment of Computer and Information Science, University of Pennsylvania. Brill, E. (1994). Some advances in rule-based part of speech tagging. Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Washington. Croft, B. (1991). The use of phrases and structured queries in information retrieval. SIGIR-91: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 32-45, New York: ACM. Edmundson, H.P. (1969). New methods in automatic extracting. Journal of the Association for Computing Machinery, 16 (2), 264-285. Fagan, J.L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Com-parison of Syntactic and Non-Syntactic Methods. Ph.D. Dissertation, Department of Com-puter Science, Cornell University, Report #87-868, Ithaca, New York. Ginsberg, A. (1993). A unified approach to automatic indexing and information retrieval. IEEE Expert, 8, 46-56. Jang, D.-H., and Myaeng, S.H. (1997). Development of a document summarization system for effective information services. RIAO 97 Conference Proceedings: Computer-Assisted Infor-mation Searching on Internet, pp. 101-111. Montreal, Canada. Johnson, F.C., Paice, C.D., Black, W.J., and Neal, A.P. (1993). The application of linguistic pro-cessing to automatic abstract generation. Journal of Document and Text Management, 1, 215- 241. Krulwich, B., and Burkey, C. (1996). Learning user information interests through the extraction of semantically significant phrases. In M. Hearst and H. Hirsh, editors, AAAI 1996 Spring Symposium on Machine Learning in Information Access. California: AAAI Press. Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference. California: Morgan Kaufmann. Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73, New York: ACM. Leung, C.-H., and Kan, W.-K. (1997). A statistical learning approach to automatic indexing of controlled index terms. Journal of the American Society for Information Science, 48 (1), 55- 66. Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 246-254, New York: ACM. Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Com-putational Linguistics, 11, 22-31. Luhn, H.P. (1958). The automatic creation of literature abstracts. I.B.M. Journal of Research and Development, 2 (2), 159-165. Marsh, E., Hamburger, H., and Grishman, R. (1984). A production rule system for message sum-marization. In AAAI-84, Proceedings of the American Association for Artificial Intelligence, pp. 243-246. Cambridge, MA: AAAI Press/MIT Press. MUC-3. (1991). Proceedings of the Third Message Understanding Conference. California: Mor-gan Kaufmann. MUC-4. (1992). Proceedings of the Fourth Message Understanding Conference. California: Morgan Kaufmann. MUC-5. (1993). Proceedings of the Fifth Message Understanding Conference. California: Mor-gan Kaufmann. MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. California: Mor-gan Kaufmann. Muñoz, A. (1996). Compound key word generation from document databases using a hierarchi-cal clustering ART model. Intelligent Data Analysis, 1 (1), Amsterdam: Elsevier. Nakagawa, H. (1997). Extraction of index words from manuals. RIAO 97 Conference Proceed-ings: Computer-Assisted Information Searching on Internet, pp. 598-611. Montreal, Canada. Paice, C.D. (1990). Constructing literature abstracts by computer: Techniques and prospects. Information Processing and Management, 26 (1), 171-186. Paice, C.D., and Jones, P.A. (1993). The identification of important concepts in highly structured technical papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval, pp. 69-78, New York: ACM. Porter, M.F. (1980). An algorithm for suffix stripping. Program; Automated Library and Infor-mation Systems, 14 (3), 130-137. Salton, G. (1988). Syntactic approaches to automatic book indexing. Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics, pp. 120-138. New York: ACM. Salton, G., Allan, J., Buckley, C., and Singhal, A. (1994). Automatic analysis, theme generation, and summarization of machine-readable texts. Science, 264, 1421-1426. Soderland, S., and Lehnert, W. (1994). Wrap-Up: A trainable discourse module for information extraction. Journal of Artificial Intelligence Research, 2, 131-158. Steier, A. M., and Belew, R. K. (1993). Exporting phrases: A statistical analysis of topical lan-guage. In R. Casey and B. Croft, editors, Second Symposium on Document Analysis and Information Retrieval, pp. 179-190. van Rijsbergen, C.J. (1979). Information Retrieval. Second edition. London: Butterworths. citation: Turney, Peter (1997) Extraction of Keyphrases from Text: Evaluation of Four Algorithms. [Departmental Technical Report] (Unpublished) document_url: http://cogprints.org/1803/1/ERB-1051.ps document_url: http://cogprints.org/1803/5/ERB-1051.pdf