creators_name: Turney, Peter
type: techreport
datestamp: 2001-09-17
lastmod: 2011-03-11 08:54:48
metadata_visibility: show
title: Extraction of Keyphrases from Text: Evaluation of Four Algorithms
ispublished: unpub
subjects: comp-sci-lang
subjects: comp-sci-mach-learn
subjects: comp-sci-stat-model
full_text_status: public
keywords: keywords, keyphrases, keyphrase extraction, summarization, AutoSummarize.
abstract: This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithms keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsofts Word 97, (2) an algorithm based on Eric Brills part-of-speech tagger, (3) the Summarize feature in Veritys Search 97, and (4) NRCs Extractor algorithm. For all five document collections, NRCs Extractor yields the best match with the manually generated keyphrases. 
date: 1997
date_type: published
institution: National Research Council of Canada
department: Institute for Information Technology
refereed: FALSE
referencetext: Brandow, R., Mitze, K., and Rau, L.R. (1995). The automatic condensation of electronic publica-tions
by sentence selection. Information Processing and Management, 31 (5), 675-685.
Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Confer-ence
on Applied Natural Language Processing, Association for Computational Linguistics
(ACL), Trento, Italy.
Brill, E. (1993). A Corpus-Based Approach to Language Learning. Ph.D. Dissertation, Depart-ment
of Computer and Information Science, University of Pennsylvania.
Brill, E. (1994). Some advances in rule-based part of speech tagging. Proceedings of the Twelfth
National Conference on Artificial Intelligence (AAAI-94), Seattle, Washington.
Croft, B. (1991). The use of phrases and structured queries in information retrieval. SIGIR-91:
Proceedings of the 14th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 32-45, New York: ACM.
Edmundson, H.P. (1969). New methods in automatic extracting. Journal of the Association for
Computing Machinery, 16 (2), 264-285.
Fagan, J.L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Com-parison
of Syntactic and Non-Syntactic Methods. Ph.D. Dissertation, Department of Com-puter
Science, Cornell University, Report #87-868, Ithaca, New York.
Ginsberg, A. (1993). A unified approach to automatic indexing and information retrieval. IEEE
Expert, 8, 46-56.
Jang, D.-H., and Myaeng, S.H. (1997). Development of a document summarization system for
effective information services. RIAO 97 Conference Proceedings: Computer-Assisted Infor-mation
Searching on Internet, pp. 101-111. Montreal, Canada.
Johnson, F.C., Paice, C.D., Black, W.J., and Neal, A.P. (1993). The application of linguistic pro-cessing
to automatic abstract generation. Journal of Document and Text Management, 1, 215-
241.
Krulwich, B., and Burkey, C. (1996). Learning user information interests through the extraction
of semantically significant phrases. In M. Hearst and H. Hirsh, editors, AAAI 1996 Spring
Symposium on Machine Learning in Information Access. California: AAAI Press.
Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the
Sixth Message Understanding Conference. California: Morgan Kaufmann.
Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer. In E.A. Fox, P.
Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73,
New York: ACM.
Leung, C.-H., and Kan, W.-K. (1997). A statistical learning approach to automatic indexing of
controlled index terms. Journal of the American Society for Information Science, 48 (1), 55-
66.
Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. In E.A.
Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual Interna-tional
ACM SIGIR Conference on Research and Development in Information Retrieval, pp.
246-254, New York: ACM.
Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Com-putational
Linguistics, 11, 22-31.
Luhn, H.P. (1958). The automatic creation of literature abstracts. I.B.M. Journal of Research and
Development, 2 (2), 159-165.
Marsh, E., Hamburger, H., and Grishman, R. (1984). A production rule system for message sum-marization.
In AAAI-84, Proceedings of the American Association for Artificial Intelligence,
pp. 243-246. Cambridge, MA: AAAI Press/MIT Press.
MUC-3. (1991). Proceedings of the Third Message Understanding Conference. California: Mor-gan
Kaufmann.
MUC-4. (1992). Proceedings of the Fourth Message Understanding Conference. California:
Morgan Kaufmann.
MUC-5. (1993). Proceedings of the Fifth Message Understanding Conference. California: Mor-gan
Kaufmann.
MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. California: Mor-gan
Kaufmann.
Muñoz, A. (1996). Compound key word generation from document databases using a hierarchi-cal
clustering ART model. Intelligent Data Analysis, 1 (1), Amsterdam: Elsevier.
Nakagawa, H. (1997). Extraction of index words from manuals. RIAO 97 Conference Proceed-ings:
Computer-Assisted Information Searching on Internet, pp. 598-611. Montreal, Canada.
Paice, C.D. (1990). Constructing literature abstracts by computer: Techniques and prospects.
Information Processing and Management, 26 (1), 171-186.
Paice, C.D., and Jones, P.A. (1993). The identification of important concepts in highly structured
technical papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Con-ference
on Research and Development in Information Retrieval, pp. 69-78, New York: ACM.
Porter, M.F. (1980). An algorithm for suffix stripping. Program; Automated Library and Infor-mation
Systems, 14 (3), 130-137.
Salton, G. (1988). Syntactic approaches to automatic book indexing. Proceedings of the 26th
Annual Meeting of the Association for Computational Linguistics, pp. 120-138. New York:
ACM.
Salton, G., Allan, J., Buckley, C., and Singhal, A. (1994). Automatic analysis, theme generation,
and summarization of machine-readable texts. Science, 264, 1421-1426.
Soderland, S., and Lehnert, W. (1994). Wrap-Up: A trainable discourse module for information
extraction. Journal of Artificial Intelligence Research, 2, 131-158.
Steier, A. M., and Belew, R. K. (1993). Exporting phrases: A statistical analysis of topical lan-guage.
In R. Casey and B. Croft, editors, Second Symposium on Document Analysis and
Information Retrieval, pp. 179-190.
van Rijsbergen, C.J. (1979). Information Retrieval. Second edition. London: Butterworths.

citation:   Turney, Peter  (1997) Extraction of Keyphrases from Text: Evaluation of Four Algorithms.  [Departmental Technical Report]    (Unpublished)  
document_url: http://cogprints.org/1803/1/ERB-1051.ps
document_url: http://cogprints.org/1803/5/ERB-1051.pdf