Cogprints

Extraction of Keyphrases from Text: Evaluation of Four Algorithms

Turney, Peter (1997) Extraction of Keyphrases from Text: Evaluation of Four Algorithms. [Departmental Technical Report] (Unpublished)

Full text available as:

[img]
Preview
Postscript
1577Kb
[img] PDF
147Kb

Abstract

This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithm’s keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsoft’s Word 97, (2) an algorithm based on Eric Brill’s part-of-speech tagger, (3) the Summarize feature in Verity’s Search 97, and (4) NRC’s Extractor algorithm. For all five document collections, NRC’s Extractor yields the best match with the manually generated keyphrases.

Item Type:Departmental Technical Report
Keywords:keywords, keyphrases, keyphrase extraction, summarization, AutoSummarize.
Subjects:Computer Science > Language
Computer Science > Machine Learning
Computer Science > Statistical Models
ID Code:1803
Deposited By: Turney, Peter
Deposited On:17 Sep 2001
Last Modified:11 Mar 2011 08:54

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

Brandow, R., Mitze, K., and Rau, L.R. (1995). The automatic condensation of electronic publica-tions

by sentence selection. Information Processing and Management, 31 (5), 675-685.

Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the Third Confer-ence

on Applied Natural Language Processing, Association for Computational Linguistics

(ACL), Trento, Italy.

Brill, E. (1993). A Corpus-Based Approach to Language Learning. Ph.D. Dissertation, Depart-ment

of Computer and Information Science, University of Pennsylvania.

Brill, E. (1994). Some advances in rule-based part of speech tagging. Proceedings of the Twelfth

National Conference on Artificial Intelligence (AAAI-94), Seattle, Washington.

Croft, B. (1991). The use of phrases and structured queries in information retrieval. SIGIR-91:

Proceedings of the 14th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, pp. 32-45, New York: ACM.

Edmundson, H.P. (1969). New methods in automatic extracting. Journal of the Association for

Computing Machinery, 16 (2), 264-285.

Fagan, J.L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Com-parison

of Syntactic and Non-Syntactic Methods. Ph.D. Dissertation, Department of Com-puter

Science, Cornell University, Report #87-868, Ithaca, New York.

Ginsberg, A. (1993). A unified approach to automatic indexing and information retrieval. IEEE

Expert, 8, 46-56.

Jang, D.-H., and Myaeng, S.H. (1997). Development of a document summarization system for

effective information services. RIAO 97 Conference Proceedings: Computer-Assisted Infor-mation

Searching on Internet, pp. 101-111. Montreal, Canada.

Johnson, F.C., Paice, C.D., Black, W.J., and Neal, A.P. (1993). The application of linguistic pro-cessing

to automatic abstract generation. Journal of Document and Text Management, 1, 215-

241.

Krulwich, B., and Burkey, C. (1996). Learning user information interests through the extraction

of semantically significant phrases. In M. Hearst and H. Hirsh, editors, AAAI 1996 Spring

Symposium on Machine Learning in Information Access. California: AAAI Press.

Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the

Sixth Message Understanding Conference. California: Morgan Kaufmann.

Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer. In E.A. Fox, P.

Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73,

New York: ACM.

Leung, C.-H., and Kan, W.-K. (1997). A statistical learning approach to automatic indexing of

controlled index terms. Journal of the American Society for Information Science, 48 (1), 55-

66.

Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. In E.A.

Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual Interna-tional

ACM SIGIR Conference on Research and Development in Information Retrieval, pp.

246-254, New York: ACM.

Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Com-putational

Linguistics, 11, 22-31.

Luhn, H.P. (1958). The automatic creation of literature abstracts. I.B.M. Journal of Research and

Development, 2 (2), 159-165.

Marsh, E., Hamburger, H., and Grishman, R. (1984). A production rule system for message sum-marization.

In AAAI-84, Proceedings of the American Association for Artificial Intelligence,

pp. 243-246. Cambridge, MA: AAAI Press/MIT Press.

MUC-3. (1991). Proceedings of the Third Message Understanding Conference. California: Mor-gan

Kaufmann.

MUC-4. (1992). Proceedings of the Fourth Message Understanding Conference. California:

Morgan Kaufmann.

MUC-5. (1993). Proceedings of the Fifth Message Understanding Conference. California: Mor-gan

Kaufmann.

MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. California: Mor-gan

Kaufmann.

Muñoz, A. (1996). Compound key word generation from document databases using a hierarchi-cal

clustering ART model. Intelligent Data Analysis, 1 (1), Amsterdam: Elsevier.

Nakagawa, H. (1997). Extraction of index words from manuals. RIAO 97 Conference Proceed-ings:

Computer-Assisted Information Searching on Internet, pp. 598-611. Montreal, Canada.

Paice, C.D. (1990). Constructing literature abstracts by computer: Techniques and prospects.

Information Processing and Management, 26 (1), 171-186.

Paice, C.D., and Jones, P.A. (1993). The identification of important concepts in highly structured

technical papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Con-ference

on Research and Development in Information Retrieval, pp. 69-78, New York: ACM.

Porter, M.F. (1980). An algorithm for suffix stripping. Program; Automated Library and Infor-mation

Systems, 14 (3), 130-137.

Salton, G. (1988). Syntactic approaches to automatic book indexing. Proceedings of the 26th

Annual Meeting of the Association for Computational Linguistics, pp. 120-138. New York:

ACM.

Salton, G., Allan, J., Buckley, C., and Singhal, A. (1994). Automatic analysis, theme generation,

and summarization of machine-readable texts. Science, 264, 1421-1426.

Soderland, S., and Lehnert, W. (1994). Wrap-Up: A trainable discourse module for information

extraction. Journal of Artificial Intelligence Research, 2, 131-158.

Steier, A. M., and Belew, R. K. (1993). Exporting phrases: A statistical analysis of topical lan-guage.

In R. Casey and B. Croft, editors, Second Symposium on Document Analysis and

Information Retrieval, pp. 179-190.

van Rijsbergen, C.J. (1979). Information Retrieval. Second edition. London: Butterworths.

Metadata

Repository Staff Only: item control page