creators_name: Turney, Peter type: techreport datestamp: 2001-09-17 lastmod: 2011-03-11 08:54:48 metadata_visibility: show title: Learning to Extract Keyphrases from Text ispublished: unpub subjects: archives subjects: comp-sci-lang subjects: comp-sci-mach-learn subjects: comp-sci-stat-model full_text_status: public keywords: machine learning, summarization, indexing, keywords, keyphrase extraction. abstract: Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft?s Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity?s Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97). date: 1999 date_type: published institution: National Research Council of Canada department: Institute for Information Technology refereed: FALSE referencetext: Brandow, R., Mitze, K., and Rau, L.R. (1995). The automatic condensation of electronic publica-tions by sentence selection. Information Processing and Management, 31 (5), 675-685. Breiman, L. (1996a). Arcing Classifiers. Technical Report 460, Statistics Department, University of California at Berkeley. Breiman, L. (1996b). Bagging predictors. Machine Learning, 24 (2), 123-140. Croft, B. (1991). The use of phrases and structured queries in information retrieval. SIGIR-91: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 32-45, New York: ACM. Edmundson, H.P. (1969). New methods in automatic extracting. Journal of the Association for Computing Machinery, 16 (2), 264-285. Fagan, J.L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Com-parison of Syntactic and Non-Syntactic Methods. Ph.D. Dissertation, Department of Com-puter Science, Cornell University, Report #87-868, Ithaca, New York. Ginsberg, A. (1993). A unified approach to automatic indexing and information retrieval. IEEE Expert, 8, 46-56. Grefenstette, J.J. (1983). A user?s guide to GENESIS. Technical Report CS-83-11, Computer Sci-ence Department, Vanderbilt University. Grefenstette, J.J. (1986). Optimization of control parameters for genetic algorithms. IEEE Trans-actions on Systems, Man, and Cybernetics, 16, 122-128. Jang, D.-H., and Myaeng, S.H. (1997). Development of a document summarization system for effective information services. RIAO 97 Conference Proceedings: Computer-Assisted Infor-mation Searching on Internet, pp. 101-111. Montreal, Canada. Johnson, F.C., Paice, C.D., Black, W.J., and Neal, A.P. (1993). The application of linguistic pro-cessing to automatic abstract generation. Journal of Document and Text Management, 1, 215-241. Krovetz, R. (1993). Viewing morphology as an inference process. Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR?93, 191-203. Krulwich, B., and Burkey, C. (1996). Learning user information interests through the extraction of semantically significant phrases. In M. Hearst and H. Hirsh, editors, AAAI 1996 Spring Symposium on Machine Learning in Information Access. California: AAAI Press. Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference. California: Morgan Kaufmann. Kubat, M., Holte, R., and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30 (2/3), 195-215. Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73, New York: ACM. Leung, C.-H., and Kan, W.-K. (1997). A statistical learning approach to automatic indexing of controlled index terms. Journal of the American Society for Information Science, 48 (1), 55- 66. Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 246-254, New York: ACM. Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Com-putational Linguistics, 11, 22-31. Luhn, H.P. (1958). The automatic creation of literature abstracts. I.B.M. Journal of Research and Development, 2 (2), 159-165. Marsh, E., Hamburger, H., and Grishman, R. (1984). A production rule system for message sum-marization. In AAAI-84, Proceedings of the American Association for Artificial Intelligence, pp. 243-246. Cambridge, MA: AAAI Press/MIT Press. MUC-3. (1991). Proceedings of the Third Message Understanding Conference. California: Mor-gan Kaufmann. MUC-4. (1992). Proceedings of the Fourth Message Understanding Conference. California: Morgan Kaufmann. MUC-5. (1993). Proceedings of the Fifth Message Understanding Conference. California: Mor-gan Kaufmann. MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. California: Mor-gan Kaufmann. Muñoz, A. (1996). Compound key word generation from document databases using a hierarchi-cal clustering ART model. Intelligent Data Analysis, 1 (1), Amsterdam: Elsevier. Nakagawa, H. (1997). Extraction of index words from manuals. RIAO 97 Conference Proceed-ings: Computer-Assisted Information Searching on Internet, pp. 598-611. Montreal, Canada. Paice, C.D. (1990). Constructing literature abstracts by computer: Techniques and prospects. Information Processing and Management, 26 (1), 171-186. Paice, C.D., and Jones, P.A. (1993). The identification of important concepts in highly structured technical papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval, pp. 69-78, New York: ACM. Porter, M.F. (1980). An algorithm for suffix stripping. Program; Automated Library and Infor-mation Systems, 14 (3), 130-137. Quinlan, J.R. (1993). C4.5: Programs for machine learning. California: Morgan Kaufmann. Quinlan, J.R. (1996). Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI?96), pp. 725-730. AAAI Press. Salton, G. (1988). Syntactic approaches to automatic book indexing. Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics, pp. 120-138. New York: ACM. Salton, G., Allan, J., Buckley, C., and Singhal, A. (1994). Automatic analysis, theme generation, and summarization of machine-readable texts. Science, 264, 1421-1426. Soderland, S., and Lehnert, W. (1994). Wrap-Up: A trainable discourse module for information extraction. Journal of Artificial Intelligence Research, 2, 131-158. Steier, A. M., and Belew, R. K. (1993). Exporting phrases: A statistical analysis of topical lan-guage. In R. Casey and B. Croft, editors, Second Symposium on Document Analysis and Information Retrieval, pp. 179-190. van Rijsbergen, C.J. (1979). Information Retrieval. Second edition. London: Butterworths. Whitley, D. (1989). The GENITOR algorithm and selective pressure. Proceedings of the Third International Conference on Genetic Algorithms (ICGA-89), pp. 116-121. California: Morgan Kaufmann. citation: Turney, Peter (1999) Learning to Extract Keyphrases from Text. [Departmental Technical Report] (Unpublished) document_url: http://cogprints.org/1802/1/ERB-1057.ps document_url: http://cogprints.org/1802/5/ERB-1057.pdf