Turney, Peter (2000) Learning algorithms for keyphrase extraction. [Journal (Paginated)]
Full text available as:
| PDF 267Kb | |
| Postscript 3457Kb |
Abstract
Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by GenEx suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications.
| Item Type: | Journal (Paginated) |
|---|---|
| Keywords: | machine learning, summarization, indexing, keywords, keyphrase extraction. |
| Subjects: | Computer Science > Language Computer Science > Machine Learning Computer Science > Statistical Models |
| ID Code: | 1797 |
| Deposited By: | Turney, Peter |
| Deposited On: | 13 Sep 2001 |
| Last Modified: | 12 Sep 2007 17:40 |
References in Article
Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.
Metadata
- HTML Citation
- ASCII Citation
- EPrints Application Profile (experimental)
- ID Plus Text Citation
- OpenURL ContextObject
- EndNote
- BibTeX
- OpenURL ContextObject in Span
- MODS
- DIDL
- EP3 XML
- Dublin Core
- Reference Manager
- Eprints Application Profile
- Simple Metadata
- Refer
- METS
- Search Data Dump
Repository Staff Only: item control page

