?url_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rft.title=Mining+the+Web+for+Lexical+Knowledge+to+Improve+Keyphrase+Extraction%3A+Learning+from+Labeled+and+Unlabeled+Data.&rft.creator=Turney%2C+Peter&rft.subject=Statistical+Models&rft.subject=Machine+Learning&rft.subject=Artificial+Intelligence&rft.description=A+journal+article+is+often+accompanied+by+a+list+of+keyphrases%2C+composed+of+about+five+to+fifteen+important+words+and+phrases+that+capture+the+article%E2%80%99s+main+topics.+Keyphrases+are+useful+for+a+variety+of+purposes%2C+including+summarizing%2C+indexing%2C+labeling%2C+categorizing%2C+clustering%2C+highlighting%2C+browsing%2C+and+searching.+The+task+of+automatic+keyphrase+extraction+is+to+select+keyphrases+from+within+the+text+of+a+given+document.+Automatic+keyphrase+extraction+makes+it+feasible+to+generate+keyphrases+for+the+huge+number+of+documents+that+do+not+have+manually+assigned+keyphrases.+Good+performance+on+this+task+has+been+obtained+by+approaching+it+as+a+supervised+learning+problem.+An+input+document+is+treated+as+a+set+of+candidate+phrases+that+must+be+classified+as+either+keyphrases+or+non-keyphrases.+To+classify+a+candidate+phrase+as+a+keyphrase%2C+the+most+important+features+(attributes)+appear+to+be+the+frequency+and+location+of+the+candidate+phrase+in+the+document.+Recent+work+has+demonstrated+that+it+is+also+useful+to+know+the+frequency+of+the+candidate+phrase+as+a+manually+assigned+keyphrase+for+other+documents+in+the+same+domain+as+the+given+document+(e.g.%2C+the+domain+of+computer+science).+Unfortunately%2C+this+keyphrase-frequency+feature+is+domain-specific+(the+learning+process+must+be+repeated+for+each+new+domain)+and+training-intensive+(good+performance+requires+a+relatively+large+number+of+training+documents+in+the+given+domain%2C+with+manually+assigned+keyphrases).+The+aim+of+the+work+described+here+is+to+remove+these+limitations.+In+this+paper%2C+I+introduce+new+features+that+are+conceptually+related+to+keyphrase-frequency+and+I+present+experiments+that+show+that+the+new+features+result+in+improved+keyphrase+extraction%2C+although+they+are+neither+domain-specific+nor+training-intensive.+The+new+features+are+generated+by+issuing+queries+to+a+Web+search+engine%2C+based+on+the+candidate+phrases+in+the+input+document.+The+feature+values+are+calculated+from+the+number+of+hits+for+the+queries+(the+number+of+matching+Web+pages).+In+essence%2C+these+new+features+are+derived+by+mining+lexical+knowledge+from+a+very+large+collection+of+unlabeled+data%2C+consisting+of+approximately+350+million+Web+pages+without+manually+assigned+keyphrases.+&rft.date=2002&rft.type=Departmental+Technical+Report&rft.type=NonPeerReviewed&rft.format=application%2Fpdf&rft.identifier=http%3A%2F%2Fcogprints.org%2F2497%2F1%2FERB-1096.pdf&rft.identifier=++Turney%2C+Peter++(2002)+Mining+the+Web+for+Lexical+Knowledge+to+Improve+Keyphrase+Extraction%3A+Learning+from+Labeled+and+Unlabeled+Data.++%5BDepartmental+Technical+Report%5D++++(Unpublished)++&rft.relation=http%3A%2F%2Fcogprints.org%2F2497%2F