creators_name: Turney, Peter editors_name: De Raedt, Luc editors_name: Flach, Peter type: confpaper datestamp: 2001-09-12 lastmod: 2011-03-11 08:54:47 metadata_visibility: show title: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL ispublished: pub subjects: comp-sci-lang subjects: comp-sci-mach-learn subjects: comp-sci-stat-model full_text_status: public keywords: PMI-IR, synonyms, LSA, LSI, Latent Semantic Analysis, text mining, web mining, TOEFL, mutual information abstract: This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing). date: 2001 date_type: published publisher: Springer-Verlag pagerange: 491-502 refereed: TRUE referencetext: 1. Church, K.W., Hanks, P.: Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Lin-guistics, (1989) 76-83. 2. Church, K.W., Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New Jersey: Lawrence Erlbaum (1991) 115-164. 3. AltaVista, AltaVista Company, Palo Alto, California, http://www.altavista.com/. 4. Test of English as a Foreign Language (TOEFL), Educational Testing Service, Princeton, New Jersey, http://www.ets.org/. 5. Tatsuki, D.: Basic 2000 Words - Synonym Match 1. In: Interactive JavaScript Quizzes for ESL Students, http://www.aitech.ac.jp/~iteslj/quizzes/js/dt/mc-2000-01syn.html (1998). 6. Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Re-view, 104 (1997) 211-240. 7. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41 (1990) 391-407. 8. Berry, M.W., Dumais, S.T., Letsche, T.A.: Computational Methods for Intelligent Informa-tion Access. Proceedings of Supercomputing ’95, San Diego, California, (1995). 9. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. Cam-bridge, Massachusetts: MIT Press (1999). 10. Firth, J.R.: A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society (1957). Reprinted in F.R. Palmer (ed.), Selected Pa-pers of J.R. Firth 1952-1959, London: Longman (1968). 11. AltaVista: AltaVista Advanced Search Cheat Sheet, AltaVista Company, Palo Alto, Cali-fornia, http://doc.altavista.com/adv_search/syntax.html (2001). 12. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. Cambridge, Massachusetts: MIT Press (1998). For more information: http://www.cogsci.princeton.edu/~wn/. 13. Haase, K.: Interlingual BRICO. IBM Systems Journal, 39 (2000) 589-596. For more infor-mation: http://www.framerd.org/brico/. 14. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Dordrecht, Netherlands: Kluwer (1998). See: http://www.hum.uva.nl/~ewn/. 15. Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2 (2000) 303-336. 16. Grefenstette, G.: Finding Semantic Similarity in Raw Text: The Deese Antonyms. In: R. Goldman, P. Norvig, E. Charniak and B. Gale (eds.), Working Notes of the AAAI Fall Sym-posium on Probabilistic Approaches to Natural Language. AAAI Press (1992) 61-65. 17. Schütze, H.: Word Space. In: S.J. Hanson, J.D. Cowan, and C.L. Giles (eds.), Advances in Neural Information Processing Systems 5, San Mateo California: Morgan Kaufmann (1993) 895-902. 18. Lin, D.: Automatic Retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the As-sociation for Computational Linguistics, Montreal (1998) 768-773. 19. Richardson, R., Smeaton, A., Murphy, J.: Using WordNet as a Knowledge Base for Meas-uring Semantic Similarity between Words. In Proceedings of AICS Conference. Trinity College, Dublin (1994). 20. Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation, 49 (1993) 188-207. 21. Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelli-gence Research, 11 (1998) 95-130. 22. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxon-omy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, (1997). 23. Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of the 1997 ACM-SIGMOD International Conference on the Management of Data (1997) 255-264. 24. Sullivan, D.: Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation, Darien, Connecticut, http://searchenginewatch.com/reports/sizes.html (2000). 25. Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent Semantic Indexing: A Probabilistic Analysis. In: Proceedings of the Seventeenth ACM-SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Seattle, Washington (1998) 159- 168. 26. Sparck Jones, K.: Comparison Between TREC2 and TREC3. In: D. Harman (ed.), The Third Text REtrieval Conference (TREC3), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) C1-C4. 27. Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text REtrieval Conference (TREC3), D. Harman (ed.), National In-stitute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) 69-80. citation: Turney, Peter (2001) Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. [Conference Paper] document_url: http://cogprints.org/1796/1/ECML2001.ps document_url: http://cogprints.org/1796/5/ECML2001.pdf