creators_name: Manin, Dmitrii creators_id: manin@pobox.com type: journalp datestamp: 2007-11-13 00:51:03 lastmod: 2011-03-11 08:57:00 metadata_visibility: show title: Experiments on predictability of word in context and information rate in natural language ispublished: pub subjects: comp-sci-lang subjects: ling-comput full_text_status: public keywords: Natural language, information theory, information rate, entropy, experiment, word guessing note: Text is somewhat extended compared to the published version. abstract: Based on data from a large-scale experiment with human subjects, we conclude that the logarithm of probability to guess a word in context (unpredictability) depends linearly on the word length. This result holds both for poetry and prose, even though with prose, the subjects don't know the length of the omitted word. We hypothesize that this effect reflects a tendency of natural language to have an even information rate. date: 2006-12-26 date_type: completed publication: Journal of Information Processes volume: 6 number: 3 publisher: Keldysh Institute of Applied Mathematics (KIAM) RAS pagerange: 229-236 refereed: TRUE referencetext: \bibitem{Shan51}{Shannon~C.E. Prediction and entropy of printed English. {\it Bell System Technical Journal}, 1951, vol.~30, pp.~50--64.} \bibitem{Shan48}{Shannon~C.E. A mathematical theory of communication. {\it Bell System Technical Journal}, 1948, vol.~27, pp.~379--423.} \bibitem{BurLick55}{Burton~N.G., Licklider~J.C.R. Long-range constraints in the statistical structure of printed English. {\it American Journal of Psychology}, 1955, vol.~68, no.~4, pp.~650--653} \bibitem{Fon}{F\'onagy~I. Informationsgehalt von wort und laut in der dichtung. In: {\it Poetics. Poetyka. Поэтика}. Warszawa:~Pa\'nstwo Wydawnictwo Naukowe, 1961, pp.~591--605.} \bibitem{Kolm65}{Kolmogorov~A. Three approaches to the quantitative definition of information. {\it Problems Inform. Transmission}, 1965, vol.~1, pp.~1--7.} \bibitem{Yaglom2}{Yaglom~A.M. and Yaglom~I.M. {\it Probability and information} Reidel, Dordrecht, 1983.} \bibitem{CK78}{Cover~T.M., King~R.C. A convergent gambling estimate of the entropy of English. {\it Information Theory, IEEE Transactions on}, 1978, vol.~24, no.~4, pp.~413--421.} \bibitem{Moradi98}{Moradi~H., Roberts~J.A., Grzymala-Busse~J.W. Entropy of English text: Experiments with humans and a machine learning system based on rough sets. {\it Inf. Sci.}, 1998, vol.~104, no.~1--2, pp.~31--47.} \bibitem{Paisley66}{Paisley~W.J. The effects of authorship, topic structure, and time of composition on letter redundancy in English text. {\it J. Verbal. Behav.}, 1966, vol.~5, pp.~28--34.} \bibitem{BrownEtAl92}{Brown~P.F., Della~Pietra~V.J., Mercer~R.L., Della~Pietra~S.A., Lai~J.C. An estimate of an upper bound for the entropy of English. {\it Comput. Linguist.}, 1992, vol.~18, no.~1, pp.~31--40.} \bibitem{Teahan96}{Teahan~W.J., Cleary~J.G. The entropy of English using PPM-based models. In: {\it DCC '96: Proceedings of the Conference on Data Compression}, Washington: IEEE Computer Society, 1996, pp.~53--62.} \bibitem{LM1}{Leibov~R.G., Manin~D.Yu. An attempt at experimental poetics [tentative title]. To be published in: {\it Proc. Tartu Univ.} [in Russian], Tartu: Tartu University Press} \bibitem{ChurchMercer93}{Church~K.W., Mercer~R.L. Introduction to the special issue on computational linguistics using large corpora. {\it Comput. Linguist.}, 1993, vol.~19, no.~1, pp.~1--24.} \bibitem{SG96}{T.Sch\"urmann and P.Grassberger. Entropy estimation of symbol sequences. {\it Chaos}, 1996, vol.~6, no.~3, pp.~414--427.} \bibitem{FreqDict}{Sharoff~S., The frequency dictionary for Russian. {\it http://www.artint.ru/projects/frqlist/frqlist-en.asp}} \bibitem{HockJoseph}{Hock~H.H., Joseph~B.D. Language History, Language Change, and Language Relationship. Berlin--New York: Mouton de Gruyter, 1996.} \bibitem{GenzelCharniak}{Genzel \& Charniak, 2002. {\it Entropy rate constancy in text.} Proc. 40th Annual Meeting of ACL, 199--206.} \bibitem{Jaeger06}{Anonymous authors (paper under review), 2006. {\it Speakers optimize information density through syntactic reduction.} To be published.} \bibitem{AylettTurk}{Aylett M. and Turk A., 2004. {\it The Smooth Signal Redundancy Hypothesis: A Functional Explanation for Relationships between Redundancy, Prosodic Prominence, and Duration in Spontaneous Speech.} Language and Speech, 47(1), 31--56.} citation: Manin, Dmitrii (2006) Experiments on predictability of word in context and information rate in natural language. [Journal (Paginated)] document_url: http://cogprints.org/5817/1/unpred_article_e.pdf