Experiments on predictability of word in context and information rate in natural language

Manin, Dmitrii (2006) Experiments on predictability of word in context and information rate in natural language. [Journal (Paginated)]

Full text available as:

Available under License Creative Commons Attribution Non-commercial.



Based on data from a large-scale experiment with human subjects, we conclude that the logarithm of probability to guess a word in context (unpredictability) depends linearly on the word length. This result holds both for poetry and prose, even though with prose, the subjects don't know the length of the omitted word. We hypothesize that this effect reflects a tendency of natural language to have an even information rate.

Item Type:Journal (Paginated)
Additional Information:Text is somewhat extended compared to the published version.
Keywords:Natural language, information theory, information rate, entropy, experiment, word guessing
Subjects:Computer Science > Language
Linguistics > Computational Linguistics
ID Code:5817
Deposited By:Manin, Dmitrii
Deposited On:13 Nov 2007 00:51
Last Modified:11 Mar 2011 08:57

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

\bibitem{Shan51}{Shannon~C.E. Prediction and entropy of printed

English. {\it Bell System Technical Journal}, 1951, vol.~30, pp.~50--64.}

\bibitem{Shan48}{Shannon~C.E. A mathematical theory of communication. {\it Bell System Technical Journal}, 1948, vol.~27, pp.~379--423.}

\bibitem{BurLick55}{Burton~N.G., Licklider~J.C.R. Long-range

constraints in the statistical structure of printed English. {\it

American Journal of Psychology}, 1955, vol.~68, no.~4, pp.~650--653}

\bibitem{Fon}{F\'onagy~I. Informationsgehalt von wort und laut in der

dichtung. In: {\it Poetics. Poetyka. Поэтика}. Warszawa:~Pa\'nstwo

Wydawnictwo Naukowe, 1961, pp.~591--605.}

\bibitem{Kolm65}{Kolmogorov~A. Three approaches to the quantitative

definition of information. {\it Problems Inform. Transmission},

1965, vol.~1, pp.~1--7.}

\bibitem{Yaglom2}{Yaglom~A.M. and Yaglom~I.M. {\it Probability and

information} Reidel, Dordrecht, 1983.}

\bibitem{CK78}{Cover~T.M., King~R.C. A convergent gambling estimate of

the entropy of English. {\it Information Theory, IEEE Transactions

on}, 1978, vol.~24, no.~4, pp.~413--421.}

\bibitem{Moradi98}{Moradi~H., Roberts~J.A.,

Grzymala-Busse~J.W. Entropy of English text: Experiments with humans

and a machine learning system based on rough sets. {\it Inf. Sci.},

1998, vol.~104, no.~1--2, pp.~31--47.}

\bibitem{Paisley66}{Paisley~W.J. The effects of authorship, topic

structure, and time of composition on letter redundancy in English

text. {\it J. Verbal. Behav.}, 1966, vol.~5, pp.~28--34.}

\bibitem{BrownEtAl92}{Brown~P.F., Della~Pietra~V.J., Mercer~R.L.,

Della~Pietra~S.A., Lai~J.C. An estimate of an upper bound for the

entropy of English. {\it Comput. Linguist.}, 1992, vol.~18, no.~1, pp.~31--40.}

\bibitem{Teahan96}{Teahan~W.J., Cleary~J.G. The entropy of English

using PPM-based models. In: {\it DCC '96: Proceedings of the

Conference on Data Compression}, Washington: IEEE Computer Society, 1996, pp.~53--62.}

\bibitem{LM1}{Leibov~R.G., Manin~D.Yu. An attempt at experimental

poetics [tentative title]. To be published in: {\it Proc.

Tartu Univ.} [in Russian], Tartu: Tartu University Press}

\bibitem{ChurchMercer93}{Church~K.W., Mercer~R.L. Introduction to the

special issue on computational linguistics using large corpora. {\it

Comput. Linguist.}, 1993, vol.~19, no.~1, pp.~1--24.}

\bibitem{SG96}{T.Sch\"urmann and P.Grassberger. Entropy estimation of

symbol sequences. {\it Chaos}, 1996, vol.~6, no.~3, pp.~414--427.}

\bibitem{FreqDict}{Sharoff~S., The frequency dictionary for

Russian. {\it}}

\bibitem{HockJoseph}{Hock~H.H., Joseph~B.D. Language History, Language

Change, and Language Relationship. Berlin--New York: Mouton de Gruyter, 1996.}

\bibitem{GenzelCharniak}{Genzel \& Charniak, 2002. {\it Entropy rate constancy in text.}

Proc. 40th Annual Meeting of ACL, 199--206.}

\bibitem{Jaeger06}{Anonymous authors (paper under review), 2006. {\it Speakers optimize

information density through syntactic reduction.} To be published.}

\bibitem{AylettTurk}{Aylett M. and Turk A., 2004. {\it The Smooth Signal

Redundancy Hypothesis: A Functional Explanation for Relationships

between Redundancy, Prosodic Prominence, and Duration in Spontaneous Speech.} Language and Speech, 47(1),



Repository Staff Only: item control page