# Derivation of Zipf's Law for random texts

As before, we begin by defining a word to be any sequence of characters separated by spaces. Let us therefore consider an alphabet of $M$ characters, interspersed with a specially-designated space character $We will consider an especially simple model (similar to that used by many others [Li92] [ Miller57] [Hill70] [Hill74] ) in which our random monkey hits all keys -- space and letters -- with equal probability$p$: p&=&\Pr(\emptyset) = \Pr(\mathtt{A}) = \cdots = \Pr(\mathtt{Z}) \nonumber \\ &=&{1}\over{M+1} We can use a LEXICOGRAPHIC TREES to conveniently organize words by the order in which the$k$characters occur prior to the terminating space, as shown in Figure (figure) This shows a set of$M+1$trees, each rooted in the words' starting character. Leaf nodes at level$k$are all labeled with the probability of the sequence of$k-1$characters prior to the space occurring at level$k$. One immediate observation is that N_{k} \), the number of words $$w_{i}$$ of length$k$or less, is: N_k &\equiv& \hbox{\it Number}(w_i\; | \; i\le k) \nonumber \\ &=& \sum_{i=1}^k M^i \nonumber \\ &=& {M(1-M^k)}\over{1-M} In an infinitely long sequence of characters generated according to Equation (FOAref) , we will expect to find a word''$w_k$terminating at level$k$(i.e., a string of$k$unbroken non-space characters bracketed by two spaces) with probability defined in terms of the independent character probabilities$p$: p_k &\equiv& \Pr(w_k) \nonumber \\ &=& p^{k+2} \nonumber \\ &=& {c\over {(M+1)^{k+2}}} We can compute$c$, the constant of proportionality, by including all the$M^k$words of length$k$, and summing these probabilities over all possible words (including unrealistic, infinitely long ones!): makes this issue disappear: } Next consider the rank of these words. Since the probability of a word's occurrence is an exponentially decreasing function of its length, we know that the M highest ranked words are exactly the one character words; next come the$M^{2}$two-letter words; and so on. Using Equation (FOAref) we therefore know how the rank$r_{k}$of all words$w_{k}$terminating on level$k$must be bounded above and below: N_{k-1} &<& r_k \leq N_{k} \nonumber \\ {\tilde r}_k &=& {{N_{k-1}+1+N_k}\over{2}} \nonumber \\ &=& (M^k - 1) {{M+1}\over {2(M-1)} } where${\tilde r}_k$denotes a compromise, average'' rank for all the$M^k$equiprobable words. Note that (FOAref) and (FOAref) define the words' probability and rank, respectively, in terms of the common metric$k$. As Li [Li92] notes, Zipf's Law is fundamentally about this transformation, from an exponential distribution onto a rank variable. Solving both equations for$k\$: k&=& -{{\ln(M\cdot p_k)}\over{\ln(M+1)}} \nonumber \\ k&=& {{\ln\left( {{2(M-1) {\tilde r}_k}\over{M+1}}+1\right)}\over{\ln M}} we can now set them equal and derive an expression for a word's probability in terms of its rank: {{\ln (M\cdot p_k)}\over{\ln(M+1)}}&=& -{{\ln\left({{2(M-1){\tilde r}_k}\over{M+1}}+1\right)} \over{\ln M}} \nonumber \\ p_k &=& {1\over M}\left({{2(M-1){\tilde r}_k}\over{M+1}}+1\right)^{{-\ln(M+1)}\over{\ln M}} This has just the functional form required by Mandelbrot's generalized Zipf's Law (cf. Equation (FOAref) ): p_k&=& {C \over{({\tilde r}_k + B)^\alpha}}}\ \mathrm{,\ where} \nonumber \\ \alpha &=& {{\ln(M+1)}\over{\ln M}} \nonumber \\ B &=& {{M+1}\over{2(M-1)}} \nonumber \\ C &=& {1\over M}\left({{M+1}\over{2(M-1)}}\right)^\alpha \end{eqnarray}

## Subsections

FOA © R. K. Belew - 00-09-21