> Thank you for your analysis. I have just a few remarks.
> >You can then relate the word length
> >distribution with the file size as:
> >File Size = SUM_k [#(k) * (k+1)] = F (1)
> > ~ mean word length * N (1.1)
> >where #(.) is the number of times the argument has appeared
> >and N is the total number of distinct words.
> >If the given relation:
> >N = 6 sqrt(F) => N^2 / 36 = F
> This relation holds between different drafts of the
> same file (study of a text during its composition).
> Another particularity is that the text measured is
> a textbook, which likely has a structure very
> different from a novel. Does your formula take these
> two considerations into account?
Sorry for the misunderstanding. But what
follows would be relevant.
> FYI, the text has 8084 distinct words for a file size
> of 1835191 characters.
For naturally occurring text, Heap's law says the following
N = A F^B
where N and F are as defined above, B is
between 0 and 1, and A is another constant. I am not
sure whether A has to be between 0 and 1 or somewhere outside.
If A can be larger than 1, then I guess what you have is basically
> Peter Van Roy
> Département d'Ingénierie Informatique
> (Department of Computing Science and Engineering)
> Université catholique de Louvain
> B-1348 Louvain-la-Neuve, Belgium
> Email: email@example.com
> Tel: (+32) (10) 47.83.74
> Web: http://www.info.ucl.ac.be/people/cvvanroy.html
> Mozart: http://www.mozart-oz.org
This archive was generated by hypermail 2b29 : Mon Oct 29 2001 - 11:30:31 MET