This is an interesting discussion about 'representativeness'
of corpus and subcorpus. I'll add my 2 cents here. Surely,
statisticans have been concerned about getting representative
samples for some time and mechanisms available, though not
perfect, to address the above issue. The one that I can
think of is sequential (and stratified?) sampling.
Suppose we have infinite resources! And suppose we have
a (random or otherwise) sequence of subcorpora s1, s2, ...
sn and their associated distribution that we observe for
any specific purpose d1, d2, ..., dn. The distribution could
be words, the number of different meanings of a word, etc.
Then, we do a sequential sampling as follows:
Let the merged distribution Di be defined recusrively
D1 := d1
Di := Di-1 + di
where + is merging two distributions. The sequential
sampling could stop if
Chi-Square of Di and Di-1 is not statistically significantly
different at X%.
There is a possibility that the sequential sampling
could never stop.
Obviously, more sophistcated techniques could be
applied and more complicated modeling may be needed
(e.g. taking time into account of the sampling as
language changes may take place).
This archive was generated by hypermail 2b29 : Sat Jan 05 2002 - 03:49:23 MET