Re: Corpora: frequency lists for clusters & MWU

Ted E. Dunning (
Tue, 6 Oct 1998 17:24:39 -0700

Frequency lists for single words are highly suspect, especially below
roughly the thousandth most common word. The utility of a frequency
list for multi-word units is even more doubtful.

That being said, I would be happy to offer up several of the most
common bigrams from a small corpus (1M words) as an illustration of
how little you are likely to learn from frequency sorting bigrams:

#S The
of the
in the
said #S
to the
#S #D
on the
for the
and the
said the
in a
at the
#S He
#S In
by the
to be
#S But
with the
of a

Here #S indicates a sentence boundary and #D a document boundary. The
only items of interest are the bigrams which include the word "said".
Their prevalence is caused by the fact that this text was from the AP

There *are* other ways to look at word coocurrence besides frequency
sorting. I tend to like to plug my Computational Linguistics paper
(CL volume 19, number 1, pages 61-74) where I introduced a useful
statistical measure for finding interesting collocations. There are
many other measures which people use for various purposes.

>>>>> "pk" == Przemyslaw KASZUBSKI <> writes:

pk> Another question: Are there frequency lists of English
pk> (lemmatised/non-lemmatised) 2-3-4-5 word clusters available
pk> anywhere, preferably retrieved from large balanced corpora? Or
pk> frequency lists of multi-word-units?