That being said, I would be happy to offer up several of the most
common bigrams from a small corpus (1M words) as an illustration of
how little you are likely to learn from frequency sorting bigrams:
Here #S indicates a sentence boundary and #D a document boundary. The
only items of interest are the bigrams which include the word "said".
Their prevalence is caused by the fact that this text was from the AP
There *are* other ways to look at word coocurrence besides frequency
sorting. I tend to like to plug my Computational Linguistics paper
(CL volume 19, number 1, pages 61-74) where I introduced a useful
statistical measure for finding interesting collocations. There are
many other measures which people use for various purposes.
>>>>> "pk" == Przemyslaw KASZUBSKI <firstname.lastname@example.org> writes:
pk> Another question: Are there frequency lists of English
pk> (lemmatised/non-lemmatised) 2-3-4-5 word clusters available
pk> anywhere, preferably retrieved from large balanced corpora? Or
pk> frequency lists of multi-word-units?