Dear Dr Gomez
Cobuild used corpus lemma frequencies in their Dictionary (2nd edition,
1995). We devised a 5-band distinction, with 700 lemmas in the most frequent band, 1200 in the 2nd band, 1500 in the 3rd band, 3200 in the 4th and 8100 in the 5th. I can't remember the exact frequency cut-offs used, but I'm confident
that most users of the dictionary have found it a very useful addition.
The exact cut-off points might be affected by the size of the corpus, and
may also be language dependent (in a highly inflected language like Spanish, there might be different relationships between some types and lemmas when compared
to a realtively uninflected language like English). Also the purpose of
your classification may affect your decisions. For a dictionary, lemma is
presumably more important than type, although type distribution within a
lemma may influence whether a form is treated under the main lemma form,
or is given separate headword status (e.g. "situated" in an English dictionary
may be a separate headword, as well as being an inflected form under the headword "situate"; similarly "painting" and "paint"; word-class shifts would also
have to be taken into account.).
Hope this helps.
Honorary Research Fellow
Corpus Research Group
University of Birmingham
This archive was generated by hypermail 2b29 : Thu Feb 17 2000 - 23:47:31 MET