RE: Corpora: lemma vs lexeme

Leidner, Jochen (
Thu, 4 Nov 1999 11:13:43 +0100


some 'working definitions' (mind the ambiguity :-)...

head-word first word in a dictionary/lexicon entry

lexeme symbol sequence (string literal) of the head-word
in a dictionary/lexicon entry

lemma set of all forms subsumed under a dictionary/lexicon entry

lemmatization mapping of a word-form to the lexeme of the lemma
the word-form belongs to

lexicon in computational linguistics: computer-readable form
of a ->dictionary (i.e. a linguistic resource)
in real life: a list of concepts in the world with
explanations ordered by their names (i.e. a resource
of world knowledge)

dictionary in computational linguistics: human-readable form of
a set of lemmata with annotations ordered alphabetically
or phonetically by lexemes

word-form sequence of characters that belongs to the language under
consideration (warning: in formal language theory, this
is called 'word')

Canonization (a generalization of lemmatization sometimes used in IR)
means mapping a string to a representative of the class the string
belongs to (e.g. according phonetical similiarity as in Russell & Odell's
SOUNDEX algorithm).

Usage in linguistics itself is highly problematic, few people use internally
consistent terms, and there is little concensus across sub-communities,
is why most people begin defining their own usage in the initial chapters of
their works.

> On Wed, 3 Nov 1999, Przemyslaw Kaszubski wrote:
> > Can anyone enlighten me definitively (or refer me to a source) on
> > the
> > distinction between lemma and lexeme?.

Jochen Leidner, M.A.                                <>
Software Engineer                                      <>
Knowledge Warehouse                       -- All views expressed are my own.
SAP AG, Walldorf, Germany.     phone +49 (6227) 7-63773 fax +49 6227 7-73773