Another approach to this problem would be to define orthographic and
phonological similarity measures and then to compute pairwise ortho and
phono similarities for the words in your lexicon. You could then cluster
the resulting similarity matrix and the result would be clusters of
orthographically and phonologically similar words. For an orthographic
similarity measure, consider proportion of common letter ngrams or edit
distance. For phonological similarity, an analogous approach can be taken,
except you work with phoneme strings rather than letter strings.
I have done something quite like this in my work on drug name confusability.
At 01:34 PM 1/25/02 +0000, Andrew Harley wrote:
> > COMLEX, which has subcategorization classes for various words, as well
> > as pronunciations, can be used to build a network based on syntactic
> > behavior or pronunciation. The ERG lexicon at the Lingo project at
> > CSLI also has detailed subcategorization information, although for
> > fewer words.
> > Beth Levin's work on verb class alternations, is based on the idea of
> > using syntactic properties to build a semantic net. Timothy Baldwin
> > and myself have done some work on clustering Japanese verbs according
> > to their subcategorization patterns, you can find the relevant papers
> > on my home-page.
> > I hope this is of some help,
>There are a number of other such lexical resources, e.g. the Cambridge
>International Dictionary of English (with full subcategorisation and
>selectional preference information for all senses) and the English
>Pronouncing Dictionary (for phonetic transcriptions). See
>http://uk.cambridge.org/elt/reference/data.htm for more information.
This archive was generated by hypermail 2b29 : Sat Jan 26 2002 - 00:00:33 MET