I don't know of a program like this, but I would be interested to hear if
you find one (or if anyone knows of one). I am trying to do things like this
in my work too.
> This leads me to a further question. There is a wonderful program called
> agrep which performs like grep, but also does approximate string
> However, the number and type of errors allowed in the input string is
> limited. For example, one cannot find "equipment" if given as input
> "equipping", or "widely used", if giving the input "used widely".
> Is there any program which can do this ? More specifically, is there
> a way of doing this efficiently ? (I would be surprised, but I still
> ask ...).
I looked around fairly widely for string matching algorithms last year with
a mind to applying them to language. I also looked at agrep. One thing I
found which you might not have come across was the sizable history of
_biological_ sequence matching research. There is a big biological project
to map the human genome: DNA sequences. Many of the problems they face are
exactly the kind of thing we face searching for language sequences in
corpora. I found some great stuff to search for matching sub-sequences of
unknown length in two strings. You can feed in a search string and get back
any sub-strings which match at any point in a data-base, without specifying
which sub-string to search on, or the number of errors.
There was an implementation available free for research, called seqaln, I
think (sequence align?). Do an Internet search on "seqaln" and you should
If you are interested write to me and I could try to dig out the other
references I had last year.