I have some questions on string matching, arising both from a practical
theoretical interest. Do there exist programs or algorithms which allow
the retrieval of strings from files (corpora or word lists in my case)
which are not working sequentially ? That is, is there a way of encoding
files in such a way that retrieval speed will not drastically differ for
small and large texts (say, between a 100 kilobyte and a 10 megabyte
corpus) ? This difference exists, for instance, for the well known grep
family, which has a performance of order O(n).
Of course, there are a whole range of programs that store texts or other
kinds of sequences as a database, by indexing on one or other element
sentence, phrases, ...). This allows for very fast retrieval. What
me, however, is to know whether there a programs which some way store
information on the substrings of a text (e.g. all substrings up to a
of 10), and possibly give the context of those strings.
This leads me to a further question. There is a wonderful program called
agrep which performs like grep, but also does approximate string
However, the number and type of errors allowed in the input string is
limited. For example, one cannot find "equipment" if given as input
"equipping", or "widely used", if giving the input "used widely".
Is there any program which can do this ? More specifically, is there
a way of doing this efficiently ? (I would be surprised, but I still
Thanks for help,
-- LANT nv/sa, Research Park Haasrode, Interleuvenlaan 21, B-3001 Leuven mailto:Tom.Vanallemeersch@lant.be Phone: ++32 16 405140 http://www.lant.be/ Fax: ++32 16 404961