This sounds like an application for a Patricia tree (I think this is the
name). A good book on algorithms should describe this data structure.
Another very simple method that I have found useful is the following.
Think of the corpus as being one long vector, so that each character position
described by an integer. Run through your corpus, constructing another
vector which records the starting positions of all of the substrings you
want to index on (e.g., all of the starting positions of words). Regard these
as pointers into your original corpus. Sort this vector of indices by the
suffix string beginning at the position in the corpus that each index points
This sorted index now gives an alphabetic listing of suffix strings in
the corpus. It is a trivial matter to search for any substring of any
length with this sorted index using a binary search.
Hope this helps,