Corpora: Using a relational database to store conc pointers

From: Mickel Grönroos (
Date: Thu Mar 30 2000 - 09:37:39 MET DST

    Dear colleagues,

    Does anybody have any experience of using a relational database to store
    index information for a concordance service?

    I'm building a test interface for the Bank of Finnish and plan to store
    pointers to specific locations in the corpus in a database column, e.g.
    something like 344:2555 would point to corpus file number 344, byte
    position 2555.

    The obvious problem is how one should handle common words, as every
    occurence of a specific type needs a pointer of its own. So, if the
    frequency of some common word is, say 50,000 this would generate 50,000
    pointers as well. Putting these in one field in a column seems to be
    rather foolish. Does anybody know how to avoid this?

    All comments are welcome.


    Mickel Grönroos
    Helsinki |
    Inst. för allmän språkvetenskap | Dep. of General Linguistics
    PB 4 (Fabiansgatan 28) | tfn/phone +358-9-191 22707
    FI-00014 Helsingfors universitet | fax +358-9-191 23598

