> I'm in the process of compiling a corpus of modern Russian comparable to the
> BNC in its size and coverage.
> Currently for searching the corpus I use custom tools written in Perl and
> based on regular expressions. As the corpus gets larger (currently 40
> million words), the indexing scheme gets totally inefficient and I'm
> reluctant to reinvent the wheel by improving it.
I have a 100 million word corpus of Spanish (www.corpusdelespanol.org) that is
annotated (POS, lemma, synonyms, etc) and is fairly fast. Even a query like
[<le> or <les> "3p IndObj" + any form of any synonym of <querer> "to want" +
infintive, e.g. <le prefiero decir, les querían saludar>] takes only about two
or three seconds.
I use relational databases in SQL Server 7.0 to achieve the results. The main
database is composed of tens of million of distinct n-grams with their
associated frequencies in several sub-corpora. These are linked to other
databases containing POS, lemma, and synonym info. The output from the n-
grams/frequency tables is then used to search the actual, unannotated textual
corpus itself, which is indexed only with SQL Server Full-Text Indexing.
Anyway, because all of the tables have clustered indices, you get pretty good
performance. The one caveat is that my approach works best with
morphologically more complex languages like Spanish, and it would have to be
modified for a language like English.
> What is the technology used in the BNC and other annotated corpora of
> similar size?
This is a question that I've asked myself many times. I would love to see a
book that discussed the approach taken by the BNC, the BoE, CREA, corpora based
on the IMS Corpus Workbench (such as O Público), etc to "look under the hood"
and see how each of these corpora and indexing schemes is organized. As you
mentioned, as more and more people start creating 100+ million word corpora, it
would be a shame if they all ended up having to re-invent the wheel.
Illinois State University
Illinois State University Webmail https://webmail2.ilstu.edu
This archive was generated by hypermail 2b29 : Tue Nov 19 2002 - 13:36:11 MET