If you have a valid XML-encoded corpus (and, basically, if you want to
check if it is valid XML), regexes are not the best tool: you could
consider using a parser, and for efficiency a C parser. This allow you
to keep Perl as your main language since wrappers for C libraries (and
the best of them) exist in Perl, like LibXML (wrapper for the libxml
librarie, which now seem to be a real SAX parser, i.e. didn't buffer
the whole string), and XML::SAX::Expat  (moving the James Clark
Expat library into SAX2 idiom). (Both available on CPAN).
If you use XSLT/XPath, which is the best way to use powerful (and
standard) query language without reinventing the wheel, you could
consider to use a Splitter and a Merger SAX handler to split your
document on middle-sized units (like <text> in TEI), bufferise the
chunk, and process them with a XSLT proc (which is easy with
XML::LibXML and XML::LibXSLT, see XML::Filter::XSLT on CPAN as an
example of a XSLT filter in a SAX handler).
Another solution to consider is to store your TEI-XML document into a
native XML DB. Sleepy cat (Berkeley DB XML) is no doubt helpful (a new
alpha is just release), allowing to process XPath query on very large
corpora. But I'm wondering (without more test) if the size of the
index needed by deeply-anotated corpus didn't simply replace the
problem of memory consumption in bufferisation (XPath, XSLT) approach.
Berkeley DB XML: http://www.sleepycat.com/xml/index.html
Please let me know your choice.
----- Original Message -----
From: "Serge Sharoff" <email@example.com>
Sent: Tuesday, November 19, 2002 12:03 PM
Subject: [Corpora-List] On tools for indexing and searching large
> Dear all,
> I'm in the process of compiling a corpus of modern Russian
comparable to the
> BNC in its size and coverage. The format of the corpus is based on
> <s id="nashi.535">
> <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
> <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
> <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
> in the first case, the POS tagger detects and cannot resolve an
> between two possible readings (masc, animate, i.e. the head of, and
> inanimate, i.e. the chapter of), so both analyses are left.
> Currently for searching the corpus I use custom tools written in
> based on regular expressions. As the corpus gets larger (currently
> million words), the indexing scheme gets totally inefficient and I'm
> reluctant to reinvent the wheel by improving it.
> What is the technology used in the BNC and other annotated corpora
> similar size? Can it be applied in this case (given the need to cope
> possible ambiguity)? The corpus uses Win-1251 encoding, but
> plan to convert it to Unicode. Any suggestions?
This archive was generated by hypermail 2b29 : Wed Nov 20 2002 - 18:49:32 MET