> Dear Serge,
> If you have a valid XML-encoded corpus (and, basically, if you want to
> check if it is valid XML), regexes are not the best tool: you could
Regexes always have been expressive for lingustic queries.
The search speed depends on index implementation.
You may have word-based index and may increase the regexp search speed
you would like to find the construction: " word1 .+ word2 ",
so the query should be: echo "word1 & word2" | mgquery | grep -i " word1 .+
> Berkeley DB XML: http://www.sleepycat.com/xml/index.html
> Please let me know your choice.
> Sylvain Loiseau
> ----- Original Message -----
> From: "Serge Sharoff" <firstname.lastname@example.org>
> To: <email@example.com>
> Sent: Tuesday, November 19, 2002 12:03 PM
> Subject: [Corpora-List] On tools for indexing and searching large
> > Dear all,
> > I'm in the process of compiling a corpus of modern Russian
> comparable to the
> > BNC in its size and coverage. The format of the corpus is based on
> TEI, for
> > instance,
> > <s id="nashi.535">
> > ...
> > <w>глава
> > <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
> > <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
> > </w>
> > <w>Владивостока
> > <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
> > </w>
> > ...
> > </s>
> > in the first case, the POS tagger detects and cannot resolve an
> > between two possible readings (masc, animate, i.e. the head of, and
> > inanimate, i.e. the chapter of), so both analyses are left.
> > Currently for searching the corpus I use custom tools written in
> Perl and
> > based on regular expressions. As the corpus gets larger (currently
> > million words), the indexing scheme gets totally inefficient and I'm
> > reluctant to reinvent the wheel by improving it.
> > What is the technology used in the BNC and other annotated corpora
> > similar size? Can it be applied in this case (given the need to cope
> > possible ambiguity)? The corpus uses Win-1251 encoding, but
> eventually I
> > plan to convert it to Unicode. Any suggestions?
> > Best,
> > Serge
This archive was generated by hypermail 2b29 : Thu Nov 21 2002 - 09:32:09 MET