> Meta languages are ideal for interchange purposes but I doubt
> that ANY software will handle SGML data describing 100 million
> annotated word forms efficiently. But that's another story.
On what grounds do you make this assertion? I suppose it all
depends what you mean by "handle efficiently", but it's simply not
true that NO software can handle SGML data on that scale.
Perhaps he should have written "raw SGML data", in which case I will
absolutely second that opinion. All XML encodings that I have seen so
far waste more space (in terms of characters) on markup than on the
actual data. An XML-encoded version of a 100 million word corpus (with
PoS and lemma annotations) will usually take up several gigabytes of
Of course, the corpus size can be drastically reduced with standard
compression alogrithms (gzip or bzip2), but the compressed corpus
cannot be accessed efficiently.
would you advocate as an alternative?
Hope you don't mind the plug: the IMS Corpus Workbench was designed
for corpora of that size and offers both (relatively) compact storage
and (relatively) efficient access (it isn't available for HP-UX either,
-- ``I could probably subsist for a decade or more on the food energy that I have thriftily wrapped around various parts of my body.'' -- Jeffrey Steingarten ______________________________________________________________________ C.E.R.T. Marbach (CQP Emergency Response Team) http://www.ims.uni-stuttgart.de/~evert email@example.com
This archive was generated by hypermail 2b29 : Fri Jun 22 2001 - 19:42:01 MET DST