I've developed some simple web crawling software
that is designed to build corpora for minority
languages quickly and inexpensively. See:
Thusfar it has been deployed in earnest only for Welsh
(now approaching 50 million words) and Irish
(15 million words). The Welsh corpus is being
used by the lexicographers at the University of Wales
Dictionary of the Welsh Language:
Of course the texts harvested in this way are
not statistically representative in any sense.
Nevertheless they are good for lexicography and
number-crunching for natural language processing.
And extracting useful subsets shouldn't be hard;
I've done some of this for the Irish corpus
The software has proved to be quite portable
across languages; it (very roughly) bootstraps
the language model from some initial "seed" texts
(or even better an initial word list).
I've done some experimentaion with several other
languages: Catalan, Swahili, Maori, Faroese,
Scottish Gaelic, Walloon, Breton, Cebuano, and Manx
Gaelic. You can see some results on the
Please send me an email if you'd be interested
in helping develop one of these corpora or in
trying a new language.
This archive was generated by hypermail 2b29 : Fri Mar 19 2004 - 17:35:25 MET