I would like to evaluate a sentence boundary
and abbreviation detection algorithm on as
many different languages as possible.
Therefore, I am searching for newspaper corpora
that are either freely avaible or not too expensive.
The languages in question should use the period
as an ambiguous token denoting either a sentence
boundary, an abbreviation or both.
I am already using parts of the Wall Street Journal Corpus,
the Neue Zürcher Zeitung and some corpora
included in the Multilingual Corpus I from the European Corpus Initiative.
I also know about TRACTOR.
I would be very thankful for any suggestions.
This archive was generated by hypermail 2b29 : Mon Apr 14 2003 - 16:15:15 MET DST