We have made available for download, morphologically analyzed and
disambiguated Turkish news text. The disambiguation has been performed
with a statistical disambiguator but no manual corrections have been
A morphological parse is represented as a sequence of features with
derivations being marked by the symbol ^DB. Morphological analysis has
been performed by the Turkish analyzer developed using XRCE Finite State
Tools. Unknown words have been analyzed with an unknown word processor and
the resulting candidate parses for those have also been disambiguated.
A typical sentence is tagged as follows with the first token on the line
being the word and the subsequent portion is the disambiguated morphological
CAVEAT: On small test sets we have seen an accuracy of 94% (over 95% if one
ignores some semantic markers). We expect a similar accuracy on this
corpus, but we have no idea how it fares. Originally the text had about 2
morphological parses per token. When you notice any errors, please let us
know and we will update the copies on the server.
Turkish has been coded using ISO-LATIN 5 encoding. The text of about 1M
words can be retrieved either as a single file, or as a batch of shorter
files. For more details on the explanation of morphological symbols used,
and downloading see
Please let us know of any problems.
-- Kemal Oflazer e-mail: firstname.lastname@example.org http://www.cs.bilkent.edu.tr/~ko/ko.html Bilkent University tel: (90-312) 266-4133 (Sec) Dept. of Computer Engineering 290-1258 (Office) Bilkent, ANKARA, 06533 TURKEY (90-532) 447-8978 (Mobile) fax: (90-312) 266-4126
This archive was generated by hypermail 2b29 : Thu Apr 27 2000 - 09:23:24 MET DST