Apologies if you receive multiple copies of this message, especially if
you have no interest whatsoever in its contents.
Following a number of requests, I have decided to mount the EMILLE
character encoding conversion software (unicodify) on the EMILLE
download site (http://www.ling.lancs.ac.uk/corplang/emille/default.htm).
The conversion software was developed at Lancaster University, and
allows users to convert 30 (or so) different 8 bit encodings of South
Asian scripts commonly found in both publishing and on the web into 16
bit little-endian Unicode format. The software is very useful indeed if
you plan to collect South Asian corpus data from the web. As with the
EMILLE corpus, the software may be used freely for non-commercial
Also, an Urdu POS tagger is now mounted on the EMILLE download site.
Again, it is free for use in non-commercial research.
Both downloads include documentation etc.
Professor of English Language and Linguistics,
Dept. Linguistics and Modern English Language,
This archive was generated by hypermail 2b29 : Sun Feb 22 2004 - 19:58:42 MET