Re: Corpora: POS disambiguation

Oliver Mason (
Thu, 23 Oct 1997 09:15:47 +0100

Date: Wed, 22 Oct 1997 13:50:02 -0400
Reply-To: Adwait Ratnaparkhi <>
From: Adwait Ratnaparkhi <>
Organization: University of Pennsylvania
X-Mailer: Mozilla 4.02 [en] (X11; I; SunOS 5.5.1 sun4u)
MIME-Version: 1.0
To: "D.H. Van Uytsel" <>
Subject: Re: Corpora: POS disambiguation
References: <Pine.GSO.3.96.971022092758.12716B-100000@spilliaert>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Resent-Date: Wed, 22 Oct 1997 19:50:34 +0200
Lines: 52

D.H. Van Uytsel wrote:
> I would like to tag a running text containing a few M words. It is not the
> focus of my research, so I can't spend too much time on this. As a poor
> researcher, I have looked around for some good freeware. For my purposes, it
> should be
> [..]
Adwait Ratnaparkhi wrote:
> I have written a statistical tagger based on a maximum entropy model ,
> which I refer to as MXPOST (for lack of a better name).
> It is written in Java, and the executable (i.e., "bytecode") is free for
> research purposes.
> It should, in theory, run on any platform with a java interpreter.

I also have written a (probabilistic) tagger which consists of a client
(written in Java) and a server (written in C). Training the tagger is
extremely fast, it just involves re-formatting the pre-tagged training
corpus. It is also independent of language or tagset. Preliminary
evaluations for Swedish (by Daniel Ridings) and Romanian (by Dan Tufis)
have given error rates of about 3%.

The tagger is freely available for research purposes at

Oliver Mason

//\\ computer officer | corpus research | department of english | school of  -
//\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
\\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
\\// mobile 07050 104504 | |\/  -