Beyond statistical significance, the question arises of how large any
observed difference really is. This is a dramatically different
question from whether the difference is significant (as several have
noted) since significance tends to increase as the number of
observations increases.
Depending on your taggers there are several options. A simple answer
is mutual information. This can even give some useful information
between systems with different tagsets, although the answer you get
will be a pessimistic bound rather than a good estimate.
If your tagger can give you alternative tagging with probabilities,
then you can also use relative entropy per tag. This has many nice
properties, not least of which is that a tagger can get partial credit
for narrowing down the list of alternatives. In many applications,
having the right answer in the top k can be nearly as good as just
getting the single right answer.
Even if your tagger always commits to a single tag, you can build a
meta-tagger around it which gives probabilities for alternative
taggings. This can then be used with relative entropy to compare
against other taggers. The simplest way to do this would be to use a
held-out set to derive a tag distribution for each output tag. The
output of the meta-tagger would then contain information about which
mistakes the tagger is known to make.
>>>>> "cb" == Chris Brew <Chris.Brew@edinburgh.ac.uk> writes:
>> I was wondering if anyone knows of the appropriate statistical
>> tests which could be applied to determine whether the
>> differences in tagging performace are statistically
>> significant?
cb> cf van Halteren, Zavrel and Daelemans, proceedings Coling-98,
cb> vol1 pp 491ff, footnote 7, using McNemar's chi-square. Since
cb> in POS tagging we are typically dealing with large corpora,
cb> even numerically small differences in error rate, are likely
cb> to be statistically significant. Statistical significance is
cb> of course not the only relevant criterion.
cb> C