sf> ... The problem is that we would like to generalise to other
sf> corpora of X and Ys from our experiment, but what we are
sf> actually measuring is a significance due to the compilation of
sf> our corpus, rather than tagging performance.
The pronoun "we" is used a bit loosely here.
You are definitely correct that "we" would like to generalise to other
corpora when we are operating in a academic mode of seeking Knowledge
On the other hand, when we are in a very pragmatic and commercial
mode, then it is very often true that we most want to evaluate
performance on a very well characterized corpus. For instance, a
company that routes newswire to readers knows that they will be
routing very similar newswire in the forseeable future. Similarly, a
software vendor knows that their FAQ database is not likely to change
all that very much over a reasonably short time period. Thus, either
of these users of NLP technology can compare the performance of
alternative approaches on their own data with reasonable confidence.
sf> Bottom line: take claims of significance with a large piece of
sf> salt. Even being generous, such claims can at best apply to
sf> the corpus on which they were measured (and usually the p
sf> values don't even make sense there), and it is foolhardy to
sf> generalise further.
The converse also holds!
If somebody does a test and finds that their results are *not*
statistically significant, then their results are almost certainly
unimportant. This happens far more often than might be imagined and
thus tests of significance should always be done on experimental
results. At the very least, good arguments should be made as to why
such tests would be superfluous.