> The sort of distortion that Adam Kilgarriff cites has been with us from the
> beginning. Look at the Brown Corpus, 1 million words (so large in its day)
> and look at the high frequency of the English word 'jabberwocky.'
> This is really raising questions about the conceptual foundations of the
> whole enterprise. Have we assumed that 'English' is not simply a collective
> term, representing a range of specializations and dialects than no one could
> possible learn entirely? Have we assumed that "I speak English" has some
> denotational sense? If so, what?
> Have we assumed that 'English' has a boundary, and it is our job to find it?
> Probably not, but then we should avoid boundary-finding activities.
But isn't it the case that whenever we make any sort of decision as to what should be admitted or not admitted to a corpus, we are indulging in boundary-finding activities? The idea of a balanced corpus is that someone makes an informed but probably subjective judgement as to what proportions of which genres should be included. And if a document is partly in a foreign language, or (in this case) there is evidence that it might not hail from an American English native speaker, someone's got to decide if it's admissible.
Having said that, I would not want to argue that highly technical documents ought to be excluded from the corpus on the grounds that they might not be representative. Such documents are still valid linguistic data, no matter how much obscure vocabulary they contain. Linguistic citations in Comp Ling papers are in a different category, I think. They are not examples of language in use, they are *not* valid linguistic data and really should be excluded from corpora, as far as is practicable, if their inclusion might lead to skewage.
I wondered if it would, so I had a look at some ACL papers. A lot have some sort of linguistic exemplification, but quite often it consists of just one or two "John said that Mary will go" type sentences near the top of the paper, and that's all. So perhaps for practical purposes it doesn't really matter; a different pattern might well emerge from a set of theoretical linguistics papers, though.
This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 17:29:01 MET DST