RE: Corpora: Corpus of scientific texts

Christopher A. Brewster (
Mon, 26 Oct 1998 14:30:02 +0200

On 26 October 1998 08:48, GCW [] wrote:
> On Fri, 23 Oct 1998, Adam Kilgarriff wrote:
> >
> > Aren't 'technical scientific corpora' the easiest of all to produce?
> > Increasingly, all the material is available online in a manner which
> > invites you to download it, for free, direct, without a publisher
> > intervening to create copyright problems.
> In this case, who controls the input? If you take what happens to be
> available on the net then you have little control over the selection
> process. Then, are we talking about 'technical' science in the sense of
> technical how-to-do-it manuels, or learned research papers. The latter are
> rarely available on-line for copyright reason. Some scientists do put
> texts on their websites, but this is for self-publicity purposes,
> 'creating a research space' in the terminology of Swales. You cannot cover
> a sublanguage in this way.

I think that the extent to which 'learned research papers' are unavailable on-line varies
considerably. In collecting material for my PhD, I am under the impression that more than 50%
of the current generation of young professors in the fields of computational linguistics, NLP
and IR have put their doctorates and the majority of their publications on the web. This is not I
believe for self-publicity but to encourage the interaction of the academic community. I would
expect other disciplines to be doing the same thing.

A simple way to recover a sub-language is to use the citation indexes, find the top 100 items
cited in the last ten years, and aim to recover 50% of these over the net. This gives you a
corpus which is statistically proven to be the most influential (i.e. like getting top ratings
on TV) and thus linguistically both representative of and influential on the linguistic community
of the sub-language.

What is the flaw in my method?

Christopher Brewster
University of Patras & University of Birmingham