Dear Corpora members,
This is a summary of replies I received to my query from 18 Jan:
> For my research on textual manifestations of common knowledge, I am
> looking for a corpus of short English texts based on restricted vocabulary
> (up to ~500 different NP, VP heads), to be used for training machine
> learning tools sensitive to vocabulary size.
I would like to thank Brett Reynolds, Eric Atwell, Joel Walters and Andrew
Harley for providing pointers.
Here is the summary of replies:
(1) Andrew Harley <firstname.lastname@example.org>
from Cambridge University Press suggested using
learner's dictionaries that have definitions based on restricted vocabulary;
for example, Cambridge learner dictionary that can be licensed. More info here:
He also suggested using ELT readers at different levels that might meet
the restricted vocabulary requirement. The first level restricts the
vocabulary to 400 headwords; at his level, there are 6 books of about 30
pages including pictures. It is possible to view samples from the readers
Readers have not yet been licensed for use as a corpus, but
Andrew Harley thinks it might be possible if there is a demand and if the
In a similar spirit, Brett Reynolds <email@example.com> suggested Oxford
Bookworms Series of Graded Readers; more information can be found here:
Some short samples are available from the site.
(2) Joel Walters <firstname.lastname@example.org> has a small corpus of native
English texts collected for an experimental procedure involving writing
syntheses/summaries of two source texts. The corpus totals about 20,000
words and individual texts range from 50-600 words.
(3) Eric Atwell referred me to Dr Caroline Lyon of University of
Hertfordshire <C.M.Lyon@herts.ac.uk> who used a restricted English Corpus
for her PhD from 1994: http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps
Thanks to all who replied,
This archive was generated by hypermail 2b29 : Tue Jan 27 2004 - 10:18:12 MET