The initial release of the LUCY Corpus is now freely available for downloading.
The LUCY Corpus is a treebank sampling modern written British English of
* edited published prose
* the writing of young adults (e.g. A-level exam scripts, 1st-year
* spontaneous writing by 9- to 12-year-old children
Compilation of the LUCY Corpus was sponsored by the Economic and Social
Research Council (UK). The corpus is named after St Lucia or Lucy, patron
saint of writers.
The corpus is structurally annotated in conformity with the SUSANNE annotation
scheme, defined in my _English for the Computer_ (Clarendon, 1995).
Extensions to the scheme were developed in the LUCY project in order to
represent what is going on in cases where unskilled writers fail to produce
written structures that succeed in expressing their apparent intention.
Documentation for the LUCY Corpus, including a definition of the annotation
conventions just mentioned, can be read as a Web page at
www.grsampson.net/LucyDoc.html (13,000 words). The Corpus itself is
available via www.grsampson.net/Resources.html, as are earlier resources from
The initial LUCY release will undoubtedly contain mistakes. (That is
particularly likely, since pressure from the sponsor for early
publication meant that there was not enough time for all the checks that
would ideally have been applied.) Users who find errors are warmly urged
to contact me with details, which will be used to produce later, more
accurate releases. My e-mail address, in a form designed to foil spammers,
is grs2 followed by at-sign followed by sussex.ac.uk
Geoffrey Sampson MA PhD MBCS
Professor of Natural Language Computing
Department of Informatics
University of Sussex
Falmer, Brighton BN1 9QH, England
t +44 1273 678525
f +44 1273 671320
e-mail address no longer shown to avoid spam flood
This archive was generated by hypermail 2b29 : Thu Nov 27 2003 - 10:14:45 MET