I would like to thank all the people who answered my query about proverbs. The various sources that were mentioned in their answers are listed at the end of this message.
An interesting by-product of my original query has been the ongoing discussion on what a corpus really is. Considering the wording I used ("the objective is to turn the corpus into a data base that could be used as an aid for reading comprehension"), I would indeed agree that the list I am intending to compile is simply a data base and will be used as such.
But by the same token, it seems to me that the same data base would qualify as a corpus if one were to use it to study, say, the use of determination or relative clauses in proverbs.
The EAGLES initiative definition which I found at the URL pointed to by John McNaught is interesting : "A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language."
However, "a sample of the language" leaves room for various interpretations, as does the word "corpus" itself. In the same way that the word "body" has several meanings ("the main or central part", but also, "a mass or collection of material that is distinct from other masses"), the sample could be representative of the language as a whole, but also of only a small part of the language (in my case, popular wisdom).
Ultimately, I think that most people's idea of what is (or isn't) a corpus boils down to their conception of what is a valid method for data selection and extraction, as was pointed out by Lou Burnard and Sabine Bartsch. And I would argue that as long as selection criteria are coherent and the data to be extracted are a "subcomponent" of the corpus (whether they are individual words or grammatical structures), then the appellation can be justified, as long as no invalid claim is made as to how representative of the language as a whole a given corpus really is.
Finally, I would like to add to the remarks made by David Powers and Ute Römer on what a "proverb" is. Oliver Mason said that "[a corpus] wouldn't be a corpus any more once you meddle with it, by eg extracting all proverbs". But what little research I have done on the subject shows that such extraction is just not possible, unless you already have such a list of proverbs at your disposal and start looking for occurrences of original words that are part of them. What I also found, interestingly, is that proverbs are seldom quoted verbatim, but rather slightly modified as an allusion to a given situation. This is particularly true of their use in the titles of magazine or newspaper articles (eg. "Charity begins abroad").
Once again, thanks to all the participants in this interesting debate. To think that none of this would have been possible 20 years ago.
Département de Langues Étrangères Appliquées
Centre de Recherche en Terminologie et en Traduction
Faculté des Langues
Université Lyon II
The following are snippets taken from the various messages I received :
http://www.shu.ac.uk/web-admin/phrases/list/ (not a very long list - but maybe it's a start.)
http://www.amazon.com/exec/obidos/Author=Spears%2C%20Richard%20A./102-667273 (Richard Spears has written many wonderful books on this topic already)
Moon (1998) uses the Oxford Hector Pilot Corpus to create a database of English fixed expressions and idioms (including proverbs). It's an extremely interesting study. The reference is:
Moon, R. (1998). Fixed Expressions and Idioms in English. Oxford: Clarendon Press.
The following URLS indicate where to purchase some electronic dictionaries :
http://www.cup.cam.ac.uk/elt/reference/data.htm (Cambridge International Dictionary of Idioms. This is available as SGML data for prices from 300 pounds (university research use) to 3000 pounds (full development licence).
http://www.icp.grenet.fr/ELRA/catalog.html (dictionnaire bilingue français-anglais d'idiomes, de proverbes et d'expressions générales).
http://www.cup.org/esl/cdae (Cambridge Dictionary of American English, book and CD-ROM combined for only $20.95)
This archive was generated by hypermail 2b29 : Mon Jan 31 2000 - 11:35:15 MET