I tend to agree at one level, but a corpus of proverbs is a possibility =
- e.g. the Bible
contains one, the dictionaries and collections of proverbs are corpora - =
not so different
from the corpus of Shakespeare or a corpus of religious or legal =
writings or telephone
conversations or parent-child speech, although the bibilical proverbs
usually take a more extended form than our English ones (some of which =
from the Bible anyway).
But once you go below sentence level, you are bringing in the kind of =
assumptions we aim to avoid in corpus work. Even selection at 'sentence =
level' is problematic due to process of context and elision, stylistic =
freedom in relation to punctuation and representation of clauses as =
lists or separate sentences, etc. e.g.
What time is it? Three thirty!
I came, I saw, I conquered!
I came! I saw! I conquered!
Another tendency is for statistics about parsers to be based on =
sentences restricted to be
less than X words where X is typically around 20 and usually less than =
for the corpus it is extracted from. Such practices should be =
deprecated except when filtering is integral to a theory (e.g. of =
language acquisition - attending to only certain types of utterance - =
but this doesn't alter the corpus).
From: Susan Hays <email@example.com>
To: CORPORA@hd.uib.no <CORPORA@hd.uib.no>
Date: Friday, January 28, 2000 9:15 AM
Subject: Corpora: What is a corpus
>Oliver has stuck an important chord with my thinking. Many of the =
>on this list request pre-filtered work. A corpus is a collection of =
>list of phrases, verb forms, or other fragments.
>One of the real joys of working with corpora is the excitement of =
>something you weren't looking for. The more the input to the corpus is
>the preconceptions of the researchers, the less likelihood that these
>insights will arise. Of course, the nature of the storage medium
>some filtering must occur, but it is important that these technical
>are kept in mind when examining the corpora. Only by looking for things =
>looking for will we gain deep insights into the nature of language.
>-Paul Hays (currently writing from a borrowed eddress)
>Oliver Mason wrote:
>> Fran=E7ois Maniez writes:
>> > I wondered whether anybody on the list knows about an online
>> >available for download and consisting of English proverbs and/or set
>> >phrases. The objective is to turn the corpus into a data base that
>> > [...]
>> Andrew Harley replies:
>> > Instead of a corpus, you might want to consider using an existing
>> > dictionary which gives examples of idioms in context, e.g. the
>> > International Dictionary of Idioms. This is available as SGML data =
>> Sorry to appear pedantic, but how would a `corpus of proverbs' look
>> like? I would think no such thing could exist, just like you =
>> have a corpus of past tense sentences. Instead, you have a corpus =
>> say, written fiction, which you can use to compile a list/database of
>> proverbs, but that would not be a corpus, but a, erm, list or
>> database (or even a dictionary).
>> My understanding of `corpus' is that it is some more or less
>> homogeneous collection of utterances, but not `filtered', ie if you
>> selected all sentences containing proverbs you would end up with a
>> list, not a (sub)corpus.
>> Do other people think different/the same?
>> //\\ computer officer | corpus research | department of english | =
>> //\\ humanities | university of birmingham | edgbaston | birmingham =
>> \\// united kingdom | phone +44-(0)121-414-6206 | fax
>> \\// mobile 07050 104504 | http://www.clg.bham.ac.uk |
This archive was generated by hypermail 2b29 : Fri Jan 28 2000 - 12:19:47 MET